From 19bb543447f987fb3d7c68e78fc567aab9e5dd70 Mon Sep 17 00:00:00 2001
From: Thijs Vogels <thijs.vogels@epfl.ch>
Date: Mon, 3 Jul 2023 21:59:23 +0200
Subject: [PATCH] Update README.md

---
 README.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/README.md b/README.md
index a8d5bbb..d895480 100644
--- a/README.md
+++ b/README.md
@@ -42,6 +42,19 @@ Usage:
 +     optimizer_step(optimizer, powersgd)
 ```
 
+## Differences with the paper version
+
+The version in this code base is a slight improvement over the version in the PowerSGD paper.
+It looks a bit like Algorithm 2 in [this follow-up paper](https://arxiv.org/pdf/2008.01425.pdf).
+
+We found that there are two ways to control the approximation quality in PowerSGD: the first is the 'rank' of the approximation, and the second is the 'number of iterations'. Because the cost of orthogonalisation grows as $O(\text{rank}^2)$, increasing the rank can become inefficient, leaving changing the number of iterations as the best option.
+
+In the original PowerSGD paper, more iterations only improves the quality of the rank-k approximation, as the approximation converges to the "best rank k approximation". In the [follow-up paper](https://arxiv.org/pdf/2008.01425.pdf), intermediate results from these power iterations are all used, effectively increasing the rank as the number of iterations grows.
+
+In the original PowerSGD paper, we used two iterations per SGD step (a left and a right iteration). In this setting, there is not much of a difference. The difference appears when you use more power iteration steps per SGD step.
+
+
+
 ## PyTorch implementation
 PyTorch features an implementation of PowerSGD as a [communucation hook](https://pytorch.org/docs/stable/ddp_comm_hooks.html) for `DistributedDataParallel` models.
 Because of the integration with DDP, the code is more involved than the code in this repository.