From 19bb543447f987fb3d7c68e78fc567aab9e5dd70 Mon Sep 17 00:00:00 2001 From: Thijs Vogels Date: Mon, 3 Jul 2023 21:59:23 +0200 Subject: [PATCH] Update README.md --- README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/README.md b/README.md index a8d5bbb..d895480 100644 --- a/README.md +++ b/README.md @@ -42,6 +42,19 @@ Usage: + optimizer_step(optimizer, powersgd) ``` +## Differences with the paper version + +The version in this code base is a slight improvement over the version in the PowerSGD paper. +It looks a bit like Algorithm 2 in [this follow-up paper](https://arxiv.org/pdf/2008.01425.pdf). + +We found that there are two ways to control the approximation quality in PowerSGD: the first is the 'rank' of the approximation, and the second is the 'number of iterations'. Because the cost of orthogonalisation grows as $O(\text{rank}^2)$, increasing the rank can become inefficient, leaving changing the number of iterations as the best option. + +In the original PowerSGD paper, more iterations only improves the quality of the rank-k approximation, as the approximation converges to the "best rank k approximation". In the [follow-up paper](https://arxiv.org/pdf/2008.01425.pdf), intermediate results from these power iterations are all used, effectively increasing the rank as the number of iterations grows. + +In the original PowerSGD paper, we used two iterations per SGD step (a left and a right iteration). In this setting, there is not much of a difference. The difference appears when you use more power iteration steps per SGD step. + + + ## PyTorch implementation PyTorch features an implementation of PowerSGD as a [communucation hook](https://pytorch.org/docs/stable/ddp_comm_hooks.html) for `DistributedDataParallel` models. Because of the integration with DDP, the code is more involved than the code in this repository.