From d46e6ce6484089f61e83dd102b8456723c3c8fee Mon Sep 17 00:00:00 2001
From: Virginia Fernandez <virginia.fernandez@kcl.ac.uk>
Date: Wed, 25 Sep 2024 16:20:50 +0100
Subject: [PATCH] Add more information on super-resolution files.

Signed-off-by: Virginia Fernandez <virginia.fernandez@kcl.ac.uk>
---
 .../2d_sd_super_resolution.ipynb              | 24 ++++++++++++++++---
 .../2d_sd_super_resolution_lightning.ipynb    | 21 +++++++++++++++-
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/generation/2d_super_resolution/2d_sd_super_resolution.ipynb b/generation/2d_super_resolution/2d_sd_super_resolution.ipynb
index e8b30d84a..03c07ddb6 100644
--- a/generation/2d_super_resolution/2d_sd_super_resolution.ipynb
+++ b/generation/2d_super_resolution/2d_sd_super_resolution.ipynb
@@ -24,7 +24,7 @@
    "source": [
     "# Super-resolution using Stable Diffusion v2 Upscalers\n",
     "\n",
-    "This tutorial illustrates how to perform super-resolution on medical images using Latent Diffusion Models (LDMs) [1]. For that, we use an autoencoder to obtain a latent representation of the high-resolution images. Then, we train a diffusion model to infer this latent representation when conditioned on a low-resolution image. \n",
+    "This tutorial illustrates how to perform **super-resolution** on medical images using Latent Diffusion Models (LDMs) [1]. The idea is that, given a low-resolution image, we train a spatial autoencoder with a latent space of the same spatial size of the low resolution, so that high resolution images are encoded into a latent space of the same size of the low resolution image. The LDM then learns how to go from **noise to a latent representation of a high resolution image**. On training and inference, the **low resolution image is concatenated to the latent**, to condition the generative process. Finally, the high resolution latent representation is decoded into a high resolution image. \n",
     "\n",
     "To improve the performance of our models, we will use a method called \"noise conditioning augmentation\" (introduced in [2] and used in Stable Diffusion v2.0 and Imagen Video [3]). During the training, we add noise to the low-resolution images using a random signal-to-noise ratio, and we condition the diffusion models on the amount of noise added. At sampling time, we use a fixed signal-to-noise ratio, representing a small amount of augmentation that aids in removing artefacts in the samples.\n",
     "\n",
@@ -416,6 +416,14 @@
     "## Train Autoencoder"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "a93437fe-d6ef-42d2-bedd-4da735c59dd1",
+   "metadata": {},
+   "source": [
+    "In this section, we train a spatial autoencoder to learn how to compress high-resolution images into a latent space representation. We need to ensure that the latent space spatial shape matches that of the low resolution images."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 30,
@@ -733,7 +741,9 @@
    "source": [
     "## Train Diffusion Model\n",
     "\n",
-    "In order to train the diffusion model to perform super-resolution, we will need to concatenate the latent representation of the high-resolution with the low-resolution image. For this, we create a Diffusion model with `in_channels=4`. Since only the outputted latent representation is interesting, we set `out_channels=3`."
+    "In order to train the diffusion model to perform super-resolution, we will need to **concatenate the latent representation of the high-resolution with the low-resolution image**. Therefore, the number of input channels to the diffusion model will be the sum of the number of channels in the low-resolution, 1, and the number of channels of the high-resolution image latent representation (3). In this case,  we create a Diffusion model with `in_channels=4`. Since only the output latent representation is interesting, we set `out_channels=3`. \n",
+    "\n",
+    "**At inference time** we do not have a high-resolution image. Instead, we pass the concatenation of the low resolution image, and noise of the same shape as the latent space representation."
    ]
   },
   {
@@ -993,7 +1003,7 @@
     "            noisy_low_res_image = scheduler.add_noise(\n",
     "                original_samples=low_res_image, noise=low_res_noise, timesteps=low_res_timesteps\n",
     "            )\n",
-    "\n",
+    "            # Here we concatenate the HR latent and thje low resolution image.\n",
     "            latent_model_input = torch.cat([noisy_latent, noisy_low_res_image], dim=1)\n",
     "\n",
     "            noise_pred = unet(x=latent_model_input, timesteps=timesteps, class_labels=low_res_timesteps)\n",
@@ -1098,6 +1108,14 @@
     "### Plotting sampling example"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "1a2813d4-9087-459e-8913-bce174ac31cd",
+   "metadata": {},
+   "source": [
+    "As mentioned above, at inference time, we only need to pass noise of the same shape of the latent concatenated to the low-resolution image, to get the latent representation of the corresponding high-resolution image."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 47,
diff --git a/generation/2d_super_resolution/2d_sd_super_resolution_lightning.ipynb b/generation/2d_super_resolution/2d_sd_super_resolution_lightning.ipynb
index d7eca6096..fde5a8c65 100644
--- a/generation/2d_super_resolution/2d_sd_super_resolution_lightning.ipynb
+++ b/generation/2d_super_resolution/2d_sd_super_resolution_lightning.ipynb
@@ -404,6 +404,14 @@
     "## Train Autoencoder"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "e740cb2d-5a57-42ed-806b-e8c720a6f922",
+   "metadata": {},
+   "source": [
+    "In this section, we train a spatial autoencoder to learn how to compress high-resolution images into a latent space representation. We need to ensure that the latent space spatial shape matches that of the low resolution images."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 10,
@@ -708,6 +716,7 @@
    "metadata": {},
    "source": [
     "## Define the LightningModule for DiffusionModelUnet (transforms, network, loaders, etc)\n",
+    "\n",
     "The LightningModule contains a refactoring of your training code. The following module is a reformating of the code in 2d_stable_diffusion_v2_super_resolution."
    ]
   },
@@ -853,7 +862,9 @@
    "source": [
     "## Train Diffusion Model\n",
     "\n",
-    "In order to train the diffusion model to perform super-resolution, we will need to concatenate the latent representation of the high-resolution with the low-resolution image. For this, we create a Diffusion model with `in_channels=4`. Since only the outputted latent representation is interesting, we set `out_channels=3`.\n",
+    "In order to train the diffusion model to perform super-resolution, we will need to **concatenate the latent representation of the high-resolution with the low-resolution image**. Therefore, the number of input channels to the diffusion model will be the sum of the number of channels in the low-resolution, 1, and the number of channels of the high-resolution image latent representation (3). In this case,  we create a Diffusion model with `in_channels=4`. Since only the output latent representation is interesting, we set `out_channels=3`. \n",
+    "\n",
+    "**At inference time** we do not have a high-resolution image. Instead, we pass the concatenation of the low resolution image, and noise of the same shape as the latent space representation.\n",
     "\n",
     "As mentioned, we will use the conditioned augmentation (introduced in [2] section 3 and used on Stable Diffusion Upscalers and Imagen Video [3] Section 2.5) as it has been shown critical for cascaded diffusion models, as well for super-resolution tasks. For this, we apply Gaussian noise augmentation to the low-resolution images. We will use a scheduler low_res_scheduler to add this noise, with the t step defining the signal-to-noise ratio and use the t value to condition the diffusion model (inputted using class_labels argument)."
    ]
@@ -1160,6 +1171,14 @@
     "### Plotting sampling example"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "19ba049e-fca6-4c76-b7b1-7e992d370583",
+   "metadata": {},
+   "source": [
+    "As mentioned above, at inference time, we only need to pass noise of the same shape of the latent concatenated to the low-resolution image, to get the latent representation of the corresponding high-resolution image."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 26,