Skip to content

Commit

Permalink
Add more information on super-resolution files.
Browse files Browse the repository at this point in the history
Signed-off-by: Virginia Fernandez <[email protected]>
  • Loading branch information
Virginia Fernandez committed Sep 25, 2024
1 parent 019a40e commit d46e6ce
Show file tree
Hide file tree
Showing 2 changed files with 41 additions and 4 deletions.
24 changes: 21 additions & 3 deletions generation/2d_super_resolution/2d_sd_super_resolution.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
"source": [
"# Super-resolution using Stable Diffusion v2 Upscalers\n",
"\n",
"This tutorial illustrates how to perform super-resolution on medical images using Latent Diffusion Models (LDMs) [1]. For that, we use an autoencoder to obtain a latent representation of the high-resolution images. Then, we train a diffusion model to infer this latent representation when conditioned on a low-resolution image. \n",
"This tutorial illustrates how to perform **super-resolution** on medical images using Latent Diffusion Models (LDMs) [1]. The idea is that, given a low-resolution image, we train a spatial autoencoder with a latent space of the same spatial size of the low resolution, so that high resolution images are encoded into a latent space of the same size of the low resolution image. The LDM then learns how to go from **noise to a latent representation of a high resolution image**. On training and inference, the **low resolution image is concatenated to the latent**, to condition the generative process. Finally, the high resolution latent representation is decoded into a high resolution image. \n",
"\n",
"To improve the performance of our models, we will use a method called \"noise conditioning augmentation\" (introduced in [2] and used in Stable Diffusion v2.0 and Imagen Video [3]). During the training, we add noise to the low-resolution images using a random signal-to-noise ratio, and we condition the diffusion models on the amount of noise added. At sampling time, we use a fixed signal-to-noise ratio, representing a small amount of augmentation that aids in removing artefacts in the samples.\n",
"\n",
Expand Down Expand Up @@ -416,6 +416,14 @@
"## Train Autoencoder"
]
},
{
"cell_type": "markdown",
"id": "a93437fe-d6ef-42d2-bedd-4da735c59dd1",
"metadata": {},
"source": [
"In this section, we train a spatial autoencoder to learn how to compress high-resolution images into a latent space representation. We need to ensure that the latent space spatial shape matches that of the low resolution images."
]
},
{
"cell_type": "code",
"execution_count": 30,
Expand Down Expand Up @@ -733,7 +741,9 @@
"source": [
"## Train Diffusion Model\n",
"\n",
"In order to train the diffusion model to perform super-resolution, we will need to concatenate the latent representation of the high-resolution with the low-resolution image. For this, we create a Diffusion model with `in_channels=4`. Since only the outputted latent representation is interesting, we set `out_channels=3`."
"In order to train the diffusion model to perform super-resolution, we will need to **concatenate the latent representation of the high-resolution with the low-resolution image**. Therefore, the number of input channels to the diffusion model will be the sum of the number of channels in the low-resolution, 1, and the number of channels of the high-resolution image latent representation (3). In this case, we create a Diffusion model with `in_channels=4`. Since only the output latent representation is interesting, we set `out_channels=3`. \n",
"\n",
"**At inference time** we do not have a high-resolution image. Instead, we pass the concatenation of the low resolution image, and noise of the same shape as the latent space representation."
]
},
{
Expand Down Expand Up @@ -993,7 +1003,7 @@
" noisy_low_res_image = scheduler.add_noise(\n",
" original_samples=low_res_image, noise=low_res_noise, timesteps=low_res_timesteps\n",
" )\n",
"\n",
" # Here we concatenate the HR latent and thje low resolution image.\n",
" latent_model_input = torch.cat([noisy_latent, noisy_low_res_image], dim=1)\n",
"\n",
" noise_pred = unet(x=latent_model_input, timesteps=timesteps, class_labels=low_res_timesteps)\n",
Expand Down Expand Up @@ -1098,6 +1108,14 @@
"### Plotting sampling example"
]
},
{
"cell_type": "markdown",
"id": "1a2813d4-9087-459e-8913-bce174ac31cd",
"metadata": {},
"source": [
"As mentioned above, at inference time, we only need to pass noise of the same shape of the latent concatenated to the low-resolution image, to get the latent representation of the corresponding high-resolution image."
]
},
{
"cell_type": "code",
"execution_count": 47,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -404,6 +404,14 @@
"## Train Autoencoder"
]
},
{
"cell_type": "markdown",
"id": "e740cb2d-5a57-42ed-806b-e8c720a6f922",
"metadata": {},
"source": [
"In this section, we train a spatial autoencoder to learn how to compress high-resolution images into a latent space representation. We need to ensure that the latent space spatial shape matches that of the low resolution images."
]
},
{
"cell_type": "code",
"execution_count": 10,
Expand Down Expand Up @@ -708,6 +716,7 @@
"metadata": {},
"source": [
"## Define the LightningModule for DiffusionModelUnet (transforms, network, loaders, etc)\n",
"\n",
"The LightningModule contains a refactoring of your training code. The following module is a reformating of the code in 2d_stable_diffusion_v2_super_resolution."
]
},
Expand Down Expand Up @@ -853,7 +862,9 @@
"source": [
"## Train Diffusion Model\n",
"\n",
"In order to train the diffusion model to perform super-resolution, we will need to concatenate the latent representation of the high-resolution with the low-resolution image. For this, we create a Diffusion model with `in_channels=4`. Since only the outputted latent representation is interesting, we set `out_channels=3`.\n",
"In order to train the diffusion model to perform super-resolution, we will need to **concatenate the latent representation of the high-resolution with the low-resolution image**. Therefore, the number of input channels to the diffusion model will be the sum of the number of channels in the low-resolution, 1, and the number of channels of the high-resolution image latent representation (3). In this case, we create a Diffusion model with `in_channels=4`. Since only the output latent representation is interesting, we set `out_channels=3`. \n",
"\n",
"**At inference time** we do not have a high-resolution image. Instead, we pass the concatenation of the low resolution image, and noise of the same shape as the latent space representation.\n",
"\n",
"As mentioned, we will use the conditioned augmentation (introduced in [2] section 3 and used on Stable Diffusion Upscalers and Imagen Video [3] Section 2.5) as it has been shown critical for cascaded diffusion models, as well for super-resolution tasks. For this, we apply Gaussian noise augmentation to the low-resolution images. We will use a scheduler low_res_scheduler to add this noise, with the t step defining the signal-to-noise ratio and use the t value to condition the diffusion model (inputted using class_labels argument)."
]
Expand Down Expand Up @@ -1160,6 +1171,14 @@
"### Plotting sampling example"
]
},
{
"cell_type": "markdown",
"id": "19ba049e-fca6-4c76-b7b1-7e992d370583",
"metadata": {},
"source": [
"As mentioned above, at inference time, we only need to pass noise of the same shape of the latent concatenated to the low-resolution image, to get the latent representation of the corresponding high-resolution image."
]
},
{
"cell_type": "code",
"execution_count": 26,
Expand Down

0 comments on commit d46e6ce

Please sign in to comment.