Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Community Pipeline] Add 🪆Matryoshka Diffusion Models #9157

Draft
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

tolgacangoz
Copy link
Contributor

@tolgacangoz tolgacangoz commented Aug 12, 2024

Thanks for the opportunity to work on this model!

The Abstract of the paper (emphasis is mine):

Diffusion models are the de-facto approach for generating high-quality images and videos but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space, or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion (MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024 × 1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.

Paper: 🪆Matryoshka Diffusion Models
Repository: https://github.com/apple/ml-mdm
License: MIT license

image

Key takeaways from the paper:

  • VAE: None; since Matryoshka Diffusion Models work on the (extended) pixel space(s).
  • Text-encoder: flan-t5-xl
  • Enables:
    1. a multi-resolution loss that greatly improves the convergence speed of high-resolution input denoising.
    2. an efficient progressive training schedule, that starts by training a low-resolution diffusion model and gradually adds high-resolution inputs and outputs following a schedule. This speeds up the overall convergence.
  • MDM allows us to train high-resolution models without resorting to cascaded (Since each model is trained separately, the generation quality can be bottlenecked by the exposure bias (Bengio et al., 2015) from imperfect predictions and several models need to be trained corresponding to different resolutions.) or latent diffusion (This not only increases the complexity of learning but also bounds the generation quality due to the lossy compression process.), and other end-to-end models (without fully considering the innate structure of hierarchical generation, their results lag behind cascaded and latent models.)
  • Resolution-specific noise schedules are used.
  • Allocating more computation in the low-resolution feature maps.
  • MDM has extensive parameter sharing across resolutions.
  • Authors see that increasing from two resolution levels to three consistently improves the model's convergence. Note that increasing the number of nesting levels brings only negligible costs.
  • LDM and MDM methods are complementary. It is possible to build MDM on top of autoencoder codes.

TODOs:
✅ The U-Net; in other words, the inner-most structure, NestedUNet2DConditionModel(nesting_level=0); approximately would be as follows:

UNet2DConditionModel(in_channels=3, out_channels=3, block_out_channels=(256, 512, 768),
		cross_attention_dim=2048, resnet_time_scale_shift='scale_shift',
		down_block_types=('DownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D'),
		up_block_types=('CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'UpBlock2D'),
		ff_act_fn='gelu', transformer_layers_per_block=[0, 1, 5],
		use_linear_projection='no_projection', attention_bias=True,
		norm_type='layer_norm_matryoshka', ff_norm_type='group_norm_matryoshka',
		cross_attention_norm='layer_norm', attention_pre_only=True,
		encoder_hid_dim_type='text_proj', encoder_hid_dim=2048,
		flip_sin_to_cos=False, masked_cross_attention=False,
		micro_conditioning_scale=64, addition_embed_type='matryoshka')

⏳ Scheduler(s)
NestedUNet2DConditionModel(nesting_level=(1, 2))
convert_matryoshka_model_to_diffusers.py
⏳ Verify outputs with the original implementation for:

  • 64×64, nesting_level=0
  • 256×256, nesting_level=1
  • 1024×1024, nesting_level=2

⬜ Show example results
⏳ Upload converted checkpoints to HF
README.md
examples/**/train_matryoshka.py

@tolgacangoz tolgacangoz changed the title Add Matryoshka Diffusion Models Add 🪆Matryoshka Diffusion Models Aug 12, 2024
@sayakpaul
Copy link
Member

@tolgacangoz would you have cycles to work on this soon? Another contributor has expressed interest in working on it. Maybe you two could collaborate?

@tolgacangoz
Copy link
Contributor Author

I am into the inference code atm. Will the training code in examples/**/train_matryoshka.py be implemented as well (since this is a very efficient model in training)? If so, he can take this up.

@sayakpaul
Copy link
Member

For now, we don't have to focus on training.

@tolgacangoz tolgacangoz changed the title Add 🪆Matryoshka Diffusion Models [Community Pipeline] Add 🪆Matryoshka Diffusion Models Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants