[experiment] CogVideoX 🤝🏼 FreeNoise #9389

a-r-r-o-w · 2024-09-08T11:36:12Z

What does this PR do?

Attempt to make FreeNoise work with CogVideoX to enable longer video generation, prompt interpolation (something like Luma AI Keyframes but without image condition control since available Cog models are txt2vid), etc.

Code

import torch
from diffusers import CogVideoXPipeline, CogVideoXDPMScheduler
from diffusers.utils import export_to_video

# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
model_id = "THUDM/CogVideoX-2b"

pipe = CogVideoXPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")

pipe.enable_free_noise()

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)

video = pipe(
    prompt=prompt,
    num_frames=81,
    guidance_scale=6,
    use_dynamic_cfg=True,
    num_inference_steps=50,
).frames[0]

filename = "output.mp4"
export_to_video(video, filename, fps=8)

Here are some preliminary txt2vid results:

cogvideox_freenoise--context_stride-2.mp4	cogvideox_freenoise--context_stride-4.mp4
cogvideox_freenoise--context_stride-6.mp4	cogvideox_freenoise--context_stride-8.mp4

I also tried the context windows method with temporal tiling on the transformer in the pipeline but got poorer results compared to FreeNoise. I don't think it's worth considering supporting as we don't support it for AnimateDiff as well. FreeNoise, however, performs better although there is some visible repetition going on for the same prompt. This is most likely due to noise_type in above generations being "shuffle_context". I'm expecting improvements when using better initial noise generation strategies.

context_length as 13 and context_stride as 4 seem to produce the best results on a variety of prompts. This corresponds to 49 and 16 pixel-space frames respectively.

TODO:

Add support for prompt interpolation (tricky)
Add support for Cog-5b (tricky too, and might require some refactoring of the last normalization layer)
Tests
Docs

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6 @yiyixuxu @asomoza

@zRzRzRzRzRzRzR @wenyihong because changes involve CogVideoX

@arthur-qiu @tin2tin for visibility

HuggingFaceDocBuilderDev · 2024-09-08T11:42:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tin2tin · 2024-09-09T08:26:50Z

Happy to see work being done on this. Tried to run the code on a 4090. It successfully produced a clip.

First step takes 8-10 minutes, and needs 20 GB VRAM.
Second step doesn't have a progress-bar and needs 36 GB VRAM.
Even though the process has finished, the 36 GB are still occupying the (V)RAM.

output.mp4

a-r-r-o-w added 4 commits September 7, 2024 23:44

update cogvideox freenoise progress

6e03e72

update progress

a012fa5

fix bugs

17b7f8a

make style

2e7502f

a-r-r-o-w added 6 commits September 10, 2024 13:51

Merge branch 'main' into cogvideox/freenoise

c9454bd

update progress

052eeb5

update progress

cce65ab

update progress

e07fe04

update

9aa2e97

Merge branch 'main' into cogvideox/freenoise

43ec0bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experiment] CogVideoX 🤝🏼 FreeNoise #9389

[experiment] CogVideoX 🤝🏼 FreeNoise #9389

a-r-r-o-w commented Sep 8, 2024

HuggingFaceDocBuilderDev commented Sep 8, 2024

tin2tin commented Sep 9, 2024 •

edited

Loading

[experiment] CogVideoX 🤝🏼 FreeNoise #9389

Are you sure you want to change the base?

[experiment] CogVideoX 🤝🏼 FreeNoise #9389

Conversation

a-r-r-o-w commented Sep 8, 2024

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Sep 8, 2024

tin2tin commented Sep 9, 2024 • edited Loading

tin2tin commented Sep 9, 2024 •

edited

Loading