-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: video_unet_generator_attn #669
Conversation
UNet=((ResBlock+Attention )*2)*4 for input_blocks |
) | ||
] | ||
ch = int(mult * self.inner_channel) | ||
if ds in attn_res: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable attn_res
should not condition the motion module (MM). The MM is mandatory, not conditioned I believe.
Also, attn_res
conditions the AttentionBlock
in the frame-only UNet, and we should keep this code here as well.
This is because the MM is an addition to any configuration of the frame-only UNet.
efficient=efficient, | ||
freq_space=self.freq_space, | ||
), | ||
MotionModule( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's verify this because :
- frame-only UNet has a "within-frame"
AttentionBlock
here, that needs to be kept. - I'm not sure the MM applies to the bottleneck : please double-check in publications an code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AttentionBlock
is kept, MM is added after it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the publication's code, whether MM is applied to the bottleneck depends on two options. However, in the two illustration figures in the publication, the bottleneck does not have MM.
) | ||
] | ||
ch = int(self.inner_channel * mult) | ||
if ds in attn_res: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark here.
Since joliGEN DDPM temporal use_temporal, it creates tensor in the shape(b,f,c,h,w), which differs from the priginal paper's formate of (b,c,f,h,w). So, in this version, all tensor flow is the formate of (b,f,c,h,w). Due to compatibility with other models in joliGEN, it may be advantageous to treat the tensor in 4D format during trainning ? |
python3 -W ignore::UserWarning train.py
|
works with command line |
python3 -W ignore::UserWarning train.py
|
lanch inference cd scripts/ |
create videos by this command_line: cd scripts/ |
models/palette_model.py
Outdated
for k in range(min(nb_imgs, self.get_current_batch_size())): | ||
self.fake_B_pool.query(self.visuals[k : k + 1]) | ||
|
||
if self.opt.G_netG == "unet_vid": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else ?
efficient=efficient, | ||
freq_space=self.freq_space, | ||
), | ||
# MotionModule( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented code ?
|
||
# attention, what we cannot get enough of | ||
###attention_score get | ||
# hidden_states_select = self._attention(query, key, value, attention_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented code ?
data/__init__.py
Outdated
@@ -61,11 +61,20 @@ def create_dataloader(opt, rank, dataset, batch_size): | |||
|
|||
|
|||
def create_dataset_temporal(opt, phase): | |||
dataset_class = find_dataset_using_name("temporal_labeled_mask_online") | |||
dataset_class = find_dataset_using_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this function needs to be change so that either temporal_labeled_mask_online
or self_supervised_temporal_labeled_mask_online
is selected based on whether cut
or palette
is running.
# sort | ||
self.A_img_paths.sort(key=natural_keys) | ||
self.A_label_mask_paths.sort(key=natural_keys) | ||
if self.use_domain_B: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In self_supervised
dataloader, domain B is not needed
create one unite test file "test_run_video_diffusion_online.py " for unite test |
c7888cb
to
1fe3f60
Compare
during inference, additional frames beyong the specified |
scripts/gen_vid_diffusion.py
Outdated
@@ -346,7 +346,8 @@ def generate( | |||
bbox_select[3] = min(img.shape[0], bbox_select[3]) | |||
else: | |||
bbox = bboxes[bbox_idx] | |||
|
|||
opt.data_online_creation_load_size_A = (1280, 720) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we don´t want hardcoded values here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we temporarily did this to do inference with your model of bdd100k_vid_64_2
, since in this model opt.data_online_creation_load_size_A is 720. Normally, this hardcoded line is not required. It is delected.
"train_batch_size": 1, | ||
"data_temporal_number_frames": 8, | ||
"data_temporal_frame_step": 1, | ||
"G_diff_n_timestep_train": 6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beware I don´t believe you can theoretically have timestep_test < timestep_train.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I misunderstood opt.G_diff_n_timestep_train is 2000 and opt.G_diff_n_timestep_test is 1000 in defalut setting ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had overwritten "G_diff_n_timestep_test", it is corrected now.
…poral consistency and inference feat(ml):step2 replace AttentionBlock by MotionModule. ResBlock/MotionModule class instance pass feat(ml):UNet=ResBlock+Attention(optional)+MM feat(ml): create UNetVid class with temporal MHA for U-Net feat(ml):add dataloader feat(ml): dataloader works with UNet feat(ml): dataloader and UNetVid works for input (b,f,c,h,w),not visdom yet feat(ml):visdom shows the trainning feat(ml):dataloader with mask feat(ml): dataloader fixed with command-line feat(ml): visdom show one batch of frame feat(ml): frame is treated as a batch, so no additional normailisation is needed feat(ml): inference for UNetVid feat(ml): use efficient_attention_xformers for attention feat(ml): xformer bug PR feat(ml): create video based on generated and orig images feat(ml):remove unnecessary option --UNetVid feat(ml): add doc for trainning and inference feat(ml): fix inference paths requirement feat(ml): improve the inference for any paths.txt and longer frames feat(ml):unite test only for vid feat(ml): debug for unite test on metrics doc: modify scripte for inference feat(ml):debug inference paths_file feat(ml): add one option for max frame feat(ml): inference debug bbox_in not img_in, and for bdd100k video feat(ml): delect hardcoding in inference feat(ml): dataloader load frames from same video feat(ml): adapt processing of frames from either a video series or a single video
step0 Get a similar architecture of UNet(ResBlock+AttentionBlock) in joliGEN comparable as in AnimateDiff(ResBlock+TransformerBlock+MotionModule)
step1 Modify ResBlock to process 5D tensor image for input and output
step1.1 Test ResBlock to process 5D tensor for input and output
step1.2 ResBlock embedding ?
step2 Replace AttentionBlock by MotionModule
step2.1 Using code of MotionModule to replace AttentionBlock
step2.2 Test AttentionBlock by MotionModule for 5D tensor for input and output
step2.3 MotionModule embedding ?
step3 Merge MotionModule in the Video_generator_attn file
step3.1 Aligned attention head, input/output channels and maybe other variables, clear the code ?
step3.2 Test MotionModule in the Video_generator_attn file for 5D tensor input/output
step3.3 Using QKVAttention for attention score calculation for the whole file ?
step4 Test UNet for 5D tensor input/output ?
step5 Create Dataloader
step6 Test UNet with Dataloader
step7 Test for training and visualization with visdom
step8 Inference script
step9 unite test