Multi-modality Video Understanding

Datasets

You can find the dataset instructions in DATASET. We have provide all the metadata files of our data.

Model ZOO

You can find all the models and the scripts in MODEL_ZOO.

Pre-Training

We use CLIP pretrained models as the unmasked teachers by default:

Follow extract.ipynb to extract visual encoder from CLIP.
Change MODEL_PATH in clip.py.

For training, you can simply run the pretraining scripts as follows:

# masked pretraining
bash ./exp_pt/videomamba_middle_5m/run.sh
# further unmasked pretraining for 1 epoch
bash ./exp_pt/videomamba_middle_5m_unmasked/run.sh

Notes:

Set data_dir and your_data_path like your_webvid_path in data.py before running the scripts.

Set vision_encoder.pretrained in vision_encoder.pretrained in the corresponding config files.

Set --rdzv_endpoint to your MASTER_NODE:MASTER_PORT in torchrun.sh.

save_latest=True will automatically save the latest checkpoint while training.

auto_resume=True will automatically loaded the best or latest checkpoint while training.

For unmasked pretraining, please set pretrained_path to load the masked pretrained epoch.

Zero-shot Evaluation

For zero-shot evaluation, you can simply run the pretraining scripts as follows:

bash ./exp_zs/msrvtt/run.sh

Notes:

Set pretrained_path in the running scripts before running the scripts.

Set zero_shot=True and evaluate=True for zero-shot evaluation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Multi-modality Video Understanding

Datasets

Model ZOO

Pre-Training

Zero-shot Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Multi-modality Video Understanding

Datasets

Model ZOO

Pre-Training

Zero-shot Evaluation