2023-08-04 (v0.1) This repository is the official implementation of COST (Collaborative Three-Stream Transformers for Video Captioning), which was recently accepted by CVIU.
Collaborative Three-Stream Transformers for Video Captioning
Hao Wang, Libo Zhang, Heng Fan, Tiejian Luo
CVIU
Clone this repository and install dependencies. We haved tested our code on python=3.8.5
, torch=1.12.1
and cuda=11.3.1
. A suitable conda environment named cost
could be created and activated by following commands.
git clone https://github.com/wanghao14/COST.git
cd COST
conda create -n cost python=3.8.5
conda activate cost
pip install -r requirements.txt
Note: The METEOR metric requires java
. You can install it with conda by conda install openjdk
. Make sure your locale is set correct i.e. echo $LANG
outputs en_US.UTF-8
Download the appearance and motion features of YouCookII from Google Drive: rt_yc2_feat.tar.gz (12GB), which are repacked from features provided by densecap, and the extracted detection features provided by us: yc2_detect_feat.tar.gz (34.7GB). Extract the former one such that they can be found in data/mart_video_feature/youcook2/*.npy
under this repository, and the latter one to data/yc2_detect_feature/training_aggre/*.npz
and data/yc2_detect_feature/training_aggre/*.npz
respectively. Otherwise you can specify the path for reading video features in dataset.py.
All hyper-parameters for our experiments could be modified in the used config file, and configs/yc2_non_recurrent.yaml
is used in default in current version.
# Train COST on YouCookII
CUDA_VISIBLE_DEVICES=0, 1 torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py
When you validate the provided checkpoints, just modify two values in configs/yc2_non_recurrent.yaml
:
validate: true
exp:
load_model: "${PATH_TO_CHECKPOINT}"
and run the same command as starting training. You can download our pretrained model from Google Drive.
- Quantitative results:
# Output:
# B@4, M, C and R@4 indicate BLEU@4, METEOR, CIDEr-D and Repetition@4, repectivaly. And the results in the first five rows are evaluated in the paragraph-level mode while the last one are in the micro-level mode.
# experiment | B@4| M| C| R@4|
# -----------------|-------------|-------------|-------------|-------------|
# yc2(val)_TSN | 9.47| 17.67| 45.54| 4.04|
# yc2(val)_COOT | 11.56| 19.67| 60.78| 6.63|
# anet(val)_TSN | 11.22| 16.58| 25.70| 7.09|
# anet(test)_TSN | 11.14| 15.91| 24.77| 5.86|
# anet(test)_COOT | 11.88| 15.70| 29.64| 6.11|
# msvd(test) | 56.8| 37.2| 99.2| 74.3|
- Qualitative results:
I have been a little busy recently, and the following works will be pushed forward in my free time.
- Release initial version which supports multi-gpu training and inference on YouCookII
- Release pre-trained models and support training with COOT features as input
- Release detection features and pre-trained models, and support training for ActivityNet-Captions
- Provide instruction on Internet videos evaluation
We would like to thank the authors of MART and COOT for sharing their codes.