Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do you plan to release a notebook demo ? #2

Open
ooza opened this issue Sep 6, 2024 · 6 comments
Open

Do you plan to release a notebook demo ? #2

ooza opened this issue Sep 6, 2024 · 6 comments

Comments

@ooza
Copy link

ooza commented Sep 6, 2024

No description provided.

@ooza
Copy link
Author

ooza commented Sep 6, 2024

Thanks for this great job! I have a small dataset of 30 video clips and I want to make zero-shot action recognition using your model. Do you have a simple demo file that I can use? or could you tell me which function/script/config should I update to work on custom videos?

@byminji
Copy link

byminji commented Sep 6, 2024

Hi @ooza, thank you for your interest in our work!

I will share a sample notebook demo in the upcoming days.

But if you want to use your custom datasets before that, please follow the below instructions. (Please also refer to the example instructions for public datasets in DATASETS.md).

  1. Put all your custom videos under /PATH/TO/VIDEOS folder.
  2. Create a label file intc-clip/labels/custom_dataset_labels.csv. The format should be like:
id,name
0,abseiling
1,air drumming
2,answering questions
3,applauding
...
  1. Create an annotation file that contains a list of video filenames and their corresponding labels in tc-clip/datasets_splits/custom_dataset_anns.txt. Each line of the txt file should be <filename> <class id>. For example, suppose that we have aaa.mp4, bbb.mp4, ..., zzz.mp4 under /PATH/TO/VIDEOS folder:
aaa.mp4 0
bbb.mp4 0
...
zzz.mp4 3
  1. Create a dataset yaml file for your custom dataset in tc-clip/configs/data/custom_dataset.yaml. Below is an example of the inference-only case:
#@package _global_
data:
  test:
    - name: custom_dataset
      protocol: top1
      dataset_list:
      - dataset_name: custom_dataset
        root: /PATH/TO/VIDEOS
        num_classes: <YOUR_ACTUAL_NUM_CLASSES>
        label_file: tc-clip/labels/custom_dataset_labels.csv
        ann_file: tc-clip/datasets_splits/custom_dataset_anns.txt
  1. Now run the below command. Note the data=custom_dataset part:
torchrun --nproc_per_node=4 main.py -cn zero_shot \
data=custom_dataset output=/PATH/TO/OUTPUT \
trainer=tc_clip eval=test resume=/PATH/TO/CHECKPOINTS/zero_shot_k400_tc_clip.pth

If you have any follow-up questions, feel free to ask. I will also mention you after adding a sample notebook.

@ooza
Copy link
Author

ooza commented Sep 11, 2024

Thanks @byminji for your quick reply.
I had to modify some source files to avoid using apex's amp because it is deprecated! I used the autocast from PyTorch's amp.
torch and torchvision versions: 2.1.2+cu118, 0.16.2+cu118
cuda version: 12.2
updated source files:
File: tc_clip.py
Function: forward
Update:
import torch.cuda.amp as amp
...

    ` with amp.autocast():
               image_features, context_tokens, attn, source = self.image_encoder(image.type(self.dtype),
                                                                      return_layer_num=self.return_layer_num,
                                                                      return_attention=return_attention,
                                                                    return_source=return_source)`

File: engine.py
Function: validate
Update :

  `with amp.autocast():
            output = model(image_input)`

File: main.py
Function: main_testing
Update:

    `with amp.autocast(): 
            test_stats = validate(val_loader, model, logger, config)`

As I said before I have a small dataset of less than 30 short videos and 3 classes.
So, I updated the accuracy_top1_top5 function in tools.py to handle fewer number of classes dynamically.
I got this result:
Screenshot 2024-09-11 at 14 28 48
My question is how to depict / analyze the predicted classes for each video ?
BTW in the results the only output is log_rank0.txt
thanks again

@byminji
Copy link

byminji commented Sep 12, 2024

Hi @ooza, You can check individual filenames and predictions by modifying some parts of the code. You can get the file id metadata by running your command with ++gather_filename=true (See datasets/build.py#L174.) Below is a code snippet that I've used before.

from utils.print_utils import colorstr

@torch.no_grad()
def print_individual_predictions(val_loader, model, logger, config):
    """ Code snippet to print individual predictions """

    assert config.num_clip == 1     # Only supports single-view sampling case
    assert config.get("gather_filename")    # Run command with "++gather_filename=true" override

    model.eval()
    num_classes = len(val_loader.dataset.classes)
    class_mapping = {idx: cls for idx, cls in val_loader.dataset.classes}
    metric_logger = MetricLogger(delimiter="  ")
    header = 'Val:'

    logger.info(f"{config.num_clip * config.num_crop} views inference")
    for idx, batch_data in enumerate(metric_logger.log_every(val_loader, config.print_freq, logger, header)):
        image = batch_data['imgs'].cuda(non_blocking=True)
        image = image.view((-1, config.num_frames, 3) + image.size()[-2:])
        label_id = batch_data['label'].cuda(non_blocking=True)
        label_id = label_id.reshape(-1)  # [b]

        # Get file id metadata
        file_id = batch_data['file_id']

        b, t, c, h, w = image.size()
        tot_similarity = torch.zeros((b, num_classes)).cuda()

        # Forward
        output = model(image)
        logits = output["logits"]
        similarity = logits.view(b, -1).softmax(dim=-1)
        tot_similarity += similarity

        # Classification score
        acc1, acc5, indices_1, _ = accuracy_top1_top5(tot_similarity, label_id)
        metric_logger.meters['acc1'].update(float(acc1) / b * 100, n=b)
        metric_logger.meters['acc5'].update(float(acc5) / b * 100, n=b)

        # Print individual predictions
        for batch_idx in range(b):
            filename = val_loader.dataset.video_infos[file_id[batch_idx]]['filename']
            foldername, videoname = filename.split("/")[-2], filename.split("/")[-1]
            gt_label, pred_label = label_id[batch_idx].item(), indices_1[batch_idx].item()
            gt_cls, pred_cls = class_mapping[gt_label], class_mapping[pred_label]
            flag = colorstr("blue", "Correct") if gt_label == pred_label else colorstr("red", "Wrong")
            print(f"{videoname}: [{flag}] GT {gt_cls}, Pred {pred_cls}")

    metric_logger.synchronize_between_processes()
    logger.info(f' * Acc@1 {metric_logger.acc1.global_avg:.3f} Acc@5 {metric_logger.acc5.global_avg:.3f}')
    return metric_logger.get_stats()

Thank you.

@ooza
Copy link
Author

ooza commented Sep 13, 2024

Thanks @byminji !
I added the print_individual_predictions function just before the main_testing. Then, I modified main_testing to include a check on the gather_filename flag:

# If gather_filename is true, print individual predictions
        with amp.autocast():
            if config.get("gather_filename", False):
                logger.info("Using print_individual_predictions function.")
                test_stats = print_individual_predictions(val_loader, model, logger, config)
            else:
                logger.info("Using validate function.")
                test_stats = validate(val_loader, model, logger, config)

I had to add this at the beginning of the function:

if config.get("gather_filename", False):
    config.num_clip = 1

Otherwise I got this error:
File "/home/vlm/tc-clip/main.py", line 151, in print_individual_predictions
assert config.num_clip == 1 # Only supports single-view sampling case
AssertionError

The issue now is a mismatch between the size of the preds and the targets:

Screenshot 2024-09-13 at 17 05 44

More details:
Screenshot 2024-09-13 at 17 07 47

But when I modified the existing multi-view inference logic by setting the config.num_clip =1 instead of 2 (elif config.protocol == 'zero_shot' and config.multi_view_inference: config.num_clip = 1) it works!
Is this safe and correct? or is there another more generic way to do it? any further explanations or details well be much appreciated.
Thanks

@byminji
Copy link

byminji commented Sep 14, 2024

Hi @ooza, Multi-view inference is a common strategy for increasing the accuracy of video recognition models by ensembling multiple predictions from differently sampled frames. Our paper used a 16 frames x 2 clips setting for comparison with 32 frame sampling models. You can either remove the multi-view inference or modify the code snippet to show results from multiple predictions. I simply implemented the single-view case only because it was for analysis, not for evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants