Skip to content

Latest commit

 

History

History
13 lines (10 loc) · 748 Bytes

README.md

File metadata and controls

13 lines (10 loc) · 748 Bytes

Video-Captioning-CS5422

Automatic video captioning, a final project for CS5422 Neural Networks and Deep Learning. This project uses neural network to produce a simple fixed 3-words caption (<noun> <verb> <noun>) for each sequence of video frames.

Neural network architecture

The neural network is composed of 4 main components:

  • Feature extractor using pretrained EfficientNet
  • Object classifier, a linear layer which sole purpose is to capture the presence of objects in each frame
  • Encoder, a linear layer which helps capture the action happening in each frame
  • Decoder, a RNN which produces the caption

image