Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower Frequency Video Recognition #1695

Open
kerolos opened this issue Jul 18, 2024 · 0 comments
Open

Lower Frequency Video Recognition #1695

kerolos opened this issue Jul 18, 2024 · 0 comments

Comments

@kerolos
Copy link

kerolos commented Jul 18, 2024

I would like to use Zipformer "state of the art in speech recognation" to train a lower frequency video recognition "Sign language" model based on the code provided in icefall/egs/librispeech/ASR/zipformer/train.py and the paper https://arxiv.org/abs/2310.11230.

Problem Statement
The current dataset has a frame rate of 24 frames per second (sample rate) with skeleton data yielding a 1662 feature vector per second. The number of tokens ranges from 30,000 to 70,000, which is considerably high. I am looking for recommendations on parameter adjustments to achieve better recognition with lower frequency data.

Parameters to Handle Lower Frequency in the Dataset
Below is a table listing relevant parameters with their default values.

Parameter Default
--num-encoder-layers "2,2,3,4,3,2"
--downsampling-factor "1,2,4,8,4,2"
--feedforward-dim "512,768,1024,1536,1024,768"
--num-heads "4,4,4,8,4,4"
--encoder-dim "192,256,384,512,384,256"
--query-head-dim "32"
--value-head-dim "12"
--pos-head-dim "4"
--pos-dim 48
--encoder-unmasked-dim "192,192,256,256,256,192"
--cnn-module-kernel "31,31,15,15,15,31"
--decoder-dim 512
--joiner-dim 512
--attention-decoder-dim 512
--attention-decoder-num-layers 6
--attention-decoder-attention-dim 512
--attention-decoder-num-heads 8
--attention-decoder-feedforward-dim 2048
--causal False
--chunk-size "16,32,64,-1"
--left-context-frames "64,128,256,-1"
--use-transducer True
--use-ctc False
--use-attention-decoder False
--world-size 1
--ref-duration 600
--prune-range 5
--lm-scale 0.25
--am-scale 0.0
--simple-loss-scale 0.5
--ctc-loss-scale 0.2

Additional Questions
How can I find the grid recipe in Icefall, specifically in this issue #150 by @luomingshuang, which is no longer available?

I know it is out of the scope, but any guidance will be appreciated. Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant