Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The loss value is extremely high when fine-tuning the xm_transformer_unity model. Is there a wrong step? #5531

Open
wanhhe opened this issue Aug 7, 2024 · 0 comments

Comments

@wanhhe
Copy link

wanhhe commented Aug 7, 2024

❓ Questions and Help

What is your question?

I hope to finetune the xm_transformer_unity pre-trained model so that its mbart decoder can recognize some new words.
I followed
https://github.com/facebookresearch/fairseq/blob/ust/examples/speech_to_speech/docs/enhanced_direct_s2st_discrete_units.md and https://github.com/facebookresearch/fairseq/ust/examples/speech_to_speech/docs/enhanced_direct_s2st_discrete_units.md.
But is it normal for me to have a high loss value during training? It's loss reached 50 and multitask loss reached 1000. Here are some console outputs:
024-08-07 15:20:51 | INFO | dev | epoch 163 | valid on 'dev' subset | loss 50.401 | nll_loss 18.932 | multitask_target_letter_loss 918.037 | ppl 500238 | wps 0 | wpb 470 | bsz 2 | multitask_target_letter_loss_weight 8 | num_updates 652
2024-08-07 15:20:51 | INFO | fairseq_cli.train | end of epoch 163 (average epoch stats below)
2024-08-07 15:20:51 | INFO | train | epoch 163 | loss 65.175 | nll_loss 19.787 | total None | n_correct None | multitask_target_letter_loss 1303.92 | ppl 904912 | wps 791.8 | ups 0.55 | wpb 1440 | bsz 6.2 | num_updates 652 | multitask_target_letter_loss_weight 8 | lr 7.066e-08 | gnorm 1667.83 | clip 100 | loss_scale None | train_wall 7 | gb_free None | cuda_gb_allocated 16.9 | cuda_gb_reserved 22.1 | cuda_gb_free 22.5 | wall 0

Additionally, I also encountered some problems like AssertionError: Optimizer does not match; please reset the optimizer (--reset-optimizer). FP16Optimizer vs FairseqAdam and exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) RuntimeError: The size of tensor a (755724736) must match the size of tensor b (931956160) at non-singleton dimension 0 and gradient overflow. I guess maybe there are some problems in my custom datasets? But when I dont use the --fp16` in the running command it works.

In fact, I'm not if my steps are correct. So I hope to seek some help. Thank you!

Code

This is my command:
fairseq-train /home/s2ut/FormattingData/DATA_ROOT \ --config-yaml/home/s2ut/FormattingData/DATA_ROOT/config.yaml \ --multitask-config-yaml /home/s2ut/FormattingData/DATA_ROOT/multitask_config.yaml \ --task speech_to_text --arch xm_transformer_t2 \ --criterion speech_to_unit_translatotron2 --label-smoothing 0.1 \ --share-decoder-input-output-embed --adaptor-n-layers 1 --normalize \ --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \ --train-subset train --valid-subset dev \ --load-pretrained-decoder-from /root/autodl-tmp/code/trained_model/checkpoint_last.pt --w2v-path /root/autodl-tmp/code/trained_model/checkpoint_last.pt \ --mask-prob 0.3 --mask-channel-length 32 --mask-channel-prob 0.25 \ --save-dir /root/autodl-tmp/code/trained_model --checkpoint-activations --encoder-proj \ --lr 0.00000001 --dropout 0.1 --attention-dropout 0.1 --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-7 --warmup-updates 2000 \ --optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \ --max-update 80000 --max-tokens 5000 --max-tokens-valid 5000 --max-source-positions 5000 \ --max-target-positions 5000 --update-freq 1 \ --seed 1234 --num-workers 1 \ --reset-dataloader --reset-optimizer --batch-size 16 --max-epoch 1000 --save-interval 1000

What have you tried?

Firstly, I prepare manifest file by python examples/wav2vec/wav2vec_manifest.py /home/s2ut/TGT_AUDIO/train --dest /home/s2ut/TGT_AUDIO/train --ext wav --valid-percent 0

Secondly, I run the command python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py --feature_type hubert \ --kmeans_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin --acoustic_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3.pt \ --layer 11 --manifest_path /home/s2ut/TGT_AUDIO/train/train.tsv \ --out_quantized_file_path /home/s2ut/TGT_AUDIO/train.txt --extension ".wav" to extract units by mhubert_base_vp_en_es_fr_it3_L11_km1000 released in https://github.com/facebookresearch/fairseq/blob/ust/examples/speech_to_speech/docs/textless_s2st_real_data.md.

Then, I formate data by python examples/speech_to_speech/preprocessing/prep_s2ut_data.py \ --source-dir /home/s2ut/SRC_AUDIO --target-dir /home/s2ut/TGT_AUDIO \ --data-split train dev --output-root /home/s2ut/FormattingData/DATA_ROOT \ --reduce-unit --vocoder-checkpoint /home/s2ut/g_00500000 \ --vocoder-cfg /home/s2ut/vocoder_code_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_config.json to get a config.yaml

My task data format is like the following:
id audio n_frames tgt_text tgt_n_frames
26 /home/s2ut/SRC_AUDIO/train/26.wav 547 864 497 248
16 /home/s2ut/SRC_AUDIO/train/16.wav 445 39 6 54 192 232

I used bpe to generate subwords and find the subwords'id in en_zh_spm.dict, and I write these tokens in tgt_text of multitask
My multitask data format is like the following:

id tgt_text
26 3476765 2692239 80799 68322236
16 36544 38935 372148

To recongize new words, so I replace some original words in the dict. May I ask if this tgt_text should be texts or tokens?

My task file:
input_channels: 1
input_feat_per_channel: 80
specaugment:
freq_mask_F: 27
freq_mask_N: 1
time_mask_N: 1
time_mask_T: 100
time_mask_p: 1.0
time_wrap_W: 0
transforms:
'*':

  • utterance_cmvn
    _train:
  • utterance_cmvn
  • specaugment
    vocoder:
    checkpoint: /home/s2ut/g_00500000
    config: /home/s2ut/vocoder_code_hifigan_hubert_base_100_lj_config.json
    type: code_hifigan
    decoder_type: transformer
    decoder_layer: 2
    encoder_layer: 1
    loss_weight: 8.0
    prepend_bos_and_append_tgt_lang_tag: true
    eos_token: lang:en
    rdrop_alpha: 10.0
    tgt_lang: lang:en
    dict: /home/s2ut/FormattingData/DATA_ROOT/dict.txt
    standardize_audio: true
    use_audio_input: false
    apply_ucmvn: true

My multitask file:
target_letter:
target_type: text
decoder_type: transformer
encoder_layer: 1
loss_weight: 8.0
prepend_bos_and_append_tgt_lang_tag: true
eos_token: "[en_XX]"
rdrop_alpha: 10.0
data: /home/s2ut/FormattingData/DATA_ROOT/target_letter
tgt_lang: lang:en
src_lang: lang:hok
dict: /home/s2ut/FormattingData/DATA_ROOT/en_zh_spm.dict
standardize_audio: true
use_audio_input: true
apply_ucmvn: true

What's your environment?

I use python 3.8, fairseq ust 0.12.0 on Linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant