The loss value is extremely high when fine-tuning the xm_transformer_unity model. Is there a wrong step? #5531

wanhhe · 2024-08-07T08:03:57Z

❓ Questions and Help

What is your question?

Additionally, I also encountered some problems like AssertionError: Optimizer does not match; please reset the optimizer (--reset-optimizer). FP16Optimizer vs FairseqAdam and exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) RuntimeError: The size of tensor a (755724736) must match the size of tensor b (931956160) at non-singleton dimension 0 and gradient overflow. I guess maybe there are some problems in my custom datasets? But when I dont use the --fp16` in the running command it works.

In fact, I'm not if my steps are correct. So I hope to seek some help. Thank you!

Code

This is my command：
fairseq-train /home/s2ut/FormattingData/DATA_ROOT \ --config-yaml/home/s2ut/FormattingData/DATA_ROOT/config.yaml \ --multitask-config-yaml /home/s2ut/FormattingData/DATA_ROOT/multitask_config.yaml \ --task speech_to_text --arch xm_transformer_t2 \ --criterion speech_to_unit_translatotron2 --label-smoothing 0.1 \ --share-decoder-input-output-embed --adaptor-n-layers 1 --normalize \ --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \ --train-subset train --valid-subset dev \ --load-pretrained-decoder-from /root/autodl-tmp/code/trained_model/checkpoint_last.pt --w2v-path /root/autodl-tmp/code/trained_model/checkpoint_last.pt \ --mask-prob 0.3 --mask-channel-length 32 --mask-channel-prob 0.25 \ --save-dir /root/autodl-tmp/code/trained_model --checkpoint-activations --encoder-proj \ --lr 0.00000001 --dropout 0.1 --attention-dropout 0.1 --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-7 --warmup-updates 2000 \ --optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \ --max-update 80000 --max-tokens 5000 --max-tokens-valid 5000 --max-source-positions 5000 \ --max-target-positions 5000 --update-freq 1 \ --seed 1234 --num-workers 1 \ --reset-dataloader --reset-optimizer --batch-size 16 --max-epoch 1000 --save-interval 1000

What have you tried?

Firstly, I prepare manifest file by python examples/wav2vec/wav2vec_manifest.py /home/s2ut/TGT_AUDIO/train --dest /home/s2ut/TGT_AUDIO/train --ext wav --valid-percent 0

Secondly, I run the command python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py --feature_type hubert \ --kmeans_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin --acoustic_model_path /home/s2ut/mhubert_base_vp_en_es_fr_it3.pt \ --layer 11 --manifest_path /home/s2ut/TGT_AUDIO/train/train.tsv \ --out_quantized_file_path /home/s2ut/TGT_AUDIO/train.txt --extension ".wav" to extract units by mhubert_base_vp_en_es_fr_it3_L11_km1000 released in https://github.com/facebookresearch/fairseq/blob/ust/examples/speech_to_speech/docs/textless_s2st_real_data.md.

Then, I formate data by python examples/speech_to_speech/preprocessing/prep_s2ut_data.py \ --source-dir /home/s2ut/SRC_AUDIO --target-dir /home/s2ut/TGT_AUDIO \ --data-split train dev --output-root /home/s2ut/FormattingData/DATA_ROOT \ --reduce-unit --vocoder-checkpoint /home/s2ut/g_00500000 \ --vocoder-cfg /home/s2ut/vocoder_code_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_config.json to get a config.yaml

My task data format is like the following：
id audio n_frames tgt_text tgt_n_frames
26 /home/s2ut/SRC_AUDIO/train/26.wav 547 864 497 248
16 /home/s2ut/SRC_AUDIO/train/16.wav 445 39 6 54 192 232

I used bpe to generate subwords and find the subwords'id in en_zh_spm.dict, and I write these tokens in tgt_text of multitask
My multitask data format is like the following：

id tgt_text
26 3476765 2692239 80799 68322236
16 36544 38935 372148

To recongize new words, so I replace some original words in the dict. May I ask if this tgt_text should be texts or tokens?

My task file:
input_channels: 1
input_feat_per_channel: 80
specaugment:
freq_mask_F: 27
freq_mask_N: 1
time_mask_N: 1
time_mask_T: 100
time_mask_p: 1.0
time_wrap_W: 0
transforms:
'*':

utterance_cmvn
_train:
utterance_cmvn
specaugment
vocoder:
checkpoint: /home/s2ut/g_00500000
config: /home/s2ut/vocoder_code_hifigan_hubert_base_100_lj_config.json
type: code_hifigan
decoder_type: transformer
decoder_layer: 2
encoder_layer: 1
loss_weight: 8.0
prepend_bos_and_append_tgt_lang_tag: true
eos_token: lang:en
rdrop_alpha: 10.0
tgt_lang: lang:en
dict: /home/s2ut/FormattingData/DATA_ROOT/dict.txt
standardize_audio: true
use_audio_input: false
apply_ucmvn: true

My multitask file:
target_letter:
target_type: text
decoder_type: transformer
encoder_layer: 1
loss_weight: 8.0
prepend_bos_and_append_tgt_lang_tag: true
eos_token: "[en_XX]"
rdrop_alpha: 10.0
data: /home/s2ut/FormattingData/DATA_ROOT/target_letter
tgt_lang: lang:en
src_lang: lang:hok
dict: /home/s2ut/FormattingData/DATA_ROOT/en_zh_spm.dict
standardize_audio: true
use_audio_input: true
apply_ucmvn: true

What's your environment?

I use python 3.8, fairseq ust 0.12.0 on Linux

The text was updated successfully, but these errors were encountered:

wanhhe added needs triage question labels Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The loss value is extremely high when fine-tuning the xm_transformer_unity model. Is there a wrong step? #5531

The loss value is extremely high when fine-tuning the xm_transformer_unity model. Is there a wrong step? #5531

wanhhe commented Aug 7, 2024

The loss value is extremely high when fine-tuning the xm_transformer_unity model. Is there a wrong step? #5531

The loss value is extremely high when fine-tuning the xm_transformer_unity model. Is there a wrong step? #5531

Comments

wanhhe commented Aug 7, 2024

❓ Questions and Help

What is your question?

Code

What have you tried?

What's your environment?