Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrain Hubert on english and chinese speech dataset. #5526

Open
shihuai opened this issue Jul 19, 2024 · 5 comments
Open

Pretrain Hubert on english and chinese speech dataset. #5526

shihuai opened this issue Jul 19, 2024 · 5 comments

Comments

@shihuai
Copy link

shihuai commented Jul 19, 2024

Hi~I'm trying pretrain hubert from scratch on english and chinese speech dataset. During pretrain, the 1st iteration loss dropped from 6.7 to 3.3, the 2nd iteration loss dropped from 11.2 to 4.0. Both of two stage iterations loss are too large, is this a normal phenomenon?

@zw76859420
Copy link

Can you show the config of your training?

@shihuai
Copy link
Author

shihuai commented Jul 24, 2024

Can you show the config of your training?

I use the hubert_base_librispeech.yaml for pretraining, only change the ddp_backend and max_sample_size.

common:
  fp16: true
  log_format: json
  log_interval: 200
  seed: 1337
  tensorboard_logdir: tblog

checkpoint:
  save_interval_updates: 25000
  keep_interval_updates: 1
  no_epoch_checkpoints: true


distributed_training:
  ddp_backend: c10d
  distributed_backend: 'nccl'
  distributed_world_size: 4
  distributed_port: 29671
  nprocs_per_node: 4
  find_unused_parameters: true

task:
  _name: hubert_pretraining
  data: ${task.data}
  label_dir: ${task.label_dir}
  labels: ${task.labels}
  label_rate: ${model.label_rate}
  sample_rate: 16000
  max_sample_size: 320000 #250000
  min_sample_size: 32000
  pad_audio: false
  random_crop: true
  normalize: false # must be consistent with extractor

dataset:
  num_workers: 6
  max_tokens: 1400000
  skip_invalid_size_inputs_valid_test: true
  validate_interval: 5
  validate_interval_updates: 10000

criterion:
  _name: hubert
  pred_masked_weight: 1.0
  pred_nomask_weight: 0.0
  loss_weights: [10,]

optimization:
  max_update: 400000
  lr: [0.00025]
  clip_norm: 10.0

optimizer:
  _name: adam
  adam_betas: (0.9,0.98)
  adam_eps: 1e-06
  weight_decay: 0.01

lr_scheduler:
  _name: polynomial_decay
  warmup_updates: 32000

model:
  _name: hubert
  label_rate: 100
  skip_masked: false
  skip_nomask: false
  mask_prob: 0.80
  extractor_mode: default
  conv_feature_layers: '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2'
  final_dim: 256
  encoder_layerdrop: 0.05
  dropout_input: 0.1
  dropout_features: 0.1
  dropout: 0.1
  attention_dropout: 0.1
  feature_grad_mult: 0.1
  untie_final_proj: true
  activation_dropout: 0.0

hydra:
  job:
    config:
      override_dirname:
        kv_sep: '-'
        item_sep: '__'
        exclude_keys:
          - run
          - task.data
          - task.label_dir
  run:
    dir: ???
  sweep:
    dir: ???
    subdir: ${hydra.job.config_name}__${hydra.job.override_dirname}

@zw76859420
Copy link

The loss of training hubert on my side can eventually converge to around 2.5, and I used the wenetspeech dataset as the pretrain dataset,which used 10,000 hours of pure Chinese data.

@zw76859420
Copy link

We believe that the key of training hubert base model is to look at the performance of the pre-trained model on main downstream tasks. You can finetune the pre-trained model trained by your recipe, and then test its accuracy on your tasks.

@shihuai
Copy link
Author

shihuai commented Jul 26, 2024

We believe that the key of training hubert base model is to look at the performance of the pre-trained model on main downstream tasks. You can finetune the pre-trained model trained by your recipe, and then test its accuracy on your tasks.

OK, Thank you for your reply! We have tried to train the SpeechTokenizer with feature from Hubert, and the reconstructed speech is also good. We will try more experiments on downstream tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants