Training speed is not improved by using a better GPU #1698

SongLi89 · 2024-07-20T08:38:41Z

Hi,

just have a question regarding the training speed using different GPUs.
We have tested the training speed with A100 and H100 (single GPU for test) using the same training setup.

Settings for A100 environment:
Driver Version: 525.147.05
cUDA Wersion: 12.0
pytorch 2.0.1
cuda: 11.8

======
Settings for H100 environment:
Driver Wersion: 525.54.15
cuDA Wersion: 12.4
pytorch 2.2.2
cuda: 12.1

To fairly compare the two GPUs, we used the same training parameters as follows (Of course, the H100 has more memory than the A100, and we can use a larger number for “maximum duration”. This test is just to compare the performance of these two GPUs against training ):

wenetspeech receip:
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-p16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16

However, we found that training speeds were not significantly improved with more expensive one (H100).
For better comparison, we have plotted the processing time for each key step for each batch, including time for backward propagation, forward(zipformer), loss calculation, parameter update, and for load data.
The results are plotted in the following two figures.

So we can see that the time for IO, loss calculation and param. update is relatively low. The main time cost for the training is the backward propagation and forward (zipformer). It is unclear to me why there is no time reduction with H100 over A100. Has anyone else had a similar experience? Or is there something we haven't noticed?

Best,
Li

XhrLeokk · 2024-07-20T08:53:39Z

Nice plot, surprisingly to know that the gap is that close.
Seems weird. 🤔

Ziyi6 · 2024-07-20T10:24:10Z

Met same problem. We're also using A100 and H100 servers, unsurprisingly the speed of H100 aren't as fast as we expected which is absolutely unnormal. At least the price we paid didn't bring us significant speed improvement. I think likely something must be set in the training code to be able to use H100 more efficiently?

rambowu11 · 2024-07-20T12:59:39Z

Mark, we have a plan to buy H100 GPUs

yuekaizhang · 2024-07-23T00:57:49Z

Hi @SongLi89, thank you for raising this issue. I will help check if there are any performance bottlenecks. Will reply here with any updates.

yuekaizhang · 2024-07-23T03:48:52Z

However, we found that training speeds were not significantly improved with more expensive one (H100).

@SongLi89
Could you tell me the specific comparison results of the training speed in your tests?

I am trying to reproduce your issue.

On the A100, it takes me about 0.6 seconds per step, and on the H100, it takes about 0.36 seconds per step. (by checking log file)

I am not sure if this speed ratio is similar to yours? (I used the aishell1 dataset, where the sentence lengths are slightly shorter, but the max_audio_duration setting is the same as yours.

yuekaizhang · 2024-07-23T03:54:55Z

(Of course, the H100 has more memory than the A100, and we can use a larger number for “maximum duration”. This test is just to compare the performance of these two GPUs against training )

Also, could you tell me the specific specifications of your GPUs? The A100 80GB and H100 80GB have the same memory size.

SongLi89 · 2024-07-23T04:01:07Z

(Of course, the H100 has more memory than the A100, and we can use a larger number for “maximum duration”. This test is just to compare the performance of these two GPUs against training )

Also, could you tell me the specific specifications of your GPUs? The A100 80GB and H100 80GB have the same memory size.

Hi yuekai, thanks for the rapid replay.
so the A100 has mem of 40G where the H100 has 80G. Belows are the two screen shots.

Which torch/CUDA version you used for test?
so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

yuekaizhang · 2024-07-23T04:17:26Z

Which torch/CUDA version you used for test? so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice

You could use the pre-built image here:

docker pull soar97/triton-sensevoice:24.05
pip install k2==1.24.4.dev20240606+cuda12.1.torch2.3.1 -f https://k2-fsa.github.io/k2/cuda.html
pip install -r icefall/requirements.txt
pip install lhotse

huggingface-cli download  --repo-type dataset --local-dir /your_icefall/egs/aishell/ASR/data yuekai/aishell_icefall_fbank
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-fp16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16

If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training.

Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it.

However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated.

SongLi89 · 2024-07-23T04:58:51Z

Which torch/CUDA version you used for test? so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice

You could use the pre-built image here:
docker pull soar97/triton-sensevoice:24.05
pip install k2==1.24.4.dev20240606+cuda12.1.torch2.3.1 -f https://k2-fsa.github.io/k2/cuda.html
pip install -r icefall/requirements.txt
pip install lhotse

huggingface-cli download  --repo-type dataset --local-dir /your_icefall/egs/aishell/ASR/data yuekai/aishell_icefall_fbank
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-fp16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16
If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training.

Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it.

However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated.

thanks a lot I will try it

SongLi89 · 2024-07-26T11:23:48Z

Which torch/CUDA version you used for test? so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice

You could use the pre-built image here:
docker pull soar97/triton-sensevoice:24.05
pip install k2==1.24.4.dev20240606+cuda12.1.torch2.3.1 -f https://k2-fsa.github.io/k2/cuda.html
pip install -r icefall/requirements.txt
pip install lhotse

huggingface-cli download  --repo-type dataset --local-dir /your_icefall/egs/aishell/ASR/data yuekai/aishell_icefall_fbank
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-fp16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16
If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training.

Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it.

However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated.

Hi yuekai, I tried with your environment and we have got similar acceleration ratio. Thanks a lot. But still it is great that the performance can further be improved. If you have ideas to speed up the training, please tell me. thanks.

yuekaizhang · 2024-07-30T03:32:39Z

Hi yuekai, I tried with your environment and we have got similar acceleration ratio. Thanks a lot. But still it is great that the performance can further be improved. If you have ideas to speed up the training, please tell me. thanks.

Hi @SongLi89, I have performed a profiling of the whole pipeline and did not find any significant bottlenecks.

It is worth noting that if you are willing to make some modifications to the attention mechanism of Zipformer, changing it to standard Transformer attention, you could leverage FlashAttention to accelerate both inference and training. However, it is uncertain whether this change would result in any loss of model accuracy.

Alternatively, for the H100, the fastest approach would be to use FP8 for training. However, considering that Zipformer has some gradient rescaling operations, FP8 training recipe might require addressing related issues first. This may not be a task that can be completed quickly. If you or anyone else is interested, we can discuss or collaborate to achieve this.

danpovey · 2024-07-30T21:44:03Z

Perhaps you are limited by the latency of individual operations, e.g. loading the kernels? The nsys profile output may give more detailed info. Unfortunately it won't look that pretty or be that easy to understand unless you annotate the code with NVTX ranges. Guys, do we have a branch anywhere that can demonstrate how to add nvtx ranges for profiling? We should make available some code somewhere so that people can easily do this.

I profile using commands like the following, the important part is just prepending "nsys profile", then you have to transfer the .qdrep file to your desktop and view it using Nvidia NSight systems.

 nsys profile  python3 ./pruned_transducer_stateless7/train.py --master-port 71840 --world-size 2 --num-epochs 30 --full-libri 0 --exp-dir pruned_transducer_stateless7/scaled_adam_exp90_2job --max-duration 300 --use-fp16 True --decoder-dim 512 --joiner-dim 512 --start-epoch 5 --num-workers 2 --exit-after-batch 15 &>> nohup2/scaled_adam_exp90_nvtx_2job.out

Something else I notice is that the time for loading data is quite a lot. You should also check on 'top' whether the data-loader workers are always busy decompressing data (would be 100% CPU) or whether they are waiting for disk access ("D" process state). You do seem to be using a lot of data-loader workers (16) so I'd hope that it wouldn't be waiting on that.

But definitely the time for model forward and backward is still quite a lot.
I do notice that your --max-duration is really quite small: 450. Unless your model is extremely large, I'd be surprised if that was the largest duration you could use even for the smaller GPU. We normally use over 1000, I think; and that's on GPUs with 32GB of memory.

galv · 2024-07-31T03:56:20Z

A quick way to get a sense of what part is slow is to use the "nvtx" pip's package ability to automatically create an nvtx range for every single python function call. You can read more here: https://nvtx.readthedocs.io/en/latest/auto.html

There is an example here: NVIDIA/NeMo#9100 (it also includes how to use cudaProfilerStart and cudaProfilerStop properly, as well emit_nvtx() from pytorch)

Basically, you can run with and without that enabled to get a sense of what might be slow, without manually putting in nvtx ranges, if you wanted to. Note that enabling automatic nvtx ranges can cause a huge slowdown, thus why it is good to run with and without it,a nd comapre the two .nsys-rep files side-by-side. I do this all the time for NeMo.

pzelasko · 2024-07-31T15:03:18Z

But definitely the time for model forward and backward is still quite a lot. I do notice that your --max-duration is really quite small: 450. Unless your model is extremely large, I'd be surprised if that was the largest duration you could use even for the smaller GPU. We normally use over 1000, I think; and that's on GPUs with 32GB of memory.

BTW since you mentioned max_duration: you might be interested in our latest efforts in improved batch size calibration for bucketing. We found we're able to use practically 100% of available compute, improving the mean batch sizes for some of our models by as much as 5x. NVIDIA/NeMo#9763

This could be easily ported to Icefall with DynamicBucketingSampler, and minor changes in oomptimizer.py to accomodate Icefall models training step API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training speed is not improved by using a better GPU #1698

Training speed is not improved by using a better GPU #1698

SongLi89 commented Jul 20, 2024

XhrLeokk commented Jul 20, 2024 •

edited

Loading

Ziyi6 commented Jul 20, 2024 •

edited

Loading

rambowu11 commented Jul 20, 2024

yuekaizhang commented Jul 23, 2024

yuekaizhang commented Jul 23, 2024 •

edited

Loading

yuekaizhang commented Jul 23, 2024

SongLi89 commented Jul 23, 2024

yuekaizhang commented Jul 23, 2024 •

edited

Loading

SongLi89 commented Jul 23, 2024

SongLi89 commented Jul 26, 2024

yuekaizhang commented Jul 30, 2024

danpovey commented Jul 30, 2024

galv commented Jul 31, 2024 •

edited

Loading

pzelasko commented Jul 31, 2024 •

edited

Loading

Training speed is not improved by using a better GPU #1698

Training speed is not improved by using a better GPU #1698

Comments

SongLi89 commented Jul 20, 2024

just have a question regarding the training speed using different GPUs. We have tested the training speed with A100 and H100 (single GPU for test) using the same training setup.

XhrLeokk commented Jul 20, 2024 • edited Loading

Ziyi6 commented Jul 20, 2024 • edited Loading

rambowu11 commented Jul 20, 2024

yuekaizhang commented Jul 23, 2024

yuekaizhang commented Jul 23, 2024 • edited Loading

yuekaizhang commented Jul 23, 2024

SongLi89 commented Jul 23, 2024

yuekaizhang commented Jul 23, 2024 • edited Loading

SongLi89 commented Jul 23, 2024

SongLi89 commented Jul 26, 2024

yuekaizhang commented Jul 30, 2024

danpovey commented Jul 30, 2024

galv commented Jul 31, 2024 • edited Loading

pzelasko commented Jul 31, 2024 • edited Loading

just have a question regarding the training speed using different GPUs.
We have tested the training speed with A100 and H100 (single GPU for test) using the same training setup.

XhrLeokk commented Jul 20, 2024 •

edited

Loading

Ziyi6 commented Jul 20, 2024 •

edited

Loading

yuekaizhang commented Jul 23, 2024 •

edited

Loading

yuekaizhang commented Jul 23, 2024 •

edited

Loading

galv commented Jul 31, 2024 •

edited

Loading

pzelasko commented Jul 31, 2024 •

edited

Loading