why running GPU version but the GPU-Util is 0% #3619

YanzeZHANG · 2020-12-01T09:07:51Z

I am running the GPU version on Centos, but it seems that GPU version does not run any faster.

I set this in train.conf：device_type=gpu

LightGBM OUTPUT Message

[LightGBM] [Info] Number of positive: 1775314, number of negative: 17823595
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 383631
[LightGBM] [Info] Number of data points in the train set: 19598909, number of used features: 87252
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 34 dense feature groups (672.88 MB) transferred to GPU in 0.741636 secs. 1 sparse feature groups
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.090582 -> initscore=-2.306546
[LightGBM] [Info] Start training from score -2.306546
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 34 dense feature groups (538.36 MB) transferred to GPU in 0.614478 secs. 1 sparse feature groups
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.299209
[LightGBM] [Info] Iteration:1, training auc : 0.730514
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 0.313567
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.718181
[LightGBM] [Info] 1313.718947 seconds elapsed, finished iteration 1

It says lightGBM is running using GPU , but I do nvidia-smi and get infomation like this

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   55C    P0    29W /  70W |    974MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     25045      C   ../lightgbm                                  963MiB |
+-----------------------------------------------------------------------------+

The GPU-Util is 0%, does it mean the GPU is not working? How to solve this problem?

The text was updated successfully, but these errors were encountered:

StrikerRUS · 2020-12-01T13:31:20Z

@YanzeZHANG How do you run nvidia-smi? Is it possible that training has been actually finished by the time you execute nvidia-smi command?

Could you try running

watch -n0.1 nvidia-smi

or

nvidia-smi --query-gpu=utilization.gpu --format=csv --loop=1

as suggested in https://stackoverflow.com/questions/45544603/tensorflow-how-do-you-monitor-gpu-performance-during-model-training-in-real-time?

YanzeZHANG · 2020-12-04T03:00:45Z

@YanzeZHANG How do you run nvidia-smi? Is it possible that training has been actually finished by the time you execute nvidia-smi command?

Could you try running
watch -n0.1 nvidia-smi
or
nvidia-smi --query-gpu=utilization.gpu --format=csv --loop=1
as suggested in https://stackoverflow.com/questions/45544603/tensorflow-how-do-you-monitor-gpu-performance-during-model-training-in-real-time?

YanzeZHANG · 2020-12-04T03:01:59Z

[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 34 dense feature groups (672.88 MB) transferred to GPU in 0.741636 secs. 1 sparse feature groups
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...

I got this，does it mean only a few features are using GPU ?

StrikerRUS · 2020-12-04T12:21:29Z

I got this，does it mean only a few features are using GPU ?

Yes, LightGBM utilizes GPU only for some but not all sub-tasks during the boosting process: #768 (comment), which requires transferring data CPU <-> GPU.

github-actions · 2023-08-23T18:54:09Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

StrikerRUS added the awaiting response label Dec 2, 2020

YanzeZHANG closed this as completed Dec 4, 2020

no-response bot removed the awaiting response label Dec 4, 2020

YanzeZHANG reopened this Dec 4, 2020

jameslamb added the question label Dec 20, 2020

StrikerRUS closed this as completed Dec 25, 2020

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why running GPU version but the GPU-Util is 0% #3619

why running GPU version but the GPU-Util is 0% #3619

YanzeZHANG commented Dec 1, 2020

StrikerRUS commented Dec 1, 2020

YanzeZHANG commented Dec 4, 2020

YanzeZHANG commented Dec 4, 2020

StrikerRUS commented Dec 4, 2020 •

edited

Loading

github-actions bot commented Aug 23, 2023

why running GPU version but the GPU-Util is 0% #3619

why running GPU version but the GPU-Util is 0% #3619

Comments

YanzeZHANG commented Dec 1, 2020

StrikerRUS commented Dec 1, 2020

YanzeZHANG commented Dec 4, 2020

YanzeZHANG commented Dec 4, 2020

StrikerRUS commented Dec 4, 2020 • edited Loading

github-actions bot commented Aug 23, 2023

StrikerRUS commented Dec 4, 2020 •

edited

Loading