Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why running GPU version but the GPU-Util is 0% #3619

Closed
YanzeZHANG opened this issue Dec 1, 2020 · 5 comments
Closed

why running GPU version but the GPU-Util is 0% #3619

YanzeZHANG opened this issue Dec 1, 2020 · 5 comments
Labels

Comments

@YanzeZHANG
Copy link

I am running the GPU version on Centos, but it seems that GPU version does not run any faster.

I set this in train.conf:device_type=gpu

LightGBM OUTPUT Message

[LightGBM] [Info] Number of positive: 1775314, number of negative: 17823595
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 383631
[LightGBM] [Info] Number of data points in the train set: 19598909, number of used features: 87252
[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 34 dense feature groups (672.88 MB) transferred to GPU in 0.741636 secs. 1 sparse feature groups
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.090582 -> initscore=-2.306546
[LightGBM] [Info] Start training from score -2.306546
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 34 dense feature groups (538.36 MB) transferred to GPU in 0.614478 secs. 1 sparse feature groups
[LightGBM] [Info] Iteration:1, training binary_logloss : 0.299209
[LightGBM] [Info] Iteration:1, training auc : 0.730514
[LightGBM] [Info] Iteration:1, valid_1 binary_logloss : 0.313567
[LightGBM] [Info] Iteration:1, valid_1 auc : 0.718181
[LightGBM] [Info] 1313.718947 seconds elapsed, finished iteration 1

It says lightGBM is running using GPU , but I do nvidia-smi and get infomation like this

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   55C    P0    29W /  70W |    974MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     25045      C   ../lightgbm                                  963MiB |
+-----------------------------------------------------------------------------+


The GPU-Util is 0%, does it mean the GPU is not working? How to solve this problem?

@StrikerRUS
Copy link
Collaborator

@YanzeZHANG How do you run nvidia-smi? Is it possible that training has been actually finished by the time you execute nvidia-smi command?

Could you try running

watch -n0.1 nvidia-smi

or

nvidia-smi --query-gpu=utilization.gpu --format=csv --loop=1

as suggested in https://stackoverflow.com/questions/45544603/tensorflow-how-do-you-monitor-gpu-performance-during-model-training-in-real-time?

@YanzeZHANG
Copy link
Author

@YanzeZHANG How do you run nvidia-smi? Is it possible that training has been actually finished by the time you execute nvidia-smi command?

Could you try running

watch -n0.1 nvidia-smi

or

nvidia-smi --query-gpu=utilization.gpu --format=csv --loop=1

as suggested in https://stackoverflow.com/questions/45544603/tensorflow-how-do-you-monitor-gpu-performance-during-model-training-in-real-time?

@YanzeZHANG
Copy link
Author

[LightGBM] [Info] Using GPU Device: Tesla T4, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 34 dense feature groups (672.88 MB) transferred to GPU in 0.741636 secs. 1 sparse feature groups
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...

I got this,does it mean only a few features are using GPU ?

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Dec 4, 2020

I got this,does it mean only a few features are using GPU ?

Yes, LightGBM utilizes GPU only for some but not all sub-tasks during the boosting process: #768 (comment), which requires transferring data CPU <-> GPU.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants