Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA][HIP] too many process spawned on multiple GPU systems #15251

Open
tdavidcl opened this issue Sep 1, 2024 · 8 comments · May be fixed by oneapi-src/unified-runtime#2077
Open

[CUDA][HIP] too many process spawned on multiple GPU systems #15251

tdavidcl opened this issue Sep 1, 2024 · 8 comments · May be fixed by oneapi-src/unified-runtime#2077
Labels
bug Something isn't working cuda CUDA back-end hip Issues related to execution on HIP backend.

Comments

@tdavidcl
Copy link

tdavidcl commented Sep 1, 2024

Describe the bug

On multiple GPU systems, using HIP or CUDA, a process is spawned on all GPUs instead being spawned only on one of them. (See To reproduce section)

This result in memory leaks when SYCL is used with both mpich and openmpi as both GPUs ends up receiving the data, even though the program (in the following exemple a private HPC application) only use one of them per MPI ranks. This result in a graph like this (memory usage per process / time)
mpirun -n 2 <...>
Screenshot_2024-09-01_21-14-17
where the blue and red curve are the working GPU processes, and the two other growing ones are the threads on the wrong GPUs.

CUDA_VISIBLE_DEVICES can be used to circumvent the issue

mpirun \                                                                           
    -n 1 -x CUDA_VISIBLE_DEVICES=0 <...> : \
    -n 1 -x CUDA_VISIBLE_DEVICES=1 <...>

Screenshot_2024-09-01_21-45-25

To reproduce

#include <sycl/sycl.hpp>
#include <iostream>

std::vector<sycl::device> get_sycl_device_list() {
	std::vector<sycl::device> devs;
	const auto &Platforms = sycl::platform::get_platforms();
	for (const auto &Platform : Platforms) {
		const auto &Devices = Platform.get_devices();
		for (const auto &Device : Devices) {
			devs.push_back(Device);
			return devs;
		}
	}
     return devs;
}

int main(void){

	for (auto d : get_sycl_device_list()){
		auto DeviceName   = d.get_info<sycl::info::device::name>();
		std::cout <<DeviceName << std::endl;
	}
	std::cin.ignore();
}
intel-llvm-installdir/bin/clang++ -fsycl -fsycl-targets=nvidia_gpu_sm_80 test.cpp
./a.out

On a multiple GPU system, this code snippet result in processes being spawned on both GPUs, even though only one GPU should be initialized.

❯ nvidia-smi | grep ./a.out
|    0   N/A  N/A   2723339      C   ./a.out                                       202MiB |
|    1   N/A  N/A   2723339      C   ./a.out                                       202MiB |

Environment

  • OS: Linux
  • Target device and vendor: CUDA : RTXA5000 and A100-SXM4, AMD: MI250X
  • DPC++ version: [e.g. commit hash or output of clang++ --version]
clang version 19.0.0git (https://github.com/intel/llvm.git 3e00e38232f81724c94c18f2e3038e6aea4f0224)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/tdavidcl/Documents/shamrock-dev/Shamrock/build_intel/.env/intel-llvm-installdir/bin
Build config: +assertions
  • Dependencies version: [e.g. the output of sycl-ls --verbose]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA RTX A5000 8.6 [CUDA 12.6]
[cuda:gpu][cuda:1] NVIDIA CUDA BACKEND, NVIDIA RTX A5000 8.6 [CUDA 12.6]

Platforms: 1
Platform [#1]:
    Version  : CUDA 12.6
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 2
        Device [#0]:
        Type              : gpu
        Version           : 8.6
        Name              : NVIDIA RTX A5000
        Vendor            : NVIDIA Corporation
        Driver            : CUDA 12.6
        UUID              : 1524713610692242731361804768205105967369
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_cuda_async_barrier ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
 ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_1d_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_external_memory_import ext_oneapi_external_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag ext_oneapi_virtual_mem ext_oneapi_image_array ext_oneapi_unique_addressing_per_dim ext_oneapi_bindless_images_sample_1d_usm ext_oneapi_bindless_images_sample_2d_usm
        info::device::sub_group_sizes: 32
        Architecture: nvidia_gpu_sm_86
        Device [#1]:
        Type              : gpu
        Version           : 8.6
        Name              : NVIDIA RTX A5000
        Vendor            : NVIDIA Corporation
        Driver            : CUDA 12.6
        UUID              : 132261661332412015314176217761172047020650
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_cuda_async_barrier ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
 ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_1d_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_external_memory_import ext_oneapi_external_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag ext_oneapi_virtual_mem ext_oneapi_image_array ext_oneapi_unique_addressing_per_dim ext_oneapi_bindless_images_sample_1d_usm ext_oneapi_bindless_images_sample_2d_usm
        info::device::sub_group_sizes: 32
        Architecture: nvidia_gpu_sm_86
default_selector()      : gpu, NVIDIA CUDA BACKEND, NVIDIA RTX A5000 8.6 [CUDA 12.6]
accelerator_selector()  : No device of requested type available.
cpu_selector()          : No device of requested type available.
gpu_selector()          : gpu, NVIDIA CUDA BACKEND, NVIDIA RTX A5000 8.6 [CUDA 12.6]
custom_selector(gpu)    : gpu, NVIDIA CUDA BACKEND, NVIDIA RTX A5000 8.6 [CUDA 12.6]
custom_selector(cpu)    : No device of requested type available.
custom_selector(acc)    : No device of requested type available.

Additional context

No response

@tdavidcl tdavidcl added the bug Something isn't working label Sep 1, 2024
@JackAKirk
Copy link
Contributor

Forgive me if I've misunderstood the problem since I'm not sure there is enough information or maybe I misread, but it looks to me like this is expected behaviour. As you point out you can resolve this problem via using CUDA_VISIBLE_DEVICES as we documented here at the bottom:

https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide#mapping-mpi-ranks-to-specific-devices

Now I think what you are asking is a way to map cpu ranks to gpus without using CUDA_VISIBLE_DEVICES. We have documented how to do this in the link above. Essentially you query RANK using your chosen MPI implementation, and map it to a chosen device.

This is identical to how you do MPI with native CUDA, and this is generally the case; we have tried to emphasize this in

https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide
and
https://developer.codeplay.com/products/oneapi/amd/2024.2.1/guides/MPI-guide

If I am wrong and there is a problem with using cuda-aware MPI in SYCL that is not documented in https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide
then probably it would help me understand what is happening by posting a more complete code example.

@AlexeySachkov AlexeySachkov added cuda CUDA back-end hip Issues related to execution on HIP backend. labels Sep 2, 2024
@tdavidcl
Copy link
Author

tdavidcl commented Sep 2, 2024

Forgive me if I've misunderstood the problem since I'm not sure there is enough information or maybe I misread, but it looks to me like this is expected behaviour.

Indeed the situation described in (https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide#mapping-mpi-ranks-to-specific-devices) is really close to what i'm doing internally.

Now I think what you are asking is a way to map cpu ranks to gpus without using CUDA_VISIBLE_DEVICES. We have documented how to do this in the link above.

As described in the same guide in doing something which looks like

std::vector<sycl::device> Devs;
for (const auto &plt : sycl::platform::get_platforms()) {
  if (plt.get_backend() == sycl::backend::cuda)
    Devs.push_back(plt.get_devices()[0]);
}
sycl::queue q{Devs[rank]};

However, correct me if I am wrong, the expected behavior would be if i do mpirun -n 2 ./a.out on a dual GPU system to have in Nvidia-smi one process on GPU 0 and the other on GPU 1.

❯ nvidia-smi | grep ./a.out
|    0   N/A  N/A   2723339      C   ./a.out                                       202MiB |
|    1   N/A  N/A   2723340      C   ./a.out                                       202MiB |

Currently by doing so you will instead get :

❯ nvidia-smi | grep ./a.out
|    0   N/A  N/A   2723339      C   ./a.out                                       202MiB |
|    1   N/A  N/A   2723339      C   ./a.out                                       202MiB |
|    1   N/A  N/A   2723340      C   ./a.out                                       202MiB |
|    0   N/A  N/A   2723340      C   ./a.out                                       202MiB |

i.e. all ranks start the process on all GPU, even if only one of them is used per processes.

The issue is that there is now way to disable streams on unused device. This confuses MPI which in turn, i suspect create the memory leak.

Maybe i was unclear in the initial post, but to reproduce the issue you can simply start a SYCL programm without MPI and observe that both GPUs show up in nvidia-smi.

Even if this can be fixed by using a proper binding script i suspect that this is not expected behavior of DPC++ ???

@JackAKirk
Copy link
Contributor

JackAKirk commented Sep 2, 2024

Forgive me if I've misunderstood the problem since I'm not sure there is enough information or maybe I misread, but it looks to me like this is expected behaviour.

Indeed the situation described in (https://developer.codeplay.com/products/oneapi/nvidia/2024.2.1/guides/MPI-guide#mapping-mpi-ranks-to-specific-devices) is really close to what i'm doing internally.

Now I think what you are asking is a way to map cpu ranks to gpus without using CUDA_VISIBLE_DEVICES. We have documented how to do this in the link above.

As described in the same guide in doing something which looks like

However, correct me if I am wrong, the expected behavior would be if i do `mpirun -n 2 ./a.out` on a dual GPU system to have in Nvidia-smi one process on GPU 0 and the other on GPU 1.

❯ nvidia-smi | grep ./a.out
| 0 N/A N/A 2723339 C ./a.out 202MiB |
| 1 N/A N/A 2723340 C ./a.out 202MiB |


Currently by doing so you will instead get :

❯ nvidia-smi | grep ./a.out
| 0 N/A N/A 2723339 C ./a.out 202MiB |
| 1 N/A N/A 2723339 C ./a.out 202MiB |
| 1 N/A N/A 2723340 C ./a.out 202MiB |
| 0 N/A N/A 2723340 C ./a.out 202MiB |

Even if this can be fixed by using a proper binding script i suspect that this is not expected behavior of DPC++ ???

Yes that should be correct. I see what you mean. I have not seen such behaviour but I can try to reproduce it. I wonder first of all whether it is an artifact of some part of your program: First of all, have you tried our samples that we linked in the documentation? e.g. https://github.com/codeplaysoftware/SYCL-samples/blob/main/src/MPI_with_SYCL/send_recv_usm.cpp

As I understand it, you would expect to see the same behaviour for that sample, but I don't remember ever seeing duplicate processes.

If you do see the same issue with that sample, I suspect this might also be an artifact of your cluster setup. You might also want to confirm that you don't see the same behaviour with a simple cuda MPI program, e.g. https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/

I would be surprised if this is a dpc++ specific issue. Once the program is compiled, as far as MPI is concerned there is no distinction between it being compiled with dpc++ or nvcc.

@tdavidcl
Copy link
Author

tdavidcl commented Sep 2, 2024

I will try, but the simplest exemple tends to already trigger the issue with dpcpp.
I think that just looping on the list of device result in cuda init on each GPU.

This simple code on a dual GPU system shows the issue already without MPI:

#include <sycl/sycl.hpp>
#include <iostream>

std::vector<sycl::device> get_sycl_device_list() {
	std::vector<sycl::device> devs;
	const auto &Platforms = sycl::platform::get_platforms();
	for (const auto &Platform : Platforms) {
		const auto &Devices = Platform.get_devices();
		for (const auto &Device : Devices) {
			devs.push_back(Device);
			return devs;
		}
	}
     return devs;
}

int main(void){

	for (auto d : get_sycl_device_list()){
		auto DeviceName   = d.get_info<sycl::info::device::name>();
		std::cout <<DeviceName << std::endl;
	}
	std::cin.ignore();
}
❯ nvidia-smi | grep ./a.out
|    0   N/A  N/A   2723339      C   ./a.out                                       202MiB |
|    1   N/A  N/A   2723339      C   ./a.out                                       202MiB |

Here the process is initialised on both GPUs even though no queues have been created, and only the first device has been used (only to query its name).

Including MPI would do pretty much the same times 2. Send receives works fine with that setup, except for the weird memory leak (I've checked the allocations and it is not on my side).

@JackAKirk
Copy link
Contributor

JackAKirk commented Sep 2, 2024

I will try, but the simplest exemple tends to already trigger the issue with dpcpp. I think that just looping on the list of device result in cuda init on each GPU.

This simple code on a dual GPU system shows the issue already without MPI:

#include <sycl/sycl.hpp>
#include <iostream>

std::vector<sycl::device> get_sycl_device_list() {
	std::vector<sycl::device> devs;
	const auto &Platforms = sycl::platform::get_platforms();
	for (const auto &Platform : Platforms) {
		const auto &Devices = Platform.get_devices();
		for (const auto &Device : Devices) {
			devs.push_back(Device);
			return devs;
		}
	}
     return devs;
}

int main(void){

	for (auto d : get_sycl_device_list()){
		auto DeviceName   = d.get_info<sycl::info::device::name>();
		std::cout <<DeviceName << std::endl;
	}
	std::cin.ignore();
}
❯ nvidia-smi | grep ./a.out
|    0   N/A  N/A   2723339      C   ./a.out                                       202MiB |
|    1   N/A  N/A   2723339      C   ./a.out                                       202MiB |

Here the process is initialised on both GPUs even though no queues have been created, and only the first device has been used (only to query its name).

Including MPI would do pretty much the same times 2. Send receives works fine with that setup, except for the weird memory leak (I've checked the allocations and it is not on my side).

This definitely isn't happening on my system (I just sanity checked it again using your code quoted above on a multi-gpu system).
The most important point is that this shouldn't be running on the gpu at all, and therefore you should not be getting any output from nvidia-smi. This is what I see. Are you sure you don't have preexisting processes happening on your gpu?

@al42and
Copy link
Contributor

al42and commented Sep 2, 2024

For the record, I tried with oneAPI 2024.2.0 (and a matching Codeplay plugin) on a dual-GPU machine, and have the same output as @tdavidcl:

$ sycl-ls 
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.4]
[cuda:gpu][cuda:1] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.4]
$ nvidia-smi --query-compute-apps=pid,name,gpu_bus_id,used_gpu_memory --format=csv
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
$ /opt/tcbsys/intel-oneapi/2024.2.0/compiler/2024.2/bin/compiler/clang++ -fsycl test.cpp
$ ./a.out &
[1] 17000
$ Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
Press any key...
[1]+  Stopped                 ./a.out
$ nvidia-smi --query-compute-apps=pid,name,gpu_bus_id,used_gpu_memory --format=csv
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
17000, ./a.out, 00000000:17:00.0, 154 MiB
17000, ./a.out, 00000000:65:00.0, 154 MiB

@JackAKirk
Copy link
Contributor

For the record, I tried with oneAPI 2024.2.0 (and a matching Codeplay plugin) on a dual-GPU machine, and have the same output as @tdavidcl:

$ sycl-ls 
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.4]
[cuda:gpu][cuda:1] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 2080 Ti 7.5 [CUDA 12.4]
$ nvidia-smi --query-compute-apps=pid,name,gpu_bus_id,used_gpu_memory --format=csv
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
$ /opt/tcbsys/intel-oneapi/2024.2.0/compiler/2024.2/bin/compiler/clang++ -fsycl test.cpp
$ ./a.out &
[1] 17000
$ Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
Press any key...
[1]+  Stopped                 ./a.out
$ nvidia-smi --query-compute-apps=pid,name,gpu_bus_id,used_gpu_memory --format=csv
pid, process_name, gpu_bus_id, used_gpu_memory [MiB]
17000, ./a.out, 00000000:17:00.0, 154 MiB
17000, ./a.out, 00000000:65:00.0, 154 MiB

Thanks, I've now reproduced the issue. We think we understand the root cause, and someone on the team has a patch on the way. It isn't a MPI specific issue, but a problem with the usage of cuContext that affects all codes.

@JackAKirk
Copy link
Contributor

Hi @tdavidcl @al42and

I opened a proposed fix for this here oneapi-src/unified-runtime#2077
along with a code example for how this would change developer code here:
codeplaysoftware/SYCL-samples#33

If you have any feedback on this then feel free to post. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda CUDA back-end hip Issues related to execution on HIP backend.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants