Quick Start

Prerequisites

Azure SKUs
- ND_A100_v4
- NDm_A100_v4
- ND_H100_v5
- NC_A100_v4 (TBD)
Non-Azure Systems
- NVIDIA A100 GPUs + CUDA >= 11.8
- NVIDIA H100 GPUs + CUDA >= 12.0
- AMD MI250X GPUs + ROCm >= 5.7
- AMD MI300X GPUs + ROCm >= 6.0
OS: tested over Ubuntu 18.04 and 20.04
Libraries: libnuma, MPI (optional)
Others
- For NVIDIA platforms, nvidia_peermem driver should be loaded on all nodes. Check it via:
```
lsmod | grep nvidia_peermem
```

Build from Source

CMake 3.25 or later is required.

$ git clone https://github.com/microsoft/mscclpp.git
$ mkdir -p mscclpp/build && cd mscclpp/build

For NVIDIA platforms, build MSCCL++ as follows.

# For NVIDIA platforms
$ cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j

For AMD platforms, use HIPCC instead of the default C++ compiler. Replace /path/to/hipcc from the command below into the your HIPCC path.

# For AMD platforms
$ CXX=/path/to/hipcc cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j

Install from Source (Libraries and Headers)

# Install the generated headers and binaries to /usr/local/mscclpp
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local/mscclpp -DBUILD_PYTHON_BINDINGS=OFF ..
$ make -j mscclpp mscclpp_static
$ sudo make install/fast

Install from Source (Python Module)

Python 3.8 or later is required.

# For NVIDIA platforms
$ python -m pip install .
# For AMD platforms
$ CXX=/path/to/hipcc python -m pip install .

Docker Images

Our base image installs all prerequisites for MSCCL++.

$ docker pull ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.3

See all available images here.

Unit Tests

unit_tests require one GPU on the system. It only tests operation of basic components.

$ make -j unit_tests
$ ./test/unit_tests

For thorough testing of MSCCL++ features, we need to use mp_unit_tests that require at least two GPUs on the system. mp_unit_tests also requires MPI to be installed on the system. For example, the following commands compile and run mp_unit_tests with two processes (two GPUs). The number of GPUs can be changed by changing the number of processes.

$ make -j mp_unit_tests
$ mpirun -np 2 ./test/mp_unit_tests

To run mp_unit_tests with more than two nodes, you need to specify the -ip_port argument that is accessible from all nodes. For example:

$ mpirun -np 16 -npernode 8 -hostfile hostfile ./test/mp_unit_tests -ip_port 10.0.0.5:50000

Performance Benchmark

Python Benchmark

Install the MSCCL++ Python package and run our Python AllReduce benchmark as follows. It requires MPI on the system.

# Choose `requirements_*.txt` according to your CUDA/ROCm version.
$ python3 -m pip install -r ./python/requirements_cuda12.txt
$ mpirun -tag-output -np 8 python3 ./python/mscclpp_benchmark/allreduce_bench.py

C++ Benchmark (mscclpp-test)

NOTE: mscclpp-test will be retired soon and will be maintained only as an example of C++ implementation. If you want to get the latest performance numbers, please use the Python benchmark instead.

mscclpp-test is a set of C++ performance benchmarks. It requires MPI on the system, and the path should be provided via MPI_HOME environment variable to the CMake build system.

$ MPI_HOME=/path/to/mpi cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j allgather_test_perf allreduce_test_perf

For example, the following command runs the allreduce5 algorithm with 8 GPUs starting from 3MB to 48MB messages, by doubling the message size in between. You can try different algorithms by changing the -k 5 option to another value (e.g., -k 3 runs allreduce3). Check all algorithms from the code: allreduce_test.cu and allgather_test.cu.

$ mpirun --bind-to numa -np 8 ./test/mscclpp-test/allreduce_test_perf -b 3m -e 48m -G 100 -n 100 -w 20 -f 2 -k 5

NOTE: a few algorithms set a condition on the total data size, such as to be a multiple of 3. If the condition is unmet, the command will throw a regarding error.

Check the help message for more details.

$ ./test/mscclpp-test/allreduce_test_perf --help
USAGE: allreduce_test_perf
        [-b,--minbytes <min size in bytes>]
        [-e,--maxbytes <max size in bytes>]
        [-i,--stepbytes <increment size>]
        [-f,--stepfactor <increment factor>]
        [-n,--iters <iteration count>]
        [-w,--warmup_iters <warmup iteration count>]
        [-c,--check <0/1>]
        [-T,--timeout <time in seconds>]
        [-G,--cudagraph <num graph launches>]
        [-a,--average <0/1/2/3> report average iteration time <0=RANK0/1=AVG/2=MIN/3=MAX>]
        [-k,--kernel_num <kernel number of commnication primitive>]
        [-o, --output_file <output file name>]
        [-h,--help]

NCCL over MSCCL++

We implement NCCL APIs using MSCCL++. How to use:

Build MSCCL++ from source.
Replace your libnccl.so library with libmscclpp_nccl.so, which is compiled under ./build/apps/nccl/ directory.

For example, you can run nccl-tests using libmscclpp_nccl.so as follows, where MSCCLPP_BUILD is your MSCCL++ build directory.

mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$MSCCLPP_BUILD/apps/nccl/libmscclpp_nccl.so ./build/all_reduce_perf -b 1K -e 256M -f 2 -d half -G 20 -w 10 -n 50

If MSCCL++ is built on AMD platforms, libmscclpp_nccl.so would replace the RCCL library (i.e., librccl.so).

See limitations of the current NCCL over MSCCL++ from here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quickstart.md

quickstart.md

Quick Start

Prerequisites

Build from Source

Install from Source (Libraries and Headers)

Install from Source (Python Module)

Docker Images

Unit Tests

Performance Benchmark

Python Benchmark

C++ Benchmark (mscclpp-test)

NCCL over MSCCL++

Files

quickstart.md

Latest commit

History

quickstart.md

File metadata and controls

Quick Start

Prerequisites

Build from Source

Install from Source (Libraries and Headers)

Install from Source (Python Module)

Docker Images

Unit Tests

Performance Benchmark

Python Benchmark

C++ Benchmark (mscclpp-test)

NCCL over MSCCL++