Skip to content

Latest commit

 

History

History
166 lines (121 loc) · 6.36 KB

quickstart.md

File metadata and controls

166 lines (121 loc) · 6.36 KB

Quick Start

Prerequisites

  • Azure SKUs
  • Non-Azure Systems
    • NVIDIA A100 GPUs + CUDA >= 11.8
    • NVIDIA H100 GPUs + CUDA >= 12.0
    • AMD MI250X GPUs + ROCm >= 5.7
    • AMD MI300X GPUs + ROCm >= 6.0
  • OS: tested over Ubuntu 18.04 and 20.04
  • Libraries: libnuma, MPI (optional)
  • Others
    • For NVIDIA platforms, nvidia_peermem driver should be loaded on all nodes. Check it via:
      lsmod | grep nvidia_peermem
      

Build from Source

CMake 3.25 or later is required.

$ git clone https://github.com/microsoft/mscclpp.git
$ mkdir -p mscclpp/build && cd mscclpp/build

For NVIDIA platforms, build MSCCL++ as follows.

# For NVIDIA platforms
$ cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j

For AMD platforms, use HIPCC instead of the default C++ compiler. Replace /path/to/hipcc from the command below into the your HIPCC path.

# For AMD platforms
$ CXX=/path/to/hipcc cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j

Install from Source (Libraries and Headers)

# Install the generated headers and binaries to /usr/local/mscclpp
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local/mscclpp -DBUILD_PYTHON_BINDINGS=OFF ..
$ make -j mscclpp mscclpp_static
$ sudo make install/fast

Install from Source (Python Module)

Python 3.8 or later is required.

# For NVIDIA platforms
$ python -m pip install .
# For AMD platforms
$ CXX=/path/to/hipcc python -m pip install .

Docker Images

Our base image installs all prerequisites for MSCCL++.

$ docker pull ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.3

See all available images here.

Unit Tests

unit_tests require one GPU on the system. It only tests operation of basic components.

$ make -j unit_tests
$ ./test/unit_tests

For thorough testing of MSCCL++ features, we need to use mp_unit_tests that require at least two GPUs on the system. mp_unit_tests also requires MPI to be installed on the system. For example, the following commands compile and run mp_unit_tests with two processes (two GPUs). The number of GPUs can be changed by changing the number of processes.

$ make -j mp_unit_tests
$ mpirun -np 2 ./test/mp_unit_tests

To run mp_unit_tests with more than two nodes, you need to specify the -ip_port argument that is accessible from all nodes. For example:

$ mpirun -np 16 -npernode 8 -hostfile hostfile ./test/mp_unit_tests -ip_port 10.0.0.5:50000

Performance Benchmark

Python Benchmark

Install the MSCCL++ Python package and run our Python AllReduce benchmark as follows. It requires MPI on the system.

# Choose `requirements_*.txt` according to your CUDA/ROCm version.
$ python3 -m pip install -r ./python/requirements_cuda12.txt
$ mpirun -tag-output -np 8 python3 ./python/mscclpp_benchmark/allreduce_bench.py

C++ Benchmark (mscclpp-test)

NOTE: mscclpp-test will be retired soon and will be maintained only as an example of C++ implementation. If you want to get the latest performance numbers, please use the Python benchmark instead.

mscclpp-test is a set of C++ performance benchmarks. It requires MPI on the system, and the path should be provided via MPI_HOME environment variable to the CMake build system.

$ MPI_HOME=/path/to/mpi cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j allgather_test_perf allreduce_test_perf

For example, the following command runs the allreduce5 algorithm with 8 GPUs starting from 3MB to 48MB messages, by doubling the message size in between. You can try different algorithms by changing the -k 5 option to another value (e.g., -k 3 runs allreduce3). Check all algorithms from the code: allreduce_test.cu and allgather_test.cu.

$ mpirun --bind-to numa -np 8 ./test/mscclpp-test/allreduce_test_perf -b 3m -e 48m -G 100 -n 100 -w 20 -f 2 -k 5

NOTE: a few algorithms set a condition on the total data size, such as to be a multiple of 3. If the condition is unmet, the command will throw a regarding error.

Check the help message for more details.

$ ./test/mscclpp-test/allreduce_test_perf --help
USAGE: allreduce_test_perf
        [-b,--minbytes <min size in bytes>]
        [-e,--maxbytes <max size in bytes>]
        [-i,--stepbytes <increment size>]
        [-f,--stepfactor <increment factor>]
        [-n,--iters <iteration count>]
        [-w,--warmup_iters <warmup iteration count>]
        [-c,--check <0/1>]
        [-T,--timeout <time in seconds>]
        [-G,--cudagraph <num graph launches>]
        [-a,--average <0/1/2/3> report average iteration time <0=RANK0/1=AVG/2=MIN/3=MAX>]
        [-k,--kernel_num <kernel number of commnication primitive>]
        [-o, --output_file <output file name>]
        [-h,--help]

NCCL over MSCCL++

We implement NCCL APIs using MSCCL++. How to use:

  1. Build MSCCL++ from source.
  2. Replace your libnccl.so library with libmscclpp_nccl.so, which is compiled under ./build/apps/nccl/ directory.

For example, you can run nccl-tests using libmscclpp_nccl.so as follows, where MSCCLPP_BUILD is your MSCCL++ build directory.

mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$MSCCLPP_BUILD/apps/nccl/libmscclpp_nccl.so ./build/all_reduce_perf -b 1K -e 256M -f 2 -d half -G 20 -w 10 -n 50

If MSCCL++ is built on AMD platforms, libmscclpp_nccl.so would replace the RCCL library (i.e., librccl.so).

See limitations of the current NCCL over MSCCL++ from here.