Skip to content

AWS OFI NCCL v1.11.0

Latest
Compare
Choose a tag to compare
@rauteric rauteric released this 19 Aug 20:28
· 58 commits to master since this release
v1.11.0-aws

This release is intended only for use on AWS P* instances. A general release that supports other Libfabric networks will be made in the near future. This release requires Libfabric v1.18.0 or later and supports NCCL 2.22.3-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

New Features:

  • Autogenerate topology file on P5 by default, with detected topology, instead of using a static file
  • Support for AWS P5e instance type

Bug fixes:

  • Fixed segfault for platform-aws builds for instance types not explicitly configured
  • Fixed failure in mr cache in SENDRECV protocol for providers that don't require memory registration
  • Re-enabled WRITE_IN_ORDER_ALIGNED_128_BYTES setting and check on P5.
  • Added check to cause an error when using old blocking connect_v4/accept_v4 interfaces with RDMA protocol. The previous release changed connection establishment such that these interfaces cause deadlock.

The plugin has been tested with following libfabric providers using tests bundled in the source code and nccl-tests suite:

  • efa

Checksum (sha512) for the release tarball:

17063f1e10a885fe6cd48e275c9a0d5748b73d04d6514103a5e9a0f28dff604c1766f8a85a55e89ad5691830c54199936d88442d28c65180c2f79be939f0b208  aws-ofi-nccl-1.11.0-aws.tar.gz