Release 3.24 · NVIDIA/aistore

Version 3.24 arrives nearly 4 months after the previous one and contains more than 400 commits that fall into several main categories, topics, and sub-topics:

1. Core

1.1 Observability

We improved and optimized stats-reporting logic and introduced multiple new metrics and new management alerts.

There's now an easy way to observe per-backend performance and errors, if any. Instead of (or rather, in addition to) a single combined counter or latency, the system separately tracks requests that utilize AWS, GCP, and/or Azure backends.

For latencies, we additionally added cumulative "total-time" metrics:

"GET: total cumulative time (nanoseconds)"
"PUT: total cumulative time (nanoseconds)"
and more

Together with respective counters, those total-times can be used to compute precise latencies and throughputs over arbitrary time intervals - either on a per-backend basis or averaged across all remote backends, if any.

New management alerts include keep-alive, tls-certificate-will-soon-expire (see next section), low-memory, low-capacity, and more.

Build-wise, aisnode with StatsD will now require the corresponding build tag.
Prometheus is effectively the default; for details, see related:

build tags and examples

1.2 HTTPS; TLS

HTTPS deployment implies (and requires) that each AIS node (aisnode) has a valid TLS (X.509) certificate.

TLS certificates tend to expire from time to time, or eventually. Each TLS certificate expires, with a standard-defined maximum of 13 months - roughly, 397 days.

AIS v3.24 automatically reloads updated certificates, tracks expiration times, and reports any inconsistencies between certificates in a cluster:

Loading, reloading, and generating certificates; switching cluster between HTTP and HTTPS

Associated Grafana and CLI-visible management alerts:

alert	comment
`tls-cert-will-soon-expire`	Warning: less than 3 days remain until the current X.509 cert expires
`tls-cert-expired`	Critical (red) alert (as the name implies)
`tls-cert-invalid`	ditto

Finally, there's a brand-new management API and ais tls CLI.

1.3 Filesystem Health Checker (FSHC)

FSHC component detects disk faults, raises associated alerts, and disables degraded mountpaths.

AIS v3.24 comes with FSHC a major (version 2) update, with new capabilities that include:

detect mountpath changed at runtime;
differentiate in-cluster IO errors from network and remote backend (errors);
support associated configuration (section "API changes; Config changes" below);
resolve (mountpath, filesystem) to disk(s), and handle:
- no-disks exception;
- disk loss, disk fault;
- new disk attachments.

1.4 Keep-Alive; Primary Election

In-cluster keep-alive mechanism (a.k.a. heartbeat) was generally micro-optimized and improved. In particular, when and if failing to ping primary via intra-cluster control, an AIS node will now utilize its public network, if available.

And vice versa.

As an aside, AIS does not require provisioning 3 different networks at deployment time. This has always been and remains a recommended option. But our experience running Kubernetes clusters in production environments proves that it is, well, highly recommended.

1.5 Rebalance; Erasure Coding: Intra-Cluster streams

Needless to say, erasure coding produces a lot of in-cluster traffic. For all those erasure-coded slice-sending-receiving transactions, AIS targets establish long-living peer-to-peer connections dubbed streams.

Long story short, any operation on an erasure bucket requires streams. But, there's also the motivation not to keep those streams open when there's no erasure coding. The associated overhead (expectedly) grows proportionally with the size of the cluster.

In AIS v3.24, we solve this problem, or part of this problem, by piggybacking on keep-alive messages that provide timely updates. Closing EC streams is a lazy process that may take several extra minutes, which is still preferable given that AIS clusters may run for days and weeks at a time with no EC traffic at all.

1.6 List Virtual Directories

Unlike hierarchical POSIX, object storage is flat, treating forward slash ('/') in object names as simply another symbol.

But that's not the entire truth. The other part of it is that users may want to operate on (ie., list, load, shuffle, copy, transform, etc.) a subset of objects in a dataset that, for lack of a better word, looks exactly like a directory.

For details, please refer to:

1.7 API changes; Config changes

Including:

"[API change] show TLS certificate details; add top-level 'ais tls' command" 091f7b0
"[API change]: extend HEAD(object) to check remote metadata" c1004dd
"[config change]: FSHC v2: track and handle total number of soft errors" a2d04da
and more

1.8 Performance Optimization; Bug fixes; Improvements

Including:

"new RMD not to trigger rebalance when disabled in the config" 550cade20
"prefetch/copy/transform: number of concurrent workers" a5a30247d, 8aa832619
"intra-cluster notifications: reduce locking, mem allocations" b7965b7be
and much more

2. Initial Sharding (`ishard`); Distributed Shuffle (`dsort`)

Initial Sharding utility (ishard) is intended to create well-formed WebDataset-formatted shards from the original dataset.

Goes without saying: original ML datasets will have an arbitrary structure, a massive number of small files and/or very large files, and deeply nested directories. Notwithstanding, there's almost always the need to batch associated files (that constitute computable samples) together and maybe pre-shuffle them for immediate consumption by a model.

Hence, ishard:

3. Authentication; Access Control

Other than code improvements and micro-optimizations (as in continuous refactoring) of the AuthN codebase, the most notable updates also include:

topic	what changed
CLI	improved token handling; user-friendly (and improved) error management; easy-to-use configuration that entails admin credentials, secret keys, and tokens
Configuration	notable (and related) environment variables: `AIS_AUTHN_SECRET_KEY`, `AIS_AUTHN_SU_NAME`, `AIS_AUTHN_SU_PASS`, and `AIS_AUTHN_TOKEN`
`AuthN` container image (new)	tailored specifically for Kubernetes deployments - for seamless integration and easy setup in K8s environments

4. CLI

Usability improvements across the board, including:

"add 'ais tls validate-certificates' command" 0a2f25c
"'ais put --retries ' with increasing timeout, if need be" 99b7a96
"copy/transform: add '--num-workers' (number of concurrent workers) option" 2414c68
"extend 'show cluster' - add 'alert' column" 40d6580df
"show configured backend providers" ba492a1
"per-backend cumulative "total" latencies
and much more

5. Python: SDK (AIStore, AuthN); PyTorch DataLoader; Tools

topic	what changed
SDK	compatibility with Python 3.8 and later versions; support retries via `urllib3.Retry`; add object group prefixes; improved dataset management for PyTorch
`AuthN`	add AuthN sub-package with Python APIs to manage users, permissions, roles, tokens, and clusters; add ObjectFile
PyTorch	dynamic sampling; support for multiple workers; integration with WebDataset. Also, included progress bars, improved error handling
Tools	add `ShardReader`; [Google Colab](https://aistore.nvidia.com/blog/2024/09/18/google-colab-aistore; `pyaisloader` to support ETL

6. Build; Lint; Continuous Integration (CI)

topic	what changed
CI	upgrade GitHub and GitLab CI configurations; include support for Python 3.8+; improve AuthN testing; fix various CI workflows; add PyTorch integration tests to CI; improve error handling during minikube deployments
Build	update Open Source Software (OSS) packages; standardize Dockerfile configurations; make Prometheus default; address security vulnerabilities (e.g., CVE fixes for `google-protobuf` and `rexml`)
Lint	enable more `golangci-lint` linters; clean up linter configurations
Deployment	improve deployment scripts and Makefiles; standardize container builds

7. Documentation and Tests

topic	what changed
new and updated references	virtual directories; AuthN SDK examples, TLS certificate management; Python SDK examples; Loading, reloading, and generating certificates; switching cluster between HTTP and HTTPS; streaming `ObjectFile` examples; and more
tests	new ETL tests for concurrent transformations with varying object sizes; improve Python ETL setup for Kubernetes, with fixes for the mock cloud backend, stress tests for initial sharding (`ishard`), and enhancements to race condition handling and minikube logging

8. Blog

Finally, there are new technical blogs added during this v3.24 development iteration:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.24

1. Core

1.1 Observability

1.2 HTTPS; TLS

1.3 Filesystem Health Checker (FSHC)

1.4 Keep-Alive; Primary Election

1.5 Rebalance; Erasure Coding: Intra-Cluster streams

1.6 List Virtual Directories

1.7 API changes; Config changes

1.8 Performance Optimization; Bug fixes; Improvements

2. Initial Sharding (`ishard`); Distributed Shuffle (`dsort`)

3. Authentication; Access Control

4. CLI

5. Python: SDK (AIStore, AuthN); PyTorch DataLoader; Tools

6. Build; Lint; Continuous Integration (CI)

7. Documentation and Tests

8. Blog

3.24

1. Core

1.1 Observability

1.2 HTTPS; TLS

1.3 Filesystem Health Checker (FSHC)

1.4 Keep-Alive; Primary Election

1.5 Rebalance; Erasure Coding: Intra-Cluster streams

1.6 List Virtual Directories

1.7 API changes; Config changes

1.8 Performance Optimization; Bug fixes; Improvements

2. Initial Sharding (ishard); Distributed Shuffle (dsort)

3. Authentication; Access Control

4. CLI

5. Python: SDK (AIStore, AuthN); PyTorch DataLoader; Tools

6. Build; Lint; Continuous Integration (CI)

7. Documentation and Tests

8. Blog

2. Initial Sharding (`ishard`); Distributed Shuffle (`dsort`)