Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable labels for ClusterUUID and CliqueId #965

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions api/config/v1/flags.go
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ type GFDCommandLineFlags struct {
NoTimestamp *bool `json:"noTimestamp" yaml:"noTimestamp"`
SleepInterval *Duration `json:"sleepInterval" yaml:"sleepInterval"`
OutputFile *string `json:"outputFile" yaml:"outputFile"`
ImexNodesConfig *string `json:"imexNodesConfig" yaml:"imexNodesConfig"`
MachineTypeFile *string `json:"machineTypeFile" yaml:"machineTypeFile"`
}

Expand Down Expand Up @@ -162,6 +163,8 @@ func (f *Flags) UpdateFromCLIFlags(c *cli.Context, flags []cli.Flag) {
updateFromCLIFlag(&f.GFD.Oneshot, c, n)
case "output-file":
updateFromCLIFlag(&f.GFD.OutputFile, c, n)
case "imex-nodes-config":
updateFromCLIFlag(&f.GFD.ImexNodesConfig, c, n)
case "sleep-interval":
updateFromCLIFlag(&f.GFD.SleepInterval, c, n)
case "no-timestamp":
Expand Down
6 changes: 6 additions & 0 deletions cmd/gpu-feature-discovery/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,12 @@ func main() {
Value: "/etc/kubernetes/node-feature-discovery/features.d/gfd",
EnvVars: []string{"GFD_OUTPUT_FILE"},
},
&cli.StringFlag{
klueska marked this conversation as resolved.
Show resolved Hide resolved
Name: "imex-nodes-config",
Usage: "the path to nvidia-imex nodes config file",
Value: "/etc/nvidia-imex/nodes_config.cfg",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we set a default value here? This means that a user could mount this file in and trigger the IMEX behaviour without explicitly setting the value.

EnvVars: []string{"GFD_IMEX_NODES_CONFIG"},
},
&cli.StringFlag{
Name: "machine-type-file",
Value: "/sys/class/dmi/id/product_name",
Expand Down
13 changes: 13 additions & 0 deletions deployments/helm/nvidia-device-plugin/templates/daemonset-gfd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,10 @@ spec:
- name: GFD_USE_NODE_FEATURE_API
value: {{ .Values.nfd.enableNodeFeatureApi | quote }}
{{- end }}
{{- if typeIs "string" .Values.imexNodesConfigFile }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also check for a non-empty value?

- name: GFD_IMEX_NODES_CONFIG
value: {{ .Values.imexNodesConfigFile | quote }}
{{- end }}
{{- if $options.hasConfigMap }}
- name: CONFIG_FILE
value: /config/config.yaml
Expand All @@ -182,6 +186,10 @@ spec:
mountPath: "/etc/kubernetes/node-feature-discovery/features.d"
- name: host-sys
mountPath: "/sys"
{{- if typeIs "string" .Values.imexNodesConfigFile }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also check for a non-empty value?

- name: imex-nodes-config
mountPath: {{ .Values.imexNodesConfigFile | quote }}
{{- end }}
{{- if $options.hasConfigMap }}
- name: available-configs
mountPath: /available-configs
Expand All @@ -199,6 +207,11 @@ spec:
- name: host-sys
hostPath:
path: "/sys"
{{- if typeIs "string" .Values.imexNodesConfigFile }}
- name: imex-nodes-config
hostPath:
path: {{ .Values.imexNodesConfigFile | quote }}
{{- end }}
{{- if $options.hasConfigMap }}
- name: available-configs
configMap:
Expand Down
1 change: 1 addition & 0 deletions deployments/helm/nvidia-device-plugin/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ deviceIDStrategy: null
nvidiaDriverRoot: null
gdsEnabled: null
mofedEnabled: null
imexNodesConfigFile: null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see us wanting to add additional IMEX options. Does something like:

imex:
  nodesConfigFile: null

make sense to allow for extension?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we open this option, I would like to also have a enabled var.

deviceDiscoveryStrategy: null

nameOverride: ""
Expand Down
119 changes: 119 additions & 0 deletions internal/lm/nvml.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,13 @@
package lm

import (
"bufio"
"errors"
"fmt"
"math/rand"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely trigger a golangci-lint error.

"net"
"os"
"sort"
"strconv"
"strings"

Expand All @@ -28,6 +33,7 @@

spec "github.com/NVIDIA/k8s-device-plugin/api/config/v1"
"github.com/NVIDIA/k8s-device-plugin/internal/resource"
"github.com/google/uuid"
)

var errMPSSharingNotSupported = errors.New("MPS sharing is not supported")
Expand Down Expand Up @@ -80,13 +86,19 @@
return nil, fmt.Errorf("error creating resource labeler: %v", err)
}

imexLabeler, err := newImexDomainLabeler(*config.Flags.GFD.ImexNodesConfig, devices)
if err != nil {
return nil, fmt.Errorf("error creating imex domain labeler: %v", err)
}

l := Merge(
machineTypeLabeler,
versionLabeler,
migCapabilityLabeler,
sharingLabeler,
resourceLabeler,
gpuModeLabeler,
imexLabeler,
)

return l, nil
Expand Down Expand Up @@ -218,6 +230,96 @@
return labels, nil
}

func newImexDomainLabeler(configFile string, device []resource.Device) (Labeler, error) {
if configFile == "" {
return nil, nil
}

// Read file and parse it
_, err := os.Stat(configFile)
if os.IsNotExist(err) {
return nil, nil
}
imexConfig, err := os.Open(configFile)
if err != nil {
return nil, fmt.Errorf("failed to open imex config file: %v", err)
}
defer imexConfig.Close()
Comment on lines +238 to +247
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Read file and parse it
_, err := os.Stat(configFile)
if os.IsNotExist(err) {
return nil, nil
}
imexConfig, err := os.Open(configFile)
if err != nil {
return nil, fmt.Errorf("failed to open imex config file: %v", err)
}
defer imexConfig.Close()
// Read file and parse it
imexConfig, err := os.Open(configFile)
if err != nil && os.IsNotExist(err) {
return nil, nil
}
if err != nil {
return nil, fmt.Errorf("failed to open imex config file: %v", err)
}
defer imexConfig.Close()


klueska marked this conversation as resolved.
Show resolved Hide resolved
// check if the file is empty
stat, err := imexConfig.Stat()
if err != nil {
return nil, nil
}
if stat.Size() == 0 {
return nil, nil
}
Comment on lines +249 to +256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this explicit check. You can check after scanning if len(ips) == 0.

Comment on lines +250 to +256
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we check stat.Size() after the os.Stat(configFile) call?


// Read the file line by line
var ips []string
scanner := bufio.NewScanner(imexConfig)
for scanner.Scan() {
ip := strings.TrimSpace(scanner.Text())
if net.ParseIP(ip) == nil {
return nil, fmt.Errorf("invalid IP address in imex config file: %s", ip)
}
ips = append(ips, ip)
}

if err := scanner.Err(); err != nil {
return nil, fmt.Errorf("failed to read imex config file: %v", err)
}

// Sort the IP addresses
sort.Strings(ips)
klueska marked this conversation as resolved.
Show resolved Hide resolved

// Join the sorted IPs into a single string
sortedIPs := strings.Join(ips, "\n")

hashedconfig := generateUUIDs(sortedIPs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a need for a temporary variable here, just inline the function call below in the creation of the labels struct.


var commonClusterUUID string
var commonCliqueID string
for _, d := range device {
// Skip non NVML devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop comment -- it may be the case that other devices become IMEX capable in the future and the code itself is clear enough here.

if ok, _ := d.IsImexCapable(); !ok {
continue
}
Comment on lines +285 to +287
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this return an error if you just skip it?


clusterUUID, err := d.GetClusterUUID()
if err != nil {
return nil, fmt.Errorf("error getting cluster UUID: %v", err)
}
if commonClusterUUID == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code might be simpler if we collect the set of unique clusterUUIDs and CliqueIPs in a map and the raise these errors after the loop.

commonClusterUUID = clusterUUID
}
if commonClusterUUID != clusterUUID {
klog.Warningf("Cluster UUIDs are different: %s != %s", commonClusterUUID, clusterUUID)
return nil, nil
}

cliqueID, err := d.GetCliqueIP()
if err != nil {
return nil, fmt.Errorf("error getting clique ID: %v", err)
}
if commonCliqueID == "" {
commonCliqueID = cliqueID
}
if commonCliqueID != cliqueID {
klog.Warningf("Clique IDs are different: %s != %s", commonCliqueID, cliqueID)
return nil, nil
}
}

labels := Labels{
"nvidia.com/gpu.clusteruuid": commonClusterUUID,
"nvidia.com/gpu.cliqueid": commonCliqueID,
"nvidia.com/gpu.imex-domain": hashedconfig + "-" + commonCliqueID,
}

return labels, nil
}

func getModeForClasses(classes []uint32) string {
if len(classes) == 0 {
return "unknown"
Expand Down Expand Up @@ -254,3 +356,20 @@
}
return classes, nil
}

func generateUUIDs(seed string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func generateUUIDs(seed string) string {
func generateUUID(seed string) string {

Copy link
Member

@elezar elezar Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does it make sense to just use an existing hash function instead of rolling our own?

https://yourbasic.org/golang/hash-md5-sha256-string-file/

or

https://pkg.go.dev/hash/[email protected]#example-package

rand := rand.New(rand.NewSource(hash(seed)))

Check failure on line 361 in internal/lm/nvml.go

View workflow job for this annotation

GitHub Actions / check

G404: Use of weak random number generator (math/rand or math/rand/v2 instead of crypto/rand) (gosec)

charset := make([]byte, 16)
rand.Read(charset)
uuid, _ := uuid.FromBytes(charset)
return uuid.String()
}

func hash(s string) int64 {
h := int64(0)
for _, c := range s {
h = 31*h + int64(c)
}
return h
}
12 changes: 12 additions & 0 deletions internal/resource/cuda-device.go
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,18 @@ func (d *cudaDevice) IsMigEnabled() (bool, error) {
return false, nil
}

func (d *cudaDevice) IsImexCapable() (bool, error) {
return false, nil
}

func (d *cudaDevice) GetPCIClass() (uint32, error) {
return 0, nil
}

func (d *cudaDevice) GetClusterUUID() (string, error) {
return "", nil
}
Comment on lines +108 to +110
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wanted to return errors here?


func (d *cudaDevice) GetCliqueIP() (string, error) {
return "", nil
}
Loading
Loading