Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Add catalog of labels to README #928

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

chipzoller
Copy link
Contributor

@chipzoller chipzoller commented Aug 26, 2024

Signed-off-by: chipzoller [email protected]

Closes #846

This PR adds a new "catalog of labels" section to the device plugin's README and updates the same table for GFD. It also makes a few minor improvements:

  • Lints/standardizes common Markdown elements
  • Improves code blocks
  • Fixes a couple things
  • Updates version numbers to reflect current (v0.16.2)

Signed-off-by: Chip Zoller <[email protected]>
Signed-off-by: Chip Zoller <[email protected]>
Signed-off-by: Chip Zoller <[email protected]>
Signed-off-by: Chip Zoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
Signed-off-by: chipzoller <[email protected]>
@chipzoller
Copy link
Contributor Author

Could I get a review from you please, @mikemckiernan?

@chipzoller
Copy link
Contributor Author

And @elezar or @klueska, could I get some technical review here?

Copy link
Member

@mikemckiernan mikemckiernan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for the assist! PLMK what I can clarify.

@@ -148,7 +160,7 @@ With the daemonset deployed, NVIDIA GPUs can now be requested by a container
using the `nvidia.com/gpu` resource type:

```yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically shell, I think.

@@ -570,34 +586,56 @@ total memory and compute resources of the GPU.
**Note**: As of now, the only supported resource available for MPS are `nvidia.com/gpu`
resources and only with full GPUs.

## Catalog of Labels

The NVIDIA device plugin reads and writes a number of different labels which it uses as either
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not the grammar nerd that I should be, but I think s/which/that/ in this case. (It's a lazy guideline, but "that" is almost never wrong.)

Comment on lines +592 to +593
configuration elements or informational elements. The below table documents and describes each
along with their use. See the related table [here](/docs/gpu-feature-discovery/README.md#generated-labels) for GFD labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls: "The following table identifies and describes each label."

  • "following" is preferred instead of "below." (I can't recall when I learned that or why, it just is.)
  • Restate the noun, "label," for clarity.

Please move (and tweak) the last sentence after the table. I'd prefer that readers read this table and then be presented with the option to read the GFD labels rather than have the teensy cognitive load to decide whether to read the GFD labels before reading the table.

Maybe: "GPU Feature Discovery also adds labels. Refer to generated labels in the GFD README for more information."

Comment on lines +595 to +596
> [!NOTE]
> Label values in Kubernetes are always of type string. The table's value type describes the type within string formatting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My pref is to avoid the NOTE to reduce visual busyness. After scrutinizing the message a bit, I think we can remove the this para and the Value Type column (to sidestep explaining a concept that is rarely an obstacle) and indicate in the description something like "Values are true or false."

> [!NOTE]
> Label values in Kubernetes are always of type string. The table's value type describes the type within string formatting.

| Label Name | Value Type | Meaning | Example |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls s/Meaning/Description/

| ----------------------------------| ---------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------------- |
| nvidia.com/device-plugin.config | String | The name of the configuration to apply to the node. Not automatically written by the device plugin; must be manually assigned by the user with the specified config. See [here](#updating-per-node-configuration-with-a-node-label) for details. | my-mps-config |
| nvidia.com/gpu.sharing-strategy | String | The sharing strategy in use. Will be set to `none` if not sharing a GPU. Additional values are `mps` and `time-slicing`. | time-slicing |
| nvidia.com/mig.capable | Boolean | If a device is currently in MIG mode. | false |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is true. Need SME verification.

Suggested change
| nvidia.com/mig.capable | Boolean | If a device is currently in MIG mode. | false |
| nvidia.com/mig.capable | Boolean | Specifies if any device on the node supports MIG. | false |

| nvidia.com/device-plugin.config | String | The name of the configuration to apply to the node. Not automatically written by the device plugin; must be manually assigned by the user with the specified config. See [here](#updating-per-node-configuration-with-a-node-label) for details. | my-mps-config |
| nvidia.com/gpu.sharing-strategy | String | The sharing strategy in use. Will be set to `none` if not sharing a GPU. Additional values are `mps` and `time-slicing`. | time-slicing |
| nvidia.com/mig.capable | Boolean | If a device is currently in MIG mode. | false |
| nvidia.com/mps.capable | Boolean | If a device is currently in MPS mode. | false |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto above, need SME review.

Suggested change
| nvidia.com/mps.capable | Boolean | If a device is currently in MPS mode. | false |
| nvidia.com/mps.capable | Boolean | Specifies if devices on the node are configured for MPS. | false |

| nvidia.com/gpu.sharing-strategy | String | The sharing strategy in use. Will be set to `none` if not sharing a GPU. Additional values are `mps` and `time-slicing`. | time-slicing |
| nvidia.com/mig.capable | Boolean | If a device is currently in MIG mode. | false |
| nvidia.com/mps.capable | Boolean | If a device is currently in MPS mode. | false |
| nvidia.com/vgpu.present | Boolean | If vGPU is in use. | false |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| nvidia.com/vgpu.present | Boolean | If vGPU is in use. | false |
| nvidia.com/vgpu.present | Boolean | Specifies if devices on the node use vGPU. | false |

| nvidia.com/mig.capable | Boolean | If a device is currently in MIG mode. | false |
| nvidia.com/mps.capable | Boolean | If a device is currently in MPS mode. | false |
| nvidia.com/vgpu.present | Boolean | If vGPU is in use. | false |
| nvidia.com/vgpu.host-driver-version | String | Version of the vGPU host driver in use on the underlying hypervisor. | 10.11.12 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| nvidia.com/vgpu.host-driver-version | String | Version of the vGPU host driver in use on the underlying hypervisor. | 10.11.12 |
| nvidia.com/vgpu.host-driver-version | String | Specifies the vGPU host driver version on the underlying hypervisor. | 550.54.16 |

| nvidia.com/mps.capable | Boolean | If a device is currently in MPS mode. | false |
| nvidia.com/vgpu.present | Boolean | If vGPU is in use. | false |
| nvidia.com/vgpu.host-driver-version | String | Version of the vGPU host driver in use on the underlying hypervisor. | 10.11.12 |
| nvidia.com/vgpu.host-driver-branch | String | Branch of the vGPU host driver in use on the underlying hypervisor. | main |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| nvidia.com/vgpu.host-driver-branch | String | Branch of the vGPU host driver in use on the underlying hypervisor. | main |
| nvidia.com/vgpu.host-driver-branch | String | Specifies the vGPU host driver branch on the underlying hypervisor. | r550_40 |

@chipzoller
Copy link
Contributor Author

Thank you for your review, @mikemckiernan. Could we get technical review here first so we ensure the content is accurate before we do any word smithing?

@chipzoller
Copy link
Contributor Author

Requesting technical review, please.

@chlunde
Copy link

chlunde commented Sep 10, 2024

@chipzoller Could you also document if it is possible to not deploy GFD, and the minimum set of labels you would have to apply manually (for example using Karpenter)?

@chipzoller
Copy link
Contributor Author

The docs here already cover opting in to deploying GFD. As for the other part, not sure I follow.

@chlunde
Copy link

chlunde commented Sep 10, 2024

@chipzoller I wonder if it would be possible to not deploy GFD - but that requires knowing what labels you are required to set manually.

@chipzoller
Copy link
Contributor Author

but that requires knowing what labels you are required to set manually.

GFD is an optional component hence why it is an opt-in setting. There are no "required" labels here unless you have a specific use case or other piece of software which depends on one or more labels set by GFD. That seems like it's out of scope for what I'm trying to accomplish in this PR.

@chlunde
Copy link

chlunde commented Sep 10, 2024

@chipzoller OK, thanks, then we probably misunderstood something while trying to upgrade to 0.15 (which failed), we'll go back to debugging some more, sorry for the confusion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add section to README with catalog of device plugin specific labels
3 participants