Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for COS #77

Merged
merged 3 commits into from
Jul 31, 2024
Merged

Support for COS #77

merged 3 commits into from
Jul 31, 2024

Conversation

Signed-off-by: Mike McKiernan <[email protected]>
Copy link

Documentation preview

https://nvidia.github.io/cloud-native-docs/review/pr-77

@mikemckiernan mikemckiernan self-assigned this Jul 27, 2024
@mikemckiernan mikemckiernan added the documentation Issue/PR focused on fixing/editing/adding documentation bits label Jul 27, 2024
Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mikemckiernan, this is a great start. I left a few comments / questions. My main question concerns the overall structure of the document and whether using the operating systems (COS and Ubuntu) as section headers would improve readability. Happy to discuss this.

gpu-operator/google-gke.rst Show resolved Hide resolved
Using NVIDIA Driver Manager
***************************

Perform the following steps to create a GKE cluster with the ``gcloud`` CLI and use the Operator and NVIDIA Driver Manager to manage the GPU driver.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reader, it is not entirely clear to me that this section, Using NVIDIA Driver Manager, is only applicable for Ubuntu node pools. It may be worth adding a sentence here stating that this approach is only supported on Ubuntu.


You can choose to use the Google driver installer to manage the NVIDIA GPU Driver.
Alternatively, you can use the Operator and NVIDIA Driver Manager to manage the driver lifecycle and upgrades.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am questioning whether NVIDIA Driver Manager is the term we should use throughout this page. I do not have a better idea at the moment.

Comment on lines 42 to 43
You can choose to use the Google driver installer to manage the NVIDIA GPU Driver.
Alternatively, you can use the Operator and NVIDIA Driver Manager to manage the driver lifecycle and upgrades.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the NVIDIA perspective, we recommend the latter approach on Ubuntu (have GPU Operator manage every software component) -- I am wondering if we should make that more evident here and in the procedure for Ubuntu.

gpu-operator/google-gke.rst Show resolved Hide resolved
@Dragoncell
Copy link

I think the second section Using NVIDIA Driver Manager most of part is duplicated to the part Using Google Driver Installer

Maybe combine the command steps, and separate the different steps towards the two approach

Up to step 7 in Using Google Driver Installer are same for Using NVIDIA Driver Manager except the Ubuntu node image.

then for

  1. Using Google Driver Installer : 1. install Google Driver installer 2. Install Operator
  2. Using NVIDIA Driver Manager : 1. Install Operator with driver enabled

@mikemckiernan
Copy link
Member Author

I think the second section Using NVIDIA Driver Manager most of part is duplicated to the part Using Google Driver Installer

Maybe combine the command steps, and separate the different steps towards the two approach

Up to step 7 in Using Google Driver Installer are same for Using NVIDIA Driver Manager except the Ubuntu node image.

then for

  1. Using Google Driver Installer : 1. install Google Driver installer 2. Install Operator
  2. Using NVIDIA Driver Manager : 1. Install Operator with driver enabled

The redundancy is not ideal, but I didn't think they were identical. The first step in the first procedure is to create a node pool. The first step in the second procedure is to create a cluster. I also wasn't under the impression that using a ubuntu_containerd image would require applying the node labels or disabling the automatic driver installation. (Has that changed and the procedure is stale? We can address offline.)

So, based on my understanding that customers either start by creating a node pool with specific node labels and command-line args that aren't common, a middle-of-procedure step to install the driver installer daemon set, and differing arguments to Helm at the end to install the Operator...I felt that was more cognitive load than necessary and could be reduced by separate procedures.

Signed-off-by: Mike McKiernan <[email protected]>
@mikemckiernan
Copy link
Member Author

Thanks @mikemckiernan, this is a great start. I left a few comments / questions. My main question concerns the overall structure of the document and whether using the operating systems (COS and Ubuntu) as section headers would improve readability. Happy to discuss this.

I started off by using the OS to organize the info, but because COS and Ubuntu are both supported by the Google driver installer, I felt like it got too messy too quickly. I revised the intro section and I hope it helps orient readers for the decision to make--driver installer or Operator, and more prominently shows the supported OSes.


.. code-block:: console

$ USE_GKE_GCLOUD_AUTH_PLUGIN=True \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop the USE_GKE_GCLOUD_AUTH_PLUGIN=True and rely on gcloud default, it shouldn't be needed

Suggested change
$ USE_GKE_GCLOUD_AUTH_PLUGIN=True \

cc @Dragoncell do you have some context on why we suggested this originally?

@bobbypage
Copy link

Thanks for updating the docs, left a few suggestions!

Signed-off-by: Mike McKiernan <[email protected]>
@mikemckiernan mikemckiernan merged commit e9a9ce9 into NVIDIA:main Jul 31, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issue/PR focused on fixing/editing/adding documentation bits
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants