-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for COS #77
Support for COS #77
Conversation
Signed-off-by: Mike McKiernan <[email protected]>
Documentation preview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mikemckiernan, this is a great start. I left a few comments / questions. My main question concerns the overall structure of the document and whether using the operating systems (COS and Ubuntu) as section headers would improve readability. Happy to discuss this.
Using NVIDIA Driver Manager | ||
*************************** | ||
|
||
Perform the following steps to create a GKE cluster with the ``gcloud`` CLI and use the Operator and NVIDIA Driver Manager to manage the GPU driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reader, it is not entirely clear to me that this section, Using NVIDIA Driver Manager
, is only applicable for Ubuntu node pools. It may be worth adding a sentence here stating that this approach is only supported on Ubuntu.
gpu-operator/google-gke.rst
Outdated
|
||
You can choose to use the Google driver installer to manage the NVIDIA GPU Driver. | ||
Alternatively, you can use the Operator and NVIDIA Driver Manager to manage the driver lifecycle and upgrades. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am questioning whether NVIDIA Driver Manager
is the term we should use throughout this page. I do not have a better idea at the moment.
gpu-operator/google-gke.rst
Outdated
You can choose to use the Google driver installer to manage the NVIDIA GPU Driver. | ||
Alternatively, you can use the Operator and NVIDIA Driver Manager to manage the driver lifecycle and upgrades. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the NVIDIA perspective, we recommend the latter approach on Ubuntu (have GPU Operator manage every software component) -- I am wondering if we should make that more evident here and in the procedure for Ubuntu.
I think the second section Maybe combine the command steps, and separate the different steps towards the two approach Up to step 7 in then for
|
The redundancy is not ideal, but I didn't think they were identical. The first step in the first procedure is to create a node pool. The first step in the second procedure is to create a cluster. I also wasn't under the impression that using a ubuntu_containerd image would require applying the node labels or disabling the automatic driver installation. (Has that changed and the procedure is stale? We can address offline.) So, based on my understanding that customers either start by creating a node pool with specific node labels and command-line args that aren't common, a middle-of-procedure step to install the driver installer daemon set, and differing arguments to Helm at the end to install the Operator...I felt that was more cognitive load than necessary and could be reduced by separate procedures. |
Signed-off-by: Mike McKiernan <[email protected]>
I started off by using the OS to organize the info, but because COS and Ubuntu are both supported by the Google driver installer, I felt like it got too messy too quickly. I revised the intro section and I hope it helps orient readers for the decision to make--driver installer or Operator, and more prominently shows the supported OSes. |
gpu-operator/google-gke.rst
Outdated
|
||
.. code-block:: console | ||
|
||
$ USE_GKE_GCLOUD_AUTH_PLUGIN=True \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can drop the USE_GKE_GCLOUD_AUTH_PLUGIN=True
and rely on gcloud default, it shouldn't be needed
$ USE_GKE_GCLOUD_AUTH_PLUGIN=True \ | |
cc @Dragoncell do you have some context on why we suggested this originally?
Thanks for updating the docs, left a few suggestions! |
Signed-off-by: Mike McKiernan <[email protected]>
RN: https://nvidia.github.io/cloud-native-docs/review/pr-77/gpu-operator/latest/release-notes.html#new-features
Review HTML: https://nvidia.github.io/cloud-native-docs/review/pr-77/gpu-operator/latest/google-gke.html