Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making sure the NVIIDA driver extension has finished before doing anything more. #6

Open
mikecroucher opened this issue Jul 17, 2020 · 2 comments

Comments

@mikecroucher
Copy link
Contributor

In the text we recommend waiting for up to 10 minutes.

In the script we wait for 60 seconds.

Could we reach out to Xavier, for example, on how he deals with this in his end to end scripts?
Maybe there is a way to probe the VM every 30 seconds or so?

I hope for a better solution than 'wait for an indeterminate amount of time'

@ptooley
Copy link
Contributor

ptooley commented Jul 17, 2020

Blurb for John:
"The GPU driver installer doesn't work properly and keeps rebooting the machine after it claims it is finished. This means there is no robust way to check when the VM is actually ready to use. That means we can't build a reliable end-to-end solution. Options are that we get MS to fix their system or we have to break the script into two parts and discuss the reason we can't do end-to-end scripting."

@mikecroucher
Copy link
Contributor Author

Info sent to Phil from MS support. We don't think this is very good at all!

I got an update from my PG team, the behaviour what you see is expected. The provisioningstate as “success” return is due to the short time window in which installation must actually succeed and constraints of being able to work around reboots. The installation can have up to 3 steps (2 reboots) depending on the requirements of the VM. Giving adequate time for the installation to finish post-reboots or tailing the log file for success is currently the only way this complicated multi-step installation can be handled with VM extensions.

manasa@Azure:~$ az vm extension list --resource-group manasa-cyclecloud --vm-name nctest4 -o table
Name                  ProvisioningState    Publisher             Version    AutoUpgradeMinorVersion
--------------------  -------------------  --------------------  ---------  -------------------------
NvidiaGpuDriverLinux  Succeeded            Microsoft.HpcCompute  1.3        True

Please let us know if you have any further questions!


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants