Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate whether sriov-operator supports the ability to obtain complete rdma indicators. #3790

Open
ty-dc opened this issue Jul 29, 2024 · 0 comments
Assignees

Comments

@ty-dc
Copy link
Collaborator

ty-dc commented Jul 29, 2024

What would you like to be added?

The rdma indicator is needed to confirm whether the training is going through rdma as expected and whether the communication performance is good

Some rdma indicators can be seen in node-export shared mode and exclusive mode. The vf indicators on the host can be captured through node-export, but the vf device in the container in exclusive mode cannot be viewed on the host, so node-export is not applicable. Investigate some functions of sriov-operator in large cluster mode to see if it meets the ability to obtain rdma indicators.

Why is this needed?

No response

How to implement it (if possible)?

No response

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants