Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][aDAG] Fix cpu tensor is automatically converted to gpu tensor #47742

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rkooo567
Copy link
Contributor

Why are these changes needed?

Right now, when cpu and gpu tensors are mixed up in the arbitrary data structure, aDAG automatically converts them to a gpu tensor on the receiver side. It is not very intuitive. This PR fixes it by

  • by default, tensor goes to the same type device where it was created (e.g., cpu -> cpu. gpu -> gpu that's assigned to that actor)
  • This PR also allows to move tensor to different device by specifying TorchTensorType(_device=X)
  • This PR doesn't allow implicit device conversion of tensors to avoid confusing behavior.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567 rkooo567 requested a review from a team as a code owner September 19, 2024 09:23
@stephanie-wang
Copy link
Contributor

The implementation looks good!

However, I would prefer slightly different semantics. One issue with keeping the tensor on the same device where it was created is that for GPU -> CPU, we can properly throw an error, but for CPU -> GPU, the user needs to remember to explicitly move the tensor. The proposed API also doesn't work for cases where a single value contains multiple tensors, but some need to be moved between GPUs and others need to stay on CPU. That could be fixed with a finer-grained API but probably too complicated to introduce that right now.

Here is a proposal that I think is the easiest to use and also gives intuitive behavior:

  1. Each actor's default device is assigned at creation time. The default device is CPU by default, and the first CUDA_VISIBLE device if it requested GPU resources.
  2. If no explicit device is specified:
    a. If the actor sends a tensor that is on its own default device, then the receiving actor will deserialize the tensor to its default device. E.g., a GPU actor sending a GPU tensor to a CPU actor means that the receiving actor will receive a CPU tensor.
    b. If the actor sends a tensor that is not on its own default device, then the receiving actor will deserialize the tensor to the sender's device. An error is thrown if the receiver does not have the sender's device visible.
  3. If an explicit device is specified, then the receiving actor will deserialize to that device. An error is thrown if the receiver does not have that device visible.

Some examples:

  • GPU actor sends GPU tensor to a CPU actor -> rule 2a, receiving actor will receive a CPU tensor
  • GPU actor sends CPU tensor to a GPU actor -> rule 2b, receiving actor will receive a CPU tensor
  • CPU actor sends CPU tensor to a GPU actor -> rule 2a, receiving actor will receive a GPU tensor
  • GPU actor sends a CPU tensor to a GPU actor, specifies _device="cuda" -> rule 3, receiving actor will receive a GPU tensor
  • GPU actor sends a CPU tensor to a CPU actor, specifies _device="cuda" -> rule 3, error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants