[core][aDAG] Fix cpu tensor is automatically converted to gpu tensor #47742

rkooo567 · 2024-09-19T09:23:52Z

Why are these changes needed?

Right now, when cpu and gpu tensors are mixed up in the arbitrary data structure, aDAG automatically converts them to a gpu tensor on the receiver side. It is not very intuitive. This PR fixes it by

by default, tensor goes to the same type device where it was created (e.g., cpu -> cpu. gpu -> gpu that's assigned to that actor)
This PR also allows to move tensor to different device by specifying TorchTensorType(_device=X)
This PR doesn't allow implicit device conversion of tensors to avoid confusing behavior.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

stephanie-wang · 2024-09-19T22:43:34Z

The implementation looks good!

However, I would prefer slightly different semantics. One issue with keeping the tensor on the same device where it was created is that for GPU -> CPU, we can properly throw an error, but for CPU -> GPU, the user needs to remember to explicitly move the tensor. The proposed API also doesn't work for cases where a single value contains multiple tensors, but some need to be moved between GPUs and others need to stay on CPU. That could be fixed with a finer-grained API but probably too complicated to introduce that right now.

Here is a proposal that I think is the easiest to use and also gives intuitive behavior:

Each actor's default device is assigned at creation time. The default device is CPU by default, and the first CUDA_VISIBLE device if it requested GPU resources.
If no explicit device is specified:
a. If the actor sends a tensor that is on its own default device, then the receiving actor will deserialize the tensor to its default device. E.g., a GPU actor sending a GPU tensor to a CPU actor means that the receiving actor will receive a CPU tensor.
b. If the actor sends a tensor that is not on its own default device, then the receiving actor will deserialize the tensor to the sender's device. An error is thrown if the receiver does not have the sender's device visible.
If an explicit device is specified, then the receiving actor will deserialize to that device. An error is thrown if the receiver does not have that device visible.

Some examples:

GPU actor sends GPU tensor to a CPU actor -> rule 2a, receiving actor will receive a CPU tensor
GPU actor sends CPU tensor to a GPU actor -> rule 2b, receiving actor will receive a CPU tensor
CPU actor sends CPU tensor to a GPU actor -> rule 2a, receiving actor will receive a GPU tensor
GPU actor sends a CPU tensor to a GPU actor, specifies _device="cuda" -> rule 3, receiving actor will receive a GPU tensor
GPU actor sends a CPU tensor to a CPU actor, specifies _device="cuda" -> rule 3, error

ip

8ece3c3

rkooo567 requested a review from a team as a code owner September 19, 2024 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][aDAG] Fix cpu tensor is automatically converted to gpu tensor #47742

[core][aDAG] Fix cpu tensor is automatically converted to gpu tensor #47742

rkooo567 commented Sep 19, 2024

stephanie-wang commented Sep 19, 2024

[core][aDAG] Fix cpu tensor is automatically converted to gpu tensor #47742

Are you sure you want to change the base?

[core][aDAG] Fix cpu tensor is automatically converted to gpu tensor #47742

Conversation

rkooo567 commented Sep 19, 2024

Why are these changes needed?

Related issue number

Checks

stephanie-wang commented Sep 19, 2024