Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The BatchSplittingSampler cannot handle empty batches #522

Open
xichens opened this issue Oct 14, 2022 · 1 comment
Open

The BatchSplittingSampler cannot handle empty batches #522

xichens opened this issue Oct 14, 2022 · 1 comment

Comments

@xichens
Copy link

xichens commented Oct 14, 2022

🐛 Bug

When poisson sampling is used, empty batches can occur. However, the BatchSplittingSampler from batch_memory_manager.py, which is called when using the BatchMemoryManager, cannot handle empty batches and will throw an error.

To Reproduce

To reproduce it, see this colab link.

Expected behavior

The wrapped batch sampler should handle empty batches properly

Additional context

I think the issue is with this line
When calling

            split_idxs = np.array_split(
                batch_idxs, math.ceil(len(batch_idxs) / self.max_batch_size)
            )

the batch_idxs can be an empty list since it is from a UniformWithReplacementSampler, but np.array_split does not expect the first arg to be empty.

@xichens
Copy link
Author

xichens commented Oct 17, 2022

A follow up issue related to this one. Even without calling the wrap_data_loader from BatchMemoryManager, the DPDataLoader has some issue handling empty batches in certain cases. Currently the wrapt_collate_with_empty function here creates empty tensors for empty batches. However, it only sets the shape but not the dtype for the empty tensors. By default, these empty tensors will be of float. But some modules expect a particular data type as input, for example, for the torch.nn.Embedding, the inputs should be of int or Long.

I think the wrapt_collate_with_empty should consider the dtypes as well as shapes of the actual sample when creating the empty tensors.

facebook-github-bot pushed a commit that referenced this issue Nov 7, 2022
Summary:
## Background

Poisson sampling can sometimes result in an empty input batch, especially if a sampling rate (i.e. expected batch size) is small. This is not out of the ordinary and should be handled accordingly - gradients (signal) should be set to 0 and noise should still be added.

We've made an [attempt](https://github.com/pytorch/opacus/blob/main/opacus/data_loader.py#L31) to support this behaviour, but it wasn't fully covered with tests and got broken over time. As a result, at the moment we have a DataLoader that is capable of producing zero-sized batches, GradSampleModule that only partially supports them and DPOptimizer that doesn't support them at all

This PR addresses Issue #522 (thanks xichens for reporting)

## Improvements

This diff fixes the following

* DPOptimizer can now handle empty batches
* BatchMemoryManager can now handle empty batches
* Adds a PrivacyEngine test with empty batches
* Adds BatchMemoryManager test with empty batches
* DataLoader now respects dtype of the inputs (i.e. empty batches only used to work with float input tensors)
* ExpandedWeights still can's process empty batches, which we call out in our readme (FYI samdow )

Pull Request resolved: #530

Reviewed By: alexandresablayrolles

Differential Revision: D40676213

Pulled By: ffuuugor

fbshipit-source-id: dc637fd91a3c20d481d22c5de97d22d42e423a71
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant