Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch evaluation support when batch_size > 1 #36

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

infinitylogesh
Copy link
Collaborator

Fixes #23

@infinitylogesh
Copy link
Collaborator Author

Added num_return_sequences as an argument, batch_size acting as num_return_sequences was confusing. Now the num_return_sequences will hold the number of generations per input in the batch and batch_size for the number of inputs in the batch. Hope this change is fine?. Updated the docs and examples with the argument

@infinitylogesh infinitylogesh marked this pull request as ready for review January 26, 2023 12:58
@infinitylogesh infinitylogesh changed the title [WIP] Add batch evaluation support when batch_size > 1 Add batch evaluation support when batch_size > 1 Jan 26, 2023
@infinitylogesh
Copy link
Collaborator Author

@loubnabnl @Muennighoff Please review and let me know your comments. Thanks. ( Do not seem to have access to request for review)

Copy link
Collaborator

@loubnabnl loubnabnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great addition Logesh it will be very useful! I did some testing and it works as expected in most cases, but gives low scores on multiple GPUs when n_samples=num_return_sequences and that will probably be the primary use case for this feature (when desired n_samples already fits in memory but we want to generate problems in parallel).

One reason for the low scores was that task_ids weren’t in the correct order when one GPU gets more than one task(see comment below). But even after fixing it there are still some small discrepancies which are probably not just noise that we need to investigate. You can find a doc with the experiments here: eval-harness-batching.

We should also probably do more tests with different combinations of num_return_sequences and batch_size in case we missed something.

lm_eval/utils.py Outdated Show resolved Hide resolved
@infinitylogesh
Copy link
Collaborator Author

Thank you so much for the detailed review and catching this issue. I will look into it further and update !.

@infinitylogesh
Copy link
Collaborator Author

infinitylogesh commented Feb 19, 2023

My updates on further analysis, Found the below to be influencing the variations in the scores ( apart from the task id repetition issue ) :

  1. Device specific seed : By default the device_specific parameter in set_seed is set to True, For the cases where num_return_sequences=n_samples, Changing the batch size might lead to a device placement of a given task in a different GPU during runtime. Thus could introduce variation in the results due to variation of seed. I have currently made the device_specific flag to False when the num_return_sequences=n_samples condition
  2. Transformers repo: generations from the model was varying by batch size even if the inputs passed to the model were ensured to be the same.I have tried to replicate the variations in this colab for Santacoder and codegen . I could see existing issues that are reported in transformer repo pointing to this behaviour (Issue1, issue 2 , issue3, issue4).Upon diving a bit deeper, I suspect that the reason for these variations are ( also showed in colab ):
    • logits from transformers are varying for the same input as the batch size varies
    • torch.multinomial used for sampling the next token returns a different next token for the same input as the batch size changes. If the same input happens to be in a different index in the batch, which is expected when batch size changes.

I am afraid only if this variation in transformers repo is handled, Our scores would be stable for varying batch sizes. Please let me know if there is any work around or suggestions.

lm_eval/generation.py Outdated Show resolved Hide resolved
lm_eval/utils.py Show resolved Hide resolved
Comment on lines +44 to +49
num_return_sequences: Optional[int] = field(
default=1,
metadata={
"help":"The number of independently computed return sequences for each element in the batch"
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this argument + n_samples - Aren't they kind of the same?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The n_samples argument is to capture the overall number of samples to be generated for a prompt/task. While the num_return_sequences is for the number of samples to be generated in one single pass.

There can be scenarios when n_samples > num_return_sequences , like when the n_samples does not fit in the memory. In that case, the task/prompt is repeated ( multiple passes) to meet the overall n_samples ( as implemented here)

For example, to calculate pass@100 I might need the n_samples to be 100 and due to memory limit I can have num_return_sequences to be 10, so the task is repeated 10 times to meet the n_samples count of 100.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But shouldn't the batch_size be responsible for handling the memory limitations? Can't we use it to infer num_return_sequences?

IIURC it means that n_samples=16 batch_size=16 num_return_sequences=1 is the same as n_samples=16 batch_size=1 num_return_sequences=16, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that both the settings are same for the case that you have shown. But I am not quite getting how we can infer num_return_sequences from batch_size. Can you please explain ? Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIURC batch_size is used to pick batch_size new items, so I think sth like:

if batch_size < n_samples:
    # Memory requirement will be the same as batch_size, but we only pick 1 new item (i.e. batch_size=1)
    num_return_sequences = batch_size
    batch_size = 1
else:
    # If n_samples, just pick batch_size new items; If n_samples > 1, somewhere in-between
    num_return_sequences = n_samples
    # Round down, such that we always have <= batch_size items in one go
    batch_size = batch_size // num_return_sequences

@infinitylogesh
Copy link
Collaborator Author

Update !
An Update about replicating this behaviour of varying generations for different batch sizes using an external repo:

I used the batch generation script from incoder repo (as suggested by Daniel Fried on Slack) and was able to replicate this behaviour ( as shown below in the screenshot , full colab here). For the same set of inputs, the generations are varying based on the batch size.

So, I believe this is a global behaviour and probably is expected to happen based on my analysis in previous comments.

image

@Muennighoff
Copy link
Contributor

Update ! An Update about replicating this behaviour of varying generations for different batch sizes using an external repo:

I used the batch generation script from incoder repo (as suggested by Daniel Fried on Slack) and was able to replicate this behaviour ( as shown below in the screenshot , full colab here). For the same set of inputs, the generations are varying based on the batch size.

So, I believe this is a global behaviour and probably is expected to happen based on my analysis in previous comments.

image

That's very odd, does it also happen for non-code models using the in-built transformer generate function with a batch? E.g. generating with https://huggingface.co/gpt2

@infinitylogesh
Copy link
Collaborator Author

Yes, This happens with gpt2 model too. Please check the colab it has an example with GPT2. This has been discussed in other issues too (issue1,issue2)

@huybery
Copy link

huybery commented Sep 11, 2023

Any new progress? Everyone needs it. 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support for batch size > 1 for single problem generations (n_samples=1)
4 participants