Add batch evaluation support when batch_size > 1 #36

infinitylogesh · 2023-01-23T02:59:30Z

Fixes #23

infinitylogesh · 2023-01-26T12:58:17Z

Added num_return_sequences as an argument, batch_size acting as num_return_sequences was confusing. Now the num_return_sequences will hold the number of generations per input in the batch and batch_size for the number of inputs in the batch. Hope this change is fine?. Updated the docs and examples with the argument

infinitylogesh · 2023-01-26T13:06:36Z

@loubnabnl @Muennighoff Please review and let me know your comments. Thanks. ( Do not seem to have access to request for review)

loubnabnl

Thanks for this great addition Logesh it will be very useful! I did some testing and it works as expected in most cases, but gives low scores on multiple GPUs when n_samples=num_return_sequences and that will probably be the primary use case for this feature (when desired n_samples already fits in memory but we want to generate problems in parallel).

One reason for the low scores was that task_ids weren’t in the correct order when one GPU gets more than one task(see comment below). But even after fixing it there are still some small discrepancies which are probably not just noise that we need to investigate. You can find a doc with the experiments here: eval-harness-batching.

We should also probably do more tests with different combinations of num_return_sequences and batch_size in case we missed something.

lm_eval/utils.py

infinitylogesh · 2023-02-03T03:28:14Z

Thank you so much for the detailed review and catching this issue. I will look into it further and update !.

Co-authored-by: Loubna Ben Allal <[email protected]>

infinitylogesh · 2023-02-19T13:03:07Z

My updates on further analysis, Found the below to be influencing the variations in the scores ( apart from the task id repetition issue ) :

Device specific seed : By default the device_specific parameter in set_seed is set to True, For the cases where num_return_sequences=n_samples, Changing the batch size might lead to a device placement of a given task in a different GPU during runtime. Thus could introduce variation in the results due to variation of seed. I have currently made the device_specific flag to False when the num_return_sequences=n_samples condition
Transformers repo: generations from the model was varying by batch size even if the inputs passed to the model were ensured to be the same.I have tried to replicate the variations in this colab for Santacoder and codegen . I could see existing issues that are reported in transformer repo pointing to this behaviour (Issue1, issue 2 , issue3, issue4).Upon diving a bit deeper, I suspect that the reason for these variations are ( also showed in colab ):
- logits from transformers are varying for the same input as the batch size varies
- torch.multinomial used for sampling the next token returns a different next token for the same input as the batch size changes. If the same input happens to be in a different index in the batch, which is expected when batch size changes.

I am afraid only if this variation in transformers repo is handled, Our scores would be stable for varying batch sizes. Please let me know if there is any work around or suggestions.

lm_eval/generation.py

lm_eval/utils.py

Muennighoff · 2023-03-23T20:33:02Z

lm_eval/arguments.py

+    num_return_sequences: Optional[int] = field(
+        default=1,
+        metadata={
+            "help":"The number of independently computed return sequences for each element in the batch"
+            }
+    )


Why do we need this argument + n_samples - Aren't they kind of the same?

The n_samples argument is to capture the overall number of samples to be generated for a prompt/task. While the num_return_sequences is for the number of samples to be generated in one single pass.

There can be scenarios when n_samples > num_return_sequences , like when the n_samples does not fit in the memory. In that case, the task/prompt is repeated ( multiple passes) to meet the overall n_samples ( as implemented here)

For example, to calculate pass@100 I might need the n_samples to be 100 and due to memory limit I can have num_return_sequences to be 10, so the task is repeated 10 times to meet the n_samples count of 100.

But shouldn't the batch_size be responsible for handling the memory limitations? Can't we use it to infer num_return_sequences?

IIURC it means that n_samples=16 batch_size=16 num_return_sequences=1 is the same as n_samples=16 batch_size=1 num_return_sequences=16, right?

I agree that both the settings are same for the case that you have shown. But I am not quite getting how we can infer num_return_sequences from batch_size. Can you please explain ? Thanks

IIURC batch_size is used to pick batch_size new items, so I think sth like:

if batch_size < n_samples: # Memory requirement will be the same as batch_size, but we only pick 1 new item (i.e. batch_size=1) num_return_sequences = batch_size batch_size = 1 else: # If n_samples, just pick batch_size new items; If n_samples > 1, somewhere in-between num_return_sequences = n_samples # Round down, such that we always have <= batch_size items in one go batch_size = batch_size // num_return_sequences

infinitylogesh · 2023-03-27T05:28:52Z

Update !
An Update about replicating this behaviour of varying generations for different batch sizes using an external repo:

I used the batch generation script from incoder repo (as suggested by Daniel Fried on Slack) and was able to replicate this behaviour ( as shown below in the screenshot , full colab here). For the same set of inputs, the generations are varying based on the batch size.

So, I believe this is a global behaviour and probably is expected to happen based on my analysis in previous comments.

Muennighoff · 2023-03-27T08:14:46Z

Update ! An Update about replicating this behaviour of varying generations for different batch sizes using an external repo:

I used the batch generation script from incoder repo (as suggested by Daniel Fried on Slack) and was able to replicate this behaviour ( as shown below in the screenshot , full colab here). For the same set of inputs, the generations are varying based on the batch size.

So, I believe this is a global behaviour and probably is expected to happen based on my analysis in previous comments.

That's very odd, does it also happen for non-code models using the in-built transformer generate function with a batch? E.g. generating with https://huggingface.co/gpt2

infinitylogesh · 2023-03-27T14:28:46Z

Yes, This happens with gpt2 model too. Please check the colab it has an example with GPT2. This has been discussed in other issues too (issue1,issue2)

huybery · 2023-09-11T13:06:26Z

Any new progress? Everyone needs it. 😁

infinitylogesh added 5 commits January 23, 2023 02:44

adding batching support

ae085e2

moving num_return_sequences argument to arguments.py

e3798a5

handling attention_mask for batch_size 1

138bde6

added multi_batch test and error handling

abeaabf

Updated docs

3e173e7

infinitylogesh marked this pull request as ready for review January 26, 2023 12:58

infinitylogesh changed the title ~~[WIP] Add batch evaluation support when batch_size > 1~~ Add batch evaluation support when batch_size > 1 Jan 26, 2023

loubnabnl reviewed Feb 1, 2023

View reviewed changes

lm_eval/utils.py Outdated Show resolved Hide resolved

loubnabnl mentioned this pull request Feb 6, 2023

[WIP; Input Requested] Integrated DS1000 task #39

Merged

infinitylogesh and others added 6 commits February 19, 2023 17:04

Fix task id repeation lm_eval/utils.py

9090dad

Co-authored-by: Loubna Ben Allal <[email protected]>

add trust_remote_code and use_auth_token args

af59009

update readme with trust_remote_code arg

76da6ed

change format of boolean arguments

b0f6785

change how bool args are called and fix typos

97ab344

device specific seed handling

09ca56f

Muennighoff reviewed Mar 23, 2023

View reviewed changes

lm_eval/generation.py Outdated Show resolved Hide resolved

lm_eval/utils.py Show resolved Hide resolved

Muennighoff requested a review from loubnabnl March 23, 2023 20:26

Muennighoff added 2 commits March 23, 2023 21:28

Remove superfluous comment

8cb26ae

Merge branch 'main' into multi_batch

8f9fae1

Muennighoff reviewed Mar 23, 2023

View reviewed changes

yeoedward mentioned this pull request Jul 25, 2023

For some models and prompts, the loglikelihood changes with the batch size. EleutherAI/lm-evaluation-harness#704

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch evaluation support when batch_size > 1 #36

Add batch evaluation support when batch_size > 1 #36

infinitylogesh commented Jan 23, 2023

infinitylogesh commented Jan 26, 2023

infinitylogesh commented Jan 26, 2023

loubnabnl left a comment

infinitylogesh commented Feb 3, 2023

infinitylogesh commented Feb 19, 2023 •

edited

Loading

Muennighoff Mar 23, 2023

infinitylogesh Mar 25, 2023

Muennighoff Mar 25, 2023

infinitylogesh Mar 27, 2023

Muennighoff Mar 27, 2023

infinitylogesh commented Mar 27, 2023

Muennighoff commented Mar 27, 2023

infinitylogesh commented Mar 27, 2023

huybery commented Sep 11, 2023

Add batch evaluation support when batch_size > 1 #36

Are you sure you want to change the base?

Add batch evaluation support when batch_size > 1 #36

Conversation

infinitylogesh commented Jan 23, 2023

infinitylogesh commented Jan 26, 2023

infinitylogesh commented Jan 26, 2023

loubnabnl left a comment

Choose a reason for hiding this comment

infinitylogesh commented Feb 3, 2023

infinitylogesh commented Feb 19, 2023 • edited Loading

Muennighoff Mar 23, 2023

Choose a reason for hiding this comment

infinitylogesh Mar 25, 2023

Choose a reason for hiding this comment

Muennighoff Mar 25, 2023

Choose a reason for hiding this comment

infinitylogesh Mar 27, 2023

Choose a reason for hiding this comment

Muennighoff Mar 27, 2023

Choose a reason for hiding this comment

infinitylogesh commented Mar 27, 2023

Muennighoff commented Mar 27, 2023

infinitylogesh commented Mar 27, 2023

huybery commented Sep 11, 2023

infinitylogesh commented Feb 19, 2023 •

edited

Loading