-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batch evaluation support when batch_size > 1 #36
base: main
Are you sure you want to change the base?
Conversation
Added |
@loubnabnl @Muennighoff Please review and let me know your comments. Thanks. ( Do not seem to have access to request for review) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great addition Logesh it will be very useful! I did some testing and it works as expected in most cases, but gives low scores on multiple GPUs when n_samples=num_return_sequences
and that will probably be the primary use case for this feature (when desired n_samples already fits in memory but we want to generate problems in parallel).
One reason for the low scores was that task_ids
weren’t in the correct order when one GPU gets more than one task(see comment below). But even after fixing it there are still some small discrepancies which are probably not just noise that we need to investigate. You can find a doc with the experiments here: eval-harness-batching.
We should also probably do more tests with different combinations of num_return_sequences
and batch_size
in case we missed something.
Thank you so much for the detailed review and catching this issue. I will look into it further and update !. |
Co-authored-by: Loubna Ben Allal <[email protected]>
My updates on further analysis, Found the below to be influencing the variations in the scores ( apart from the task id repetition issue ) :
I am afraid only if this variation in transformers repo is handled, Our scores would be stable for varying batch sizes. Please let me know if there is any work around or suggestions. |
num_return_sequences: Optional[int] = field( | ||
default=1, | ||
metadata={ | ||
"help":"The number of independently computed return sequences for each element in the batch" | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this argument + n_samples
- Aren't they kind of the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The n_samples
argument is to capture the overall number of samples to be generated for a prompt/task. While the num_return_sequences
is for the number of samples to be generated in one single pass.
There can be scenarios when n_samples
> num_return_sequences
, like when the n_samples
does not fit in the memory. In that case, the task/prompt is repeated ( multiple passes) to meet the overall n_samples
( as implemented here)
For example, to calculate pass@100 I might need the n_samples
to be 100 and due to memory limit I can have num_return_sequences
to be 10, so the task is repeated 10 times to meet the n_samples
count of 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But shouldn't the batch_size be responsible for handling the memory limitations? Can't we use it to infer num_return_sequences
?
IIURC it means that n_samples=16 batch_size=16 num_return_sequences=1 is the same as n_samples=16 batch_size=1 num_return_sequences=16, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that both the settings are same for the case that you have shown. But I am not quite getting how we can infer num_return_sequences
from batch_size. Can you please explain ? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIURC batch_size is used to pick batch_size new items, so I think sth like:
if batch_size < n_samples:
# Memory requirement will be the same as batch_size, but we only pick 1 new item (i.e. batch_size=1)
num_return_sequences = batch_size
batch_size = 1
else:
# If n_samples, just pick batch_size new items; If n_samples > 1, somewhere in-between
num_return_sequences = n_samples
# Round down, such that we always have <= batch_size items in one go
batch_size = batch_size // num_return_sequences
Update ! I used the batch generation script from incoder repo (as suggested by Daniel Fried on Slack) and was able to replicate this behaviour ( as shown below in the screenshot , full colab here). For the same set of inputs, the generations are varying based on the batch size. So, I believe this is a global behaviour and probably is expected to happen based on my analysis in previous comments. |
That's very odd, does it also happen for non-code models using the in-built transformer generate function with a batch? E.g. generating with https://huggingface.co/gpt2 |
Any new progress? Everyone needs it. 😁 |
Fixes #23