Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch evaluation support when batch_size > 1 #36

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,15 @@ accelerate launch main.py \
--temperature <TEMPERATURE> \
--do_sample True \
--n_samples 100 \
--num_return_sequences 20 \
--batch_size 10 \
--allow_code_execution=False
```
* `limit` represents the number of problems to solve, if it's not provided all problems in the benchmark are selected.
* `allow_code_execution` is for executing the generated code: read the displayed warning before setting it to `True`.

Some tasks don't require code execution such as
`codexglue_code_to_text-<LANGUAGE>`/`codexglue_code_to_text-python-left`/`conala`/`concode` that use BLEU evaluation. In addition, we generate one candidate solution for each problem in these tasks, so use `n_samples=1` and `batch_size=1`. (Note that `batch_size` should always be equal or less than `n_samples`).
`codexglue_code_to_text-<LANGUAGE>`/`codexglue_code_to_text-python-left`/`conala`/`concode` that use BLEU evaluation. In addition, we generate one candidate solution for each problem in these tasks, so use `n_samples=1` and `num_return_sequences=1`. (Note that `num_return_sequences` should always be equal or less than `n_samples`).
* For APPS tasks, you can use `n_samples=1` for strict and average accuracies (from the original APPS paper) and `n_samples>1` for pass@k.

### Generation only
Expand Down
4 changes: 3 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ accelerate launch main.py \
--tasks humaneval \
--temperature 0.2 \
--n_samples 200 \
--num_return_sequences 20 \
--batch_size 10 \
--allow_code_execution=False
```
Expand Down Expand Up @@ -70,6 +71,7 @@ accelerate launch main.py \
--tasks mbpp \
--temperature 0.1 \
--n_samples 15 \
--num_return_sequences 15 \
--batch_size 10 \
--allow_code_execution=False \
```
Expand Down Expand Up @@ -139,7 +141,7 @@ accelerate launch main.py \
--tasks apps-introductory \
--n_samples 1 \
--temperature 0.1 \
--batch_size 1 \
--batch_size 5 \
--allow_code_execution=False
```
We expect a model [finetuned](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/APPS) on the train split of APPS.
Expand Down
6 changes: 6 additions & 0 deletions lm_eval/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,9 @@ class EvalArguments:
seed: Optional[int] = field(
default=0, metadata={"help": "Random seed used for evaluation."}
)
num_return_sequences: Optional[int] = field(
default=1,
metadata={
"help":"The number of independently computed return sequences for each element in the batch"
}
)
Comment on lines +44 to +49
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this argument + n_samples - Aren't they kind of the same?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The n_samples argument is to capture the overall number of samples to be generated for a prompt/task. While the num_return_sequences is for the number of samples to be generated in one single pass.

There can be scenarios when n_samples > num_return_sequences , like when the n_samples does not fit in the memory. In that case, the task/prompt is repeated ( multiple passes) to meet the overall n_samples ( as implemented here)

For example, to calculate pass@100 I might need the n_samples to be 100 and due to memory limit I can have num_return_sequences to be 10, so the task is repeated 10 times to meet the n_samples count of 100.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But shouldn't the batch_size be responsible for handling the memory limitations? Can't we use it to infer num_return_sequences?

IIURC it means that n_samples=16 batch_size=16 num_return_sequences=1 is the same as n_samples=16 batch_size=1 num_return_sequences=16, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that both the settings are same for the case that you have shown. But I am not quite getting how we can infer num_return_sequences from batch_size. Can you please explain ? Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIURC batch_size is used to pick batch_size new items, so I think sth like:

if batch_size < n_samples:
    # Memory requirement will be the same as batch_size, but we only pick 1 new item (i.e. batch_size=1)
    num_return_sequences = batch_size
    batch_size = 1
else:
    # If n_samples, just pick batch_size new items; If n_samples > 1, somewhere in-between
    num_return_sequences = n_samples
    # Round down, such that we always have <= batch_size items in one go
    batch_size = batch_size // num_return_sequences

6 changes: 3 additions & 3 deletions lm_eval/generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def parallel_generations(task, dataset, accelerator, model, tokenizer, n_tasks,

if accelerator.is_main_process:
print(f"number of problems for this task is {n_tasks}")
n_copies = args.n_samples // args.batch_size
n_copies = args.n_samples // args.num_return_sequences

ds_tokenized = TokenizedDataset(
task,
Expand All @@ -76,7 +76,7 @@ def parallel_generations(task, dataset, accelerator, model, tokenizer, n_tasks,
)

# do not confuse args.batch_size, which is actually the num_return_sequences
Muennighoff marked this conversation as resolved.
Show resolved Hide resolved
ds_loader = DataLoader(ds_tokenized, batch_size=1)
ds_loader = DataLoader(ds_tokenized, batch_size=args.batch_size)

model, ds_loader = accelerator.prepare(model, ds_loader)
generations = complete_code(
Expand All @@ -86,7 +86,7 @@ def parallel_generations(task, dataset, accelerator, model, tokenizer, n_tasks,
tokenizer,
ds_loader,
n_tasks=n_tasks,
batch_size=args.batch_size,
num_return_sequences=args.num_return_sequences,
prefix=args.prefix,
postprocess=args.postprocess,
**gen_kwargs,
Expand Down
19 changes: 13 additions & 6 deletions lm_eval/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ def __iter__(self):
if self.n_copies == 1 and self.n_tasks % self.num_devices != 0:
self.n_copies = 2
warnings.warn(
"n_copies (n_samples/batch_size) was changed from 1 to 2 because n_tasks isn't proportional to num devices"
"n_copies (n_samples/num_return_sequences) was changed from 1 to 2 because n_tasks isn't proportional to num devices"
)

for sample in range(self.n_tasks):
Expand All @@ -58,6 +58,7 @@ def __iter__(self):
"ids": outputs.input_ids[sample],
"task_id": sample,
"input_len": outputs.attention_mask[sample].sum(),
"attention_mask": outputs.attention_mask[sample],
}


Expand All @@ -68,7 +69,7 @@ def complete_code(
tokenizer,
dataloader,
n_tasks,
batch_size=20,
num_return_sequences=20,
prefix="",
postprocess=True,
**gen_kwargs,
Expand All @@ -84,13 +85,19 @@ def complete_code(
with torch.no_grad():
if task.stop_words:
gen_kwargs["stopping_criteria"][0].start_length = batch["ids"].shape[-1]

if batch["ids"].shape[0]==1:
batch["ids"] = batch["ids"][:,:batch["input_len"]]
batch["attention_mask"] = batch["attention_mask"][:,:batch["input_len"]]
Muennighoff marked this conversation as resolved.
Show resolved Hide resolved

generated_tokens = accelerator.unwrap_model(model).generate(
input_ids=batch["ids"][:, : batch["input_len"]],
num_return_sequences=batch_size,
input_ids=batch["ids"],
attention_mask=batch["attention_mask"],
num_return_sequences=num_return_sequences,
**gen_kwargs,
)
# each task is generated batch_size times
generated_tasks = batch["task_id"].repeat(batch_size)
# each task is generated num_return_sequences times
generated_tasks = batch["task_id"].repeat(num_return_sequences)
infinitylogesh marked this conversation as resolved.
Show resolved Hide resolved
generated_tokens = accelerator.pad_across_processes(
generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
)
Expand Down
9 changes: 9 additions & 0 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,15 @@ def main():
print("bos_token used as eos_token")
else:
raise ValueError("No eos_token or bos_token found")

if args.n_samples < args.num_return_sequences:
raise ValueError("n_samples should always be equal or greater than num_return_sequences ")

# When padding_side = "right",Padding tokens are considered during decoding.
# so setting it to left - to ignore padding tokens while decoding, as per
# https://github.com/huggingface/transformers/pull/7552
if args.batch_size > 1:
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
evaluator = Evaluator(accelerator, model, tokenizer, args)

Expand Down
22 changes: 22 additions & 0 deletions tests/test_generation_evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,25 @@ def test_evaluation():
results = evaluator.evaluate(task)
assert results == {"pass@1": 0.25}
print("passed eval")

def test_multi_batch_generation():
args.n_samples = 1
args.batch_size = 2
args.limit = 2
args.do_sample = False
args.generation_only = True
args.generations_path = None
# Increasing the max_lenth to accomadate pad tokens
# in the final generation
args.max_length_generation=356
tokenizer.padding_side = "left"
evaluator = Evaluator(accelerator, model, tokenizer, args)
for task in TASKS:
print(f"testing task {task}")
generations, references = evaluator.generate_text(task)
true_gens, true_refs = load_generation_examples(task)
# capping the generation to the max length of true gens
for idx,tg in enumerate(true_gens):
generations[idx][0] = generations[idx][0][:len(tg[0])]
assert generations == true_gens
assert references == true_refs