Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Shadereval tasks #173

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ Below are the features and tasks of this framework:
- `StarCoderFIM`: which uses the default FIM tokens `"<fim_prefix>", "<fim_middle>", "<fim_suffix>"`, and
- `SantaCoderFIM`: which uses SantaCoder FIM tokens `"<fim-prefix>", "<fim-middle>", "<fim-suffix>"`
- [Mercury](https://huggingface.co/datasets/Elfsong/Mercury) for evaluating computational efficiency of **Python** code generation.
- Shadereval for **GLSL** code understanding ([task1](https://huggingface.co/spaces/Vipitis/ShaderEval)) and generation ([task2](https://huggingface.co/spaces/Vipitis/shadermatch))

More details about each task can be found in the documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
## Setup
Expand Down
3 changes: 2 additions & 1 deletion bigcode_eval/tasks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
concode, ds1000, gsm, humaneval, humanevalplus, humanevalpack,
instruct_humaneval, instruct_wizard_humaneval, mbpp, mbppplus,
multiple, parity, python_bugs, quixbugs, recode, santacoder_fim,
studenteval, mercury)
studenteval, mercury,shadereval)

TASK_REGISTRY = {
**apps.create_all_tasks(),
Expand All @@ -31,6 +31,7 @@
**santacoder_fim.create_all_tasks(),
"studenteval": studenteval.StudentEval,
"mercury": mercury.Mercury,
**shadereval.create_all_tasks(),
}

ALL_TASKS = sorted(list(TASK_REGISTRY))
Expand Down
241 changes: 241 additions & 0 deletions bigcode_eval/tasks/shadereval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,241 @@
# This template file is adapted from: https://github.com/EleutherAI/lm-evaluation-harness/blob/master/templates/new_task.py

# TODO: Remove all TODO comments once the implementation is complete.
"""
Paper-Title: Throwing Shaders at Language Models - Evaluating Creative Code Generation
TODO: Paper-URL: unavailable (unpublished)
Description: ShaderEval aims to be a suite of tasks to evaluate generative model on creative code generation. Espeicically GLSL shadercode.
Task1 is a proof of concept and looks at code completion for returnstatemetns of Shadertoy functions. Exact_match and greedy decoding.
Homepage: https://huggingface.co/spaces/Vipitis/ShaderEval

Paper-Title: Evaluating language models for computer graphics code completion
TODO: Paper-URL: unavailable (unpublished)
Description: Function Completion task for GLSL shadercode. Metric statically compares and then runs generated code to compare rendered frames with the refernece.
Homepage: https://huggingface.co/spaces/Vipitis/Shadermatch
"""
from bigcode_eval.base import Task
import evaluate
import datasets

# TODO: Add the BibTeX citation for the task.
_CITATION = """tbd
"""

def create_all_tasks():
"""assemble all tasks in a dictionary:
- task1: return completion
- task2: function generation
"""
return {
"shadereval-1": ReturnCompletion,
"shadereval-2": FunctionGeneration,
}

# TODO: Replace `NewTask` with the name of your Task.
class ReturnCompletion(Task): #Task1
# TODO: Add the `DATASET_PATH` string. This will be the name of the `Task`
# dataset as denoted in HuggingFace `datasets`.
DATASET_PATH = "Vipitis/Shadertoys-fine" # now defunct.
# TODO: Add the `DATASET_NAME` string. This is the name of a subset within
# `DATASET_PATH`. If there aren't specific subsets you need, leave this as `None`.
DATASET_NAME = "return_completion"

def __init__(self):
super().__init__(
# TODO: Specify the list of stop words in `stop_words` for the code generation task \
# and if the evaluation requires executing the generated code in `requires_execution`.
stop_words=[";"],
requires_execution=False,
)

def get_dataset(self):
# TODO: retrieve the evaluation subset from the loaded dataset (e.g. `self.dataset["test"]`)
"""Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
return self.dataset["test"]

def fewshot_examples(self):
# TODO: load few-shot examples (from lm_eval/tasks/fewshot_examples) if they exist
"""Loads and returns the few-shot examples for the task if they exist."""
pass

def get_prompt(self, doc):
# TODO: build the prompt for the language model from a sample `doc` from the dataset
"""
Builds the prompt for the LM to generate from.
:param doc: dict[str: str]
sample from the test dataset
:return: str
"""
return doc["body"]

def get_reference(self, doc):
# TODO: get the reference solution from a sample `doc` from the dataset
"""
Builds the reference solution for the doc (sample from the test dataset).
:param doc: dict[str: str]
sample from the test dataset
:return: str
"""
return doc["return_statement"].split(";")[0].strip()

def postprocess_generation(self, generation, idx):
# TODO: define the postprocessing for the LM generation
"""
Defines the postprocessing for a LM generation.
:param generation: str
code generation from LM
:param idx: int (if needed)
index of doc in the dataset to which the generation belongs
:return: str
"""
generation = generation.split("return")[1] # this works?
return generation.split(";")[0].strip()

def process_results(self, generations, references):
# TODO: define how the evaluation score is computed from list of \
# generations and reference solutions
"""
Takes the list of LM generations and evaluates them against ground truth references,
returning the metric for the generations as in {"metric_name": result}.
We encourage to directly load the metric from `evaluate` library to keep the code concise.
:param generations: list(list(str))
list of lists containing generations
:param references: list(str)
list of str containing refrences
:return: dict[str: float]
"""
exact_match = evaluate.load("exact_match")
generations = [
generation[0] for generation in generations
] # unpack one list for some reason? (we zero shot)
return exact_match.compute(predictions=generations, references=references)



# TODO: Replace `NewTask` with the name of your Task.
class FunctionGeneration(Task): #task2
DATASET_PATH = "Vipitis/Shadereval-experiments-dev" #as a temporary solution to reduce current problems

# `DATASET_PATH`. If there aren't specific subsets you need, leave this as `None`.
DATASET_NAME = None #this will eventually be a subset for the Shadertoys dataset, but not right now

def __init__(self):
super().__init__(
# TODO: Specify the list of stop words in `stop_words` for the code generation task \
# and if the evaluation requires executing the generated code in `requires_execution`.
# stop_words=["\nfloat ", "\nvec", "\nint", "\nvoid", "\nmat"], #new function starts... so all the keywords
# TODO: stopwords can cause incorrect early stopping, so we don't edn up using them. I am considering using guided generation with tree-sitter to do early stopping.
stop_words=[], #set it's to Falsy?
requires_execution=True, #we run shadercode - could that be harmful? (all in the metric)
)
self._metric = evaluate.load("Vipitis/shadermatch") #load the metric from the evaluate library

def get_dataset(self):
# TODO replace with subset once that is set up
return self.dataset["test"]

def fewshot_examples(self):
# TODO: load few-shot examples (from lm_eval/tasks/fewshot_examples) if they exist
"""Loads and returns the few-shot examples for the task if they exist."""
pass

def get_prompt(self, doc):
# TODO: build the prompt for the language model from a sample `doc` from the dataset
"""
:param doc: dict[str: str]
sample from the test dataset
:return: str
"""
return doc["model_inp"]

def get_reference(self, doc):
# TODO: get the reference solution from a sample `doc` from the dataset
"""
Builds the reference solution for the doc (sample from the test dataset).
:param doc: dict[str: str]
sample from the test dataset
:return: str
"""
return doc["image_code"] #returns full original code

def remove_last_block(self, code):
"""
Adapted from https://github.com/bigcode-project/bigcode-evaluation-harness/blob/be2a44c2faa29c20b5041d7083acb698eb373309/bigcode_eval/tasks/humanevalpack.py#L275C5-L311C20
"""
# TODO: can be removed
for w in self.stop_words:
if w in code:
code = code[:code.find(w)]

### Find the first occassion where a chain of { } is closed??
open_brackets = 1
cut = False
for i, c in enumerate(code.encode("utf-8")):
c = chr(c)
if c == '{':
open_brackets += 1
elif c == '}':
open_brackets -= 1
if open_brackets == 0:
code = code.encode("utf-8")[:i+1].decode("utf-8", "ignore")
cut = True
break
if not cut:
if '}' in code:
code = code[:code.rfind('}')] + '}'
else:
code = code + "// incomplete generation! \n"
return code

def postprocess_generation(self, generation, idx):
# TODO: define the postprocessing for the LM generation
"""
Defines the postprocessing for a LM generation.
:param generation: str
code generation from LM
:param idx: int (if needed)
index of doc in the dataset to which the generation belongs
:return: str
"""

row = self.dataset["test"][idx]
truncated = self._metric.truncate_generation(model_inp="", generation=generation)
# TODO: the metric methods will be renaming their args to be more broadly useable.. maybe even refactor the bit at the top.
altered = self._metric.replace_body(ref_code=row["image_code"], altered_body=truncated, end_header_byte=row["func_bytes"][0], end_function_byte=row["func_bytes"][4])
return altered

# TODO: remove the old code
# assemble into the full code with just the function replaced
ref = self.dataset["test"][idx]
model_ctx = ref["model_ctx"]
full_code = ref["full_code"]
start, end = ref["func_range"]
before_gen = full_code.encode("utf-8")[:start].decode("utf-8")
after_gen = full_code.encode("utf-8")[end:].decode("utf-8")

if self.prompt == "full":
gen = self.remove_last_block(generation.encode("utf-8")[start + len(model_ctx.encode("utf-8")):].decode("utf-8"))
else:
gen = self.remove_last_block(generation.encode("utf-8")[len(model_ctx.encode("utf-8")):].decode("utf-8")) #remove last block to avoid syntax errors
return before_gen + model_ctx + gen + after_gen #does this patch it together correctly?

def process_results(self, generations, references):
# TODO: define how the evaluation score is computed from list of \
# generations and reference solutions
"""
Takes the list of LM generations and evaluates them against ground truth references,
returning the metric for the generations as in {"metric_name": result}.
We encourage to directly load the metric from `evaluate` library to keep the code concise.
:param generations: list(list(str))
list of lists containing generations
:param references: list(str)
list of str containing refrences
:return: dict[str: float]
"""
# shadermatch = evaluate.load("Vipitis/shadermatch")
generations = [
generation[0] for generation in generations
] # unpack one list for some reason? (we zero shot)
results = self._metric.compute(predictions=generations, references=references)
# this also includes a list of all individual labels (in order).
return results
42 changes: 42 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -426,6 +426,48 @@ accelerate launch main.py \
--metric_output_path <MODEL_NAME>.json
```

### Shadereval
[Shadereval](tbd.) explores "creative" code generation. Fragment GLSL shaders are sourced from Shadertoy.com and curated into the [Shadertoys](https://github.com/Vipitis/shadertoys-dataset) dataset. The task specific datasets are build from the Shadertoys dataset.

Task-1: **ReturnCompletion** provides a function header and body, so the model generates a matching return statement. Generations are evaluated by `exact-match` therefore does not require code execution. The original publication uses greedy decoding and only 300 samples. Dataset is now private due to a takedown notice, please contact for access. This is only meant as a prototype task.

```bash
accelerate launch main.py \
--model <MODEL_NAME> \
--tasks shadereval-1 \
--n_samples 300 \
--do_sample False \
```

Task-2: **FunctionGeneration** parses comments directly before the function header and the header itself as model input. The model is expected to generate a complete function that is syntactially sound. Generated functions are inserted in the original shader program for evaluation. A custom metric is hosted in the [demo space](https://huggingface.co/spaces/Vipitis/shadermatch) which renders frames to compare. This requires an additional dependency [wgpu-shadertoy](https://github.com/pygfx/shadertoy) as well as [tree-sitter-glsl](https://github.com/tree-sitter-grammars/tree-sitter-glsl). It's recommended to generate generations first and then evaluate them later.
The original publication greedily decodes 512 tokens at BF16. However custom sampling strategies and additional prompts can be attempted, as long as accurately communicated with claimed results.

```bash
accelerate launch main.py \
--model <MODEL_NAME> \
--tasks shadereval-2 \
--generation_only \
--save_generations \
--save_generations_path "saved_generations.json" \
--allow_code_execution \
--do_sample False \
--precision bf16 \
--max_length_generation 512
```

To evaluate later run the following command:

```bash
accelerate launch main.py \
--model <MODEL_NAME> \
--tasks shadereval-2 \
--load_generations_path "saved_generations.json" \
--allow_code_execution \
--metric_output_path "eval_results.json" \
--precision bf16 \
--max_length_generation 512
```

## Code generation benchmarks without unit tests

For these tasks, we do single generations and compare the generated code against reference solutions and compute BLEU score. For the following tasks, we use a two-shot setting where we include 2 inputs and their solutions in the prompt, all preceded by an instruction such as: ` "Answer the following instructions in a one line SQL query:\n"`. The solutions consist of one line so we stop the generation when a new line is generated. 3 languages are present: Python, SQL and Java.
Expand Down