LLM Evaluation

Introduction

This is an attempt at creating a scalable interface to evaluate any large language model on any dataset. The library currently supports evaluation of Gemini Pro on HumanEval.

The library includes a wrapper for large language models from different providers in models.py and a wrapper for different evaluations/benchmarks in evaluation.py.

Results

Predictions and evaluation results for Gemini Pro on HumanEval are available in results/. Gemini Pro obtains an overall pass@1 = 54.268.

Instructions

Setup

Create a new conda environment and install required libraries:

conda create -n llm-eval python=3.10
conda activate llm-eval
pip install -r requirements.txt

Evaluation

Use the following command for evaluation:

python main.py --model {model_name} --dataset {dataset_name} --key {api_key} --data_path {data_path} --out_path {output_path} --n {number_samples}

Example command for evaluating Gemini Pro on HumanEval:

python main.py --model gemini-pro --dataset humaneval --key {api_key}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
human-eval		human-eval
results		results
.gitignore		.gitignore
README.md		README.md
evaluation.py		evaluation.py
main.py		main.py
models.py		models.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Evaluation

Introduction

Results

Instructions

Setup

Evaluation

About

Releases

Packages

Languages

devanshrj/llm-eval

Folders and files

Latest commit

History

Repository files navigation

LLM Evaluation

Introduction

Results

Instructions

Setup

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages