BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

This repository contains information and code of BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems.

Introduction

BattleAgentBench consists of two main parts: stage design and agent interaction. In stage design, we design seven stages of three different difficulty levels. In agent interaction, we implement interactions between agents and servers to support evaluation in the above stages.

Next, we will introduce how to evaluate LLMs on BattleAgentBench.

Install

First, execute the following command to install the necessary dependencies.

python setup.py develop

Second, test the installation. Use the following command to start the game server. If executed successfully, you will see the game interface pop up.

python -m battle_city.server_sync  --turn-off-after-end --map s1

Then use the following command to start a random agent. If executed successfully, you will see the agent start moving randomly in the game interface.

python -m battle_city.examples.client_test  --sync

Playing Game with LLMs

The game runs on a server with a game window and listens for agents. Each agent is a separate client (tank) connected to the game server. The agents control their tanks by sending actions to the server. The game starts when all agents' connections are established.

First, you need to configure the key used to call the LLM. For OpenAI models, please configure the corresponding key in battle_city/examples/agent/model/gpt.py. For evaluating open-source small language models, such as glm4-9b, we recommend using the API provided by siliconflow, which offers free access to open-source small language models. Please configure the corresponding key in battle_city/examples/agent/model/silicon.py.

Second, start the game server and client using the following two commands respectively:

python -m battle_city.server_sync  --turn-off-after-end --map s1
python -m battle_city.examples.client  --sync --model glm4-9b

All available maps are shown in basic.py, and all available models are shown in battle_city/examples/agent/model/__init__.py.

Evaluation

First, using LLM to play games. For convenience in evaluation, we have placed server startup, clients startup, and parameter configuration into a single script, so you only need to run one script to run games.

bash run_batch.sh

In the script, there are two variables: models and scenarios, which define the model to be evaluated and the test stage, respectively. Users can modify these two variables according to their needs.

Second, metric calculation. We calculate the score for each game, then aggregate multiple results to calculate the average score.

cd examples
python compute_metric.py
python compute_summary.py

Citation

If you find our work helpful, please kindly cite our paper:

@article{wang2024battleagentbench,
  title={BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems},
  author={Wang, Wei and Zhang, Dan and Feng, Tao and Wang, Boyan and Tang, Jie},
  journal={arXiv preprint arXiv:2408.15971},
  year={2024}
}

Acknowledgement

The game used by BattleAgentBench is based on and modified from battle-city-ai. We would like to express our sincere gratitude to the creators and contributors for their excellent work. For more information about the original game, please visit: battle-city-ai

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
battle_city		battle_city
images		images
maps		maps
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
run_batch.sh		run_batch.sh
run_game.sh		run_game.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

Introduction

Install

Playing Game with LLMs

Evaluation

Citation

Acknowledgement

About

Releases

Packages

Languages

License

THUDM/BattleAgentBench

Folders and files

Latest commit

History

Repository files navigation

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

Introduction

Install

Playing Game with LLMs

Evaluation

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages