Skip to content

Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity

License

Notifications You must be signed in to change notification settings

Cybonto/OllaBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity

Documentation python Static Badge license


Latest News

Table of Contents

Overview

Introduction to Interdependent Cybersecurity

Interdependent cybersecurity addresses the complexities and interconnectedness of various systems, emphasizing the need for collaborative and holistic approaches to mitigate risks. This field focuses on how different components, from technology to human factors, influence each other, creating a web of dependencies that must be managed to ensure robust security.

Background and Rationale

Despite significant investments in cybersecurity, many organizations struggle to effectively manage cybersecurity risks due to the increasing complexity and interdependence of their systems. Notably, human factors account for half of the long-lasting challenges in interdependent cybersecurity. Agent-Based Modeling powered by Large Language Models emerges as a promising solution as it is excellent at capturing individual characteristics, allowing the micro-level agent behaviors to collectively generate emergent macro-level structures. Evaluating LLMs in this context is crucial for legal compliance and effective application development. However, traditional evaluation frameworks for large language models often neglect the human factor and cognitive computing capabilities essential for interdependent cybersecurity. The paper introduces OllaBench, a novel evaluation framework designed to fill this gap by assessing LLMs on their ability to reason about human-centric interdependent cybersecurity scenarios, thereby enhancing their application in interdependent cybersecurity threat modeling and risk management.

Main Conclusions

Here we show that OllaBench effectively evaluates the accuracy, wastefulness, and consistency of LLMs in answering scenario-based information security compliance/non-compliance questions. The results indicate that while commercial LLMs perform best overall, smaller open-weight models also show promising capabilities. The most accurate models are not the most efficient models in terms of tokens spent in wrong answers which unecessarily increases the cost of adopting these models. Finally, the best performing models are not only accurate but also consistent in the way they answer questions.

Context and Impact

The findings from OllaBench highlight the opportunities and the importance of fine-tuning existing large language models to address human factors in interdependent cybersecurity. Providing a comprehensive tool for assessing LLM performance in human-centric, complex, interdependent cybersecurity scenarios, this work advances the field by closing the gaps of evaluating large language models in deeply complex interdisciplinary areas such as human-centrict interdependent cybersecurity threat modeling and risk management. The findings also contribute to the development of more reliable and effective cybersecurity systems, ultimately enhancing organizational resilience against evolving cyber threats.

IMPORTANT
Dataset Generator and test Datasets at the OllaGen1 subfolder.
You need to have either a local LLM stack (nvidia TensorRT-LLM with Llama_Index in my case) or OpenAI api key for generating new OllaBench datasets.
OpenAI throttle Requests per Minutes which may cause significant delays in generating big datasets.
When OllaBench white paper is published (later in MARCH), OllaBench benchmark scripts and leaderboard results will be made available.

OllaBench-Flows

Quick Start

Evaluate with your own codes

You can grab the evaluation datasets to run with your own evaluation codes. Note that the datasets (csv files) are for zero-shot evaluation. It is recommended that you modify the OllaBench Generator 1 (OllaGen1) params.json with your desired specs and run the OllaGen1.py to generate for yourself fresh, UNSEEN datasets that match your custom needs. Check OllaGen-1 README for more details.

Use OllaBench

OllaBench will evaluate your models within Ollama model zoo using OllaGen1 default datasets. You can quickly spin up Ollama with Docker desktop/compose and download LLMs to Ollama. Please check the below Installation section for more details.

Tested System Settings

The following tested system settings show successful operation for running OllaGen1 dataset generator and OllaBench.

  • Primary Generative AI model: Llama2
  • Python version: 3.10
  • Windows version: 11
  • GPU: nvidia geforce RTX 3080 Ti
  • Minimum RAM: [your normal ram use]+[the size of your intended model]
  • Disk space: [your normal disk use]+[minimum software requirements]+[the size of your intended model]
  • Minimum software requirements: nvidia CUDA 12 (nvidia CUDA toolkit), Microsoft MPI, MSVC compiller, llama_index
  • Additional system requirements: docker compose and other related docker requirements if you use Docker stack

Quick Install of Key Components

This quick install is for a single Windows PC use case (without Docker) and for when you need to use OllaGen1 to generate your own datasets. I assume you have nvidia GPU installed.\

  • Go to TensorRT-LLM for Windows and follow the Quick Start section to install TensorRT-LLM and the prerequisites.
  • If you plan to use OllaGen1 with local LLM, go to Llama_Index for TensorRT-LLM and follow instrucitons to install Llama_Index, and prepare models for TensorRT-LLM
  • If you plan to use OllaGen1 with OpenAI, please follow OpenAI's intruction to add the api key into your system environment. You will also need to change the llm_framework param in OllaGen1 params.json to openai.

Commands to check for key software requirements

Python
python -V
nvidia CUDA 12
nvcc -V
Microsoft MPI*
mpiexec -hellp \

Installation

The following instructions are mainly for the Docker use case.

Windows Linux Subsystem

If you are using Windows, you need to install WSL. The Windows Subsystem for Linux (WSL) is a compatibility layer introduced by Microsoft that enables users to run Linux binary executables natively on Windows 10 and Windows Server 2019 and later versions. WSL provides a Linux-compatible kernel interface developed by Microsoft, which can then run a Linux distribution on top of it. See here for information on how to install it. In this set up, we use Debian linux. You can check verify linux was installed by executing wsl -l -v You enter WSL by executing the command "wsl" from windows command line window.

Please disregard if you are using a linux system.

Nvidia Container Toolkit

The NVIDIA Container Toolkit is a powerful set of tools that allows users to build and run GPU-accelerated Docker containers. It leverages NVIDIA GPUs to enable the deployment of containers that require access to NVIDIA graphics processing units for computing tasks. This toolkit is particularly useful for applications in data science, machine learning, and deep learning, where GPU resources are critical for processing large datasets and performing complex computations efficiently. Instalation instructions are in here.

Please disregard if your computer does not have a GPU.

nVidia TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

  • LlamaIndex Tutorial on Installing TensorRT-LLM
  • TensorRT-LLM Github page

Ollama

  • Install Docker Desktop and Ollama with these instructions.

Run OllaGen-1

Please go to OllaGen1 subfolder and follow the instructions to generate the evaluation datasets.