This Python script is designed to validate and analyze the theat scenairo spreadsheet document with entries provided by CSA's AI Control Working group's experts, by processing input data, fetching related web content, and using the Claude API to validate and suggest corrections for various threat categories, life cycle, assets, and impacts.
- Reads input data from an Excel file
- Fetches and processes web content (both HTML and PDF)
- Uses the Claude API to validate threat categories and impacts
- Writes results to an Excel file and a text log file
- Implements rate limiting for API calls
- Python 3.x
- Required Python packages: pandas, requests, PyPDF2, beautifulsoup4, python-dotenv, anthropic, langdetect
-
Install the required packages:
pip install pandas requests PyPDF2 beautifulsoup4 python-dotenv anthropic langdetect
-
Create a
.env
file in the same directory as the script and add your Claude API key:CLAUDE_API_KEY=your_api_key_here
-
Prepare an input Excel file named
input.xlsx
with the required columns. Sample input.xlsx is in the same working directory -
run
python main.py
-
Data Structures: The script defines several dictionaries to categorize assets, life cycle stages, threat categories, and impact categories.
-
File Handling:
- Reads input from
input.xlsx
- Writes output to
output.xlsx
- Logs details to
result.txt
- Reads input from
-
Web Content Fetching:
get_pdf_content()
: Extracts text from PDF filesget_html_content()
: Extracts text from HTML pagesfetch_content()
: Determines the content type and calls the appropriate function
-
Language Detection: Uses the
langdetect
library to ensure the fetched content is in English. -
Claude API Integration:
call_claude_api()
: Constructs the prompt and makes the API call- Implements rate limiting (30 calls per minute, max 30,000 tokens)
-
Data Processing:
- Iterates through each row of the input data
- Fetches related web content
- Calls the Claude API for validation
- Updates the output with the API response
- Load the input Excel file
- Process each row (skipping the first two rows)
- Fetch web content based on the provided link
- Check if the content is in English
- If English, prepare the context and row content for the API call
- Call the Claude API for validation
- Update the 'Claude-Review' column with the API response
- Save results to the output Excel file and log file
- Implement rate limiting between API calls
- An updated Excel file (
output.xlsx
) with a new 'Claude-Review' column containing the validation results - A text file (
result.txt
) logging all prompts sent to the API and the responses received
The script includes error handling for API calls and file operations, logging any errors to both the console and the result.txt
file.
To comply with API usage limits, the script implements a rate limiting mechanism:
- Maximum 30 calls per minute
- Maximum 30,000 tokens per minute
- Waits for 61 seconds between each row processing
You can modify the assets
, life_cycle
, threat_categories
, and impact_categories
dictionaries to update the categories used for validation.
Ensure you have the necessary permissions and comply with the terms of service for the Claude API and any websites you're fetching content from.
This script is designed for threat analysis and validation in the context of LLMs. Always use it responsibly and in compliance with relevant laws and regulations.