Skip to content

A Python library for optimising BERTopic model hyperparameters.

Notifications You must be signed in to change notification settings

benjaminr/tuneBERTopic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tuneBERTopic

tuneBERTopic is a tool designed to optimize the hyperparameters of the BERTopic model using various search strategies, including Bayesian optimization. The tool allows users to configure parameters, load data, and select evaluation metrics for tuning the topic model.

Features

  • Parameter Configuration: Load parameters from a YAML file.
  • Data Loading: Load data from a file or use sample data from the 20 Newsgroups dataset.
  • Search Strategies: Implement different search strategies, with Bayesian optimization as the default.
  • Evaluation Metrics: Support multiple evaluation metrics, including coherence, BLEU, ROUGE, and silhouette scores.
  • Logging: Utilize MLflow for tracking and logging experiments.

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/tuneBERTopic.git
    cd tuneBERTopic
  2. Install dependencies:

    poetry install

Usage

Command-Line Interface

The main script main.py can be executed with various command-line arguments:

python main.py <parameter_file> [--data-path <data_path>] [--categories <categories>] [--max-num-samples <num_samples>] [--strategy <strategy>] [--metric <metric>] [--llm <llm>] [--log-level <log_level>]

Examples

  1. Basic Example:

    python main.py parameters.yaml --log-level INFO
  2. Using Custom Data:

    python main.py parameters.yaml --data-path /path/to/data.txt --log-level INFO
  3. Specifying Categories and Maximum Samples:

    python main.py parameters.yaml --categories "alt.atheism" "comp.graphics" --max-num-samples 500 --log-level INFO

Parameter File

The parameter file should be in YAML format. An example parameters.yaml file:

param_grid:
  umap__n_neighbors: [15, 50]
  umap__n_components: [5, 10]
  hdbscan__min_cluster_size: [5, 15]
  bertopic__nr_topics: [2, 5, 10, 50]

Components

Data Loading

  • Function load_parameter_file: Load parameters from a YAML file.
  • Function load_data: Load data from a file or sample dataset.

Search Strategies

  • Class SearchStrategy: Base class for search strategies.
  • Class BayesianOptimizationSearch: Implements Bayesian optimization using hyperopt.

Evaluation Metrics

tuneBERTopic supports various evaluation metrics to assess the quality of the topics generated by the BERTopic model. These include:

  • Coherence Score (c_v): Measures how consistently related the words in a topic are to each other, which helps in determining the interpretability and quality of the topics generated by BERTopic.
  • Silhouette Score: Measures how similar an object is to its own cluster (topic) compared to other clusters. It is used to evaluate the quality of clustering.
  • BLEU Score: (Bilingual Evaluation Understudy) Used for evaluating the quality of machine-translated text. In this context, BLEU scores are calculated by using an LLM backend to generate summaries from the topic keywords and comparing them against the input documents.
  • ROUGE Score: (Recall-Oriented Understudy for Gisting Evaluation) Another metric for evaluating automatic summarization and machine translation. ROUGE scores are obtained similarly to BLEU scores, using an LLM backend to generate summaries from the topic keywords and comparing them against the input documents.

Using Evaluation Metrics

The evaluate_model method in the SearchStrategy class calculates these metrics:

  • Coherence and Silhouette: Evaluated directly on the topic model and the input documents.
  • BLEU and ROUGE: Utilize an LLM backend to generate summaries from the topic keywords, which are then compared to the input documents to obtain the scores.

Logging

MLflow is used for tracking experiments and logging results. Ensure MLflow is properly configured before running the tuning process.

# setup the mlflow tracking server
mlflow server --host 127.0.0.1 --port 8080

Contributing

Contributions are welcome! Please fork the repository and submit a pull request.

About

A Python library for optimising BERTopic model hyperparameters.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages