Skip to content
This repository has been archived by the owner on Sep 6, 2024. It is now read-only.
/ pyboretum Public archive

Framework to explore and analyze custom decision trees in Python

License

Notifications You must be signed in to change notification settings

picwell/pyboretum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyboretum logo

pyboretum

CircleCI

Fertile grounds to explore and analyze custom decision trees in Python

Overview

Decision trees, also known as recursive partitioning, are one of the most often used statistical/machine learning algorithms in practice, often as part of a random forest [1]. They have gained wide popularity by offering robust performance over a wide range of regression and classification problems with little tuning of the hyperparameters. As a result, there are many implementations of decision tree algorithms in various programming languages, such as, decision trees in scikit-learn in Python and rpart in R.

Although decision trees offer good performance out of the box, they are, in fact, a framework to create various different algorithms [2]. The framework consists of a few distinct elements: (a) rules to identify cut variable; (b) rules to identify cut point; (c) rules to stop or prune; and (d) models to predict from nodes/partitions. There are numerous publications that change these elements to create decision trees tailored to problems of interest. However, scikit-learn, the most popular decision tree implementation in Python, does not provide ways to customize the basic algorithm it offers.

pyboretum solves this problem by offering an object-oriented programming framework to create your own decision tree algorithm and tools to analyze it in Python. Similar packages exist in other languages, for instance, like partykit in R. Such a custom algorithm can significantly improve how efficiently training data are used and lead to improved performance; further performance improvements can be also achieved by popular ensemble techniques like random forest or boosting.

The code currently focuses on regression problems, and we may support classification problems in the future as well (we are open to contributions!). Note that the code base is under active development, and the class interface is still evolving. It is licensed under the MIT License (see license file for more information) and supports both Python 2.7 and 3.7.

Installation

In the directory where you want to keep the pyboretum source code,

git clone [email protected]:picwell/pyboretum.git
cd pyboretum
python setup.py install

This will make pyboretum available through import.

Getting Started

In this example, we will use a small public dataset of red wine quality to demonstrate the basic pattern of training and inspecting a pyboretum decision tree. The key takeaways are

  • Splitter class can be customized to change the variable and cut-point selection rules. The example shows MSESplitter and MAESplitter that optimize for the weighted mean squared error (MSE) and mean absolute error (MAE), respectively.
  • Node class can be customized to provide different predictions. The example shows MeanNode for MSE criterion and MedianNode for the MAE criterion.
  • .visualize_tree() creates a nice visualization of the decision rules
from pyboretum import DecisionTree, MeanNode, splitters

dt = DecisionTree(min_samples_leaf=5, max_depth=5,
				  node_class=MeanNode)

Training a Decision Tree

Currently, pyboretum trees expect the data to be numeric (continuous or ordered categorical), so you have to encode your nominal categorical variables using algorithms like one-hot coding. We plan to support nominal categorical features in the future.

Specifying a Splitter

When we fit a tree, we specify a Splitter object to .fit() in addition to passing X and Y ("features" and "target" data, respectively). This defaults to MSESplitter if not given. Each Splitter will partition the data to optimize a different split criteria, and this is where users can create their own custom Splitter classes tailored to a particular problem at hand.

In the cells below, we will generate two different trees to minimize MSE and MAE using two different Splitters included out-of-the-box in pyboretum.

import pandas as pd
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
y = data['quality']
X = data[[c for c in data.columns if c!='quality']]

dt.fit(X, y, splitters.MSESplitter())
dt.visualize_tree(max_depth=2)

MSE Tree

We can pass a different Splitter object to .fit() to generate an alternative tree.

from pyboretum import MedianNode

dt = DecisionTree(min_samples_leaf=5, max_depth=5,
				  node_class=MedianNode)
                  
dt.fit(X, y, splitters.MAESplitter())
dt.visualize_tree(max_depth=2)

MSE Tree

Key Features

  • Support for efficient univariate (vector y) and multivariate (matrix Y) MSE criteria for variable and cut-point selection [3]
  • Support for an efficient univariate MAE criterion for variable and cut-point selection [3]
  • Support for Mahalanobis distance for multivariate MSE criteria
  • Visualization of decision rules

Code Organization

<root_dir>/
  pyboretum/
   |-- splitters/
   |    |-- base.py (interface definition for Splitter)
   |    |-- mae_splitter.py (splitter for MAE criteria)
   |    |-- mse_splitter.py (splitter for MSE criteria, including using mahalanobis distance)
   |-- tree/
   |    |-- base.py (interface for Tree)
   |    |-- linked_tree.py (Tree implementation using linked lists)
   |    |-- list_tree.py (Tree implemenataion using lists)
   |-- decision_tree.py (main decidion tree implementation)
   |-- node.py (Node classes used with Tree)
   |-- training_data.py 
   |-- utils.py
  test/  (various unit tests)

Key Classes/Interfaces for Customization

What to Come Next

Release Notes

References

[1] Leo Breiman et al., Classification and Regression Trees, Chapman and Hall/CRC, 1984

[2] Heping Zhang and Burton H. Singer, Recursive Partitioning and Applications, Springer, 2010 (2nd edition)

[3] Luis Fernando Rainho Alves Torgo, Inductive Learning of Tree-based Regression Models, Ph.D. Thesis, University of Porto, 1999