Traffic volume prediction using ML

This repository contains code for data processing, feature selection and multiple model fitting and evaluation using traffic dataset. Feature selection was performed using supervised and unsupervised methods to achieve higher accuracy. Hyperparameter tuning suing gridsearchCV was performed.

Implementation Details

Dataset Details

Variable Name Role Type Description Units
holiday Feature Categorical US National holidays plus regional holiday, Minnesota State Fair -
temp Feature Continuous Average temperature in Kelvin Kelvin
rain_1h Feature Continuous Amount in millimeters of rain that occurred in the hour mm
snow_1h Feature Continuous Amount in millimeters of snow that occurred in the hour mm
clouds_all Feature Integer Percentage of cloud cover %
weather_main Feature Categorical Short textual description of the current weather -
weather_description Feature Categorical Longer textual description of the current weather -
date_time Feature Date Hour of the data collected in local CST time -
traffic_volume Target Integer Hourly I-94 ATR 301 reported westbound traffic volume -

Data processing

1. Target column - traffic_volume

The rows containing NaN values for this column were droped as it is the ground truth column, imputing or replacing NaN values will affect the dataset originiality.

2. Date_time column processing

The NaN containing row in date_time column were droped as it doesnot make sense to impute.

3. Holiday column

The NaN value in the were converted to 0 and the holiday to 1.

4. Temperature, rain, snow and clouds column

The NaN value in these columns were filled using ffill after the datset was sorted by date_time column.

5. Weather description and weather main column

The Na values in weather main was filled using an approach where the last word of weather description was used to fill weather main NA values. And then remaining NA cointaining rows were droped.

6. Drop duplicates and unnecessary columns

Next the duplicates rows were droped and also date_time, weather description column were droped.

7. Encoding categorical text into numerical value.

Here weather main, day, month, year, day_time, weekend were converted to numerical encoding suing serial values starting from 1 till len(var).

Data exploration

1. Histogram

All columns were visualized using histogram.

alt text

2. Box plot

To check the outlier in certain column box plot is used to visualize and detect.

alt text

alt text

alt text

Base Model development

For model development multiple estimator has been screened and the appropriate model was further fine tuned later. The dataset was spilited to training dataset and evaluation dataset. Here are score for model screening.

Model R2 MSE
SVR 0.1953 3187710.8458
LinearRegression 0.3232 2681101.2234
KNeighborsRegression 0.7063 1163396.5928
SGDRegressor 0.3213 2688483.8860
BayesianRidge 0.3231 2681126.1147
DecisionTreeRegressor 0.9340 261360.1047
GradientBoostingRegressor 0.9017 389401.9654
RandomForestRegressor 0.9652 137792.2456
XGBRegressor 0.9725 108748.1525

RandomForest and XGBRegressor was found to be performing best among all the models.

Cross valadatiuon was performed to see if there is overfitting.

CV Number Score
1 0.92805313
2 0.9287948
3 0.94174875

The scores are less than the score genearted from evaluation dataset. It indicates overfitting of the model.

Supervised feature selection

Pearson Correlation - feature selection

Pearson correlation is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations.

alt text

Fig : Pearson correlation coefficient for all the pairwise combinations of features.

Evaluating mutual info regression method for feature selection

Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

alt text

Fig-1: The plot shows the dependency of the target on each feature.

Top 70% of the features were selected and evaluated for it accuracy with all features dataset using RandomForestRegressor.

Metric Value
R2 Score 0.9656273468841373
Mean Squared Error 133272.57017553216
Cross-Validation Scores [0.96480622, 0.96424541, 0.95989378]
Percentage Deviation from Mean 11.21

2. Selecting features using f_regression

f_regression uses univariate linear regression tests returning F-statistic and p-values.

alt text

Fig-2: Plot for F ststistics for all feature against the target.

Top 60% of the features were selected and evaluated for it accuracy with all features dataset using RandomForestRegressor.

Metric Value
R2 Score 0.944669474729565
Mean Squared Error 214532.21219487552
Cross-Validation Scores [0.94521279, 0.94479329, 0.93781204]
Percentage Deviation from Mean 14.22

Unsupervised feature selection

1. Principal component analysis

PCA is defined as an orthogonal linear transformation on a real inner product space that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

9 component were selected for the transformed space and evalauated it using RandomForestRegressor.

Metric Value
R2 Score 0.5565759676724207
Mean Squared Error 1719281.3213077914
Cross-Validation Scores [0.53416057, 0.53846765, 0.53979648]
Percentage Deviation from Mean 40.27

Hyperparameter tuning

Hyparameter tuning was performed using GridSearchCV with parameter

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]

The best paarmeter was used to train the RandomForestRegressor Model on train dataset and evaluated on test dataset. The scores is

Test Score (R-squared): 0.9935267631943931

*Inference - RandomForestRegressor and XGBRegressor performed the best and after hyperparameter tuning it have improved further.

Key Takeaways

How to perform feature selection using variuous method and perform ML model fitting and evaluate the performance of the model.


The code is is avaiable in a python notebook model.ipynb. To view the code please click below

Click here


Language: Python

Packages: Sklearn, Matplotlib, Pandas, Seaborn


