This repository contains code for data processing, feature selection and multiple model fitting and evaluation using traffic dataset. Feature selection was performed using supervised and unsupervised methods to achieve higher accuracy. Hyperparameter tuning suing gridsearchCV was performed.
- Dataset - Traffic volume dataset
- Model evaluated:
- Feature selection methods:
- Supervised feature selection
- Unsupervised feature selection
- Hyperparameter tuning
- Using GridSearchCV
Variable Name | Role | Type | Description | Units |
---|---|---|---|---|
holiday | Feature | Categorical | US National holidays plus regional holiday, Minnesota State Fair | - |
temp | Feature | Continuous | Average temperature in Kelvin | Kelvin |
rain_1h | Feature | Continuous | Amount in millimeters of rain that occurred in the hour | mm |
snow_1h | Feature | Continuous | Amount in millimeters of snow that occurred in the hour | mm |
clouds_all | Feature | Integer | Percentage of cloud cover | % |
weather_main | Feature | Categorical | Short textual description of the current weather | - |
weather_description | Feature | Categorical | Longer textual description of the current weather | - |
date_time | Feature | Date | Hour of the data collected in local CST time | - |
traffic_volume | Target | Integer | Hourly I-94 ATR 301 reported westbound traffic volume | - |
The rows containing NaN values for this column were droped as it is the ground truth column, imputing or replacing NaN values will affect the dataset originiality.
The NaN containing row in date_time column were droped as it doesnot make sense to impute.
The NaN value in the were converted to 0 and the holiday to 1.
The NaN value in these columns were filled using ffill after the datset was sorted by date_time column.
The Na values in weather main was filled using an approach where the last word of weather description was used to fill weather main NA values. And then remaining NA cointaining rows were droped.
Next the duplicates rows were droped and also date_time, weather description column were droped.
Here weather main, day, month, year, day_time, weekend were converted to numerical encoding suing serial values starting from 1 till len(var).
All columns were visualized using histogram.
To check the outlier in certain column box plot is used to visualize and detect.
For model development multiple estimator has been screened and the appropriate model was further fine tuned later. The dataset was spilited to training dataset and evaluation dataset. Here are score for model screening.
Model | R2 | MSE |
---|---|---|
SVR | 0.1953 | 3187710.8458 |
LinearRegression | 0.3232 | 2681101.2234 |
KNeighborsRegression | 0.7063 | 1163396.5928 |
SGDRegressor | 0.3213 | 2688483.8860 |
BayesianRidge | 0.3231 | 2681126.1147 |
DecisionTreeRegressor | 0.9340 | 261360.1047 |
GradientBoostingRegressor | 0.9017 | 389401.9654 |
RandomForestRegressor | 0.9652 | 137792.2456 |
XGBRegressor | 0.9725 | 108748.1525 |
RandomForest and XGBRegressor was found to be performing best among all the models.
CV Number | Score |
---|---|
1 | 0.92805313 |
2 | 0.9287948 |
3 | 0.94174875 |
The scores are less than the score genearted from evaluation dataset. It indicates overfitting of the model.
Pearson correlation is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations.
Fig : Pearson correlation coefficient for all the pairwise combinations of features.
Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.
Fig-1: The plot shows the dependency of the target on each feature.
Top 70% of the features were selected and evaluated for it accuracy with all features dataset using RandomForestRegressor.
Metric | Value |
---|---|
R2 Score | 0.9656273468841373 |
Mean Squared Error | 133272.57017553216 |
Cross-Validation Scores | [0.96480622, 0.96424541, 0.95989378] |
Percentage Deviation from Mean | 11.21 |
f_regression uses univariate linear regression tests returning F-statistic and p-values.
Fig-2: Plot for F ststistics for all feature against the target.
Top 60% of the features were selected and evaluated for it accuracy with all features dataset using RandomForestRegressor.
Metric | Value |
---|---|
R2 Score | 0.944669474729565 |
Mean Squared Error | 214532.21219487552 |
Cross-Validation Scores | [0.94521279, 0.94479329, 0.93781204] |
Percentage Deviation from Mean | 14.22 |
PCA is defined as an orthogonal linear transformation on a real inner product space that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
9 component were selected for the transformed space and evalauated it using RandomForestRegressor.
Metric | Value |
---|---|
R2 Score | 0.5565759676724207 |
Mean Squared Error | 1719281.3213077914 |
Cross-Validation Scores | [0.53416057, 0.53846765, 0.53979648] |
Percentage Deviation from Mean | 40.27 |
Hyparameter tuning was performed using GridSearchCV with parameter
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
The best paarmeter was used to train the RandomForestRegressor Model on train dataset and evaluated on test dataset. The scores is
Test Score (R-squared): 0.9935267631943931
*Inference - RandomForestRegressor and XGBRegressor performed the best and after hyperparameter tuning it have improved further.
How to perform feature selection using variuous method and perform ML model fitting and evaluate the performance of the model.
The code is is avaiable in a python notebook model.ipynb. To view the code please click below
Language: Python
Packages: Sklearn, Matplotlib, Pandas, Seaborn
Resources used
- scikit-learn
- OpenAI. (2024). ChatGPT (3.5) Large language model. https://chat.openai.com
If you have any feedback/are interested in collaborating, please reach out to me at LinkdIn