Skip to content

Machine learning model development for predicting median house value

License

Notifications You must be signed in to change notification settings

sonti-roy/california_housing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

alt text

Project Title

Machine learning model development for predicting house price in california

Implementation Details

Dataset Details

This dataset was obtained from the StatLib repository (Link)

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surprisingly large values for block groups with few households and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the sklearn.datasets.fetch_california_housing function.

  • California Housing Dataset in Sklearn Documentation
  • 20640 samples
  • 8 Input Features:
    • MedInc median income in block group
    • HouseAge median house age in block group
    • AveRooms average number of rooms per household
    • AveBedrms average number of bedrooms per household
    • Population block group population
    • AveOccup average number of household members
    • Latitude block group latitude
    • Longitude block group longitude
  • Target: Median house value for California districts, expressed in hundreds of thousands of dollars ($100,000)

Exploratory data analysis

  1. Correlation between features were carried out to see if highly correlated features are there, so that redundancy could be removed from the features.

alt text

  1. Correlation shows longitude and latitude are highly correlated and one could be removed from the features list. alt text

Model fitting and evaluation

  1. Multiple models were evaluated for their performance and compared the R2 and MSE for the models to select the best model.

alt text

  1. The performance of GradientBoostingRegressor model was found to be the highest with very low MSEerror compared to other models that are evaluated.
Model R2 MSE
SVR -0.020689 1.017586
LinearRegression 0.582674 0.416057
KNeighborsRegression 0.136115 0.861259
SGDRegressor 0.001655 0.995310
BayesianRidge 0.582681 0.416051
DecisionTreeRegressor 0.585701 0.413039
GradientBoostingRegressor 0.772826 0.226484
  1. Model prediction comparasion with true values

alt text

Inference - shows a good colinearity which is also visible from the score.

Cross valadation

To evaluate the GradientBoostingRegressor model further and check for over fitting, cross valadation is performed.

  1. Cross validation of the model with complete dataset with cv = 5 shows reduced score than thge model
Score 1 Score 2 Score 3 Score 4 Score 5
0.62413216 0.6943188 0.71206383 0.65481236 0.67672756
  1. Cross validation of the model with split dataset shows similar accuracy as the fitted model.
Score 1 Score 2 Score 3 Score 4 Score 5
0.78189507 0.78282526 0.78389246 0.80503452 0.80055348

Inference - The model need further tuning to match the score in both the scanerio.

Key Takeaways

How to perform a basic ML model fitting and evaluate the performance of the model.

Code

The code is is avaiable in a python notebook model.ipynb. To view the code please click below

Click here

Roadmap

  1. Model Exploration
  2. Model Optimization
  3. Hyperparameter Tuning
  4. Exploring Other Ways to Improve Model

Libraries

Language: Python

Packages: Sklearn, Matplotlib, Pandas, Seaborn

Acknowledgements

Resources used

Contact

If you have any feedback/are interested in collaborating, please reach out to me at LinkdIn

License

MIT License

About

Machine learning model development for predicting median house value

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published