Skip to content

anandtopu/udacity_data_scientist_nanodegree

Repository files navigation

Analysis of the Stack Overflow Annual Developer Survey

Git Repo: https://github.com/anandtopu/udacity_data_scientist_nanodegree/

Table of Contents

  1. Installation
  2. Project Motivation
  3. File Descriptions
  4. Technical Details
  5. Conclusions
  6. Licensing and Acknowledgements

Installation

The Jupyter notebook in this project is based on Python 3.7.1. You will need packages such as numpy, pandas, matplotlib, pycountry_convert, jupytab,seaborn to run this notebook. The package pycountry_convert is used to convert country names to continent names.
After downloading the repository to your local machine, you will need to unzip the data.zip file before running the Stack_Overflow_survey_project.ipynb file. (I have to zip the dataset because it is too big.)
The file structure should be as follows:

repository/
  Stack_Overflow_survey_project.ipynb
  data.zip   stackoverflow_ide.twb
  stackoverflow_programming languages.twb
  IDE.xlsx
  language.xlsx
  Stack_Overflow_Survey_Analysis.html   README.md

Project Motivation

Based on stack overflow survey data few questions this project seeks to understand are: Stack Overflow Annual Developer Survey dataset.

  1. What is the most popular programming language in the past five years since 2015?
  2. What is the most popular IDE for professional software developers in the past five years since 2015?
  3. How much does professional software developers make in the past five years since 2015? What is the salary increase rate?
  4. Education degree and job satisfaction and salary/compension based on education level for 2019?

File Descriptions

  1. Stack_Overflow_Survey_Analysis.ipynb: is the main file for data processing, analysis and visualization.
  2. data.zip: contains Stack Overflow Annual Developer Survey dataset between Year 2015 and Year 2019. These datasets are used for this project.
  3. IDE.xlsx and stackoverflow_ide.twb: Tableau source files to generate the 'most popular IDE' figure.
  4. language.xlsx and stackoverflow_programming languages.twb: Tableau source files to generate the 'most popular languages' figure.

Technical details

Language and IDE data

To find out the most popular programming language, I used the survey data of the recent five years.
To find out the most popular IDE, I only used the survey data of the recent four years. This is because the survey data in Year 2015 doesn't include information about IDE.
To find out the relationship between higher education and salary I had used 2019 data only as the

Salary

The average salary calculation only includes full-time developers.
The salary difference between countries is so big. It doesn't make sense to calculate the average salary of professional software developers for the whole world. Since we have data from more than 170 countries and dependent territories, it also does not make sense to calculate the average salary for every country. So I added a column named 'continent' and calculated the average salary based on different continents.

Missing values

Because we have a very large dataset and only a small fraction of data has no salary information, so I removed the rows where salary information is missing.

Incorrect values

When I examining the dataset in the data assessment step, I observed many full-time developers have a salary of 0, while others have a salary of 10^30 dollars! These are obviously incorrect. So I decided to only include salaries within 0.1 - 0.9 quantile range for the average salary calculation. I believe this is reasonable.

Visualization

The two visuals related to salary is generated with matplotlib. The other visuals are generated using Tableau.

Conclusions

1.The most popular programming languages are JavaScript, HTML/CSS and SQL. 2.The most popular IDEs are Visual Studio, Notepad++ and sublime. 3.North American developers have the highest average salary and relatively high salary increase rate. Asian Developers have highest increase in salary. 4.Professionals with advanced degrees are clearly earning more than those with only High School degree with the same years of Coding Experience However, as the years of experience increase, the difference becomes smaller.

Blog here.

Licensing and Acknowledgements

Thanks Stack Overflow for making the survey data available to the general public.The code in this repository is released under the MIT license.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published