diff --git a/scripts/README.md b/scripts/README.md index 396a55ff..2424265e 100644 --- a/scripts/README.md +++ b/scripts/README.md @@ -3,4 +3,7 @@ This is probably the most complex folder of all the repository, so I will try to be as detailed as possible. This folder is organized as follows: -- If you are looking for how we extracted documentation data from GitHub, you should look at the `scraper` folder. The `api_scraper.py` file is the main file of this folder, containing the code that requests custom URLs to GitHub API. The file `main.py` presents the whole process of extracting a documentation file, `scrapy.py` shows how to do the URL requets to the `api_scraper.py` module and `validate.py` shows how we validated if a documentation file was valid for qualitative analysis or not. If you want to know how we converted the markdown files to spreadsheets, take a look at `export.py` (noticed that we use cmark-gfm to convert the markdown content to plaintext, which might be a pain if you are not using a system based on Linux). More information about all these files are given as doctstrings. +- If you are looking for how we extracted documentation data from GitHub, you should look at the `scraper` folder. The `api_scraper.py` file is the main file of this folder, containing the code that requests custom URLs to GitHub API. The file `main.py` presents the whole process of extracting a documentation file, `scrapy.py` shows how to do the URL requets to the `api_scraper.py` module and `validate.py` shows how we validated if a documentation file was valid for qualitative analysis or not. If you want to know how we converted the markdown files to spreadsheets, take a look at `export.py` (Please noticed that we use cmark-gfm to convert the markdown content to plaintext and, if you want to run it, you will need to build cmark-gfm on your computer). More information about all these files are given in doctstrings. +- Inside the `classifier` folder you will find how we performed all the classification steps until getting a final model. The subfolders are supposed to as intuitive as possible. The `data_preparation` folder, contains the code about how we prepared data for classification, the `model_selection` folder about how we selected the best estimator for our problem, the `results_report` should contain scripts used to report our final model, and the `classification` folder contains the code used to perform classification. If you want to understand the whole process, I recommend starting with the `main.py` file, where I tried to split in clear methods the stages of this process. + +Don't hesitate to contact me at fronchettl@vcu.edu if you get confused, this was a one-developer job and I know that some parts might be unclear. I did my best.