scraping-UnB

I made this project to teach myself the concepts of web scraping in Python, database manipulation, using git with command line and markdown language as well besides gathering information that could be useful to students of the university.

The idea is to parse the entire site of the University where I study (Universidade de Brasília) and gather data of the classes that are being offered this semester: campi, departments, disciplines, classes, teachers, classrooms; store this data in a database and make some statistics with this data.

To set up the environment to this project we will need to install the following modules, so follow the instructions.

requests (module to make HTTP requests in a simple way, avoiding to deal with HTTP protocol)

pip install requests

beautifulsoup4 (module to navigate in the HTML tree)

pip install beautifulsoup4

lxml (parser to the HTML code)

pip install lxml

mysql-connector-python (module to store and retrieve data from the MySQL database)

pip install mysql-connector-python

The website

The website is organized in the structure of a tree:

So we must access the root, so we can get the links to the campi pages, then we access the campi pages to get the departments links and so on. The leaves of the tree are the discipline pages and there we can find the useful info. Because there are only 4 campi, the scrape didn't started at the root, it started directly from campi pages.

This give a hint of how to explore the website and how to design the database. One thing to keep in mind is that the University server is not prepared to scraping, it does not even have a robot.txt file. So it's good manners to scrape it with ease and as a solution to this I also saved the URL's to the pages in the database so I could access them bit by bit without overloading the servers with requests.

The database

At first the database is not normalized because it's more convenient to extract the data according the template of the information, and it looks like that:

Since it's not an updating database, it will be easier to gather the information in first place and later execute some SQL scripts to normalize and make the database more efficient.

The software

This software is as simple as it could be: There are 3 classes and a main function, each class parses a different kind of page and the main function construct one of each class.

Departments: parses the list of departments (as shown in the image) gatherings its attributes and storing all of them in the database
Disciplines: parses the list of departments (as shown in the image) gatherings its attributes and storing all of them in the database
Classes: the scraping in this kind of page (as shown in the image) was quite a challenge because the data availabe was not well structured, some of it was in images, and some kind appeared only in some cases, so I've had to consider several edge cases to makes the scraper work well.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scraping-UnB

The website

The database

The software

About

Releases

Packages

Languages

License

borisaraujo/scraping-UnB

Folders and files

Latest commit

History

Repository files navigation

scraping-UnB

The website

The database

The software

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages