Skip to content

Inferring Commuting Routes And Respectives Modes Of Transport Through Call Detail Records (Python, PostgreSQL, Google API, ArcGIS)

Notifications You must be signed in to change notification settings

joelpires/CDRsDataAnalysis

Repository files navigation

Predicting Commuting Routes And Respectives Modes Of Transport Through Call Detail Records

More detailed information about the implementation and methodology used can be found in these series of articles:
Part 1 | Part 2 | Part 3

Papers mentioned can be seen in detail here:
1st Paper | 2nd Paper

The project in question aims to:

  • Replicate paper results for the EPIA 2019, namely:

    • Produce graphs and statistics to analyze the impact that the variation of some quality parameters of a CDR dataset has in the percentage of users from which we can infer commuting patterns.
    • These quality parameters or features involve: towers density per square kilometer; number of different day of call activity; average number of calls the user makes/receives per day; regularity with which users make/receive calls.
  • Replicate paper results for the TRA2020, namely:

    • Produce Shapefiles and statistics showing the probability distribution of different modes of transport used in commuting routes within the city.
    • These modes include: car, bus, metro, train, subway, walking and cycling.
    • percentages are still produced on a single-mode solutions, multimodal, private and public adopted by users during their commuting routes.
    • These results require further validation; as well as shapefiles in need of a projection on a map with the help of ArcGIS or QGIS.
    • In this case, it was used CDRs from PTelecom/Altice Labs dated from 2007. So the results were validated by comparing the results obtained with IMTT censuses, namely:

Practically all code functions are commented in the code itself.

Prerequisites for running the program:

  • ArcGIS installed
  • Python installed (and PyCharm preferably)
  • PostgreSQL installed (and DataGrip preferably)
  • Account Google Cloud Platforms with key access to the Directions API.

Dataset used belongs to Orange/Altice Labs and it's classified. The dataset was manipulated remotely through a Database that belongs to CISUC Laboratories.

Files and order of execution:

  • 0. "Database.ini" - file which sets the database information so that the python can access the data in it.

  • 0. log.txt - file that the program "4_altice_getPossibleRoutes.py" creates and uses to write relevant information during the extraction of possible commuting routes of each user through the google api.

  • 0. shapefiles in the C: \ Users \ <your user> \ Documents \ altice_ODPaths - these shapefiles are created during the execution of the "4_altice_getPossibleRoutes.py" and correspond to the possible commuting routes for each user. They can later be shown in ArcGIS or QGIS for its correct display.

  • 1. "1_altice_cleaning.sql" - This is the first file to run and probably the most time consuming. It is responsible for cleaning / process data - it eliminates null values, negative data other nonsense data.

  • 2. "1a) altice_oscillations_cases.sql" - File to optionally run. Only gives information on cases that may occur phenomena oscillatory calls and having to remove.

  • 3. "2_altice_user_characterization.sql" - File responsible for creating tables that feature all users of the entire dataset. This characterization involves describing, for example: how many calls they made on average, how many days of different call activity, among other metrics and features extracted.

  • 4. "2a) altice_paper1_insights.sql" - File to run optionally. It's necessary to create the tables for later statistical and graphical analysis performed by "2b) altice_stats_paper1.py" file.

  • 5. "2b) altice_stats_paper1.py" - File to run only optionally. It produces statistical and graphical analysis. It allows you to analyze how the variation of some quality parameters of a CDR dataset impacts the percentage of users from which we can infer commuting patterns.

  • 6. "3_altice_subsampling.sql" - File responsible for subsampling the dataset, so that the inference of commuting routes and modes of transport is faster and more efficient. To make an optimized sub-sampling, it is advised to examine the elements generated by "2b) altice_stats_paper1.py" and choose the best criteria in accordance with the conclusions obtained. (Steps 1, 2 and 3 of Figure 1)

  • 7. "4_altice_getPossibleRoutes.py" - File responsible for making calls to the Google API to collect possible commuting routes for each user as well as to send back such points to tables in PostgreSQL. Shapefiles with these route points are generated. (Steps 4-11 of Figure 1)

  • 8. "5_altice_paper2_final_routes.sql" - Analyzes the possible routes provided by the Google Maps API and, according scoring formulas, assigns a score to each route and chooses the most likely route for each user in your home->workplace route and another route on his way from workplace->home. (Steps 12-17 of Figure 1)

  • 9. "5a) altice_stats_paper2.py" - File to run optionally. Being the most probable routes and respective means of transport associated with them determined, statistics on transport distribution can be calculated. Percentages relative single-mode solutions, multimodal solutions, private and public modes adopted by users during their commuting routes are also produced. (Step 18 of Figure 1)


Figure 1 - General Program Structure

About

Inferring Commuting Routes And Respectives Modes Of Transport Through Call Detail Records (Python, PostgreSQL, Google API, ArcGIS)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published