Skip to content
Tung Lin edited this page May 22, 2023 · 4 revisions

Welcome to the wiki!

Database

We are using Supabase to host a PostgreSQL database. Rather than using Supabase's Python client library, we are using SQLAlchemy ORM to insert and query the database. This allows us to move to a different host when the time comes.

The database credentials are in a Google Drive. Contact Tung, Wilson, or Tony for access (read+write).

Scraping

Running spiders to scrape data

In the first web_scraping folder (ls should return scrapy.cfg and another web_scraping folder). Run

scrapy crawl [spider name]

For example:

scrapy crawl searchspider

Crawl results

Depending on the spider, a .json file will be created/rewritten in web_crawling/jsons.

Currently:

  • statespider --> states.json (state, url)
  • munispider --> municipalities.json (state, municipality, url)
  • searchspider --> parking_code.json (state, municipality, state_url, parking_code)

searchspider

Loops through every entry of municipalities.json and follows the URL for the municipality.

Using scrapy-playwright, for each request it:

  1. waits 6 seconds for JS to load
  2. types a keyword into the search bar
  3. presses "Enter" key
  4. waits 6 seconds for the results to load
  5. results page is sent to parse_search to find the URL with parking code

To resolve:

  • how to find the right link with parking code (currently we're extracting the first link)
  • when municipality URL redirects to a site that is not municode
  • if a keyword does not return any results
Clone this wiki locally