budget-explorer

###Workflow Overview

Retrieve Budget Estimates PDF files from government budget website
Convert PDF to text files by running Shell script convert.sh (PDF -> TXT)
Extract figures in text files into CSV files by running python script parse.py (TXT -> CSV)
Upload CSV file to Google Spreadsheet and cleanse the data
Export the data as JSON by Google Spreadsheet JSON API
Import JSON Data into Elastic Search Engine

Chinese PDFs used for processing

convert_batch.sh 021 will download the PDFs for head 021 (CEO Office) and parse them

Download Data (refer to https://code4hk.hackpad.com/CODE4HK-Budget-Hackathon-4Sgfyk51g5m) and extract to raw/csv/

Env Setup

cd docker-host-vm vagrant up

cd .. vagrant up --provider=docker