Skip to content
This repository has been archived by the owner on Jan 30, 2022. It is now read-only.

figure out how to make this stuff machine readble #13

Open
hypatia opened this issue Jun 26, 2014 · 7 comments
Open

figure out how to make this stuff machine readble #13

hypatia opened this issue Jun 26, 2014 · 7 comments

Comments

@hypatia
Copy link
Member

hypatia commented Jun 26, 2014

here's the CSV/XML format: http://www.eeoc.gov/employers/eeo1survey/eeo1_cvs_specifications.cfm

ASCII/text format: http://www.eeoc.gov/employers/eeo1survey/ee1_datafile_2013.cfm

@jebeck
Copy link

jebeck commented Jun 27, 2014

Happy to help out on this, but is YAML the best format? D3, for example, provides data loading functions for CSV/TSV, XML, and JSON blobs natively, but not YAML.

@hypatia
Copy link
Member Author

hypatia commented Jun 27, 2014

@jebeck I have no strong feelings about this - whatever you think would be best! Could you have a glance at the existing CSV/XML stuff I linked to and see if we could just work with that directly?

@hypatia hypatia changed the title Make a simplified yaml data format for the EEO-1 data for ease of visualization usage figure out how to make this stuff machine readble Jun 27, 2014
@jebeck
Copy link

jebeck commented Jun 27, 2014

I glanced at the CSV spec, and it looks pretty terrible (unlike information distributed across rows instead of columns, which is generally not friendly data design). The XML spec may be better, but I also wonder how often companies choose to submit in this form? In any case, happy to take on the task of writing some tools to translate between the official specs and a simplified format (I'd argue for JSON). We should chat about tools - might be able to keep it all client-side and do JavaScript, or could do Python.

@hougs
Copy link
Contributor

hougs commented Jun 29, 2014

Do we have access to the eeo-1 csv or xml files? I think setting up automated tooling to transform these files may be quite a bit of effort in order to parse a small amount of data that is getting updated and added at a relatively slow pace.

I don't mean to suggest that we shouldn't do this, but I would like propose a few alternatives that I think may get use human & machine readable data faster. I think the following strategies may get us a win (that admittedly isn't particularly flexible) in a short amount of time:

  • Enter data by hand (I think this would take less time than making something automated. Issue Data in spreadsheet format #22 already has most of the available data, though I think we need to add the leadership demographic breakdown.)
  • Use a pdf parsing tool like Tabula (thanks @ameliagreenhall for pointing this out)

Also, +1 on aiming for JSON as the data format we keep in this repo. It is machine readable, and human readable & editable.

@hougs
Copy link
Contributor

hougs commented Jun 29, 2014

I saw on the double union mailing list that another goal is it advocate for a standard data format to release diversity data in. This seems related, but not necessarily dependent on making the currently released data machine readable. Perhaps we could make another ticket for it?

@jebeck
Copy link

jebeck commented Jun 29, 2014

AFAIK, @jhlch, we don't have access to the CSV and/or XML data. I think each corporation gets to decide how they want to submit the data (see the links @hypatia pasted opening this issue), and I doubt we're going to have very many, if any, of them releasing the data in these formats.

Given the very small size of these datasets (at least compared to some of the data I'm used to working with...), I think transcription won't be a completely heinous task, unless we start getting 1000s of companies to release data(!!!) (Take note that many of the "PDFs" submitted so far are actually images of a PDF, so something like Tabula isn't going to help much.) Another possibility I'd like to try is setting up a client-side GUI app for transcription; we should be able to leverage the download attribute in browsers that support it to let a transcriber download the results of the form and send it in. Does that sound like a good idea to anyone else or just me? ;)

All in all, my proposal is the following path (and yeah, these should be split out into separate tickets if there's consensus):

  • write a simple JSON Schema to document the standard format (this will mean we can easily validate crowd-sourced transcriptions)
  • make a client-side GUI for transcription (basically just a big form, probably laid out in the same way that the PDFs are laid out, so transcription is dead easy)
  • if desired, write some code to translate between data formats, so that if we have a standard JSON we can also allow downloading the CSV or XML gov't spec; I don't think this is necessary, it'd just be cool :)

I've got some vacation coming up this week, and I've got some other projects to work on as well, but I could definitely do the JSON Schema proposal, maybe get a start on a simple transcription form.

@hougs
Copy link
Contributor

hougs commented Jun 30, 2014

I remembered that we have a gmail account for open diversity data. I bet we could make a google form, and get a spreadsheet auto populated in google docs. This may be an option to consider for a client side gui for crowdsourcing parsing the pdf data. Just a thought.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants