Skip to content

SamHollings/output_checker

Repository files navigation

output_checker

Build pylint score flake8 pytest coverage Code style: black License: MIT

output_checker is a tool that allows analysts to check if their code is ok to export

End Goal

  • Check some repo to ensure it's ok to remove from a secure environment
    • flag any large files
      • function to calculate file size
    • flag any long files
      • function to calculate file length
    • flag any CSVs, or other known data files
    • flag any images, artifacts etc.
    • scan files for hardcoded datastructures, e.g. embedded data, tokens
    • scan files for "entities" e.g. names, or other identifiers
    • scan notebooks to ensure they have had outputs cleared
  • Check "outputs" to ensure they are ok to remove from a secure environment
    • make sure "data" outputs, e.g. CSV/parquet, etc. follow disclosure control rules
    • scan output files for entities
    • flag any long files
    • flag any large files

This will work by:

  • pointing a checking function at a given directory and indicating if it contains code or data files
    • function which scans all files in a directory and applies the necessary checks
  • This will return a table detailing each file in which there were problems, and which problems were found
    • functionality which returns all failures in a tabular form

This repo will also contain test cases to show it works correctly. E.g.

  • example code files with and without harded datastructures, entities, etc.
  • example data files with and without entities, and which pass and fail disclosure control etc.
  • For large and long files - we need functions which can generate these on the fly

Components

This will contain functions to do a few different checks for both data outputs and code.

data outputs:

code outputs:

  • ToDo: Large files
  • ToDo: Files which are too long
  • ToDo: Entity Recognition
  • ToDo: Embedded tables
  • ToDo: notebooks without cleared outputs

Initially, we need to make simple functions which address the above, then we can build other functions which will apply them multiple files.

A core part of the use-case is that there is that people can be alerted what issues there are, in which files.

It does this by running a few simple checks. The envisaged workflow is:

graph LR
    A[Statistical Disclosure Control] --> B{Values below a threshold?};
    A --> C{Values not rounded?};
    B -->|Pass| D[Ok to output];
    B -->|Fail| E[Identifies problems];
    C -->|Pass| D[Ok to output];
    C -->|Fail| E[Identifies problems];

    click A "https://github.com/SamHollings/output_checker/tree/main/src/disclosure_control_check" "Disclosure Control code" _blank
Loading

Getting started

To start using this project, first make sure your system meets its requirements.

Requirements

Contributors have some additional requirements!

To install the Python requirements, open your terminal and enter:

pip install -r requirements.txt

Required secrets and credentials

To run this project, you need a .secrets file with secrets/credentials as environmental variables. The secrets/credentials should have the following environment variable name(s):

Secret/credential Environment variable name Description
Secret 1 SECRET_VARIABLE_1 Plain English description of Secret 1.
Credential 1 CREDENTIAL_VARIABLE_1 Plain English description of Credential 1.

Once you've added, load these environment variables using .env.

Licence

Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation. The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.

Contributing

If you want to help us build, and improve output_checker, view our contributing guidelines.

Acknowledgements

This project structure is based on the govcookiecutter template project.