Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test lists v1.5 #1720

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .github/workflows/check-all-lists.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,8 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Install hatch
run: pip install hatch

- name: Run lint lists against all lists
run: python scripts/lint-lists.py lists/
run: hatch run lint-lists lists/
102 changes: 102 additions & 0 deletions docs/spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
## Test Lists v1 data format

The goal of this section is to outline the current dataformat for the testing
lists.

Ideally we would enrich this data format spec with also some additional notes
on the existing pain points and what are the current limitations.

### v1 data format

The testing lists are broken down into CSV files, which are named as:
* `global.csv` for testing lists that apply to all countries
* `[country_code].csv` for country specific lists, where `country_code` is the
lowercase
[ISO3166](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes) alpha
2 country code. The only exception is the `cis` category code that is
for Commonwealth of Independent States nations.

Each CSV file contains the following columns:

* `url` - Full URL of the resource, which must match the following regular expression:
```
re.compile(
r'^(?:http)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$', re.IGNORECASE)
```
* `category_code` - Category code (see current category codes)
* `category_description` - Description of the category
* `date_added` - [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) timestamp of when it was added to the list in the format `YYYY-MM-DD`
* `source` - opaque string representing the name of the person that added it to the list
* `notes` - opaque string with notes about this string

### v1 category codes

* Alcohol & Drugs,ALDR
* Religion,REL
* Pornography,PORN
* Provocative Attire,PROV
* Political Criticism,POLR
* Human Rights Issues,HUMR
* Environment,ENV
* Terrorism and Militants,MILX
* Hate Speech,HATE
* News Media,NEWS
* Sex Education,XED
* Public Health,PUBH
* Gambling,GMB
* Anonymization and circumvention tools,ANON
* Online Dating,DATE
* Social Networking,GRP
* LGBT,LGBT
* File-sharing,FILE
* Hacking Tools,HACK
* Communication Tools,COMT
* Media sharing,MMED
* Hosting and Blogging Platforms,HOST
* Search Engines,SRCH
* Gaming,GAME
* Culture,CULTR
* Economics,ECON
* Government,GOVT
* E-commerce,COMM
* Control content,CTRL
* Intergovernmental Organizations,IGO
* Miscellaneous content,MISC

## v1.5 data format

The goal of the v1.5 data format is to come up with an incremental set of
changes to the lists formats such that it's possible to relatively easily
backport changes from upstream while we work on fully migrating over to the new
format.

Ideally it would include only the addition of new columns, without any
drammatic changes to minimize the likelyhood of conflicts when it's merged from
upstream.

* `url` - Full URL of the resource
* `category_code` - Category code (see current category codes)
* `category_description` - Description of the category
* `date_added` - ISO timestamp of when it was added
* `source` - string representing the name of the person that added it
* `notes` - a JSON string representing metadata for the URL (see URL Meta below)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add note about the quoting format and the fact that JSON format is determined by peaking the first byte which should be {


### URL Meta

URL meta is a JSON encoded metadata column that expresses metadata related to
the a URL that is relevant to analysts permorning data analysis.

It should be extensible without needing to add new columns (adding or changing
columns has the potential of breaking parsers of CSV).

This field is optional and parsers should not expect it to be present or it
containing any of the specific keys defined below.

Defined keys
* `notes`: value coming from the existing notes column
* `context_*`: values representing context that's specific to the URL

65 changes: 65 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "test-lists"
dynamic = ["version"]
description = ''
readme = "README.md"
requires-python = ">=3.8"
license = "MPL-2.0"
keywords = []
authors = [{ name = "Arturo Filastò", email = "[email protected]" }]
classifiers = [
"Development Status :: 4 - Beta",
"Programming Language :: Python",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
]
dependencies = []

[project.urls]
Documentation = "https://github.com/ooni/test-lists#readme"
Issues = "https://github.com/ooni/test-lists/issues"
Source = "https://github.com/ooni/test-lists"

[tool.hatch.version]
path = "src/test_lists/__about__.py"

[tool.hatch.envs.default]
dependencies = ["coverage[toml]>=6.5", "pytest", "click"]
path = ".venv/"

[tool.hatch.envs.default.scripts]
lint-lists = "python -m test_lists.cli lint-lists {args}"
test = "pytest {args:tests}"
test-cov = "coverage run -m pytest {args:tests}"
cov-report = ["- coverage combine", "coverage report"]
cov = ["test-cov", "cov-report"]

[[tool.hatch.envs.all.matrix]]
python = ["3.8", "3.9", "3.10", "3.11", "3.12"]

[tool.hatch.envs.types]
dependencies = ["mypy>=1.0.0"]
[tool.hatch.envs.types.scripts]
check = "mypy --install-types --non-interactive {args:src/test_lists tests}"

[tool.coverage.run]
source_pkgs = ["test_lists", "tests"]
branch = true
parallel = true
omit = ["src/test_lists/__about__.py"]

[tool.coverage.paths]
test_lists = ["src/test_lists", "*/test-lists/src/test_lists"]
tests = ["tests", "*/test-lists/tests"]

[tool.coverage.report]
exclude_lines = ["no cov", "if __name__ == .__main__.:", "if TYPE_CHECKING:"]
Loading