Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
Add using your own data + github release instructions
  • Loading branch information
jelmervdl authored Jan 7, 2024
1 parent 9b84521 commit 8d5c4a2
Showing 1 changed file with 10 additions and 0 deletions.
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,14 @@ Filter each individual dataset, showing you the results immediately.
Compare the dataset at different stages of filtering to see what the impact is of each filter.
[<img src="https://github.com/hplt-project/OpusCleaner/raw/main/.github/screenshots/diff-filter-output.png" width="100%">](https://github.com/hplt-project/OpusCleaner/blob/main/.github/screenshots/diff-filter-output.png)

### Using your own data
OpusCleaner scans for datasets and finds them automatically if they're in the right format. When you download OPUS data, it will get converted to this format, and there's nothing stopping you from adding your own in the same format.

By default, it scans for files matching `data/train-parts/*.*.gz` and will derive which files make up a dataset from the filenames: `name.en.gz` and `name.de.gz` will be a dataset called _name_. The files are your standard moses format: a single sentence per line, and each Nth line in the first file will match with the Nth line of the second file.

When in doubt, just download one of the OPUS datasets through OpusCleaner, and replicate the format for your own dataset.

If you want to use another path, you can use the `DATA_PATH` environment variable to change it, e.g. run `DATA_PATH="./my-datasets/*.*.gz" opuscleaner-server`.

### Paths
- `data/train-parts` is scanned for datasets. You can change this by setting the `DATA_PATH` environment variable, the default is `data/train-parts/*.*.gz`.
Expand Down Expand Up @@ -92,6 +100,8 @@ python -m laserembeddings download-models

Run `npm build` in the `frontend/` directory first, and then run `hatch build .` in the project directory to build the wheel and source distribution.

To push a new release to Pypi from Github, tag a commit with a `vX.Y.Z` version number (including the `v` prefix). Then publish a release on Github. This should trigger a workflow that pushes a sdist + wheel to pypi.

# Acknowledgements

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]
Expand Down

0 comments on commit 8d5c4a2

Please sign in to comment.