GithubHelp home page GithubHelp logo

mozilla / opuscleaner Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hplt-project/opuscleaner

0.0 2.0 0.0 7.84 MB

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Shell 0.57% JavaScript 9.47% Python 61.63% Perl 4.94% CSS 0.63% HTML 0.11% Vue 22.45% Dockerfile 0.19%

opuscleaner's Introduction

OpusCleaner

OpusCleaner is a machine translation/language model data cleaner and training scheduler. The Training scheduler has moved to OpusTrainer.

Cleaner

The cleaner bit takes care of downloading and cleaning multiple different datasets and preparing them for translation.

opuscleaner-clean --parallel 4 data/train-parts/dataset.filter.json | gzip -c > clean.gz

Installation for cleaning

If you just want to use OpusCleaner for cleaning, you can install it from PyPI, and then run it

pip3 install opuscleaner
opuscleaner-server serve

Then you can go to http://127.0.0.1:8000/ to show the interface.

You can also install and run OpusCleaner on a remote machine, and use SSH local forwarding (e.g. ssh -L 8000:localhost:8000 [email protected]) to access the interface on your local machine.

Dependencies

(Mainly listed as shortcuts to documentation)

  • FastAPI as the base for the backend part.
  • Pydantic for conversion of untyped JSON to typed objects. And because FastAPI automatically supports it and gives you useful error messages if you mess up things.
  • Vue for frontend

Screenshots

List and categorize the datasets you are going to use for training.

Download more datasets right from the interface.

Filter each individual dataset, showing you the results immediately.

Compare the dataset at different stages of filtering to see what the impact is of each filter.

Paths

  • data/train-parts is scanned for datasets. You can change this by setting the DATA_PATH environment variable, the default is data/train-parts/*.*.gz.
  • filters should contain filter json files. You can change the FILTER_PATH environment variable, the default is <PYTHON_PACKAGE>/filters/*.json.

Installation for development

cd frontend
npm clean-install
npm run build
cd ..

python3 -m venv .env
bash --init-file .env/bin/activate
pip install -e .

Finally you can run opuscleaner-server as normal. The --reload option will cause it to restart when any of the python files change.

opuscleaner-server serve --reload

Then go to http://127.0.0.1:8000/ for the "interface" or http://127.0.0.1:8000/docs for the API.

Frontend development

If you're doing frontend development, try also running:

cd frontend
npm run dev

Then go to http://127.0.0.1:5173/ for the "interface".

This will put vite in hot-reloading mode for easier Javascript dev. All API requests will be proxied to the python server running in 8000, which is why you need to run both at the same time.

Filters

If you want to use LASER, you will also need to download its assets:

python -m laserembeddings download-models

Packaging

Run npm build in the frontend/ directory first, and then run hatch build . in the project directory to build the wheel and source distribution.

Acknowledgements

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]

opuscleaner's People

Contributors

ales-t avatar gramirez-prompsit avatar jelmervdl avatar jindrahelcl avatar miau1 avatar tkhnv avatar xapajiamnu avatar zjaume avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.