Comments (4)
So, how is each job structured? We need generic corpora (we have files that have the same name, different languages and may end up conflicting because multiple corpora are selected...
Assume two folders:
- S3:/jobs/<my_job_id>/src
- S3:/jobs/<my_job_id>/trg
These files are added: - For each file in a corpus, the files are named <corpus_id>.<normal_name>.txt (already made into a txt by the time it's here)
- Since the files mirror in each folder, that is how
- The text files may start with a "ID\t" format for Bible references
- For keyterms, add files but with ".keyterm" or some such extension
- Add the config.yaml file in the root, calling out the source and target text, the keywords and the parent model.
After the whole process is done, the translated files would be at: - S3:/jobs/<my_job_id>/translations/<normal_name>.txt
This could be built in a few stages:
- Update machine.py to use the S3 buckets
- Update machine.py to look for these specific files in this specific way
- Update machine to use s3
- Update machine to push the specific files in the specific way
from machine.
Most of the job code has already been implemented. I found a bug in the .NET library that I am using to access S3. I am working on fixing that now. Once that is fixed, we just need S3 buckets to point at. Here is the current structure:
- s3://
bucket_name
/builds/build_id
/- train.src.txt: source sentences from all files in all corpora
- train.trg.txt: target sentences from all files in all corpora
- pretranslate.src.json: source sentences to pretranslate
- pretranslate.trg.json: pretranslated target sentences
from machine.
Downloading parent models and uploading child models have not been implemented yet. Here is the planned structure:
- s3://
bucket_name
/parent_models/lang
/: parent model for target languagelang
- s3://
bucket_name
/models/engine_id
/: child model for engineengine_id
from machine.
So, to get the secrets, etc. for machine and machine.py, would we do this?:
- For Machine, use Rancher secrets and pull it into a configuration file as environment variables
- For Machine.py, we should be able to use user based secrets. But then we would also need to register the clearml user in machine with Rancher secrets as well.
Does this match your understanding?
from machine.
Related Issues (20)
- Documentation on training a new translation e.g. German to English HOT 2
- Make queue depth information universal across instances
- CMAKE_CXX_COMPILER not set, after EnableLanguage HOT 10
- Update version number of server projects to match docker container version HOT 4
- XML upload for large files broken HOT 4
- SMT build error HOT 8
- Get Word Graph - catastrophic failure HOT 8
- Health check ClearML Health Check with status Unhealthy HOT 2
- Add health check for SMT engine disk storage HOT 7
- Crash on train-segment HOT 3
- Add tests to cover pretranslation and train-on logic in NmtPreprocessBuildJob
- NaN alignment score in FuzzyEditDistanceWordAlignmentMethod
- Investigate why JsonStringEnumConverter fails
- New NMT option - choose ClearML queue HOT 11
- Log each build as JSON in Loki HOT 1
- Add support for manipulating USFM
- Clean up inconsistent states in MongoDB HOT 7
- CancelBuildAsync return StatusCode.Aborted if already cancelled
- Move options to Record type
- More test coverage for non-scripture
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from machine.