j0ma / morph-seg Goto Github PK
View Code? Open in Web Editor NEWmorphological / word segmentation experiments
morphological / word segmentation experiments
Currently, download-data.sh
downloads data we are not interested in, e.g. data/raw/en-hi.tgz
which is ~70MB in size.
As a low-priority item, it'd be nice to remove these.
Hi @j0ma,
I see in your repo that you faced the math domain error, for which you have the stacktrace in a txt file committed in the repo, as I do.
May I ask you what did you do to fix it?
Thanks,
Marco
Need some sort of script/model binary that can take a pre-trained MORSEL model and apply it to a corpus.
Currently, prepare-*.sh
contains code that invokes sentencepiece
and fairseq
. Remove these so that only tokenization is performed.
This could also be done inside flores
once the time comes.
Ultimately this code will need to work in the FLoRes repo as well.
It should be moved there once things are working to get started on the experiments.
Need to make LMVR segmentation work, using their hyperparameters.
LMVR modifies FlatCat and allows for an output lexicon size to be set.
Since we used 3 different settings for BPE (2500, 5000, 7500), it could
be worthwhile to investigate the settings for LMVR as well. It's non-obvious
whether the vocab size will matter or not when not using BPE.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.