Comments (3)
The preprocessing and tokenization in ner_stream is pretty basic. However, we are setting up python, R, and java APIs right now that will make it easy for you to use whatever kind of preprocessing and tokenization you want (e.g. like NLTK's tokenizer) since in general different applications require different sorts of preprocessing and tokenization. But yeah, it's annoying that the default tokenizer doesn't handle a smartquote. So I just updated it and if you pull and recompile it should work properly now.
Cheers,
Davis
from mitie.
Thanks for the update. Will definitely wait for the APIs.
from mitie.
@geovedi the simple ner_stream
tokenizer doesn't handle the more general problem of unicode normalization. We should probably do this. If you can file an issue for posterity and discussion, we can keep track of this.
As @davisking mentioned, the solution may not be to put it in ner_stream
but into the bindings. The issue there is that it affects training.
from mitie.
Related Issues (20)
- issues in MITIEInterpreter HOT 1
- attribute error when running on python3.6 HOT 4
- What are the Wordrep Parameters to improve the vectors model HOT 3
- What text categorizer perform to handle unknown vocabulary in testing dataset HOT 2
- Current version 0.6 is release with setup.py version set to 0.5 HOT 2
- suitable tool to annotate large text file
- Mitie code integration with GPU HOT 1
- “std::bad_alloc”: am I using too much memory?
- extract_entities returns score of 0 HOT 1
- Can not install mitie on Centos7 HOT 1
- How does mitie deal with the segmentation of OOV HOT 1
- Interface HOT 4
- Exception: Invalid range given to ner_training_instance.overlaps_any_entity(). It overlaps an entity given to a previous call to add_entity(). HOT 1
- Project status? HOT 2
- Dlib as (forked) surepo, transplant your changes to your forkadd it as subrepo, HOT 4
- Bad offsets for tokenize_with_offsets with UTF-8
- Not classifying trained entities HOT 5
- PHP bindings HOT 2
- Unexpected `Bus error: 10` while using `wordrep` HOT 10
- Is MITIE a proper choice for restoring punctuation HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mitie.