mit-nlp / mitie Goto Github PK

MITIE: library and tools for information extraction

C++ 97.36% C 0.34% CMake 0.25% MATLAB 0.01% HTML 0.12% Shell 0.10% XSLT 0.57% Python 0.76% Makefile 0.07% Batchfile 0.01% R 0.33% Perl 0.09%

machine-learning natural-language-processing information-extraction python c-plus-plus java

mitie's Issues

UTF-8 problems

Hi,

First of all thank let me thank you for this great tool.
We are using MITIE via python 2.7. To my best knowledge we have to convert our strings from unicode to plain bytes before passing them to MITIE.
When using tokenize_with_offset this can lead to some offset detected in the middle of some unicode character spanning over multiple bytes which results in "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-5: unexpected end of data" after attempt for decode.

Any ideas?

Many thanks,
Jakub

Sourceforge considered harmful

Sourceforge has a troubling history of hijacking projects and should be considered harmful.

Github has arbitrary file downloads associated with releases (git tags). An admin should be able to add the data download for MITIE to the repo's releases and away from Sourceforge.

enhancement: direct access to word embeddings

I'd like to build a simple sentence classifier using the spectral embeddings a simple average-of- bag-of-words approach.

Thinking about forking this and creating a model using dlib, but it might make sense anyway to write some functions which just return the embeddings for a list of tokens which can then be wrapped in python and other bindings. Let me know what you think

Getting error overlap entity

Good afternoon.
Thanks for such awesome library.

I'm getting an error Invalid range given to ner_training_instance.overlaps_any_entity(). It overlaps an entity given to a previous call to add_entity()..

Can you please explain me what is going on?
Thanks again.

How to interpret confidence score?

from the comments in the code base, I understand that the confidence score can range from negative to positive value larger than 1. But it's unclear to me that when I have a result with score 1.3, I'm still not sure how confidence this value is, because I don't have a [min, max] range in mind. If possible, may I have some statistical background in how this value is calculated? thanks!!

How to create a svm model ?

Hello,
I just have started with MITIE, I would like to understand how can i create a model for binary relationship.
Unfortunately I need italian models so I must create them from scratch. Reading the examples i found a good way to create NER models (here: https://github.com/mit-nlp/MITIE/blob/master/examples/java/TrainNerExample.java)

and here: https://github.com/mit-nlp/MITIE/blob/master/examples/java/NerExample.java for BinaryRelationDetector, the problem is how can i create a .svm ? and how can I train it?

Is there a tutorial to understand how can i create a good binary relation model ?

Thank you for your support!

running face detection example ERROR

Hi , when trying to run face_detection.cpp
by g++ -o output -lpng -ljpeg -O3 -I.. ../dlib/all/source.cpp -lpthread -lX11 face_detection_ex.cpp
./output test.jpg
exception thrown!
Unable to load image in file test.jpg.
You must #define DLIB_PNG_SUPPORT and link to libpng to read PNG files.
Do this by following the instructions at http://dlib.net/compile.html.

the instructions was not clear for me
Note also that if you want to work with jpeg/png files using dlib then you will need to link your program with libjpeg and/or libpng. You also need to tell dlib about this by defining the DLIB_JPEG_SUPPORT and DLIB_PNG_SUPPORT preprocessor directives.

i have installed libjpeg and libpng and added the flags ... but can"t figure out what's wrong

reserved identifier violation

I would like to point out that identifiers like "MITLL_MITIe_H__" and "MIT_LL_CONLL_PaRSER_H__" do not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?

Coreference ability

For my project I also need Coreference between sentences to extract information. Is there a good way with MITIE. Any help or pointers will help.

Thanks

WordRep install/usage

Do you have any documentation on how to install or use the wordrep tool. I attempted to build it with cmake but haven't had any success.

I attempted to adapt the steps in the ReadMe for /tools/ner_stream:

cd tools/ner_stream
mkdir build
cd build
cmake ..
cmake --build . --config Release

but I am getting this error.

-- Check size of void*
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
JPEG_LIBRARY
linked by target "cmTryCompileExec1671790077" in directory /mydir/mitie/tools/wordrep/build/CMakeFiles/CMakeTmp

CMake Error: Internal CMake error, TryCompile configure of cmake failed
-- Check size of void* - failed
-- Found LAPACK library
-- Found CBLAS library
-- Looking for cblas_ddot
-- Looking for cblas_ddot - found
-- Check for STD namespace
-- Check for STD namespace - found
-- Looking for C++ include iostream
-- Looking for C++ include iostream - found
-- Configuring incomplete, errors occurred!

See also "mydir/mitie/tools/wordrep/build/CMakeFiles/CMakeOutput.log".
See also "mydir/mitie/tools/wordrep/build/CMakeFiles/CMakeError.log".

Custom relationship extraction example in java?

I dont see an example for custom relationships in java. Is this supported? How can I learn more about how to use this feature?

Add more entities to the existing model

Hi there,

Can I add more entities to the existing ner_model.dat model in MITIE-models folder?

Thanks,

Memory growth in c++ code

Hello Davis,

I am training a NER where in I have about 63 sentences with lot of key works per sentence. This is primarily wikipedia extracts and I am training the words for a specific domain.

My cpp code is @ https://github.com/skprasadu/wikipedia-extraction-framework/blob/master/train_movie_relation/train_my_ner.cpp . And my main code @ https://github.com/skprasadu/wikipedia-extraction-framework/blob/master/train_movie_relation/train_relation_extraction_example.cpp calls the train method.

When I run the training process, it takes a long time, that is understandable, but the memory growth is crazy, it sucks up close to 16gb to complete the NER. Can you see what I am doing wrong?

Can you tell me how to optimize this?

Thanks
Krishna

Updating mitie in NPM

Are there any plans to update mitie on the NPM? It's almost a year old, and I suspect many users would want to have mitie integration in Node as exists in Java, Python, and other languages.

How to bootstrap model with known entities?

Hi Davis,

Thank you so much for this high performance open source library. I have one question that I couldn't find an answer to wrt training the entity recognizer.

I would like to take advantage of already known entities, but also be able to recognize entities not already known to the dictionary. For e.g. the wikidata project provides millions of entities and it would be nice to seed the model with those known entities. Couple of approaches I can think of:

Train a new model using whatever training data I can gather. Load known entities into a dictionary. At runtime, say if I am working with a sentence, identify known entities as well as run the sentence through the ner model. Then reconcile the two with the dictionary based reco overriding any conflicting judgements. I wrote this, but don't think this is a good idea.
Generate a large set of training data by plugging in already known entities. E.g. knowing "Davis King" and "MIT" are entities, generate a training sentence "This library is from Davis King of MIT". I would think this approach's results will be heavily influenced by the variation of the filler text generated as part of the training set.
How would you go about doing this? Is there a straight forward technique to seed the model with known entities or a recommended technique to supplement the model with a dictionary?

Get Java API by JavaCpp?

Hello,

This library is really like a dream.

Right now, I need to have Java api to train customized model. So is it possible to provide Java API by using https://github.com/bytedeco/javacpp.

Thanks.

Is it possible to reduce the size of the model

Hi, I've come across this library, and found it is really amazing! The accuracy is even better than Stanford NER demo!

Although I understand it contains a high dimensional space with over 500,000 dimensions, is it possible to reduce the model size?

Building MITIE java on Windows

MITIE is a cool tool and much more focused and ease of use. I have Cygwin on windows and company restriction is that I cannot have VS.Net..

Can someone guide me how to build it. Thanks in advance.

Krishna

Train custom NER model in Java

Hi guys,
I am using MITIE for java. I see that your example python program, shows how to train the ner, with a sentence and indices.

However, I am using MITIE for java.
How would I do the same thing in java?

For example, I need to train the NER, to extract the 'holiday' entity from: When is Christmas?
I am using the compiled java library you linked to.

Is the NER process fast, and does it consume a lot of memory?

P.S. By the way, for NER, is this library better than OpenNLP and others? What's the advantage in using this library over others? Does it have state of the art advanced algorithms?

Training ner on a a new corpus

Is there any memory leak? its taking a lot of memory for a very few training samples .
Its getting killed after printing this

num feats in chunker model: 4095
train: precision, recall, f1-score: 0.984615 0.984615 0.984615
now do training
num training samples: 198

I observed the memory usage and saw that it kept on increasing gradually once it reaches here, as if in each iteration some memory is getting filled garbage.

What does total_word_feature_extractor.dat contain?

Hi there,

Can you explain what does total_word_feature_extractor.dat in MITIE-models contain?

Thanks in advance.

Support for text classification

Hi guys,

I am working on a c++ project that requires named entity recognition and text classification.
For the entity recognition, I discovered this MITIE library, which is fast and excellent.
I would also really like to see text classification (c++) built into the project, since entity recognition and text classification can complement each other.

For example, OpenNLP has text classification. Can you guys create a similar text classification, where I have the data in Tab-spaced format.

Example training data:
category text

If you guys do not have this or not going to implement this, can you please guide me to a C++ library that has good text classification. I am aware of implementations in other languages, but I need to do this in C++.

It would be really great if you machine learning wizards can include a C++ library for text classification!

Can you explain what features from a word the total_word_feature_extractor extracts?

In the example it mentioned it looks at morphological features. What kind of features are they? Thank you!

Is there a Citiation I can give?

I'm currently writing a paper, and I'd like to include a reference to the MITIE system. However, there's no mention of it on the CSAIL (?) site, nor any author names, etc.

Can you suggest a paper or technical report that I could cite? - it would be very helpful (and possibly good for the authors too).

All the Best (and thank you for releasing such excellent software)
Martin
:-)

[ner_stream] unclosed tags for non-sentence line

Davis,

Just found a small cosmetic error (missing closed ']' mark) on how ner_stream produces the output for the last detected entity when it's not in a proper sentence (not ended with punctuation).

$ ./ner_stream MITIE-models/ner_model.dat

Loading MITIE NER model file...
time: 4.38sec

Mozilla CEO Exit Exposes Silicon Valley’s Equality-Freedom Rift.
[ORGANIZATION Mozilla] CEO Exit Exposes Silicon Valley’s [MISC Equality-Freedom Rift] .


Mozilla CEO Exit Exposes Silicon Valley’s Equality-Freedom Rift
[ORGANIZATION Mozilla] CEO Exit Exposes Silicon Valley’s [MISC Equality-Freedom Rift

To contact the editors responsible for this story: Pui-Wing Tam at [email protected] Reed Stevenson, Ari Levy.
To contact the editors responsible for this story : [PERSON Pui-Wing Tam] at ptam13@bloomberg . net [PERSON Reed Stevenson] , [PERSON Ari Levy] .


To contact the editors responsible for this story: Pui-Wing Tam at [email protected] Reed Stevenson, Ari Levy
To contact the editors responsible for this story : [PERSON Pui-Wing Tam] at ptam13@bloomberg . net [PERSON Reed Stevenson] , [PERSON Ari Levy

Cheers,
Jim

wordrep raises a std::bad_alloc error when training on new corpus

I've tried to used wordrep to train the word embedding on a specialized corpus. However a std::bad_alloc occurs during the process:

number of raw ASCII files found: 127
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 127
Sample 50000000 random context vectors
Now do CCA (left size: 50000000, right size: 50000000).
std::bad_alloc

I've tried to launch gdb (after compiling in DEBUG mode) to see the error, but there is no stacktrace.
The training data is quite large (but fits in RAM), but it contains some non-ASCII characters (the files are encoded in unicode). Could this be due to encoding?

Sentiment Analysis

I have seen good number of examples of ner and binary relationship, are there plans for Sentiment Analysis?

Also can I get an example of binary_relation_detector_trainer in Java. It will help.
Thanks in advance.

Documentation for Java on OS X

I ran into a few problems on OS X with the instructions for Using MITIE from a Java program:

Also note that you must have Swig 1.3.40 or newer, CMake 2.8.4 or newer, and the Java JDK installed to compile the MITIE interface.

In addition, it was necessary to install FFTW to get the build scripts to work properly. Without it, I was getting errors like error: fftw3.h: No such file or directory.

That will place a javamitie shared library and jar file into the mitielib folder.

The build scripts created the javamitie.jar and libjavamitie.jnilib files, but I had to move them from MITIE/mitielib/java/build/lib into MITIE/mitielib manually. Otherwise, I saw errors like this when trying to compile NerExample.java: NerExample.java:6: error: package edu.mit.ll.mitie does not exist.

Once you have those two files you can run the example program in examples/java by running run_ner.bat if you are on Windows or run_ner.sh if you are on a POSIX system like Linux or OS X.

In run_ner.sh, export LD_LIBRARY_PATH=../../mitielib doesn't have the desired effect in OS X -- you'll wind up with errors like this:

Native code library failed to load. 
java.lang.UnsatisfiedLinkError: no javamitie in java.library.path

You need to use DYLD_LIBRARY_PATH=../../mitielib instead, or set it using the -Djava.library.path VM option.

Handling smartquotes

Davis,

Any plan to support smartquotes as modern text editors these days tend to use them instead regular quotes? This may have impact on the result. For example, this is the original text.

Mozilla CEO Exit Exposes Silicon Valley’s Equality-Freedom Rift.
[ORGANIZATION Mozilla] CEO Exit Exposes Silicon Valley’s [MISC Equality-Freedom Rift] .

and this is after preprocessing (smartquotes replaced). Silicon Valley is now detected.

Mozilla CEO Exit Exposes Silicon Valley's Equality-Freedom Rift.
[ORGANIZATION Mozilla] CEO Exit Exposes [LOCATION Silicon Valley] 's [MISC Equality-Freedom Rift]

Cheers,
Jim

Best way to replace stemmer or have multiple stemmers?

Hi Davis,

Looking at the code, it seems to me that everything is language agnostic apart from the english stemmer used in text categorization.

What would be the best way to replace the stemmer with another one, or even better have multiple stemmers for different languages?

Thank you very much!

python 3 error

Can MITIE work with python 3?
this is what i get if i try one of the train_ner examples:

Traceback (most recent call last):
  File "p3train.py", line 11, in <module>
    trainer = ner_trainer(u"../../MITIE-models/english/total_word_feature_extractor.dat")
  File "/home/data/experim/MITIE/examples/python/../../mitielib/mitie.py", line 403, in __init__
    self.__obj = _f.mitie_create_ner_trainer(filename)
ctypes.ArgumentError: argument 1: <class 'TypeError'>: wrong type
Exception ignored in: <bound method ner_trainer.__del__ of <mitie.ner_trainer object at 0x7fa2bdeb8828>>
Traceback (most recent call last):
  File "/home/data/experim/MITIE/examples/python/../../mitielib/mitie.py", line 409, in __del__
    self.__mitie_free(self.__obj)
AttributeError: 'ner_trainer' object has no attribute '_ner_trainer__mitie_free'

Add shared library file for mac os in Precompiled Java 64bit binaries

Hi there,

I am working on integrating Mitie with Apache Tika along with Prof Chris Mattmann. It would be really helpful if the shared library for Mac OS was also added in the precompiled binaries as well as published to the maven central repository. So then Tika can use the jar directly as an external jar to enable Mitie named entity recognition.

How to make multiple models share the same extractor?

Hi Davis,

Thanks for your help always.

We always want to reduce the memory usage. Since normally we can not control extractor, so at least we hope that multiple models can share the same extractor.

With the current C++ implementation without using pointer, it seems that there is no way to share the extractor among multiple models. I tried to write the following code in three cases.

TotalWordFeatureExtractor totalWordFeatureExtractor = TotalWordFeatureExtractor.getEnglishExtractor();
NamedEntityExtractor ner = new NamedEntityExtractor(file.getAbsolutePath(), totalWordFeatureExtractor);

The above code consumes around 680 MB JVM memory.

TotalWordFeatureExtractor totalWordFeatureExtractor = TotalWordFeatureExtractor.getEnglishExtractor();
NamedEntityExtractor ner = new NamedEntityExtractor(file.getAbsolutePath(), totalWordFeatureExtractor);
NamedEntityExtractor ner2 = new NamedEntityExtractor(file.getAbsolutePath(), totalWordFeatureExtractor);

The above code consumes around 975 MB JVM memory as following.

TotalWordFeatureExtractor totalWordFeatureExtractor = TotalWordFeatureExtractor.getEnglishExtractor();
NamedEntityExtractor ner = new NamedEntityExtractor(file.getAbsolutePath(), totalWordFeatureExtractor);
NamedEntityExtractor ner2 = new NamedEntityExtractor(file.getAbsolutePath(), totalWordFeatureExtractor);
NamedEntityExtractor ner3 = new NamedEntityExtractor(file.getAbsolutePath(), totalWordFeatureExtractor);

The above code consumes around 1.26 GB JVM memory.

For the detailed code, please refer to the following link.
https://github.com/wihoho/MITIE/blob/master/mitielib/java/maven/src/test/java/edu/mit/ll/mitie/NamedEntityExtractorTest.java#L41

Obviously, there is not what we want. The ideal case is that the memory shall still be around 690 MB even there are three different models. So I assume that using pointer in the C++ code will be the only way to overcome this issue. We would like to seek your opinions on resolving this issue because actually we are not good at C++.

Thank you.

Errors Building MITIE for Java

/Users/davidlaxer/MITIE/dlib/dlib/gui_widgets/nativefont.h:29:10: fatal error:
'X11/Xlocale.h' file not found

include <X11/Xlocale.h>

1 error generated.

OS X 10.10.4.

David-Laxers-MacBook-Pro:MITIE davidlaxer$ xcode-select --install
xcode-select: error: command line tools are already installed, use "Software Update" to install updates

David-Laxers-MacBook-Pro:java davidlaxer$ pwd
/Users/davidlaxer/MITIE/mitielib/java

David-Laxers-MacBook-Pro:java davidlaxer$ java -version
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
David-Laxers-MacBook-Pro:java davidlaxer$

-- The C compiler identification is AppleClang 6.1.0.6020053
-- The CXX compiler identification is AppleClang 6.1.0.6020053

David-Laxers-MacBook-Pro:java davidlaxer$ mkdir build
David-Laxers-MacBook-Pro:java davidlaxer$ cmake ..
-- The C compiler identification is AppleClang 6.1.0.6020053
-- The CXX compiler identification is AppleClang 6.1.0.6020053
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for png_create_read_struct
-- Looking for png_create_read_struct - found
-- Looking for jpeg_read_header
-- Looking for jpeg_read_header - found
-- Searching for BLAS and LAPACK
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of void*
-- Check size of void* - done
-- Found OpenBLAS library
-- Looking for sgetrf_single
-- Looking for sgetrf_single - not found
-- Found LAPACK library
-- Looking for cblas_ddot
-- Looking for cblas_ddot - found
-- Check for STD namespace
-- Check for STD namespace - found
-- Looking for C++ include iostream
-- Looking for C++ include iostream - found
-- Configuring done
CMake Warning (dev):
Policy CMP0042 is not set: MACOSX_RPATH is enabled by default. Run "cmake
--help-policy CMP0042" for policy details. Use the cmake_policy command to
set the policy and suppress this warning.

MACOSX_RPATH is not specified for the following targets:

mitie

This warning is for project developers. Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /Users/davidlaxer/MITIE/mitielib/java
David-Laxers-MacBook-Pro:java davidlaxer$ cmake --build . --config Release --target install
Scanning dependencies of target dlib
[ 0%] Building CXX object dlib_build/CMakeFiles/dlib.dir/base64/base64_kernel_1.o
[ 1%] Building CXX object dlib_build/CMakeFiles/dlib.dir/bigint/bigint_kernel_1.o
[ 2%] Building CXX object dlib_build/CMakeFiles/dlib.dir/bigint/bigint_kernel_2.o
[ 3%] Building CXX object dlib_build/CMakeFiles/dlib.dir/bit_stream/bit_stream_kernel_1.o
[ 4%] Building CXX object dlib_build/CMakeFiles/dlib.dir/entropy_decoder/entropy_decoder_kernel_1.o
[ 5%] Building CXX object dlib_build/CMakeFiles/dlib.dir/entropy_decoder/entropy_decoder_kernel_2.o
[ 6%] Building CXX object dlib_build/CMakeFiles/dlib.dir/entropy_encoder/entropy_encoder_kernel_1.o
[ 7%] Building CXX object dlib_build/CMakeFiles/dlib.dir/entropy_encoder/entropy_encoder_kernel_2.o
[ 8%] Building CXX object dlib_build/CMakeFiles/dlib.dir/md5/md5_kernel_1.o
[ 9%] Building CXX object dlib_build/CMakeFiles/dlib.dir/tokenizer/tokenizer_kernel_1.o
[ 10%] Building CXX object dlib_build/CMakeFiles/dlib.dir/unicode/unicode.o
[ 11%] Building CXX object dlib_build/CMakeFiles/dlib.dir/data_io/image_dataset_metadata.o
[ 12%] Building CXX object dlib_build/CMakeFiles/dlib.dir/sockets/sockets_kernel_1.o
[ 13%] Building CXX object dlib_build/CMakeFiles/dlib.dir/bsp/bsp.o
[ 14%] Building CXX object dlib_build/CMakeFiles/dlib.dir/dir_nav/dir_nav_kernel_1.o
[ 15%] Building CXX object dlib_build/CMakeFiles/dlib.dir/dir_nav/dir_nav_kernel_2.o
[ 16%] Building CXX object dlib_build/CMakeFiles/dlib.dir/dir_nav/dir_nav_extensions.o
[ 17%] Building CXX object dlib_build/CMakeFiles/dlib.dir/linker/linker_kernel_1.o
[ 18%] Building CXX object dlib_build/CMakeFiles/dlib.dir/logger/extra_logger_headers.o
[ 19%] Building CXX object dlib_build/CMakeFiles/dlib.dir/logger/logger_kernel_1.o
[ 20%] Building CXX object dlib_build/CMakeFiles/dlib.dir/logger/logger_config_file.o
[ 20%] Building CXX object dlib_build/CMakeFiles/dlib.dir/misc_api/misc_api_kernel_1.o
[ 21%] Building CXX object dlib_build/CMakeFiles/dlib.dir/misc_api/misc_api_kernel_2.o
[ 22%] Building CXX object dlib_build/CMakeFiles/dlib.dir/sockets/sockets_extensions.o
[ 23%] Building CXX object dlib_build/CMakeFiles/dlib.dir/sockets/sockets_kernel_2.o
[ 24%] Building CXX object dlib_build/CMakeFiles/dlib.dir/sockstreambuf/sockstreambuf.o
[ 25%] Building CXX object dlib_build/CMakeFiles/dlib.dir/sockstreambuf/sockstreambuf_unbuffered.o
[ 26%] Building CXX object dlib_build/CMakeFiles/dlib.dir/server/server_kernel.o
[ 27%] Building CXX object dlib_build/CMakeFiles/dlib.dir/server/server_iostream.o
[ 28%] Building CXX object dlib_build/CMakeFiles/dlib.dir/server/server_http.o
[ 29%] Building CXX object dlib_build/CMakeFiles/dlib.dir/threads/multithreaded_object_extension.o
[ 30%] Building CXX object dlib_build/CMakeFiles/dlib.dir/threads/threaded_object_extension.o
[ 31%] Building CXX object dlib_build/CMakeFiles/dlib.dir/threads/threads_kernel_1.o
[ 32%] Building CXX object dlib_build/CMakeFiles/dlib.dir/threads/threads_kernel_2.o
[ 33%] Building CXX object dlib_build/CMakeFiles/dlib.dir/threads/threads_kernel_shared.o
[ 34%] Building CXX object dlib_build/CMakeFiles/dlib.dir/threads/thread_pool_extension.o
[ 35%] Building CXX object dlib_build/CMakeFiles/dlib.dir/timer/timer.o
[ 36%] Building CXX object dlib_build/CMakeFiles/dlib.dir/stack_trace.o
[ 37%] Building CXX object dlib_build/CMakeFiles/dlib.dir/gui_widgets/fonts.o
In file included from /Users/davidlaxer/MITIE/dlib/dlib/gui_widgets/fonts.cpp:14:
/Users/davidlaxer/MITIE/dlib/dlib/gui_widgets/nativefont.h:29:10: fatal error:
'X11/Xlocale.h' file not found

include <X11/Xlocale.h>

1 error generated.
dlib_build/CMakeFiles/dlib.dir/build.make:974: recipe for target 'dlib_build/CMakeFiles/dlib.dir/gui_widgets/fonts.o' failed
gmake[2]: *** [dlib_build/CMakeFiles/dlib.dir/gui_widgets/fonts.o] Error 1
CMakeFiles/Makefile2:122: recipe for target 'dlib_build/CMakeFiles/dlib.dir/all' failed
gmake[1]: *** [dlib_build/CMakeFiles/dlib.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
gmake: *** [all] Error 2
David-Laxers-MacBook-Pro:java davidlaxer$

Can't set num_threads from Python for NER training

I'm training a new NER model in MITIE on a machine with a bunch of cores. Training it with a few thousand samples takes a couple hours, so I'd really like to be able to multithread it. When I set it in my code, though, it seems to default to 4 regardless of what I set.

import sys, os, json
sys.path.append("/home/ahalterman/MITIE/mitielib")
from mitie import *
from collections import defaultdict

trainer = ner_trainer("/home/ahalterman/MITIE/tools/wordrep/build/total_word_feature_extractor.dat")
trainer.num_threads = 8

[all the other stuff here]

When I start the training, though, it says only 4 threads are in use:

now do training
C:           20
epsilon:     0.01
num threads: 4
cache size:  5
loss per missed segment:  3

Sure enough, running ps -o nlwp <pid> says it's got 5 threads (I don't know much about multithreading so I'm not sure what that extra one is).

Any idea what's going on?

Error on building nodejs wrapper

(Linux 3.16.0-4-amd64 Debian 3.16.7-ckt11-1+deb8u5 (2015-10-09) x86_64 GNU/Linux)
Hey guys, sorry for disturbing you.
Have the same problem while building Mitie's nodejs wrapper. It seems that an error occurs in Mitie, not in node-gyp, but I might be wrong.

What can it be? Maybe I miss something? Thanks!

> [email protected] install /var/www/dh-nlp/node_modules/mitie
> node-gyp rebuild

make: Entering directory '/var/www/dh-nlp/node_modules/mitie/build'
  CXX(target) Release/obj.target/mitie/src/mitie.o
In file included from ../mitie/dlib/dlib/serialize.h:157:0,
                 from /usr/local/include/mitie/approximate_substring_set.h:9,
                 from /usr/local/include/mitie/word_morphology_feature_extractor.h:7,
                 from /usr/local/include/mitie/total_word_feature_extractor.h:8,
                 from /usr/local/include/mitie/named_entity_extractor.h:7,
                 from ../src/entity_extractor.h:5,
                 from ../src/mitie.cc:2:
../mitie/dlib/dlib/smart_pointers/shared_ptr.h: In member function 'void* dlib::shared_ptr<T>::deleter_template<D>::get_deleter_void(const std::type_info&) const':
../mitie/dlib/dlib/smart_pointers/shared_ptr.h:112:29: error: cannot use typeid with -fno-rtti
                 if (typeid(D) == t)
                             ^
../mitie/dlib/dlib/smart_pointers/shared_ptr.h: In member function 'D* dlib::shared_ptr<T>::_get_deleter() const':
../mitie/dlib/dlib/smart_pointers/shared_ptr.h:432:83: error: cannot use typeid with -fno-rtti
                 return static_cast<D*>(shared_node->del->get_deleter_void(typeid(D)));
                                                                                   ^
g++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-4.9/README.Bugs> for instructions.
mitie.target.mk:94: recipe for target 'Release/obj.target/mitie/src/mitie.o' failed
make: *** [Release/obj.target/mitie/src/mitie.o] Error 4
make: Leaving directory '/var/www/dh-nlp/node_modules/mitie/build'

How is score of entity calculated?

How is the score of the entity calculated? Also, how can one interpret the score?

How to compile on Windows?

Hi Davis,

I am really sorry to disturb you again. But I could not build the java interface on Windows 7.

The environment which I use is as following:

Cmake, 3.4.1
MinGW
swig-3.0.8

The error message is as following.

-- The C compiler identification is unknown
-- The CXX compiler identification is unknown
-- Check for working C compiler using: Visual Studio 10 2010 Win64
-- Check for working C compiler using: Visual Studio 10 2010 Win64 -- broken
CMake Error at C:/Program Files (x86)/cmake-3.4.1-win32-x86/share/cmake-3.4/Modules/CMakeTestCCompiler.cmake:61 (message):
  The C compiler "C:/MinGW/bin/gcc.exe" is not able to compile a simple test
  program.

  It fails with the following output:

   Change Dir: F:/Download/MITIE-master/mitielib/java/build/CMakeFiles/CMakeTmp



  Run Build
  Command:"C:/Windows/Microsoft.NET/Framework/v4.0.30319/MSBuild.exe"
  "cmTC_f7860.vcxproj" "/p:Configuration=Debug" "/p:VisualStudioVersion=10.0"

  Microsoft (R) Build Engine version 4.0.30319.18408

  [Microsoft .NET Framework, version 4.0.30319.18444]

  Copyright (C) Microsoft Corporation.  All rights reserved.



  Build started 8/1/2016 6:59:25 PM.

  Project
  "F:\Download\MITIE-master\mitielib\java\build\CMakeFiles\CMakeTmp\cmTC_f7860.vcxproj"
  on node 1 (default targets).


  F:\Download\MITIE-master\mitielib\java\build\CMakeFiles\CMakeTmp\cmTC_f7860.vcxproj(27,3):
  error MSB4019: The imported project "F:\Microsoft.Cpp.Default.props" was
  not found.  Confirm that the path in the <Import> declaration is correct,
  and that the file exists on disk.

  Done Building Project
  "F:\Download\MITIE-master\mitielib\java\build\CMakeFiles\CMakeTmp\cmTC_f7860.vcxproj"
  (default targets) -- FAILED.



  Build FAILED.




  "F:\Download\MITIE-master\mitielib\java\build\CMakeFiles\CMakeTmp\cmTC_f7860.vcxproj"
  (default target) (1) ->

    F:\Download\MITIE-master\mitielib\java\build\CMakeFiles\CMakeTmp\cmTC_f7860.vcxproj(27,3): error MSB4019: The imported project "F:\Microsoft.Cpp.Default.props" was not found. Confirm that the path in the <Import> declaration is correct, and that the file exists
 on disk.



      0 Warning(s)
      1 Error(s)



  Time Elapsed 00:00:00.02





  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:9 (project)


-- Configuring incomplete, errors occurred!
See also "F:/Download/MITIE-master/mitielib/java/build/CMakeFiles/CMakeOutput.log".
See also "F:/Download/MITIE-master/mitielib/java/build/CMakeFiles/CMakeError.log".

I really appreciate your help on builds on windows. Thanks a lot.

Is there any method to make MITIE support Chinese language?

I need to handle Chinese language, so is there any method or some steps that i could try to train NER and binary relation extration model?

Thanks for any help!

Improving Mitie NER for missing entities

Hello,

So far I'm greatly impressed with the product. The assessment in #12 is spot on and I'm looking forward to training some binary relationships as soon as I can think of an interesting one.

I was wondering if there were plans to update the basic NER model over time. Further, is there some kind of way to submit missed entities within text in an organized fashion?

This isn't be best example since it's hard to determine from the context but running

"Popular Science is your wormhole to the future. Reporting on what's new and what's next in science and technology, we deliver the future now."

missed 'Popular Science' as miscellaneous or otherwise without any kind of score.

Move the models off of sourceforge

I'm currently downloading them at 100k/sec while trying to test an automatic provisioning script and not thrilled about it :)

Sourceforge has definitely seen better days, I think.

xpaths for in-browser display

For display of highlights natively in a web browser as a user visits pages in the wild, it is necessary to pass XPath-based offsets to the highlighting JavaScript code running in the browser. It doesn't appear that MITIE generates XPath offsets currently. This ticket is a feature request for adding such output to MITIE.

See here for further details:

It is possible to wrap a tool like MITIE with a converter that reparses the HTML and generates XPath offsets. The python example linked below works in more than half of web pages, but has issues with incomplete tags. We have experience solving this issue in some other contexts. If this feature request gets prioritized by MITIE, we would be glad to help MITIE engineers.

https://github.com/trec-kba/streamcorpus-pipeline/blob/master/streamcorpus_pipeline/offsets.py#L341-L425

MITIE-models examples

Hello all,

Could somebody give me some examples/datasets that were used in the binary relations classifiers:

rel_classifier_book.written_work.author.svm
rel_classifier_film.film.directed_by.svm
rel_classifier_influence.influence_node.influenced_by.svm
rel_classifier_law.inventor.inventions.svm
rel_classifier_location.location.contains.svm
rel_classifier_location.location.nearby_airports.svm
rel_classifier_location.location.partially_contains.svm
rel_classifier_organization.organization.place_founded.svm
rel_classifier_organization.organization_founder.organizations_founded.svm
rel_classifier_organization.organization_scope.organizations_with_this_scope.svm
rel_classifier_people.deceased_person.place_of_death.svm
rel_classifier_people.ethnicity.geographic_distribution.svm
rel_classifier_people.person.ethnicity.svm
rel_classifier_people.person.nationality.svm
rel_classifier_people.person.parents.svm
rel_classifier_people.person.place_of_birth.svm
rel_classifier_people.person.religion.svm
rel_classifier_people.place_of_interment.interred_here.svm
rel_classifier_time.event.includes_event.svm
rel_classifier_time.event.locations.svm
rel_classifier_time.event.people_involved.svm

Thanks in advance!

tokenising misaligned for java strings?

Hi,

I am observing some misalignment of the token indexing when calling Java method edu.mit.ll.mitie.mitie.tokenizeWithOffsets with a string containing multi-byte characters. For example, calling it with the string funny “quotation marks” and so on produces tokens:

Token 0: index=0, value=<funny>
Token 1: index=6, value=<“quotation>
Token 2: index=19, value=<marks”>
Token 3: index=28, value=<and>
Token 4: index=32, value=<so>
Token 5: index=35, value=<on>

Note that the quotation marks are Unicode 201c and 201d. token 2's index (19) is out by 2 (it should be 17), and token 3's (28) is out by 4 (it should be 24). So there appear to be cumulative indexing errors when "multi-byte" characters are encountered.

Undefined variable: ner_training_instance error

hi. i am new to python and trying out MITIE Python API to train a named_entity_extractor. But i am getting Undefined variable: ner_training_instance error. How can i resolve that?

One or two output examples in the README.md file

MITIE library makes a really exciting impression at first glance. However it would be very helpful if you could provide one or two examples of what the library actually outputs when being used on a text. Especially for people like me not totally into language processing it is hard to get a clear picture of what the library actually does and what it can be used for.

Cheers
Holger

make test fails on osx

First, THANK YOU! The world has needed this forever for the reasons you outline in the readme, so I'm very glad you're building this. Dlib is awesome as well.

I'm able to build and run the example but make test fails with:

$ make test
./ner_stream MITIE-models/ner_model.dat < sample_text.txt > /tmp/test.out
Loading MITIE NER model file...
time: 2.86sec

diff /tmp/test.out sample_text.reference-output
diff: sample_text.reference-output: No such file or directory

maybe just a git add sample_text.reference-output ?

Amazon always (or at least incredibly often) tagged as a location and McDonald's as a person

Hello,

First I'd like to thank the makers of MITIE. It's great to have a state-of-the-art NER free for commercial use. I've been finding that it has issues identifying Amazon and McDonald's as companies, as described below.

Throughout this post I'll run the following code with different input and display the output:

toks = mitie.tokenize(eatiht.extract(fd))
ents = ner.extract_entities(toks)
[(e[1], ' '.join(toks[e[0][0]:e[0][-1]+1]), e[2]) for e in ents]

I've been finding that Amazon is usually tagged as a location. Here's an example of such a text:

"Amazon is among a host of companies asking the Federal Aviation Administration to expand the abilities of small commercial drones and the traffic control system that would monitor them."

which yields

[('LOCATION', 'Amazon', 0.6497587837582173),
('ORGANIZATION', 'Federal Aviation Administration', 1.2338242977625724)]

Here's another:

'Its earnings season again and Amazon, for the first time ever, has broken out the financial results of its cloud services division, Amazon Web Services (AWS). The results are impressive. In less than a decade, Amazon has grown AWS into a $5 billion business that is still growing at 50%.'

which yields

[('LOCATION', 'Amazon', 0.6857005959408439),
('ORGANIZATION', 'Amazon Web Services', 0.8302322174217054),
('LOCATION', 'Amazon', 0.5971181888647527)]

I've been finding similar problems with McDonald's. Take this story, for example: http://www.foxnews.com/leisure/2015/04/27/not-lovin-it-mcdonald-is-trying-to-fix-its-business/

which yields

[('PERSON', 'McDonald', 0.5934354493699665),
('PERSON', 'McDonald', 0.6011390084587678),
('PERSON', 'Steve Easterbrook', 1.62707021179198),
('PERSON', 'McDonald', 0.7252897014392439),
('MISC', 'Turnaround Summit', 0.5459067014161887),
('PERSON', 'McDonald', 0.733597056118872),
('PERSON', 'McDonald', 0.5221199692486291),
('LOCATION', 'San Diego', 1.4569853574876104),
('PERSON', 'John Gordon', 1.4067061157394036),
('PERSON', 'McDonald', 0.6262266486680956),
('PERSON', 'McDonald', 0.5300773536597632),
('LOCATION', 'U.S.', 1.3314469880747613),
('PERSON', 'McDonald', 0.6102694086803659),
('PERSON', 'Paul Shapiro', 1.8008827543760988),
('LOCATION', 'United States', 1.1536681812718963),
('ORGANIZATION', 'McDonald', 0.7478091010431832),
('PERSON', 'Shapiro', 1.1529211339569105),
('ORGANIZATION', 'Chipotle', 0.27411100096279334),
('ORGANIZATION', 'Sofritas', 0.20042503949162985),
('PERSON', 'Denny', 1.0445957119600875),
('PERSON', 'Johnny Rockets', 0.6257758817370944),
('LOCATION', 'White Castle', 0.550052599194086),
('ORGANIZATION', 'McDonald', 0.5866372774064789),
('PERSON', 'McDonald', 0.6089171335562328),
('PERSON', 'Shapiro', 1.4221506683701266),
('ORGANIZATION', 'Chicago Tribune', 0.5300056649547958),
('PERSON', 'Easterbrook', 1.3357314682060835),
('PERSON', 'McDonald', 0.5254242850102231),
('MISC', 'burger', 0.20863861829526598),
('PERSON', 'McDonald', 0.6882332877969929),
('LOCATION', 'U.S.', 1.152526189418561),
('PERSON', 'Robert Reich', 1.3686463530945003),
('LOCATION', 'U.S.', 0.9064906053018911),
('PERSON', 'Clinton', 0.8424408210970944),
('PERSON', 'McDonald', 0.8099967902497076),
('PERSON', 'McDonald', 0.5726923125964373),
('PERSON', 'Laura Ries', 1.2960758120524078),
('ORGANIZATION', 'Ries & Ries', 0.8948838138088949),
('LOCATION', 'Atlanta', 1.4866648917001795),
('PERSON', 'McDonald', 0.7071161079471816)]

This means that out of the 16 times McDonald's occurred, it was tagged as a PERSON 14 times and as an ORGANIZATION 2 times.

MITIE very often does quite well from what I've seen. There seem to be some major companies though for which it doesn't. I know that we can train our own models using MITIE, but it just seems like it should work for one of the most prolific companies of our time (Amazon) and a company as prevailent as McDonald's. I don't think the problem is a training/inference domain mismatch since the English section of the CoNLL 2003 NER task was a Rueters corpus and I'm applying it to news wire. Beyond training my own model, all I can think to do is find these cases MITIE gets terribly wrong and create custom rules that are used to change the output of the NER. Do you have any other suggestions?

[Documentatio] train-ner script

Hi,

using the ner_conll example I found the train-ner script in MITIE/tools/ner_conll/train-ner:

./ner --train-chunker eng.train_all_some_sentences_combined $THREADS > $LOG
    ./ner --train-id      eng.train_all_some_sentences_combined $THREADS >> $LOG

    ./ner --test-id eng.testb >> $LOG
    ./ner --tag-conll-file ner_model.dat eng.testb | ./conlleval 
    echo $LOG

As I cannot find a executable called "ner" - I'm just wondering if it should be train-ner instead, but the train-ner command line arguments only provides:

parser.add_option("h", "Display this help information.");
        parser.add_option("train", "train named_entity_extractor on CoNLL data.");
        parser.add_option("test", "test named_entity_extractor on CoNLL data.");
        parser.add_option("threads", "Use <arg> threads when doing training (default: 4).",1);
        parser.add_option("tag-conll-file", "Read in a CoNLL annotation file and output a copy that is tagged with a MITIE NER model.");

See MITIE/tools/ner_conll/src/main.cpp.

So my question is if that small train_ner script should be updated then?

Thanks in advance (+ I'm really looking forward to train a model for German),

Stefan

Updating a trained model

Hi Team,

Could you please help me how to update a saved model with new training data? Do I have to run the entire trained samples along with the new samples to through a trainer and create a new model for updating an existing model? I tried serializing the trainer and reuse it. However since there are pointer references, I am not able to save the trainer. I am trying to periodically update my model with new samples.

Kind Regards,
rp

mit-nlp / mitie Goto Github PK

mitie's Issues

include <X11/Xlocale.h>

include <X11/Xlocale.h>

Recommend Projects

Recommend Topics

Recommend Org

Jobs