GithubHelp home page GithubHelp logo

mit-nlp / mitie Goto Github PK

View Code? Open in Web Editor NEW
2.9K 193.0 537.0 10.7 MB

MITIE: library and tools for information extraction

C++ 97.36% C 0.34% CMake 0.25% MATLAB 0.01% HTML 0.12% Shell 0.10% XSLT 0.57% Python 0.76% Makefile 0.07% Batchfile 0.01% R 0.33% Perl 0.09%
machine-learning natural-language-processing information-extraction python c-plus-plus java

mitie's Introduction

MITIE: MIT Information Extraction

This project provides free (even for commercial use) state-of-the-art information extraction tools. The current release includes tools for performing named entity extraction and binary relation detection as well as tools for training custom extractors and relation detectors.

MITIE is built on top of dlib, a high-performance machine-learning library[1], MITIE makes use of several state-of-the-art techniques including the use of distributional word embeddings[2] and Structural Support Vector Machines[3]. MITIE offers several pre-trained models providing varying levels of support for both English, Spanish, and German trained using a variety of linguistic resources (e.g., CoNLL 2003, ACE, Wikipedia, Freebase, and Gigaword). The core MITIE software is written in C++, but bindings for several other software languages including Python, R, Java, C, and MATLAB allow a user to quickly integrate MITIE into his/her own applications.

Outside projects have created API bindings for OCaml, .NET, .NET Core, PHP, and Ruby. There is also an interactive tool for labeling data and training MITIE.

Using MITIE

MITIE's primary API is a C API which is documented in the mitie.h header file. Beyond this, there are many example programs showing how to use MITIE from C, C++, Java, R, or Python 2.7.

Initial Setup

Before you can run the provided examples you will need to download the trained model files which you can do by running:

make MITIE-models

or by simply downloading the MITIE-models-v0.2.tar.bz2 file and extracting it in your MITIE folder. Note that the Spanish and German models are supplied in separate downloads. So if you want to use the Spanish NER model then download MITIE-models-v0.2-Spanish.zip and extract it into your MITIE folder. Similarly for the German model: MITIE-models-v0.2-German.tar.bz2

Using MITIE from the command line

MITIE comes with a basic streaming NER tool. So you can tell MITIE to process each line of a text file independently and output marked up text with the command:

cat sample_text.txt | ./ner_stream MITIE-models/english/ner_model.dat  

The ner_stream executable can be compiled by running make in the top level MITIE folder or by navigating to the tools/ner_stream folder and running make or using CMake to build it which can be done with the following commands:

cd tools/ner_stream
mkdir build
cd build
cmake ..
cmake --build . --config Release

Compiling MITIE as a shared library

On a UNIX like system, this can be accomplished by running make in the top level MITIE folder or by running:

cd mitielib
make

This produces shared and static library files in the mitielib folder. Or you can use CMake to compile a shared library by typing:

cd mitielib
mkdir build
cd build
cmake ..
cmake --build . --config Release --target install

Either of these methods will create a MITIE shared library in the mitielib folder.

Compiling MITIE using OpenBLAS

If you compile MITIE using cmake then it will automatically find and use any optimized BLAS libraries on your machine. However, if you compile using regular make then you have to manually locate your BLAS libaries or DLIB will default to its built in, but slower, BLAS implementation. Therefore, to use OpenBLAS when compiling without cmake, locate libopenblas.a and libgfortran.a, then run make as follows:

cd mitielib 
make BLAS_PATH=/path/to/openblas.a LIBGFORTRAN_PATH=/path/to/libfortran.a

Note that if your BLAS libraries are not in standard locations cmake will fail to find them. However, you can tell it what folder to look in by replacing cmake .. with a statement such as:

cmake -DCMAKE_LIBRARY_PATH=/home/me/place/i/put/blas/lib ..

Using MITIE from a Python 2.7 program

Once you have built the MITIE shared library, you can go to the examples/python folder and simply run any of the Python scripts. Each script is a tutorial explaining some aspect of MITIE: named entity recognition and relation extraction, training a custom NER tool, or training a custom relation extractor.

You can also install mitie direcly from github with this command: pip install git+https://github.com/mit-nlp/MITIE.git.

Using MITIE from R

MITIE can be installed as an R package. See the README for more details.

Using MITIE from a C program

There are example C programs in the examples/C folder. To compile of them you simply go into those folders and run make. Or use CMake like so:

cd examples/C/ner
mkdir build
cd build
cmake ..
cmake --build . --config Release

Using MITIE from a C++ program

There are example C++ programs in the examples/cpp folder. To compile any of them you simply go into those folders and run make. Or use CMake like so:

cd examples/cpp/ner
mkdir build
cd build
cmake ..
cmake --build . --config Release

Using MITIE from a Java program

There is an example Java program in the examples/java folder. Before you can run it you must compile MITIE's java interface which you can do like so:

cd mitielib/java
mkdir build
cd build
cmake ..
cmake --build . --config Release --target install

That will place a javamitie shared library and jar file into the mitielib folder. Once you have those two files you can run the example program in examples/java by running run_ner.bat if you are on Windows or run_ner.sh if you are on a POSIX system like Linux or OS X.

Also note that you must have Swig 1.3.40 or newer, CMake 2.8.4 or newer, and the Java JDK installed to compile the MITIE interface. Finally, note that if you are using 64bit Java on Windows then you will need to use a command like:

cmake -G "Visual Studio 10 Win64" ..

instead of cmake .. so that Visual Studio knows to make a 64bit library.

Running MITIE's unit tests

You can run a simple regression test to validate your build. Do this by running the following command from the top level MITIE folder:

make test

make test builds both the example programs and downloads required example models. If you require a non-standard C++ compiler, change CC in examples/C/makefile and in tools/ner_stream/makefile.

Precompiled Python 2.7 binaries

We have built Python 2.7 binaries packaged with sample models for 64bit Linux and Windows (both 32 and 64 bit version of Python). You can download the precompiled package here: Precompiled MITIE 0.2

Precompiled Java 64bit binaries

We have built Java binaries for the 64bit JVM which work on Linux and Windows. You can download the precompiled package here: Precompiled Java MITIE 0.3. In the file is an examples/java folder. You can run the example by executing the provided .bat or .sh file.

Citing MITIE

There isn't any paper specifically about MITIE. However, since MITIE is basically just a thin wrapper around dlib please cite dlib's JMLR paper if you use MITIE in your research:

Davis E. King. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research 10, pp. 1755-1758, 2009

@Article{dlib09,
  author = {Davis E. King},
  title = {Dlib-ml: A Machine Learning Toolkit},
  journal = {Journal of Machine Learning Research},
  year = {2009},
  volume = {10},
  pages = {1755-1758},
}

License

MITIE is licensed under the Boost Software License - Version 1.0 - August 17th, 2003.

Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following:

The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

References

[1] Davis E. King. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research 10, pp. 1755-1758, 2009.

[2] Paramveer Dhillon, Dean Foster and Lyle Ungar, Eigenwords: Spectral Word Embeddings, Journal of Machine Learning Research (JMLR), 16, 2015.

[3] T. Joachims, T. Finley, Chun-Nam Yu, Cutting-Plane Training of Structural SVMs, Machine Learning, 77(1):27-59, 2009.

mitie's People

Contributors

arjunmajum avatar avitale avatar baali avatar benhoff avatar davisking avatar dynamite-ready avatar eamonnbell avatar hannometer avatar jinyichao avatar kecsap avatar lopuhin avatar mcelvg avatar myeesw avatar paralax avatar scotthaleen avatar slamj1 avatar steve98654 avatar str-janus avatar swadey avatar vinvinod avatar weiwujiang avatar wihoho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mitie's Issues

Best way to replace stemmer or have multiple stemmers?

Hi Davis,

Looking at the code, it seems to me that everything is language agnostic apart from the english stemmer used in text categorization.

What would be the best way to replace the stemmer with another one, or even better have multiple stemmers for different languages?

Thank you very much!

How to bootstrap model with known entities?

Hi Davis,

Thank you so much for this high performance open source library. I have one question that I couldn't find an answer to wrt training the entity recognizer.

I would like to take advantage of already known entities, but also be able to recognize entities not already known to the dictionary. For e.g. the wikidata project provides millions of entities and it would be nice to seed the model with those known entities. Couple of approaches I can think of:

  1. Train a new model using whatever training data I can gather. Load known entities into a dictionary. At runtime, say if I am working with a sentence, identify known entities as well as run the sentence through the ner model. Then reconcile the two with the dictionary based reco overriding any conflicting judgements. I wrote this, but don't think this is a good idea.
  2. Generate a large set of training data by plugging in already known entities. E.g. knowing "Davis King" and "MIT" are entities, generate a training sentence "This library is from Davis King of MIT". I would think this approach's results will be heavily influenced by the variation of the filler text generated as part of the training set.
  3. How would you go about doing this? Is there a straight forward technique to seed the model with known entities or a recommended technique to supplement the model with a dictionary?

How to make multiple models share the same extractor?

Hi Davis,

Thanks for your help always.

We always want to reduce the memory usage. Since normally we can not control extractor, so at least we hope that multiple models can share the same extractor.

With the current C++ implementation without using pointer, it seems that there is no way to share the extractor among multiple models. I tried to write the following code in three cases.

TotalWordFeatureExtractor totalWordFeatureExtractor = TotalWordFeatureExtractor.getEnglishExtractor();
NamedEntityExtractor ner = new NamedEntityExtractor(file.getAbsolutePath(), totalWordFeatureExtractor);

The above code consumes around 680 MB JVM memory.

TotalWordFeatureExtractor totalWordFeatureExtractor = TotalWordFeatureExtractor.getEnglishExtractor();
NamedEntityExtractor ner = new NamedEntityExtractor(file.getAbsolutePath(), totalWordFeatureExtractor);
NamedEntityExtractor ner2 = new NamedEntityExtractor(file.getAbsolutePath(), totalWordFeatureExtractor);

The above code consumes around 975 MB JVM memory as following.

screen shot 2016-01-09 at 7 47 15 pm

TotalWordFeatureExtractor totalWordFeatureExtractor = TotalWordFeatureExtractor.getEnglishExtractor();
NamedEntityExtractor ner = new NamedEntityExtractor(file.getAbsolutePath(), totalWordFeatureExtractor);
NamedEntityExtractor ner2 = new NamedEntityExtractor(file.getAbsolutePath(), totalWordFeatureExtractor);
NamedEntityExtractor ner3 = new NamedEntityExtractor(file.getAbsolutePath(), totalWordFeatureExtractor);

The above code consumes around 1.26 GB JVM memory.

For the detailed code, please refer to the following link.
https://github.com/wihoho/MITIE/blob/master/mitielib/java/maven/src/test/java/edu/mit/ll/mitie/NamedEntityExtractorTest.java#L41

Obviously, there is not what we want. The ideal case is that the memory shall still be around 690 MB even there are three different models. So I assume that using pointer in the C++ code will be the only way to overcome this issue. We would like to seek your opinions on resolving this issue because actually we are not good at C++.

Thank you.

UTF-8 problems

Hi,

First of all thank let me thank you for this great tool.
We are using MITIE via python 2.7. To my best knowledge we have to convert our strings from unicode to plain bytes before passing them to MITIE.
When using tokenize_with_offset this can lead to some offset detected in the middle of some unicode character spanning over multiple bytes which results in "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-5: unexpected end of data" after attempt for decode.

Any ideas?

Many thanks,
Jakub

Is there a Citiation I can give?

I'm currently writing a paper, and I'd like to include a reference to the MITIE system. However, there's no mention of it on the CSAIL (?) site, nor any author names, etc.

Can you suggest a paper or technical report that I could cite? - it would be very helpful (and possibly good for the authors too).

All the Best (and thank you for releasing such excellent software)
Martin
:-)

Amazon always (or at least incredibly often) tagged as a location and McDonald's as a person

Hello,

First I'd like to thank the makers of MITIE. It's great to have a state-of-the-art NER free for commercial use. I've been finding that it has issues identifying Amazon and McDonald's as companies, as described below.

Throughout this post I'll run the following code with different input and display the output:

toks = mitie.tokenize(eatiht.extract(fd))
ents = ner.extract_entities(toks)
[(e[1], ' '.join(toks[e[0][0]:e[0][-1]+1]), e[2]) for e in ents]

I've been finding that Amazon is usually tagged as a location. Here's an example of such a text:

"Amazon is among a host of companies asking the Federal Aviation Administration to expand the abilities of small commercial drones and the traffic control system that would monitor them."

which yields

[('LOCATION', 'Amazon', 0.6497587837582173),
('ORGANIZATION', 'Federal Aviation Administration', 1.2338242977625724)]

Here's another:

'Its earnings season again and Amazon, for the first time ever, has broken out the financial results of its cloud services division, Amazon Web Services (AWS). The results are impressive. In less than a decade, Amazon has grown AWS into a $5 billion business that is still growing at 50%.'

which yields

[('LOCATION', 'Amazon', 0.6857005959408439),
('ORGANIZATION', 'Amazon Web Services', 0.8302322174217054),
('LOCATION', 'Amazon', 0.5971181888647527)]

I've been finding similar problems with McDonald's. Take this story, for example: http://www.foxnews.com/leisure/2015/04/27/not-lovin-it-mcdonald-is-trying-to-fix-its-business/

which yields

[('PERSON', 'McDonald', 0.5934354493699665),
('PERSON', 'McDonald', 0.6011390084587678),
('PERSON', 'Steve Easterbrook', 1.62707021179198),
('PERSON', 'McDonald', 0.7252897014392439),
('MISC', 'Turnaround Summit', 0.5459067014161887),
('PERSON', 'McDonald', 0.733597056118872),
('PERSON', 'McDonald', 0.5221199692486291),
('LOCATION', 'San Diego', 1.4569853574876104),
('PERSON', 'John Gordon', 1.4067061157394036),
('PERSON', 'McDonald', 0.6262266486680956),
('PERSON', 'McDonald', 0.5300773536597632),
('LOCATION', 'U.S.', 1.3314469880747613),
('PERSON', 'McDonald', 0.6102694086803659),
('PERSON', 'Paul Shapiro', 1.8008827543760988),
('LOCATION', 'United States', 1.1536681812718963),
('ORGANIZATION', 'McDonald', 0.7478091010431832),
('PERSON', 'Shapiro', 1.1529211339569105),
('ORGANIZATION', 'Chipotle', 0.27411100096279334),
('ORGANIZATION', 'Sofritas', 0.20042503949162985),
('PERSON', 'Denny', 1.0445957119600875),
('PERSON', 'Johnny Rockets', 0.6257758817370944),
('LOCATION', 'White Castle', 0.550052599194086),
('ORGANIZATION', 'McDonald', 0.5866372774064789),
('PERSON', 'McDonald', 0.6089171335562328),
('PERSON', 'Shapiro', 1.4221506683701266),
('ORGANIZATION', 'Chicago Tribune', 0.5300056649547958),
('PERSON', 'Easterbrook', 1.3357314682060835),
('PERSON', 'McDonald', 0.5254242850102231),
('MISC', 'burger', 0.20863861829526598),
('PERSON', 'McDonald', 0.6882332877969929),
('LOCATION', 'U.S.', 1.152526189418561),
('PERSON', 'Robert Reich', 1.3686463530945003),
('LOCATION', 'U.S.', 0.9064906053018911),
('PERSON', 'Clinton', 0.8424408210970944),
('PERSON', 'McDonald', 0.8099967902497076),
('PERSON', 'McDonald', 0.5726923125964373),
('PERSON', 'Laura Ries', 1.2960758120524078),
('ORGANIZATION', 'Ries & Ries', 0.8948838138088949),
('LOCATION', 'Atlanta', 1.4866648917001795),
('PERSON', 'McDonald', 0.7071161079471816)]

This means that out of the 16 times McDonald's occurred, it was tagged as a PERSON 14 times and as an ORGANIZATION 2 times.

MITIE very often does quite well from what I've seen. There seem to be some major companies though for which it doesn't. I know that we can train our own models using MITIE, but it just seems like it should work for one of the most prolific companies of our time (Amazon) and a company as prevailent as McDonald's. I don't think the problem is a training/inference domain mismatch since the English section of the CoNLL 2003 NER task was a Rueters corpus and I'm applying it to news wire. Beyond training my own model, all I can think to do is find these cases MITIE gets terribly wrong and create custom rules that are used to change the output of the NER. Do you have any other suggestions?

Errors Building MITIE for Java

/Users/davidlaxer/MITIE/dlib/dlib/gui_widgets/nativefont.h:29:10: fatal error:
'X11/Xlocale.h' file not found

include <X11/Xlocale.h>

     ^

1 error generated.

OS X 10.10.4.

David-Laxers-MacBook-Pro:MITIE davidlaxer$ xcode-select --install
xcode-select: error: command line tools are already installed, use "Software Update" to install updates

David-Laxers-MacBook-Pro:java davidlaxer$ pwd
/Users/davidlaxer/MITIE/mitielib/java

David-Laxers-MacBook-Pro:java davidlaxer$ java -version
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
David-Laxers-MacBook-Pro:java davidlaxer$

-- The C compiler identification is AppleClang 6.1.0.6020053
-- The CXX compiler identification is AppleClang 6.1.0.6020053

David-Laxers-MacBook-Pro:java davidlaxer$ mkdir build
David-Laxers-MacBook-Pro:java davidlaxer$ cmake ..
-- The C compiler identification is AppleClang 6.1.0.6020053
-- The CXX compiler identification is AppleClang 6.1.0.6020053
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for png_create_read_struct
-- Looking for png_create_read_struct - found
-- Looking for jpeg_read_header
-- Looking for jpeg_read_header - found
-- Searching for BLAS and LAPACK
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of void*
-- Check size of void* - done
-- Found OpenBLAS library
-- Looking for sgetrf_single
-- Looking for sgetrf_single - not found
-- Found LAPACK library
-- Looking for cblas_ddot
-- Looking for cblas_ddot - found
-- Check for STD namespace
-- Check for STD namespace - found
-- Looking for C++ include iostream
-- Looking for C++ include iostream - found
-- Configuring done
CMake Warning (dev):
Policy CMP0042 is not set: MACOSX_RPATH is enabled by default. Run "cmake
--help-policy CMP0042" for policy details. Use the cmake_policy command to
set the policy and suppress this warning.

MACOSX_RPATH is not specified for the following targets:

mitie

This warning is for project developers. Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /Users/davidlaxer/MITIE/mitielib/java
David-Laxers-MacBook-Pro:java davidlaxer$ cmake --build . --config Release --target install
Scanning dependencies of target dlib
[ 0%] Building CXX object dlib_build/CMakeFiles/dlib.dir/base64/base64_kernel_1.o
[ 1%] Building CXX object dlib_build/CMakeFiles/dlib.dir/bigint/bigint_kernel_1.o
[ 2%] Building CXX object dlib_build/CMakeFiles/dlib.dir/bigint/bigint_kernel_2.o
[ 3%] Building CXX object dlib_build/CMakeFiles/dlib.dir/bit_stream/bit_stream_kernel_1.o
[ 4%] Building CXX object dlib_build/CMakeFiles/dlib.dir/entropy_decoder/entropy_decoder_kernel_1.o
[ 5%] Building CXX object dlib_build/CMakeFiles/dlib.dir/entropy_decoder/entropy_decoder_kernel_2.o
[ 6%] Building CXX object dlib_build/CMakeFiles/dlib.dir/entropy_encoder/entropy_encoder_kernel_1.o
[ 7%] Building CXX object dlib_build/CMakeFiles/dlib.dir/entropy_encoder/entropy_encoder_kernel_2.o
[ 8%] Building CXX object dlib_build/CMakeFiles/dlib.dir/md5/md5_kernel_1.o
[ 9%] Building CXX object dlib_build/CMakeFiles/dlib.dir/tokenizer/tokenizer_kernel_1.o
[ 10%] Building CXX object dlib_build/CMakeFiles/dlib.dir/unicode/unicode.o
[ 11%] Building CXX object dlib_build/CMakeFiles/dlib.dir/data_io/image_dataset_metadata.o
[ 12%] Building CXX object dlib_build/CMakeFiles/dlib.dir/sockets/sockets_kernel_1.o
[ 13%] Building CXX object dlib_build/CMakeFiles/dlib.dir/bsp/bsp.o
[ 14%] Building CXX object dlib_build/CMakeFiles/dlib.dir/dir_nav/dir_nav_kernel_1.o
[ 15%] Building CXX object dlib_build/CMakeFiles/dlib.dir/dir_nav/dir_nav_kernel_2.o
[ 16%] Building CXX object dlib_build/CMakeFiles/dlib.dir/dir_nav/dir_nav_extensions.o
[ 17%] Building CXX object dlib_build/CMakeFiles/dlib.dir/linker/linker_kernel_1.o
[ 18%] Building CXX object dlib_build/CMakeFiles/dlib.dir/logger/extra_logger_headers.o
[ 19%] Building CXX object dlib_build/CMakeFiles/dlib.dir/logger/logger_kernel_1.o
[ 20%] Building CXX object dlib_build/CMakeFiles/dlib.dir/logger/logger_config_file.o
[ 20%] Building CXX object dlib_build/CMakeFiles/dlib.dir/misc_api/misc_api_kernel_1.o
[ 21%] Building CXX object dlib_build/CMakeFiles/dlib.dir/misc_api/misc_api_kernel_2.o
[ 22%] Building CXX object dlib_build/CMakeFiles/dlib.dir/sockets/sockets_extensions.o
[ 23%] Building CXX object dlib_build/CMakeFiles/dlib.dir/sockets/sockets_kernel_2.o
[ 24%] Building CXX object dlib_build/CMakeFiles/dlib.dir/sockstreambuf/sockstreambuf.o
[ 25%] Building CXX object dlib_build/CMakeFiles/dlib.dir/sockstreambuf/sockstreambuf_unbuffered.o
[ 26%] Building CXX object dlib_build/CMakeFiles/dlib.dir/server/server_kernel.o
[ 27%] Building CXX object dlib_build/CMakeFiles/dlib.dir/server/server_iostream.o
[ 28%] Building CXX object dlib_build/CMakeFiles/dlib.dir/server/server_http.o
[ 29%] Building CXX object dlib_build/CMakeFiles/dlib.dir/threads/multithreaded_object_extension.o
[ 30%] Building CXX object dlib_build/CMakeFiles/dlib.dir/threads/threaded_object_extension.o
[ 31%] Building CXX object dlib_build/CMakeFiles/dlib.dir/threads/threads_kernel_1.o
[ 32%] Building CXX object dlib_build/CMakeFiles/dlib.dir/threads/threads_kernel_2.o
[ 33%] Building CXX object dlib_build/CMakeFiles/dlib.dir/threads/threads_kernel_shared.o
[ 34%] Building CXX object dlib_build/CMakeFiles/dlib.dir/threads/thread_pool_extension.o
[ 35%] Building CXX object dlib_build/CMakeFiles/dlib.dir/timer/timer.o
[ 36%] Building CXX object dlib_build/CMakeFiles/dlib.dir/stack_trace.o
[ 37%] Building CXX object dlib_build/CMakeFiles/dlib.dir/gui_widgets/fonts.o
In file included from /Users/davidlaxer/MITIE/dlib/dlib/gui_widgets/fonts.cpp:14:
/Users/davidlaxer/MITIE/dlib/dlib/gui_widgets/nativefont.h:29:10: fatal error:
'X11/Xlocale.h' file not found

include <X11/Xlocale.h>

     ^

1 error generated.
dlib_build/CMakeFiles/dlib.dir/build.make:974: recipe for target 'dlib_build/CMakeFiles/dlib.dir/gui_widgets/fonts.o' failed
gmake[2]: *** [dlib_build/CMakeFiles/dlib.dir/gui_widgets/fonts.o] Error 1
CMakeFiles/Makefile2:122: recipe for target 'dlib_build/CMakeFiles/dlib.dir/all' failed
gmake[1]: *** [dlib_build/CMakeFiles/dlib.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
gmake: *** [all] Error 2
David-Laxers-MacBook-Pro:java davidlaxer$

How to create a svm model ?

Hello,
I just have started with MITIE, I would like to understand how can i create a model for binary relationship.
Unfortunately I need italian models so I must create them from scratch. Reading the examples i found a good way to create NER models (here: https://github.com/mit-nlp/MITIE/blob/master/examples/java/TrainNerExample.java)

and here: https://github.com/mit-nlp/MITIE/blob/master/examples/java/NerExample.java for BinaryRelationDetector, the problem is how can i create a .svm ? and how can I train it?

Is there a tutorial to understand how can i create a good binary relation model ?

Thank you for your support!

wordrep raises a std::bad_alloc error when training on new corpus

I've tried to used wordrep to train the word embedding on a specialized corpus. However a std::bad_alloc occurs during the process:

number of raw ASCII files found: 127
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 127
Sample 50000000 random context vectors
Now do CCA (left size: 50000000, right size: 50000000).
std::bad_alloc

I've tried to launch gdb (after compiling in DEBUG mode) to see the error, but there is no stacktrace.
The training data is quite large (but fits in RAM), but it contains some non-ASCII characters (the files are encoded in unicode). Could this be due to encoding?

Train custom NER model in Java

Hi guys,
I am using MITIE for java. I see that your example python program, shows how to train the ner, with a sentence and indices.

However, I am using MITIE for java.
How would I do the same thing in java?

For example, I need to train the NER, to extract the 'holiday' entity from: When is Christmas?
I am using the compiled java library you linked to.

Is the NER process fast, and does it consume a lot of memory?

P.S. By the way, for NER, is this library better than OpenNLP and others? What's the advantage in using this library over others? Does it have state of the art advanced algorithms?

tokenising misaligned for java strings?

Hi,

I am observing some misalignment of the token indexing when calling Java method edu.mit.ll.mitie.mitie.tokenizeWithOffsets with a string containing multi-byte characters. For example, calling it with the string funny “quotation marks” and so on produces tokens:

Token 0: index=0, value=<funny>
Token 1: index=6, value=<“quotation>
Token 2: index=19, value=<marks”>
Token 3: index=28, value=<and>
Token 4: index=32, value=<so>
Token 5: index=35, value=<on>

Note that the quotation marks are Unicode 201c and 201d. token 2's index (19) is out by 2 (it should be 17), and token 3's (28) is out by 4 (it should be 24). So there appear to be cumulative indexing errors when "multi-byte" characters are encountered.

python 3 error

Can MITIE work with python 3?
this is what i get if i try one of the train_ner examples:

Traceback (most recent call last):
  File "p3train.py", line 11, in <module>
    trainer = ner_trainer(u"../../MITIE-models/english/total_word_feature_extractor.dat")
  File "/home/data/experim/MITIE/examples/python/../../mitielib/mitie.py", line 403, in __init__
    self.__obj = _f.mitie_create_ner_trainer(filename)
ctypes.ArgumentError: argument 1: <class 'TypeError'>: wrong type
Exception ignored in: <bound method ner_trainer.__del__ of <mitie.ner_trainer object at 0x7fa2bdeb8828>>
Traceback (most recent call last):
  File "/home/data/experim/MITIE/examples/python/../../mitielib/mitie.py", line 409, in __del__
    self.__mitie_free(self.__obj)
AttributeError: 'ner_trainer' object has no attribute '_ner_trainer__mitie_free'

xpaths for in-browser display

For display of highlights natively in a web browser as a user visits pages in the wild, it is necessary to pass XPath-based offsets to the highlighting JavaScript code running in the browser. It doesn't appear that MITIE generates XPath offsets currently. This ticket is a feature request for adding such output to MITIE.

See here for further details:

It is possible to wrap a tool like MITIE with a converter that reparses the HTML and generates XPath offsets. The python example linked below works in more than half of web pages, but has issues with incomplete tags. We have experience solving this issue in some other contexts. If this feature request gets prioritized by MITIE, we would be glad to help MITIE engineers.

https://github.com/trec-kba/streamcorpus-pipeline/blob/master/streamcorpus_pipeline/offsets.py#L341-L425

Updating a trained model

Hi Team,

Could you please help me how to update a saved model with new training data? Do I have to run the entire trained samples along with the new samples to through a trainer and create a new model for updating an existing model? I tried serializing the trainer and reuse it. However since there are pointer references, I am not able to save the trainer. I am trying to periodically update my model with new samples.

Kind Regards,
rp

Add shared library file for mac os in Precompiled Java 64bit binaries

Hi there,

I am working on integrating Mitie with Apache Tika along with Prof Chris Mattmann. It would be really helpful if the shared library for Mac OS was also added in the precompiled binaries as well as published to the maven central repository. So then Tika can use the jar directly as an external jar to enable Mitie named entity recognition.

Support for text classification

Hi guys,

I am working on a c++ project that requires named entity recognition and text classification.
For the entity recognition, I discovered this MITIE library, which is fast and excellent.
I would also really like to see text classification (c++) built into the project, since entity recognition and text classification can complement each other.

For example, OpenNLP has text classification. Can you guys create a similar text classification, where I have the data in Tab-spaced format.

Example training data:
category text

If you guys do not have this or not going to implement this, can you please guide me to a C++ library that has good text classification. I am aware of implementations in other languages, but I need to do this in C++.

It would be really great if you machine learning wizards can include a C++ library for text classification!

How to compile on Windows?

Hi Davis,

I am really sorry to disturb you again. But I could not build the java interface on Windows 7.

The environment which I use is as following:

  • Cmake, 3.4.1
  • MinGW
  • swig-3.0.8

The error message is as following.

-- The C compiler identification is unknown
-- The CXX compiler identification is unknown
-- Check for working C compiler using: Visual Studio 10 2010 Win64
-- Check for working C compiler using: Visual Studio 10 2010 Win64 -- broken
CMake Error at C:/Program Files (x86)/cmake-3.4.1-win32-x86/share/cmake-3.4/Modules/CMakeTestCCompiler.cmake:61 (message):
  The C compiler "C:/MinGW/bin/gcc.exe" is not able to compile a simple test
  program.

  It fails with the following output:

   Change Dir: F:/Download/MITIE-master/mitielib/java/build/CMakeFiles/CMakeTmp



  Run Build
  Command:"C:/Windows/Microsoft.NET/Framework/v4.0.30319/MSBuild.exe"
  "cmTC_f7860.vcxproj" "/p:Configuration=Debug" "/p:VisualStudioVersion=10.0"

  Microsoft (R) Build Engine version 4.0.30319.18408

  [Microsoft .NET Framework, version 4.0.30319.18444]

  Copyright (C) Microsoft Corporation.  All rights reserved.



  Build started 8/1/2016 6:59:25 PM.

  Project
  "F:\Download\MITIE-master\mitielib\java\build\CMakeFiles\CMakeTmp\cmTC_f7860.vcxproj"
  on node 1 (default targets).


  F:\Download\MITIE-master\mitielib\java\build\CMakeFiles\CMakeTmp\cmTC_f7860.vcxproj(27,3):
  error MSB4019: The imported project "F:\Microsoft.Cpp.Default.props" was
  not found.  Confirm that the path in the <Import> declaration is correct,
  and that the file exists on disk.

  Done Building Project
  "F:\Download\MITIE-master\mitielib\java\build\CMakeFiles\CMakeTmp\cmTC_f7860.vcxproj"
  (default targets) -- FAILED.



  Build FAILED.




  "F:\Download\MITIE-master\mitielib\java\build\CMakeFiles\CMakeTmp\cmTC_f7860.vcxproj"
  (default target) (1) ->

    F:\Download\MITIE-master\mitielib\java\build\CMakeFiles\CMakeTmp\cmTC_f7860.vcxproj(27,3): error MSB4019: The imported project "F:\Microsoft.Cpp.Default.props" was not found. Confirm that the path in the <Import> declaration is correct, and that the file exists
 on disk.



      0 Warning(s)
      1 Error(s)



  Time Elapsed 00:00:00.02





  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:9 (project)


-- Configuring incomplete, errors occurred!
See also "F:/Download/MITIE-master/mitielib/java/build/CMakeFiles/CMakeOutput.log".
See also "F:/Download/MITIE-master/mitielib/java/build/CMakeFiles/CMakeError.log".

I really appreciate your help on builds on windows. Thanks a lot.

Memory growth in c++ code

Hello Davis,

I am training a NER where in I have about 63 sentences with lot of key works per sentence. This is primarily wikipedia extracts and I am training the words for a specific domain.

My cpp code is @ https://github.com/skprasadu/wikipedia-extraction-framework/blob/master/train_movie_relation/train_my_ner.cpp . And my main code @ https://github.com/skprasadu/wikipedia-extraction-framework/blob/master/train_movie_relation/train_relation_extraction_example.cpp calls the train method.

When I run the training process, it takes a long time, that is understandable, but the memory growth is crazy, it sucks up close to 16gb to complete the NER. Can you see what I am doing wrong?

Can you tell me how to optimize this?

Thanks
Krishna

Coreference ability

For my project I also need Coreference between sentences to extract information. Is there a good way with MITIE. Any help or pointers will help.

Thanks

[Documentatio] train-ner script

Hi,

using the ner_conll example I found the train-ner script in MITIE/tools/ner_conll/train-ner:

./ner --train-chunker eng.train_all_some_sentences_combined $THREADS > $LOG
    ./ner --train-id      eng.train_all_some_sentences_combined $THREADS >> $LOG

    ./ner --test-id eng.testb >> $LOG
    ./ner --tag-conll-file ner_model.dat eng.testb | ./conlleval 
    echo $LOG

As I cannot find a executable called "ner" - I'm just wondering if it should be train-ner instead, but the train-ner command line arguments only provides:

parser.add_option("h", "Display this help information.");
        parser.add_option("train", "train named_entity_extractor on CoNLL data.");
        parser.add_option("test", "test named_entity_extractor on CoNLL data.");
        parser.add_option("threads", "Use <arg> threads when doing training (default: 4).",1);
        parser.add_option("tag-conll-file", "Read in a CoNLL annotation file and output a copy that is tagged with a MITIE NER model.");

See MITIE/tools/ner_conll/src/main.cpp.

So my question is if that small train_ner script should be updated then?

Thanks in advance (+ I'm really looking forward to train a model for German),

Stefan

Is it possible to reduce the size of the model

Hi, I've come across this library, and found it is really amazing! The accuracy is even better than Stanford NER demo!

Although I understand it contains a high dimensional space with over 500,000 dimensions, is it possible to reduce the model size?

[ner_stream] unclosed tags for non-sentence line

Davis,

Just found a small cosmetic error (missing closed ']' mark) on how ner_stream produces the output for the last detected entity when it's not in a proper sentence (not ended with punctuation).

$ ./ner_stream MITIE-models/ner_model.dat

Loading MITIE NER model file...
time: 4.38sec

Mozilla CEO Exit Exposes Silicon Valley’s Equality-Freedom Rift.
[ORGANIZATION Mozilla] CEO Exit Exposes Silicon Valley’s [MISC Equality-Freedom Rift] .


Mozilla CEO Exit Exposes Silicon Valley’s Equality-Freedom Rift
[ORGANIZATION Mozilla] CEO Exit Exposes Silicon Valley’s [MISC Equality-Freedom Rift

To contact the editors responsible for this story: Pui-Wing Tam at [email protected] Reed Stevenson, Ari Levy.
To contact the editors responsible for this story : [PERSON Pui-Wing Tam] at ptam13@bloomberg . net [PERSON Reed Stevenson] , [PERSON Ari Levy] .


To contact the editors responsible for this story: Pui-Wing Tam at [email protected] Reed Stevenson, Ari Levy
To contact the editors responsible for this story : [PERSON Pui-Wing Tam] at ptam13@bloomberg . net [PERSON Reed Stevenson] , [PERSON Ari Levy

Cheers,
Jim

Move the models off of sourceforge

I'm currently downloading them at 100k/sec while trying to test an automatic provisioning script and not thrilled about it :)

Sourceforge has definitely seen better days, I think.

Documentation for Java on OS X

I ran into a few problems on OS X with the instructions for Using MITIE from a Java program:

Also note that you must have Swig 1.3.40 or newer, CMake 2.8.4 or newer, and the Java JDK installed to compile the MITIE interface.

In addition, it was necessary to install FFTW to get the build scripts to work properly. Without it, I was getting errors like error: fftw3.h: No such file or directory.

That will place a javamitie shared library and jar file into the mitielib folder.

The build scripts created the javamitie.jar and libjavamitie.jnilib files, but I had to move them from MITIE/mitielib/java/build/lib into MITIE/mitielib manually. Otherwise, I saw errors like this when trying to compile NerExample.java: NerExample.java:6: error: package edu.mit.ll.mitie does not exist.

Once you have those two files you can run the example program in examples/java by running run_ner.bat if you are on Windows or run_ner.sh if you are on a POSIX system like Linux or OS X.

In run_ner.sh, export LD_LIBRARY_PATH=../../mitielib doesn't have the desired effect in OS X -- you'll wind up with errors like this:

Native code library failed to load. 
java.lang.UnsatisfiedLinkError: no javamitie in java.library.path

You need to use DYLD_LIBRARY_PATH=../../mitielib instead, or set it using the -Djava.library.path VM option.

enhancement: direct access to word embeddings

I'd like to build a simple sentence classifier using the spectral embeddings a simple average-of- bag-of-words approach.

Thinking about forking this and creating a model using dlib, but it might make sense anyway to write some functions which just return the embeddings for a list of tokens which can then be wrapped in python and other bindings. Let me know what you think

Improving Mitie NER for missing entities

Hello,

So far I'm greatly impressed with the product. The assessment in #12 is spot on and I'm looking forward to training some binary relationships as soon as I can think of an interesting one.

I was wondering if there were plans to update the basic NER model over time. Further, is there some kind of way to submit missed entities within text in an organized fashion?

This isn't be best example since it's hard to determine from the context but running

"Popular Science is your wormhole to the future. Reporting on what's new and what's next in science and technology, we deliver the future now."

missed 'Popular Science' as miscellaneous or otherwise without any kind of score.

running face detection example ERROR

Hi , when trying to run face_detection.cpp
by g++ -o output -lpng -ljpeg -O3 -I.. ../dlib/all/source.cpp -lpthread -lX11 face_detection_ex.cpp
./output test.jpg
exception thrown!
Unable to load image in file test.jpg.
You must #define DLIB_PNG_SUPPORT and link to libpng to read PNG files.
Do this by following the instructions at http://dlib.net/compile.html.

the instructions was not clear for me
Note also that if you want to work with jpeg/png files using dlib then you will need to link your program with libjpeg and/or libpng. You also need to tell dlib about this by defining the DLIB_JPEG_SUPPORT and DLIB_PNG_SUPPORT preprocessor directives.

i have installed libjpeg and libpng and added the flags ... but can"t figure out what's wrong

How to interpret confidence score?

from the comments in the code base, I understand that the confidence score can range from negative to positive value larger than 1. But it's unclear to me that when I have a result with score 1.3, I'm still not sure how confidence this value is, because I don't have a [min, max] range in mind. If possible, may I have some statistical background in how this value is calculated? thanks!!

Training ner on a a new corpus

Is there any memory leak? its taking a lot of memory for a very few training samples .
Its getting killed after printing this

num feats in chunker model: 4095
train: precision, recall, f1-score: 0.984615 0.984615 0.984615
now do training
num training samples: 198

I observed the memory usage and saw that it kept on increasing gradually once it reaches here, as if in each iteration some memory is getting filled garbage.

Handling smartquotes

Davis,

Any plan to support smartquotes as modern text editors these days tend to use them instead regular quotes? This may have impact on the result. For example, this is the original text.

Mozilla CEO Exit Exposes Silicon Valley’s Equality-Freedom Rift.
[ORGANIZATION Mozilla] CEO Exit Exposes Silicon Valley’s [MISC Equality-Freedom Rift] .

and this is after preprocessing (smartquotes replaced). Silicon Valley is now detected.

Mozilla CEO Exit Exposes Silicon Valley's Equality-Freedom Rift.
[ORGANIZATION Mozilla] CEO Exit Exposes [LOCATION Silicon Valley] 's [MISC Equality-Freedom Rift]

Cheers,
Jim

MITIE-models examples

Hello all,

Could somebody give me some examples/datasets that were used in the binary relations classifiers:

rel_classifier_book.written_work.author.svm
rel_classifier_film.film.directed_by.svm
rel_classifier_influence.influence_node.influenced_by.svm
rel_classifier_law.inventor.inventions.svm
rel_classifier_location.location.contains.svm
rel_classifier_location.location.nearby_airports.svm
rel_classifier_location.location.partially_contains.svm
rel_classifier_organization.organization.place_founded.svm
rel_classifier_organization.organization_founder.organizations_founded.svm
rel_classifier_organization.organization_scope.organizations_with_this_scope.svm
rel_classifier_people.deceased_person.place_of_death.svm
rel_classifier_people.ethnicity.geographic_distribution.svm
rel_classifier_people.person.ethnicity.svm
rel_classifier_people.person.nationality.svm
rel_classifier_people.person.parents.svm
rel_classifier_people.person.place_of_birth.svm
rel_classifier_people.person.religion.svm
rel_classifier_people.place_of_interment.interred_here.svm
rel_classifier_time.event.includes_event.svm
rel_classifier_time.event.locations.svm
rel_classifier_time.event.people_involved.svm

Thanks in advance!

Getting error overlap entity

Good afternoon.
Thanks for such awesome library.

I'm getting an error Invalid range given to ner_training_instance.overlaps_any_entity(). It overlaps an entity given to a previous call to add_entity()..

Can you please explain me what is going on?
Thanks again.

Sentiment Analysis

I have seen good number of examples of ner and binary relationship, are there plans for Sentiment Analysis?

Also can I get an example of binary_relation_detector_trainer in Java. It will help.
Thanks in advance.

Error on building nodejs wrapper

(Linux 3.16.0-4-amd64 Debian 3.16.7-ckt11-1+deb8u5 (2015-10-09) x86_64 GNU/Linux)
Hey guys, sorry for disturbing you.
Have the same problem while building Mitie's nodejs wrapper. It seems that an error occurs in Mitie, not in node-gyp, but I might be wrong.

What can it be? Maybe I miss something? Thanks!

> [email protected] install /var/www/dh-nlp/node_modules/mitie
> node-gyp rebuild

make: Entering directory '/var/www/dh-nlp/node_modules/mitie/build'
  CXX(target) Release/obj.target/mitie/src/mitie.o
In file included from ../mitie/dlib/dlib/serialize.h:157:0,
                 from /usr/local/include/mitie/approximate_substring_set.h:9,
                 from /usr/local/include/mitie/word_morphology_feature_extractor.h:7,
                 from /usr/local/include/mitie/total_word_feature_extractor.h:8,
                 from /usr/local/include/mitie/named_entity_extractor.h:7,
                 from ../src/entity_extractor.h:5,
                 from ../src/mitie.cc:2:
../mitie/dlib/dlib/smart_pointers/shared_ptr.h: In member function 'void* dlib::shared_ptr<T>::deleter_template<D>::get_deleter_void(const std::type_info&) const':
../mitie/dlib/dlib/smart_pointers/shared_ptr.h:112:29: error: cannot use typeid with -fno-rtti
                 if (typeid(D) == t)
                             ^
../mitie/dlib/dlib/smart_pointers/shared_ptr.h: In member function 'D* dlib::shared_ptr<T>::_get_deleter() const':
../mitie/dlib/dlib/smart_pointers/shared_ptr.h:432:83: error: cannot use typeid with -fno-rtti
                 return static_cast<D*>(shared_node->del->get_deleter_void(typeid(D)));
                                                                                   ^
g++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-4.9/README.Bugs> for instructions.
mitie.target.mk:94: recipe for target 'Release/obj.target/mitie/src/mitie.o' failed
make: *** [Release/obj.target/mitie/src/mitie.o] Error 4
make: Leaving directory '/var/www/dh-nlp/node_modules/mitie/build'

WordRep install/usage

Do you have any documentation on how to install or use the wordrep tool. I attempted to build it with cmake but haven't had any success.

I attempted to adapt the steps in the ReadMe for /tools/ner_stream:

cd tools/ner_stream
mkdir build
cd build
cmake ..
cmake --build . --config Release

but I am getting this error.

-- Check size of void*
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
JPEG_LIBRARY
linked by target "cmTryCompileExec1671790077" in directory /mydir/mitie/tools/wordrep/build/CMakeFiles/CMakeTmp

CMake Error: Internal CMake error, TryCompile configure of cmake failed
-- Check size of void* - failed
-- Found LAPACK library
-- Found CBLAS library
-- Looking for cblas_ddot
-- Looking for cblas_ddot - found
-- Check for STD namespace
-- Check for STD namespace - found
-- Looking for C++ include iostream
-- Looking for C++ include iostream - found
-- Configuring incomplete, errors occurred!

See also "mydir/mitie/tools/wordrep/build/CMakeFiles/CMakeOutput.log".
See also "mydir/mitie/tools/wordrep/build/CMakeFiles/CMakeError.log".

Updating mitie in NPM

Are there any plans to update mitie on the NPM? It's almost a year old, and I suspect many users would want to have mitie integration in Node as exists in Java, Python, and other languages.

make test fails on osx

First, THANK YOU! The world has needed this forever for the reasons you outline in the readme, so I'm very glad you're building this. Dlib is awesome as well.

I'm able to build and run the example but make test fails with:

$ make test
./ner_stream MITIE-models/ner_model.dat < sample_text.txt > /tmp/test.out
Loading MITIE NER model file...
time: 2.86sec

diff /tmp/test.out sample_text.reference-output
diff: sample_text.reference-output: No such file or directory

maybe just a git add sample_text.reference-output ?

Can't set num_threads from Python for NER training

I'm training a new NER model in MITIE on a machine with a bunch of cores. Training it with a few thousand samples takes a couple hours, so I'd really like to be able to multithread it. When I set it in my code, though, it seems to default to 4 regardless of what I set.

import sys, os, json
sys.path.append("/home/ahalterman/MITIE/mitielib")
from mitie import *
from collections import defaultdict

trainer = ner_trainer("/home/ahalterman/MITIE/tools/wordrep/build/total_word_feature_extractor.dat")
trainer.num_threads = 8

[all the other stuff here]

When I start the training, though, it says only 4 threads are in use:

now do training
C:           20
epsilon:     0.01
num threads: 4
cache size:  5
loss per missed segment:  3

Sure enough, running ps -o nlwp <pid> says it's got 5 threads (I don't know much about multithreading so I'm not sure what that extra one is).

Any idea what's going on?

Building MITIE java on Windows

MITIE is a cool tool and much more focused and ease of use. I have Cygwin on windows and company restriction is that I cannot have VS.Net..

Can someone guide me how to build it. Thanks in advance.

Krishna

One or two output examples in the README.md file

MITIE library makes a really exciting impression at first glance. However it would be very helpful if you could provide one or two examples of what the library actually outputs when being used on a text. Especially for people like me not totally into language processing it is hard to get a clear picture of what the library actually does and what it can be used for.

Cheers
Holger

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.