GithubHelp home page GithubHelp logo

3alama / arabic-tagger Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nschneid/arabic-tagger

0.0 1.0 0.0 8.94 MB

AQMAR Arabic Tagger: Sequence tagger with cost-augmented structured perceptron training

License: GNU General Public License v3.0

Shell 1.03% Perl 13.53% Python 11.61% Java 73.83%

arabic-tagger's Introduction

AQMAR Arabic Tagger

This package provides a sequence tagger implementation customized for Arabic features, including a named entity detection model especially intended for Arabic Wikipedia. It was trained on labeled ACE and ANER data as well as an unlabeled Wikipedia corpus. Learning is with the structured perceptron, optionally in a cost-augmented fashion. Feature extraction is handled as a preprocessing step prior to learning/decoding.

The tagger was used for the experiments reported in

  • Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer, and Noah A. Smith (2012), Recall-Oriented Learning of Named Entities in Arabic Wikipedia. Proceedings of EACL.

and accompanies the AQMAR Arabic Wikipedia Named Entity Corpus also described in that work; both can be obtained at

http://www.ark.cs.cmu.edu/AQMAR/

The Java tagger was adapted from Michael Heilman's supersense tagger implementation for English (http://www.ark.cs.cmu.edu/mheilman/questions/). It requires a minimum Java version of 1.6. Feature extraction uses Python and depends on the MADA toolkit (http://www1.ccls.columbia.edu/MADA/; version 3.1 was used for the Named Entity Corpus).

The AQMAR Arabic Tagger is released under the GNU General Public License (GPL) version 3 or later; see LICENSE. (Michael Heilman's supersense tagger, which we modify, was originally released in 2011 under GPL version 2 or later; the JSAP library, which we link to, was originally released by Martian Software in 2011 under the Lesser GNU Public License.)

Contents

  • eval/

    README and scripts for NER evaluation.

  • featExtract/

    README and scripts for feature extraction.

  • lib/

    External libraries required for the Java tagger.

  • model/

    Serialized tagging models, namely the best Arabic Wikipedia tagger reported in the EACL paper.

  • src/

    Java source files for the tagger.

  • arabic-tagger.jar

    Compiled Java program for training and decoding with the tagger.

  • build.sh

    Script for compiling the Java sources.

  • sample.properties

    An example properties file that can be used to specify options for the tagger. Options may alternatively be passed as command-line flags; if an option is specified in both places, the command-line value will take precedence.

  • LICENSE

  • README

  • VERSION

Usage

Extracting features for text data: See featExtract/README.txt

Running the Arabic named entity tagger

For example, the following command will use the existing named entity model in the model/ directory:

java -Xmx8000m -XX:+UseCompressedOops -jar arabic-tagger.jar 
	--load model/arabic-ner-superROP200.selfROP100.ser.gz 
	--test-predict featExtract/sample.bio.nerFeats --usePrevLabel true
	--properties sample.properties > predictions.out

Training a tagging model

Here is an example command for training a model on the sample feature-extracted data:

java -Xmx8000m -XX:+UseCompressedOops -jar arabic-tagger.jar 
	--save model/sample-model.ser.gz --iters 10 --no-averaging
	--labels featExtract/sample.labels --train featExtract/sample.nerFeats --debug --disk --weights
	--properties sample.properties > weights.out

or boundaries only:

java -Xmx8000m -XX:+UseCompressedOops -jar arabic-tagger.jar 
	--save model/sample-model.ser.gz --iters 10 --no-averaging
	--labels featExtract/bio.labels --train featExtract/sample.bio.nerFeats --debug --disk --weights
	--properties sample.properties > weights.out

Until this bug is fixed, we recommend specifying --no-averaging for training.

For details about options, run

java -jar arabic-tagger.jar --help

arabic-tagger's People

Contributors

nschneid avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.