GithubHelp home page GithubHelp logo

uhh-lt / josimtext Goto Github PK

View Code? Open in Web Editor NEW
16.0 18.0 4.0 2.96 MB

A system for word sense induction and disambiguation based on JoBimText approach

Home Page: http://jobimtext.org/wsd

Scala 90.02% Shell 1.63% Python 8.22% Makefile 0.12%
distributional-semantics jobimtext sense-induction noun-sense-induction sense-disambiguation sense-clusters spark count-based

josimtext's Introduction

JoSimText

This system performs word sense induction form text. This is an implementation of the JoBimText approach in Scala, Spark, tuned for induction of word Senses (hence the "S" instead of "B" in the name, but also because of the name of the initial developer of the project Johannes Simon). The original JoBimText implementation is written in Java/Pig and is more generic as it supposes that "Jo"s (i.e. objects) and "Bims (i.e. features) can be any linguistic objects. This particular implementation is designed for modeling of words and multiword expressions.

The system consist of several modules:

  1. Term feature extraction
  2. Term similarity (this reposiroty). This repository performs construction of a distributional thesaurus from word-feature frequencies.
  3. Word sense induction

System requirements:

  1. git
  2. Java 1.8+
  3. Apache Spark 2.2+

Installation of the tool:

  1. Get the source code:
git clone https://github.com/uhh-lt/josimtext.git
cd josimtext
  1. Build the tool:
make
  1. Set the environment variable SPARK_HOME to the directory with Spark installation.

Run a command:

  1. To see the list of available commands:
./run
  1. To see arguments of a particular command, e.g. :
./run WordSimFromTermContext --help
  1. By default, the tool is running locally. To change Spark and Hadoop parameters of the job (queue, number of executors, memory per job, and so on) you need to modify the conf/env.sh file. A sample file for running the jobs using the CDH YARN cluster are provided in conf/cdh.sh.

josimtext's People

Contributors

alexanderpanchenko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

josimtext's Issues

Construct a distributional thesaurus with the current pipeline on Wikipedia corpus

Motivation

To learn the current framework and details of its implementation you need to calculate a word similarity graph (a distributional thesaurus, a DT) with the current pipeline. This is needed to better understand how it works and how it can be optimised.

Implementation

  1. Download Wikipedia corpus: http://cental.fltr.ucl.ac.be/team/~panchenko/data/corpora/wacky-surface.csv.
  2. For testing purposes make a subcorpus of 50Mb and 500Mb. First do all the experiments on these smaller chunks. Then proceed with the entire 5Gb dataset. All experiments are conducted locally on your machine.
  3. Execute steps until "Compute DT (noun-sense-induction-scala)" according to this instruction. https://github.com/tudarmstadt-lt/noun-sense-induction-scala.
  4. Upload the result files (DT, feature matrix, etc) to Dropbox or any other place and put link for download here.

Implement Word Sense Induction (WSI) based on Chinese Whispers (CW) in Spark

Motivation

Currently one important component of the JoBimText pipeline is conducted in a non-distributed fashion, namely the word sense induction. This means transfer of files from the HDFS and back. Also this limits scalability of the method. Your goal is to implement the component in a distributed way.

Implementation

  1. Download the similarity graph:
    http://cental.fltr.ucl.ac.be/team/~panchenko/data/serelex/norm60.tgz
  2. Replace all ";" to "\t" in the file (the separator).
  3. Run current WSI method locally: https://github.com/tudarmstadt-lt/chinese-whispers. Use the https://github.com/tudarmstadt-lt/chinese-whispers/blob/master/run.sh
  4. Read the paper to understand how the algorithm works: http://www.aclweb.org/website/old_anthology/W/W06/W06-38.pdf#page=83
  5. Implement the algorithm with Spark GraphX library. For reference see:
  6. Write unit tests that make sure that the output of your implementation is the same as the original one.
  7. Write a report measuring memory consumption, computation time and occupied disk space for two implementations of the WSI system.

Comparing original and new trigram JoBimText (JBT) implementations

General motivation

Computational lexical semantics is a subfield of Natural Language Processing that studies computational models of lexical items, such as words, noun phrases and multiword expressions. Modelling semantic relations between words (e.g. synonyms) and word senses (e.g. “python” as programming language vs “python” as snake) is of practical interest in context of various language processing and information retrieval applications.

During the last 20 years, several accurate computational models of lexical semantics emerged, such as distributional semantics (Biemann, 2013; Baroni, 2011) and word embeddings (Mikolov, 2013). In this thesis, you will deal with one of the state-of-the-art approaches to lexical semantics, developed at TU Darmstadt, called JoBimText: http://jobimtext.org. According to multiple evaluations, the JoBimText approach yields cutting edge accuracy on such tasks as semantic relatedness (Biemann, 2013). Besides, it also enables features missing in other frameworks, such as automatic sense discovery.

Current implementation of the JoBimText let us process text corpora up to 50 Gb on a mid-sized Hadoop cluster of 400 cores and 50 Tb of HDFS. Your goal will be to re-engineer the system in such a way that it is able to process text corpora up to 5 Tb (100 times bigger) on the same cluster. This goal will be achieved by using the modern Apache Spark framework for distributed computation that allows a user to dump to temporary files to disk and thus implement incremental algorithms more efficiently.

The ultimate goal of the project will be to develop a system that will be able to compute a distributional thesaurus from the Common Crawl corpus (541TB dataset on Amazon AWS):

This is supposed to be the biggest experiment in distributional semantics conducted so far. This will be in line with this initiative: http://www.webdatacommons.org/. Read this thesis for reference on a similar project: thesis.pdf

Motivation of the initial experiment

Initial experiment is needed to make a proof-of-concept and show feasibility of results. In this experiment, you will work with trigram holing JoBimText (JBT) approach to construction of distributional thesaurus (DT). The goal of the experiment is to:

  1. Ensure by extensive testing that the new (Spark) implementation provides the same outputs as ths original (MapReduce) implementation.
  2. Measure and compare performance of the original and the new implementations.

Implementation of the initial experiment

  1. Download the corpus. Download Wikipedia corpus: http://cental.fltr.ucl.ac.be/team/~panchenko/data/corpora/wacky-surface.csv. For testing purposes make a subcorpus of 50Mb and 500Mb. First do all the experiments on these smaller chunks. Then proceed with the entire 5Gb dataset. All experiments are conducted locally on your machine.

  2. Compute a trigram DT with the original JBT implementation.

    python generateHadoopScript.py -q shortrunning -hl trigram -nb corpora/en/wikipedia_eugen -f 5 -w 5 -wf 2
    
  3. Get the DT from the outputs of the original pipeline. Here is description of the output formats: http://panchenko.me/jbt/

  4. Compute the same DT with the new pipeline (use also the trigram holing). Make sure to use exactly the same parameters! Follow instuctions here: https://github.com/tudarmstadt-lt/noun-sense-induction-scala. Use this script to get parameters of the trigram holing without lemmatization: https://github.com/tudarmstadt-lt/noun-sense-induction-scala/blob/master/scripts/run-nsi-trigram-nolemma.sh

  5. Create a table in a Google Docs with comparison of the original and the new DT outputs. Rows -- runs. Colums are the following measurements:

    • size of the input corpus, MB
    • number of words in DT: cat dt.csv | cut -f 1 | sort | uniq | wc -l
    • number of relations in DT
    • overlap of relations, percent
    • size of DT in MB
    • DT computation time in seconds one core:time
    • output size of all files in MB
    • memory consumed in MB
  6. Put online results of the experiments of both pipelines e.g. Google Drive.

  7. Write a report including the table above.

  8. Write outline of the thesis. Add references e.g. Spark books and master theses liste below.

References

Increase number of digits in the similarity computation

now like this:

Specialized–lululemon Garmin-Transitions  0.001
Specialized–lululemon Katusha 0.001
Specialized–lululemon Polti   0.001
Specialized–lululemon Specialized–lululemon 0.001
Specialized–lululemon Milram  0.001
Specialized–lululemon Kawasaki    0.001
Specialized–lululemon Sparkasse   0.001
Specialized–lululemon Sky 0.001
Specialized–lululemon RadioShack  0.001
High-priority   High-priority   0.001

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.