uhh-lt / josimtext Goto Github PK

A system for word sense induction and disambiguation based on JoBimText approach

Scala 90.02% Shell 1.63% Python 8.22% Makefile 0.12%

distributional-semantics jobimtext sense-induction noun-sense-induction sense-disambiguation sense-clusters spark count-based

josimtext's Introduction

JoSimText

This system performs word sense induction form text. This is an implementation of the JoBimText approach in Scala, Spark, tuned for induction of word Senses (hence the "S" instead of "B" in the name, but also because of the name of the initial developer of the project Johannes Simon). The original JoBimText implementation is written in Java/Pig and is more generic as it supposes that "Jo"s (i.e. objects) and "Bims (i.e. features) can be any linguistic objects. This particular implementation is designed for modeling of words and multiword expressions.

The system consist of several modules:

Term feature extraction
Term similarity (this reposiroty). This repository performs construction of a distributional thesaurus from word-feature frequencies.
Word sense induction

System requirements:

git
Java 1.8+
Apache Spark 2.2+

Installation of the tool:

Get the source code:

git clone https://github.com/uhh-lt/josimtext.git
cd josimtext

Build the tool:

make

Set the environment variable SPARK_HOME to the directory with Spark installation.

Run a command:

To see the list of available commands:

./run

To see arguments of a particular command, e.g. :

./run WordSimFromTermContext --help

By default, the tool is running locally. To change Spark and Hadoop parameters of the job (queue, number of executors, memory per job, and so on) you need to modify the conf/env.sh file. A sample file for running the jobs using the CDH YARN cluster are provided in conf/cdh.sh.

josimtext's People

Contributors

Stargazers

Watchers

Forkers

fmarten awesome-archive benjamesbabala eugenso

josimtext's Issues

Construct a distributional thesaurus with the current pipeline on Wikipedia corpus

Motivation

To learn the current framework and details of its implementation you need to calculate a word similarity graph (a distributional thesaurus, a DT) with the current pipeline. This is needed to better understand how it works and how it can be optimised.

Implementation

Download Wikipedia corpus: http://cental.fltr.ucl.ac.be/team/~panchenko/data/corpora/wacky-surface.csv.
For testing purposes make a subcorpus of 50Mb and 500Mb. First do all the experiments on these smaller chunks. Then proceed with the entire 5Gb dataset. All experiments are conducted locally on your machine.
Execute steps until "Compute DT (noun-sense-induction-scala)" according to this instruction. https://github.com/tudarmstadt-lt/noun-sense-induction-scala.
Upload the result files (DT, feature matrix, etc) to Dropbox or any other place and put link for download here.

Implement Word Sense Induction (WSI) based on Chinese Whispers (CW) in Spark

Motivation

Currently one important component of the JoBimText pipeline is conducted in a non-distributed fashion, namely the word sense induction. This means transfer of files from the HDFS and back. Also this limits scalability of the method. Your goal is to implement the component in a distributed way.

Implementation

Download the similarity graph:
http://cental.fltr.ucl.ac.be/team/~panchenko/data/serelex/norm60.tgz
Replace all ";" to "\t" in the file (the separator).
Run current WSI method locally: https://github.com/tudarmstadt-lt/chinese-whispers. Use the https://github.com/tudarmstadt-lt/chinese-whispers/blob/master/run.sh
Read the paper to understand how the algorithm works: http://www.aclweb.org/website/old_anthology/W/W06/W06-38.pdf#page=83
Implement the algorithm with Spark GraphX library. For reference see:
Write unit tests that make sure that the output of your implementation is the same as the original one.
Write a report measuring memory consumption, computation time and occupied disk space for two implementations of the WSI system.

Comparing original and new trigram JoBimText (JBT) implementations

General motivation

Computational lexical semantics is a subfield of Natural Language Processing that studies computational models of lexical items, such as words, noun phrases and multiword expressions. Modelling semantic relations between words (e.g. synonyms) and word senses (e.g. “python” as programming language vs “python” as snake) is of practical interest in context of various language processing and information retrieval applications.

During the last 20 years, several accurate computational models of lexical semantics emerged, such as distributional semantics (Biemann, 2013; Baroni, 2011) and word embeddings (Mikolov, 2013). In this thesis, you will deal with one of the state-of-the-art approaches to lexical semantics, developed at TU Darmstadt, called JoBimText: http://jobimtext.org. According to multiple evaluations, the JoBimText approach yields cutting edge accuracy on such tasks as semantic relatedness (Biemann, 2013). Besides, it also enables features missing in other frameworks, such as automatic sense discovery.

Current implementation of the JoBimText let us process text corpora up to 50 Gb on a mid-sized Hadoop cluster of 400 cores and 50 Tb of HDFS. Your goal will be to re-engineer the system in such a way that it is able to process text corpora up to 5 Tb (100 times bigger) on the same cluster. This goal will be achieved by using the modern Apache Spark framework for distributed computation that allows a user to dump to temporary files to disk and thus implement incremental algorithms more efficiently.

The ultimate goal of the project will be to develop a system that will be able to compute a distributional thesaurus from the Common Crawl corpus (541TB dataset on Amazon AWS):

This is supposed to be the biggest experiment in distributional semantics conducted so far. This will be in line with this initiative: http://www.webdatacommons.org/. Read this thesis for reference on a similar project: thesis.pdf

Motivation of the initial experiment

Initial experiment is needed to make a proof-of-concept and show feasibility of results. In this experiment, you will work with trigram holing JoBimText (JBT) approach to construction of distributional thesaurus (DT). The goal of the experiment is to:

Ensure by extensive testing that the new (Spark) implementation provides the same outputs as ths original (MapReduce) implementation.
Measure and compare performance of the original and the new implementations.

Implementation of the initial experiment

Download the corpus. Download Wikipedia corpus: http://cental.fltr.ucl.ac.be/team/~panchenko/data/corpora/wacky-surface.csv. For testing purposes make a subcorpus of 50Mb and 500Mb. First do all the experiments on these smaller chunks. Then proceed with the entire 5Gb dataset. All experiments are conducted locally on your machine.
Compute a trigram DT with the original JBT implementation.
- https://sites.google.com/site/jobimtexttutorial/ -- extensive tutorial on how to run the original system
- http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/wordpress/wp-content/uploads/2014/04/JoBimText-Tutorial-Practice-Commands.txt
- http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/
- to generate the bootstrap script:
```
python generateHadoopScript.py -q shortrunning -hl trigram -nb corpora/en/wikipedia_eugen -f 5 -w 5 -wf 2
```
Get the DT from the outputs of the original pipeline. Here is description of the output formats: http://panchenko.me/jbt/
Compute the same DT with the new pipeline (use also the trigram holing). Make sure to use exactly the same parameters! Follow instuctions here: https://github.com/tudarmstadt-lt/noun-sense-induction-scala. Use this script to get parameters of the trigram holing without lemmatization: https://github.com/tudarmstadt-lt/noun-sense-induction-scala/blob/master/scripts/run-nsi-trigram-nolemma.sh
Create a table in a Google Docs with comparison of the original and the new DT outputs. Rows -- runs. Colums are the following measurements:
- size of the input corpus, MB
- number of words in DT: cat dt.csv | cut -f 1 | sort | uniq | wc -l
- number of relations in DT
- overlap of relations, percent
- size of DT in MB
- DT computation time in seconds one core:time
- output size of all files in MB
- memory consumed in MB
Put online results of the experiments of both pipelines e.g. Google Drive.
Write a report including the table above.
Write outline of the thesis. Add references e.g. Spark books and master theses liste below.

References

Biemann, Chris, and Martin Riedl. "Text: Now in 2D! a framework for lexical expansion with contextual similarity." Journal of Language Modelling 1.1 (2013): 55-95.
Baroni, Marco, and Alessandro Lenci. "Distributional memory: A general framework for corpusbased semantics." Computational Linguistics 36.4 (2010): 673721.
Ruppert, Eugen, Manuel Kaufmann, Martin Riedl, and Chris Biemann. "JoBimViz: A Web-based Visualization for Graph-based Distributional Semantic Models." ACL-IJCNLP 2015 (2015): 103.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space.". 1301.3781 (2013).
Julian Felix Maria Seitner. The Web Tuples Database: A Large-scale Resource of hyponymy Relations
Master Thesis.
Johannes Simon. Word Sense Induction and Disambiguation using Distributional Semantics

Increase number of digits in the similarity computation

now like this:

Specialized–lululemon Garmin-Transitions  0.001
Specialized–lululemon Katusha 0.001
Specialized–lululemon Polti   0.001
Specialized–lululemon Specialized–lululemon 0.001
Specialized–lululemon Milram  0.001
Specialized–lululemon Kawasaki    0.001
Specialized–lululemon Sparkasse   0.001
Specialized–lululemon Sky 0.001
Specialized–lululemon RadioShack  0.001
High-priority   High-priority   0.001

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble