GithubHelp home page GithubHelp logo

ontologyportal / semconcor Goto Github PK

View Code? Open in Web Editor NEW
2.0 9.0 0.0 6.91 MB

Semantic Concordancer that uses dependency parsing and the Suggested Upper Merged Ontology (SUMO) to support linguistic analysis at a level of semantic abstraction above the original textual elements

Home Page: http://www.ontologyportal.org

Java 9.79% JavaScript 79.69% CSS 10.42% HTML 0.10%
concordancer ontology nlp

semconcor's Introduction

Introduction
============

Semantic Concordancer

Concordancers are an accepted and valuable part of the tool set of linguists and
lexicographers. They allow the user to see the context of use of a word or
phrase in a corpus. One challenge is that there may be too many results for
short phrases or common words when only a specific context is desired. However,
finding meaningful groupings of usage may be challenging or impractical if it
means enumerating long lists of possible values, such as city names. If a tool
existed that could create some semantic abstractions, it would free the
lexicographer from the need to resort to customized development of analysis
software. To address this need, we have developed a Semantic Concordancer that
uses dependency parsing and the Suggested Upper Merged Ontology (SUMO) to
support linguistic analysis at a level of semantic abstraction above the
original textual elements.
  Users of this work who publish academic papers are asked to cite

Pease, A., and Cheung, A., (2018).  Toward A Semantic Concordancer, to appear.


Linux Installation
==================

Please install https://github.com/ontologyportal/sigmakee and
https://github.com/ontologyportal/sigmanlp first.

cd ~/workspace
git clone https://github.com/ontologyportal/semconcor
cd ~/Programs
wget 'http://www.h2database.com/h2-2017-06-10.zip'
unzip h2-2017-06-10.zip

you'll need to set Indexer.JDBCstring to point to your corpus database and
modify the main() method of Indexer to read your corpus.  Currently,
the system can read from

- the FCE corpus, https://www.ilexir.co.uk/datasets/index.html
- wikipedia - http://www.evanjones.ca/software/wikipedia2text-extracted.txt.bz2 plain text of 10M words
- or our Hong Kong court judgment corpus, which is not generally available.

Note that the system is currently pretty slow, and even indexing the FCE corpus will take days

Assuming you've downloaded the FCE corpus to ~/corpora/fce-released-dataset
First create the empty database

java -cp ~/Programs/h2/bin/h2*.jar org.h2.tools.RunScript -url jdbc:h2:~/corpora/FCE -script ~/workspace/semconcor/script.sql

start the DB engine

java -jar ~/Programs/h2/bin/h2*.jar &

Then run the indexer

java -Xmx2G -cp /home/apease/workspace/semconcor/build/classes:
  /home/apease/workspace/semconcor/build/lib/*:
  /home/apease/workspace/sigmanlp/build/lib/* com.articulate.semconcor.Indexer

Finally, you can run the concordancer on the command line

java -Xmx2G -cp /home/apease/workspace/semconcor/build/classes:
  /home/apease/workspace/semconcor/build/lib/*:
  /home/apease/workspace/sigmanlp/build/lib/* com.articulate.semconcor.Searcher -i

or start it up in tomcat

$CATALINA_HOME/bin/startup.sh

and point your browser at

localhost:8080/semconcor/semconcor.jsp

semconcor's People

Contributors

apease avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.