GithubHelp home page GithubHelp logo

snowball's Introduction

Snowball Algorithm

(Note: This program requires non-code resources that can be locally accessed. Preprocessing/building the database takes a while. This README will be updated with what resources are needed and where to get them, plus a recommended way to organize the resource directory.)

Based on this paper.

How to run

  1. Go to ~/resources/snowball/

  2. Make a directory <some_name>

  3. Go to <some_name>/

  4. Create a file called 'seeds'. Contents are as follows:

    • 1st line: A name for your relation (e.g. located_in). No whitespace.
    • 2nd line: Subject tag and object tag. Must be separated by at least one whitespace character. See stanford corenlp to know which tags are allowed. Make sure they are capitalized.
    • 3rd line onwards: Pair of tokens where whitespace is replaced by one underscore and tokens are separated by whitespace.
  5. Open config/config.py in project directory. Replace <some_name> in SNOWBALL_SEEDS_FILE with the name you chose.

  6. The other SNOWBALL_* parameters can be changed, but be careful. Increasing iterations will make the program run more slowly per iteration.

  7. Start elasticstart. Go to ~/resources/elasticsearch-1.7.0/bin/ and run:

    ./elasticsearch

    It is good to do this in a screen session.

  8. Return to project directory and run the following to start Snowball: python main/main.py

  9. You can monitor the system by running: tailf ~/resources/logs/snowball.log

  10. Tuples will be in ~/resources/snowball/<some_name>/tuples. Each line is a Tuple object that can be evaluated in python. For a cleaner read of the results, run:

     python classes/read.py > out
    

    The results will be in out. Each line is of the form SUBJ OBJ.

snowball's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

snowball's Issues

Errors when indexing wikipedia

Hi,
I've been trying to set up snowball with wikipedia.
I'm at the step of indexing wikipedia using elasticsearch, I do this by running:
python index/index.py
(by the way, I believe this step is missing from the readme)

The code successfully sets up the corenlp server but then runs into the following error:

...
...
INFO:CoreNLP_PyWrapper:Successful ping. The server has started.
INFO:CoreNLP_PyWrapper:Subprocess is ready.
Python(1828,0x7fffefa9d3c0) malloc: *** mach_vm_map(size=8872781095355314176) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Process Process-2:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "index/index.py", line 35, in work
parse = parser.combined_parse(xml, text, self.corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 221, in combined_parse
sentences = get_sentences(text, corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 203, in get_sentences
sents = parse(text, corenlp)['sentences']
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 88, in parse
return corenlp.parse_doc(text)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 226, in parse_doc
return self.send_command_and_parse_result(cmd, timeout, raw=raw)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 246, in send_command_and_parse_result
data = self.send_command_and_get_string_result(cmd, timeout)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 289, in send_command_and_get_string_result
data = self.outpipe_fp.read(remaining_size)
INFO:CoreNLP_JavaServer: INPUT: 1 documents, 50939 characters, 8747 tokens, 50939.0 char/doc, 8747.0 tok/doc RATES: 0.125 doc/sec, 1092.4 tok/sec
MemoryError

WARNING:CoreNLP_PyWrapper:Bad JSON returned from subprocess; returning null.
WARNING:CoreNLP_PyWrapper:Bad JSON length 392719, starts with: 'J","NN","."],"lemmas":["Industrial","agriculture","base","on","large-scale","monoculture","farming","have","become","the","dominant","agricultural","methodology","."],"tokens":["Industrial","agriculture","based","on","large-scale","monoculture","farming","has","become","the","dominant","agricultural","methodology","."],"char_offsets":[[590,600],[601,612],[613,618],[619,621],[622,633],[634,645],[646,653],[654,657],[658,664],[665,668],[669,677],[678,690],[691,702],[702,703]],"ner":["O","O","O","O","O","O","O","O","O","O","O","O","O","O"],"normner":["","","","","","","","","","","","","",""]},{"pos":["NNP","NNP",",","NN","NN",",","NNS","JJ","IN","NNS","CC","NNS",",","CC","JJ","NNS","VBP","IN","JJ","NNS","RB","VBD","NNS","IN","NN",",","CC","IN","DT","JJ","NN","VBP","VBN","JJ","JJ","NN","CC","JJ","JJ","NN","NNS","."],"lemmas":["Modern","agronomy",",","plant","breeding",",","agrochemical","such","as","pesticide","and","fertilizer",",","and","technological","development","have","in","many","c'
Process Process-1:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "index/index.py", line 35, in work
parse = parser.combined_parse(xml, text, self.corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 221, in combined_parse
sentences = get_sentences(text, corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 203, in get_sentences
sents = parse(text, corenlp)['sentences']
TypeError: 'NoneType' object has no attribute 'getitem'

I see 3 problems here - the memory allocation error, the "bad json" warning, and the TypeError at the end. Not sure which of these is causing the problem or if they are connected.

  • I'm using the same corenlp and parser versions as in the example config file - "stanford-corenlp-full-2015-04-20".

  • Corenlp python wrapper installed from:
    https://github.com/brendano/stanford_corenlp_pywrapper

  • log files:
    log_0.log

    INFO (LOGGER 0): Now processing file: /Users/mac/git/wikipedia/extracted2/AA/wiki_00

    log_1.log

    INFO (LOGGER 1): Now processing file: /Users/mac/git/wikipedia/extracted2/AA/wiki_01

  • Wikipedia extracted using the latest version of wikipedia-extractor, the version in this repository wasn't working for me.

Would love to hear your thoughts.

Thanks,
Simon.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.