snowball's Introduction

Snowball Algorithm

(Note: This program requires non-code resources that can be locally accessed. Preprocessing/building the database takes a while. This README will be updated with what resources are needed and where to get them, plus a recommended way to organize the resource directory.)

Based on this paper.

How to run

Go to ~/resources/snowball/
Make a directory <some_name>
Go to <some_name>/
Create a file called 'seeds'. Contents are as follows:
- 1st line: A name for your relation (e.g. located_in). No whitespace.
- 2nd line: Subject tag and object tag. Must be separated by at least one whitespace character. See stanford corenlp to know which tags are allowed. Make sure they are capitalized.
- 3rd line onwards: Pair of tokens where whitespace is replaced by one underscore and tokens are separated by whitespace.
Open config/config.py in project directory. Replace <some_name> in SNOWBALL_SEEDS_FILE with the name you chose.
The other SNOWBALL_* parameters can be changed, but be careful. Increasing iterations will make the program run more slowly per iteration.
Start elasticstart. Go to ~/resources/elasticsearch-1.7.0/bin/ and run:

./elasticsearch

It is good to do this in a screen session.
Return to project directory and run the following to start Snowball: python main/main.py
You can monitor the system by running: tailf ~/resources/logs/snowball.log
Tuples will be in ~/resources/snowball/<some_name>/tuples. Each line is a Tuple object that can be evaluated in python. For a cleaner read of the results, run:
```
 python classes/read.py > out
```
The results will be in out. Each line is of the form SUBJ OBJ.

snowball's People

Stargazers

Watchers

snowball's Issues

Errors when indexing wikipedia

Hi,
I've been trying to set up snowball with wikipedia.
I'm at the step of indexing wikipedia using elasticsearch, I do this by running:
python index/index.py
(by the way, I believe this step is missing from the readme)

The code successfully sets up the corenlp server but then runs into the following error:

...
...
INFO:CoreNLP_PyWrapper:Successful ping. The server has started.
INFO:CoreNLP_PyWrapper:Subprocess is ready.
Python(1828,0x7fffefa9d3c0) malloc: *** mach_vm_map(size=8872781095355314176) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Process Process-2:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "index/index.py", line 35, in work
parse = parser.combined_parse(xml, text, self.corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 221, in combined_parse
sentences = get_sentences(text, corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 203, in get_sentences
sents = parse(text, corenlp)['sentences']
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 88, in parse
return corenlp.parse_doc(text)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 226, in parse_doc
return self.send_command_and_parse_result(cmd, timeout, raw=raw)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 246, in send_command_and_parse_result
data = self.send_command_and_get_string_result(cmd, timeout)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 289, in send_command_and_get_string_result
data = self.outpipe_fp.read(remaining_size)
INFO:CoreNLP_JavaServer: INPUT: 1 documents, 50939 characters, 8747 tokens, 50939.0 char/doc, 8747.0 tok/doc RATES: 0.125 doc/sec, 1092.4 tok/sec
MemoryError

WARNING:CoreNLP_PyWrapper:Bad JSON returned from subprocess; returning null.
WARNING:CoreNLP_PyWrapper:Bad JSON length 392719, starts with: 'J","NN","."],"lemmas":["Industrial","agriculture","base","on","large-scale","monoculture","farming","have","become","the","dominant","agricultural","methodology","."],"tokens":["Industrial","agriculture","based","on","large-scale","monoculture","farming","has","become","the","dominant","agricultural","methodology","."],"char_offsets":[[590,600],[601,612],[613,618],[619,621],[622,633],[634,645],[646,653],[654,657],[658,664],[665,668],[669,677],[678,690],[691,702],[702,703]],"ner":["O","O","O","O","O","O","O","O","O","O","O","O","O","O"],"normner":["","","","","","","","","","","","","",""]},{"pos":["NNP","NNP",",","NN","NN",",","NNS","JJ","IN","NNS","CC","NNS",",","CC","JJ","NNS","VBP","IN","JJ","NNS","RB","VBD","NNS","IN","NN",",","CC","IN","DT","JJ","NN","VBP","VBN","JJ","JJ","NN","CC","JJ","JJ","NN","NNS","."],"lemmas":["Modern","agronomy",",","plant","breeding",",","agrochemical","such","as","pesticide","and","fertilizer",",","and","technological","development","have","in","many","c'
Process Process-1:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "index/index.py", line 35, in work
parse = parser.combined_parse(xml, text, self.corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 221, in combined_parse
sentences = get_sentences(text, corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 203, in get_sentences
sents = parse(text, corenlp)['sentences']
TypeError: 'NoneType' object has no attribute 'getitem'

I see 3 problems here - the memory allocation error, the "bad json" warning, and the TypeError at the end. Not sure which of these is causing the problem or if they are connected.

I'm using the same corenlp and parser versions as in the example config file - "stanford-corenlp-full-2015-04-20".
Corenlp python wrapper installed from:
https://github.com/brendano/stanford_corenlp_pywrapper
log files:
log_0.log

INFO (LOGGER 0): Now processing file: /Users/mac/git/wikipedia/extracted2/AA/wiki_00

log_1.log

INFO (LOGGER 1): Now processing file: /Users/mac/git/wikipedia/extracted2/AA/wiki_01
Wikipedia extracted using the latest version of wikipedia-extractor, the version in this repository wasn't working for me.

Would love to hear your thoughts.

Thanks,
Simon.

Recommend Projects

aadah / snowball Goto Github PK

snowball's Introduction

Snowball Algorithm

How to run

snowball's People

Stargazers

Watchers

Forkers

snowball's Issues

Errors when indexing wikipedia

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs