GithubHelp home page GithubHelp logo

stanford_pipeline's Introduction

stanford_pipeline

Program to run scraped news stories through Stanford's CoreNLP program.

The program pulls stories added to the database within the past day and that aren't currently parsed using CoreNLP. Once parsed, the parsetrees are placed back into the database. The program is currently set to proccess the first six sentences of a story.

This program makes extensive use of Brendan O'Connor's wrapper for CoreNLP. The current install comes from my (John Beieler) fork. The config file for CoreNLP makes use of the shift-reduce parser introduced in CoreNLP 3.4.

CoreNLP Setup

This pipeline depends on having CoreNLP 3.4 with the shift-reduce parser. Download the models like this:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2014-06-16.zip
unzip stanford-corenlp-full-2014-06-16.zip
mv stanford-corenlp-full-2014-06-16 stanford-corenlp
cd stanford-corenlp
wget http://nlp.stanford.edu/software/stanford-srparser-2014-07-01-models.jar

If errors persist, try changing the path in default_config.ini from the relative path ~/stanford-corenlp to the full path (e.g.) /home/ahalterman/stanford-corenlp.

Configuration

The default_config.ini file has several options that can be changed, including the MongoDB database and collection of stories to process and whether all unparsed stories should be processed or just the stories added in the last day.

Usage

python process.py

Up to a minute of [Errno 111] Connection refused messages are normal during startup.

stanford_pipeline's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

stanford_pipeline's Issues

Socket connection refused error

Hello, I'm trying to run stanford_pipeline, and I'm getting the following error:
vagrant@eldiablo:~/stanford_pipeline$ cd /home/vagrant/stanford_pipeline && /usr/bin/python /home/vagrant/stanford_pipeline/process.py
INFO:stanford:Running.
INFO:stanford:Returning 1178 total stories.
INFO:stanford:Setting up CoreNLP.
INFO:StanfordSocketWrap:Starting pipe subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -cp /usr/local/lib/python2.7/dist-packages/stanford_corenlp_pywrapper/lib/piperunner.jar:/usr/local/lib/python2.7/dist-packages/stanford_corenlp_pywrapper/lib/guava-13.0.1.jar:/usr/local/lib/python2.7/dist-packages/stanford_corenlp_pywrapper/lib/jackson-all-1.9.11.jar:/home/vagrant/stanford-corenlp/stanford-corenlp-3.4.jar:/home/vagrant/stanford-corenlp/stanford-corenlp-3.4-models.jar:/home/vagrant/stanford-corenlp/stanford-srparser-2014-07-01-models.jar corenlp.PipeCommandRunner --server 12340 --mode justparse --configfile stanford_config.ini
[Server] Using mode type: justparse
Adding annotator tokenize
Adding annotator ssplit

Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection refused
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection refused
done [4.5 sec].
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ...INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection refused
done [2.3 sec].
[Server] Using CoreNLP configuration file: stanford_config.ini
Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Adding annotator lemma
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ...INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection refused
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection refused
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection refused
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection refused
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection refused
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection refused
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection refused
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection
refused
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection
refused
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection
refused
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection
refused
INFO:StanfordSocketWrap:Waiting for startup: ping got exception: <class 'socket.error'> [Errno 111] Connection
refused
^CTraceback (most recent call last):
File "/home/vagrant/stanford_pipeline/process.py", line 145, in
main()
File "/home/vagrant/stanford_pipeline/process.py", line 141, in main
parser.stanford_parse(coll, stories, stanford_dir)
File "/home/vagrant/stanford_pipeline/parser.py", line 33, in stanford_parse
corenlp_libdir=stanford)
File "/usr/local/lib/python2.7/dist-packages/stanford_corenlp_pywrapper/sockwrap.py", line 82, in init
self.start_server()
File "/usr/local/lib/python2.7/dist-packages/stanford_corenlp_pywrapper/sockwrap.py", line 99, in start_serv
er
time.sleep(STARTUP_BUSY_WAIT_INTERVAL_SEC)
KeyboardInterrupt
WARNING:StanfordSocketWrap:Killing subprocess 28486
vagrant@eldiablo:~/stanford_pipeline$

Any thoughts on how to fix this?

Thanks,
Matt

Problem with relative link to CoreNLP directory

This could just be a Mac problem, but I was getting an error when I tried to run process.py:

AssertionError: CoreNLP jar file does not seem to exist; are the paths correct?  Searched files: ['~/stanford-corenlp-full-2014-10-26/stanford-corenlp-3.4.jar', '~/stanford-corenlp-full-2014-10-26/stanford-corenlp-3.4-models.jar', '~/stanford-corenlp-full-2014-10-26/stanford-srparser-2014-07-01-models.jar']

I've had trouble before with relative links in Python, so I changed the default_config.ini file to use the full path to CoreNLP on my system (i.e., Users/andy/stanford-corenlp-3-4/ rather than ~/stanford-corenlp-3-4).

When I did that, everything worked fine.

¯\_(ツ)_/¯

RuntimeError: maximum recursion depth exceeded in cmp

I have tried to set up a pipeline following the instructions on Andy Halterman's blog. Everything goes well until I try to run the Stanford pipeline on a collection of articles in MongoDB.

It seems at first that the process goes well, but after ~20 minutes an error occurs:

RuntimeError: maximum recursion depth exceeded in cmp
INFO:StanfordSocketWrap:Subprocess seems to be stopped, exit code 1
INFO:StanfordSocketWrap:Subprocess seems to be stopped, exit code 1

I have thought that it might simply surpass the recursion limit due to the number of articles your custom scraper gets from the DW (984 in my case). The same thing happens to a collection with 409 articles in it.

Though stanford.log shows that the operation was successful:

INFO 2018-10-04 13:18:18,371: Getting today's unparsed stories from db 'event_scrape', collection 'dw_test'
INFO 2018-10-04 13:18:18,371: Querying for all unparsed stories added within the last day
INFO 2018-10-04 13:18:18,373: Returning 984 total stories.
INFO 2018-10-04 13:18:18,375: Setting up CoreNLP.
INFO 2018-10-04 17:00:42,344: Running.

I've checked the database in mongo shell with db.dw_test.findOne() - it still contains unparsed text.

Just in case, I've also tried running the collection through the Phoenix pipeline, but it got 0 sentences coded.

UPD: (just in case) I've tried to test it on mongo a collection with only 1 article and it still is the same - after numerous retries, it still throws an error I've specified above. Here is a chunk of the terminal output if it might say something in particular.

Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ... Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: java.io.IOException: Unable to resolve "edu/stanford/nlp/models/srparser/englishSR.ser.gz" as either class path, filename or URL

Any ideas on why I can't 'grab' the shift-reduce parser?

OS X 10.10.4.
David-Laxers-MacBook-Pro:stanford_pipeline davidlaxer$ python -V
Python 2.7.10 :: Anaconda 2.3.0 (x86_64)
David-Laxers-MacBook-Pro:stanford_pipeline davidlaxer$ java -version
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

btw - I increased STARTUP_BUSY_WAIT_INTERVAL_SEC to 200.0 in sockwrap.py in stanford_corenlp_pywrapper.

David-Laxers-MacBook-Pro:stanford_pipeline davidlaxer$ python process.py
INFO:stanford:Running.
INFO:stanford:Returning 601 total stories.
INFO:stanford:Setting up CoreNLP.

Setting up StanfordNLP. The program isn't dead. Promise.
INFO:StanfordSocketWrap:Starting pipe subprocess, and waiting for signal it's ready, with command: exec java -Xmx4g -cp /users/davidlaxer/anaconda/lib/python2.7/site-packages/stanford_corenlp_pywrapper-0.1.0-py2.7.egg/stanford_corenlp_pywrapper/lib/piperunner.jar:/users/davidlaxer/anaconda/lib/python2.7/site-packages/stanford_corenlp_pywrapper-0.1.0-py2.7.egg/stanford_corenlp_pywrapper/lib/guava-13.0.1.jar:/users/davidlaxer/anaconda/lib/python2.7/site-packages/stanford_corenlp_pywrapper-0.1.0-py2.7.egg/stanford_corenlp_pywrapper/lib/jackson-all-1.9.11.jar:/Users/davidlaxer/stanford-corenlp-full-2015-04-20/stanford-corenlp-3.5.2.jar:/Users/davidlaxer/stanford-corenlp-full-2015-04-20/stanford-corenlp-3.5.2-models.jar:/Users/davidlaxer/stanford-corenlp-full-2015-04-20/stanford-srparser-2014-07-01-models.jar corenlp.PipeCommandRunner --server 12340 --mode justparse --configfile stanford_config.ini
[Server] Using mode type: justparse
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.6 sec].
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [1.9 sec].
[Server] Using CoreNLP configuration file: stanford_config.ini
Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Adding annotator lemma
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ... Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: java.io.IOException: Unable to resolve "edu/stanford/nlp/models/srparser/englishSR.ser.gz" as either class path, filename or URL
at edu.stanford.nlp.parser.common.ParserGrammar.loadModel(ParserGrammar.java:183)
at edu.stanford.nlp.pipeline.ParserAnnotator.loadModel(ParserAnnotator.java:197)
at edu.stanford.nlp.pipeline.ParserAnnotator.(ParserAnnotator.java:107)
at edu.stanford.nlp.pipeline.AnnotatorImplementations.parse(AnnotatorImplementations.java:145)
at edu.stanford.nlp.pipeline.AnnotatorFactories$11.create(AnnotatorFactories.java:453)
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:85)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:289)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:126)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:122)
at corenlp.Parse.setConfigurationFromFile(Parse.java:189)
at corenlp.PipeCommandRunner.main(PipeCommandRunner.java:83)
Caused by: java.io.IOException: Unable to resolve "edu/stanford/nlp/models/srparser/englishSR.ser.gz" as either class path, filename or URL
at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:481)
at edu.stanford.nlp.io.IOUtils.readObjectFromURLOrClasspathOrFileSystem(IOUtils.java:313)
at edu.stanford.nlp.parser.common.ParserGrammar.loadModel(ParserGrammar.java:180)
... 10 more
^CTraceback (most recent call last):
File "process.py", line 138, in
run()
File "process.py", line 134, in run
parser.stanford_parse(coll, stories, stanford_dir)
File "/Users/davidlaxer/stanford_pipeline/parser.py", line 33, in stanford_parse
corenlp_libdir='/Users/davidlaxer/stanford-corenlp-full-2015-04-20/')
File "/users/davidlaxer/anaconda/lib/python2.7/site-packages/stanford_corenlp_pywrapper-0.1.0-py2.7.egg/stanford_corenlp_pywrapper/sockwrap.py", line 82, in init
self.start_server()
File "/users/davidlaxer/anaconda/lib/python2.7/site-packages/stanford_corenlp_pywrapper-0.1.0-py2.7.egg/stanford_corenlp_pywrapper/sockwrap.py", line 99, in start_server
time.sleep(STARTUP_BUSY_WAIT_INTERVAL_SEC)

More config options

You should be able to specify a few more things in the config file:

  • Mongo db
  • Mongo collection
  • process today or process the whole collection

pywrapper issue

Trying to use the pipeline, got this error:

vagrant@vagrant-ubuntu-trusty-64:~/stanford_pipeline$ python process.py 
INFO:stanford:Running.
INFO:stanford:Returning 7612 total stories.
INFO:stanford:Setting up CoreNLP.

Setting up StanfordNLP. The program isn't dead. Promise.
Exception AttributeError: "SockWrap instance has no attribute 'proc'" in <bound method SockWrap.__del__ of <stanford_corenlp_pywrapper.sockwrap.SockWrap instance at 0x7f1128cd17e8>> ignored
Traceback (most recent call last):
  File "process.py", line 145, in <module>
    main()
  File "process.py", line 141, in main
    parser.stanford_parse(coll, stories, stanford_dir)
  File "/home/vagrant/stanford_pipeline/parser.py", line 33, in stanford_parse
    corenlp_libdir=stanford)
TypeError: __init__() got an unexpected keyword argument 'corenlp_libdir'

Looks like there's a named argument that got removed in the coreNLP pywrapper. Or something like that. You guys know anything about this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.