stanfordnlp / stanza Goto Github PK

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages

Home Page: https://stanfordnlp.github.io/stanza/

License: Other

Python 97.73% Shell 0.16% CSS 0.03% HTML 0.25% JavaScript 1.57% Perl 0.26%

python nlp natural-language-processing machine-learning deep-learning artificial-intelligence pytorch universal-dependencies named-entity-recognition corenlp

stanza's Introduction

Stanza: A Python NLP Library for Many Human Languages

The Stanford NLP Group's official Python NLP library. It contains support for running various accurate natural language processing tools on 60+ languages and for accessing the Java Stanford CoreNLP software from Python. For detailed information please visit our official website.

🔥 A new collection of biomedical and clinical English model packages are now available, offering seamless experience for syntactic analysis and named entity recognition (NER) from biomedical literature text and clinical notes. For more information, check out our Biomedical models documentation page.

References

If you use this library in your research, please kindly cite our ACL2020 Stanza system demo paper:

@inproceedings{qi2020stanza,
    title={Stanza: A {Python} Natural Language Processing Toolkit for Many Human Languages},
    author={Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    year={2020}
}

If you use our biomedical and clinical models, please also cite our Stanza Biomedical Models description paper:

@article{zhang2021biomedical,
    author = {Zhang, Yuhao and Zhang, Yuhui and Qi, Peng and Manning, Christopher D and Langlotz, Curtis P},
    title = {Biomedical and clinical {E}nglish model packages for the {S}tanza {P}ython {NLP} library},
    journal = {Journal of the American Medical Informatics Association},
    year = {2021},
    month = {06},
    issn = {1527-974X}
}

The PyTorch implementation of the neural pipeline in this repository is due to Peng Qi (@qipeng), Yuhao Zhang (@yuhaozhang), and Yuhui Zhang (@yuhui-zh15), with help from Jason Bolton (@j38), Tim Dozat (@tdozat) and John Bauer (@AngledLuffa). Maintenance of this repo is currently led by John Bauer.

If you use the CoreNLP software through Stanza, please cite the CoreNLP software package and the respective modules as described here ("Citing Stanford CoreNLP in papers"). The CoreNLP client is mostly written by Arun Chaganty, and Jason Bolton spearheaded merging the two projects together.

If you use the Semgrex or Ssurgeon part of CoreNLP, please cite our GURT paper on Semgrex and Ssurgeon:

@inproceedings{bauer-etal-2023-semgrex,
    title = "Semgrex and Ssurgeon, Searching and Manipulating Dependency Graphs",
    author = "Bauer, John  and
      Kiddon, Chlo{\'e}  and
      Yeh, Eric  and
      Shan, Alex  and
      D. Manning, Christopher",
    booktitle = "Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023)",
    month = mar,
    year = "2023",
    address = "Washington, D.C.",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.tlt-1.7",
    pages = "67--73",
    abstract = "Searching dependency graphs and manipulating them can be a time consuming and challenging task to get right. We document Semgrex, a system for searching dependency graphs, and introduce Ssurgeon, a system for manipulating the output of Semgrex. The compact language used by these systems allows for easy command line or API processing of dependencies. Additionally, integration with publicly released toolkits in Java and Python allows for searching text relations and attributes over natural text.",
}

Issues and Usage Q&A

To ask questions, report issues or request features 🤔, please use the GitHub Issue Tracker. Before creating a new issue, please make sure to search for existing issues that may solve your problem, or visit the Frequently Asked Questions (FAQ) page on our website.

Contributing to Stanza

We welcome community contributions to Stanza in the form of bugfixes 🛠️ and enhancements 💡! If you want to contribute, please first read our contribution guideline.

Installation

pip

Stanza supports Python 3.6 or later. We recommend that you install Stanza via pip, the Python package manager. To install, simply run:

pip install stanza

This should also help resolve all of the dependencies of Stanza, for instance PyTorch 1.3.0 or above.

If you currently have a previous version of stanza installed, use:

pip install stanza -U

Anaconda

To install Stanza via Anaconda, use the following conda command:

conda install -c stanfordnlp stanza

Note that for now installing Stanza via Anaconda does not work for Python 3.10. For Python 3.10 please use pip installation.

From Source

Alternatively, you can also install from source of this git repository, which will give you more flexibility in developing on top of Stanza. For this option, run

git clone https://github.com/stanfordnlp/stanza.git
cd stanza
pip install -e .

Running Stanza

Getting Started with the neural pipeline

To run your first Stanza pipeline, simply following these steps in your Python interactive interpreter:

>>> import stanza
>>> stanza.download('en')       # This downloads the English models for the neural pipeline
>>> nlp = stanza.Pipeline('en') # This sets up a default neural pipeline in English
>>> doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
>>> doc.sentences[0].print_dependencies()

If you encounter requests.exceptions.ConnectionError, please try to use a proxy:

>>> import stanza
>>> proxies = {'http': 'http://ip:port', 'https': 'http://ip:port'}
>>> stanza.download('en', proxies=proxies)  # This downloads the English models for the neural pipeline
>>> nlp = stanza.Pipeline('en')             # This sets up a default neural pipeline in English
>>> doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
>>> doc.sentences[0].print_dependencies()

The last command will print out the words in the first sentence in the input string (or Document, as it is represented in Stanza), as well as the indices for the word that governs it in the Universal Dependencies parse of that sentence (its "head"), along with the dependency relation between the words. The output should look like:

('Barack', '4', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '4', 'aux:pass')
('born', '0', 'root')
('in', '6', 'case')
('Hawaii', '4', 'obl')
('.', '4', 'punct')

See our getting started guide for more details.

Accessing Java Stanford CoreNLP software

Aside from the neural pipeline, this package also includes an official wrapper for accessing the Java Stanford CoreNLP software with Python code.

There are a few initial setup steps.

Download Stanford CoreNLP and models for the language you wish to use
Put the model jars in the distribution folder
Tell the Python code where Stanford CoreNLP is located by setting the CORENLP_HOME environment variable (e.g., in *nix): export CORENLP_HOME=/path/to/stanford-corenlp-4.5.3

We provide comprehensive examples in our documentation that show how one can use CoreNLP through Stanza and extract various annotations from it.

Online Colab Notebooks

To get your started, we also provide interactive Jupyter notebooks in the demo folder. You can also open these notebooks and run them interactively on Google Colab. To view all available notebooks, follow these steps:

Go to the Google Colab website
Navigate to File -> Open notebook, and choose GitHub in the pop-up menu
Note that you do not need to give Colab access permission to your GitHub account
Type stanfordnlp/stanza in the search bar, and click enter

Trained Models for the Neural Pipeline

We currently provide models for all of the Universal Dependencies treebanks v2.8, as well as NER models for a few widely-spoken languages. You can find instructions for downloading and using these models here.

Batching To Maximize Pipeline Speed

To maximize speed performance, it is essential to run the pipeline on batches of documents. Running a for loop on one sentence at a time will be very slow. The best approach at this time is to concatenate documents together, with each document separated by a blank line (i.e., two line breaks \n\n). The tokenizer will recognize blank lines as sentence breaks. We are actively working on improving multi-document processing.

Training your own neural pipelines

All neural modules in this library can be trained with your own data. The tokenizer, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer and the dependency parser require CoNLL-U formatted data, while the NER model requires the BIOES format. Currently, we do not support model training via the Pipeline interface. Therefore, to train your own models, you need to clone this git repository and run training from the source.

For detailed step-by-step guidance on how to train and evaluate your own models, please visit our training documentation.

LICENSE

Stanza is released under the Apache License, Version 2.0. See the LICENSE file for more details.

stanza's People

Contributors

Stargazers

Watchers

Forkers

aavella77 nofeetbird0321 wuyunxiangwyx mokundong ssghost wyingquan yenmuse chunde johnsonman earlzz yishuihanhan songrui-ustc awesome-archive shinezai lxlhum zhangjiekui gavinzjchao yunstanford charlottesean hhgxx123 huazhz husky0427 redvesper wyfzidane tianyikenan norbertstrzelecki jbrry lovit yyht eurus-holmes shaunstanislauslau kmmao earlbabson nyiel 0xflotus cwlseu jstzwjr super-louis decodetheencoded zmskye frankhoang kevinam99 wxb506 agrwlmini fanfanruyun xennygrimmato issac1zj yibit sindhu819 ajitaru liben2018 dst1213 cwenner rahulsoibam daywatch songxianjin xcbat rakeshksaraf shashgpt kirosg dillalabs mohabyoussef09 whungt mbrukman ivanpierre dingyuliang maddyvc zhp510730568 zenith1598 gridl nishultomar 3wayhimself dainis-boumber lflxp pandinosaurus lgf124 shadowkun hancelpv rqz233 engmux techpointyt uttgeorge tanduong marco2018 keyman9848 maggie0830 morsunchen leaderyangzi smtlify jeanru awesomemachinelearning ibabbar dancerontoper juaby loretoparisi abiraja2004 shanmine gluecklichste sjm-y zhenyuanwei

stanza's Issues

[Question] About BPE

The paper does go in details for the Tokenizer when dealing with Unicode languages (like Hindi, Marathi or Japanese, etc.). Is the Byte Pair Encoding used for those languages? Recently very good implementations has been released in C++ used by Facebook LASER or Google's SentencePiece.
Could you release more details about tokenization of Unicode languages (also for unicode points greater than \uffff i.e. longer unicode points like \UXXXXXXXX, etc.). thanks.

Is sentiment analysis available?

Hi, thanks for the package. Wondering if sentiment analysis is available using this package?

Exception happened when executing data preparation.

I tested stanfordnlp in a Jupyter notebook. I followed the "Model Training and Evaluation" Chapter of your official website. All things run well until I typed

!cd stanfordnlp && ./scripts/prep_tokenize_data.sh

for the preparation of tokenized data, and it raised like this:

scripts/treebank_to_shorthand.sh: line 14: lang2lcode: bad array subscript Preparing tokenizer data... Traceback (most recent call last): File "stanfordnlp/utils/prepare_tokenizer_data.py", line 14, in <module> with open(args.plaintext_file, 'r') as f: FileNotFoundError: [Errno 2] No such file or directory: '/u/nlp/data/dependency_treebanks/CoNLL18///_-ud-.txt' cp: cannot stat '/u/nlp/data/dependency_treebanks/CoNLL18///_-ud-.conllu': No such file or directory cp: cannot stat '/u/nlp/data/dependency_treebanks/CoNLL18///_-ud-.txt': No such file or directory ./scripts/prep_tokenize_data.sh: line 31: [: ==: unary operator expected

Issue to train a new pipeline

First of all, thanks a lot for this new Neural NLP pipeline. I'm currently trying to train a new model for French with my data + UD datasets, but before that I would like to be able to properly reproduce the training steps. Now, I'm struggling to train the Tokenizer. Here what I'm doing:

mkdir -p ./extern_data/word2vec
scripts/download_vectors.sh ./extern_data/word2vec/
mkdir -p ./data/uddata/
mkdir ./data/tokenize
cd ./data/uddata
git clone https://github.com/UniversalDependencies/UD_French-GSD.git
cd ../..
Changing the UDBASE env variable to ./data/uddata in scripts/config.sh
scripts/run_tokenize.sh UD_French-GSD

And I get the following output with several errors:

Preparing tokenizer train data...
Traceback (most recent call last):
  File "stanfordnlp/utils/prepare_tokenizer_data.py", line 14, in <module>
    with open(args.plaintext_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/uddata/UD_French-GSD/fr_gsd-ud-train.txt'
cp: cannot stat './data/uddata/UD_French-GSD/fr_gsd-ud-train.txt': No such file or directory
bash: warning: setlocale: LC_ALL: cannot change locale (fr_FR.UTF-8)
bash: warning: setlocale: LC_ALL: cannot change locale (fr_FR.UTF-8)
Preparing tokenizer dev data...
Traceback (most recent call last):
  File "stanfordnlp/utils/prepare_tokenizer_data.py", line 14, in <module>
    with open(args.plaintext_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/uddata/UD_French-GSD/fr_gsd-ud-dev.txt'
cp: cannot stat './data/uddata/UD_French-GSD/fr_gsd-ud-dev.txt': No such file or directory
Traceback (most recent call last):
  File "stanfordnlp/utils/avg_sent_len.py", line 12, in <module>
    with open(toklabels, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/tokenize/fr_gsd-ud-train.toklabels'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: ceil() argument after * must be an iterable, not float
Running tokenizer with ...
usage: tokenizer.py [-h] [--txt_file TXT_FILE] [--label_file LABEL_FILE]
                    [--json_file JSON_FILE] [--mwt_json_file MWT_JSON_FILE]
                    [--conll_file CONLL_FILE] [--dev_txt_file DEV_TXT_FILE]
                    [--dev_label_file DEV_LABEL_FILE]
                    [--dev_json_file DEV_JSON_FILE]
                    [--dev_conll_gold DEV_CONLL_GOLD] [--lang LANG]
                    [--shorthand SHORTHAND] [--mode {train,predict}]
                    [--emb_dim EMB_DIM] [--hidden_dim HIDDEN_DIM]
                    [--conv_filters CONV_FILTERS] [--no-residual]
                    [--no-hierarchical] [--hier_invtemp HIER_INVTEMP]
                    [--input_dropout] [--conv_res CONV_RES]
                    [--rnn_layers RNN_LAYERS] [--max_grad_norm MAX_GRAD_NORM]
                    [--anneal ANNEAL] [--anneal_after ANNEAL_AFTER]
                    [--lr0 LR0] [--dropout DROPOUT]
                    [--unit_dropout UNIT_DROPOUT] [--tok_noise TOK_NOISE]
                    [--weight_decay WEIGHT_DECAY] [--max_seqlen MAX_SEQLEN]
                    [--batch_size BATCH_SIZE] [--epochs EPOCHS]
                    [--steps STEPS] [--report_steps REPORT_STEPS]
                    [--shuffle_steps SHUFFLE_STEPS] [--eval_steps EVAL_STEPS]
                    [--save_name SAVE_NAME] [--load_name LOAD_NAME]
                    [--save_dir SAVE_DIR] [--cuda CUDA] [--cpu] [--seed SEED]
tokenizer.py: error: argument --max_seqlen: expected one argument
Running tokenizer in predict mode
Directory saved_models/tokenize do not exist; creating...
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jplu/stanfordnlp/stanfordnlp/models/tokenizer.py", line 182, in <module>
    main()
  File "/home/jplu/stanfordnlp/stanfordnlp/models/tokenizer.py", line 93, in main
    evaluate(args)
  File "/home/jplu/stanfordnlp/stanfordnlp/models/tokenizer.py", line 159, in evaluate
    mwt_dict = load_mwt_dict(args['mwt_json_file'])
  File "/home/jplu/stanfordnlp/stanfordnlp/models/tokenize/utils.py", line 10, in load_mwt_dict
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: './data/tokenize/fr_gsd-ud-dev-mwt.json'
Traceback (most recent call last):
  File "stanfordnlp/utils/conll18_ud_eval.py", line 532, in <module>
    main()
  File "stanfordnlp/utils/conll18_ud_eval.py", line 500, in main
    evaluation = evaluate_wrapper(args)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 483, in evaluate_wrapper
    system_ud = load_conllu_file(args.system_file)
  File "stanfordnlp/utils/conll18_ud_eval.py", line 477, in load_conllu_file
    _file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {}))
FileNotFoundError: [Errno 2] No such file or directory: './data/tokenize/fr_gsd.dev.pred.conllu'
fr_gsd

Apparently by checking the the stanfordnlp/utils/prepare_tokenizer_data.py file, indeed there is a need for txt files, but they do not exists in the repo of the UD dataset I'm using. Any hint on where can I find them? Or if they are not provided for free, what do they should look like? In order for me to be able to create them from the conllu files.

Thanks in advance :)

can't import stanfordnlp, error with "cannot import name 'pack_seqeunce'

Hi,

I installed the library but I can't even import the library in python.
The error message is as below:

File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/stanfordnlp/models/pos/model.py", line 5, in <module>
    from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence, pack_sequence, PackedSequence
ImportError: cannot import name 'pack_sequence'

I googled this error but there's nothing.
Could anybody please help me?

How to use only the POS tagging part?

I want to re-annotate my corpus, which is already tokenized. So how can I use the POS tagger only? I have tried this.

>>> from stanfordnlp import Pipeline
>>> nlp = Pipeline(processors='pos')
Use device: cpu
---
Loading: pos
With settings: 
{'model_path': '/Users/speedcell/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/Users/speedcell/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Done loading processors!
---
>>> nlp("this is nice")
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-4351b2659f0b>", line 1, in <module>
    nlp("this is nice")
  File "/usr/local/lib/python3.7/site-packages/stanfordnlp/pipeline/core.py", line 74, in __call__
    self.process(doc)
  File "/usr/local/lib/python3.7/site-packages/stanfordnlp/pipeline/core.py", line 68, in process
    self.processors[processor_name].process(doc)
  File "/usr/local/lib/python3.7/site-packages/stanfordnlp/pipeline/pos_processor.py", line 19, in process
    doc, self.config['batch_size'], self.config, self.pretrain, vocab=self.vocab, evaluation=True)
  File "/usr/local/lib/python3.7/site-packages/stanfordnlp/models/pos/data.py", line 26, in __init__
    self.conll, data = self.load_doc(doc)
  File "/usr/local/lib/python3.7/site-packages/stanfordnlp/models/pos/data.py", line 127, in load_doc
    data = doc.conll_file.get(['word', 'upos', 'xpos', 'feats'], as_sentences=True)
AttributeError: 'NoneType' object has no attribute 'get'
>>> nlp(["this", "is", "nice"])
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-4-9346a903a804>", line 1, in <module>
    nlp(["this", "is", "nice"])
  File "/usr/local/lib/python3.7/site-packages/stanfordnlp/pipeline/core.py", line 74, in __call__
    self.process(doc)
  File "/usr/local/lib/python3.7/site-packages/stanfordnlp/pipeline/core.py", line 68, in process
    self.processors[processor_name].process(doc)
  File "/usr/local/lib/python3.7/site-packages/stanfordnlp/pipeline/pos_processor.py", line 19, in process
    doc, self.config['batch_size'], self.config, self.pretrain, vocab=self.vocab, evaluation=True)
  File "/usr/local/lib/python3.7/site-packages/stanfordnlp/models/pos/data.py", line 41, in __init__
    data = self.preprocess(data, self.vocab, self.pretrain_vocab, args)
UnboundLocalError: local variable 'data' referenced before assignment

"Vector file is not provided." error on Ubuntu with Python 3.6.8

I am getting this error when loading the pretrained models from disk.
I'm working on Ubuntu 18.04 with Python 3.6.8.

Here is the full traceback:

Traceback (most recent call last):
  File "/home/mega/anaconda3/envs/allennlp/lib/python3.6/site-packages/stanfordnlp/models/common/pretrain.py", line 38, in load
    data = torch.load(self.filename, lambda storage, loc: storage)
  File "/home/mega/anaconda3/envs/allennlp/lib/python3.6/site-packages/torch/serialization.py", line 368, in load
    return _load(f, map_location, pickle_module)
  File "/home/mega/anaconda3/envs/allennlp/lib/python3.6/site-packages/torch/serialization.py", line 542, in _load
    result = unpickler.load()
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mega/anaconda3/envs/allennlp/lib/python3.6/site-packages/stanfordnlp/pipeline/core.py", line 53, in __init__
    use_gpu=self.use_gpu)
  File "/home/mega/anaconda3/envs/allennlp/lib/python3.6/site-packages/stanfordnlp/pipeline/pos_processor.py", line 14, in __init__
    self.trainer = Trainer(pretrain=self.pretrain, model_file=config['model_path'], use_cuda=use_gpu)
  File "/home/mega/anaconda3/envs/allennlp/lib/python3.6/site-packages/stanfordnlp/models/pos/trainer.py", line 31, in __init__
    self.load(pretrain, model_file)
  File "/home/mega/anaconda3/envs/allennlp/lib/python3.6/site-packages/stanfordnlp/models/pos/trainer.py", line 106, in load
    self.model = Tagger(self.args, self.vocab, emb_matrix=pretrain.emb, share_hid=self.args['share_hid'])
  File "/home/mega/anaconda3/envs/allennlp/lib/python3.6/site-packages/stanfordnlp/models/common/pretrain.py", line 32, in emb
    self._vocab, self._emb = self.load()
  File "/home/mega/anaconda3/envs/allennlp/lib/python3.6/site-packages/stanfordnlp/models/common/pretrain.py", line 42, in load
    return self.read_and_save()
  File "/home/mega/anaconda3/envs/allennlp/lib/python3.6/site-packages/stanfordnlp/models/common/pretrain.py", line 50, in read_and_save
    raise Exception("Vector file is not provided.")
Exception: Vector file is not provided.

Exception: Vector file is not provided.

Error comes up on CentOS7 on GoogleCloud, running Python 3.7.2. >> I had upgraded to Python3.7.2 specifically to fix the issue, as per the existing troubleshooting suggestions, but appearing again.

Works on my home computer under MacOS, gives the above error under CentOS7 on the server

storing dependencies output

I'm testing the Hebrew DP and getting interesting results but I want to store them as a list or dictionary for further analysis. Getting Started shows how to print the dependencies but is there also a way to get them as an iterable that can be stored as a list, dictionary, etc? As print_dependencies is NoneType we can only use try to do so by capturing the printout but I was wondering if there' a direct way to capture the dependencies (with Python...)

Dependency Parsing for Interrogative Sentences

I'm getting some very strange dependency parses on a set of question (interrogative) sentences, and I'm not sure why. As an example, I get the parse below when I use the old Stanford parser web app (http://nlp.stanford.edu:8080/parser/index.jsp) on the question "Why are people in dense areas more likely to become infected?" :

advmod(infected-11, Why-1)
auxpass(infected-11, are-2)
nsubjpass(infected-11, people-3)
case(areas-6, in-4)
amod(areas-6, dense-5)
nmod(people-3, areas-6)
advmod(likely-8, more-7)
amod(areas-6, likely-8)
mark(become-10, to-9)
acl(people-3, become-10)
root(ROOT-0, infected-11)

Which makes perfect sense to me. But I get the following parse for the same sentence when I use the stanfordnlp.Pipeline() parser (constructed as is with defaults):

advmod(to-9, Why-1)
cop(to-9, are-2)
nsubj(to-9, people-3)
case(more-7, in-4)
amod(more-7, dense-5)
nmod(in-4, areas-6)
advmod(to-9, more-7)
root(Why-1, likely-8)
mark(infected-11, to-9)
xcomp(to-9, become-10)
xcomp(infected-11, infected-11)
punct(to-9, ?-12)

Obviously there's some slight variations in delivery (inclusion of punct, etc.) but that parse seems just bizarre and wrong. I'm using 0.1.2 on Python 3.7.2 if that's of use. Code I used to get the second parse was:

["{}({}-{}, {}-{})".format(t.dependency_relation, sent.words[t.governor].text, t.governor+1, t.
    ...: text, t.index) for t in sent.words]

Cannot load models if default directory is changed.

Currently, I am getting the following error message if I change the Default download directory:

Default download directory: ~/eday/stanfordnlp_resources
Hit enter to continue or type an alternate directory.
~/eday/new_dir/stanfordnlp_resources

Downloading models for: en_ewt
Download location: ~/eday/new_dir/stanfordnlp_resources/en_ewt_models.zip
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.96G/1.96G [04:54<00:00, 11.5MB/s]

Download complete.  Models saved to: ~/eday/new_dir/stanfordnlp_resources/en_ewt_models.zip
Extracting models file for: en_ewt
Cleaning up...Done.
>>> nlp = stanfordnlp.Pipeline()
Use device: gpu
---
Loading: tokenize
With settings:
{'model_path': '~/eday/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Cannot load model from ~/eday/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt

It is probably because, DEFAULT_MODEL_DIR variable in resources.py does not get updated after download_ud_model function is called.

poor sentence tokenization

Just to note this is really not a massive deal, in most cases the sentence tokenization is very good but I've noticed some failure cases and thought I'd mention them if you're ever looking to improve it.

Like with other sentence tokenizers it doesn't really like seeing multiple capital letter words sequentially, and preemptively splits them apart. This doesn't happen for all instances of Burger King but in the corpus it appears to happen definitely more than a handful of times.

1       Near    Near    ADP     IN      _       2       case    _       _
2       Burger  Burger  PROPN   NNP     Number=Sing     0       root    _       _

1       King    king    NOUN    NN      Number=Sing     8       nsubj   _       _
2       in      in      ADP     IN      _       4       case    _       _
3       city    city    NOUN    NN      Number=Sing     4       compound        _       _
4       centre  centre  NOUN    NN      Number=Sing     1       nmod    _       _
5       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   8       cop     _       _
6       the     the     DET     DT      Definite=Def|PronType=Art       8       det     _       _
7       adult   adult   ADJ     JJ      Degree=Pos      8       amod    _       _
8       establishment   establishment   NOUN    NN      Number=Sing     0       root    _       _
9       Alimentum       Alimentum       PROPN   NNP     Number=Sing     8       appos   _       _
10      .       .       PUNCT   .       _       8       punct   _       _

1       Alimentum       Alimentum       PROPN   NNP     Number=Sing     4       nsubj   _       _
2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       cop     _       _
3       near    near    ADP     IN      _       4       case    _       _
4       Burger  Burger  PROPN   NNP     Number=Sing     0       root    _       _

1       King    king    NOUN    NN      Number=Sing     0       root    _       _
2       in      in      ADP     IN      _       5       case    _       _
3       the     the     DET     DT      Definite=Def|PronType=Art       5       det     _       _
4       city    city    NOUN    NN      Number=Sing     5       compound        _       _
5       center  center  NOUN    NN      Number=Sing     1       nmod    _       _
6       .       .       PUNCT   .       _       1       punct   _       _

Can't import stanfordnlp

In [1]: import stanfordnlp File "stanfordnlp/pipeline/doc.py", line 175 return f"<{self.__class__.__name__} index={self.index};words={self.words}>" ^ SyntaxError: invalid syntax

getting this error on import. Tried both 'pip install stanfordnlp' and 'install from git' methods. I'm using python 2.7.15 on Mac OS Mojave.

Issue with tokensregex

Hello!

I want to use stanfordnlp to extract some Chinese words in a text file. My code is like this.

with open('F:\\project\\test.txt','r',encoding ='utf8') as f:
    text=f.read()
with open CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner','parse','depparse','coref'], timeout=30000, memory='16G') as client:
    pattern='[{ner:"CITY"}|{ner: "STATE_OR_PROVINCE"}|{ner:"LOCATION"}]+'
    matches = client.tokensregex(text, pattern, to_words=True)
    print(matches)

But it raised an error:

UnicodeEncodeError: 'latin-1' codec can't encode character '\u9648' in position 0: Body ('陈') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

I then changed the encoding way when read the text file

with open('F:\\project\\test.txt','r',encoding ='latin-1') as f:
     text=f.read()

The error was gone but the extracted words became

  {'sentences': [{'length': 0}, {'0': {'text': '±±º£', 'begin': 0, 'end': 1}, 'length': 1}, {'length': 0}, {'length': 0}, {'0': {'text': 'Îäºº', 'begin': 0, 'end': 1}, 'length': 1}, {'length': 0}, {'length': 0}, {'length': 0}, {'length': 0}, {'length': 0}, {'length': 0}, {'0': {'text': 'ÁºÉ½', 'begin': 0, 'end': 1}, 'length': 1}]}

These extracted ones are not Chinese characters. I tried some decode and encode methods on these strange characters and hope they can be transformed into Chinese form, but failed.

Can you give me some suggestions?

Training of tagger very slow due to a slow lookup in XPOS and UFeats vocabularies while evaluating on dev data

Thank you for the release! I have tried training the Slovenian tagger and the training time is MUCH longer (days) than it is in the tensorflow version (hour or two) due to a very slow model evaluation on dev data.

I think I did isolate the issue, which is the lookup of XPOS and UFeats labels in the corresponding vocabulary after prediction.

https://github.com/stanfordnlp/stanfordnlp/blob/421a8b3427178bde7c02c8c42f5621f351afa007/stanfordnlp/models/pos/trainer.py#L72-L73

TypeError: init() got an unexpected keyword argument 'reduction'

I am just loading the model, as specified:

>>> import stanfordnlp
>>> stanfordnlp.download('en')   # This downloads the English models for the neural pipeline
>>> nlp = stanfordnlp.Pipeline()

I am not sure why I have the following error messages:

  File "<stdin>", line 1, in <module>
  File "/home/lzy/anaconda3/envs/nlp/lib/python3.6/site-packages/stanfordnlp/pipeline/core.py", line 93, in __init__
    use_gpu=self.use_gpu)
  File "/home/lzy/anaconda3/envs/nlp/lib/python3.6/site-packages/stanfordnlp/pipeline/depparse_processor.py", line 14, in __init__
    self.trainer = Trainer(pretrain=self.pretrain, model_file=config['model_path'], use_cuda=use_gpu)
  File "/home/lzy/anaconda3/envs/nlp/lib/python3.6/site-packages/stanfordnlp/models/depparse/trainer.py", line 33, in __init__
    self.load(pretrain, model_file)
  File "/home/lzy/anaconda3/envs/nlp/lib/python3.6/site-packages/stanfordnlp/models/depparse/trainer.py", line 107, in load
    self.model = Parser(self.args, self.vocab, emb_matrix=pretrain.emb)
  File "/home/lzy/anaconda3/envs/nlp/lib/python3.6/site-packages/stanfordnlp/models/depparse/model.py", line 78, in __init__
    self.crit = nn.CrossEntropyLoss(ignore_index=-1, reduction='sum') # ignore padding
TypeError: __init__() got an unexpected keyword argument 'reduction'```

Make torch dependency optional

Is it possibile to make the torch package (582MB) dependency optional? Not sure if this is necessary for inference on pipelines, but it's a huge package.

Collecting torch (from stanfordnlp->-r requirements.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/31/ca/dd2c64f8ab5e7985c4af6e62da933849293906edcdb70dac679c93477733/torch-1.0.1.post2-cp36-cp36m-manylinux1_x86_64.whl (582.5MB)

Tokenize processors needs to be loaded

I was trying to load just the depparse processors, but I kept getting NoneType errors. It seams that the tokenize processor must be loaded in order to parse documents, otherwise the above error occurs. If this is the case, then tokenize should be loaded by default.

Issue with DEFAULT_HOME_DIR

Hi,

Thank you so much for introducing a full fledged Python wrapper.

My issue is:

When the models get downloaded, it prompts me to choose a folder path, with the default being home. My home directory in Linux has low space so I chose a different directory path. The models got downloaded there.
When I run stanford.Pipeline(), it still looks for the models in the home directory, not where we downloaded the models in the above step. This is because in stanfordnlp -> utils -> resources.py, the default directory is always being assigned Path.home(). Instead it has to be where we download the models.

Hope you guys take care of this real soon.

Thanks again,
Abhinay.

Pretrained file exists but cannot be loaded [Errno 22] Invalid argument, Vector file is not provided

While trying out the demo code I encounter an error Vector file is not provided.

The download of models seems successful and the paths also seem to be correct.

I use python 3.7.0 h4eca856_1 conda-forge on MacOSX 10.11.6.

How can I fix this?

Thanks in advance and thanks very much too for sharing Stanford NLP!

Downloading models for: en_ewt
Download location: /Users/peter/stanfordnlp_resources/en_ewt_models.zip
100%|██████████| 1.96G/1.96G [10:43<00:00, 4.64MB/s]

Download complete.  Models saved to: /Users/peter/stanfordnlp_resources/en_ewt_models.zip
Extracting models file for: en_ewt
Cleaning up...Done.

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/Users/peter/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': '/Users/peter/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/Users/peter/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Pretrained file exists but cannot be loaded from /Users/peter/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt, due to the following exception:
	[Errno 22] Invalid argument

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
~/anaconda/envs/k4/lib/python3.7/site-packages/stanfordnlp/models/common/pretrain.py in load(self)
     37             try:
---> 38                 data = torch.load(self.filename, lambda storage, loc: storage)
     39             except BaseException as e:

~/anaconda/envs/k4/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module)
    366     try:
--> 367         return _load(f, map_location, pickle_module)
    368     finally:

~/anaconda/envs/k4/lib/python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module)
    537     unpickler.persistent_load = persistent_load
--> 538     result = unpickler.load()
    539 

OSError: [Errno 22] Invalid argument

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-15-22747d34c3ed> in <module>
----> 1 nlp = stanfordnlp.Pipeline()

~/anaconda/envs/k4/lib/python3.7/site-packages/stanfordnlp/pipeline/core.py in __init__(self, processors, lang, models_dir, treebank, use_gpu, **kwargs)
     51             print(curr_processor_config)
     52             self.processors[processor_name] = NAME_TO_PROCESSOR_CLASS[processor_name](config=curr_processor_config,
---> 53                                                                                       use_gpu=self.use_gpu)
     54         print("Done loading processors!")
     55         print('---')

~/anaconda/envs/k4/lib/python3.7/site-packages/stanfordnlp/pipeline/pos_processor.py in __init__(self, config, use_gpu)
     12         self.pretrain = Pretrain(config['pretrain_path'])
     13         # set up trainer
---> 14         self.trainer = Trainer(pretrain=self.pretrain, model_file=config['model_path'], use_cuda=use_gpu)
     15         self.build_final_config(config)
     16 

~/anaconda/envs/k4/lib/python3.7/site-packages/stanfordnlp/models/pos/trainer.py in __init__(self, args, vocab, pretrain, model_file, use_cuda)
     29         if model_file is not None:
     30             # load everything from file
---> 31             self.load(pretrain, model_file)
     32         else:
     33             assert all(var is not None for var in [args, vocab, pretrain])

~/anaconda/envs/k4/lib/python3.7/site-packages/stanfordnlp/models/pos/trainer.py in load(self, pretrain, filename)
    104         self.args = checkpoint['config']
    105         self.vocab = MultiVocab.load_state_dict(checkpoint['vocab'])
--> 106         self.model = Tagger(self.args, self.vocab, emb_matrix=pretrain.emb, share_hid=self.args['share_hid'])
    107         self.model.load_state_dict(checkpoint['model'], strict=False)
    108 

~/anaconda/envs/k4/lib/python3.7/site-packages/stanfordnlp/models/common/pretrain.py in emb(self)
     30     def emb(self):
     31         if not hasattr(self, '_emb'):
---> 32             self._vocab, self._emb = self.load()
     33         return self._emb
     34 

~/anaconda/envs/k4/lib/python3.7/site-packages/stanfordnlp/models/common/pretrain.py in load(self)
     40                 print("Pretrained file exists but cannot be loaded from {}, due to the following exception:".format(self.filename))
     41                 print("\t{}".format(e))
---> 42                 return self.read_and_save()
     43             return data['vocab'], data['emb']
     44         else:

~/anaconda/envs/k4/lib/python3.7/site-packages/stanfordnlp/models/common/pretrain.py in read_and_save(self)
     48         # load from pretrained filename
     49         if self.vec_filename is None:
---> 50             raise Exception("Vector file is not provided.")
     51         print("Reading pretrained vectors from {}...".format(self.vec_filename))
     52         first = True

Exception: Vector file is not provided.

PermanentlyFailedException: Timed out waiting for service to come alive.

I am trying out the demo code for using the CoreNLP server.

# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner','parse','depparse','coref'], timeout=30000, memory='16G') as client:
    # submit the request to the server
    ann = client.annotate(text)

gives me this error:

PermanentlyFailedException: Timed out waiting for service to come alive.

I tried increasing the timeout limit, but no success.

Hope to support Chinese (simplified) language

I found StanfordNLP only support Chinese (traditional). It would be great to support Chinese (simplified) too. Is there any plan for this feature?

Can't download catalan model

I am not able to download the catalan model... I tried using python:

requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

And directly using wget:

--2019-02-13 13:33:43-- http://nlp.stanford.edu/software/conll_2018/ca_ancora_models.zip
Résolution de nlp.stanford.edu (nlp.stanford.edu)… 171.64.67.140
Connexion à nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80… connecté.
requête HTTP transmise, en attente de la réponse… Erreur de lecture (Connexion ré-initialisée par le correspondant) dans les en-têtes.
Nouvel essai.

--2019-02-13 13:34:48-- (essai : 2)
http://nlp.stanford.edu/software/conll_2018/ca_ancora_models.zip
Connexion à nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80… connecté.
requête HTTP transmise, en attente de la réponse… Erreur de lecture (Connexion ré-initialisée par le correspondant) dans les en-têtes.
Nouvel essai.

--2019-02-13 13:35:54-- (essai : 3) http://nlp.stanford.edu/software/conll_2018/ca_ancora_models.zip
Connexion à nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80… connecté.
requête HTTP transmise, en attente de la réponse…

Exception: Vector file is not provided.

Hi! I keep having the error constantly. I've tried installing both 3.6.8+ and 3.7.2, but still the error persists. Any ideas what might be the case?

Traceback (most recent call last):
File "/home/anthony/anaconda3/envs/py368/lib/python3.6/site-packages/stanfordnlp/models/common/pretrain.py", line 38, in load
data = torch.load(self.filename, lambda storage, loc: storage)
File "/home/anthony/anaconda3/envs/py368/lib/python3.6/site-packages/torch/serialization.py", line 368, in load
return _load(f, map_location, pickle_module)
File "/home/anthony/anaconda3/envs/py368/lib/python3.6/site-packages/torch/serialization.py", line 542, in _load
result = unpickler.load()
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/home/anthony/anaconda3/envs/py368/lib/python3.6/site-packages/stanfordnlp/pipeline/core.py", line 93, in init
use_gpu=self.use_gpu)
File "/home/anthony/anaconda3/envs/py368/lib/python3.6/site-packages/stanfordnlp/pipeline/depparse_processor.py", line 14, in init
self.trainer = Trainer(pretrain=self.pretrain, model_file=config['model_path'], use_cuda=use_gpu)
File "/home/anthony/anaconda3/envs/py368/lib/python3.6/site-packages/stanfordnlp/models/depparse/trainer.py", line 33, in init
self.load(pretrain, model_file)
File "/home/anthony/anaconda3/envs/py368/lib/python3.6/site-packages/stanfordnlp/models/depparse/trainer.py", line 107, in load
self.model = Parser(self.args, self.vocab, emb_matrix=pretrain.emb)
File "/home/anthony/anaconda3/envs/py368/lib/python3.6/site-packages/stanfordnlp/models/common/pretrain.py", line 32, in emb
self._vocab, self._emb = self.load()
File "/home/anthony/anaconda3/envs/py368/lib/python3.6/site-packages/stanfordnlp/models/common/pretrain.py", line 42, in load
return self.read_and_save()
File "/home/anthony/anaconda3/envs/py368/lib/python3.6/site-packages/stanfordnlp/models/common/pretrain.py", line 50, in read_and_save
raise Exception("Vector file is not provided.")
Exception: Vector file is not provided.

Use type hinting in source code

Python 3's Type Hints seem very helpful for reading the source code and it is used in projects like AllenNLP and Flair. Do you plan to use it in the development of StanfordNLP?

Inputting Tokenized data to Depparser

For my application, I have data in a (very large) CoNNL-U file that has already been tokenized that I would like to parse. How can I (lazily) parse this and write it back to file?

Thank you!

accessing openie triplets

Dear All
I would like to ask you how to get access to the openie triplets from the python interface.

import stanfordnlp.server as corenlp
text="This is a text."
c=corenlp.CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner','parse','openie'], timeout=60000)
ann = c.annotate(text,annotators=["openie"])

how do I retrieve them from ann?

Many thanks
valery

Coreference resolution

Hi there!

Please could you tell me if there is any coref model? Classical (non-neural, jvm-based) CoreNLP includes several ones, but I can't find it here.

About NER for Neural Pipeline

First of all congratulations! We have been waiting this for months!!! I'm currently using CoreNLP server an among the other things the NER annotators. Any plans to bring current CRF based Named Entity Recognition in CoreNLP to the new Pytorch Neural Pipeline?

Using the corenlp client with different languages

I want to access the arabic and chinese models from the Java CoreNLP using the CoreNLP client.
The README says that to use the Java models, I should put the models in the "distribution folder".

Is this the same folder as CORENLP_HOME?
How to specify what language model to use when creating the CoreNLP client?

Need some sample code

I tried to use this library at here but I couldn't find any sample codes for other functions of this library. the help pages doen't have any example.
would you please make it?

cannot install the models properly

I have tried to run the proposed command "nlp = stanfordnlp.Pipeline( )"
But the model is too big and it takes a long time to download.
So i tried to download it directly from the website in another folder.
It may sound stupid, but i didn't find out how to install these models as i don't want to move these models to the default folder.

import stanfordnlp
nlp = stanfordnlp.Pipeline(model_path="D:\python\stanfordnlp_resources\en_ewt_models\en_ewt_tokenizer.pt")
Use device: cpu
Loading: tokenize
With settings:
{'model_path': 'C:\Users\84692/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Cannot load model from C:\Users\84692/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt

So is there any solutions to run the models properly?
BTW, can you introduce me about the difference between "stanfordnlp", "stanford corenlp" and "stanford parser/Postagger etc"?
Thank you a lot!!!

Issue with Stanfordnlp 0.1.2

As I am analyzing a large corpus, I concatenated all existing texts as suggested in the description (i.e. by concatenating all texts with two line breaks between them) and setting the parameter tokenize_protokenized to true. This approach works with stanfordnlp 0.1.1 but since version 0.1.2 I encounter the following error:

`Traceback (most recent call last):
  File "some_python_file.py", line 285, in some_function
    doc = nlp(text)
  File "/home/henry/anaconda3/envs/some_env/lib/python3.6/site-packages/stanfordnlp/pipeline/core.py", line 125, in __call__
    self.process(doc)
  File "/home/henry/anaconda3/envs/some_env/lib/python3.6/site-packages/stanfordnlp/pipeline/core.py", line 119, in process
    self.processors[processor_name].process(doc)
  File "/home/henry/anaconda3/envs/some_env/lib/python3.6/site-packages/stanfordnlp/pipeline/pos_processor.py", line 22, in process
    preds += self.trainer.predict(b)
  File "/home/henry/anaconda3/envs/some_env/lib/python3.6/site-packages/stanfordnlp/models/pos/trainer.py", line 71, in predict
    _, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, word_orig_idx, sentlens, wordlens)
  File "/home/henry/anaconda3/envs/some_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/henry/anaconda3/envs/some_env/lib/python3.6/site-packages/stanfordnlp/models/pos/model.py", line 113, in forward
    char_reps = self.charmodel(wordchars, wordchars_mask, word_orig_idx, sentlens, wordlens)
  File "/home/henry/anaconda3/envs/some_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/henry/anaconda3/envs/some_env/lib/python3.6/site-packages/stanfordnlp/models/common/char_model.py", line 28, in forward
    embs = pack_padded_sequence(embs, wordlens, batch_first=True)
  File "/home/henry/anaconda3/envs/some_env/lib/python3.6/site-packages/torch/nn/utils/rnn.py", line 148, in pack_padded_sequence
    return PackedSequence(torch._C._VariableFunctions._pack_padded_sequence(input, lengths, batch_first))
RuntimeError: Length of all samples has to be greater than 0, but found an element in 'lengths' that is <= 0
None`

Regarding the requirements, following package versions are installed:

numpy= 1.15.4
protobuf= 3.6.1
requests= 2.21.0
pytorch= 1.0.1
tqdm= 4.28.1

and the machine is running Ubuntu 16.04.6.
I thought that maybe there is some empty value between two consecutive blank lines leading to this issue but it was not the case.
Thank you in advance for looking into this.

Is the neural tokenizer reversible?

The java tokenizer was technically reversible because it kept track of spaces, I could not find spaces in the python tokenizer.

XPOS Tagsets Releases

Hi,

Thanks for releasing the library. I am wondering would it be possible to release or link to the the language specific XPOS tagsets?

'rm' on windows

>>> stanfordnlp.download('en')
Using the default treebank "en_ewt" for language "en".
Would you like to download the models for: en_ewt now? (Y/n)
y

Default download directory: ...
Hit enter to continue or type an alternate directory.


Downloading models for: en_ewt
Download location: .../en_ewt_models.zip
100%|████████████████████████████████████████████████████████████████████████████| 1.96G/1.96G [1:06:17<00:00, 498kB/s]

Download complete.  Models saved to: .../en_ewt_models.zip
Extracting models file for: en_ewt
Cleaning up...'rm' is not recognized as an internal or external command,
operable program or batch file.
Done.

That last thing looks like a Linux remnant which should not happen on Windows.
Python 3.7.2 (64 bit) on Windows 10 (64 bit)

How can I run more processes?

I'm running a simple pipeline of tokenization and pos using a 600mb text file in catalan as input. Stanfordnlp automatically runs 24 processes and it's processing about 1mb every 10 minutes or so.

I tried to change pos_batch_size (from 10000 to 100000, then from 200000 to 20000, etc.) and tokenize_batch_size (32, 64, 128, then back), but it seems that I'm hitting a bottle neck, because increasing the batch_size makes the process slower.

How can I change the number of processes to run?

My system configuration is as follows:

Architecture: x86_64
Mode(s) opératoire(s) des processeurs :32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) par cœur : 2
Cœur(s) par socket : 12
Socket(s): 2
Nœud(s) NUMA : 2
Identifiant constructeur :GenuineIntel
Famille de processeur :6
Modèle : 79
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Révision : 1
Vitesse du processeur en MHz :2500.007
CPU max MHz: 2900,0000
CPU min MHz: 1200,0000
BogoMIPS: 4400.80
Virtualisation : VT-x
Cache L1d : 32K
Cache L1i : 32K
Cache L2 : 256K
Cache L3 : 30720K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

I tried using GPU but it is slower and adjusting the batch_size did not improve the processing time.

NVIDIA-SMI 375.39 Driver Version: 375.39
GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC
Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.

0 Tesla M40 Off | 0000:04:00.0 Off | 0
N/A 40C P0 63W / 250W | 434MiB / 11443MiB | 0% Default

Processes: GPU Memory
GPU PID Type Process name Usage

0 172198 C python 107MiB
0 172526 C python 107MiB
0 186922 C python 107MiB
0 187828 C python 107MiB

I'm using Python 3.6.8 over an anaconda environment.

How to use a particular language model other than default English?

Hi there,
I am using Python 3.7.2, Anaconda, Windows 10.
Everything works perfectly for me on English. I am trying to use the Arabic model, but it doesn't seem it detect the Arabic model while it's downloaded at this path C:\Users\mzeid\stanfordnlp_resources\ar_padt_models

Using "extract_pos(arabic_doc)", gives the following results as shown in the screen shot.

How can I get this fixed and let the library use Arabic, instead of the default English?

Thanks

Issue with the Tokenizer?

Hello,

I am using a real-world dataset with the French treebank "fr_sequoia" model and I am facing an issue, with unusual inputs (PUNCT characters) the processing of the sentence ends before finishing.

With a fictional example:
doc = nlp("La Joconde est un tableau de De Vinci - réalisé entre 1503 et 1506")
doc.sentences[0].print_tokens()

Return:
<Token index=1;words=[<Word index=1;text=La;lemma=le;upos=DET;xpos=;feats=Definite=Def|Gender=Fem|Number=Sing|PronType=Art;governor=2;dependency_relation=det>]>
<Token index=2;words=[<Word index=2;text=Joconde;lemma=Joconde;upos=PROPN;xpos=;feats=Gender=Fem|Number=Sing;governor=5;dependency_relation=nsubj>]>
<Token index=3;words=[<Word index=3;text=est;lemma=être;upos=AUX;xpos=;feats=Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin;governor=5;dependency_relation=cop>]>
<Token index=4;words=[<Word index=4;text=un;lemma=un;upos=DET;xpos=;feats=Definite=Ind|Gender=Masc|Number=Sing|PronType=Art;governor=5;dependency_relation=det>]>
<Token index=5;words=[<Word index=5;text=tableau;lemma=tableau;upos=NOUN;xpos=;feats=Gender=Masc|Number=Sing;governor=0;dependency_relation=root>]>
<Token index=6;words=[<Word index=6;text=de;lemma=de;upos=ADP;xpos=;feats=;governor=8;dependency_relation=case>]>
<Token index=7;words=[<Word index=7;text=De;lemma=de;upos=ADP;xpos=;feats=;governor=8;dependency_relation=case>]>
<Token index=8;words=[<Word index=8;text=Vinci;lemma=vinci;upos=PROPN;xpos=;feats=Gender=Masc|Number=Sing;governor=5;dependency_relation=nmod>]>

Environment:
Python 3.6.3
Torch version: https://download.pytorch.org/whl/cpu/torch-1.0.1.post2-cp36-cp36m-linux_x86_64.whl

Best regards,

question on license of models

Hi,

I've got a question on the license of the models.
The UD treebanks are distributed under different licenses depending on each treebank (e.g. CC-BY-SA / CC-BY-NC-SA / some LGPL / ...)
Under what license do you distribute the models (which basically allow mimicing the UD databases)? Is that the same license of the UD treebank?

Running tagger and parser with pretokenized files

I tried running the tagger and then the parser with a pretokenized file, calling both from the command line. The parser call was

python -m stanfordnlp.models.parser --eval_file temp.conllu --output_file zh_gsd-pred.conllu --shorthand zh_gsd --mode predict --batch_size 5000

And I got the following error:

Traceback (most recent call last):
  File "/home/erick/.pyenv/versions/3.7.2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/erick/.pyenv/versions/3.7.2/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/erick/dev/stanfordnlp/stanfordnlp/models/parser.py", line 245, in <module>
    main()
  File "/home/erick/dev/stanfordnlp/stanfordnlp/models/parser.py", line 97, in main
    evaluate(args)
  File "/home/erick/dev/stanfordnlp/stanfordnlp/models/parser.py", line 223, in evaluate
    batch = DataLoader(args['eval_file'], args['batch_size'], loaded_args, pretrain, vocab=vocab, evaluation=True)
  File "/home/erick/dev/stanfordnlp/stanfordnlp/models/depparse/data.py", line 43, in __init__
    data = self.preprocess(data, self.vocab, self.pretrain_vocab, args)
  File "/home/erick/dev/stanfordnlp/stanfordnlp/models/depparse/data.py", line 84, in preprocess
    processed_sent += [[int(w[5]) for w in sent]]
  File "/home/erick/dev/stanfordnlp/stanfordnlp/models/depparse/data.py", line 84, in <listcomp>
    processed_sent += [[int(w[5]) for w in sent]]
ValueError: invalid literal for int() with base 10: '_'

This was because the input file had a "_" character in the head position and the reader tried to convert to an int. It's unexpected, though, since the parser was on predict mode. Did I miss something?

Anyway, I could get it to work by changing line 84 in stanfordnlp/models/depparse/data.py to

processed_sent += [[int(w[5]) if w[5] != '_' else 0 for w in sent]]

This is related to #34 but I'm doing everything in the CLI instead of calling the API in Python code. Maybe this could help with the other issue?

Add stanfordnlp recipe to conda-forge.

Thanks for making this available! Please consider adding a feedstock recipe on conda-forge (see here). I, and many of the folks I know who are/will be excited about this release, use conda and conda environments extensively in our research. It tends to be a lot nicer (and more portable) to use the conda package dependency solver for everything (cf. using pip inside of a conda environment).

While someone not on your maintainer team could do it, it's probably a better idea for a maintainer to do it and to add the update process to your release checklist.

Thanks again for the package, and thanks in advance for considering a conda-forge recipe.

cannot download en_ewt_model.zip in running demo/pipeline_demo.py after cloning on my local

Hi,
I have one trouble on demo script. After installing stanfordnlp from pip and cloning your repository on my local, I cannot run script demo/pipeline_demo.py to try your code.

Raised Error is below. Could you tell me solution of this problem.

Downloading models for: en_ewt
Download location: /Users/hayata.yamamoto/stanfordnlp_resources/en_ewt_models.zip
Traceback (most recent call last):
  File "demo/pipeline_demo.py", line 32, in <module>
    stanfordnlp.download(args.lang, args.models_dir, confirm_if_exists=True)
  File "/Users/hayata.yamamoto/anaconda3/envs/analytics/lib/python3.6/site-packages/stanfordnlp/utils/resources.py", line 134, in download
    download_ud_model(default_treebanks[download_label], resource_dir=resource_dir, confirm_if_exists=confirm_if_exists)
  File "/Users/hayata.yamamoto/anaconda3/envs/analytics/lib/python3.6/site-packages/stanfordnlp/utils/resources.py", line 101, in download_ud_model
    with open(download_file_path, 'wb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/hayata.yamamoto/stanfordnlp_resources/en_ewt_models.zip'

accessing Constituency Parse

Dear All
I would like to ask you how to get access to the constituency parse (the same as the http://corenlp.run/ webpage) from the python interface.

import stanfordnlp.server as corenlp
text="This is a text."
c=corenlp.CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner','parse'], timeout=60000)
ann = c.annotate(text,annotators=["parse"] )

how do I retrieve it from ann?

Many thanks
valery

Make torch dependency optional, just use the Java Stanford CoreNLP Server

How to do it?

Vector File Not Provided

I followed all the steps provided at https://pypi.org/project/stanfordnlp/ but I get 'vector file not provided' error.

How to pre-load model on server start?

It's the same question in this link site, is there any options for now?

Sentiment Distribution

I've been able to get sentiment analysis working using the CoreNLPClient but it looks like the standard output doesn't include the sentiment distribution. Is that correct or am I just missing it? If I switch to the text output format I can find it.

Minibatching

First, congrats on the release 🎉. It's great to have these models in such an easy to use form. I'm looking forward to using them in some tri-training experiments, and I was pleased to see @ines get the spaCy wrapper done so quickly: https://github.com/explosion/spacy-stanfordnlp

One thing that's not great in the wrapper at the moment is that we're not minibatching, so inference is super slow. In spaCy batched processing is done via the .pipe() method, which expects to take a generator of texts, and yield out a generator of Doc objects.

If I understand correctly, StanfordNLP supports batching by concatenating texts together, such as with \n\n. The problem is, after concatenating the texts this way, how can I separate out the documents? I want to do something like this:

def pipe(self, texts, batch_size=32):
    for text_batch in itertools.islice(texts, batch_size):
        analyses = stanford_nlp('\n\n'.join(text_batch))
        doc_analyses = unbatch(analyses) # <-- How do I do this?
        for doc_analysis in doc_analyses:
            doc = make_spacy_doc(doc_analysis)
            yield doc

It's the unbatch part that I don't see an obvious solution for. The texts that are coming in could have newline sequences within them, so I can't rely on just looking for \n\n as a control sequence.

Your tokenizer can remove whitespace, but is it guaranteed that non-whitespace characters will be preserved? If so then maybe I can count the non-whitespace characters in the analysis output, and use that to find the document boundaries?

issue with the output for simplified chinese language

from stanfordnlp.server import CoreNLPClient

text = '这是个最好的时代，也是一个最坏的时代！'  

properties = {
        # segment
        "tokenize.language": "zh",
        "segment.model": "edu/stanford/nlp/models/segmenter/chinese/ctb.gz",
         ...

with CoreNLPClient(properties=properties, annotators=annotators,timeout=60000, threads=5, memory='4G', be_quiet=False) as client: 
    print('---')
    print('first token of first sentence')
    token = sentence.token[0]
    print(token)
    ...

The output:
first token of first sentence
word: "\350\277\231"
pos: "PN"
value: "\350\277\231"
originalText: "\350\277\231"
ner: "O"
lemma: "\350\277\231"
beginChar: 0
endChar: 1