ufal / udpipe Goto Github PK

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files

License: Mozilla Public License 2.0

HTML 15.46% Shell 1.22% Makefile 0.43% C++ 77.45% C 0.33% Perl 1.71% PHP 1.03% CSS 0.04% C# 0.05% Java 0.05% Python 0.29% XS 0.01% Ragel 1.24% JavaScript 0.05% Dockerfile 0.02% SWIG 0.62%

udpipe's Introduction

UDPipe 1

UDPipe is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. UDPipe is language-agnostic and can be trained given annotated data in CoNLL-U format. Trained models are provided for nearly all UD treebanks. UDPipe is available as a binary for Linux/Windows/OS X, as a library for C++, Python, Perl, Java, C#, and as a web service. Third-party R CRAN package also exists.

UDPipe is a free software distributed under the Mozilla Public License 2.0 and the linguistic models are free for non-commercial use and distributed under the CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions. UDPipe is versioned using Semantic Versioning.

UDPipe website http://ufal.mff.cuni.cz/udpipe contains download links of both the released packages and trained models, hosts documentation and offers online web service.

UDPipe development repository http://github.com/ufal/udpipe is hosted on GitHub.

Third-party contribution: Instructions how to build UDPipe REST server as Docker image is here: http://github.com/samisalkosuo/udpipe-rest-server-docker. Instructions how to train UDPipe language models using a Docker image is also there.

udpipe's People

Contributors

Stargazers

Watchers

udpipe's Issues

Hardcoded static library loading code conflicts with custom library resolving

The udpipe_javaJNI contains a static section that tries to load the udpipe library from the current working directory or alternatively using the system-wide library lookup. I want to manually load the library from a custom location. To avoid the built-in library loading code from failing with an exception, it would be good to have an option to disable it.

colon vs semi-colon in user manual documentation

In the UDPipe Manual, Section 1.3, there is this sentence:
"If the --tokenize option is supplied, the input is assumed to be plain text and is tokenized using model tokenizer. Additional arguments to the tokenizer might be specified using --tokenizer=data option (which implies --tokenize), where data is a semicolon-separated list of the following options "
-I think it is not semicolon separated. Semi-colon separated options seems to throw me an error, whereas colon separated options works fine.

feature_sequences_score assertion

I'm trying to train a POS tagger using UDPipe from data which is already in CoNLL format, but not lemmatized. I'm using custom tagsets as well.

However, when reaching iteration 10, an assertion error happens in feature_sequences_optimizer.h.
Here are the logs:

Loading training data: done.
Training the UDPipe model.
Tagger model 1 columns: lemma use=0/provide=0, xpostag use=0/provide=0, feats use=0/provide=0
Creating morphological dictionary for tagger model 1.
Tagger model 1 dictionary options: max_form_analyses=0, custom dictionary_file=none
Tagger model 1 guesser options: suffix_rules=8, prefixes_max=0, prefix_min_count=10, enrich_dictionary=6
Tagger model 1 options: iterations=30, early_stopping=1, templates=tagger
Training tagger model 1.
Iteration 1: done, accuracy 74.91%, heldout accuracy 78.47%t/98.35%l/77.87%b
Iteration 2: done, accuracy 78.81%, heldout accuracy 78.32%t/98.49%l/77.80%b
Iteration 3: done, accuracy 79.94%, heldout accuracy 78.02%t/98.57%l/77.56%b
Iteration 4: done, accuracy 80.45%, heldout accuracy 77.78%t/98.58%l/77.34%b
Iteration 5: done, accuracy 80.74%, heldout accuracy 77.57%t/98.61%l/77.14%b
Iteration 6: done, accuracy 80.92%, heldout accuracy 77.33%t/98.62%l/76.91%b
Iteration 7: done, accuracy 81.06%, heldout accuracy 77.21%t/98.61%l/76.78%b
Iteration 8: done, accuracy 81.14%, heldout accuracy 77.15%t/98.61%l/76.72%b
Iteration 9: done, accuracy 81.18%, heldout accuracy 76.98%t/98.59%l/76.55%b
Iteration 10: done, accuracy 81.19%udpipe: ./morphodita/tagger/feature_sequences_optimizer.h:132: ufal::udpipe::morphodita::feature_sequences_optimizer<FeatureSequences<ElementaryFeatures<ufal::udpipe::morphodita::training_elementary_feature_map>, ufal::udpipe::morphodita::training_feature_sequence_map> >::optimize(const original_feature_sequences&, ufal::udpipe::morphodita::feature_sequences_optimizer<FeatureSequences<ElementaryFeatures<ufal::udpipe::morphodita::training_elementary_feature_map>, ufal::udpipe::morphodita::training_feature_sequence_map> >::optimized_feature_sequences&)::<lambda(ufal::udpipe::utils::binary_encoder&, const ufal::udpipe::morphodita::training_feature_sequence_map::info&)> [with FeatureSequences = ufal::udpipe::morphodita::feature_sequences; ElementaryFeatures = ufal::udpipe::morphodita::conllu_elementary_features]: Assertion `feature_sequence_score(info.gamma) == info.gamma' failed.

Am I doing anything wrong?

address sanitiser issues when using ufal::udpipe::model::load

In the R package which interfaces with udpipe, the CRAN build system (https://www.stats.ox.ac.uk/pub/bdr/memtests/gcc-UBSAN/udpipe/00check.log) is mentioning address sanitiser issues when I use ufal::udpipe::model::load
The code where this happens is shown below.

// Load language model and return the pointer to be used by udp_tokenise_tag_parse
  ufal::udpipe::model *languagemodel;
  languagemodel = ufal::udpipe::model::load(file_model);

This basically gives the following issues reported by UBSAN when loading a model.
What can be done to fix this?

trying URL 'https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.0/master/inst/udpipe-ud-2.0-170801/dutch-ud-2.0-170801.udpipe'
Content type 'application/octet-stream' length 19992491 bytes (19.1 MB)
==================================================
downloaded 19.1 MB

udpipe.cpp:1545:21: runtime error: load of misaligned address 0x62d00154a401 for type 'uint16_t', which requires 2 byte alignment
0x62d00154a401: note: pointer points here
 00 80 4e  02 32 00 00 01 18 62 00  00 00 3d 00 00 00 cf de  c4 bc d9 d0 5f 3d 68 83  be 3e e8 6c 66
              ^ 
udpipe.cpp:1552:12: runtime error: load of misaligned address 0x62d00154a406 for type 'uint32_t', which requires 4 byte alignment
0x62d00154a406: note: pointer points here
 00 00 01 18 62 00  00 00 3d 00 00 00 cf de  c4 bc d9 d0 5f 3d 68 83  be 3e e8 6c 66 3e f8 ee  8c be
             ^ 
udpipe.cpp:1545:21: runtime error: load of misaligned address 0x7f0772cf8a5f for type 'uint16_t', which requires 2 byte alignment
0x7f0772cf8a5f: note: pointer points here
 21 00 01 00 01  00 03 01 22 00 01 00 02  00 00 10 43 64 22 2d 6e  75 6d 6d 65 72 70 6c 61  74 65 6e
             ^ 
udpipe.cpp:4023:11: runtime error: store to misaligned address 0x61600053ee81 for type 'uint16_t', which requires 2 byte alignment
0x61600053ee81: note: pointer points here
 00 00 70  21 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00
              ^ 
udpipe.cpp:4024:11: runtime error: store to misaligned address 0x61600053ee83 for type 'uint32_t', which requires 4 byte alignment
0x61600053ee83: note: pointer points here
 70  21 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00
              ^ 
udpipe.cpp:4028:11: runtime error: store to misaligned address 0x61600053e883 for type 'uint32_t', which requires 4 byte alignment
0x61600053e883: note: pointer points here
 19  21 00 01 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00
              ^ 
udpipe.cpp:4030:11: runtime error: store to misaligned address 0x627000263dfd for type 'uint16_t', which requires 2 byte alignment
0x627000263dfd: note: pointer points here
 21 00 00 03 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00
             ^ 
udpipe.cpp:3235:21: runtime error: load of misaligned address 0x6110002c2801 for type 'uint16_t', which requires 2 byte alignment
0x6110002c2801: note: pointer points here
 00 80 30  43 01 00 c9 00 00 00 04  00 0f 00 10 00 11 00 08  00 4c 02 00 11 01 47 04  00 00 02 00 04
              ^ 
udpipe.cpp:4062:14: runtime error: load of misaligned address 0x6110002c2805 for type 'volatile const uint16_t', which requires 2 byte alignment
0x6110002c2805: note: pointer points here
 01 00 c9 00 00 00 04  00 0f 00 10 00 11 00 08  00 4c 02 00 11 01 47 04  00 00 02 00 04 00 6d 00  9b
             ^ 
udpipe.cpp:4064:47: runtime error: load of misaligned address 0x6110002c2807 for type 'volatile const uint16_t', which requires 2 byte alignment
0x6110002c2807: note: pointer points here
 c9 00 00 00 04  00 0f 00 10 00 11 00 08  00 4c 02 00 11 01 47 04  00 00 02 00 04 00 6d 00  9b 00 a5
             ^ 
udpipe.cpp:4064:47: runtime error: load of misaligned address 0x6110002c2805 for type 'volatile const uint16_t', which requires 2 byte alignment
0x6110002c2805: note: pointer points here
 01 00 c9 00 00 00 04  00 0f 00 10 00 11 00 08  00 4c 02 00 11 01 47 04  00 00 02 00 04 00 6d 00  9b
             ^ 
udpipe.cpp:4068:14: runtime error: load of misaligned address 0x6110002c2805 for type 'volatile const uint16_t', which requires 2 byte alignment
0x6110002c2805: note: pointer points here
 01 00 c9 00 00 00 04  00 0f 00 10 00 11 00 08  00 4c 02 00 11 01 47 04  00 00 02 00 04 00 6d 00  9b
             ^ 
udpipe.cpp:4070:26: runtime error: load of misaligned address 0x6110002c2803 for type 'const uint16_t', which requires 2 byte alignment
0x6110002c2803: note: pointer points here
 30  43 01 00 c9 00 00 00 04  00 0f 00 10 00 11 00 08  00 4c 02 00 11 01 47 04  00 00 02 00 04 00 6d
              ^ 
udpipe.cpp:4072:44: runtime error: load of misaligned address 0x6110002c2807 for type 'volatile const uint16_t', which requires 2 byte alignment
0x6110002c2807: note: pointer points here
 c9 00 00 00 04  00 0f 00 10 00 11 00 08  00 4c 02 00 11 01 47 04  00 00 02 00 04 00 6d 00  9b 00 a5
             ^ 
udpipe.cpp:4072:44: runtime error: load of misaligned address 0x6110002c2805 for type 'volatile const uint16_t', which requires 2 byte alignment
0x6110002c2805: note: pointer points here
 01 00 c9 00 00 00 04  00 0f 00 10 00 11 00 08  00 4c 02 00 11 01 47 04  00 00 02 00 04 00 6d 00  9b
             ^ 
udpipe.cpp:4073:43: runtime error: load of misaligned address 0x6110002c2803 for type 'const uint16_t', which requires 2 byte alignment
0x6110002c2803: note: pointer points here
 30  43 01 00 c9 00 00 00 04  00 0f 00 10 00 11 00 08  00 4c 02 00 11 01 47 04  00 00 02 00 04 00 6d
              ^ 
udpipe.cpp:3338:11: runtime error: null pointer passed as argument 1, which is declared to never be null
udpipe.cpp:3235:21: runtime error: load of misaligned address 0x61f0000a443f for type 'uint16_t', which requires 2 byte alignment
0x61f0000a443f: note: pointer points here
 00 02 49 6b 9b  05 5c 0b 00 00 02 69 6b  f8 0b 67 0b 00 00 02 41  6c bc 05 9e 0b 00 00 02  49 6c f9
             ^ 
udpipe.cpp:3241:12: runtime error: load of misaligned address 0x61f0000a4441 for type 'uint32_t', which requires 4 byte alignment
0x61f0000a4441: note: pointer points here
 49 6b 9b  05 5c 0b 00 00 02 69 6b  f8 0b 67 0b 00 00 02 41  6c bc 05 9e 0b 00 00 02  49 6c f9 00 88
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6160000c98cf for type 'const unsigned int', which requires 4 byte alignment
0x6160000c98cf: note: pointer points here
 50 65 72 69 36  00 00 00 7e 58 7e 50 72  65 70 5f 41 64 76 7e 43  61 73 65 3d 4e 6f 6d 7c  44 65 67
             ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x61100005e56f for type 'const unsigned int', which requires 4 byte alignment
0x61100005e56f: note: pointer points here
 1b 00 00 3f 24  00 00 00 41 71 04 00 00  42 71 05 00 00 43 7b 39  00 00 45 86 0c 00 00 47  e5 34 00
             ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x61100009236f for type 'const unsigned int', which requires 4 byte alignment
0x61100009236f: note: pointer points here
 1f 00 00 3f 22  00 00 00 41 39 05 00 00  42 6f 06 00 00 43 f4 35  00 00 44 b2 4b 00 00 45  a7 0d 00
             ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x61300005dfef for type 'const unsigned int', which requires 4 byte alignment
0x61300005dfef: note: pointer points here
 00 00 00 3f 50  00 00 00 41 1a 00 00 00  42 1e 00 00 00 43 2b 00  00 00 44 16 00 00 00 45  19 00 00
             ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6110000906aa for type 'const unsigned int', which requires 4 byte alignment
0x6110000906aa: note: pointer points here
 03 00  00 3f 3d 03 00 00 41 42  01 00 00 42 f1 01 00 00  43 19 03 00 00 44 35 03  00 00 45 90 01 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x61100007dd2a for type 'const unsigned int', which requires 4 byte alignment
0x61100007dd2a: note: pointer points here
 0d 00  00 3f 70 0c 00 00 41 f2  02 00 00 42 ea 05 00 00  43 f7 0b 00 00 44 47 0c  00 00 45 6e 04 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x61100007d5aa for type 'const unsigned int', which requires 4 byte alignment
0x61100007d5aa: note: pointer points here
 16 00  00 3f 72 19 00 00 41 fe  02 00 00 42 c0 08 00 00  43 27 16 00 00 44 18 17  00 00 45 82 05 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6110000a456a for type 'const unsigned int', which requires 4 byte alignment
0x6110000a456a: note: pointer points here
 22 00  00 3f 45 25 00 00 41 e1  01 00 00 42 9a 09 00 00  43 43 20 00 00 44 2b 22  00 00 45 41 05 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6110000a41aa for type 'const unsigned int', which requires 4 byte alignment
0x6110000a41aa: note: pointer points here
 2c 00  00 3f 1c 26 00 00 41 91  01 00 00 42 72 07 00 00  43 1f 2a 00 00 44 61 28  00 00 45 4e 04 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6110000a366a for type 'const unsigned int', which requires 4 byte alignment
0x6110000a366a: note: pointer points here
 30 00  00 3f 3c 2a 00 00 41 06  01 00 00 42 81 06 00 00  43 2a 2e 00 00 44 42 2d  00 00 45 42 03 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6110000a33ea for type 'const unsigned int', which requires 4 byte alignment
0x6110000a33ea: note: pointer points here
 35 00  00 3f 31 34 00 00 41 9f  00 00 00 42 c9 06 00 00  43 64 2e 00 00 44 58 30  00 00 45 a4 02 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6110000a302a for type 'const unsigned int', which requires 4 byte alignment
0x6110000a302a: note: pointer points here
 36 00  00 3f 79 35 00 00 41 76  00 00 00 42 d3 05 00 00  43 b7 2f 00 00 44 15 32  00 00 45 5c 02 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x61300006feef for type 'const unsigned int', which requires 4 byte alignment
0x61300006feef: note: pointer points here
 00 00 00 3f 48  00 00 00 41 21 00 00 00  42 32 00 00 00 43 36 00  00 00 44 31 00 00 00 45  1e 00 00
             ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6110000a2b2a for type 'const unsigned int', which requires 4 byte alignment
0x6110000a2b2a: note: pointer points here
 02 00  00 3f c4 02 00 00 41 86  00 00 00 42 39 01 00 00  43 f0 02 00 00 44 d2 02  00 00 45 ec 00 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6110000a24ea for type 'const unsigned int', which requires 4 byte alignment
0x6110000a24ea: note: pointer points here
 09 00  00 3f b6 08 00 00 41 12  01 00 00 42 53 03 00 00  43 5a 08 00 00 44 00 09  00 00 45 4f 02 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6110000a226a for type 'const unsigned int', which requires 4 byte alignment
0x6110000a226a: note: pointer points here
 13 00  00 3f 93 11 00 00 41 0f  01 00 00 42 d9 05 00 00  43 2c 14 00 00 44 fc 12  00 00 45 45 03 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6110000a172a for type 'const unsigned int', which requires 4 byte alignment
0x6110000a172a: note: pointer points here
 1f 00  00 3f b7 1d 00 00 41 11  01 00 00 42 cf 06 00 00  43 a4 1d 00 00 44 3f 1d  00 00 45 a4 04 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6110000a14aa for type 'const unsigned int', which requires 4 byte alignment
0x6110000a14aa: note: pointer points here
 28 00  00 3f 0b 2b 00 00 41 96  00 00 00 42 b0 05 00 00  43 70 28 00 00 44 28 26  00 00 45 4a 03 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x61100005fbea for type 'const unsigned int', which requires 4 byte alignment
0x61100005fbea: note: pointer points here
 30 00  00 3f fd 2c 00 00 41 51  00 00 00 42 52 05 00 00  43 e7 30 00 00 44 54 2b  00 00 45 04 03 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x61100005f96a for type 'const unsigned int', which requires 4 byte alignment
0x61100005f96a: note: pointer points here
 33 00  00 3f d8 35 00 00 41 4a  00 00 00 42 df 04 00 00  43 d4 35 00 00 44 b2 2e  00 00 45 65 02 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x61100005f6ea for type 'const unsigned int', which requires 4 byte alignment
0x61100005f6ea: note: pointer points here
 37 00  00 3f 4a 37 00 00 41 47  00 00 00 42 ea 04 00 00  43 3d 30 00 00 44 67 32  00 00 45 1f 02 00
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x6060000da90a for type 'const unsigned int', which requires 4 byte alignment
0x6060000da90a: note: pointer points here
 00 41  55 58 08 00 00 00 41 44  56 07 00 00 00 44 45 54  0c 00 00 00 4e 55 4d 0b  00 00 00 53 59 4d
              ^ 
udpipe.cpp:9157:15: runtime error: load of misaligned address 0x60300000430a for type 'const unsigned int', which requires 4 byte alignment
0x60300000430a: note: pointer points here
 00 4e  6f 6d 02 00 00 00 44 61  74 04 00 00 00 62 00 00  00 00 00 00 00 00 00 00  02 00 00 00 ff ff
              ^ 
udpipe.cpp:3344:26: runtime error: load of misaligned address 0x6150002fd993 for type 'const uint16_t', which requires 2 byte alignment
0x6150002fd993: note: pointer points here
 00  4c 61 6e 67 65 00 be be  be be be be be be be be  b0 d9 2f 00 50 61 00 00  06 00 00 00 00 00 00
              ^ 
udpipe.cpp:9292:15: runtime error: load of misaligned address 0x62a000060232 for type 'const int', which requires 4 byte alignment
0x62a000060232: note: pointer points here
 57 00  0a 01 14 c9 10 00 0b 01  0e 50 03 00 0c 01 e9 ce  ff ff 0d 01 d5 a7 07 00  0e 01 d6 4e 02 00
              ^ 
udpipe.cpp:3344:26: runtime error: load of misaligned address 0x6150002fd831 for type 'const uint16_t', which requires 2 byte alignment
0x6150002fd831: note: pointer points here
 00 00 00  62 65 6e 00 be be be be  be be be be be be be be  50 d8 2f 00 50 61 00 00  02 00 00 00 00
              ^ 
udpipe.cpp:4094:5: runtime error: load of misaligned address 0x6220000fb14d for type 'uint16_t', which requires 2 byte alignment
0x6220000fb14d: note: pointer points here
 00 05 00 6e 26 00 7f  00 a5 00 d1 00 0c 01 92  01 98 01 cb 01 4b 02 4d  02 66 02 68 02 6f 02 87  02
             ^ 
udpipe.cpp:4095:16: runtime error: load of misaligned address 0x6220000fb14d for type 'uint16_t', which requires 2 byte alignment
0x6220000fb14d: note: pointer points here
 00 05 00 6e 26 00 7f  00 a5 00 d1 00 0c 01 92  01 98 01 cb 01 4b 02 4d  02 66 02 68 02 6f 02 87  02
             ^ 
/usr/include/c++/7/bits/predefined_ops.h:65:22: runtime error: load of misaligned address 0x6220000fb175 for type 'short unsigned int', which requires 2 byte alignment
0x6220000fb175: note: pointer points here
 e3 02 18 03 61 03 8c  03 bb 03 fa 03 00 04 1d  04 25 04 34 04 94 04 a5  04 be 04 31 05 39 05 41  05
             ^ 
udpipe.cpp:4105:80: runtime error: load of misaligned address 0x6220000fb14f for type 'uint16_t', which requires 2 byte alignment
0x6220000fb14f: note: pointer points here
 00 6e 26 00 7f  00 a5 00 d1 00 0c 01 92  01 98 01 cb 01 4b 02 4d  02 66 02 68 02 6f 02 87  02 8f 02
             ^ 
udpipe.cpp:8619:51: runtime error: load of misaligned address 0x7f077f390407 for type 'const uint16_t', which requires 2 byte alignment
0x7f077f390407: note: pointer points here
 00 00 00 01 00  00 05 68 65 72 65 6e 00  00 00 01 00 00 00 00 03  70 65 72 00 01 04 00 00  00 00 00
             ^ 
udpipe.cpp:3235:21: runtime error: load of misaligned address 0x62400102efff for type 'uint16_t', which requires 2 byte alignment
0x62400102efff: note: pointer points here
 00 61 6c 2d 01  00 72 04 00 00 01 00 01  00 65 65 6e 03 00 4d 04  da 04 04 05 00 00 02 00  03 00 05
             ^ 
udpipe.cpp:3235:21: runtime error: load of misaligned address 0x62400102f005 for type 'uint16_t', which requires 2 byte alignment
0x62400102f005: note: pointer points here
 72 04 00 00 01 00 01  00 65 65 6e 03 00 4d 04  da 04 04 05 00 00 02 00  03 00 05 00 0a 00 03 00  03
             ^ 
udpipe.cpp:4112:27: runtime error: load of misaligned address 0x6220000fb915 for type 'uint16_t', which requires 2 byte alignment
0x6220000fb915: note: pointer points here
 81 00 82 00 83 00 85  00 86 00 87 00 89 00 8b  00 8c 00 8d 00 8e 00 8f  00 90 00 91 00 93 00 94  00
             ^ 
udpipe.cpp:4112:81: runtime error: load of misaligned address 0x6220000fb917 for type 'uint16_t', which requires 2 byte alignment
0x6220000fb917: note: pointer points here
 82 00 83 00 85  00 86 00 87 00 89 00 8b  00 8c 00 8d 00 8e 00 8f  00 90 00 91 00 93 00 94  00 95 00
             ^ 
udpipe.cpp:4113:34: runtime error: load of misaligned address 0x6220000fbb53 for type 'uint16_t', which requires 2 byte alignment
0x6220000fbb53: note: pointer points here
 03  00 03 00 0a 00 03 00 03  00 03 00 0a 00 03 00 0a  00 03 00 03 00 03 00 03  00 03 00 03 00 03 00
              ^ 
udpipe.cpp:4112:81: runtime error: load of misaligned address 0x6220000fb917 for type 'uint16_t', which requires 2 byte alignment
0x6220000fb917: note: pointer points here
 82 00 83 00 85  00 86 00 87 00 89 00 8b  00 8c 00 8d 00 8e 00 8f  00 90 00 91 00 93 00 94  00 95 00
             ^ 
trying URL 'https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.0/master/inst/udpipe-ud-2.0-170801/sanskrit-ud-2.0-170801.udpipe'
Content type 'application/octet-stream' length 2107925 bytes (2.0 MB)
==================================================
downloaded 2.0 MB

udpipe.cpp:3235:21: runtime error: load of misaligned address 0x6230000fd547 for type 'uint16_t', which requires 2 byte alignment
0x6230000fd547: note: pointer points here
 a5 e0 a8 a4 01  00 00 95 a4 e0 b5 a4 e0  80 a5 e0 9c a4 01 00 00  83 a4 e0 a4 a4 e0 bf a4  e0 b5 a4
             ^ 
udpipe.cpp:3235:21: runtime error: load of misaligned address 0x61a00038ddc5 for type 'uint16_t', which requires 2 byte alignment
0x61a00038ddc5: note: pointer points here
 a4 e0 a6 a4 01 00 00  be a4 e0 a8 a4 01 00 00  80 a5 e0 a4 a4 01 00 00  87 a5 e0 95 a4 01 00 00  83

FYI. The udpipe code is at https://raw.githubusercontent.com/bnosac/udpipe/master/src/udpipe.cpp

pre-trained models for UD 2.0 doesn't work

First congratulations for this amazing work and thanks for sharing it with us. Bellow I explain the issue I am facing.

When I run the pre-trained models from udpipe-ud-1.2-160523 it works fine. The execution starts after the command:

./udpipe --tokenize --tag --parse english-ud-1.2-160523.udpipe en-ud-dev.conllu

But when I try to run a pre-trained model from udpipe-ud-2.0-conll17-170315. I get an error:

./udpipe --tokenize --tag --parse english-ud-2.0-conll17-170315.udpipe en-ud-dev.conllu Loading UDPipe model: Cannot load UDPipe model 'english-ud-2.0-conll17-170315.udpipe'!

Note that I have confirmed that the english-ud-2.0-conll17-170315.udpipe file is in the folder:

UDPipe/git/udpipe/src$ ls common.h Makefile.builtem sentence english-partut-ud-2.0-conll17-170315.udpipe Makefile.include tokenizer english-ud-1.2-160523.udpipe model trainer english-ud-2.0-conll17-170315.udpipe model.output udpipe en-ud-dev.conllu morphodita udpipe.cpp en-ud-dev.txt parsito unilib libudpipe.a pt-ud-train.conllu utils Makefile rest_server version

Any ideas?

Incorporate changes from CoNLL 2017 Shared Task.

Is the parser using computed tags during training?

Thanks for a great tool!

I have a question regarding the source of PoS tags during parser training with UDPipe. Running udpipe --accuracy --tag --parse reports LAS/UAS both for using computed and gold tags during testing, but it is not clear to me whether the parser uses computed or gold tags during training.

I had always assumed that the parser gets trained on gold tags, until I recently tried training the parser on data with computed tags (while re-using a tagger from an existing model with the option --tagger=from_model=.. and also re-using the same pre-computed form embeddings) and the results became identical to when training on the original data with gold tags.

To try to shed more light on this issue, I then tried to train yet another parser on the original gold-tag data, specifying --tagger=none. Running udpipe --accuracy --parse with this model gives different LAS/UAS scores (as reported for `Parsing from gold tokenization with gold tags') than when the UDPipe model also contains a tagger.

In sum this seems to indicate that computed tags are used during parser training. Any clarification on this issue would be very welcome.

Strange annotation view

Hi everyone,

I'm sorry if the following problem has already been discussed or if it's happened just due to a mistake of mine. However, I don't know how to fix it and maybe it'd be useful for somebody else.

I tried to train my own model on the Syntagrus corpus (a full pipeline from tokenization to parsing) implying morphological dictionary.
This dictionary follows proposed format, see an excerpt:

The model was trained and it shows satisfying results. However, sometimes the output annotation looks strange, see:

So for some words there are only five columns instead of ten, their order is confused and some letters disappeared.

As far as I can understand it, this happens with words appended to dictionary trained by model. Has anyone dealt with it?

Thanks in advance!

problem with spaces in form column in training tagger

$ cat ~/source/apertium/languages/apertium-kaz/texts/puupankki/puupankki.kaz.conllu kk-ud-dev.conllu | ~/source/udpipe/src/udpipe --tokenizer epochs=5 --train kaz2.udpipe
Loading training data: done.
Training the UDPipe model.
Epoch 1, logprob: -6.6085e+04, training acc: 96.42%
Epoch 2, logprob: -1.1740e+04, training acc: 99.03%
Epoch 3, logprob: -6.4117e+03, training acc: 99.53%
Epoch 4, logprob: -4.9325e+03, training acc: 99.65%
Epoch 5, logprob: -3.9941e+03, training acc: 99.71%
Creating morphological dictionary for tagger model 1.
An error occurred during model training: Cannot parse replacement rule '  ған емес ' in statistical guesser file!

The offending sentence is:

# sent_id = akorda-random.tagged.txt:209:3751
# text = - Біздің елдеріміз арасында ешқашан ешқандай да қайшылықтар болған емес.
1       -       -       PUNCT   guio    _       8       punct   _       _
2       Біздің  біз     PRON    prn     Case=Gen|Number=Plur|Person=1|PronType=Prs      3       nmod:poss       _       _
3       елдеріміз       ел      NOUN    n       Case=Nom|Number=Plur|Number[psor]=Plur|Person[psor]=1   4       nmod:poss       _       _
4       арасында        ара     NOUN    n       Case=Loc|Number[psor]=Plur,Sing|Person[psor]=3  8       obl     _       _
5       ешқашан ешқашан ADV     adv     _       8       advmod  _       _
6-7     ешқандай да     _       _       _       _       _       _       _       _
6       ешқандай        ешқандай        DET     det     PronType=Neg    8       det     _       _
7       да      да      ADV     postadv _       8       advmod  _       _
8       қайшылықтар     қайшылық        NOUN    n       Case=Nom|Number=Plur    0       root    _       _
9       болған емес     бол     AUX     v       Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Fin      8       cop     _       SpaceAfter=No
10      .       .       PUNCT   sent    _       8       punct   _       _

It seems to be a problem with spaces in the form column, but this should be valid CoNLL-U.

Allow using dictionary_file= to pass dictionary

At the moment embedding_form_file=FILENAME is used to pass embeddings, but to pass a dictionary, you need dictionary_file=file:FILENAME. This is inconsistent and could be confusing to users, especially if no error messages are thrown. Could we make embedding_form_file=file:FILENAME or dictionary_file=FILENAME to be consistent?

Tokenization without segmentation

Is it possible to run just the tokenizer without segmenter? Of course, if the sentence gets divided into more segments I can merge merge them (calling addWord() on the first Ufal::UDPipe::Sentence segment to add words from the other segments), but it is an extra work, especially if I want to handle also multiword tokens.

Compilation

Hi, I compiled udpipe with g++ (gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12)) and swig (SWIG Version 3.0.8) on Ubuntu 16.10 and it seems to fail when loading models with a segmentation fault:

./udpipe --tokenize --tag --parse ../../../models/en.model.output ../../../test_en.txt
Loading UDPipe model: Segmentation fault (core dumped)

The model was trained on a different computer with g++ (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)) and swig (SWIG Version 3.0.2) and Ubuntu 16.04.2 LTS.

I cannot nor load pre-trained, nor train new models.

Do you have an idea what could be the issue?

Segfault from training UD_Finnish 2

I just tried training out of the box UD_Finnish from the 2 version, and I am not able to get any other result than segfault. I have tried sedding spaces away to workaround issue #21, but this had no effect. I tried training both with ./udpipe --train UD_Finnish-2.0.udpipe fi-ud-train.conllu as well as
cat fi-ud-train.conllu | ./udpipe --train fi-ud-2.0.udpipe but it made no difference either. It always ends in:

Epoch 99, logprob: -2.4581e+03, training acc: 99.84%
Epoch 100, logprob: -2.4962e+03, training acc: 99.84%
Creating morphological dictionary for tagger model 1.
Training tagger model 1.
Speicherzugriffsfehler (Speicherabzug geschrieben)

make bindings/csharp error

If I run make in bindings/csharp, I recived the following error message:

swig -O -c++ -outcurrentdir -csharp -namespace Ufal.UDPipe -outdir Ufal\UDPipe -o udpipe_csharp.cpp udpipe_csharp.i
process_begin: CreateProcess(NULL, swig -O -c++ -outcurrentdir -csharp -namespace Ufal.UDPipe -outdir Ufal\UDPipe -o udpipe_csharp.cpp udpipe_csharp.i, ...) failed.
make (e=2): The system cannot find the file specified.
mingw32-make: *** [Makefile:25: udpipe_csharp.cpp] Error 2

udpipe_server

I am trying to start the server with the command:

udpipe_server 8080 /path-to-the-model/udmodel.udpipe

but I get the message "Cannot load specified models!"

What am I missing?

Allow additional external morphological dictionary during inference.

In addition to the "internal" morphological dictionary stored in the model, it would make sense to allow using additional external dictionary during tagging only.

create binary wheels for windows (Appveyor, macOS and manylinux1 (tavis)

Followup to #8 (comment)

There are some pretty easy to follow guides on how to setup appveyor and travis to build wheels:

Windows: https://packaging.python.org/appveyor/#automatically-uploading-wheels
Linux: travis https://github.com/pypa/python-manylinux-demo
macOS: travis https://github.com/MacPython/wiki/wiki/Wheel-building

Building parser models using GPUs

Is it possible to speed up the model building (for the parser) in UDPipe using GPU clusters? I can't seem to find anything that mentions this in the UDPipe or Parsito manuals.

Way to represent different tokenisations in dictionary format

In addition to being able to use a morphological analyser/dictionary during inference (#50) it would be cool if the dictionary format allowed for multi-word surface tokens, for example:

15-17   arreglándotelo  _       _       _       _       _       _       _       SpaceAfter=No
15      arreglándo      arreglar        VERB    _       VerbForm=Ger    12      acl     _       _
16      te      tú      PRON    _       Case=Acc,Dat|Number=Sing|Person=2|PrepCase=Npr|PronType=Prs     15      iobj    _       _
17      lo      él      PRON    _       Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs     15      obj     _       _

At the moment this does not seem to be possible in the dictionary format:

dictionary_file (default empty): use a given custom morphological dictionary, where each line contains 5 tab-separated fields FORM, LEMMA, UPOSTAG, XPOSTAG and FEATS. Note that this dictionary data is appended to the dictionary created from the UD training data, not replacing it.

Windows python install

Hello,

I have tried to install ufal.udpipe both by trying "pip install ufal.udpipe" and downloading the PyPi package and running the setup and I'm getting C++ errors on install in both cases.

Here's an image with the error I'm getting. Any help would be appreciated.

UDPipe on Intel Xeon

I tried installing UDPipe on Intel Xeon in Azure and Google Cloud platforms. Install seems to complete successfully but running the code doesn't. I tried installing with pip as well as build from the sources with MODE=release and MODE=debug in current master and in "stable" branch.

When installing with pip, UDPipe crashes with segmentation fault when trying to use it from Python. When building from sources and using with built binary I get error Could not load model.

G++ compiler version if 5.4.0 on Ubuntu 16.04 and using pretrained ud-1.2 models. I also tried to train my own models with the same results.

Has anybody tried building UDPipe on Xeon with any luck?

Python example doesn't work

The method finishDocument() doesn't exist, so the Python example doesn't work.
This line needs to be removed.

how to prevent sentence detection for tokenaztion?

Hi,
I was not able to find any option to turn off sentence delimitation for tokenization. My data needs to be with the same number of sentences after tokenization but UDPIPE separates some of the sentences.
Thanks

[OS X] pip install, fatal error: 'atomic' file not found

I just tried to install this using pip install ufal.udpipe, but I got the following error:

    building 'ufal_udpipe' extension
    creating build/temp.macosx-10.6-x86_64-3.5
    creating build/temp.macosx-10.6-x86_64-3.5/udpipe
    /usr/bin/clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Applications/anaconda/include -arch x86_64 -Iudpipe/include -I/Applications/anaconda/include/python3.5m -c udpipe/udpipe.cpp -o build/temp.macosx-10.6-x86_64-3.5/udpipe/udpipe.o -std=c++11 -fvisibility=hidden -w
    udpipe/udpipe.cpp:7:10: fatal error: 'atomic' file not found
    #include <atomic>
             ^
    1 error generated.
    error: command '/usr/bin/clang' failed with exit status 1

    ----------------------------------------
Command "/Applications/anaconda/bin/python -u -c "import setuptools, tokenize;__file__='/private/tmp/pip-build-qx1s775a/ufal.udpipe/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-03jnjoj8-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/tmp/pip-build-qx1s775a/ufal.udpipe/

Stack Overflow suggests that this is due to a flag in the compiler. But I do not know how to fix this using pip.

Compilation on Windows

Hello,

When I compile the latest release on Windows x64 (gcc 6.3.0 from MinGW-w64) and launchs it, the program shows the help and then crash...without any further error message !
However, the downloaded pre-compiled binary (1.0 release) works perfectly.

Have you ever had this kind of problem ?

method to plug a morpological dictionary

As pretty much everybody who opens an issue here, I want to thank you for such a great (out-of-the-box great!) tool.

UDPipe section of UD tools page says it is possible to provide a morphological dictionary. Could you please describe how can that be done? There is a lexicon of form-lemma-upos-ufeat format.

problem with byte encoding in udpipe from git

When I try and train a tagger using the latest English data, I get a strange error from UDPipe,

$ cat *.conllu | udpipe --train --tokenize --tagger --parser=no english.udpipe
Loading training data: done.
Training the UDPipe model.
Training tokenizer with the following options: tokenize_url=1, allow_spaces=0, dimension=24
  epochs=100, batch_size=50, learning_rate=0.0050, dropout=0.1000, early_stopping=0
Epoch 1, logprob: -6.1698e+04, training acc: 95.88%
Epoch 2, logprob: -1.7173e+04, training acc: 98.92%
Epoch 3, logprob: -1.4062e+04, training acc: 99.05%
Epoch 4, logprob: -1.2385e+04, training acc: 99.16%
Epoch 5, logprob: -1.1286e+04, training acc: 99.24%
Epoch 6, logprob: -1.1481e+04, training acc: 99.21%
Epoch 7, logprob: -1.0287e+04, training acc: 99.30%
Epoch 8, logprob: -1.0279e+04, training acc: 99.29%
Epoch 9, logprob: -9.8852e+03, training acc: 99.32%
Epoch 10, logprob: -9.6343e+03, training acc: 99.34%
Epoch 11, logprob: -9.4930e+03, training acc: 99.35%
Epoch 12, logprob: -9.2315e+03, training acc: 99.37%
Epoch 13, logprob: -9.2220e+03, training acc: 99.38%
Epoch 14, logprob: -8.8526e+03, training acc: 99.39%
Epoch 15, logprob: -8.7573e+03, training acc: 99.40%
Epoch 16, logprob: -8.8190e+03, training acc: 99.41%
Epoch 17, logprob: -8.8209e+03, training acc: 99.39%
Epoch 18, logprob: -8.3526e+03, training acc: 99.42%
Epoch 19, logprob: -8.3097e+03, training acc: 99.43%
Epoch 20, logprob: -8.7686e+03, training acc: 99.41%
Epoch 21, logprob: -8.5230e+03, training acc: 99.42%
Epoch 22, logprob: -8.0554e+03, training acc: 99.44%
Epoch 23, logprob: -8.0775e+03, training acc: 99.45%
Epoch 24, logprob: -8.4924e+03, training acc: 99.42%
Epoch 25, logprob: -8.2039e+03, training acc: 99.45%
Epoch 26, logprob: -7.9598e+03, training acc: 99.46%
Epoch 27, logprob: -7.9808e+03, training acc: 99.46%
Epoch 28, logprob: -8.0371e+03, training acc: 99.46%
Epoch 29, logprob: -7.9295e+03, training acc: 99.47%
Epoch 30, logprob: -7.5110e+03, training acc: 99.47%
Epoch 31, logprob: -7.9097e+03, training acc: 99.47%
Epoch 32, logprob: -7.8456e+03, training acc: 99.48%
Epoch 33, logprob: -7.9043e+03, training acc: 99.46%
Epoch 34, logprob: -7.7426e+03, training acc: 99.48%
Epoch 35, logprob: -7.6989e+03, training acc: 99.47%
Epoch 36, logprob: -7.7118e+03, training acc: 99.47%
Epoch 37, logprob: -7.8382e+03, training acc: 99.47%
Epoch 38, logprob: -7.6632e+03, training acc: 99.47%
Epoch 39, logprob: -7.6765e+03, training acc: 99.49%
Epoch 40, logprob: -7.7373e+03, training acc: 99.48%
Epoch 41, logprob: -7.5058e+03, training acc: 99.50%
Epoch 42, logprob: -7.4203e+03, training acc: 99.50%
Epoch 43, logprob: -7.2875e+03, training acc: 99.51%
Epoch 44, logprob: -7.5939e+03, training acc: 99.48%
Epoch 45, logprob: -7.4016e+03, training acc: 99.50%
Epoch 46, logprob: -7.3488e+03, training acc: 99.51%
Epoch 47, logprob: -7.3759e+03, training acc: 99.49%
Epoch 48, logprob: -7.7003e+03, training acc: 99.49%
Epoch 49, logprob: -7.1461e+03, training acc: 99.51%
Epoch 50, logprob: -7.4844e+03, training acc: 99.48%
Epoch 51, logprob: -7.4017e+03, training acc: 99.50%
Epoch 52, logprob: -7.3334e+03, training acc: 99.49%
Epoch 53, logprob: -7.1444e+03, training acc: 99.50%
Epoch 54, logprob: -7.2387e+03, training acc: 99.51%
Epoch 55, logprob: -7.1217e+03, training acc: 99.51%
Epoch 56, logprob: -7.4385e+03, training acc: 99.51%
Epoch 57, logprob: -7.1386e+03, training acc: 99.50%
Epoch 58, logprob: -7.1672e+03, training acc: 99.50%
Epoch 59, logprob: -7.3106e+03, training acc: 99.52%
Epoch 60, logprob: -7.1694e+03, training acc: 99.50%
Epoch 61, logprob: -7.1212e+03, training acc: 99.52%
Epoch 62, logprob: -7.0805e+03, training acc: 99.52%
Epoch 63, logprob: -7.0900e+03, training acc: 99.51%
Epoch 64, logprob: -7.2829e+03, training acc: 99.50%
Epoch 65, logprob: -6.8592e+03, training acc: 99.52%
Epoch 66, logprob: -7.2357e+03, training acc: 99.51%
Epoch 67, logprob: -7.1893e+03, training acc: 99.51%
Epoch 68, logprob: -7.2612e+03, training acc: 99.51%
Epoch 69, logprob: -7.0492e+03, training acc: 99.52%
Epoch 70, logprob: -7.2061e+03, training acc: 99.50%
Epoch 71, logprob: -7.0483e+03, training acc: 99.52%
Epoch 72, logprob: -6.9997e+03, training acc: 99.52%
Epoch 73, logprob: -7.1702e+03, training acc: 99.51%
Epoch 74, logprob: -6.9724e+03, training acc: 99.52%
Epoch 75, logprob: -7.2270e+03, training acc: 99.50%
Epoch 76, logprob: -7.0296e+03, training acc: 99.51%
Epoch 77, logprob: -6.9355e+03, training acc: 99.53%
Epoch 78, logprob: -7.1586e+03, training acc: 99.51%
Epoch 79, logprob: -7.0209e+03, training acc: 99.53%
Epoch 80, logprob: -6.9683e+03, training acc: 99.52%
Epoch 81, logprob: -7.1498e+03, training acc: 99.52%
Epoch 82, logprob: -7.2023e+03, training acc: 99.52%
Epoch 83, logprob: -6.8345e+03, training acc: 99.53%
Epoch 84, logprob: -7.1528e+03, training acc: 99.51%
Epoch 85, logprob: -6.6544e+03, training acc: 99.54%
Epoch 86, logprob: -6.9870e+03, training acc: 99.52%
Epoch 87, logprob: -6.9638e+03, training acc: 99.51%
Epoch 88, logprob: -6.9834e+03, training acc: 99.53%
Epoch 89, logprob: -6.5750e+03, training acc: 99.56%
Epoch 90, logprob: -6.9301e+03, training acc: 99.52%
Epoch 91, logprob: -7.0809e+03, training acc: 99.52%
Epoch 92, logprob: -6.9539e+03, training acc: 99.52%
Epoch 93, logprob: -7.1273e+03, training acc: 99.52%
Epoch 94, logprob: -7.0223e+03, training acc: 99.51%
Epoch 95, logprob: -6.8614e+03, training acc: 99.53%
Epoch 96, logprob: -6.8142e+03, training acc: 99.54%
Epoch 97, logprob: -6.9596e+03, training acc: 99.52%
Epoch 98, logprob: -6.8749e+03, training acc: 99.53%
Epoch 99, logprob: -7.0501e+03, training acc: 99.51%
Epoch 100, logprob: -7.1078e+03, training acc: 99.52%
Tagger model 1 columns: lemma use=1/provide=1, xpostag use=1/provide=1, feats use=1/provide=1
Creating morphological dictionary for tagger model 1.
Tagger model 1 dictionary options: max_form_analyses=0, custom dictionary_file=none
Tagger model 1 guesser options: suffix_rules=8, prefixes_max=4, prefix_min_count=10, enrich_dictionary=6
An error occurred during model training: Should encode value 338 in one byte!

I saw in the old issues that this might be fixed by lowering the value for guesser_enrich_dictionary, but perhaps the error message should suggest this to the user ? Also, am I right in thinking that this command should work?

$ cat *.conllu | udpipe --train --tokenizer=no --tagger=guesser_enrich_dictionary=3 --parser=no english.udpipe

Documentation examples or validation of command line arguments

It would be handy to have some examples of how to use the training parameters in the documentation, or some kind of validation of command line arguments.

I am perhaps not the typical use case, but I have been trying to change the parameters for a while, but here is no indication which ones are valid/invalid,

$ udpipe --tokenizer none --tagger none --train tr-ud-train.0.udpipe < UD_Turkish/tr-ud-train.conllu 
$ udpipe --tokenizer none --tagger none --parser swap --train tr-ud-train.1.udpipe < UD_Turkish/tr-ud-train.conllu 
$ udpipe --tokenizer none --tagger none --parser "structured_interval=8" --train tr-ud-train.2.udpipe < UD_Turkish/tr-ud-train.conllu

These all seem to produce the same output model. I'm sure I'm missing something, but I can't work out what it is from the documentation. I'm sure it is something I'm doing wrong, but I can't work out what :)

Pretrained word embeddings yield lower scores

First of all, thanks a lot for making this tool available. I am using it in my MSc thesis on parsing of Norwegian language using neural net parsers and word embeddings.

In my experiments I have first run the parser on my dependency treebank without using pretrained word embeddings, in order to establish a baseline. I have then trained word2vec embeddings using the Norwegian Newspaper Corpus with the parameters specified in gen.sh. In other parsers, such as Dyer et. al.'s LSTM Parser, this yields a significant increase in both LAS and UAS. In UDPipe however, my results drop by approximately .5 LAS and .1 UAS. I find this rather curious.

Another thing I find odd, is that UDPipe reports slightly lower scores when run in accuracy mode, than the scores reported by the CoNLL Shared Task's eval.pl script.

The parameters I use are
./udpipe --train --parser=embedding_form_file=embedding.vec --parser=embedding_form=50 model.output training.conll
./udpipe --parse --input=conllu --output=conllu --outfile=result.conll model.output dev.conll
./udpipe --accuracy --parse model.output dev.conll
perl eval.pl -s result.conll -g dev.conll > eval_result.txt

I suspect that I am doing something wrong, but I simply cannot figure out what. I would be grateful if you could point me in the right direction.

cannot start udpipe_server

./udpipe_server 8080 /root/udpipe-ud-2.0-170801/spanish-ud-2.0-170801.udpipe
Cannot load specified models!

There are any default directory to models?

Exponent input in double parsing code

We're trying to train UDpipe with the Facebook Wikipedia embeddings but are getting an odd error, something to do with parsing floats with an exponent.

command: udpipe --tokenizer none --parser "embedding_form_file=wiki.et.vec"  --train $2 < $1

Loading training data: done.
Training the UDPipe model.
Tagger model 1 columns: lemma use=1/provide=1, xpostag use=1/provide=1, feats use=1/provide=1
Creating morphological dictionary for tagger model 1.
Tagger model 1 dictionary options: max_form_analyses=0, custom dictionary data given=false
Tagger model 1 guesser options: suffix_rules=8, prefixes_max=4, prefix_min_count=10, enrich_dictionary=6
Tagger model 1 options: iterations=20, early_stopping=0, templates=tagger
Training tagger model 1.
Iteration 1: done, accuracy 69.84%
Iteration 2: done, accuracy 90.85%
Iteration 3: done, accuracy 95.14%
Iteration 4: done, accuracy 96.69%
Iteration 5: done, accuracy 97.76%
Iteration 6: done, accuracy 98.55%
Iteration 7: done, accuracy 98.70%
Iteration 8: done, accuracy 98.89%
Iteration 9: done, accuracy 98.91%
Iteration 10: done, accuracy 99.00%
Iteration 11: done, accuracy 99.09%
Iteration 12: done, accuracy 99.25%
Iteration 13: done, accuracy 99.20%
Iteration 14: done, accuracy 99.16%
Iteration 15: done, accuracy 99.37%
Iteration 16: done, accuracy 99.30%
Iteration 17: done, accuracy 99.39%
Iteration 18: done, accuracy 99.44%
Iteration 19: done, accuracy 99.49%
Iteration 20: done, accuracy 99.50%
Parser transition options: system=projective, oracle=dynamic, structured_interval=8, single_root=1
Parser uses lemmas/upos/xpos/feats: automatically generated by tagger
Parser embeddings options: upostag=20, feats=20, xpostag=0, form=50, lemma=0, deprel=20
Parser network options: iterations=10, hidden_layer=200, batch_size=10,
  learning_rate=0.0200, learning_rate_final=0.0010, l2=0.5000, early_stopping=0
Initialized 'universal_tag' embedding with 0,16 words and 0.0%,100.0% coverage.
Initialized 'feats' embedding with 0,367 words and 0.0%,100.0% coverage.
Cannot parse embedding weight double value '-2.2318e-05': non-digit character found.

Need help on training syntax

Hi everyone. I'm trying to train a model using udpipe and I issue the following command:

../udpipe \
--train \
'modell.udpipe' \
< 'train.conllu' \
--tokenizer 'epochs=60;early_stopping=1;allow_spaces=1' \
--tagger 'templates=tagger' \
--parser 'single_root=0;iterations=20;early_stopping=1;embedding_form_file=fa.word2vec' \
--heldout dev.conllu

Tokenization finishes without any problem. But when it comes to tagger, it just occupies a huge part of RAM and does nothing, no matter what I do.
What is my problem? Is it the syntax of the command? Should I provide something?

Thanks!

p.s.
The word2vec model is made using gensim, though I'm not sure it is the problem.

weird output for Arabic sentences

Hi,
Thanks for developing this useful tool.
Recently, i am trying to use UDpipe on Arabic documents. I have tried the pre-trained model for UD1.2 and i also trained the model on UD1.2 treebank myself. But the pos tags returned by the tool are always "X". As a result the dependency parser returned wrong output as well. Could you help me with this problem? thanks!

error: expected primary-expression before ‘enum’ in tokenizer/multiword_splitter.cpp

Hi,

The CRAN build farm which is now checking on the udpipe R package (https://cran.r-project.org/web/checks/check_results_udpipe.html) gives an error when building on Solaris:
https://www.r-project.org/nosvn/R.check/r-patched-solaris-x86/udpipe-00install.html

This looks like a syntax error in tokenizer/multiword_splitter.cpp
How can this be fixed in ufal/udpipe?

* installing to library ‘/home/ripley/R/Lib32’
* installing *source* package ‘udpipe’ ...
** package ‘udpipe’ successfully unpacked and MD5 sums checked
** libs
/opt/csw//bin/g++ -std=gnu++11 -I/home/ripley/R/gcc/include -DNDEBUG  -I"/home/ripley/R/Lib32/Rcpp/include" -I/opt/csw/include -I/usr/local/include   -fPIC  -O2 -c RcppExports.cpp -o RcppExports.o
/opt/csw//bin/g++ -std=gnu++11 -I/home/ripley/R/gcc/include -DNDEBUG  -I"/home/ripley/R/Lib32/Rcpp/include" -I/opt/csw/include -I/usr/local/include   -fPIC  -O2 -c rcpp_udpipe.cpp -o rcpp_udpipe.o
/opt/csw//bin/g++ -std=gnu++11 -I/home/ripley/R/gcc/include -DNDEBUG  -I"/home/ripley/R/Lib32/Rcpp/include" -I/opt/csw/include -I/usr/local/include   -fPIC  -O2 -c udpipe.cpp -o udpipe.o
udpipe.cpp: In member function ‘void ufal::udpipe::multiword_splitter::append_token(ufal::udpipe::utils::string_piece, ufal::udpipe::utils::string_piece, ufal::udpipe::sentence&) const’:
udpipe.cpp:19769:3: error: expected primary-expression before ‘enum’
   enum { UC_FIRST, UC_ALL, OTHER } casing = OTHER;
   ^
udpipe.cpp:19769:36: error: ‘casing’ was not declared in this scope
   enum { UC_FIRST, UC_ALL, OTHER } casing = OTHER;
                                    ^
udpipe.cpp:19769:45: error: ‘OTHER’ was not declared in this scope
   enum { UC_FIRST, UC_ALL, OTHER } casing = OTHER;
                                             ^
*** Error code 1
make: Fatal error: Command failed for target `udpipe.o'
Current working directory /tmp/RtmpyRaOvb/R.INSTALL2ad75272872/udpipe/src
ERROR: compilation failed for package ‘udpipe’
* removing ‘/home/ripley/R/Lib32/udpipe’

real       40.8
user       37.6
sys         2.4

Minimum working example for Python

This looks like a great tool, but I am struggling to get it working with Python. Could you provide a minimum working example?

I imagine sth like:

from ufal.udpipe import Tokenizer

data = open("mydata")
tokenize = Tokenizer()
new_data = tokenize(data)
print(new_data[:3])

Deploy udpipe.jar to Maven Central

It would be great if the Java part of the UDPipe bindings would be available on Maven Central.

It would be even more great if the native libraries would also be available as JARs, e.g. by building and packaging them using JavaCPP? I've seen it being used e.g. in DL4J - not used it myself through. Seems like a nice way of packaging up natively compiled libraries along with their Java interfaces in JARs and to distribute them via Maven.

question on nl-lassysmall

I'm trying to reproduce the models released with ud-2.0 using the parameters defined at https://github.com/ufal/udpipe/tree/master/training/models-ud-2.0
In particular, I'm have a question on nl-lassysmall. The model does not output xpos tags (checked this on the R side but also in the webservice at http://lindat.mff.cuni.cz/services/udpipe/).
![knipsel](https://user-images.githubusercontent.com/1710810/34611828-65eba812-f227-11e7-9d9e-0fb9b7251fd2.PNG)
While if I train the model from R with the following parameters using the latest conllu files (version 2.1), I do get the xpos tags.
I checked that the xpos tags were there in version 2.0 (March 1) and they were not available, so that explains probably it.
However, to understand the number reported in http://dx.doi.org/10.18653/v1/K17-3009 for the xpos for dutch-lassysmall (99.9%). Can you explain where does it come from.
In particular, I'm interested basically to understand if there any further things that were done except the code at https://github.com/ufal/udpipe/tree/master/training (resplitting data, computing embeddings and training the model with the provided parameters) to the models released at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2364.

library(udpipe)
m <- udpipe_train(file = "dev/nl-lassysmall-token-tag-parse.udpipe", 
                  files_conllu_training = "dev/nl_lassysmall-ud-train.conllu", 
                  files_conllu_holdout = "dev/nl_lassysmall-ud-test.conllu",
                  annotation_tokenizer = list(tokenize_url = 1, allow_spaces = 0,
                                              dimension = 64, epochs = 100, initialization_range = 0.1,
                                              batch_size = 100, learning_rate = 0.005, dropout = 0.1, early_stopping = 1),
                  annotation_tagger = list(models = 2, iterations = 20,
                                           ## Settings for the UPOS/XPOS/FEATS tagger
                                           templates_1 = "tagger", 
                                           guesser_suffix_rules_1 = 10, guesser_enrich_dictionary_1 = 6, guesser_prefixes_max_1 = 1,
                                           use_lemma_1 = 1, use_xpostag_1 = 1, use_feats_1 = 1, 
                                           provide_lemma_1 = 0, provide_xpostag_1 = 1, provide_feats_1 = 1, prune_features_1 = 0,
                                           ## Settings for the Lemmatizer
                                           templates_2 = "lemmatizer",
                                           guesser_suffix_rules_2 = 8, guesser_enrich_dictionary_2 = 5, guesser_prefixes_max_2 = 4,
                                           use_lemma_2 = 1, use_xpostag_2 = 0, use_feats_2 = 0, 
                                           provide_lemma_2 = 1, provide_xpostag_2 = 0, provide_feats_2 = 0, prune_features_2 = 1), 
                  annotation_parser = list(iterations = 30, 
                                           embedding_upostag = 20, embedding_feats = 20, embedding_xpostag = 0, 
                                           embedding_form = 50, embedding_form_file = "dev/ud-2.0-embeddings/nl_lassysmall.skip.forms.50.vectors", 
                                           embedding_lemma = 0, embedding_deprel = 20, 
                                           learning_rate = 0.01, learning_rate_final = 0.001, l2 = 0.3, hidden_layer = 200, 
                                           batch_size = 10, transition_system = "swap", transition_oracle = "static_lazy", 
                                           structured_interval = 8))
m <- udpipe_load_model("dev/nl-lassysmall-token-tag-parse.udpipe")
as.data.frame(udpipe_annotate(m, "Hoe komt dit, er zijn geen xpostags hierbij."))

   doc_id paragraph_id sentence_id                                     sentence token_id    token   lemma  upos                                 xpos
1    doc1            1           1 Hoe komt dit, er zijn geen xpostags hierbij.        1      Hoe     hoe   ADV                                   BW
2    doc1            1           1 Hoe komt dit, er zijn geen xpostags hierbij.        2     komt   komen  VERB                      WW|pv|tgw|met-t
3    doc1            1           1 Hoe komt dit, er zijn geen xpostags hierbij.        3      dit     dit  PRON         VNW|aanw|pron|stan|vol|3o|ev
4    doc1            1           1 Hoe komt dit, er zijn geen xpostags hierbij.        4        ,       , PUNCT                                  LET
5    doc1            1           1 Hoe komt dit, er zijn geen xpostags hierbij.        5       er      er   ADV   VNW|aanw|adv-pron|stan|red|3|getal
6    doc1            1           1 Hoe komt dit, er zijn geen xpostags hierbij.        6     zijn    zijn   AUX                         WW|pv|tgw|mv
7    doc1            1           1 Hoe komt dit, er zijn geen xpostags hierbij.        7     geen    geen   DET VNW|onbep|det|stan|prenom|zonder|agr
8    doc1            1           1 Hoe komt dit, er zijn geen xpostags hierbij.        8 xpostags xpostag  NOUN                     N|soort|mv|basis
9    doc1            1           1 Hoe komt dit, er zijn geen xpostags hierbij.        9  hierbij hierbij   ADV                                   BW
10   doc1            1           1 Hoe komt dit, er zijn geen xpostags hierbij.       10        .       . PUNCT                                  LET

R package made available

FYI, I've made this R wrapper for udpipe available at https://github.com/bnosac/udpipe

Idea: Allow for custom entity wordlist

I've been playing around with UDPipe and it's simply the best NLP library I've seen so far. Great work! :)

The only problems I've seen are related to miscategorized PROPN tags. I see how these are the hardest ones to get right, given that they follow very few rules, and often are multi-token. Given that PROPN detecting is the same thing as NER recognition, and NER is often solved with big lists of entites (gazetteers) I think that's something that UDPipe could do too.

Mind you, I'm not saying that you should bundle these lists yourself. Instead you could let the developer that uses your library (and that knows the domain it's going to be used in) point to a text file with entities. You could then lookup if a word is an entity or not, and use that when deciding if something should be a PROPN or not. I could just be another feature, nothing more than that.

Is this a good idea? Would it solve the problems I'm seeing with missing PROPN tags?

bin-win64 contains a 32-bit DLL (udpipe_csharp.dll)

The current release contains a 32-bit version of \bin-win64\csharp\udpipe_csharp.dll
A 64-bit version would allow a C# assembly to run in 64-bit mode (which has some advantages - e.g. larger memory limit).

C++ Finding lemma of a word using UDPipe

Hi,

I want to find lemma of a word inside my c++ code. Is it possible to do this using UDPipe?

Thanks.

Bad performance on swedish

Hi. I'm switching an old project, that does parsing of Swedish sentences, from a custom parser to UDPipe. But when I compare the results for simple sentences I get a pretty bad result.

I'm using this example sentence to show the difference: "Hitta ordklass i svensk text". I've added exclamation marks where results are incorrect.

Swe word	Eng trans	upos	features
Hitta	Find	VERB	Mood=Imp❗️, VerbForm=Fin, Voice=Act
ordklass	word class	PRON❗️	Case=Acc, Definite=Def, Gender=Com, Number=Plur❗️
i	in	ADP	-
svensk	Swedish	ADJ	Case=Nom, Definite=Ind, Degree=Pos, Gender=Com, Number=Sing
text	text	NOUN	Case=Nom, Definite=Ind, Gender=Com, Number=Sing

The other tagger handles all of the above examples correctly. Is it because the architecture is different (structured perceptron using greedy search for decoding), or because they use a larger corpus?

udpip --tag produces invalid CoNLL-U

When udpipe is run without --parse, it sets the HEAD fields to _, which does not conform to the CONLL-U format spcification -- IMHO it should be set to 0.

method to plug a morpological guesser

UDPipe 1.1 made it possible to use a morphological dictionary, which dramatically improves accuracy. To improve even further, is there an "official" way to plug a custom rule-based morphological guesser? (which itself uses dict internally to provide possible interpretations at runtime)
Both external and compile-in solutions are interesting.

Documentation tidying

The sections 3.2. License and 3.3. Platforms and Requirements
at http://ufal.mff.cuni.cz/udpipe are redundant,
because most of the sentences are already mentioned before.

The almost-empty sections 2.1. Online Demo and 2.2. Web Service are visually strange.
I would suggest just one section "2. Online Demo and Web Service"
and add few more sentences, e.g.
The service is freely available for testing / non-commercial...
REST API...

I was afraid to fix this myself, as there may be reasons behind the duplicity and/or links to those sections from outside.

Feel free to dismiss this issue.

Word does not contain offsets

I have trouble anchoring the words produced by the tokenizer on the original text because the Word does not seem to contain character offset information. Could this be added?

error on Solaris

Hi,

As the CRAN build farm gives an error when building on Solaris, I'm trying to fix the build errors, so that the package gets accepted by CRAN.
When I build the package on Solaris (which is configured as follows: https://www.stats.ox.ac.uk/pub/bdr/Rconfig/r-patched-solaris-x86). with the fix and command explained in issue #43 I still get errors.
I've put these below.

This looks like udpipe.cpp does not use cmath but math.h
If I look for a solution on this, I see http://kevinushey.github.io/blog/2016/09/14/pitfalls-in-writing-portable-c++98-code/ which pretty much resembles the issues that are raised when building this on Solaris.

Standard Library Headers

The following code may fail to compile on Solaris:

#include <cstring>
size_t string_length(const char* string) {
  return ::strlen(string);
}

kevin@soularis:~/scratch
$ CC -library=stlport4 string_length.cpp 
"string_length.cpp", line 3: Error: size_t is not defined.
"string_length.cpp", line 4: Error: strlen is not defined.
2 Error(s) detected.

The C++ standard library headers that ‘wrap’ their C counterparts are typically prefixed with a c and contain no extension, e.g. <cstring>; while the C headers themselves are typically given a .h extension, e.g. <string.h>. When the <cstring> header is included in a translation unit, the C++98 standard dictates that the compiler:

    Must define its members (e.g. strlen) in the std:: namespace, and
    May define its members (e.g. strlen) in the global namespace.

In fact, gcc and clang both accept the above code, but the Solaris compilers do not. (The Solaris compilers do not populate the global namespace when including these headers.)

    Rule: If you include a C++-style standard library header, reference symbols from the std namespace. Prefer using C++-style standard library headers over the original C counterpart. Prefer referencing these symbols explicitly, with the std:: prefix.

How can this be fixed in ufal/udpipe?
The following shows the error from building the package:

/opt/csw/bin/g++ -std=gnu++11 -I/opt/R/R-3.4.1-patched-gcc/lib/R/include -DNDEBUG  -I"/export/home/X8MNlSX/R/Rcpp/include" -I/opt/csw/include -I/usr/local/include   -fPIC  -O2 -c RcppExports.cpp -o RcppExports.o
In file included from /opt/csw/include/c++/5.2.0/cmath:44:0,
                 from /export/home/X8MNlSX/R/Rcpp/include/Rcpp/platform/compiler.h:100,
                 from /export/home/X8MNlSX/R/Rcpp/include/Rcpp/r/headers.h:48,
                 from /export/home/X8MNlSX/R/Rcpp/include/RcppCommon.h:29,
                 from /export/home/X8MNlSX/R/Rcpp/include/Rcpp.h:27,
                 from RcppExports.cpp:4:
/usr/include/math.h:45:12: error: ‘std::float_t’ has not been declared
 using std::float_t;
            ^
/usr/include/math.h:46:12: error: ‘std::double_t’ has not been declared
 using std::double_t;
            ^
/usr/include/math.h:48:12: error: ‘std::fpclassify’ has not been declared
 using std::fpclassify;
            ^
/usr/include/math.h:49:12: error: ‘std::isfinite’ has not been declared
 using std::isfinite;
            ^
/usr/include/math.h:50:12: error: ‘std::isinf’ has not been declared
 using std::isinf;
            ^
/usr/include/math.h:51:12: error: ‘std::isnan’ has not been declared
 using std::isnan;
            ^
/usr/include/math.h:52:12: error: ‘std::isnormal’ has not been declared
 using std::isnormal;
            ^
/usr/include/math.h:53:12: error: ‘std::signbit’ has not been declared
 using std::signbit;
            ^
/usr/include/math.h:55:12: error: ‘std::isgreater’ has not been declared
 using std::isgreater;
            ^
/usr/include/math.h:56:12: error: ‘std::isgreaterequal’ has not been declared
 using std::isgreaterequal;
            ^
/usr/include/math.h:57:12: error: ‘std::isless’ has not been declared
 using std::isless;
            ^
/usr/include/math.h:58:12: error: ‘std::islessequal’ has not been declared
 using std::islessequal;
            ^
/usr/include/math.h:59:12: error: ‘std::islessgreater’ has not been declared
 using std::islessgreater;
            ^
/usr/include/math.h:60:12: error: ‘std::isunordered’ has not been declared
 using std::isunordered;
            ^
/usr/include/math.h:62:12: error: ‘std::acosh’ has not been declared
 using std::acosh;
            ^
/usr/include/math.h:63:12: error: ‘std::asinh’ has not been declared
 using std::asinh;
            ^
/usr/include/math.h:64:12: error: ‘std::atanh’ has not been declared
 using std::atanh;
            ^
/usr/include/math.h:65:12: error: ‘std::cbrt’ has not been declared
 using std::cbrt;
            ^
/usr/include/math.h:66:12: error: ‘std::copysign’ has not been declared
 using std::copysign;
            ^
/usr/include/math.h:67:12: error: ‘std::erf’ has not been declared
 using std::erf;
            ^
/usr/include/math.h:68:12: error: ‘std::erfc’ has not been declared
 using std::erfc;
            ^
/usr/include/math.h:69:12: error: ‘std::exp2’ has not been declared
 using std::exp2;
            ^
/usr/include/math.h:70:12: error: ‘std::expm1’ has not been declared
 using std::expm1;
            ^
/usr/include/math.h:71:12: error: ‘std::fdim’ has not been declared
 using std::fdim;
            ^
/usr/include/math.h:72:12: error: ‘std::fma’ has not been declared
 using std::fma;
            ^
/usr/include/math.h:73:12: error: ‘std::fmax’ has not been declared
 using std::fmax;
            ^
/usr/include/math.h:74:12: error: ‘std::fmin’ has not been declared
 using std::fmin;
            ^
/usr/include/math.h:75:12: error: ‘std::hypot’ has not been declared
 using std::hypot;
            ^
/usr/include/math.h:76:12: error: ‘std::ilogb’ has not been declared
 using std::ilogb;
            ^
/usr/include/math.h:77:12: error: ‘std::lgamma’ has not been declared
 using std::lgamma;
            ^
/usr/include/math.h:78:12: error: ‘std::llrint’ has not been declared
 using std::llrint;
            ^
/usr/include/math.h:79:12: error: ‘std::llround’ has not been declared
 using std::llround;
            ^
/usr/include/math.h:80:12: error: ‘std::log1p’ has not been declared
 using std::log1p;
            ^
/usr/include/math.h:81:12: error: ‘std::log2’ has not been declared
 using std::log2;
            ^
/usr/include/math.h:82:12: error: ‘std::logb’ has not been declared
 using std::logb;
            ^
/usr/include/math.h:83:12: error: ‘std::lrint’ has not been declared
 using std::lrint;
            ^
/usr/include/math.h:84:12: error: ‘std::lround’ has not been declared
 using std::lround;
            ^
/usr/include/math.h:85:12: error: ‘std::nan’ has not been declared
 using std::nan;
            ^
/usr/include/math.h:86:12: error: ‘std::nanf’ has not been declared
 using std::nanf;
            ^
/usr/include/math.h:87:12: error: ‘std::nanl’ has not been declared
 using std::nanl;
            ^
/usr/include/math.h:88:12: error: ‘std::nearbyint’ has not been declared
 using std::nearbyint;
            ^
/usr/include/math.h:89:12: error: ‘std::nextafter’ has not been declared
 using std::nextafter;
            ^
/usr/include/math.h:90:12: error: ‘std::nexttoward’ has not been declared
 using std::nexttoward;
            ^
/usr/include/math.h:91:12: error: ‘std::remainder’ has not been declared
 using std::remainder;
            ^
/usr/include/math.h:92:12: error: ‘std::remquo’ has not been declared
 using std::remquo;
            ^
/usr/include/math.h:93:12: error: ‘std::rint’ has not been declared
 using std::rint;
            ^
/usr/include/math.h:94:12: error: ‘std::round’ has not been declared
 using std::round;
            ^
/usr/include/math.h:95:12: error: ‘std::scalbln’ has not been declared
 using std::scalbln;
            ^
/usr/include/math.h:96:12: error: ‘std::scalbn’ has not been declared
 using std::scalbn;
            ^
/usr/include/math.h:97:12: error: ‘std::tgamma’ has not been declared
 using std::tgamma;
            ^
/usr/include/math.h:98:12: error: ‘std::trunc’ has not been declared
 using std::trunc;
            ^
gmake: *** [/opt/R/R-3.4.1-patched-gcc/lib/R/etc/Makeconf:168: RcppExports.o] Error 1

Hyperparameter settings from pre-trained models

Great tool, thanks for making it available.

Is there some way of obtaining the hyperparameter settings that were used for the individual pre-trained models available from Lindat?

Method to query tagger and parser tagsets from model

It would be great if the library API would export methods to query the tagsets used by the tagger and parser. Could you please also include them in the Java API?

Differences in accuracies between MacOS and Linux binaries

There seems to be a lot of difference between the POS tagging output on MacOS and Linux. Did anyone else experience this? I am attaching the screenshots of Version 1.2 Binaries, but this issue remained even when I compiled from the source repo, and with 1.1 version binaries.

ufal / udpipe Goto Github PK

udpipe's Introduction

UDPipe 1

udpipe's People

Contributors

Stargazers

Watchers

Forkers

udpipe's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs