Comments (6)
Hi Uli,
ParZu should preserve the input order - if it doesn't, this is a bug.
What might happen is that the sentences get processed out-of-order, but
they should be put together in the right order again by the wrapper
script (multiprocessed_parsing.py).
I think what you want to achieve should be easiest with
one-sentenc-per-line input (which is already supported), and CoNLL
output format, where sentences are delimited by empty lines. I regularly
post-process the CoNLL format into some one-sentence-per-line
representation, e.g. for SMT (
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/wrappers/conll2mosesxml.py
), and it should be easy to do something like that for your purposes,
and then copy the additional annotations from input to output (if
they're both one sentence per line).
If you do find that ParZu mixes up the order, can you give me a
reproducible example of this? Do you also observe it if you start ParZu
singlethreaded ("-p 1")?
best wishes,
Rico
On 03.03.2016 12:44, Uli Fahrer wrote:
Hi,
this is a feature request for an additional input format that also
tackles the output format. I often have additional annotations like
document id or sentence id for input sentences that I want to preserve
in the parses. So an input could look like this:documentId sentenceId sentence 0 0 first sentence, first document 0 1 second sentence, first document 1 0 first sentence, second document
For the output, I propose an one sentence per line format since this
is easy to post process.0 1 Der Arzt arbeitet im Krankenhaus . Der@@ART <https://github.com/ART>@@der <https://github.com/der>@@Def <https://github.com/Def>|Masc|Nom|Sg Arzt@@NN <https://github.com/NN>@@arzt <https://github.com/arzt>@@Masc <https://github.com/Masc>|Nom|Sg arbeitet@@VVFIN@@arbeiten@@3 <https://github.com/3>|Sg|Pres|Ind im@@APPRART@@in <https://github.com/in>@@Dat <https://github.com/Dat> Krankenhaus@@NN <https://github.com/NN>@@krankenhaus@@Neut <https://github.com/Neut>|Dat|Sg .@@$.@@--@@_ DET(arzt-1,der-0) SUBJ(arbeiten-2,arzt-1) S(arbeiten-2,arbeiten-2) PP(arbeiten-2,in-3) PN(in-3,krankenhaus-4) ROOT(*-5,*-5)
I already tried to re-add these annotations after the parsing.
However, this solution doesn't work since ParZu computes in a
multi-threaded manner and doesn't preserve the input order in its
output.To overcome this issue, one could introduce some sort of buffer that
stores finished sentences after parsing. In order to preserve the
input order the buffer could flush sentences in the correct order once
a sequence is computed. Do you see the problem when the output parses
don't match the input?Best Uli
â
Reply to this email directly or view it on GitHub
#5.
from parzu.
ParZu should preserve the input order - if it doesn't, this is a bug.
After some more digging, I found the issue that led to my wrong assumption that the output order is not preserved. The input file contains sometimes more than one empty space as token delimiter. Therefore, the command:
./parzu -i tokenized_lines < inprob -p 12 > prob
with inprob
as Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86  % , gestiegen .
produces:
1 Die die ART ART Def|_|Nom|Pl 2 det _ _
2 Baukosten Baukosten N NN _|Nom|Pl 3 subj _ _
3 sind sein V VAFIN 3|Pl|Pres|Ind 0 root _ _
4 also also ADV ADV _ 3 adv _ _
5 deutlich deutlich ADV ADJD Pos| 3 pred _ _
6 mehr mehr ADV ADV _ 7 adv _ _
7 als als KOKOM KOKOM _ 3 kom _ _
8 die die ART ART Def|Fem|_|Sg 9 det _ _
9 Geschossfläche Geschossfläche N NN Fem|_|Sg 7 cj _ _
10 , , $, $, _ 0 root _ _
11 nämlich nämlich ADV ADV _ 12 adv _ _
12 um um PREP APPR _ 3 pp _ _
13 insgesamt insgesamt ADV ADV _ 14 adv _ _
14 86 86 CARD CARD _ 12 pn _ _
1 % % N NN _|Nom|_ 3 subj _ _
2 , , $, $, _ 0 root _ _
3 gestiegen steigen V VVPP _ 0 root _ _
4 . . $. $. _ 0 root _ _
This results in a non-align-able input and output file, because they are different. I suggest to change this behavior and make the tokenized_line
input format more robust.
from parzu.
Hello Uli,
I'm unable to reproduce your problem:
echo "Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen ." | ./parzu -i tokenized_lines 2> /dev/null
1 Die die ART ART Def|||Pl 2 det _ _
2 Baukosten Baukosten N NN ||Pl 0 root _ _
3 sind sein V VAFIN 3|Pl|Pres|Ind 0 root _ _
4 also also ADV ADV _ 3 adv _ _
5 deutlich deutlich ADV ADJD Pos| 6 attr _ _
6 mehr mehr PRO PIS |Nom|Pl 3 subj _ _
7 als als KOKOM KOKOM _ 3 kom _ _
8 die die ART ART Def|Fem||Sg 9 det _ _
9 Geschossfläche Geschossfläche N NN Fem|_|Sg 7 cj _ _
10 , ,
11 nämlich nämlich ADV ADV _ 12 adv _ _
12 um um PREP APPR _ 3 pp _ _
13 insgesamt insgesamt ADV ADV _ 15 adv _ _
14 86 86 CARD CARD _ 15 attr _ _
15 % % N NN ||Pl 12 pn _ _
16 , ,
17 gestiegen steigen V VVPP _ 3 aux _ _
18 . .
I'm not sure why this behaviour is different on your machine. Is it possible that your original sentence contains some UTF-8 special whitespace character that was stripped when you copied the sentence to this message? Alternatively, are you using a particularly old or new version of Python?
You can also test the tokenizer in isolation:
echo "Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen ." | python preprocessor/tokenized_lines.py
Die
Baukosten
sind
also
deutlich
mehr
als
die
Geschossfläche
,
nämlich
um
insgesamt
86
%
,
gestiegen
.
from parzu.
I'm not sure why this behaviour is different on your machine. Is it possible that your original sentence contains some UTF-8 special whitespace character that was stripped when you copied the sentence to this message?
Interesting, you are right. Your example works for me. However, the attached sentence [1] produces the error I mentioned before.
Can you reproduce the problem with the provided file? I use Python 2.7.10
. The character between 86
and %
looks like this one [2].
[1] https://www.dropbox.com/s/mp670vkbtdhuoof/inprob?dl=0
[2] http://www.unicodemap.org/details/0x00A0/index.html
from parzu.
Hi Uli,
hm, I'm tempted to just blame the bad unicode support on Python 2 (tokenized_lines.py works fine in Python 3.4.3), but I just committed a fix that should improve unicode handling in Python 2.7. I hope this solves the problem for you.
from parzu.
Thank you very much. Looks good to me. I was not aware that ParZu works with Python 3.
from parzu.
Related Issues (20)
- Executing "create_statistics.sh" HOT 3
- Using the python module HOT 1
- Cannot parse file in docker HOT 2
- Difference Local and WebVersion results HOT 5
- ImportError: No module named pexpect when starting docker run HOT 4
- Wrong Tagging for APZR HOT 7
- swi-prolog sfst HOT 1
- Support --tokenized or --tokenized_lines in the REST API HOT 3
- For python scripts to be invoked from command line, consider #!/usr/bin/env python
- Problem running the create_statistics.sh script HOT 3
- Error POS Tatting PPOSAT
- Cannot use the module HOT 1
- Online parser returns random sentences HOT 6
- Timeout values cannot be configured HOT 3
- Some texts take very long to parse HOT 3
- Error 502 with the demo of the ParZu HOT 4
- pexpect error on init HOT 2
- broken link in readme HOT 1
- Unexpected keyword argument 'text' for Python3.1 HOT 1
- Creating the visual dependency graph from conll data HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
đ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. đđđ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google â¤ď¸ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parzu.