Feature: Allow additional annotations for input sentences about parzu HOT 6 CLOSED

Tooa commented on May 28, 2024

Feature: Allow additional annotations for input sentences

from parzu.

Comments (6)

rsennrich commented on May 28, 2024

Hi Uli,

ParZu should preserve the input order - if it doesn't, this is a bug.
What might happen is that the sentences get processed out-of-order, but
they should be put together in the right order again by the wrapper
script (multiprocessed_parsing.py).

I think what you want to achieve should be easiest with
one-sentenc-per-line input (which is already supported), and CoNLL
output format, where sentences are delimited by empty lines. I regularly
post-process the CoNLL format into some one-sentence-per-line
representation, e.g. for SMT (
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/wrappers/conll2mosesxml.py
), and it should be easy to do something like that for your purposes,
and then copy the additional annotations from input to output (if
they're both one sentence per line).

If you do find that ParZu mixes up the order, can you give me a
reproducible example of this? Do you also observe it if you start ParZu
singlethreaded ("-p 1")?

best wishes,
Rico

On 03.03.2016 12:44, Uli Fahrer wrote:

Hi,

this is a feature request for an additional input format that also
tackles the output format. I often have additional annotations like
document id or sentence id for input sentences that I want to preserve
in the parses. So an input could look like this:
documentId sentenceId sentence
0 0 first sentence, first document
0 1 second sentence, first document
1 0 first sentence, second document
For the output, I propose an one sentence per line format since this
is easy to post process.
0 1 Der Arzt arbeitet im Krankenhaus . Der@@ART
<https://github.com/ART>@@der <https://github.com/der>@@Def
<https://github.com/Def>|Masc|Nom|Sg Arzt@@NN
<https://github.com/NN>@@arzt <https://github.com/arzt>@@Masc
<https://github.com/Masc>|Nom|Sg arbeitet@@VVFIN@@arbeiten@@3
<https://github.com/3>|Sg|Pres|Ind im@@APPRART@@in
<https://github.com/in>@@Dat <https://github.com/Dat>
Krankenhaus@@NN <https://github.com/NN>@@krankenhaus@@Neut
<https://github.com/Neut>|Dat|Sg .@@$.@@--@@_ DET(arzt-1,der-0)
SUBJ(arbeiten-2,arzt-1) S(arbeiten-2,arbeiten-2)
PP(arbeiten-2,in-3) PN(in-3,krankenhaus-4) ROOT(*-5,*-5)
I already tried to re-add these annotations after the parsing.
However, this solution doesn't work since ParZu computes in a
multi-threaded manner and doesn't preserve the input order in its
output.

To overcome this issue, one could introduce some sort of buffer that
stores finished sentences after parsing. In order to preserve the
input order the buffer could flush sentences in the correct order once
a sequence is computed. Do you see the problem when the output parses
don't match the input?

Best Uli

—
Reply to this email directly or view it on GitHub
#5.

from parzu.

Tooa commented on May 28, 2024

ParZu should preserve the input order - if it doesn't, this is a bug.

After some more digging, I found the issue that led to my wrong assumption that the output order is not preserved. The input file contains sometimes more than one empty space as token delimiter. Therefore, the command:

./parzu -i tokenized_lines < inprob -p 12 > prob with inprob as Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen .

produces:

1   Die die ART ART Def|_|Nom|Pl    2   det _   _ 
2   Baukosten   Baukosten   N   NN  _|Nom|Pl    3   subj    _   _ 
3   sind    sein    V   VAFIN   3|Pl|Pres|Ind   0   root    _   _ 
4   also    also    ADV ADV _   3   adv _   _ 
5   deutlich    deutlich    ADV ADJD    Pos|    3   pred    _   _ 
6   mehr    mehr    ADV ADV _   7   adv _   _ 
7   als als KOKOM   KOKOM   _   3   kom _   _ 
8   die die ART ART Def|Fem|_|Sg    9   det _   _ 
9   Geschossfläche Geschossfläche N   NN  Fem|_|Sg    7   cj  _   _ 
10  ,   ,   $,  $,  _   0   root    _   _ 
11  nämlich    nämlich    ADV ADV _   12  adv _   _ 
12  um  um  PREP    APPR    _   3   pp  _   _ 
13  insgesamt   insgesamt   ADV ADV _   14  adv _   _ 
14  86  86  CARD    CARD    _   12  pn  _   _ 

1   %   %   N   NN  _|Nom|_ 3   subj    _   _ 
2   ,   ,   $,  $,  _   0   root    _   _ 
3   gestiegen   steigen V   VVPP    _   0   root    _   _ 
4   .   .   $.  $.  _   0   root    _   _

This results in a non-align-able input and output file, because they are different. I suggest to change this behavior and make the tokenized_line input format more robust.

from parzu.

rsennrich commented on May 28, 2024

Hello Uli,

I'm unable to reproduce your problem:

echo "Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen ." | ./parzu -i tokenized_lines 2> /dev/null
1 Die die ART ART Def|||Pl 2 det _ _
2 Baukosten Baukosten N NN ||Pl 0 root _ _
3 sind sein V VAFIN 3|Pl|Pres|Ind 0 root _ _
4 also also ADV ADV _ 3 adv _ _
5 deutlich deutlich ADV ADJD Pos| 6 attr _ _
6 mehr mehr PRO PIS |Nom|Pl 3 subj _ _
7 als als KOKOM KOKOM _ 3 kom _ _
8 die die ART ART Def|Fem||Sg 9 det _ _
9 Geschossfläche Geschossfläche N NN Fem|_|Sg 7 cj _ _
10 , , $, $, _ 0 root _ _
11 nämlich nämlich ADV ADV _ 12 adv _ _
12 um um PREP APPR _ 3 pp _ _
13 insgesamt insgesamt ADV ADV _ 15 adv _ _
14 86 86 CARD CARD _ 15 attr _ _
15 % % N NN ||Pl 12 pn _ _
16 , , $, $, _ 0 root _ _
17 gestiegen steigen V VVPP _ 3 aux _ _
18 . . $. $. _ 0 root _ _

I'm not sure why this behaviour is different on your machine. Is it possible that your original sentence contains some UTF-8 special whitespace character that was stripped when you copied the sentence to this message? Alternatively, are you using a particularly old or new version of Python?

You can also test the tokenizer in isolation:

echo "Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen ." | python preprocessor/tokenized_lines.py
Die
Baukosten
sind
also
deutlich
mehr
als
die
Geschossfläche
,
nämlich
um
insgesamt
86
%
,
gestiegen
.

from parzu.

Tooa commented on May 28, 2024

I'm not sure why this behaviour is different on your machine. Is it possible that your original sentence contains some UTF-8 special whitespace character that was stripped when you copied the sentence to this message?

Interesting, you are right. Your example works for me. However, the attached sentence [1] produces the error I mentioned before.

Can you reproduce the problem with the provided file? I use Python 2.7.10. The character between 86 and % looks like this one [2].

[1] https://www.dropbox.com/s/mp670vkbtdhuoof/inprob?dl=0
[2] http://www.unicodemap.org/details/0x00A0/index.html

from parzu.

rsennrich commented on May 28, 2024

Hi Uli,

hm, I'm tempted to just blame the bad unicode support on Python 2 (tokenized_lines.py works fine in Python 3.4.3), but I just committed a fix that should improve unicode handling in Python 2.7. I hope this solves the problem for you.

from parzu.

Tooa commented on May 28, 2024

Thank you very much. Looks good to me. I was not aware that ParZu works with Python 3.

from parzu.

Feature: Allow additional annotations for input sentences about parzu HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs