GithubHelp home page GithubHelp logo

Comments (6)

rsennrich avatar rsennrich commented on May 28, 2024

Hi Uli,

ParZu should preserve the input order - if it doesn't, this is a bug.
What might happen is that the sentences get processed out-of-order, but
they should be put together in the right order again by the wrapper
script (multiprocessed_parsing.py).

I think what you want to achieve should be easiest with
one-sentenc-per-line input (which is already supported), and CoNLL
output format, where sentences are delimited by empty lines. I regularly
post-process the CoNLL format into some one-sentence-per-line
representation, e.g. for SMT (
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/wrappers/conll2mosesxml.py
), and it should be easy to do something like that for your purposes,
and then copy the additional annotations from input to output (if
they're both one sentence per line).

If you do find that ParZu mixes up the order, can you give me a
reproducible example of this? Do you also observe it if you start ParZu
singlethreaded ("-p 1")?

best wishes,
Rico

On 03.03.2016 12:44, Uli Fahrer wrote:

Hi,

this is a feature request for an additional input format that also
tackles the output format. I often have additional annotations like
document id or sentence id for input sentences that I want to preserve
in the parses. So an input could look like this:

documentId sentenceId sentence
0 0 first sentence, first document
0 1 second sentence, first document
1 0 first sentence, second document

For the output, I propose an one sentence per line format since this
is easy to post process.

0 1 Der Arzt arbeitet im Krankenhaus . Der@@ART
<https://github.com/ART>@@der <https://github.com/der>@@Def
<https://github.com/Def>|Masc|Nom|Sg Arzt@@NN
<https://github.com/NN>@@arzt <https://github.com/arzt>@@Masc
<https://github.com/Masc>|Nom|Sg arbeitet@@VVFIN@@arbeiten@@3
<https://github.com/3>|Sg|Pres|Ind im@@APPRART@@in
<https://github.com/in>@@Dat <https://github.com/Dat>
Krankenhaus@@NN <https://github.com/NN>@@krankenhaus@@Neut
<https://github.com/Neut>|Dat|Sg .@@$.@@--@@_ DET(arzt-1,der-0)
SUBJ(arbeiten-2,arzt-1) S(arbeiten-2,arbeiten-2)
PP(arbeiten-2,in-3) PN(in-3,krankenhaus-4) ROOT(*-5,*-5)

I already tried to re-add these annotations after the parsing.
However, this solution doesn't work since ParZu computes in a
multi-threaded manner and doesn't preserve the input order in its
output.

To overcome this issue, one could introduce some sort of buffer that
stores finished sentences after parsing. In order to preserve the
input order the buffer could flush sentences in the correct order once
a sequence is computed. Do you see the problem when the output parses
don't match the input?

Best Uli

—
Reply to this email directly or view it on GitHub
#5.

from parzu.

Tooa avatar Tooa commented on May 28, 2024

ParZu should preserve the input order - if it doesn't, this is a bug.

After some more digging, I found the issue that led to my wrong assumption that the output order is not preserved. The input file contains sometimes more than one empty space as token delimiter. Therefore, the command:

./parzu -i tokenized_lines < inprob -p 12 > prob with inprob as Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86   % , gestiegen .

produces:

1   Die die ART ART Def|_|Nom|Pl    2   det _   _ 
2   Baukosten   Baukosten   N   NN  _|Nom|Pl    3   subj    _   _ 
3   sind    sein    V   VAFIN   3|Pl|Pres|Ind   0   root    _   _ 
4   also    also    ADV ADV _   3   adv _   _ 
5   deutlich    deutlich    ADV ADJD    Pos|    3   pred    _   _ 
6   mehr    mehr    ADV ADV _   7   adv _   _ 
7   als als KOKOM   KOKOM   _   3   kom _   _ 
8   die die ART ART Def|Fem|_|Sg    9   det _   _ 
9   Geschossfläche Geschossfläche N   NN  Fem|_|Sg    7   cj  _   _ 
10  ,   ,   $,  $,  _   0   root    _   _ 
11  nämlich    nämlich    ADV ADV _   12  adv _   _ 
12  um  um  PREP    APPR    _   3   pp  _   _ 
13  insgesamt   insgesamt   ADV ADV _   14  adv _   _ 
14  86  86  CARD    CARD    _   12  pn  _   _ 

1   %   %   N   NN  _|Nom|_ 3   subj    _   _ 
2   ,   ,   $,  $,  _   0   root    _   _ 
3   gestiegen   steigen V   VVPP    _   0   root    _   _ 
4   .   .   $.  $.  _   0   root    _   _ 

This results in a non-align-able input and output file, because they are different. I suggest to change this behavior and make the tokenized_line input format more robust.

from parzu.

rsennrich avatar rsennrich commented on May 28, 2024

Hello Uli,

I'm unable to reproduce your problem:

echo "Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen ." | ./parzu -i tokenized_lines 2> /dev/null
1 Die die ART ART Def|||Pl 2 det _ _
2 Baukosten Baukosten N NN ||Pl 0 root _ _
3 sind sein V VAFIN 3|Pl|Pres|Ind 0 root _ _
4 also also ADV ADV _ 3 adv _ _
5 deutlich deutlich ADV ADJD Pos| 6 attr _ _
6 mehr mehr PRO PIS |Nom|Pl 3 subj _ _
7 als als KOKOM KOKOM _ 3 kom _ _
8 die die ART ART Def|Fem|
|Sg 9 det _ _
9 Geschossfläche Geschossfläche N NN Fem|_|Sg 7 cj _ _
10 , , $, $, _ 0 root _ _
11 nämlich nämlich ADV ADV _ 12 adv _ _
12 um um PREP APPR _ 3 pp _ _
13 insgesamt insgesamt ADV ADV _ 15 adv _ _
14 86 86 CARD CARD _ 15 attr _ _
15 % % N NN ||Pl 12 pn _ _
16 , , $, $, _ 0 root _ _
17 gestiegen steigen V VVPP _ 3 aux _ _
18 . . $. $. _ 0 root _ _

I'm not sure why this behaviour is different on your machine. Is it possible that your original sentence contains some UTF-8 special whitespace character that was stripped when you copied the sentence to this message? Alternatively, are you using a particularly old or new version of Python?

You can also test the tokenizer in isolation:

echo "Die Baukosten sind also deutlich mehr als die Geschossfläche , nämlich um insgesamt 86 % , gestiegen ." | python preprocessor/tokenized_lines.py
Die
Baukosten
sind
also
deutlich
mehr
als
die
Geschossfläche
,
nämlich
um
insgesamt
86
%
,
gestiegen
.

from parzu.

Tooa avatar Tooa commented on May 28, 2024

I'm not sure why this behaviour is different on your machine. Is it possible that your original sentence contains some UTF-8 special whitespace character that was stripped when you copied the sentence to this message?

Interesting, you are right. Your example works for me. However, the attached sentence [1] produces the error I mentioned before.

Can you reproduce the problem with the provided file? I use Python 2.7.10. The character between 86 and % looks like this one [2].

[1] https://www.dropbox.com/s/mp670vkbtdhuoof/inprob?dl=0
[2] http://www.unicodemap.org/details/0x00A0/index.html

from parzu.

rsennrich avatar rsennrich commented on May 28, 2024

Hi Uli,

hm, I'm tempted to just blame the bad unicode support on Python 2 (tokenized_lines.py works fine in Python 3.4.3), but I just committed a fix that should improve unicode handling in Python 2.7. I hope this solves the problem for you.

from parzu.

Tooa avatar Tooa commented on May 28, 2024

Thank you very much. Looks good to me. I was not aware that ParZu works with Python 3.

from parzu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.