apertium / apertium-recursive Goto Github PK

View Code? Open in Web Editor NEW

6.0 13.0 4.0 874 KB

Recursive structural transfer module for Apertium

Home Page: https://wiki.apertium.org/wiki/Apertium-recursive

License: GNU General Public License v3.0

Makefile 0.63% Python 14.84% Lex 0.08% Yacc 2.08% C++ 79.26% Shell 0.32% M4 1.49% C 0.78% Vim Script 0.53%

apertium-tools apertium-core

apertium-recursive's Introduction

Apertium-recursive

A recursive structural transfer module for Apertium

Compiling

./autogen.sh
make

Running

# compile the rules file
src/rtx-comp rule-file bytecode-file

# run the rules
src/rtx-proc bytecode-file < input

# decompile the rules and examine the bytecode
src/rtx-decomp bytecode-file text-file

# compile XML rule files
src/trx-comp bytecode-file xml-files...

# generate random sentences from a rules file
apertium-recursive/src/randsen.py start_node pair_directory source_language_directory

Options for rtx-comp:

-e don't compile a rule with a particular name
-l load lexicalized weights from a file
-s output summaries of the rules to stderr

Options for trx-comp:

-l load lexicalized weights from a file

Options for rtx-proc:

-a indicates that the input comes from apertium-anaphora
-f trace which parse branches are discarded
-r print which rules are applying
-s trace the execution of the bytecode interpreter
-t mimic the behavior of apertium-transfer and apertium-interchunk
-T print the parse tree rather than applying output rules
-b print both the parse tree and the output
-m set the mode of tree output, available modes are:
- nest (default) print the tree as text indented with tabs
- flat print the tree as text
- latex print the tree as LaTeX source using the forest library
- dot print the tree as a Dot graph
- box print the tree using box-drawing characters
-e a combination of -f and -r
- Intended use: rtx-proc -e -m latex rules.bin < input.txt 2> trace.tex
-F filter branches for things besides parse errors (experimental)

Testing

make test

Using in a Pair

In Makefile.am add:

$(PREFIX1).rtx.bin: $(BASENAME).$(PREFIX1).rtx
	rtx-comp $< $@

$(PREFIX2).rtx.bin: $(BASENAME).$(PREFIX2).rtx
	rtx-comp $< $@

and add

$(PREFIX1).rtx.bin \
$(PREFIX2).rtx.bin

to TARGETS_COMMON.

In modes.xml, replace apertium-transfer, apertium-interchunk, and apertium-postchunk with:

<program name="rtx-proc">
  <file name="abc-xyz.rtx.bin"/>
</program>

Documentation

GSoC project proposal: https://wiki.apertium.org/wiki/User:Popcorndude/Recursive_Transfer
File format documentation: https://wiki.apertium.org/wiki/Apertium-recursive/Formalism
Bytecode documentation: https://wiki.apertium.org/wiki/Apertium-recursive/Bytecode
Progress reports: https://wiki.apertium.org/wiki/User:Popcorndude/Recursive_Transfer/Progress and #1
Examples of functioning rule sets can be found in apertium-eng-kir, eng-spa.rtx, and tests/

apertium-recursive's People

Contributors

Stargazers

Watchers

Forkers

augustya0new rinkydevi ahmedsiam0 aanyasinghdhaka

apertium-recursive's Issues

Restructure Output Conditionals

Right now, output conditionals are compiled on the input side and so cannot access chunk tags set higher up the tree. Fixing this will probably require moving the chunk surface information from OutputChunk to Rule.

@jonorthwash

Embed chunks in chunks


  <rule comment="Det SN" firstChunk="SD">
   <pattern>
    <pattern-item n="Det"/>
    <pattern-item n="SN"/>
   </pattern>
   <action>
      <call-macro n="f_set_chunk_name1"><with-param pos="2"/></call-macro>
      <call-macro n="f_concord2"><with-param pos="2"/><with-param pos="1"/></call-macro>
      <call-macro n="f_link_concord2"><with-param pos="2"/><with-param pos="1"/></call-macro>
      <call-macro n="f_set_determiner2"><with-param pos="2"/><with-param pos="1"/></call-macro>
      <out>
       <chunk namefrom="chunkName">
        <tags>
         <tag><lit-tag v="SD"/></tag>
         <tag><clip pos="2" side="tl" part="a_gen"/></tag>
         <tag><var n="numero"/></tag>
         <tag><clip pos="2" side="tl" part="a_poss"/></tag>
        </tags>
        <chunk>
          <lemma><clip pos="1" side="tl" part="lem"/></lemma>
           <tags><clip pos="1" side="tl" part="tags"/></tags>
           <lu><clip pos="1" side="tl" part="lem"/>
                  <clip pos="1" side="tl" part="tags"/></lu>
         </chunk>
        <b/>
        <lu><clip pos="2" side="tl" part="whole"/></lu>
       </chunk>
      </out>
   </action>
  </rule>

Desired output:

        ^madre<SD><f><sg><px1sg>{
                ^mío<Det><det><pos><mf><3>{
                    ^mío<det><pos><mf><3>$
                 }$
                ^madre<SN><f><3><px1sg>{
                        ^madre<N><f><3><px1sg>{
                                ^madre<n><f><3>$
                        }$
                }$
        }$

Validation Script for TRX DTD

Tool to display TL tree

probably easiest as a python script that parses -Tr and works out what the tree would be if we weren't discarding as we descend

Rules not being applied in all cases

Hello,

I am working on this set of rules: https://github.com/apertium/apertium-eng-cat/blob/recursive-eng-cat/apertium-eng-cat.eng-cat.rtx

Given the following input:
^beautiful<adj>/bonic<adj>/$^,<cm>/,<cm>/$

The output of rtx-proc is:
^bonic<adj>$^,<cm>$

However, given that there is a rule on line 107 for "adj", I would expect the output to be:
^bonic<adj><m><sg>$^,<cm>$

This rule (line 107) is applied correctly when only the first token is present.

Taking a look at the summary of the operations, it looks like the rule is recognised but then discarded for some reason:

Reading Input:
^beautiful<adj>/bonic<adj>$

Checking for reductions for branch 1

Applying rule 16 (line 107) to branch 1 with weight 0: ^beautiful<adj>/bonic<adj>$

^bonic<SAdj><GD><ND>{
        ^beautiful<adj>/bonic<adj>$
}$

Checking for reductions for branch 1
No further reductions possible for branch 1.

Splitting stack and creating branch 2
Branch 1: 1 nodes, weight = 0
[Chunk]:
^bonic<SAdj><GD><ND>{
        ^beautiful<adj>/bonic<adj>$
}$
Branch 2: 1 nodes, weight = 0
[Chunk]:
^beautiful<adj>/bonic<adj>$

Filtering Branches:
Branch 1  has no possible continuations.
Branch 2
Reading Input:
^,<cm>/,<cm>$

Checking for reductions for branch 2
No further reductions possible for branch 2.
Branch 2: 3 nodes, weight = 0
[Chunk]:
^beautiful<adj>/bonic<adj>$
[Blank]:
[Chunk]:
^,<cm>/,<cm>$

Filtering Branches:
Input buffer is empty.
Branch 2  has no active branch to compare to.

************************************************************
************************************************************
************************************************************
Outputting Branch 2

[Chunk]:
^beautiful<adj>/bonic<adj>$
[Blank]:
[Chunk]:
^,<cm>/,<cm>$
************************************************************
************************************************************
************************************************************

Output Node:
^beautiful<adj>/bonic<adj>$

^bonic<adj>$Output Node:
^,<cm>/,<cm>$

^,<cm>$

Is this behaviour correct? Am I missing something in my set of rules to handle this? Thanks!

Rule internal variables

I'd like to be able to declare variables locally and be able to clip from them and pass them to macros.

  <rule comment="v_iv" firstChunk="VI">
   <local>
      <var n="copula"/>
   </local>
   <pattern>
    <pattern-item n="v_iv"/>
   </pattern>
   <action>
      <let><var n="copula"/><clip pos="1" side="tl" part="whole"/></let>
     <call-macro n="f_set_chunk_name1"><with-param pos="1"/></call-macro>
     <choose>
       <when><test><equal><clip pos="1" side="tl" part="a_pred"/><lit-tag v="adj.pred"/></equal></test>
        <let><clip pos="1" side="tl" part="a_pred"/><lit-tag v="adj"/></let>
        <call-macro n="f_concord1"><with-param pos="1"/></call-macro>
        <call-macro n="f_agr_fin_verb1"><with-param><var="copula"/></with-param></call-macro>
        <call-macro n="f_conj_fin_verb1"><with-param><var="copula"/></with-param></call-macro>
        <out>
         <chunk namefrom="chunkName">
          <tags>
           <tag><lit-tag v="V.iv"/></tag>
           <tag><clip pos="1" side="tl" part="tags"/></tag>
          </tags>
          <lu><lit v="estar"/><lit-tag v="vblex"/><clip part="a_agr"><var n="copula"/></clip> </lu><b/>
          <lu><clip pos="1" side="tl" part="lem"/></lu>
         </chunk>
        </out>
      </when>

Interface for Weight Learning and Lexicalization

It should be as straightforward as possible to incorporate learned weights into a ruleset.

Perhaps it should be possible to put labels on rules, maybe something like

DP -> "pos-s" DP de@pr DP { 3 's@gen 1 } |
      "pos-of" DP de@pr DP { 3 of@pr 1 } ;

The compiler could accept a file of data like

pos-s	1.7	*@DP.* de@pr *@DP.np.*
pos-of	2.1	*@DP.* de@pr noodle@DP.*

Which it could then incorporate into the transducer directly.

Using rule numbers would also be possible, but that seems like a more fragile system.

It could also be possible for this file to be used to exclude rules, allowing the learning program to try different combinations.

@ftyers

Chunk Variables in TRX

TRX should have a way to manipulate chunk variables. Possibly something like

<let>
 <chunk-var n="wh_word"/>
 <clip pos="1" part="whole"/>
</let>

<clip var="wh_word" part="number"/>

There should probably also be some changes to <section-def-vars> or an additional <section-def-chunk-vars>. Either way, the DTD should be updated.

Warn on Propagation of Unused Variable

S: _ ;
S -> DP.$number VP { 1 _1 2 } ;

It would be nice if a situation like this could issue a warning or error along the lines of

S has no tag 'number'.

Exit code 139 on execution with blank input

The rtx-proc program currently exits with code 139 when it is executed with blank input. For example:

echo "" | rtx-proc cat-ron.rtx.bin

Other Apertium modules simply return the blank output and exit with code 0.

how to override POS tag

I'm trying to override the POS tag manually.

An example rule:

      8: бар@AP %vP.cop [$barjoq=barjoq] { %*(vblex)[lemh=have] } |

Here we [correctly] end up with [email protected] instead of [email protected]. The issue is how to override vP with vblex. I could force it with an attribute that applies to both of those, but I think there should be an equivalent of lemh for first tag? @mr-martian, do you know if it exists / what it is?

Should chcontent return the part outside the contents

Should chcontent return the {}?

e.g.

	^madre<SD><f><sg><px1sg>{

			^el<det><def><2><3>$

		^madre<SN><f><3><px1sg>{
			^madre<N><f><3><px1sg>{
				^madre<n><f><3>$
			}$
		}$
	}$

^madre<SD><f><sg><px1sg>{
		^{
			^el<det><def><2><3>$
		}$
		^madre<SN><f><3><px1sg>{
			^madre<N><f><3><px1sg>{
				^madre<n><f><3>$
			}$
		}$
	}$

i=yes for rules

it would be great to be able to use i="yes" in any element (or at least <rule> and <choose>).

Can't have rules with identical patterns but different conditions

Given a set of rules like this:

NP -> n ?(1.number = sg) { ... } ;
NP -> n ?(1.number = pl) { ... } ;

The two rules will have the same path in the transducer and thus a single end state will be associated with each of them. Currently any particular state can only have one associated rule and the reference to the second rule will be discarded.

The particular situation above should have been written as a single rule, but this will nonetheless fail unintuitively. Also, there could be rules that generate different chunks from input with the same path.

This would require modifying pattern.cc to output both rule numbers and matcher.h to store multiple rules. Perhaps MatchExe2 should store a map<int, vector<pair<int, double>>> and then MatchNode2 can just have an isEndOfRule flag.

Lookahead for XML files

Using trx-comp we can compile files in the .t*x format and use them as recursive rules. However, trx-comp does not generate lookahead paths and so loses the speedups from #6.

Doing this will require working out the part of speech tag of every generated chunk, which may force all chunks to have literal pos tags.

Better XML syntax design

Current XML syntax was borrowed directly from t1x and t2x and doesn't correspond very well to how rtx-comp works, which makes certain things (particularly agreement) much more difficult than they should be.

<rule>
  <pattern>
    <pattern-item n="SA"/>
    <pattern-item n="SN"/>
  </pattern>
  <input-action>
    <out>
      <chunk>
        <lemma><clip pos="2" part="lem" side="tl"/></lemma>
        <tags>
          <tag><lit-tag v="SN"/></tag>
          <tag><clip pos="2" part="number" side="tl"/></tag>
        </tags>
        <input-elements/>
      </chunk>
    </out>
  </input-action>
  <output-action>
    <let>
      <clip pos="1" part="number" side="tl"/>
      <clip pos="0" part="number" side="tl"/>
    </let>
    <out>
      <lu pos="2"/>
      <lu pos="1" blank-after="no"/>
    </out>
  </output-action>
</rule>

@ftyers How does this look for a first draft?

lit-tag shouldn’t put <> on an empty string

This issue was automatically made by begiak, Apertium's beloved IRC bot, by the order of khannatanmai on #apertium. A human is yet to update the description.

Crashes on macros that output nothing

$ echo "айтшы!" | apertium -d .. kaz-kir-transfer

Reading Input:
^айт<v><tv><imp><p2><sg>/айт<v><tv><imp><p2><sg>$

Checking for reductions for branch 1

Applying rule 16 (line 197) to branch 1 with weight 2: ^айт<v><tv><imp><p2><sg>/айт<v><tv><imp><p2><sg>$

 -> rule was rejected
This rule was rejeced.


Applying rule 15 (line 196) to branch 1 with weight 1: ^айт<v><tv><imp><p2><sg>/айт<v><tv><imp><p2><sg>$

tried to pop Chunk but mode is 1

If this word is in a text, it outputs only everything up to that word.

Need pkg-config

GSoC Progress

General TODO list:

String Variables in RTX

RTX rules should have some way of reading and setting global string variables.

#15 should probably be resolved first.

infinite loops on деп-related sentences

If you have a sentence like "Мен досум барам деп айттым." in a larger file, rtx-proc loops infinitely.

Automatic Capitalization

Postchunk automatically changes the capitalization of lexical units based on the pseudolemma of the chunk. We can modify case, but right now it's entirely manual.

Conjoining LUs to Macros

Because the compiler currently can't determine whether an if statement will have output or not, conjoining things inside and outside of if statements is currently disallowed. Unfortunately, this also disallows conjoining to macros.

VP -> vblex.inf prn.obj {1 + 2};

If vblex is defined using a macro, the above rule will fail to compile and it seems like something we would want to compile.

Alternatively, this situation could require

VP -> vblex.inf prn.obj {1(prn_num=2.number, prn_pers=2.person, prn_gen=2.gender)};

features coming from /ref by default??

I seem to be getting subject—possessed-object number agreement, apparently because of /ref?

In the sentence "Мышыктардын үйү бар.", for example, the second noun gets <pl> when only /ref has any indication of pl.

Applying rule 2 (line 218) to branch 2 with weight 1: ^үй<n><px3sp><nom>/house<n><px3sp><nom>/Cat<n><pl><gen>$

^үй<NP><n><pl><px3sp><nom><in><GD>{
	^үй<n><px3sp><nom>/house<n><px3sp><nom>/Cat<n><pl><gen>$
}$

The rule in question is the following:

NP ->	
      1: n.$nptype.$lem/sl.$case/sl.$number.$poss
         [$prep_flag=(if (1.lem in en_nouns_loc_in) in
                      else-if (1.lem in en_nouns_loc_on) on
                      else-if (1.lem in en_nouns_loc_at) at
                      else 1.case/sl>prep_flag
                     ),
          $REF_gender=1.gender/ref>REF_gender
         ] { 1[lemcase=$lemcase, number=$number] } |

Python 3.5 support

https://docs.python.org/3/library/subprocess.html#subprocess.check_output "Changed in version 3.6: encoding and errors were added."

Supported distros Debian Stretch and Ubuntu Xenial are both on Python 3.5, so all Python code must be compatible with 3.5. There may be other issues beyond subprocess.check_output(encoding=...).

Concordance macros for Spanish

The old transfer had a set of concordance macros to deal with determining gender and number in cases like,

adj.m.sg n.m.sg		el coche nuevo
adj.m.sg n.m.sp		el saltamontes nuevo 
adj.m.sg n.mf.sg	el criminal nuevo
adj.m.sg n.mf.sp	el trotamundos nuevo
adj.m.pl n.m.pl		los coches nuevos
adj.m.pl n.m.sp		los saltamontes nuevos
adj.m.pl n.mf.pl	los criminales nuevos
adj.m.pl n.mf.sp	los trotamundos nuevos
*adj.m.sp n.m.sg		
*adj.m.sp n.m.pl
*adj.m.sp n.m.sp
*adj.m.sp n.mf.sg
*adj.m.sp n.mf.pl
*adj.m.sp n.mf.sp
adj.f.sg n.f.sg		la casa nueva
adj.f.sg n.f.sp		la tesis nueva
adj.f.sg n.mf.sg	la criminal nueva
adj.f.sg n.mf.sp	la trotamundos nueva
adj.f.pl n.f.pl		las casas nuevas
adj.f.pl n.f.sp		las tesis nuevas
adj.f.pl n.mf.pl	las criminales nuevas
adj.f.pl n.mf.sp	las trotamundos nuevas
*adj.f.sp n.f.sg
*adj.f.sp n.f.pl
*adj.f.sp n.f.sp
*adj.f.sp n.mf.sg
*adj.f.sp n.mf.pl
*adj.f.sp n.mf.sp
adj.mf.sg n.m.sg	el coche interesante
adj.mf.sg n.m.sp	el saltamontes interesante
adj.mf.sg n.f.sg	la casa interesante
adj.mf.sg n.f.sp	la tesis interesante
adj.mf.sg n.mf.sg	_ criminal interesante
adj.mf.sg n.mf.sp	_ trotamundos interesante
adj.mf.pl n.m.pl	los coches interesantes
adj.mf.pl n.m.sp	los saltamontes interesantes
adj.mf.pl n.f.pl	las casas interesantes
adj.mf.pl n.f.sp	las tesis interesantes
adj.mf.pl n.mf.pl	_ criminales interesantes
adj.mf.pl n.mf.sp	_ trotamundos interesantes
adj.mf.sp n.m.sg	el coche salvavidas
adj.mf.sp n.m.pl	los coches salvavidas
adj.mf.sp n.m.sp	_ saltamontes salvavidas
adj.mf.sp n.f.sg	la casa salvavidas
adj.mf.sp n.f.pl	las casas salvavidas
adj.mf.sp n.f.sp	_ tesis salvavidas
adj.mf.sp n.mf.sg	_ criminal salvavidas
adj.mf.sp n.mf.pl	_ criminales salvavidas
adj.mf.sp n.mf.sp	_ trotamundos salvavidas

It would be great to have similar macros for this transfer. They are called f_concord1, f_concord2 etc. In principle we should only need 1 and 2 here.

Random Sentence Generator Doesn't Handle <re> in Bilingual Dictionary

Currently randsen.py looks for <l> or <r> and misses <re>, leading to odd results. It should look for <re> and also know how to interpret it.

Have macros check %

vP -> %vaux { 1 } ;

Error in macro 'vP', invoked by rule beginning on line 189 of apertium-kaz-kir.kaz-kir.rtx: Macro not given value for attribute 'fin_nonfin'.

Macro expansion should check the grab_all property.

Have Random Sentence Generator Actually Parse Rule File

Currently randsen.py is relying on rtx-comp -s. In order for it to generate grammatical sentences, either rtx-comp -s should be made more informative or randsen.py should directly parse the rule files.

Make String Variables Branch-Specific

Global string variables would be more useful if they were branch-specific like chunk variables.

For efficiency, they should probably be stored in a fixed array like the chunk variables with the size specified in the compiled file.

An array of initial values can also be specified, similar to how initial values are current set.

Random Sentence Generator is Insufficiently Random

randsen.py has a tendency to choose from the same fairly small set of words on each run.

Automatic resolving of linked properties

Sometimes you want to be able to link stuff to a chunk tag, but the exact position of the tag is uncertain because there may or may not be intervening tags. e.g. if you have
^SN<px3sg><m><sg>$ and ^SN<m><sg>$, you may want to do,

        <lu><clip pos="1" side="tl" part="a_gen" link-to="3"/></lu>

But that would have to be 2 in the case of the second example. Usually tags in a chunk are not ambiguous, so would it be possible to do something like,

<lu><clip pos="1" side="tl" part="a_gen" link/></lu>

And have it automatically resolve? e.g. pull out the part of the chunk that corresponds to a_gen?

Don't output a blank if an LU is empty

Sometimes we want to do things like:

      <out>
       <chunk name="SV">
        <tags>
         <tag><lit-tag v="SV"/></tag>
          <tag><clip pos="1" side="tl" part="a_val"/></tag>
        </tags>
        <var n="CI"/> <b/>
        <lu> <clip pos="1" side="tl" part="whole"/> </lu>
       </chunk>
      </out>

But if the variable is empty, the blank still gets output, which means that we get two spaces instead of one. If the variable is empty, the blank should not be output.... I can't think of any counter example to this.

Something About Conditional Clip Setting

Something in here causes a segfault in rtx-comp:

vaux: (if (1.qst = qst)
          1(verb_nopers)[tense=inf]
       else-if ((1.tense>tense = pres) and
                ( (1.pos_tag = vbser and (1.person = p1 or 1.person = p3))
                  or
                  (1.person = p3 and 1.number = sg)))
          (if (1.negative = neg)
              [ 1(verb_pers) + not@adv ]
           else
              1(verb_pers) )
       else
          (if (1.negative = neg)
              [ 1(verb_nopers) + not@adv ]
           else
              1(verb_nopers) ) );

!!! MACROS !!!

qst_front: (always *(vaux)[lemh=(if (1.lem = жат or 1.lem = э) be
                                 if (1.lem = ал) can
                                 else do),
                           pos_tag=(if (1.lem = жат or 1.lem = э) vbser
                                    if (1.lem = ал) vbmod
                                    else vbdo),
                           person=1.person, number=1.number, tense=1.tense,
                           negative=1.negative, qst=NOqst, lemcase=1.lemcase]);

Probably in connection with lemh=(if ...)

+ between tree nodes does not disassemble as expected

Example:

$ echo "Күшік бар ма?" | apertium -d ../ kaz-kir

The tree is built correctly

[Chunk]: 
^Default<S>{
	^Күчүк<NP><ND><nom>{
		^Күчүк<nP><ND><nom>{
			^Күшік<n><nom>/Күчүк<n><nom>$
		}$
	}$
	^ээээ<VP><TD><aor><p3><sg>{
		^бар<AP>{
			^бар<aP>{
				^бар<adj>/бар<adj>$
			}$
		}$
		^ээээ<vP><cop><TD><aor><p3><sg>{
			^ээ<vP><cop><TD><aor><p3><sg>{
				^е<cop><aor><p3><sg>/э<cop><aor><p3><sg>$
			}$
			^ма<qst>/бы<qst>$
		}$
	}$
}$

But this rule

		2: AP %vP ?(2.v_type=cop & 2.tense=aor) { 1 + 2 } ;

doesn't disassemble correctly:

^Күчүк<n><nom>$ ^бар<adj>$^ээ<cop><aor><p3><sg>+бы<qst>$^?<sent>$^.<sent>$

Expected output is:

^Күчүк<n><nom>$ ^бар<adj>+э<cop><aor><p3><sg>+бы<qst>$^?<sent>$^.<sent>$

Unable to properly conjoin LUs

Sample rule file

prn: _.prn_type.person.gender.number ;
vblex: _.tense;
Clt: _.prn_type.person.gender.number ;
SV: _.vb_type.vb_cnj.tense.person.gender.number ;

gender = (GD m) m f nt @mf GD ;
number = (ND sg) sg pl @sp ND ;
person = (PD p3) p1 p2 p3 PD ;
prn_type = tn itg pro enc ;
tense = pri fti cni imp prs pis fts pii ifi inf ger pp ;
vb_type = vbhaver vblex vbmod vbser ;
vb_cnj = cnj impers ;

Clt ->  1: %prn.enc
                { %1 } ;

SV ->   2: %vblex.inf
                [$vb_cnj=impers]
                { (if($lu-count="2") [%1+>2] else %1 ) } |

        2: %SV.*.impers.inf Clt.enc
                { %1 < 2 } ;

Input

^cantar<vblex><inf>/cantar<vblex><inf>$ ^nos<prn><enc><p1><mf><pl>/ens<prn><enc><p1><mf><pl>$

Output

^cantar<vblex><inf>ens<Clt><enc><p1><mf><pl>$

Expected output

^cantar<vblex><inf>+ens<prn><enc><p1><mf><pl>$

Description

There are 2 aspects not working as I would expect, all related to the third rule:

+ is missing in the output between the conjoined parts.
The second part of the rule is output as the lemma and tags of the chunk created in the first (Clt) rule and not the content, that is, the content inside the chunk is completely ignored.

Thanks!

XML - set default weight based on number of tags in pattern

I have in apertium-quc-spa.quc-spa.rtx

    <def-cat n="n_rel">
      <cat-item tags="n.rel.*"/>
      <cat-item tags="n.rel"/>
    </def-cat>
    <def-cat n="n">
      <cat-item tags="n.*"/>
      <cat-item tags="n"/>
    </def-cat>

...


  <rule comment="rel" firstChunk="Rel">
   <pattern>
    <pattern-item n="n_rel"/>
   </pattern>
   <action>
      <call-macro n="f_set_chunk_name1"><with-param pos="1"/></call-macro>
      <call-macro n="f_concord1"><with-param pos="1"/></call-macro>
      <call-macro n="f_conv_poss1"><with-param pos="1"/></call-macro>
      <out>
       <chunk namefrom="chunkName">
        <tags>
         <tag><lit-tag v="Rel"/></tag>
        </tags>
        <lu><clip pos="1" side="tl" part="whole"/></lu>
       </chunk>
      </out>
   </action>
  </rule>

  <rule comment="n" firstChunk="N">
   <pattern>
    <pattern-item n="n"/>
   </pattern>
   <action>
      <call-macro n="f_set_chunk_name1"><with-param pos="1"/></call-macro>
      <call-macro n="f_concord1"><with-param pos="1"/></call-macro>
      <call-macro n="f_conv_poss1"><with-param pos="1"/></call-macro>
      <out>
       <chunk namefrom="chunkName">
        <tags>
         <tag><lit-tag v="N"/></tag>
         <tag><clip pos="1" side="tl" part="a_gen"/></tag>
         <tag><clip pos="1" side="tl" part="a_nbr"/></tag>
         <tag><clip pos="1" side="sl" part="a_poss"/></tag>
        </tags>
        <lu><clip pos="1" side="tl" part="lem"/>
            <clip pos="1" side="tl" part="a_sust"/>
            <clip pos="1" side="tl" part="a_gen"/>
            <clip pos="1" side="tl" part="a_nbr" link-to="3"/></lu>
       </chunk>
      </out>
   </action>
  </rule>

If I put the Rel rule after the N rule, the N rule matches, even though Rel has a longer pattern. Is this the desired behaviour? If not, then we could sort the rules before applying them to get longest match behaviour?

Generate chunks with internal structure

would be nice if we could go from n.ins to PP{NP{n} with@post} in a single rule

Match unknown words

Some t1x rules match unknown words with

    <def-cat n="unknown">
      <cat-item tags=""/>
    </def-cat>

Perhaps * in the pattern for rtx should do the same?

We would have to make sure this doesn't conflict with the current implementation of lookahead.

better postchunk rules for trx

@ftyers @khannatanmai
Regarding introspecting chunks, there isn't really a way to do that in parsing rules and my way of implementing postchunk rules (copying the structure of t3x with lemma matching) is really messy.

What do you thinking rules like this?

<output-rule name="insert_det">
  <!-- assorted manipulations -->
  <out>
    <lu>...</lu>
    ...
  </out>
  <!-- or -->
  <output-all/>
  <!-- for if you changed tags and things but haven't added or removed any chunks -->
</output-rule>

<rule>
  <pattern>...</pattern>
  <action>
    <out>
      <chunk output-rule="insert_det">...</chunk>
    </out>
  </action>
</rule>

Means of modifying which output rule was set for a chunk could of course also be added (whether with <let> and <clip> or with specialized instructions, I'm not sure).

Compile .t*x Files

I ensured that all rules have separate paths, but this prevents checking for conflicts.
Some words occasionally lose tags, particularly <ger> and <prn>
A few empty chunks are produced

Error Messages from rtx-comp Should Report What Macro is being Compiled

Discard More Branches

GLR mode is currently somewhat slow. The best option for speeding it up is probably to find some way of discarding paths early, rather than holding on to all of them until completion or a parse error.

numbers get repeated?

@mr-martian, I'm having a weird issue. Currently the following sentence

Ал 17-кылымда төрөлгөн.

translates as

He was borne in the 1717th century.

Do you have any idea why we get the number doubled?

Lemmas with queues cannot be properly conjoined

Currently it does not seem possible to properly conjoin lemmas when the first one has a queue. This is very common with verbs and clitics and can be done easily in apertium-transfer, which allows to specify the clipping position of lemh and lemq. Example with a verb and a clitic:

Source

^tenir# en compte<vblex><ger>$ ^ho<prn><enc><p3><nt>$

Current output

^tenir<vblex><ger># en compte+ho<prn><enc><p3><nt>$

Expected output

^tenir<vblex><ger>+ho<prn><enc><p3><nt># en compte$

For lemmas that are not yet inside of a chunk, it seems straightforward. However, given that it is also possible to conjoin whole chunks, I am not entirely sure how this would be handled.

Clip from variables

I'd like to be able to clip from variables, the way it should probably work is like this:

<clip part="a_agr"><var n="copula"/></clip>

Where the variable is passed to clip as an argument by putting the var tag inside the clip tag.

Multiple Tag-Rewrite Rules

There's currently no way for the compiler to do tag conversion x > y > z. If macros try to rewrite tags, the last one takes precedence and we get x > z. The main trick for fixing this will be to avoid inappropriate applications of x > x.

Interpolation in RTX

In TRX it is possible to insert chunks into other chunks:

<chunk ...>
  <tags>...</tags>
  <clip pos="1" part="chcontent" side="tl"/>
  <b/>
  <clip pos="2" part="whole" side="tl"/>
</chunk>

The relevant postchunk rule can then check the value of <lu-count/> to see if anything has been inserted and move it around if so.

This is currently not possible in RTX and maybe should be.

variables get cleared between macro calls?

    <def-macro n="f_link_concord1" npar="1">
      <choose>
        <when><test><not><equal><clip pos="1" side="tl" part="a_nbr"/><lit-tag v="sp"/></equal></not></test>
                    <let><var n="chunkNumero"/><clip pos="1" side="tl" part="a_nbr"/></let>
                    <let><clip pos="1" side="tl" part="a_nbr"/><lit-tag v="3"/></let>
        </when>
        <otherwise>
            <let><var n="chunkNumero"/><lit-tag v="ND"/></let><!--<clip pos="1" side="tl" part="a_nbr"/></let>-->
        </otherwise>
      </choose>
      <choose>
        <when><test><not><equal><clip pos="1" side="tl" part="a_gen"/><lit-tag v="mf"/></equal></not></test>
                    <let><var n="chunkGenero"/><clip pos="1" side="tl" part="a_gen"/></let>
                    <let><clip pos="1" side="tl" part="a_gen"/><lit-tag v="2"/></let>
        </when>
        <otherwise>
             <let><var n="chunkGenero"/><lit-tag v="GD"/></let><!--<clip pos="1" side="tl" part="a_gen"/></let>-->
        </otherwise>
      </choose>
    </def-macro>

There should be no way that the variable chunkGenero and chunkNumero should be empty after calling this macro --- it should be either a non-<sp>, non-<mf> value or <ND> or <GD>. But that is what happens,


  <rule comment="det SN" firstChunk="SD">
   <pattern>
    <pattern-item n="det"/>
    <pattern-item n="SN"/>
   </pattern>
   <action>
      <call-macro n="f_set_chunk_name1"><with-param pos="2"/></call-macro>
      <!--<call-macro n="f_concord2"><with-param pos="2"/><with-param pos="1"/></call-macro>-->
      <call-macro n="f_link_concord1"><with-param pos="2"/></call-macro>
      <!--<call-macro n="f_link_concord1"><with-param pos="1"/></call-macro>-->
      <!--<call-macro n="f_set_determiner2"><with-param pos="2"/><with-param pos="1"/></call-macro>-->
      <out>
       <chunk namefrom="chunkName">
        <tags>
         <tag><lit-tag v="SD"/></tag>
         <tag><var n="chunkGenero"/></tag>
         <tag><var n="chunkNumero"/></tag>
         <tag><clip pos="2" side="tl" part="a_poss"/></tag>
        </tags>
<var n="numero"/><var n="genero"/><var n="chunkNumero"/><var n="chunkGenero"/>
        <lu><clip pos="1" side="tl" part="whole"/></lu>
        <b/>
        <lu><clip pos="2" side="tl" part="whole"/></lu>
       </chunk>
      </out>
   </action>
  </rule>

Output

^parábola<SD>{
	^<sg>$
	^<m>$
	^$
	^$
	^este<det><dem><m><sg>$
	^parábola<SN>{
		^uno<Num>{
			^uno<det><ind><2><3>$
		}$
		^parábola<SN>{
			^parábola<N>{
				^parábola<n><2><3>$
			}$
		}$
	}$
}$

apertium / apertium-recursive Goto Github PK

apertium-recursive's Introduction

Apertium-recursive

Compiling

Running

Testing

Using in a Pair

Documentation

apertium-recursive's People

Contributors

Stargazers

Watchers

Forkers

apertium-recursive's Issues

Sample rule file

Input

Output

Expected output

Description

Recommend Projects

Recommend Topics

Recommend Org

Jobs