GithubHelp home page GithubHelp logo

apertium / apertium-recursive Goto Github PK

View Code? Open in Web Editor NEW
6.0 13.0 4.0 874 KB

Recursive structural transfer module for Apertium

Home Page: https://wiki.apertium.org/wiki/Apertium-recursive

License: GNU General Public License v3.0

Makefile 0.63% Python 14.84% Lex 0.08% Yacc 2.08% C++ 79.26% Shell 0.32% M4 1.49% C 0.78% Vim Script 0.53%
apertium-tools apertium-core

apertium-recursive's Introduction

Apertium-recursive

A recursive structural transfer module for Apertium

Compiling

./autogen.sh
make

Running

# compile the rules file
src/rtx-comp rule-file bytecode-file

# run the rules
src/rtx-proc bytecode-file < input

# decompile the rules and examine the bytecode
src/rtx-decomp bytecode-file text-file

# compile XML rule files
src/trx-comp bytecode-file xml-files...

# generate random sentences from a rules file
apertium-recursive/src/randsen.py start_node pair_directory source_language_directory

Options for rtx-comp:

  • -e don't compile a rule with a particular name
  • -l load lexicalized weights from a file
  • -s output summaries of the rules to stderr

Options for trx-comp:

  • -l load lexicalized weights from a file

Options for rtx-proc:

  • -a indicates that the input comes from apertium-anaphora
  • -f trace which parse branches are discarded
  • -r print which rules are applying
  • -s trace the execution of the bytecode interpreter
  • -t mimic the behavior of apertium-transfer and apertium-interchunk
  • -T print the parse tree rather than applying output rules
  • -b print both the parse tree and the output
  • -m set the mode of tree output, available modes are:
    • nest (default) print the tree as text indented with tabs
    • flat print the tree as text
    • latex print the tree as LaTeX source using the forest library
    • dot print the tree as a Dot graph
    • box print the tree using box-drawing characters
  • -e a combination of -f and -r
    • Intended use: rtx-proc -e -m latex rules.bin < input.txt 2> trace.tex
  • -F filter branches for things besides parse errors (experimental)

Testing

make test

Using in a Pair

In Makefile.am add:

$(PREFIX1).rtx.bin: $(BASENAME).$(PREFIX1).rtx
	rtx-comp $< $@

$(PREFIX2).rtx.bin: $(BASENAME).$(PREFIX2).rtx
	rtx-comp $< $@

and add

$(PREFIX1).rtx.bin \
$(PREFIX2).rtx.bin

to TARGETS_COMMON.

In modes.xml, replace apertium-transfer, apertium-interchunk, and apertium-postchunk with:

<program name="rtx-proc">
  <file name="abc-xyz.rtx.bin"/>
</program>

Documentation

apertium-recursive's People

Contributors

khannatanmai avatar mr-martian avatar nlhowell avatar tinodidriksen avatar unhammer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apertium-recursive's Issues

Restructure Output Conditionals

Right now, output conditionals are compiled on the input side and so cannot access chunk tags set higher up the tree. Fixing this will probably require moving the chunk surface information from OutputChunk to Rule.

@jonorthwash

Embed chunks in chunks


  <rule comment="Det SN" firstChunk="SD">
   <pattern>
    <pattern-item n="Det"/>
    <pattern-item n="SN"/>
   </pattern>
   <action>
      <call-macro n="f_set_chunk_name1"><with-param pos="2"/></call-macro>
      <call-macro n="f_concord2"><with-param pos="2"/><with-param pos="1"/></call-macro>
      <call-macro n="f_link_concord2"><with-param pos="2"/><with-param pos="1"/></call-macro>
      <call-macro n="f_set_determiner2"><with-param pos="2"/><with-param pos="1"/></call-macro>
      <out>
       <chunk namefrom="chunkName">
        <tags>
         <tag><lit-tag v="SD"/></tag>
         <tag><clip pos="2" side="tl" part="a_gen"/></tag>
         <tag><var n="numero"/></tag>
         <tag><clip pos="2" side="tl" part="a_poss"/></tag>
        </tags>
        <chunk>
          <lemma><clip pos="1" side="tl" part="lem"/></lemma>
           <tags><clip pos="1" side="tl" part="tags"/></tags>
           <lu><clip pos="1" side="tl" part="lem"/>
                  <clip pos="1" side="tl" part="tags"/></lu>
         </chunk>
        <b/>
        <lu><clip pos="2" side="tl" part="whole"/></lu>
       </chunk>
      </out>
   </action>
  </rule>

Desired output:

        ^madre<SD><f><sg><px1sg>{
                ^mío<Det><det><pos><mf><3>{
                    ^mío<det><pos><mf><3>$
                 }$
                ^madre<SN><f><3><px1sg>{
                        ^madre<N><f><3><px1sg>{
                                ^madre<n><f><3>$
                        }$
                }$
        }$

Tool to display TL tree

probably easiest as a python script that parses -Tr and works out what the tree would be if we weren't discarding as we descend

Rules not being applied in all cases

Hello,

I am working on this set of rules: https://github.com/apertium/apertium-eng-cat/blob/recursive-eng-cat/apertium-eng-cat.eng-cat.rtx

Given the following input:
^beautiful<adj>/bonic<adj>/$^,<cm>/,<cm>/$

The output of rtx-proc is:
^bonic<adj>$^,<cm>$

However, given that there is a rule on line 107 for "adj", I would expect the output to be:
^bonic<adj><m><sg>$^,<cm>$

This rule (line 107) is applied correctly when only the first token is present.

Taking a look at the summary of the operations, it looks like the rule is recognised but then discarded for some reason:

Reading Input:
^beautiful<adj>/bonic<adj>$

Checking for reductions for branch 1

Applying rule 16 (line 107) to branch 1 with weight 0: ^beautiful<adj>/bonic<adj>$

^bonic<SAdj><GD><ND>{
        ^beautiful<adj>/bonic<adj>$
}$

Checking for reductions for branch 1
No further reductions possible for branch 1.

Splitting stack and creating branch 2
Branch 1: 1 nodes, weight = 0
[Chunk]:
^bonic<SAdj><GD><ND>{
        ^beautiful<adj>/bonic<adj>$
}$
Branch 2: 1 nodes, weight = 0
[Chunk]:
^beautiful<adj>/bonic<adj>$

Filtering Branches:
Branch 1  has no possible continuations.
Branch 2
Reading Input:
^,<cm>/,<cm>$

Checking for reductions for branch 2
No further reductions possible for branch 2.
Branch 2: 3 nodes, weight = 0
[Chunk]:
^beautiful<adj>/bonic<adj>$
[Blank]:
[Chunk]:
^,<cm>/,<cm>$

Filtering Branches:
Input buffer is empty.
Branch 2  has no active branch to compare to.

************************************************************
************************************************************
************************************************************
Outputting Branch 2

[Chunk]:
^beautiful<adj>/bonic<adj>$
[Blank]:
[Chunk]:
^,<cm>/,<cm>$
************************************************************
************************************************************
************************************************************

Output Node:
^beautiful<adj>/bonic<adj>$

^bonic<adj>$Output Node:
^,<cm>/,<cm>$

^,<cm>$

Is this behaviour correct? Am I missing something in my set of rules to handle this? Thanks!

Rule internal variables

I'd like to be able to declare variables locally and be able to clip from them and pass them to macros.

  <rule comment="v_iv" firstChunk="VI">
   <local>
      <var n="copula"/>
   </local>
   <pattern>
    <pattern-item n="v_iv"/>
   </pattern>
   <action>
      <let><var n="copula"/><clip pos="1" side="tl" part="whole"/></let>
     <call-macro n="f_set_chunk_name1"><with-param pos="1"/></call-macro>
     <choose>
       <when><test><equal><clip pos="1" side="tl" part="a_pred"/><lit-tag v="adj.pred"/></equal></test>
        <let><clip pos="1" side="tl" part="a_pred"/><lit-tag v="adj"/></let>
        <call-macro n="f_concord1"><with-param pos="1"/></call-macro>
        <call-macro n="f_agr_fin_verb1"><with-param><var="copula"/></with-param></call-macro>
        <call-macro n="f_conj_fin_verb1"><with-param><var="copula"/></with-param></call-macro>
        <out>
         <chunk namefrom="chunkName">
          <tags>
           <tag><lit-tag v="V.iv"/></tag>
           <tag><clip pos="1" side="tl" part="tags"/></tag>
          </tags>
          <lu><lit v="estar"/><lit-tag v="vblex"/><clip part="a_agr"><var n="copula"/></clip> </lu><b/>
          <lu><clip pos="1" side="tl" part="lem"/></lu>
         </chunk>
        </out>
      </when>

Interface for Weight Learning and Lexicalization

It should be as straightforward as possible to incorporate learned weights into a ruleset.

Perhaps it should be possible to put labels on rules, maybe something like

DP -> "pos-s" DP de@pr DP { 3 's@gen 1 } |
      "pos-of" DP de@pr DP { 3 of@pr 1 } ;

The compiler could accept a file of data like

pos-s	1.7	*@DP.* de@pr *@DP.np.*
pos-of	2.1	*@DP.* de@pr noodle@DP.*

Which it could then incorporate into the transducer directly.

Using rule numbers would also be possible, but that seems like a more fragile system.

It could also be possible for this file to be used to exclude rules, allowing the learning program to try different combinations.

@ftyers

Chunk Variables in TRX

TRX should have a way to manipulate chunk variables. Possibly something like

<let>
 <chunk-var n="wh_word"/>
 <clip pos="1" part="whole"/>
</let>

<clip var="wh_word" part="number"/>

There should probably also be some changes to <section-def-vars> or an additional <section-def-chunk-vars>. Either way, the DTD should be updated.

Warn on Propagation of Unused Variable

S: _ ;
S -> DP.$number VP { 1 _1 2 } ;

It would be nice if a situation like this could issue a warning or error along the lines of

S has no tag 'number'.

Exit code 139 on execution with blank input

The rtx-proc program currently exits with code 139 when it is executed with blank input. For example:

echo "" | rtx-proc cat-ron.rtx.bin

Other Apertium modules simply return the blank output and exit with code 0.

how to override POS tag

I'm trying to override the POS tag manually.

An example rule:

      8: бар@AP %vP.cop [$barjoq=barjoq] { %*(vblex)[lemh=have] } |

Here we [correctly] end up with [email protected] instead of [email protected]. The issue is how to override vP with vblex. I could force it with an attribute that applies to both of those, but I think there should be an equivalent of lemh for first tag? @mr-martian, do you know if it exists / what it is?

Should chcontent return the part outside the contents

Should chcontent return the {}?

e.g.

	^madre<SD><f><sg><px1sg>{

			^el<det><def><2><3>$

		^madre<SN><f><3><px1sg>{
			^madre<N><f><3><px1sg>{
				^madre<n><f><3>$
			}$
		}$
	}$

or

^madre<SD><f><sg><px1sg>{
		^{
			^el<det><def><2><3>$
		}$
		^madre<SN><f><3><px1sg>{
			^madre<N><f><3><px1sg>{
				^madre<n><f><3>$
			}$
		}$
	}$

i=yes for rules

it would be great to be able to use i="yes" in any element (or at least <rule> and <choose>).

Can't have rules with identical patterns but different conditions

Given a set of rules like this:

NP -> n ?(1.number = sg) { ... } ;
NP -> n ?(1.number = pl) { ... } ;

The two rules will have the same path in the transducer and thus a single end state will be associated with each of them. Currently any particular state can only have one associated rule and the reference to the second rule will be discarded.

The particular situation above should have been written as a single rule, but this will nonetheless fail unintuitively. Also, there could be rules that generate different chunks from input with the same path.

This would require modifying pattern.cc to output both rule numbers and matcher.h to store multiple rules. Perhaps MatchExe2 should store a map<int, vector<pair<int, double>>> and then MatchNode2 can just have an isEndOfRule flag.

Lookahead for XML files

Using trx-comp we can compile files in the .t*x format and use them as recursive rules. However, trx-comp does not generate lookahead paths and so loses the speedups from #6.

Doing this will require working out the part of speech tag of every generated chunk, which may force all chunks to have literal pos tags.

Better XML syntax design

Current XML syntax was borrowed directly from t1x and t2x and doesn't correspond very well to how rtx-comp works, which makes certain things (particularly agreement) much more difficult than they should be.

<rule>
  <pattern>
    <pattern-item n="SA"/>
    <pattern-item n="SN"/>
  </pattern>
  <input-action>
    <out>
      <chunk>
        <lemma><clip pos="2" part="lem" side="tl"/></lemma>
        <tags>
          <tag><lit-tag v="SN"/></tag>
          <tag><clip pos="2" part="number" side="tl"/></tag>
        </tags>
        <input-elements/>
      </chunk>
    </out>
  </input-action>
  <output-action>
    <let>
      <clip pos="1" part="number" side="tl"/>
      <clip pos="0" part="number" side="tl"/>
    </let>
    <out>
      <lu pos="2"/>
      <lu pos="1" blank-after="no"/>
    </out>
  </output-action>
</rule>

@ftyers How does this look for a first draft?

Crashes on macros that output nothing

$ echo "айтшы!" | apertium -d .. kaz-kir-transfer

Reading Input:
^айт<v><tv><imp><p2><sg>/айт<v><tv><imp><p2><sg>$

Checking for reductions for branch 1

Applying rule 16 (line 197) to branch 1 with weight 2: ^айт<v><tv><imp><p2><sg>/айт<v><tv><imp><p2><sg>$

 -> rule was rejected
This rule was rejeced.


Applying rule 15 (line 196) to branch 1 with weight 1: ^айт<v><tv><imp><p2><sg>/айт<v><tv><imp><p2><sg>$

tried to pop Chunk but mode is 1

If this word is in a text, it outputs only everything up to that word.

GSoC Progress

General TODO list:

  • Formalism and compiler
    • Multiple output chunks
    • Interpolation (for clitics, etc.)
      • This is currently possible but potentially messier than we want
    • Capitalization (probably using pseudo attribute "lemcase")
    • Clean up distinction between input and output rules
    • Tag replacement rules ?
    • Implement non-overwriting (currently being parsed and ignored)
    • Conjoined LUs
  • Interpreter
    • Profile
    • Speed up, if possible (currently takes about twice as long as chunker/interchunk/postchunk)
  • XML compiler
    • Understand what postchunk does
    • Deal with naming conflicts
    • Detect literal chunks
    • Test
  • eng->spa rules
    • Verbs are a bit of a mess
    • and seem to get overwritten a lot (this should maybe be a compiler thing)
  • Tests
  • Documentation
    • More examples
    • More comments in code
      • Comments in header files
      • Explanations of algorithms in code
    • And just clean up code in general
    • Ensure documentation aligns with current behavior

String Variables in RTX

RTX rules should have some way of reading and setting global string variables.

#15 should probably be resolved first.

Automatic Capitalization

Postchunk automatically changes the capitalization of lexical units based on the pseudolemma of the chunk. We can modify case, but right now it's entirely manual.

Conjoining LUs to Macros

Because the compiler currently can't determine whether an if statement will have output or not, conjoining things inside and outside of if statements is currently disallowed. Unfortunately, this also disallows conjoining to macros.

VP -> vblex.inf prn.obj {1 + 2};

If vblex is defined using a macro, the above rule will fail to compile and it seems like something we would want to compile.

Alternatively, this situation could require

VP -> vblex.inf prn.obj {1(prn_num=2.number, prn_pers=2.person, prn_gen=2.gender)};

features coming from /ref by default??

I seem to be getting subject—possessed-object number agreement, apparently because of /ref?

In the sentence "Мышыктардын үйү бар.", for example, the second noun gets <pl> when only /ref has any indication of pl.

Applying rule 2 (line 218) to branch 2 with weight 1: ^үй<n><px3sp><nom>/house<n><px3sp><nom>/Cat<n><pl><gen>$

^үй<NP><n><pl><px3sp><nom><in><GD>{
	^үй<n><px3sp><nom>/house<n><px3sp><nom>/Cat<n><pl><gen>$
}$

The rule in question is the following:

NP ->	
      1: n.$nptype.$lem/sl.$case/sl.$number.$poss
         [$prep_flag=(if (1.lem in en_nouns_loc_in) in
                      else-if (1.lem in en_nouns_loc_on) on
                      else-if (1.lem in en_nouns_loc_at) at
                      else 1.case/sl>prep_flag
                     ),
          $REF_gender=1.gender/ref>REF_gender
         ] { 1[lemcase=$lemcase, number=$number] } |

Concordance macros for Spanish

The old transfer had a set of concordance macros to deal with determining gender and number in cases like,

adj.m.sg n.m.sg		el coche nuevo
adj.m.sg n.m.sp		el saltamontes nuevo 
adj.m.sg n.mf.sg	el criminal nuevo
adj.m.sg n.mf.sp	el trotamundos nuevo
adj.m.pl n.m.pl		los coches nuevos
adj.m.pl n.m.sp		los saltamontes nuevos
adj.m.pl n.mf.pl	los criminales nuevos
adj.m.pl n.mf.sp	los trotamundos nuevos
*adj.m.sp n.m.sg		
*adj.m.sp n.m.pl
*adj.m.sp n.m.sp
*adj.m.sp n.mf.sg
*adj.m.sp n.mf.pl
*adj.m.sp n.mf.sp
adj.f.sg n.f.sg		la casa nueva
adj.f.sg n.f.sp		la tesis nueva
adj.f.sg n.mf.sg	la criminal nueva
adj.f.sg n.mf.sp	la trotamundos nueva
adj.f.pl n.f.pl		las casas nuevas
adj.f.pl n.f.sp		las tesis nuevas
adj.f.pl n.mf.pl	las criminales nuevas
adj.f.pl n.mf.sp	las trotamundos nuevas
*adj.f.sp n.f.sg
*adj.f.sp n.f.pl
*adj.f.sp n.f.sp
*adj.f.sp n.mf.sg
*adj.f.sp n.mf.pl
*adj.f.sp n.mf.sp
adj.mf.sg n.m.sg	el coche interesante
adj.mf.sg n.m.sp	el saltamontes interesante
adj.mf.sg n.f.sg	la casa interesante
adj.mf.sg n.f.sp	la tesis interesante
adj.mf.sg n.mf.sg	_ criminal interesante
adj.mf.sg n.mf.sp	_ trotamundos interesante
adj.mf.pl n.m.pl	los coches interesantes
adj.mf.pl n.m.sp	los saltamontes interesantes
adj.mf.pl n.f.pl	las casas interesantes
adj.mf.pl n.f.sp	las tesis interesantes
adj.mf.pl n.mf.pl	_ criminales interesantes
adj.mf.pl n.mf.sp	_ trotamundos interesantes
adj.mf.sp n.m.sg	el coche salvavidas
adj.mf.sp n.m.pl	los coches salvavidas
adj.mf.sp n.m.sp	_ saltamontes salvavidas
adj.mf.sp n.f.sg	la casa salvavidas
adj.mf.sp n.f.pl	las casas salvavidas
adj.mf.sp n.f.sp	_ tesis salvavidas
adj.mf.sp n.mf.sg	_ criminal salvavidas
adj.mf.sp n.mf.pl	_ criminales salvavidas
adj.mf.sp n.mf.sp	_ trotamundos salvavidas

It would be great to have similar macros for this transfer. They are called f_concord1, f_concord2 etc. In principle we should only need 1 and 2 here.

Have macros check %

vP -> %vaux { 1 } ;
Error in macro 'vP', invoked by rule beginning on line 189 of apertium-kaz-kir.kaz-kir.rtx: Macro not given value for attribute 'fin_nonfin'.

Macro expansion should check the grab_all property.

Make String Variables Branch-Specific

Global string variables would be more useful if they were branch-specific like chunk variables.

For efficiency, they should probably be stored in a fixed array like the chunk variables with the size specified in the compiled file.

An array of initial values can also be specified, similar to how initial values are current set.

Automatic resolving of linked properties

Sometimes you want to be able to link stuff to a chunk tag, but the exact position of the tag is uncertain because there may or may not be intervening tags. e.g. if you have
^SN<px3sg><m><sg>$ and ^SN<m><sg>$, you may want to do,

        <lu><clip pos="1" side="tl" part="a_gen" link-to="3"/></lu>

But that would have to be 2 in the case of the second example. Usually tags in a chunk are not ambiguous, so would it be possible to do something like,

<lu><clip pos="1" side="tl" part="a_gen" link/></lu>

And have it automatically resolve? e.g. pull out the part of the chunk that corresponds to a_gen?

Don't output a blank if an LU is empty

Sometimes we want to do things like:

      <out>
       <chunk name="SV">
        <tags>
         <tag><lit-tag v="SV"/></tag>
          <tag><clip pos="1" side="tl" part="a_val"/></tag>
        </tags>
        <var n="CI"/> <b/>
        <lu> <clip pos="1" side="tl" part="whole"/> </lu>
       </chunk>
      </out>

But if the variable is empty, the blank still gets output, which means that we get two spaces instead of one. If the variable is empty, the blank should not be output.... I can't think of any counter example to this.

Something About Conditional Clip Setting

Something in here causes a segfault in rtx-comp:

vaux: (if (1.qst = qst)
          1(verb_nopers)[tense=inf]
       else-if ((1.tense>tense = pres) and
                ( (1.pos_tag = vbser and (1.person = p1 or 1.person = p3))
                  or
                  (1.person = p3 and 1.number = sg)))
          (if (1.negative = neg)
              [ 1(verb_pers) + not@adv ]
           else
              1(verb_pers) )
       else
          (if (1.negative = neg)
              [ 1(verb_nopers) + not@adv ]
           else
              1(verb_nopers) ) );

!!! MACROS !!!

qst_front: (always *(vaux)[lemh=(if (1.lem = жат or 1.lem = э) be
                                 if (1.lem = ал) can
                                 else do),
                           pos_tag=(if (1.lem = жат or 1.lem = э) vbser
                                    if (1.lem = ал) vbmod
                                    else vbdo),
                           person=1.person, number=1.number, tense=1.tense,
                           negative=1.negative, qst=NOqst, lemcase=1.lemcase]);

Probably in connection with lemh=(if ...)

+ between tree nodes does not disassemble as expected

Example:

$ echo "Күшік бар ма?" | apertium -d ../ kaz-kir

The tree is built correctly

[Chunk]: 
^Default<S>{
	^Күчүк<NP><ND><nom>{
		^Күчүк<nP><ND><nom>{
			^Күшік<n><nom>/Күчүк<n><nom>$
		}$
	}$
	^ээээ<VP><TD><aor><p3><sg>{
		^бар<AP>{
			^бар<aP>{
				^бар<adj>/бар<adj>$
			}$
		}$
		^ээээ<vP><cop><TD><aor><p3><sg>{
			^ээ<vP><cop><TD><aor><p3><sg>{
				^е<cop><aor><p3><sg>/э<cop><aor><p3><sg>$
			}$
			^ма<qst>/бы<qst>$
		}$
	}$
}$

But this rule

		2: AP %vP ?(2.v_type=cop & 2.tense=aor) { 1 + 2 } ;

doesn't disassemble correctly:

^Күчүк<n><nom>$ ^бар<adj>$^ээ<cop><aor><p3><sg>+бы<qst>$^?<sent>$^.<sent>$

Expected output is:

^Күчүк<n><nom>$ ^бар<adj>+э<cop><aor><p3><sg>+бы<qst>$^?<sent>$^.<sent>$

Unable to properly conjoin LUs

Sample rule file

prn: _.prn_type.person.gender.number ;
vblex: _.tense;
Clt: _.prn_type.person.gender.number ;
SV: _.vb_type.vb_cnj.tense.person.gender.number ;

gender = (GD m) m f nt @mf GD ;
number = (ND sg) sg pl @sp ND ;
person = (PD p3) p1 p2 p3 PD ;
prn_type = tn itg pro enc ;
tense = pri fti cni imp prs pis fts pii ifi inf ger pp ;
vb_type = vbhaver vblex vbmod vbser ;
vb_cnj = cnj impers ;

Clt ->  1: %prn.enc
                { %1 } ;

SV ->   2: %vblex.inf
                [$vb_cnj=impers]
                { (if($lu-count="2") [%1+>2] else %1 ) } |

        2: %SV.*.impers.inf Clt.enc
                { %1 < 2 } ;

Input

^cantar<vblex><inf>/cantar<vblex><inf>$ ^nos<prn><enc><p1><mf><pl>/ens<prn><enc><p1><mf><pl>$

Output

^cantar<vblex><inf>ens<Clt><enc><p1><mf><pl>$

Expected output

^cantar<vblex><inf>+ens<prn><enc><p1><mf><pl>$ 

Description

There are 2 aspects not working as I would expect, all related to the third rule:

  1. + is missing in the output between the conjoined parts.

  2. The second part of the rule is output as the lemma and tags of the chunk created in the first (Clt) rule and not the content, that is, the content inside the chunk is completely ignored.

Thanks!

XML - set default weight based on number of tags in pattern

I have in apertium-quc-spa.quc-spa.rtx

    <def-cat n="n_rel">
      <cat-item tags="n.rel.*"/>
      <cat-item tags="n.rel"/>
    </def-cat>
    <def-cat n="n">
      <cat-item tags="n.*"/>
      <cat-item tags="n"/>
    </def-cat>

...


  <rule comment="rel" firstChunk="Rel">
   <pattern>
    <pattern-item n="n_rel"/>
   </pattern>
   <action>
      <call-macro n="f_set_chunk_name1"><with-param pos="1"/></call-macro>
      <call-macro n="f_concord1"><with-param pos="1"/></call-macro>
      <call-macro n="f_conv_poss1"><with-param pos="1"/></call-macro>
      <out>
       <chunk namefrom="chunkName">
        <tags>
         <tag><lit-tag v="Rel"/></tag>
        </tags>
        <lu><clip pos="1" side="tl" part="whole"/></lu>
       </chunk>
      </out>
   </action>
  </rule>

  <rule comment="n" firstChunk="N">
   <pattern>
    <pattern-item n="n"/>
   </pattern>
   <action>
      <call-macro n="f_set_chunk_name1"><with-param pos="1"/></call-macro>
      <call-macro n="f_concord1"><with-param pos="1"/></call-macro>
      <call-macro n="f_conv_poss1"><with-param pos="1"/></call-macro>
      <out>
       <chunk namefrom="chunkName">
        <tags>
         <tag><lit-tag v="N"/></tag>
         <tag><clip pos="1" side="tl" part="a_gen"/></tag>
         <tag><clip pos="1" side="tl" part="a_nbr"/></tag>
         <tag><clip pos="1" side="sl" part="a_poss"/></tag>
        </tags>
        <lu><clip pos="1" side="tl" part="lem"/>
            <clip pos="1" side="tl" part="a_sust"/>
            <clip pos="1" side="tl" part="a_gen"/>
            <clip pos="1" side="tl" part="a_nbr" link-to="3"/></lu>
       </chunk>
      </out>
   </action>
  </rule>

If I put the Rel rule after the N rule, the N rule matches, even though Rel has a longer pattern. Is this the desired behaviour? If not, then we could sort the rules before applying them to get longest match behaviour?


Match unknown words

Some t1x rules match unknown words with

    <def-cat n="unknown">
      <cat-item tags=""/>
    </def-cat>

Perhaps * in the pattern for rtx should do the same?

We would have to make sure this doesn't conflict with the current implementation of lookahead.

better postchunk rules for trx

@ftyers @khannatanmai
Regarding introspecting chunks, there isn't really a way to do that in parsing rules and my way of implementing postchunk rules (copying the structure of t3x with lemma matching) is really messy.

What do you thinking rules like this?

<output-rule name="insert_det">
  <!-- assorted manipulations -->
  <out>
    <lu>...</lu>
    ...
  </out>
  <!-- or -->
  <output-all/>
  <!-- for if you changed tags and things but haven't added or removed any chunks -->
</output-rule>

<rule>
  <pattern>...</pattern>
  <action>
    <out>
      <chunk output-rule="insert_det">...</chunk>
    </out>
  </action>
</rule>

Means of modifying which output rule was set for a chunk could of course also be added (whether with <let> and <clip> or with specialized instructions, I'm not sure).

Compile .t*x Files

  • I ensured that all rules have separate paths, but this prevents checking for conflicts.
  • Some words occasionally lose tags, particularly <ger> and <prn>
  • A few empty chunks are produced

See also #3

Discard More Branches

GLR mode is currently somewhat slow. The best option for speeding it up is probably to find some way of discarding paths early, rather than holding on to all of them until completion or a parse error.

numbers get repeated?

@mr-martian, I'm having a weird issue. Currently the following sentence

Ал 17-кылымда төрөлгөн.

translates as

He was borne in the 1717th century.

Do you have any idea why we get the number doubled?

Lemmas with queues cannot be properly conjoined

Currently it does not seem possible to properly conjoin lemmas when the first one has a queue. This is very common with verbs and clitics and can be done easily in apertium-transfer, which allows to specify the clipping position of lemh and lemq. Example with a verb and a clitic:

Source

^tenir# en compte<vblex><ger>$ ^ho<prn><enc><p3><nt>$

Current output

^tenir<vblex><ger># en compte+ho<prn><enc><p3><nt>$

Expected output

^tenir<vblex><ger>+ho<prn><enc><p3><nt># en compte$

For lemmas that are not yet inside of a chunk, it seems straightforward. However, given that it is also possible to conjoin whole chunks, I am not entirely sure how this would be handled.

Clip from variables

I'd like to be able to clip from variables, the way it should probably work is like this:

<clip part="a_agr"><var n="copula"/></clip>

Where the variable is passed to clip as an argument by putting the var tag inside the clip tag.

Multiple Tag-Rewrite Rules

There's currently no way for the compiler to do tag conversion x > y > z. If macros try to rewrite tags, the last one takes precedence and we get x > z. The main trick for fixing this will be to avoid inappropriate applications of x > x.

Interpolation in RTX

In TRX it is possible to insert chunks into other chunks:

<chunk ...>
  <tags>...</tags>
  <clip pos="1" part="chcontent" side="tl"/>
  <b/>
  <clip pos="2" part="whole" side="tl"/>
</chunk>

The relevant postchunk rule can then check the value of <lu-count/> to see if anything has been inserted and move it around if so.

This is currently not possible in RTX and maybe should be.

variables get cleared between macro calls?

    <def-macro n="f_link_concord1" npar="1">
      <choose>
        <when><test><not><equal><clip pos="1" side="tl" part="a_nbr"/><lit-tag v="sp"/></equal></not></test>
                    <let><var n="chunkNumero"/><clip pos="1" side="tl" part="a_nbr"/></let>
                    <let><clip pos="1" side="tl" part="a_nbr"/><lit-tag v="3"/></let>
        </when>
        <otherwise>
            <let><var n="chunkNumero"/><lit-tag v="ND"/></let><!--<clip pos="1" side="tl" part="a_nbr"/></let>-->
        </otherwise>
      </choose>
      <choose>
        <when><test><not><equal><clip pos="1" side="tl" part="a_gen"/><lit-tag v="mf"/></equal></not></test>
                    <let><var n="chunkGenero"/><clip pos="1" side="tl" part="a_gen"/></let>
                    <let><clip pos="1" side="tl" part="a_gen"/><lit-tag v="2"/></let>
        </when>
        <otherwise>
             <let><var n="chunkGenero"/><lit-tag v="GD"/></let><!--<clip pos="1" side="tl" part="a_gen"/></let>-->
        </otherwise>
      </choose>
    </def-macro>

There should be no way that the variable chunkGenero and chunkNumero should be empty after calling this macro --- it should be either a non-<sp>, non-<mf> value or <ND> or <GD>. But that is what happens,


  <rule comment="det SN" firstChunk="SD">
   <pattern>
    <pattern-item n="det"/>
    <pattern-item n="SN"/>
   </pattern>
   <action>
      <call-macro n="f_set_chunk_name1"><with-param pos="2"/></call-macro>
      <!--<call-macro n="f_concord2"><with-param pos="2"/><with-param pos="1"/></call-macro>-->
      <call-macro n="f_link_concord1"><with-param pos="2"/></call-macro>
      <!--<call-macro n="f_link_concord1"><with-param pos="1"/></call-macro>-->
      <!--<call-macro n="f_set_determiner2"><with-param pos="2"/><with-param pos="1"/></call-macro>-->
      <out>
       <chunk namefrom="chunkName">
        <tags>
         <tag><lit-tag v="SD"/></tag>
         <tag><var n="chunkGenero"/></tag>
         <tag><var n="chunkNumero"/></tag>
         <tag><clip pos="2" side="tl" part="a_poss"/></tag>
        </tags>
<var n="numero"/><var n="genero"/><var n="chunkNumero"/><var n="chunkGenero"/>
        <lu><clip pos="1" side="tl" part="whole"/></lu>
        <b/>
        <lu><clip pos="2" side="tl" part="whole"/></lu>
       </chunk>
      </out>
   </action>
  </rule>

Output

^parábola<SD>{
	^<sg>$
	^<m>$
	^$
	^$
	^este<det><dem><m><sg>$
	^parábola<SN>{
		^uno<Num>{
			^uno<det><ind><2><3>$
		}$
		^parábola<SN>{
			^parábola<N>{
				^parábola<n><2><3>$
			}$
		}$
	}$
}$

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.