GithubHelp home page GithubHelp logo

apertium / apertium-separable Goto Github PK

View Code? Open in Web Editor NEW
4.0 13.0 5.0 16.64 MB

Module for reordering separable/discontiguous multiwords.

Home Page: https://wiki.apertium.org/wiki/Apertium_separable

License: GNU General Public License v3.0

Makefile 1.68% Shell 3.46% M4 14.92% Python 46.99% C++ 32.96%
apertium-core

apertium-separable's Introduction

Lttoolbox provides a module for reordering separable/discontiguous multiwords and processing them in the pipeline. Multiwords are manually written in an additional xml-format dictionary.

Installing

The module is part of the nightly repositories as apt-get install apertium-separable.

If you'd like to compile it manually—e.g., for development purposes—you can follow these instructions:

Prerequisites and compilation are the same as lttoolbox and apertium. See Installation.

The code can be found at https://github.com/apertium/apertium-separable, and instructions for compiling the module are:

./autogen.sh
./configure
make
make install

You'll need lttoolbox from git (or, greater than the current release 3.5.1) and associated libraries.

Lexical transfer in the pipeline

lsx-proc runs directly AFTER apertium-tagger and apertium-pretransfer:
(note: previously this page had said that lsx-proc runs between BETWEEN apertium-tagger and apertium-pretransfer. it has now been determined that it should run AFTER pretransfer.)

… | apertium-tagger -g en-es.prob |  apertium-pretransfer | lsx-proc en-es.autoseq.bin | …

Usage

Creating the lsx-dictionary

The lsx dictionary format is largely similar to those of the morphological and bilingual dictionaries. (see also: Apertium_New_Language_Pair_HOWTO

We begin with a declaration of the dictionary. There is currently nothing in it, only a declaration that we want to begin a new dictionary.

<dictionary type="separable">
</dictionary>

Then add the alphabet entry, this can be empty as the alphabet is only used for tokenisation and the lsx module comes after the text is tokenised. Now we have:

<dictionary type="separable">
    <alphabet></alphabet> 
</dictionary>

Next we need to add the symbol definitions, abbreviated to sdefs. These are the symbols that your words are tagged with, e.g. noun or verb or adj. Again, you should be able to just copy the sdef section from your language's monodix, and it should contain many more than in this basic example.

<dictionary type="separable">
    <alphabet></alphabet>
    <sdefs>
        <sdef n="adj"/>
        <sdef n="adv"/>
        <sdef n="n"/>
        <sdef n="sep"/>
        <sdef n="vblex"/>
    </sdefs>
</dictionary>

Now we need to add the paradigm definitions, abbreviated to pardefs. These represent patterns of word orders. The following example represents words tagged as adjective, noun, noun phrase, and frequency adjectives. See the note below about the tags , , . The lemma can be represented as anychars (, such as in adj and n below; or by typing out the word itself, such as in freq-adv below. Pardefs can be used to create other pardefs, such as in SN below. Adding paradigms into the dictionary, we get:

<dictionary type="separable">
    <alphabet></alphabet>
    <sdefs>
        ...
    </sdefs>
    <pardefs>
        <pardef n="adj"> <!-- to represent all adjectives -->
            <e><i><w/><s n="adj"/><j/></i></e> <!-- word only has the adj tag -->
            <e><i><w/><s n="adj"/><t/><j/></i></e> <!-- word has the adj tag followed by one or more other tags -->
        </pardef>
        <pardef n="n"> #to represent all nouns
            <e><i><w/><s n="n"/><t/><j/></i></e> <!-- word has the n tag followed by one or more other tags -->
        </pardef>
        <pardef n="SN"> #to represent all noun phrases
            <e><par n="n"/></e>
            <e><par n="adj"/><par n="n"/></e> <!-- word phrase is comprised of an adjective word followed by a noun word -->
            <e><par n="adj"/><par n="adj"/><par n="n"/></e> <!-- word phrase is comprised of two adjectives followed by a noun -->
        </pardef>
        <pardef n="freq-adv">
            <e><i>always<s n="adv"/><j/></i></e> <!-- i.e. ^always<adv>$ -->
            <e><i>anually<s n="adv"/><j/></i></e>
            <e><i>bianually<s n="adv"/><j/></i></e>
        </pardef>
    </pardefs>
</dictionary>

Finally, we add the main entries. Here is the final result of our small example dictionary:

<dictionary type="separable">
    <alphabet></alphabet>
    <sdefs>
        <sdef n="adj"/>
        <sdef n="adv"/>
        <sdef n="n"/>
        <sdef n="sep"/>
        <sdef n="vblex"/>
    </sdefs>
    <pardefs>
        <pardef n="adj">
            <e><i><w/><s n="adj"/><j/></i></e>
            <e><i><w/><s n="adj"/><t/><j/></i></e>
        </pardef>
        <pardef n="n">
            <e><i><w/><s n="n"/><t/><j/></i></e>
        </pardef>
        <pardef n="SN">
            <e><par n="n"/></e>
            <e><par n="adj"/><par n="n"/></e>
            <e><par n="adj"/><par n="adj"/><par n="n"/></e>
        </pardef>
        <pardef n="freq-adv">
            <e><i>always<s n="adv"/><j/></i></e>
            <e><i>anually<s n="adv"/><j/></i></e>
            <e><i>bianually<s n="adv"/><j/></i></e>
        </pardef>
    </pardefs>
    <section id="main" type="standard">
        <e lm="be late" c="llegar tarde">
            <p><l>be<s n="vbser"/></l><r>be<g><b/>late</g><s n="vbser"/></r></p><i><t/><j/></i>
            <par n="SAdv"/><p><l>late<t/><j/></l><r></r></p>
        </e>
        <e lm="take away" c="sacar, quitar">
            <p><l>take<s n="vblex"/></l><r>take<g><b/>away</g><s n="vblex"/></r></p><i><t/><j/></i>
            <par n="SN"/><p><l>away<t/><j/></l><r></r></p>
        </e>
    </section>
</dictionary>

Note:

  • stands for one or more alphabetic symbols

  • stands for one or more tags (multicharacter symbols).

  • stands for the word boundary symbol $

i.e.

  • <e><i><w/><s n="adj"/><t/><j/></i></e> is equivalent to any-one-or-more-chars<...optional-anytag...><$>
    • ^tall<...>$
  • <e><i><w/><s n="adj"/><j/></i></e> is equivalent to any-one-or-more-chars<$>
    • ^tall$

A larger example dictionary can be found at https://github.com/apertium/apertium-separable/blob/master/examples/apertium-eng-spa.eng-spa.lsx.

The lsx dictionary file names are of the form apertium-A-B.A-B.lsx, where apertium-A-B is the name of the language pair. For example, file apertium-eng-cat.eng-cat.lsx is the lsx dictionary for the eng-cat pair. The names of the compiled binaries are of the form apertium-A-B.autoseq.bin. For example, eng-cat.autoseq.bin.

Compilation

Compilation into the binary format is achieved by means of the lsx-comp program. Specifying lr as the mode will produce an analyser, and rl will produce a generator.

$ lsx-comp lr apertium-eng-spa.eng-spa.lsx eng-spa.autoseq.bin
main@standard 61 73

Processing

Processing can be done using the lsx-proc program.

The input to lsx-proc is the output of apertium-tagger and apertium-pretransfer,

$ echo '^take<vblex><imp>$ ^prpers<prn><obj><p3><nt><sg>$ ^out of<pr>$ ^there<adv>$^.<sent>$' | lsx-proc eng-spa.autoseq.bin
^take# out<vblex><sep><imp>$ ^prpers<prn><obj><p3><nt><sg>$ ^of<pr>$ ^there<adv>$^.<sent>$

Example usages

Example #1: A sentence in plain text,

The Aragonese took Ramiro out of a monastery and made him king.

This is the output of feeding the sentence through apertium-tagger and then apertium-pretransfer:

^the<det><def><sp>$ ^Aragonese<n><sg>$ ^take<vblex><past>$ ^Ramiro<np><ant><m><sg>$ ^out of<pr>$ ^a<det><ind><sg>$
^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$

This is the output of feeding the output above through lsx-proc with apertium-eng-spa.eng-spa.lsx:

^the<det><def><sp>$ ^Aragonese<n><sg>$ ^take# out<vblex><sep><past>$ ^Ramiro<np><ant><m><sg>$ ^of<pr>$ ^a<det><ind><sg>$
^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$

Troubleshooting

Segmentation fault

Segmentation fault upon compilation or usage
The lsx-dictionary compiles fine with zero entries but gives a seg fault once entries are added
...no solution found yet
something is not updated or something in the makefile (?)

make sure that the makefile ...

Complaints about step_override()

git pull in lttoolbox (and do make, make install)
You'll need an up-to-date version of lttoolbox and associated libraries, and zlib (debian: zlib1g-dev).

Undefined symbol

In your dictionary you are probably using a symbol that you didn't define in the sdefs. Add the symbol to the sdefs.

Future work

Offloading multiwords from transducers to lsx

In theory we're offloading multiwords from the transducers to lsx. This leaves open some questions:

  • how do we do N N compounds with lsx?
  • how does translation to a multiword work? In theory it's possible to invert the transducer, but an attempt to try this results in a transducer that looks right but silently fails to apply to input. Also, it will need to be able to handle the output of transfer. —Firespeaker (talk) 00:02, 1 September 2017 (CEST)

Recycling dictionaries and/or paradigms

lsx-dictionaries are packaged in language pairs. the eng-spa lsx-dictionary can mostly be reaped by eng-cat. could we make use of the similarity?

Beta testing

Support for language pairs: we haven't gotten much extensive beta testing. The following are language pairs that have packaged the lsx-module:

    • eng-cat
    • eng-deu (?)
    • kaz-kir

Beta test with more language pairs

Transfer-like super powers

  • Transfer-like capabilities for the lexicon (super powers). E.g., gustar /

The one-to-many bug

Given the following lsx file:

<dictionary type="sequential">
    <alphabet>АӘБВГҒДЕЁЖЗИІЙКҚЛМНҢОӨПРСТУҰҮФХҺЦЧШЩЬЫЪЭЮЯаәбвгғдеёжзиійкқлмнңоөпрстуұүфхһцчшщьыъэюя</alphabet>
    <sdefs>
        <sdef n="adj"/>
        <sdef n="adv"/>
        <sdef n="n"/>
        <sdef n="nom"/>
        <sdef n="dat"/>
        <sdef n="v"/>
    </sdefs>
    <pardefs>
        <pardef n="adj">
            <e><i><w/><s n="adj"/><j/></i></e>
            <e><i><w/><s n="adj"/><t/><j/></i></e>
        </pardef>
        <pardef n="n">
            <e><i><w/><s n="n"/><t/><j/></i></e>
        </pardef>
        <pardef n="SN">
            <e><par n="n"/></e>
            <e><par n="adj"/><par n="n"/></e>
            <e><par n="adj"/><par n="adj"/><par n="n"/></e>
        </pardef>
    </pardefs>
    <section id="main" type="standard">
        <e lm="кабарда" c="хабар ет">
            <p><l>хабар<b/>ет<s n="v"/></l>
                <r>хабар<s n="n"/><s n="nom"/><j/>ет<s n="v"/></r></p><i><t/><j/></i>
        </e>
        <e lm="абайла" c="абай бол">
            <p><l>абай<b/>бол<s n="v"/></l>
                <r>абай<s n="adj"/><j/>бол<s n="v"/></r></p><i><t/><j/></i>
        </e>
        <e lm="абайла" c="абай бол">
            <p><l>абай<b/>бол<s n="v"/></l>
                <r>абай<s n="adj"/><j/>бол<s n="v"/></r></p><i><t/>+ма<t/><j/></i>
            <!-- p><l>абай<s n="adj"/><j/>бол<s n="v"/><t/></l>
                <r>абай<b/>бол<s n="v"/><t/></r></p -->
        </e>
        <e lm="сууга түш" c="шомылда">
            <p><l>сууга<b/>түш<s n="v"/></l>
                <r>суу<s n="n"/><s n="dat"/><j/>түш<s n="v"/></r></p><i><t/><j/></i>
        </e>

    </section>
</dictionary>

and the following code to compile it (where $(PREFIX1) is kaz-kir and $(PREFIX2) is kir-kaz and $(BASENAME) is apertium-kaz-kir; the above file is apertium-kaz-kir.kir-kaz.lsx):

$(PREFIX1).autoseq.bin: $(BASENAME).$(PREFIX1).lsx
    lsx-comp $< $@

$(PREFIX2).autoseq.bin: $(BASENAME).$(PREFIX2).lsx
    lsx-comp $< $@

$(PREFIX1).revautoseq.bin: $(BASENAME).$(PREFIX1).lsx
    lt-print $(PREFIX1).autoseq.bin |  sed 's/ /@_SPACE_@/g' > $(PREFIX1).autoseq.att
    hfst-txt2fst -e ε < $(PREFIX1).autoseq.att > $(PREFIX1).autoseq.hfst
    hfst-invert $(PREFIX1).autoseq.hfst | hfst-minimise > $(PREFIX1).revautoseq.hfst
    hfst-fst2txt $(PREFIX1).revautoseq.hfst | gzip -9 -c -n > $(PREFIX1).revautoseq.att.gz
    zcat < $(PREFIX1).revautoseq.att.gz > $(PREFIX1).revautoseq.att
    sed 's/@0@/ε/g' $(PREFIX1).revautoseq.att > $(PREFIX1).revautoseq.1.att
    lt-comp lr $(PREFIX1).revautoseq.1.att $@


$(PREFIX2).revautoseq.bin: $(BASENAME).$(PREFIX2).lsx
    lt-print $(PREFIX2).autoseq.bin |  sed 's/ /@_SPACE_@/g' > $(PREFIX2).autoseq.att
    hfst-txt2fst -e ε < $(PREFIX2).autoseq.att > $(PREFIX2).autoseq.hfst
    hfst-invert $(PREFIX2).autoseq.hfst | hfst-minimise > $(PREFIX2).revautoseq.hfst
    hfst-fst2txt $(PREFIX2).revautoseq.hfst | gzip -9 -c -n > $(PREFIX2).revautoseq.att.gz
    zcat < $(PREFIX2).revautoseq.att.gz > $(PREFIX2).revautoseq.att
    sed 's/@0@/ε/g' $(PREFIX2).revautoseq.att > $(PREFIX2).revautoseq.1.att
    lt-comp lr $(PREFIX2).revautoseq.1.att $@

EXPECTED OUTPUT:

we expect lr compilation to give the following behaviour:

$ echo "^хабар ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.autoseq.bin
^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$

and

$ echo "^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.autoseq.bin
^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$

WHEREAS with rl compilation (outputting with name revautoseq), we expect the following behaviour:

$ echo "^хабар<n><nom>$ ^ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.revautoseq.bin
^хабар ет<v><iv><ifi><p1><sg>$

and

$ echo "^хабар ет<v><iv><ifi><p1><sg>$" | lsx-proc kaz-kir.revautoseq.bin
^хабар ет<v><iv><ifi><p1><sg>$

See also

apertium-separable's People

Contributors

ftyers avatar himanshu40 avatar itang1 avatar jonorthwash avatar khannatanmai avatar mr-martian avatar nishantwrp avatar nlhowell avatar tinodidriksen avatar unhammer avatar xavivars avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apertium-separable's Issues

no match if `+` after pattern

Acceptable translation, with good match:

$ echo "бүгүнкү күндө түшөт" | apertium -d . kir-eng
he falls these days

$ echo "бүгүнкү күндө түшөт" | apertium -d . kir-eng-disam  # and some manual cleaning
^бүгүн<adv><attr>$ ^күн<n><loc>$ ^түш<v><iv><aor><p3><sg>$^.<sent>$

$ echo "бүгүнкү күндө түшөт" | apertium -d . kir-eng-autoseq
^бүгүнкү күндө<adv>$ ^түш<v><iv><aor><p3><sg>$^.<sent>$

$ echo "бүгүнкү күндө түшөт" | apertium -d . kir-eng-biltrans
^бүгүнкү күндө<adv>/these days<adv>/today<adv>$ ^түш<v><iv><aor><p3><sg>/fall<vblex><aor><p3><sg>$^.<sent>/.<sent>$

Problematic translation:

$ echo "бүгүнкү күндө" | apertium -d . kir-eng
he is on the today day

$ echo "бүгүнкү күндө" | apertium -d . kir-eng-disam # and some manual cleaning
^бүгүн<adv><attr>$ ^күн<n><loc>+э<cop><aor><p3><sg>$^.<sent>$

$ echo "бүгүнкү күндө" | apertium -d . kir-eng-autoseq
^бүгүн<adv><attr>$ ^күн<n><loc>+э<cop><aor><p3><sg>$^.<sent>$

$ echo "бүгүнкү күндө" | apertium -d . kir-eng-biltrans
^бүгүн<adv><attr>/today<adv><attr>$ ^күн<n><loc>/day<n><loc>/sun<n><loc>$ ^э<cop><aor><p3><sg>/be<vbser><aor><p3><sg>$^.<sent>/.<sent>$

The reason the second translation is problematic is because the current rule doesn't match because of the +э<cop>.... Here's the current rule:

		<e lm="бүгүнкү күндө" c="today">
			<p>
				<l>бүгүн<s n="adv"/><s n="attr"/><j/>күн<s n="n"/><s n="loc"/></l>
				<r>бүгүнкү<b/>күндө<s n="adv"/></r>
			</p>
			<i><j/></i>
		</e>

rule-initial <w/> can make other rules match

It seems like a <w/> at the start of a rule can make the analyser move its position into a lexical unit even if the rule doesn't end up fully matching, allowing other rules to match from that point on.

apertium-nno-nob.nob-nno.lsx:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary type="separable">

  <alphabet></alphabet>

  <sdefs>
    <sdef n="adj"/>
  </sdefs>

  <pardefs>
    <pardef n="meh">
      <e><i>meh<s n="adj"/><t/><j/></i></e>
    </pardef>
  </pardefs>

  <section id="main" type="standard">

    <e c="override below rule if adj before">
      <i><w/>stuffnotininput<s n="adj"/><t/><j/></i>
      <i>DROP<s n="adj"/><t/><j/></i>
    </e>

    <e c="drop DROP and LEFT→RIGHT">
      <p><l>DROP<t/><j/></l> <r></r></p>
      <p><l>LEFT</l>           <r>RIGHT</r></p> <i><t/><j/></i>
    </e>
  </section>

</dictionary>
$ lsx-comp lr apertium-nno-nob.nob-nno.lsx nob-nno.autoseq.bin
main@standard 39 44

$ echo '^keptDROP<adj><sg>$ ^LEFT<n><sg>$' | lsx-proc nob-nno.autoseq.bin
^keptRIGHT<n><sg>$

None of the entries should've matched here, yet it seems like we had a partial match on the first one and then only backtracked back to where the second one was able to start matching (instead of backtracking outside of the word ^).

(thanks @victoria-tro for reporting)

is it possible to have <j/> boundaries without spaces?

With the following rule, I'm trying to get "year-old", but instead get "year - old" (with spaces).

Rule:

<e lm="year-old" c="жашар">
	<p>
		<l>year<s n="n"/><s n="sg"/><j/>-<s n="guio"/><j/>old<s n="adj"/><s n="sint"/></l>
		<r>year-old<s n="adj"/></r>
	</p>
	<i><j/></i>
</e>

Example input and output:

Азамат алты жашар кичинекей бала.

↓ tagging, transfer

^Azamat<np><ant><m><sg>$ ^be<vbser><pres><p3><sg>$ ^the<det><def><sp>$ ^six<num><pl>$ ^year-old<adj>$ ^little<adj>$ ^kid<n><sg>$^.<sent>$^.<sent>$

↓ revautoseq

^Azamat<np><ant><m><sg>$ ^be<vbser><pres><p3><sg>$ ^the<det><def><sp>$ ^six<num><pl>$ ^year<n><sg>$ ^-<guio>$ ^old<adj><sint>$ ^little<adj>$ ^kid<n><sg>$^.<sent>$^.<sent>$

↓ generation

Azamat is the six year - old little kid.

Outputs extra chars

fran@ipek:~/source/apertium/pairs/apertium-quc-spa$ echo "rumal rech che" | apertium -d . quc-spa
porque
??fran@ipek:~/source/apertium/pairs/apertium-quc-spa$ echo "rumal rech che." | apertium -d . quc-spa-tagger
^umal<n><rel><px3sg>$ ^ech<n><rel><px3sg>$ ^chi<pr>+ech<n><rel><px3sg>$^.<sent>$^.<sent>$
fran@ipek:~/source/apertium/pairs/apertium-quc-spa$ echo "rumal rech che." | apertium -d . quc-spa-separable
^rumal rech<cnjadv>$ ^chi<pr>$ ^ech<n><rel><px3sg>$^.<sent>$^.<sent>$
?
$ echo "rumal rech che." | apertium -d . quc-spa-tagger | apertium-pretransfer | lsx-proc quc-spa.autoseq.bin | unidump 
      0    005E 0072 0075 006D 0061 006C 0020 0072 0065 0063 0068 0020 0063 0068 0065 003C    ^rumal.rech.che<
     16    0063 006E 006A 0061 0064 0076 003E 0024 005E 002E 003C 0073 0065 006E 0074 003E    cnjadv>$^.<sent>
     32    0024 005E 002E 003C 0073 0065 006E 0074 003E 0024 000A 003F                        $^.<sent>$.?
$ echo "rumal rech che" | hfst-proc quc-spa.automorf.hfst | cg-proc quc-spa.rlx.bin | apertium-tagger -u 2 -g quc-spa.prob|  apertium-pretransfer | lsx-proc quc-spa.autoseq.bin | hexdump -xc
0000000    725e    6d75    6c61    7220    6365    2068    6863    3c65
0000000   ^   r   u   m   a   l       r   e   c   h       c   h   e   <
0000010    6e63    616a    7664    243e    3f0a                        
0000010   c   n   j   a   d   v   >   $  \n   ?                        
000001a

Weird issue with spaces

fran@ipek:~/source/apertium/pairs/apertium-quc-spa$ echo "Jas che mna kixpe chwe’q?" | apertium -d . quc-spa
por qué  *mna vinisteis a @we’*q?
fran@ipek:~/source/apertium/pairs/apertium-quc-spa$ echo "Jas  che mna kixpe chwe’q?" | apertium -d . quc-spa
  @jasche  *mna vinisteis a @we’*q?

Issue with capital letters

This works:

        <e lm="Jun Ajpu" c=""><p>
                <l>jun<s n="num"/><j/>ajpu<s n="np"/><s n="ant"/><s n="m"/><j/></l>
                <r>Jun<b/>Ajpu<s n="np"/><s n="ant"/><s n="m"/></r>
            </p>
        </e>
$ echo "^Jun<num>$ ^Ajpu<np><ant><m>$ " | lsx-proc quc-spa.autoseq.bin 
^Jun Ajpu<np><ant><m>$ 

But this doesn't:

        <e lm="Jun Ajpu" c=""><p>
                <l>Jun<s n="num"/><j/>Ajpu<s n="np"/><s n="ant"/><s n="m"/><j/></l>
                <r>Jun<b/>Ajpu<s n="np"/><s n="ant"/><s n="m"/></r>
            </p>
        </e>
$ echo "^Jun<num>$ ^Ajpu<np><ant><m>$ " | lsx-proc quc-spa.autoseq.bin 
^Jun<num>$ ^Ajpu<np><ant><m>$ 

Extra <t/> probably shouldn't fill up memory

When <t/> is in the code and doesn't match, lsx-proc quickly fills up available memory, and unless stopped quickly, this can cause problems on the machine.

See #7 for an example of this.

When <t/> isn't matched, it should probably fail silently instead—e.g., producing output to stdout and logging a warning to stderr (or similar).

lsx-comp compiling error

When recompiling the apertium-separable rules for fra-frp, which were working are untouched in the last year, I get an error:

$ make
lsx-comp lr apertium-fra-frp.fra-frp.l1x fra-frp.autosep1.bin
lsx-comp: symbol lookup error: lsx-comp: undefined symbol: _ZN8Alphabet5writeEP8_IO_FILE
make: *** [Makefile:819: fra-frp.autosep1.bin] Error 127

Problems with a match

In apertium-fra-cat we have the file apertium-fra-cat.cat-fra.l2x

There are several rules which are more or less a copy-and-paste but none of them has been tested.

I've tried with the rule "rendre public", which is supposed to split the multiword "rendre# public" if it is followed by "pas" changing "rendre public pas" into "rendre pas public". Unfortunately it doesn't work:

> no faig públic.
> ^no/no<adv>$ ^faig públic/fer<vblex><pri><p1><sg># públic$^./.<sent>$ 
> ^no/no<adv>$ ^faig públic/fer# públic<vblex><pri><p1><sg>$^./.<sent>$ 
> ^no<adv>$ ^fer# públic<vblex><pri><p1><sg>$^.<sent>$ 
> ^no<adv>/ne<adv>/non<adv>$ ^fer# públic<vblex><pri><p1><sg>/rendre# public<vblex><pri><p1><sg>$^.<sent>/.<sent>$ 
> ^no<adv>/ne<adv>$ ^fer# públic<vblex><pri><p1><sg>/rendre# public<vblex><pri><p1><sg>$^.<sent>/.<sent>$ 
> ^ne<adv>$ ^rendre<vblex><pri><p1><sg># public$ ^pas<adv>$^.<sent>$ 
> ~ne rends public pas~. 
> ne rends public pas. 

I've tried several possibilities of matching, but I couldn't make the rule work (there are two rules now, just as a try: there is no match).

Any help would be appreciated.

LRLM matching?

It doesn't seem to do LRLM matching, it gives up after the shorter match:

        <e lm="rumal rech che" c="porque"><p>
                <l>umal<s n="n"/><s n="rel"/><s n="px3sg"/><j/>ech<s n="n"/><s n="rel"/><s n="px3sg"/><j/>chi<s n="pr"/><j/>re<s n="prn"/><s n="pers"/><s n="p3"/><s n="sg"/><j/></l>
                <r>rumal<b/>rech<b/>che<s n="cnjadv"/></r>
            </p>
        </e>

        <e lm="rumal rech" c="por cuanto"><p>
                <l>umal<s n="n"/><s n="rel"/><s n="px3sg"/><j/>ech<s n="n"/><s n="rel"/><s n="px3sg"/><j/></l>
                <r>rumal<b/>rech<s n="cnjadv"/></r>
            </p>
        </e>
$ echo "^umal<n><rel><px3sg>$ ^ech<n><rel><px3sg>$ ^chi<pr>$ ^re<prn><pers><p3><sg>$" | lsx-proc quc-spa.autoseq.bin 
^rumal rech<cnjadv>$ ^chi<pr>$ ^re<prn><pers><p3><sg>$

If I delete the shorter entry rumal rech, I get:

$ echo "^umal<n><rel><px3sg>$ ^ech<n><rel><px3sg>$ ^chi<pr>$ ^re<prn><pers><p3><sg>$" | lsx-proc quc-spa.autoseq.bin 
^rumal rech che<cnjadv>$ 

Is it possible to enforce a space?

Almost an inverse of #11, it'd be nice if we could have some way of enforcing a space.

If input is "a, b" (three lexical units) and the rule outputs two units "c b", then separable will notice that there's no space between the first two units, and uses that empty string as the space between the first two output units so we get "cb" instead of "c b".

$ cat b.lsx
<?xml version="1.0" encoding="UTF-8"?>
<dictionary type="separable">
  <alphabet></alphabet>
  <sdefs>
    <sdef n="ex" c="Exasperative"/>
    <sdef n="ir" c="Irritative"/>
 </sdefs>

  <pardefs>
    <pardef n="meh">
      <e><i><w/><t/><j/></i></e>
    </pardef>
  </pardefs>

  <section id="main" type="standard">
    <e>
      <p><l>a<t/><j/></l> <r></r></p>
      <p><l>,<t/><j/></l> <r></r></p>
      <p><l>b<t/><j/></l> <r></r></p>
      <p><l></l>          <r>c<s n="ex"/><j/></r></p>
      <p><l></l>          <r>d<s n="ir"/><j/></r></p>
    </e>
  </section>

</dictionary>


$ lsx-comp lr b.lsx b.bin
main@standard 17 19

$ echo '^a<ir>$^,<cm>$ ^b<ex>$' | lsx-proc b.bin
^c<ex>$^d<ir>$

Expected:

^c<ex>$ ^d<ir>$

weights

lsx-comp makes an lttoolbox fst, which should support weights. Weights would be useful for making override rules without having to always specify longer contexts just to make LRLM DTRT.

It seems that lsx-comp currently ignores the w attribute (lt-print shows just 0.00000 weights).

Does not work unless <j/> manually added at end of every line

@ftyers reports that lsx-proc doesn't parse correctly unless <j/> is added to the end of every entry. See #6 for discussion of this issue.

Three possible solutions appear to exist:

  • The compiler could add <j/> automatically to the end of every entry.
  • The parser could assume <j/> at the end of every entry.
  • Users could be made aware of this [somewhat arbitrary?] requirement. The documentation would need to be updated.

LU doesn't delete after combining

Input:

;!^take<vblex><past>$ !^Ramiro<np><ant><m><sg>$ ;;^out<adv>$ ^of<pr>$ ^a<det><ind><sg>$

;!^take<vblex><past>$ ^Ramiro<np><ant><m><sg>$ ^out<adv>$ ^of<pr>$ ^a<det><ind><sg>$

Output:

;!^take# out<vblex><sep><past>$ !^Ramiro<np><ant><m><sg>$ ;;^<adv>$ ^of<pr>$ ^a<det><ind><sg>$

;!^take# out<vblex><sep><past>$ ^Ramiro<np><ant><m><sg>$ ^<adv>$ ^of<pr>$ ^a<det><ind><sg>$

Doesn't work for simple example

<dictionary type="sequential">
<sdefs>
<sdef n="det"/>
<sdef n="abl"/>
<sdef n="dem"/>
<sdef n="n"/>
<sdef n="cnjadv"/>
</sdefs>
<section id="main" type="standard">
<e><p><l>bu<s n="det"/><s n="dem"/><j/>yüz<s n="n"/><s n="abl"/></l>
      <r>bu<b/>yüzden<s n="cnjadv"/></r></p></e>
</section>
</dictionary>

Then compile:

$ lsx-comp lr apertium-tur-uzb.tur-uzb.lsx tur-uzb.autosep.bin
main@standard 11 10

Show the transducer:

$ lt-print tur-uzb.autosep.bin
0	1	b	b	0.000000	
1	2	u	u	0.000000	
2	3	<det>	 	0.000000	
3	4	<dem>	y	0.000000	
4	5	<$>	ü	0.000000	
5	6	y	z	0.000000	
6	7	ü	d	0.000000	
7	8	z	e	0.000000	
8	9	<n>	n	0.000000	
9	10	<abl>	<cnjadv>	0.000000	
10	0.000000

But it doesn't work:

$ echo "^bu<det><dem>$ ^yüz<n><abl>$" | lsx-proc tur-uzb.autosep.bin 
^bu<det><dem>$ ^yüz<n><abl>$

Expected output is:

^bu yüzden<cnjadv>$

@jonorthwash @itang1 @unhammer any ideas?

trouble with optional paradigm block

From apertium-eng-kir.eng-kir.lsx

		<e lm="make really hot" c="ысыт">
			<p>
				<l>make<s n="vblex"/></l>
				<r>make<b/>hot<s n="vblex"/></r>
			</p>
			<i><t/><j/></i>
			<par n="SN"/>
			<par n="SAdv"/>
			<p>
				<l>hot<s n="adj"/><s n="sint"/><j/></l>
				<r></r>
			</p>
		</e>

This rule doesn't apply, apparently (tested, confirmed) because of the shorter version of the rule, without SAdv.

In theory, though, there should be an easy way to specify "optional SAdv", so these can be condensed into one rule. Defining such as below does not solve this—it matches the simple rule but not any of the versions with adverbs.

		<pardef n="optSAdv">
			<e></e>
			<e><par n="adv"/></e>
			<e><par n="adv"/><par n="adv"/></e>
		</pardef>

A sentence that can be tested is "Бул тон мени (аябай) ысытып атат" with expected output of "This fur coat is making me (really) hot."

apertium-filter-rules for lsx files

It would be very useful for the oci-fra pair to be able to filter rules in the apertium-separable files, e.g.

    <e lm="far mestièr" v="oci">
      <p><l>far<s n="vblex"/></l><r>far<g><b/>mestièr</g><s n="vblex"/></r></p>
      <i><t/><d/></i>
      <p><l>mestièr<s n="n"/><s n="m"/><s n="sg"/><d/></l><r></r></p>
    </e>

I think there is only need to filter <e> and <pardef>.

Tags on individual unchanged LU's are spread across the whole matching rule

echo 'Hos personer med <a class="crossref" href="https://sml.snl.no/atopisk_eksem">atopisk eksem</a> foreligger ofte arvelige faktorer' | apertium -u -f html-noent nob-nno_e

Output after apertium-pretransfer -z:

^hos<pr><Aa><@adv>$
^person<n><m><pl><ind><aa><@←p-utfyll>$
^med<pr><aa><@adv>$
[[t:a:_3bPiw]]^atopisk<adj><pst><nt><sg><ind><aa><@adj→>$
[[t:a:_3bPiw]]^eksem<n><nt><sg><ind><aa><@←p-utfyll>$
^foreligge<vblex><pres><aa><@fv>$
^ofte<adv><aa><@adv>$
^arvelig<adj><pst><un><pl><ind><aa><@adj→>$
^faktor<n><m><pl><ind><aa><@subj>$^.<sent><clb><aa>$[]

Output after lsx-proc -z -w 'nob-nno.autoseq.bin':

^hos<pr><Aa><@adv>$
^person<n><m><pl><ind><aa><@←p-utfyll>$
[[t:a:_3bPiw; t:a:_3bPiw]]^med<pr><aa><@adv>$
[[t:a:_3bPiw; t:a:_3bPiw]]^atopisk<adj><pst><nt><sg><ind><aa><@adj→>$
[[t:a:_3bPiw; t:a:_3bPiw]]^eksem<n><nt><sg><ind><aa><@←p-utfyll>$
[[t:a:_3bPiw; t:a:_3bPiw]]^ligge<vblex><pres><aa><@fv>$
[[t:a:_3bPiw; t:a:_3bPiw]]^ofte<adv><aa><@adv>$
[[t:a:_3bPiw; t:a:_3bPiw]]^arvelig<adj><pst><un><pl><ind><aa><@adj→>$
[[t:a:_3bPiw; t:a:_3bPiw]]^faktor<n><m><pl><ind><aa><@subj>$
[[t:a:_3bPiw; t:a:_3bPiw]]^fore<adv>$^.<sent><clb><aa>$[]

How to keep caps?

 $ echo '^Den<det><dem><nt><sg>$ ^enkelt<adj><pst><un><sp><def>$ ^departement<n><nt><sg><ind>$'|lsx-proc nob-nno.autoseq.bin
^hver<det><qnt><nt><sg>$ ^enkelt<adj><pst><nt><sg><ind>$ ^departement<n><nt><sg><ind>$
 
– how do you get it to carry over the caps?

lsx has `  <p><l>den<s n="det"/></l><r>hver<s n="det"/></r></p>`

Can't compile on Manjaro linux ARM

Hi, I try to install apertium on Pinbook-pro (arm platform).

It crashes when compiling apertium-separable:

# setup environment
LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
export LD_LIBRARY_PATH
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:${PKG_CONFIG_PATH}
export PKG_CONFIG_PATH

# install dependancies
sudo pacman --sync --needed expat gawk libxslt pcre gcc-libs libxml2 cmake icu boost gperftools utf8cpp

# install apertium 
git clone https://github.com/apertium/lttoolbox.git 
git clone https://github.com/apertium/apertium.git
git clone https://github.com/apertium/apertium-lex-tools.git
cd lttoolbox
./autogen.sh
make
sudo make install
sudo ldconfig
cd ..
cd apertium
./autogen.sh
make
sudo make install
sudo ldconfig
cd ..
cd apertium-lex-tools
./autogen.sh
make
sudo make install
sudo ldconfig
cd ..
# install cg3
git clone https://github.com/GrammarSoft/cg3 # your documentation is outdated, svn repository doesn't exist anymore
cd cg3
./cmake.sh
make -j3
sudo make install
cd ..
# install apertium-separable
git clone https://github.com/apertium/apertium-separable
cd apertium-separable
./autogen.sh
./configure
make # HERE IS THE BUG!
Making all in src
make[1] : on entre dans le répertoire « /home/regivanx/apertium-separable/src »
g++ -DPACKAGE_NAME=\"apertium-separable\" -DPACKAGE_TARNAME=\"apertium-separable\" -DPACKAGE_VERSION=\"0.7.0\" -DPACKAGE_STRING=\"apertium-separable\ 0.7.0\" -DPACKAGE_BUGREPORT=\"[email protected]\" -DPACKAGE_URL=\"\" -DPACKAGE=\"apertium-separable\" -DVERSION=\"0.7.0\" -DHAVE_LIBXML2=1 -I.   -Wall -Wextra  -I/usr/local/include  -I/usr/include/libxml2    -Wall -Wextra -g -O2 -std=c++23 -MT lsx_processor.o -MD -MP -MF .deps/lsx_processor.Tpo -c -o lsx_processor.o lsx_processor.cc
Dans le fichier inclus depuis /usr/local/include/lttoolbox/alphabet.h:26,
                 depuis lsx_processor.h:4,
                 depuis lsx_processor.cc:1:
/usr/local/include/lttoolbox/ustring.h:25:10: erreur fatale: utf8.h : Aucun fichier ou dossier de ce type
   25 | #include <utf8.h>
      |          ^~~~~~~~
compilation terminée.
make[1]: *** [Makefile:358: lsx_processor.o] Error 1
make[1] : on quitte le répertoire « /home/regivanx/apertium-separable/src »
make: *** [Makefile:400: all-recursive] Error 1

Error: Trying to link nonexistent states

Minimal reproduction:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary type="separable">
  <pardefs>
	<pardef n="e">
	  <e>
		<i>e</i>
	  </e>
	</pardef>
	<pardef n="penn">
	  <e><i>p</i></e>
	</pardef>
  </pardefs>

  <section id="main" type="standard">
	<e><par n="e"/><par n="penn"/></e> <!-- broken -->
	<e><i>e</i><par n="penn"/></e>     <!-- ok -->
  </section>
</dictionary>

rl-compiled transducer fills up memory

This is the relevant entry in apertium-eng-deu.eng-deu.lsx:

    <e lm="switch off" c="abschalten">
      <p><l>switch<s n="vblex"/></l><r>switch<g><b/>off</g><s n="vblex"/><s n="sep"/></r></p><i><t/><j/></i>
      <par n="SN"/><p><l>off<t/><j/></l><r></r></p>
    </e>

Compiled like this:

lsx-comp rl apertium-eng-deu.eng-deu.lsx eng-deu.revautoseq.bin

Testing like this:

$ echo "^switch# off<vblex><sep><past>$ ^the<det><def><sp>$ ^light<n><pl>$^.<sent>$" | lsx-proc eng-deu.revautoseq.bin

The result is that it eats tons of memory and needs to be killed.

If I test it as follows, it returns some output before memlooping:

$ echo "^PRPERS<prn><subj><p1><mf><sg>$ ^switch# off<vblex><sep><past>$ ^the<det><def><sp>$ ^light<n><pl>$^.<sent>$" | lsx-proc eng-deu.revautoseq.bin
^PRPERS<prn><subj><p1><mf><sg>$^C

lsx-proc eats final blank

On macOS and older distros, input (http://sprunge.us/FrmAHK) echo -ne '^Apertium<np><al><m><sg>$ ^être<vbser>pri><p3><sg>$ ^un<det><ind><m><sg>$ ^logiciel<n><m><sg>$ ^de<pr>$ ^traduction<n><f><sg>$ ^automatique<adj><mf><sg>$^.<sent>$[][\n]' | lsx-proc fra-cat.autosep.bin
yields output
^Apertium<np><al><m><sg>$ ^être<vbser><pri><p3><sg>$ ^un<det><ind><m><sg>$ ^logiciel<n><m><sg>$ ^de<pr>$ ^traduction<n><f><sg>$ ^automatique<adj><mf><sg>$^.<sent>$
where the final blanks are missing.

Works on Ubuntu 20.04, weirdly enough.

Potentially related to #26

Double free in lsx-comp

Building apertium-fra-cat-1.3.0 fails on OpenBSD due to a double free in lsx-comp. (apertium-3.5.1, lttoolbox-3.4.1, apertium-separable-0.3.0)

lsx-comp lr apertium-fra-cat.fra-cat.l1x fra-cat.autosep.bin
main@standard 28 32
lsx-comp(68548) in free(): chunk is already free 0x100a2b5489c0
gmake: *** [Makefile:770: fra-cat.autosep.bin] Abort trap (core dumped)
gmake: *** Deleting file 'fra-cat.autosep.bin'

This is 100% reproducible on my system. Unfortunately the backtrace isn’t much help since the crash apparently happens at program exit.

Possible to use for matching on forms?

It seems we can almost use lsx-proc for matching on readings that include forms, the only thing that's missing is not escaping the slash:

input.txt:

^i/i<pr><aa><@adv>$
^lov/lov<n><m><sg><ind><aa><@←p-utfyll>$
^om/om<pr><aa><@adv>$
^frittståande/*frittståande$
^skolar/*skolar$

rules.lsx:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary type="separable">
  <alphabet></alphabet>
  <sdefs>
    <sdef n="np"      c="Proper noun"/>
    <sdef n="pr"      c="Preposition"/>
  </sdefs>
  <pardefs>
    <pardef n="reading:" c="match and drop readings (incl. tagless/unknown). Includes end delimiter">
      <e><p><l>/<w/><d/></l>    <r/></p></e>
      <e><p><l>/<w/><t/><d/></l><r/></p></e>
    </pardef>
  </pardefs>
  <section id="main" type="standard">
    <e>
      <i><w/><s n="pr"/><t/><d/></i>
      <p><l>lov</l>          <r></r></p> <par n="reading:"/>
      <p><l>om</l>           <r></r></p> <par n="reading:"/>
      <p><l>frittståande</l> <r></r></p> <par n="reading:"/>
      <p><l>skolar</l>       <r></r></p> <par n="reading:"/>
      <p><l></l> <r>lov<b/>om<b/>frittståande<b/>skolar/lov<b/>om<b/>frittståande<b/>skolar<s n="np"/><d/></r></p>
    </e>
  </section>
</dictionary>

GOT:

$ lsx-comp lr rules.lsx rules.bin
$ lsx-proc rules.bin < input.txt
^i\/i<pr><aa><@adv>$
^lov om frittståande skolar\/lov om frittståande skolar<np>$

EXPECTED:

$ lsx-comp lr rules.lsx rules.bin
$ lsx-proc rules.bin < input.txt
^i/i<pr><aa><@adv>$
^lov om frittståande skolar/lov om frittståande skolar<np>$

Maybe we could have a special symbol for reading-separator (slash-that-shouldn't-be-escaped)? Then we could

  <pardefs>
    <pardef n="reading:" c="match and drop readings (incl. tagless/unknown). Includes end delimiter">
      <e><p><l><reading-separator/><w/><d/></l>    <r/></p></e>
      <e><p><l><reading-separator/><w/><t/><d/></l><r/></p></e>
    </pardef>
  </pardefs>
  <section id="main" type="standard">
    <e>
      <i><w/><reading-separator/><w/><s n="pr"/><t/><d/></i>
      <p><l>lov</l>          <r></r></p> <par n="reading:"/>
      <p><l>om</l>           <r></r></p> <par n="reading:"/>
      <p><l>frittståande</l> <r></r></p> <par n="reading:"/>
      <p><l>skolar</l>       <r></r></p> <par n="reading:"/>
      <p><l></l> <r>lov<b/>om<b/>frittståande<b/>skolar<reading-separator/>lov<b/>om<b/>frittståande<b/>skolar<s n="np"/><d/></r></p>
    </e>
  </section>

(tag name to be bikeshod. bikeshotten. bikeshought. bikeshawn)


Also, <w/> should maybe not match unescaped / (though it should match escaped, e.g. if lemma is "A/B-testing" which in stream format is ^A\/B-testing/A\/B-testing<n><sg>$ ).

Compile error because of missing method

fran@ipek:~/source/apertium/trunk/apertium-separable$ make
Making all in src
make[1]: Entering directory '/home/fran/source/apertium/trunk/apertium-separable/src'
g++ -DPACKAGE_NAME=\"apertium-separable\" -DPACKAGE_TARNAME=\"apertium-separable\" -DPACKAGE_VERSION=\"0.3.4\" -DPACKAGE_STRING=\"apertium-separable\ 0.3.4\" -DPACKAGE_BUGREPORT=\"[email protected]\" -DPACKAGE_URL=\"\" -DPACKAGE=\"apertium-separable\" -DVERSION=\"0.3.4\" -DHAVE_LIBXML2=1 -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DECL_FREAD_UNLOCKED=1 -DHAVE_DECL_FWRITE_UNLOCKED=1 -DHAVE_DECL_FGETC_UNLOCKED=1 -DHAVE_DECL_FPUTC_UNLOCKED=1 -DHAVE_DECL_FPUTS_UNLOCKED=1 -DHAVE_DECL_FGETWC_UNLOCKED=0 -DHAVE_DECL_FPUTWC_UNLOCKED=0 -DHAVE_DECL_FGETWS_UNLOCKED=0 -DHAVE_DECL_FPUTWS_UNLOCKED=0 -I.   -Wall -Wextra  -I/home/fran/local/include/apertium-3.6 -I/home/fran/local/lib/apertium-3.6/include -I/home/fran/local/include/lttoolbox-3.5 -I/usr/include/libxml2 -I/usr/include/libxml2  -Wall -Wextra -g -O2 -std=c++2a -MT lsx_processor.o -MD -MP -MF .deps/lsx_processor.Tpo -c -o lsx_processor.o lsx_processor.cc
lsx_processor.cc: In member function ‘void LSXProcessor::processWord(FILE*, FILE*)’:
lsx_processor.cc:236:64: error: no matching function for call to ‘State::step_override(__gnu_cxx::__alloc_traits<std::allocator<wchar_t>, wchar_t>::value_type&, wint_t, int&, __gnu_cxx::__alloc_traits<std::allocator<wchar_t>, wchar_t>::value_type&)’
         s.step_override(lu[i], towlower(lu[i]), any_char, lu[i]);
                                                                ^
In file included from lsx_processor.h:7,
                 from lsx_processor.cc:1:
/home/fran/local/include/lttoolbox-3.5/lttoolbox/state.h:187:8: note: candidate: ‘void State::step_override(int, int, int)’
   void step_override(int const input, int const old_sym, int const new_sym);
        ^~~~~~~~~~~~~
/home/fran/local/include/lttoolbox-3.5/lttoolbox/state.h:187:8: note:   candidate expects 3 arguments, 4 provided
Makefile:362: recipe for target 'lsx_processor.o' failed
make[1]: *** [lsx_processor.o] Error 1
make[1]: Leaving directory '/home/fran/source/apertium/trunk/apertium-separable/src'
Makefile:401: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1

in lttoolbox:

$ git log | head
commit acdfe58b793440e3fc6cc7e23635178e696fd460
Author: Francis Tyers <[email protected]>
Date:   Sun Jul 5 21:54:05 2020 +0100

    deprecate SAO code

Issue with blanks

Here's some testing I did:

Testing behaviour with wordbound blanks

Command: lsx-proc eng-spa.autoseq.bin

separable input:
^the<det><def><sp>$ [[t:i:123456]]^Aragonese<n><sg>$ [[t:b:basfs]]^take<vblex><past>$ [[t:s:123545]]^Ramiro<np><ant><m><sg>$ [[t:x:abc123]]^out of<pr>$ [[t:y:vdfdrf]]^a<det><ind><sg>$
^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$

separable output:
^the<det><def><sp>$ [[t:i:123456]]^Aragonese<n><sg>$ [[t:b:basfs]]^take# [[t:s:123545]]out<vblex><sep><past>$ [[t:x:abc123]]^Ramiro<np><ant><m><sg>$ ^of<pr>$ [[t:y:vdfdrf]]^a<det><ind><sg>$
^monastery<n><sg>$ ^and<cnjcoo>$ ^make<vblex><pp>$ ^prpers<prn><obj><p3><m><sg>$ ^king<n><sg>$^.<sent>$

Testing if the issue is the "out of"

separable input2:
[[t:b:basfs]]^take<vblex><past>$ [[t:s:123545]]^Ramiro<np><ant><m><sg>$ [[t:x:abc123]]^out<adv>$ ^of<pr>$ [[t:y:vdfdrf]]^a<det><ind><sg>$

separable output2:
[[t:b:basfs]]^take# [[t:s:123545]]out<vblex><sep><past>$ [[t:x:abc123]]^Ramiro<np><ant><m><sg>$ ^<adv>$ ^of<pr>$ [[t:y:vdfdrf]]^a<det><ind><sg>$

Testing if issue exists with normal blanks as well

separable input3:
^the<det><def><sp>$ [<div>]^Aragonese<n><sg>$ [</div>]^take<vblex><past>$ [@tmp:123456]^Ramiro<np><ant><m><sg>$ [<b><i>]^out of<pr>$ []^a<det><ind><sg>$

separable output3:
^the<det><def><sp>$ [<div>]^Aragonese<n><sg>$ [</div>]^take# [@tmp:123456]out<vblex><sep><past>$ [<b><i>]^Ramiro<np><ant><m><sg>$ ^of<pr>$ []^a<det><ind><sg>$

Testing with out of LU characters (without blank [..])

separable input4:
^the<det><def><sp>$ !!^Aragonese<n><sg>$ ;^take<vblex><past>$ ;.^Ramiro<np><ant><m><sg>$   !;^out of<pr>$ ^a<det><ind><sg>$

separable output4:
^the<det><def><sp>$ !!^Aragonese<n><sg>$ ;^take# ;.out<vblex><sep><past>$   !;^Ramiro<np><ant><m><sg>$ ^of<pr>$ ^a<det><ind><sg>$

Seems to be a general issue of not reading LUs as a unit. I'll try looking for a solution anyway since I have to try and modify the parsing for wordbound blanks but thought I'd file an issue as well if anyone had any thoughts.

lsx-comp not running as before

When fra-frp was released, as said in the documentation , sentences like j’ai toujours besoin, je n’ai pas besoin or je n’ai pas toujours besoin, were translated as j’é tojorn fôta, j’é pas fôta, j’é pas tojorn fôta. This was done thanks to this rule in apertium-fra-frp.fra-frp.l1x:

    <e lm="avoir besoin">
      <p><l>avoir<s n="vblex"/></l><r>avoir<g><b/>besoin</g><s n="vblex"/></r></p>
      <i><t/><j/></i>
      <par n="adv"/>
      <p><l>besoin<s n="n"/><s n="m"/><t/><j/></l><r></r></p>
    </e>

Currently, this rule (and seemingly no rule) in apertium-fra-frp.fra-frp.l1x is being matched:

$ echo "je n'ai pas besoin" | apertium -d . fra-frp-lsx1
^je<prn><tn><p1><mf><sg>$ ^ne<adv>$ ^avoir<vblex><pri><p1><sg>$ ^pas<adv>$ ^besoin<n><m><sg>$^.<sent>$

$ echo "je n'ai pas besoin" | apertium -d . fra-frp
j'é pas besouen

This is also happening in apertium.org.

Add a way to copy tags in output

<t/> consumes tags... there should also be a way to copy tags.

        <e lm="utz il" c="gustarse"><p>
                <l>utz<s n="adj"/><j/>il<s n="v"/><s n="tv"/><t/><j/></l>
                <r>utz<b/>il<s n="v"/><s n="iv"/></r>
            </p>
        </e>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.