divvun / divvunspell Goto Github PK

Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support. (Spell checking derived from hfst-ospell)

License: Apache License 2.0

Rust 93.41% C 1.74% C++ 1.16% CSS 0.35% HTML 0.14% JavaScript 0.60% Svelte 2.44% Nix 0.16%

hfst rust fst spellchecking box

divvunspell's People

Contributors

Stargazers

Watchers

Forkers

killercup atul9 fry albbas snorrees bbqsrc icodein

divvunspell's Issues

analyse option

hfst-ospell has the option: -a, --analyse : Analyse strings and corrections
it will be very useful if that option would work in divvunspell too.

Punctuation considered part of words

This causes correctly spelled words to be underlined when followed by punctuation (full stop, comma, semicolon etc). The suggestion is usually the same word without the punctuation mark.

It seems to be a tokeniser regression.

divvunspell does not build for Android, needs updated NDK

Cf https://github.com/divvun/divvunspell/runs/24607451880

Building targets (armeabi-v7a, arm64-v8a)
Building armeabi-v7a (armv7-linux-androideabi)
Exporting CARGO_NDK_ANDROID_TARGET="armeabi-v7a"
error: NDK versions less than r23 are not supported. Install an up-to-date version of the NDK.
::error::The process '/root/.cargo/bin/cargo' failed with exit code 1
Process exited with code: 1

Missing README and usage

Hello! I was wondering if there could be a simple getting started guide in the README that would explain how to embed this in mobile applications. Currently, there is no README or LICENSE.

EDIT: To clarify, I'm try to see if this implementation is viable to embed within a predictive text engine for mobile platforms. However, I need a license to determine if I can use it at all. As well, most open source libraries I would need the bare minimum of documentation (entry points, example on how to link with my application). Adding both a simple README and LICENSE will go a long way to help me and other coders use and support this library! 😃

morphology-based / word-part completion model

This is based on ideas by @aarppe, and may require both a special error model, and a special acceptor: one needs to both suggest and accept word fragments during typing.

Mixed case bug - mixed case accepted

Cf the following test case from Estonian:

echo KÕige | divvunspell suggest -a tools/spellcheckers/et-desktop.zhfst 
Reading from stdin...
Input: KÕige		[CORRECT]
Input: 		[CORRECT]

Compare to hfst-ospell-office (correct behavior):

echo '5 KÕige' | hfst-ospell-office tools/spellcheckers/et-desktop.zhfst 
@@ hfst-ospell-office is alive
&	Käige	Kõige	Kiige	Kaige

Expected behavior: mixed case should only be accepted if found in the lexicon with the exact same casing, ie the suggestions should only keep the casing of the input string if the suggestion string matches the lexical case completely. In all other cases it should be initial upper case if input is initial upper case, or if the lexicon requires an initial upper case (as in names), or all upper case if input is all upper.

The zhfst file is attached:
et-desktop.zhfst.zip

slowness issue

I was demoing the morphology-aware spelling prediction stuff here at the conference and ran into a weird slowness issue I might need help to debug and profile, I have a zhfst and input that works instantaneously in hfst-ospell but takes 20 minutes in divvunspell, I thought first it was in my code, but it seems to happen in without analysis option as well:

$ echo niwa | time hfst-ospell -a crk-neu.zhfst -n 10
...
1.13user 0.02system 0:01.18elapsed 97%CPU (0avgtext+0avgdata 54824maxresident)k
$ echo niwa | time cargo run -- suggest --archive crk-neu.zhfst -S -n 10
...
1146.24user 1.33system 19:08.48elapsed 99%CPU (0avgtext+0avgdata 126788maxresident)k

Perhaps I am doing something simple wrong? I can upload the zhfst after the conference with better network and battery

divvunspell having problems with Cyrillic capital letters (it seems)

To repeat:

uit-mac-443 lang-mns (main)$ e Эспоо|humnsNorm
Эспоо	Эспоо+N+Prop+Sem/Plc+Sg+Nom	0,000000

uit-mac-443 lang-mns (main)$ e Эспоо|divvunspell suggest -a tools/spellcheckers/mns.zhfst 
Reading from stdin...
Input: Эспоо		[INCORRECT]

uit-mac-443 lang-mns (main)$ e Эспоо|hfst-ospell -S -n 5 tools/spellcheckers/mns.zhfst 
"Эспоо" is in the lexicon...

The form Эспоо is in the normative fst, and is recognised by hfst-ospell, but not by divvunspell, even though both spellers access the same freshly compiled version of the speller, mns.zhfst.

The lemma (Эспоо) is from the shared-urj-Cyrl, but that should not be relevant (it is in the resulting fst). The editdist.default.txt file declares the е/э pair (and, on a side note, also the composed long э̄). Capital letters are not declared explicitly:

uit-mac-443 lang-mns (main)$ grep э tools/spellcheckers/editdist.default.txt 
э
э̄
е	э	2
э	е	2

Complete "CHFST" support

Will document this properly later.

Underlining, misspelled string/suggestions one off for combining diacritics

If the misspelled string contains a combining diacritic (instead of a precomposed diacritic), it seems that divvunspell is one off:

If I switch to another speller engine, the problem disappears:

The exact input string used is:

Giélelutnjeme jih nuvviDspeller Gielelutnjeme lea buerie.

The speller version is:

and divvunspell 0.2.0, running on maOS Mojave 10.14.2.

Handle sPoNgEbOb case

The pathological case of someone arbitrarily capitalising chunks of words should be handled properly.

Add divvunspell support to Enchant

Filed here in lack of a better place:

We would like to have support for divvunspell in most Linux distros, and the easiest way to do that is to add support for divvunspell to Enchant, in the same way as (lib)voikko is supported, cf AbiWord/enchant#259.

The arguments for doing this are as follows:

divvunspell is faster than hfst-ospell
we get support for the bhfst format (only supported by divvunspell)
we get consistent suggestion behavior across all platforms we support

divvunspell fails to find suggestions hfst-ospell does

It seems that hfst-ospell does a better job considering all possible
tokenisations of a word; divvunspell fails to offer some suggestions when they
are multiple tokenisations (due to multichar symbols).

After a lot of work, I derived this explanation by finding a minimal failing
example; I hope this effort helps with fixing the bug!

Script to reproduce (run in an empty directory):

echo -e "cat\nca\ncb\ncbt" > multichar
echo -e "cbt:cat\ncb:ca" > orth
echo '?*' | hfst-regexp2fst -o anystar.hfst
hfst-strings2fst -j -m multichar < orth \
| hfst-concatenate anystar.hfst - \
| hfst-concatenate - anystar.hfst \
| hfst-repeat -f 1 -t 3 \
| hfst-disjunct - anystar.hfst \
| hfst-fst2fst -w > errmodel.default.hfst
echo "cat" | hfst-strings2fst -j -m multichar | hfst-fst2fst -w > acceptor.default.hfst
cat > index.xml <<-EOF
<?xml version="1.0" encoding="utf-8"?>
<hfstspeller dtdversion="1.0" hfstversion="3">
    <info>
        <title>cat</title>
	<locale>xxx</locale>
	<producer>xxx</producer>
	<description>cat</description>
    </info>
    <acceptor type="general" id="acceptor.default.hfst">
        <title>cat</title>
	<description>cat</description>
    </acceptor>
    <errmodel id="errmodel.default.hfst">
        <title>error</title>
	<description>cat</description>
        <type type="default"/>
        <model>errmodel.default.hfst</model>
    </errmodel>
</hfstspeller>
EOF

zip test.zhfst index.xml errmodel.default.hfst acceptor.default.hfst
echo "cbt" | hfst-ospell -S test.zhfst
echo "cbt" | divvunspell -S -z test.zhfst

Suggestions weight differences between divvunspell and hfst-ospell

Compare the following two commands and their output:

$ echo nikinitawiwapamew | divvunspell --always-suggest --zhfst tools/spellcheckers/fstbased/desktop/hfst/crk.zhfst 
Reading from stdin...
Input: nikinitawiwapamew		[INCORRECT]
nikî-nitawi-wâpamâw		40.458984
naki-nitawi-wâpamêw		49.151657
naki-nitawi-wâpamâw		53.151657
nika-nitawi-wâpamâw		53.151657
nikê-nitawi-wâpamâw		53.151657
nikî-nitawi-asamâw		53.151657
nikî-natawi-wâpamâw		57.151657
nikî-nitawi-wanâmâw		57.151657
niwî-nitawi-wâpamâw		57.151657
kikî-nitawi-wâpamâw		67.15166

$ echo nikinitawiwapamew | hfst-ospell -S -b 16 tools/spellcheckers/fstbased/desktop/hfst/crk.zhfst 
"nikinitawiwapamew" is NOT in the lexicon:
Corrections for "nikinitawiwapamew":
nikî-nitawi-wâpamâw    15.458984
naki-nitawi-wâpamêw    24.151655
nikî-nitawi-wanâmâw    27.151655
niwî-nitawi-wâpamâw    27.151655
kikî-nitawi-wâpamâw    27.151655
nikî-natawi-wâpamâw    27.151655
nika-nitawi-wâpamâw    28.151655
nikê-nitawi-wâpamâw    28.151655
nikî-nitawi-asamâw    28.151655
naki-nitawi-wâpamâw    28.151655
nikî-nitawi-wêpimâw    31.151655

What is strange about the weight differences is that the weights are encoded in the fst's (acceptor and error model). So the expectation would be that identical input should give identical weight for identical output.

On the surface, it looks like divvunspell is giving wrong weights — if one takes the acceptor weight of the suggestion + the weight of each editing operation, one comes close to the hfst-ospell weight:

$ echo nikî-nitawi-wâpamâw | hfst-lookup -q tools/spellcheckers/fstbased/desktop/hfst/acceptor.default.hfst 
nikî-nitawi-wâpamâw	nikî-nitawi-wâpamâw	9,458984
nikî-nitawi-wâpamâw	nikî-nitawi-wâpamâw	11,151655

The lowest weight is the one used, and there are four editing operations applied to the input string, with the following weight:

# strings.default.regex:
{in} -> {î-n}::1,
{iw} -> {i-w}::1;

## editdist.default.txt:
a	â	2

# final_strings.default.txt:
mew:mâw	4

9,458984 + 1 +1 + 2 + 4 = 17,458984

hfst-ospell is still 2 off, but that is nevertheless way closer than divvunspells 40.458984.

These differences are problematic for two reasons: it indicates a bug in the weight calculation, and it makes it hard to debug the suggestions and their ordering.

Documentation and discoverability of additional algorithms

We have two algorithms at play in divvunspell that don't exist in hfst-ospell:

Case handling
Penalty weighting for first letter different, last letter difference and Damerau–Levenshtein distance for middle letters

Things to do to make this good:

Document somewhere sane how the algorithms behave
Add some information to --help either with a link or with the information itself
In the suggestion output for divvunspell, show the penalties, and the unmodified weights, as well as the modified weights
Document how to add the weight information to BHFST files so it can be controlled by the linguist
If possible, add a flag for disabling the penalty weighting algorithm (like --no-case-handling already does somewhat, but separate the two into different flags)

All caps input does not return all caps suggestions

From giellalt/lang-sma#5:

Reading from stdin...
Input: IDENTITETE		[INCORRECT]
Identiteete		82.49508
Identiteetem		116.20703
Identiteeten		117.49508
Identiteeth		117.49508
Identiteeti		117.49508

Expected: all suggestions in all caps

Tested with the latest divvunspell code.

More features

What is missing is a mode for suppressing correct words, i.e. incorporating

cat wordlist | divvunspell suggest -a xxx.zhfst | grep -v "\[CORRECT\]"| ...

into e.g. divvunspell suggest --suppress-correct -a ... or something along those lines.

accuracy needs some more features

Thanks for a good tool in accuracy, as described under Speller testing in the README file. In order to be really useful for development, the tool needs some additional options:

Sorting order

Today, we may sort by input order (default) and by time used. All well. In addition, we need:

Sort by Result, both in the order given here and in the reversed order (with 6 on top)

Top result
Suggestion 1
Suggestion 2
...
Not in suggestions
Not in suggestions (No suggestion)

Other features

We need Average position for all suggestions.

The old setup also calculated precision, recall and accuracy, this was a useful feature.

Tweaking output

The initial "Speller configuration" is good to have stated. I might have missed it, but it is not clear to me how to manipulate it. Based on the documentation I tried

3 divvunspell (main)$ accuracy -w 5 -o support/accuracy-viewer/public/report.json ../../giellalt/lang-fit/test/data/typos.txt ../../giellalt/lang-fit/tools/spellcheckers/fit.zhfst

... but that still resulted in 10 suggestions, not 5.

Summary

The most important thing is top be able to sort by result, it is critical for improving the suggestion mechanism. The other issues are there for making better reports.

And then, of course the inclusion of this in an easier command setup in each lang-xxx catalogue, but for my own part I am happy with the present setup, as soon as I get these features in place.

Multiple acceptors and error models

Both old ideas and new development suggest a more flexible approach to accceptors and error models. Below is a list of things discussed in the past, + new ideas inspired by the ongoing machine learning work by @gusmakali, on word completion and prediction. Also some of the tasks mentioned in #19 are relevant to this.

Multiple error models

(neural model)
hand-tuned error model
#29
default/fall-back model (the present one)

The idea is that all of the above could be present in one and the same speller archive, and with some configuration specification as to when to apply which model. A very tentative idea could be that a machine learning error model will either get it right with the top hypothesis, or completely fail (as determined by filtering the hypothesis against the lexicon), thus use that one as a first step, then fall back to a hand-tuned error model, and when that fails (it could be written to be on the safe side, ie not suggest anything outside a certain set of errors), fall back to the default error model.

Exactly how this should work and interact is very much an open question, but divvunspell should provide the machinery so that linguists can experiment with it to reach an optimal setup for a given language and device type.

Multiple acceptors

default acceptor (the present one)
suggestion acceptor
#29
rejector

And possibly other variants too.

There are at least two ideas here:

we might want to be more careful with what we suggest, and an easy way to do that is verifying suggestions against a more restricted acceptor, e.g. with no dynamic compounding or derivation (such words would still be accepted, just never suggested). Another way of restricting suggestions is to never suggest anything with a weight higher than a limit X, where X is configurable (this has been discussed several times in the past):
- never suggest if weight higher than configurable weight X
in productive word formation it is easy to overgenerate, e.g. for compounds, but subtracting illegal paths from an fst is hugely inefficient and space consuming. What is way better is to have a rejector fst that contains invalid strings, and anything in that fst should always be rejected, in all cases except when explicitly added to a user dictionary by the user.

As part of this work it is probably necessary to rework the zhfst archive format, probably by making the bhfst format the standard, including the json config file used there.

Too short help message for divvunspell?

Both divvunspell -h and ... --help on the command line give:

divvunspell -h
Usage: divvunspell [OPTIONS]

Optional arguments:
  -h, --help  print help message

Available commands:
  suggest   get suggestions for provided input
  tokenize  print input in word-separated tokenized form
  predict   predict next words using GPT2 model

This contrast with the README.md file, which gives

Testing frontend for the Divvunspell library

USAGE:
    divvunspell [FLAGS] [OPTIONS] <--zhfst <ZHFST>|--bhfst <BHFST>|--acceptor <acceptor>> [WORDS]...

FLAGS:
    -S, --always-suggest    Always show suggestions even if word is correct (implies -s)
    -h, --help              Prints help information
        --json              Output results in JSON
    -s, --suggest           Show suggestions for given word(s)
    -V, --version           Prints version information

OPTIONS:
        --acceptor <acceptor>    Use the given acceptor file
    -b, --bhfst <BHFST>          Use the given BHFST file
        --errmodel <errmodel>    Use the given errmodel file
    -n, --nbest <nbest>          Maximum number of results for suggestions
    -w, --weight <weight>        Maximum weight limit for suggestions
    -z, --zhfst <ZHFST>          Use the given ZHFST file

ARGS:
    <WORDS>...    The words to be processed

So, two wishes here:

Extend the short help message to the longer, more useful one
While at it, add the -a flag to the first OPTIONS (also an -e?)

Rejects words accepted by hfst-ospell and hfst-lookup

To repeat:

cd lang-lut
./autogen.sh
./configure --enable-spellers
make -j
echo sčətxʷəd | divvunspell suggest -a tools/spellcheckers/lut.zhfst 
Reading from stdin...
Input: sčətxʷəd		[INCORRECT]
sčətxʷəd		0

echo sčətxʷəd | hfst-ospell -S tools/spellcheckers/lut.zhfst 
"sčətxʷəd" is in the lexicon...

See giellalt/lang-lut#4 for more examples.

The following words fail:

sčətxʷəd
ƛ̕aƛ̕ac̓apəd
sčətxʷəd
ƛ̕aƛ̕ac̓apəd
sčətxʷəd
gʷəl
x̌ʷul̕
ƛ̕uʔibibəš
x̌ʷul̕
ƛ̕uʔibibəš
gʷəl
ƛ̕aƛ̕ac̓apəd
gʷəl
dᶻəgʷaʔ

Common to all of them is that they either contain ʷ (U+02B7, MODIFIER LETTER SMALL W) or ̕ (U+0315, COMBINING COMMA ABOVE RIGHT).

The following words do NOT fail:

ʔi
tsiʔiɬ
hay
ʔah
tiʔəʔ
syəyəhub
ʔə
tiʔiɬ
ʔi
tsiʔiɬ
tiʔəʔ
tsiʔiɬ

None of these words contain the mentioned characters.

To me it seems like some part of divvunspell does not contain a complete definition of the set of word characters. It should be based on a suitable Unicode character class, not on specific character code ranges.

Add regression testing

Add a tool for allowing a user to compare two spellers for regressions.

Extra weight when initial letter is changed

One of the most glaring problems with our present spellers is that the suggestions get the same weight independently of where in the input string changes are made to produce the suggestions. This is contrary to what we know about the distribution of spelling errors, and cause a lot of noisy suggestions.

Errors are typically made in the middle of the words, not the beginning or end, so that the distribution of spelling errors' positions looks a lot like a gauss curve.

We are not able to build error models where the weights for one and the same edit changes depending on the position of the edit. A crude solution that would fix the most glaring errors would be a mechanically added weight to edits that are made to the first letter of the word. At best the initial letter weight would be configurable, for easy adjustments to the rest of the weighting scheme. The default could be 10.

Use cthulhu for generating FFI interface

Use the cthulhu crate for providing a standard interface to the FFI functions without having to write broken raw code.

Freeze the command line flags

Need to review whether the current flags are sufficient, and consider the interplay between CHFST and ZHFST files.

Don't suggest if weight >10 000

Everything with such high weights are intentionally so, and should not be suggested. Implementing that limit will remove a lot of noise from the speller.

In the long run the weight limit for no-sugg should be configurable via the zhfst file, but for now it is ok to just hard-code it.

Mixed-case strings accepted by hfst-ospell-office

The string DavveVássján is accepted by the smj speller (attached, rename suffix to zhfst), although it should not. hfst-ospell does not accept it:

echo DavveVássján | hfst-ospell -S smj.zhfst | head -n 10
"DavveVássján" is NOT in the lexicon:
Corrections for "DavveVássján":
Davve-Vássján    27.590923 <== this is the intended correction
Davve-Vássjá    37.590923
...

I assume it is a bug in the case handling algorithm. For cases like these, the input string should be accepted IFF it is accepted exactly as given, or with the initial letter downcased, or all upper. Crucially, it should not be accepted if it is only accepted when all lowercased. In the example above, DavveVássján is not an acceptable word, although davvevássján is. But since neither DavveVássján nor davveVássján are accepted, the input string should be rejected, despite davvevássján being accepted.

smj.zip

Provide alphabet to tokeniser

Some languages are going to consider dashes etc as characters of their alphabet, so we should override Unicode's categories for this case by providing the alphabet from a transducer as the primary source of truth of what belongs in a "word".

Differing behavior with different devices (crk)

When I try to correct nikiwapamaw with divvunspell, I get as the first one the expected result of nikîwâpamâw, cf.

divvunspell suggest --archive tools/spellcheckers/crk.zhfst
Reading from stdin...
nikiwapamaw
Input: nikiwapamaw		[INCORRECT]
nikî-wâpamâw		30.360352
nikitâpamâw		36.151657
nikîmâpamâw		41.151657
naki-wâpamâw		42.151657
nika-wâpamâw		43.151657

This is also the behavior within LibreOffice, cf.

However, when I try this with the mobile version of the speller, for whatever reason the second form (according to divvunspell) nikitâpamâw is suggested as the topmost correction, and nikî-wâpamâw is nowhere to be seen, cf.

Correct string not accepted by speller

The input Omas-vuonas is not accepted by divvunspell, although it is accepted or analysed by all other tools:

echo Omas-vuonas | divvunspell suggest -a se.zhfst 
Reading from stdin...
Input: Omas-vuonas		[INCORRECT]
Omasvuonas		105.3018

Compare with the following:

echo '5 Omas-vuonas' | hfst-ospell-office -d se.zhfst 
@@ Loading se.zhfst with args max-weight=-1.00, beam=-1.00, time-cutoff=6.00
@@ hfst-ospell-office is alive
*

(the star * is ospell-office/Hunspell speak for an accepted word).

And:

printf "Omas-vuonas" | hfst-ospell -S tools/spellcheckers/se.zhfst 
"Omas-vuonas" is in the lexicon...

And:

printf "Omas-vuonas" | hfst-lookup -q tools/spellcheckers/acceptor.default.hfst 
Omas-vuonas	Omas-vuonas	20092,302734
Omas-vuonas	Omas-vuonas	20092,302734
Omas-vuonas	Omas-vuonas	20092,302734
Omas-vuonas	Omas-vuonas	20105,302734
Omas-vuonas	Omas-vuonas	20105,302734
Omas-vuonas	Omas-vuonas	20105,302734

And:

printf "Omas-vuonas" | hfst-lookup -q tools/spellcheckers/analyser-desktopspeller-gt-norm.hfst 
Omas-vuonas	Oma+N+Prop+Sem/Plc+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Loc	20092,302734
Omas-vuonas	Oma+N+Prop+Sem/Plc+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Loc	20092,302734
Omas-vuonas	Oma+N+Prop+Sem/Plc+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Loc	20092,302734
Omas-vuonas	Oma+N+Prop+Sem/Sur+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Loc	20092,302734
Omas-vuonas	Oma+N+Prop+Sem/Sur+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Loc	20092,302734
Omas-vuonas	Oma+N+Prop+Sem/Sur+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Loc	20092,302734
Omas-vuonas	Oma+N+Prop+Sem/Plc+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Acc+PxSg3	20105,302734
Omas-vuonas	Oma+N+Prop+Sem/Plc+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Gen+PxSg3	20105,302734
Omas-vuonas	Oma+N+Prop+Sem/Plc+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Acc+PxSg3	20105,302734
Omas-vuonas	Oma+N+Prop+Sem/Plc+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Gen+PxSg3	20105,302734
Omas-vuonas	Oma+N+Prop+Sem/Plc+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Acc+PxSg3	20105,302734
Omas-vuonas	Oma+N+Prop+Sem/Plc+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Gen+PxSg3	20105,302734
Omas-vuonas	Oma+N+Prop+Sem/Sur+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Acc+PxSg3	20105,302734
Omas-vuonas	Oma+N+Prop+Sem/Sur+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Gen+PxSg3	20105,302734
Omas-vuonas	Oma+N+Prop+Sem/Sur+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Acc+PxSg3	20105,302734
Omas-vuonas	Oma+N+Prop+Sem/Sur+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Gen+PxSg3	20105,302734
Omas-vuonas	Oma+N+Prop+Sem/Sur+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Acc+PxSg3	20105,302734
Omas-vuonas	Oma+N+Prop+Sem/Sur+Sg+Loc+Cmp/Cit+Cmp/Hyph+Cmp#vuotna+N+Sem/Plc+Sg+Gen+PxSg3	20105,302734

Expected behavior: divvunspell should accept every input the acceptor accepts.

The speller .zhfst file used for testing is attached.

se.zhfst.zip

Suggestion differences between divvunspell and hfst-ospell(-office)

divvunspell:

AS:ain → AS:in

Result: Not in suggestions

Time: 36.038582 ms

Astan 53.04883
Asiin 55.3018
Askin 55.3018
Asodin 55.3018
Asolin 55.3018
Asošin 55.3018
Assin 55.3018
Astadin 55.3018
Astamin 55.3018
Asudin 55.3018

hfst-ospell-office:

AS:ain -> AS:in

Time: 1.018443 s

Result: 3

Ab:in
NS:in
AS:in
ASAin
Anjain
Annain
Antain
Anyain
Apain
Apiain

The correct suggestion in hfst-ospellhas a weight of 72,916. The two top suggestions have the weight 70.301.

The most surprising thing is that the editing distance between the input AS:ain and the top divvunspell suggestion Astan is so big: S-s, :->t, i->0 = 3. It should not be possible to generate most of the divvunspell suggestions, as we by default only allow two edit operations (specific string operations in the error model might override this default).

This is even more intriguing as the top divvunspell suggestion (Astan) does not appear in the output of the error model! Cf:

echo AS:ain | hfst-lookup -q errmodel.default.hfst

(It gives 553 732 suggestions, obviously most of them are garbage.)

Download the se.zhfst file for testing.

divvun / divvunspell Goto Github PK

divvunspell's People

Contributors

Stargazers

Watchers

Forkers

divvunspell's Issues

Sorting order

Other features

Tweaking output

Summary

Multiple error models

Multiple acceptors

Recommend Projects

Recommend Topics

Recommend Org

Jobs