zverok / spylls Goto Github PK

View Code? Open in Web Editor NEW

270.0 7.0 18.0 2.82 MB

Pure Python spell-checker, (almost) full port of Hunspell

Home Page: https://spylls.readthedocs.io

License: Mozilla Public License 2.0

Python 92.61% Makefile 4.88% Shell 2.51%

spellcheck spelling hunspell spellchecker

spylls's Introduction

Spylls: Hunspell ported to Python

Spylls is an effort of porting prominent spellcheckers into clear, well-structured, well-documented Python. It is intended to be useful both as a library and as some kind of "reference (or investigatory, if you will) implementation". Currently, only Hunspell is ported.

Hunspell is a long-living, complicated, almost undocumented piece of software, and it was our feeling that the significant part of human knowledge is somehow "locked" in a form of a large C++ project. That's how Spylls was born: as an attempt to "unlock" it, via well-structured and well-documented implementation in a high-level language.

Follow the explanatory blog post series on my blog.

Usage as a library

$ pip install spylls

from spylls.hunspell import Dictionary

# en_US dictionary is distributed with spylls
# See docs to load other dictionaries
dictionary = Dictionary.from_files('en_US')

print(dictionary.lookup('spylls'))
# False
for suggestion in dictionary.suggest('spylls'):
    print(suggestion)
# spells
# spills

Documentation

Full documentation, including detailed source code/algorithms walkthrough, more detailed reasoning and some completeness reports, is available at https://spylls.readthedocs.io/.

Project Links

Docs: https://spylls.readthedocs.io/
GitHub: https://github.com/zverok/spylls
PyPI: https://pypi.python.org/pypi/spylls
Issues: https://github.com/spylls/spylls/issues

License

MPL 2.0. See the bundled LICENSE file for more details. Note that being an "explanatory rewrite", spylls should considered a derivative work of Hunspell, and so would be all of its ports/rewrites.

We are incredibly grateful to Hunspell's original authors and current maintainers for all the hard work they've put into the most used spellchecker in the world!

spylls's People

Contributors

Stargazers

Watchers

Forkers

romanlevin constructionware vault-the shantanuo tjdev7 strepon vletard 4144 zdenop exander77 elijahahianyo akimsp abakai2 ennamarie19 mayhemheroes iq-scm bruceoh vizid

spylls's Issues

Question: Does hunspell/spylls have facility to change the root of the word?

Does hunspell/spylls have facility to change the root of the word when making word forms?

I am looking through cs_CZ and a lot of words are directly written there in all forms.

Like stůl (table):

stole
stolech
stolem
stoletím
stolu
stolů
stolům
stoly
stůl

Which seems highly inefficient to me.
But it may be that there is no facility to implement character transform as ů is becoming o.
Or is there something in hunspell format that can handle this?

Basically, every male gender word with ů gets transformed like this into o form. And there are several transformations like these in Czech language.

Knob to fix word-case

I know you mentions this as design-goal, but is there a flag/knob to fix word-casing, e.g.:
Walking -> walking
Best wishes!

Infinite loop when suggesting

It seems that for some words the generator from suggest() never finishes. I am using the French dictionary from Firefox as my datasource. This works:

dictionary = Dictionary.from_zip("./fr.xpi")

print(list(dictionary.suggest('sommes')))

This goes into an infinite loop:

print(list(dictionary.suggest('decouverte')))

Any ideas/suggestions?
Spylls version 0.1.7, Python 3.9.7

MIT license certainly isn't compatible with Hunspell's license

When I was discussing linking to Espells from a few places with the author of NSpell, they noted that Spylls likely is in violation of the Hunspell license.

Here is what they had to say:

Spylls is explained as a “explanatory rewrite”, this sounds like carefully reading through the code and then implementing it either the same of a more clear way, and in the posts the Spyll author also explicitly mentions reading through the source and what they found there. That means it’s not 100% a new body of work. Which means it has to follow the MPL 1.1/GPL 2.0/LGPL 2.1 with copyright statement of Hunspell and add to it. So the first issue is there, and it’s a big one: spyll uses MIT, which is very permissive, whereas the author of Hunspell chose a strong license, to specifically prohibit such permissiveness. On top of that, it doesn’t acknowledge the copyright that the Hunspell author(s) have over their code

I had a vague notion of this issue, but made Espells MIT anyways. However, following this, I've relicensed it to MPL 2.0, which is allowed through the MPL license family that Hunspell is licensed under. I think MPL 2.0 is harmless for a repository like this, so that's what I would recommend.

Support for extracting all valid words from a dictionary

Hello there, first of all thanks a lot for writing Spylls, all the documentation and the blog series, I haven't read it all in depth yet but so far it has been wonderfully informative and from what I've seen it's the best sort of documentation on how Hunspell works available, and Spylls itself is probably the best port of Hunspell to another language.

I have a use case where I need to perform spell-checking on the browser basically, and from what I've seen, mainly from playing around with nspell, an Hunspell-like approach to spell-checking would be pretty prohibitive for my use case, for some languages at least, it takes too long to generate words or even just parse dictionaries, plus keeping all those strings in memory can blow up memory usage fairly rapidly which is a concern for my use case.

I think it's possible to approach the problem differently, pre-parsing dictionaries into a compact ~binary representation and working with that directly, trading some lookup performance, hopefully not too much performance, for amazing memory usage and startup times. There's some shallow documentation about that here but I think you alluded to something like this in a blog post when mentioning another spell checker so you might already be familiar with this.

Anyway I'd like to spend some time implementing something like that, for doing that I would first of all need to extract all valid words from all dictionaries, and that sounds like a task Spylls could be well suited to address, it's not a problem if it'll take hours for some dictionary, I care more about extracting all valid words as correctly as possible than doing it quickly.

So could Spylls expose an API for doing that?

TypeError: '<' not supported between instances of 'Word' and 'Word'

It works for some words but getting error in case of others.

from spylls.hunspell import Dictionary
dictionary = Dictionary.from_files('/root/marathi/dicts/mr_IN')

for suggestion in dictionary.suggest('मान्वी'):
  print(suggestion)

मानवी
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-cec47a2e0f5b> in <module>
      3 dictionary = Dictionary.from_files('/root/marathi/dicts/mr_IN')
      4 
----> 5 for suggestion in dictionary.suggest('मान्वी'):
      6   print(suggestion)

/root/miniforge3/lib/python3.7/site-packages/spylls/hunspell/dictionary.py in suggest(self, word)
    201         """
    202 
--> 203         yield from self.suggester(word)

/root/miniforge3/lib/python3.7/site-packages/spylls/hunspell/algo/suggest.py in __call__(self, word)
    181             word: Word to check
    182         """
--> 183         yield from (suggestion.text for suggestion in self.suggest_internal(word))
    184 
    185     def suggest_internal(self, word: str) -> Iterator[Suggestion]:  # pylint: disable=too-many-statements

/root/miniforge3/lib/python3.7/site-packages/spylls/hunspell/algo/suggest.py in <genexpr>(.0)
    181             word: Word to check
    182         """
--> 183         yield from (suggestion.text for suggestion in self.suggest_internal(word))
    184 
    185     def suggest_internal(self, word: str) -> Iterator[Suggestion]:  # pylint: disable=too-many-statements

/root/miniforge3/lib/python3.7/site-packages/spylls/hunspell/algo/suggest.py in suggest_internal(self, word)
    345 
    346         ngrams_seen = 0
--> 347         for sug in self.ngram_suggestions(word, handled=handled):
    348             for res in handle_found(Suggestion(sug, 'ngram'), check_inclusion=True):
    349                 ngrams_seen += 1

/root/miniforge3/lib/python3.7/site-packages/spylls/hunspell/algo/suggest.py in ngram_suggestions(self, word, handled)
    508                     known={*(word.lower() for word in handled)},
    509                     maxdiff=self.aff.MAXDIFF,
--> 510                     onlymaxdiff=self.aff.ONLYMAXDIFF)
    511 
    512     def phonet_suggestions(self, word: str) -> Iterator[str]:

/root/miniforge3/lib/python3.7/site-packages/spylls/hunspell/algo/ngram_suggest.py in ngram_suggest(misspelling, dictionary_words, prefixes, suffixes, known, maxdiff, onlymaxdiff)
     81             heapq.heappushpop(root_scores, (score, word.stem, word))
     82         else:
---> 83             heapq.heappush(root_scores, (score, word.stem, word))
     84 
     85     roots = heapq.nlargest(MAX_ROOTS, root_scores)

TypeError: '<' not supported between instances of 'Word' and 'Word'

Spelling mistake in example code on spylls.readthedocs.io

On the docs on your website... there's a spelling mistake... :-) :-) "sugestion" is written with two GGs.

On github this is already fixed.

Upgrade the package

I tried to upgrade. But it seems that updated code is not there in pip

pip install --upgrade spylls

Please upgrade the package so that I it will be easy to use.

Using spylls to clean-up text file

Is it possible to run spylls against a large corpus and remove all mis-spelled words?
Something like asked here...
https://stackoverflow.com/questions/65785287/using-hunspell-to-find-incorrect-words-in-jamspell

Integrate Black for formatting

Black is a pretty good option for formatting code + can be integrated with Poetry / pyproject.toml

Since it's pretty strict it stops a lot of the battles that can happen over this.

(As a pedantic side note - the indentation in the docs in the README is not 4 spaces, as in PEP8)

.dic and .aff content by param.

Hello!

Would it be possible to populate the dictionary by submitting a LIST with the content of .dic and .aff ?

This is useful in the case of spark UDFs where it is easier to pass LIST variables, rather than copy .dic and .aff files from the driver node to the executors.

Btw , there is any way to implement stemming like the original hunspell library? Or there is some alternative for stemming?

It takes too long time to return an answer on a corner case

It seems that you implemented some algorithm that has a complexity related to the length of the input for "dictionary.lookup()", for example, the following script takes seconds to return an answer:

from spylls.hunspell import Dictionary

HUNSPELL_DICT = Dictionary.from_files('en_US')

text = "-" * 40
print(HUNSPELL_DICT.lookup(text))

I know this may not be an expected input for your design, but I suggest adding some shortcut to reduce the processing time for this kind of input that is obviously invalid

aff-regex

This AFF (czech) contains a wrong regex:
https://github.com/wooorm/dictionaries/blob/main/dictionaries/cs/index.aff#L2119

Therefore this line fails re.error: unterminated character set at position 36
https://github.com/zverok/spylls/blob/master/spylls/hunspell/data/aff.py#L266

ask for the Stemming feature

I'd like this feature :
https://github.com/binhetech/CyHunspell#stemming

spylls fails to load Dutch dictionary

When I tried to load the Dutch dictionary from https://github.com/OpenTaal/opentaal-hunspell, it failed:

In [1]: from spylls.hunspell import Dictionary

In [2]: dictionary = Dictionary.from_files("github/opentaal-hunspell/nl")
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-a2daf207bcaa> in <module>
----> 1 dictionary = Dictionary.from_files("github/opentaal-hunspell/nl")

/usr/local/lib/python3.9/site-packages/spylls/hunspell/dictionary.py in from_files(cls, path)
    116 
    117         aff, context = readers.read_aff(FileReader(path + '.aff'))
--> 118         dic = readers.read_dic(FileReader(path + '.dic', encoding=context.encoding), aff=aff, context=context)
    119 
    120         return cls(aff, dic)

/usr/local/lib/python3.9/site-packages/spylls/hunspell/readers/dic.py in read_dic(source, aff, context)
     58                 # So we just mutate the list of parts we are currently processing, so those fetched
     59                 # by numeric alias would be handled.
---> 60                 parts.extend(aff.AM[part])
     61             else:
     62                 # ...otherwise, it is still part of the word

KeyError: '10'

Looking at the nl.dic file, the relevant word is on line 18906:
boekentop 10.

Since hunspell accepts this dictionary, and uses it:

hunspell -d github/opentaal-hunspell/nl
Hunspell 1.7.0
boekentop 10
*
*

boekentop-10
& boekentop-10 1 0: boekentop 10

This seems to indicate that the comment on line 56 of hunspell/readers/dic.py:
# If it is just numeric AND not the first part in string, it is "morphology alias" is not correct.
I cannot find the term "morphology alias" in hunspell(5), so I'm not sure what is meant by that.
The manual does show numerical flags used.
But numbers in the stem should not be interpreted, AFAICT.

Generating dictionary wordforms/unmunch

The original Hunspell had two important utilities:

# print all forms for all words whose roots are given in `roots.dic`
# and make use of affix rules defined in `affixes.aff`:
unmunch   roots.dic affixes.aff
# print the forms of ONE given word (a single root with no affix rule)
# which are allowed by the reference dictionary defined by the pair of
# `roots.dic` and `affixes.aff`:
wordforms affixes.aff roots.dic word

How to achieve this in spylls.hunspell?
I use Hunspell to generate Scrabble dictionaries, and I am looking into replacing it with spylls.hunspell.

ValueError: invalid literal for int() with base 10:

Hello,

first of all: thanks for this project. I just try it for dictionary I maintained and it fails with :
ValueError: invalid literal for int() with base 10: 'šími/Fs'

Which is great because .aff file has error, but this message does not said for ordinary user...
Is there any function to check correctness of .dic and .aff files?
Or at least please provide better error message with something like this:

@@ -246,10 +246,15 @@ def read_value(source: BaseReader, directive: str, *values, context: Context) ->
         ]
     if directive in ['SFX', 'PFX']:
         flag, crossproduct, count, *_ = values
-        return [
-            make_affix(directive, flag, crossproduct, *line, context=context)
-            for line in _read_array(int(count))
-        ]
+        try:
+            return [
+                make_affix(directive, flag, crossproduct, *line, context=context)
+                for line in _read_array(int(count))
+            ]
+        except ValueError as error:
+            print(f"Error at: directive, values: {directive}, {values}")
+            print(f"Maybe wrong count of rules?")
+            raise error

Most of the spellcheckers create function for end users but forgot about dictionary creators and maintainers ;-) (Aspell at least tried to report e.g. affix problems when creating dictionary from wordlist...)