GithubHelp home page GithubHelp logo

Comments (6)

wolfgarbe avatar wolfgarbe commented on September 28, 2024
  1. SymSpell is based on the Damerau–Levenshtein edit distance. It takes the input term and tries to find the dictionary term with the smallest edit distance. Unfortunately this is opposite to the intention of using an abbreviation like gr8 for great. The intention of abbreviations is to save us from typing of as many letters as possible, which results in a large edit distance.
    The best way of solving your problem ist to translate all abbreviations prior to the use of SymSpell.
    Of course you could also add gr8 as a valid word to the SymSpell dictionary. In that case the abbreviation gr8 will stay unchanged in the text, which also could be a desired behavior.

  2. readying is a valid English word. So probably it should not be removed from the dictionary. Lookup("readying",SymSpell.Verbosity.All,1) will still return reading as one of the suggestions. You have to choose manually whether readying or reading fits better into your context. The reason is that SymSpell does not yet take the context of the whole sentence into account for automatically choosing the most probable suggestion. The order of suggestions are currently solely based on edit distance and word frequency (for equal edit distance).

from symspell.

fahadshery avatar fahadshery commented on September 28, 2024

thanks for the detailed explanation. I tend to deal with a lot of technical terms that are specific to the telecoms industry. How can I add those words to the dictionary? For example, people who misspell broadband as brodband should be autocorrected to broadband. Is there a detailed package documentation/guide that I could use which could list all the available methods and their purpose including what's the input and outputs are?

Lastly, I tend to deal with hundreds of thousands of customer feedbacks and will have to read the verbatim in via a CSV. What's the best/quickest way to autocorrect these before doing any other NLP related task?

Many thanks

from symspell.

wolfgarbe avatar wolfgarbe commented on September 28, 2024

SymSpell is an algorithm, rather than a turn-key product. It's purpose is to find all terms/candidates from a large dictionary which are within a maximum edit distance (Damerau-Levenshtein) to an input term in a very short time (to my knowledge much faster than any other algorithm). The possible applications are far beyond spell checking.
A developer can use SymSpell as core technology to extremely fast find approximate matches from a dictionary within a maximum edit distance.
The developer can then customize/extend it with additional pre- and post processing modules according to his specific requirements in order to build/extend a product.

How can I add those words to the dictionary?

The SymSpell dictionary is a plan text file in UTF-8 encoding. You can edit and add terms to it with any text editor.

For example, people who misspell broadband as brodband should be autocorrected to broadband.

If you want to auto correct a text file, you have to read the source text file yourself, split it into terms, feed one term at a time to SymSpell, and write either the original term or one of the terms suggested by SymSpell to the destination file.
SymSpell orders the suggestions by edit distance and word frequency (for equal edit distance). It's your duty to choose the suggestion which is most likely in your context (based word frequencies in your specific domain, typical errors from your customer base, assisted by deep learning etc.)
It is your duty to remember/restore upper/lower case information and punctuation.

Is there a detailed package documentation

Currently there is no documentation beyond the readme.
But there is a lot of additional information in the linked blog posts

Symspell has three methods:

  1. Lookup
  2. LookUpCompound
  3. WordSegmentation

Lookup (Single word spelling correction )
Lookup provides a very fast spelling correction of single words.

A Verbosity parameter allows to control the number of returned results:

  • Top: Top suggestion with the highest term frequency of the suggestions of smallest edit distance found.
  • Closest: All suggestions of smallest edit distance found, suggestions ordered by term frequency.
  • All: All suggestions within maxEditDistance, suggestions ordered by edit distance, then by term frequency.

The Maximum edit distance parameter controls up to which edit distance words from the dictionary should be treated as suggestions.
The required Word frequency dictionary can either be directly loaded from text files (LoadDictionary) or generated from a large text corpus (CreateDictionary).

LookupCompound (Compound aware multi-word spelling correction )

LookupCompound can insert only a single space into a token (string fragment separated by existing spaces). It is intended for spelling correction of word segmented text but can fix an occasional missing space. There are fewer variants to generate and evaluate because of the single space restriction per token. Therefore it is faster and the quality of the correction is usually better.
It returns only a single (best/most likely) correction suggestion, not multiple possible variants.

WordSegmentation (Word Segmentation of noisy text )
WordSegmentation can insert as many spaces as required into a token. Therefore it is suitable also for long strings without any space. The drawback is a slower speed and correction quality, as many more potential variants exist, which need to be generated, evaluated and chosen from.
It returns only a single (best/most likely) segmentation variant, not multiple possible variants.

from symspell.

wolfgarbe avatar wolfgarbe commented on September 28, 2024

Regarding auto correction, instead of using Lookup() as described earlier and having full control, you could theoretically use WordSegmentation(text, maxEditDistance). It does an automatic corrrection and word segmentation of an arbitrary text.
But it should be used with caution:

  • First, the resulting text will be lower cased and all punctuation will be removed.

  • Second, the automatically chosen corrections are based solely on edit distance and word occurrence frequency. No context based information is taken into account.

  • Third, as the automatic correction will make some mistakes, it is only useful if after the correction process the overall number of mistakes has been reduced, i.e. the number of properly corrected words > the number of newly introduced errors. At the first glance this seems obvious, but usually most words are already correct (and thus there are many candidates were the auto correction could introduce new errors), and only a small percentage of words contains mistakes (and thus there are only few words where the auto correction could remove errors).
    See also my post Automatic Spelling Correction: success or disaster (unpublished draft).

from symspell.

fahadshery avatar fahadshery commented on September 28, 2024

Thank you so much for taking the time to explain in detail. Personally, if I was to program this, how would I pick out a suggestion to replace a word? Obviously, this would get better over time but initially do you recommend a threshold value of a probability (If that's what it outputs?)

from symspell.

wolfgarbe avatar wolfgarbe commented on September 28, 2024

With symSpell.Lookup(inputTerm, SymSpell.Verbosity.All, maxEditDistanceLookup) you get all matches within the maximum edit distance.
The order of suggestions is solely based on edit distance and word frequency (for equal edit distance).
You can implement a custom post processing and filter and/or re-order those suggestions according to your needs.
E.g. you could implement a Weighted Edit distance giving a higher priority to those suggestions where the original character and the corrected character are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound).
You also can try to utilize the context of the whole sentence to guess the most likely suggestion by using bigram or n-gram probabilities, hidden markov models, the Bayes theorem or deep learning.
This is rather hard research than just a simple programming task.

from symspell.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.