GithubHelp home page GithubHelp logo

doytsujin / jamspell Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bakwc/jamspell

0.0 1.0 0.0 701 KB

Modern spell checking library - accurate, fast, multi-language

Home Page: https://jamspell.com/

License: MIT License

Python 3.51% CMake 0.22% C++ 95.63% Shell 0.01% C 0.59% SWIG 0.04%

jamspell's Introduction

JamSpell

Build Status Release

JamSpell is a spell checking library with following features:

  • accurate - it considers words surroundings (context) for better correction
  • fast - near 5K words per second
  • multi-language - it's written in C++ and available for many languages with swig bindings

Colab example

JamSpellPro

jamspell.com - check out a new jamspell version with following features

  • Improved accuracy (catboost gradient boosted decision trees candidates ranking model)
  • Splits merged words
  • Pre-trained models for many languages (small, medium, large) for:
    en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no
  • Ability to add words / sentences at runtime
  • Fine-tuning / additional training
  • Memory optimization for training large models
  • Static dictionary support
  • Built-in Java, C#, Ruby support
  • Windows support

Content

Benchmarks

Errors Top 7 Errors Fix Rate Top 7 Fix Rate Broken Speed
(words/second)
JamSpell 3.25% 1.27% 79.53% 84.10% 0.64% 4854
Norvig 7.62% 5.00% 46.58% 66.51% 0.69% 395
Hunspell 13.10% 10.33% 47.52% 68.56% 7.14% 163
Dummy 13.14% 13.14% 0.00% 0.00% 0.00% -

Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).

We used following metrics:

  • Errors - percent of words with errors after spell checker processed
  • Top 7 Errors - percent of words missing in top7 candidated
  • Fix Rate - percent of errored words fixed by spell checker
  • Top 7 Fix Rate - percent of errored words fixed by one of top7 candidates
  • Broken - percent of non-errored words broken by spell checker
  • Speed - number of words per second

To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:

Errors Top 7 Errors Fix Rate Top 7 Fix Rate Broken Speed (words per second)
JamSpell 3.56% 1.27% 72.03% 79.73% 0.50% 5524
Norvig 7.60% 5.30% 35.43% 56.06% 0.45% 647
Hunspell 9.36% 6.44% 39.61% 65.77% 2.95% 284
Dummy 11.16% 11.16% 0.00% 0.00% 0.00% -

More details about reproducing available in "Train" section.

Usage

Python

  1. Install swig3 (usually it is in your distro package manager)

  2. Install jamspell:

pip install jamspell
  1. Download or train language model

  2. Use it:

import jamspell

corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')

corrector.FixFragment('I am the begt spell cherken!')
# u'I am the best spell checker!'

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

C++

  1. Add jamspell and contrib dirs to your project

  2. Use it:

#include <jamspell/spell_corrector.hpp>

int main(int argc, const char** argv) {

    NJamSpell::TSpellCorrector corrector;
    corrector.LoadLangModel("model.bin");

    corrector.FixFragment(L"I am the begt spell cherken!");
    // "I am the best spell checker!"

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "best", "beat", "belt", "bet", "bent", ... )

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "checker", "chicken", "checked", "wherein", "coherent", ... )
    return 0;
}

Other languages

You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.

HTTP API

  • Install cmake

  • Clone and build jamspell (it includes http server):

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make
./web_server/web_server en.bin localhost 8080
  • GET Request example:
$ curl "http://localhost:8080/fix?text=I am the begt spell cherken"
I am the best spell checker
  • POST Request example
$ curl -d "I am the begt spell cherken" http://localhost:8080/fix
I am the best spell checker
  • Candidate example
curl "http://localhost:8080/candidates?text=I am the begt spell cherken"
# or
curl -d "I am the begt spell cherken" http://localhost:8080/candidates
{
    "results": [
        {
            "candidates": [
                "best",
                "beat",
                "belt",
                "bet",
                "bent",
                "beet",
                "beit"
            ],
            "len": 4,
            "pos_from": 9
        },
        {
            "candidates": [
                "checker",
                "chicken",
                "checked",
                "wherein",
                "coherent",
                "cheered",
                "cherokee"
            ],
            "len": 7,
            "pos_from": 20
        }
    ]
}

Here pos_from - misspelled word first letter position, len - misspelled word len

Train

To train custom model you need:

  1. Install cmake

  2. Clone and build jamspell:

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make
  1. Prepare a utf-8 text file with sentences to train at (eg. sherlockholmes.txt) and another file with language alphabet (eg. alphabet_en.txt)

  2. Train model:

./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin
  1. To evaluate spellchecker you can use evaluate/evaluate.py script:
python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt
  1. You can use evaluate/generate_dataset.py to generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books.

Download models

Here is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See Train section above.

jamspell's People

Contributors

bakwc avatar deniskore avatar jeancsil avatar narkq avatar dngros avatar tizz98 avatar tigitz avatar stigjb avatar vjaysln avatar rqvolkov avatar reefactor avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.