GithubHelp home page GithubHelp logo

jsonspeller's Introduction

JsonSpeller

Training in progress

Under Development

Main Functions

  1. spell_check(list_text, deep_search=True, context_search=True, sim_char_search=True)
    Replaces the misspelled words in the given list of texts using various strategies without training.

    • list_text : python list of texts.
    • deep_search : default=True | enables / disables deep-learning search.
    • context_search : default=True | enables / disables context search.
    • sim_char_search : default=True | enables / disables similar character search.
  2. train_check(list_text, deep_search=True, context_search=True, sim_char_search=True)
    Train the words in the given list and updates the misspell.json Then replaces the misspelled words in the given list of texts using various strategies.

    • list_text : python list of texts.
    • deep_search : default=True | enables / disables deep-learning search.
    • context_search : default=True | enables / disables context search.
    • sim_char_search : default=True | enables / disables similar character search.
  3. train(list_text)
    Train the words in the given list and updates the misspell.json

    • list_text : python list of texts.

Material used in Training

Comparison

(to be observed once training is done.)

  1. JsonSpeller

  2. autocorrect

  3. textblob

  4. symspellpy

  5. pyspellcheck

Why did I make this?

답답하면 니가 뛰던지

의 '니'가 되었습니다.

마음에 쏙드는 한국어 맞춤법교정기가 없어서... 만들어보는 중이고... 시험삼아 영어를 먼저 만들어 보았습니다!

While working on my final project, I realized that there are not much great spellcheckers for Korean.
Therefore, I have decided to create one.
Named it JsonSpeller as it works based on JSON file. Not because I am Jason

Before diving into my spellchecker, here are the results of different spellcheckers.
We had around 2,000 data to spellcheck.

Other models

Model Time Result Tested By
T5 2.5 hours It does a good job, but some hate-speech are labeled as "hate-speech" and some as it is @JasonHeesangLee
T5 - detailed prompt 4 hours I'll have to modify the prompt again to try, but for now, it shows only ####Hello#### x million times... @JasonHeesangLee
ChatGPT API - simple prompt 2.5 hours It is good, but too good... For hate-speech, it cleanses the term and explains the situation. (May damage the original if used as is)
Most importantly, it costs...
@SionBang
ChatGPT - detailed prompt 2.5 hours It is good, hate-speech labeled as [OFF], when a single word is all there is to the text -> Explains that word. (May damage the original if used as is)
Also, it costs...
@JasonHeesangLee
Rule-based Spacing Module with khaiii POS-tagger by Kakao Corp 22 seconds 95% spacing correctly done (Maybe able to reach upto 100% if the rule can be even more developed). However, it doesn't correct the spellings, but only reconcatenate the consonants and vowels if they are separated (It happens often in typing fast). It may be useful if ensembled with other spell checking models. @JasonHeesangLee
symspellpy-ko 20 minutes A model developed over symspellpy.
As the model checks the spell based on the external dictionary, some new terms could be leftout or wrongly corrected. (If a perfect wordlist exists... I believe this would the best model in theory, but we all know that the perfect wordlist doesn't exist.)
@JasonHeesangLee
Pusan-University Spellchecking API 40 ~ 60 minutes Performs well at spacing, sanitizing profanity, and separating sentences / Doesn't understand neologisms, people's names, etc. and returns incorrect spellings. Also, it cannot be used for deep-learning as the service is free, and is offered for public. @NayoungBae
soynlp 20 minutes for training
(30,000 sentences), 3 seconds for inference
Some neologisms were well maintained while some were not. For other normal terms, it also returns bad spacing. @SionBang
hanspell No Idea Error on all teammates' environment. ALL

symspellpy-ko would have been the best model among the models above, only if the perfect word-list, including neologisms.

Then I thought "What if we make typos for words on purpose, and put them in the dictionary to replace them?"
But creating typos would be too inefficient, even if generate them with LLMs.
Below is the modified version of this Thought I had.
(The current version I have developed can only perform spell check on English, and it will be developed for Korean soon.)
Please kindly go through them and tell me if this method would be too inefficient or if there is any problems I haven't thought of.
Disclaimer We are working on extracting the keywords from the users’ daily records and segment them into “Positive” and “Negative” emotions, as explained in this discussion and this JsonSpeller was initially developed solely for this project.

Facts & Hypothesis

A keyword term must appear at least twice to be considered a keyword.
For nouns that appear only once, if the spelling is correct (if it is in the Korean word-list), we will extract it as a keyword, otherwise we will exclude it.
Even if it is included, the user may say: "What is this? It's not a word I wrote down."

How to

  1. Prepare tweet texts or any texts that are written informally (not the news articles) for pretraining.
  2. List-up the words that appears more than once.
    The ratio could be adjusted after passing certain number of pretraining - when we have enough data to set threshold.
  3. Extract the words in the possible-misspelled-terms category of those correct words from the text and put them into a dictionary in the form of key(misspelled-words) - value_array(possible-correct-terms).
    3-1. What is possible-misspelled-terms category and possible-correct-terms category?
    • possible-misspelled-terms category : misspelled terms like "wolrd", "word", "owrld", etc. when the correct word is "world".
      However, the words in the this category must appear only once in the text.
      If it appears more than once, the word is added to the list mentioned above.
    • Since there are real words like word, it is likely that the user meant world when "world" appears multiple times and "word" only appears for twice.
    • We exclude the actual words that are in the dictionary because we don't want future uses of word to be recognized as world.
      For example the key - value_array will look like : "wword" : ["word", "world", "work", ... etc.]
  4. Iterate over the dictionary value_array words and replace them with nouns from the string.

Pros

  • If this is applied in real-world service, it is possible to create separate JSON files of frequently used typos for each user and use it continuously. (Personalization of misspell processing?)
  • Since the process only runs for words recognized as nouns in the sentence (Korean Language Only), the processing speed does not increase proportionally as the sentence gets longer.
  • Many variants of typos can be quickly collected when combining these personalized JSON files.
  • No need to build a dictionary of neologisms (because the words people use often will be in known words - appears more than 2 times)
  • We can use different dataset to pretrain on a certain topic - like data science, sports, medical, ... etc. whatever topic you think of.

Cons

  • Some words could be leftout or misjudged.
  • It's likely to take a long time to go through each word.
  • But... will there really be that many typo variants for one word...?
  • 유저들의 행동이 모든게 계획대로 돌아가야 함.

Last but not least, a great THANK YOU my teammates for being patient with me! @nayoungbae @chanhyukhan @bangsioni

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.