GithubHelp home page GithubHelp logo

sabin5105 / anti-cursing Goto Github PK

View Code? Open in Web Editor NEW
7.0 1.0 0.0 71 KB

python package that detects and switches negative or any kind of cursing word from sentences or comments whatever๐Ÿคฌ

License: MIT License

Python 100.00%
bert natural-language-processing nlp pypi python

anti-cursing's Introduction

anti-cursing

"anti-cursing" is a python package that detects and switches negative or any kind of cursing word from sentences or comments whatever๐Ÿคฌ

You just install the package the way you install any other package and then you can use it in your code.

So this is the very first idea

But you can find my package in pypi(https://pypi.org/project/anti-cursing/0.0.2/)

๐Ÿ™๐ŸปPlz bare with the program to install model's weight and bias from huggingface at the first time you use the package.

image


Concept

There are often situations where you have to code something, detect a forbidden word, and change it to another word. Hardcoding all parts is very inconvenient, and in the Python ecosystem, there are many packages to address. One of them is "anti-cursing".

The package, which operates exclusively for Korean, does not simply change the banned word by setting it up, but detects and replaces the banned word by learning a deep learning model.

Therefore, it is easy to cope with new malicious words as long as they are learned. For this purpose, semi-supervied learning through pseudo labeling is used.

Additionally, instead of changing malicious words to special characters such as --- or ***, you can convert them into emojis to make them more natural.

Contents

Installation

You can install the package using pip:

pip install anti-cursing
pip.mp4

Usage

from anti_cursing.utils import antiCursing

antiCursing.anti_cur("๋‚˜๋Š” ๋„ˆ๊ฐ€ ์ข‹์ง€๋งŒ, ๋„ˆ๋Š” ๋„ˆ๋ฌด ๊ฐœ์ƒˆ๋ผ์•ผ")
๋‚˜๋Š” ๋„ˆ๊ฐ€ ์ข‹์ง€๋งŒ, ๋„ˆ๋Š” ๋„ˆ๋ฌด ๐Ÿ‘ผ๐Ÿป์•ผ
working.mp4

Model-comparison

Classification KcElectra KoBERT RoBERTa-base RoBERTa-large
Validation Accuracy 0.88680 0.85721 0.83421 0.86994
Validation Loss 1.00431 1.23237 1.30012 1.16179
Training Loss 0.09908 0.03761 0.0039 0.06255
Epoch 10 40 20 20
Batch-size 8 32 16 32
transformers beomi/KcELECTRA-base skt/kobert-base-v1 xlm-roberta-base klue/roberta-large

Dataset

Used-api

Google translator

License

This repository is licensed under the MIT license. See LICENSE for details.

Click here to see the License information --> License

References

Sentiment Analysis Based on Deep Learning : A Comparative Study

  • Nhan Cach Dang, Maria N. Moreno-Garcia, Fernando De la Prieta. 2006. Sentiment Analysis Based on Deep Learning : A Comparative Study. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 1โ€“8, Prague, Czech Republic. Association for Computational Linguistics.

Attention is all you need

  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000โ€“6010.

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 4171โ€“4186.

Electra : Pre-training Text Encoders as Discriminators Rather Than Generators

  • Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 2019. Electra: Pre-training text encoders as discriminators rather than generators. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 4171โ€“4186.

BIDAF : Bidirectional Attention Flow for Machine Comprehension

  • Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi. 2016. Bidirectional Attention Flow for Machine Comprehension. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2129โ€“2139.

Effect of Negation in Sentences on Sentiment Analysis and Polarity Detection

  • Partha Mukherjeea, Saptarshi Ghoshb, and Saptarshi Ghoshc. 2018. Effect of Negation in Sentences on Sentiment Analysis and Polarity Detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2129โ€“2139.

KOAS : Korean Text Offensiveness Analysis System

  • Seonghwan Kim, Seongwon Lee, and Seungwon Do. 2019. KOAS: Korean Text Offensiveness Analysis System. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1โ€“11.

Korean Unsmile Dataset

  • Seonghwan Kim, Seongwon Lee, and Seungwon Do. 2019. Korean Unsmile Dataset. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1โ€“11.

Project-status

80%

Future-work

update soon plz bare with me ๐Ÿ™๐Ÿป


KOREAN FROM HERE / ์—ฌ๊ธฐ๋ถ€ํ„ด ํ•œ๊ตญ์–ด ์„ค๋ช…์ž…๋‹ˆ๋‹ค.

anti-cursing

**"anti-cursing"**์€ ๋ฌธ์žฅ์ด๋‚˜ ๋Œ“๊ธ€์—์„œ ๋ถ€์ •์ ์ด๊ฑฐ๋‚˜ ๋ชจ๋“  ์ข…๋ฅ˜์˜ ์š•์„ค์„ ๊ฐ์ง€ํ•˜๊ณ  ์ „ํ™˜ํ•˜๋Š” ํŒŒ์ด์ฌ ํŒจํ‚ค์ง€์ž…๋‹ˆ๋‹ค๐Ÿคฌ

๋‹ค๋ฅธ ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•˜๋Š” ๋ฐฉ์‹๊ณผ ๋™์ผํ•˜๊ฒŒ ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•œ ๋‹ค์Œ ์ฝ”๋“œ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Pypi(https://pypi.org/project/anti-cursing/0.0.2/)์— ํŒจํ‚ค์ง€๋ฅด ์—…๋กœ๋“œํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๊ณณ์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ™๐ŸปํŒจํ‚ค์ง€๋ฅผ ์ฒ˜์Œ ์„ค์น˜ํ•˜์‹œ๊ณ  ์‚ฌ์šฉํ•˜์‹ค ๋•Œ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ์œ„ํ•ด huggingface์—์„œ parsing์„ ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค. ์ฒ˜์Œ์—๋งŒ ํ•ด๋‹น ์ž‘์—…์ด ํ•„์š”ํ•˜๋‹ˆ ์‹œ๊ฐ„์ด ์กฐ๊ธˆ ๊ฑธ๋ฆผ๊ณผ ์šฉ๋Ÿ‰์„ ์ฐจ์ง€ํ•จ์„ ๊ณ ๋ คํ•ด์ฃผ์„ธ์š”

image


Concept

๋ฌด์–ธ๊ฐ€ ์ฝ”๋”ฉ์„ ํ•˜๋ฉฐ, ๊ธˆ์ง€ ๋‹จ์–ด๋ฅผ ๊ฐ์ง€ํ•˜๊ณ  ๊ทธ๊ฒƒ์„ ๋‹ค๋ฅธ ๋‹จ์–ด๋กœ ๋ฐ”๊ฟ”์•ผํ•  ์ƒํ™ฉ์ด ์ข…์ข… ์ƒ๊น๋‹ˆ๋‹ค. ๋ชจ๋“  ๋ถ€๋ถ„์„ ํ•˜๋“œ์ฝ”๋”ฉํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ๋ถˆํŽธํ•˜๋ฉฐ, ํŒŒ์ด์ฌ ์ƒํƒœ๊ณ„์—์„œ๋Š” ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋งŽ์€ ํŒจํ‚ค์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ์ค‘ ํ•˜๋‚˜๊ฐ€ **"anti-cursing"**์ž…๋‹ˆ๋‹ค.

ํ•œ๊ตญ์–ด ์ „์šฉ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ํ•ด๋‹น ํŒจํ‚ค์ง€๋Š” ๋‹จ์ˆœํžˆ ๊ธˆ์ง€ ๋‹จ์–ด๋ฅผ ๊ธฐ์กด์— ์„ค์ •ํ•˜์—ฌ ๋ฐ”๊พธ๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜์—ฌ ๊ธˆ์ง€ ๋‹จ์–ด๋ฅผ ๊ฐ์ง€ํ•˜๊ณ  ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ƒˆ๋กญ๊ฒŒ ์ƒ๊ธฐ๋Š” ์•…์„ฑ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋„ ํ•™์Šต๋งŒ ์ด๋ฃจ์–ด์ง„๋‹ค๋ฉด ์‰ฝ๊ฒŒ ๋Œ€์ฒ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด pseudo labeling์„ ํ†ตํ•œ semi-supervied learning์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ถ”๊ฐ€๋กœ ์•…์„ฑ๋‹จ์–ด๋ฅผ ---๋‚˜ ***๊ฐ™์€ ํŠน์ˆ˜๋ฌธ์ž๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ์ด๋ชจ์ง€๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋”์šฑ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชฉ์ฐจ

์„ค์น˜

pip๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

pip install anti-cursing
pip.mp4

์‚ฌ์šฉ๋ฒ•

from anti_cursing.utils import antiCursing

antiCursing.anti_cur("๋‚˜๋Š” ๋„ˆ๊ฐ€ ์ข‹์ง€๋งŒ, ๋„ˆ๋Š” ๋„ˆ๋ฌด ๊ฐœ์ƒˆ๋ผ์•ผ")
๋‚˜๋Š” ๋„ˆ๊ฐ€ ์ข‹์ง€๋งŒ, ๋„ˆ๋Š” ๋„ˆ๋ฌด ๐Ÿ‘ผ๐Ÿป์•ผ
working.mp4

๋ชจ๋ธ ์„ฑ๋Šฅ ๋น„๊ต

Classification KcElectra KoBERT RoBERTa-base RoBERTa-large
Validation Accuracy 0.88680 0.85721 0.83421 0.86994
Validation Loss 1.00431 1.23237 1.30012 1.16179
Training Loss 0.09908 0.03761 0.0039 0.06255
Epoch 10 40 20 20
Batch-size 8 32 16 32
transformers beomi/KcELECTRA-base skt/kobert-base-v1 xlm-roberta-base klue/roberta-large

๋ฐ์ดํ„ฐ์…‹

์‚ฌ์šฉ API

Google translator

License

์ด ํ”„๋กœ์ ํŠธ๋Š” MIT ๋ผ์ด์„ผ์Šค๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ LICENSE ํŒŒ์ผ์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.

๋ผ์ด์„ผ์Šค ์ •๋ณด --> License

์ฐธ๊ณ ๋ฌธํ—Œ

Sentiment Analysis Based on Deep Learning : A Comparative Study

  • Nhan Cach Dang, Maria N. Moreno-Garcia, Fernando De la Prieta. 2006. Sentiment Analysis Based on Deep Learning : A Comparative Study. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 1โ€“8, Prague, Czech Republic. Association for Computational Linguistics.

Attention is all you need

  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000โ€“6010.

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 4171โ€“4186.

Electra : Pre-training Text Encoders as Discriminators Rather Than Generators

  • Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 2019. Electra: Pre-training text encoders as discriminators rather than generators. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 4171โ€“4186.

BIDAF : Bidirectional Attention Flow for Machine Comprehension

  • Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, Hannaneh Hajishirzi. 2016. Bidirectional Attention Flow for Machine Comprehension. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2129โ€“2139.

Effect of Negation in Sentences on Sentiment Analysis and Polarity Detection

  • Partha Mukherjeea, Saptarshi Ghoshb, and Saptarshi Ghoshc. 2018. Effect of Negation in Sentences on Sentiment Analysis and Polarity Detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2129โ€“2139.

KOAS : Korean Text Offensiveness Analysis System

  • Seonghwan Kim, Seongwon Lee, and Seungwon Do. 2019. KOAS: Korean Text Offensiveness Analysis System. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1โ€“11.

Korean Unsmile Dataset

  • Seonghwan Kim, Seongwon Lee, and Seungwon Do. 2019. Korean Unsmile Dataset. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1โ€“11.

์ง„ํ–‰์ƒํ™ฉ

80%

๋ฐœ์ „

์•ž์œผ๋กœ ์ถ”๊ฐ€๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค ์ž ์‹œ๋งŒ ๊ธฐ๋‹ค๋ ค์ฃผ์„ธ์š”๐Ÿ™๐Ÿป

anti-cursing's People

Contributors

sabin5105 avatar

Stargazers

Hyeonuk Kim avatar Jihong Lee avatar Jeongjin Oh avatar  avatar ๊น€์˜๊ท  avatar Dongha Jeong avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.