GithubHelp home page GithubHelp logo

lgomezt / tidyx Goto Github PK

View Code? Open in Web Editor NEW
25.0 1.0 1.0 76.97 MB

Python package to clean raw tweets for ML applications.

Home Page: https://tidyx.readthedocs.io/

License: MIT License

Python 63.34% HTML 36.66%
natural-language-processing tweet-analysis tweets twitter

tidyx's Introduction

tidyX

GitHub stars Downloads

Before and After tidyX

tidyX is a Python package designed for cleaning and preprocessing text for machine learning applications, especially for text written in Spanish and originating from social networks. This library provides a complete pipeline to remove unwanted characters, normalize text, group similar terms, etc. to facilitate NLP applications.

To deep dive in the package visit our website

Installation

Install the package using pip:

pip install tidyX

Make sure you have the necessary dependencies installed. If you plan on lemmatizing, you'll need spaCy along with the appropriate language models. For Spanish lemmatization, we recommend downloading the es_core_news_sm model:

python -m spacy download es_core_news_sm 

For English lemmatization, we suggest the en_core_web_sm model:

python -m spacy download en_core_web_sm 

To see a full list of available models for different languages, visit Spacy's documentation.

Features

  • Standardize Text Pipeline: The preprocess() method provides an all-encompassing solution for quickly and effectively standardizing input strings, with a particular focus on tweets. It transforms the input to lowercase, strips accents (and emojis, if specified), and removes URLs, hashtags, and certain special characters. Additionally, it offers the option to delete stopwords in a specified language, trims extra spaces, extracts mentions, and removes 'RT' prefixes from retweets.
from tidyX import TextPreprocessor as tp

# Raw tweet example
raw_tweet = "RT @user: Check out this link: https://example.com ๐ŸŒ #example ๐Ÿ˜ƒ"

# Applying the preprocess method
cleaned_text = tp.preprocess(raw_tweet)

# Printing the cleaned text
print("Cleaned Text:", cleaned_text)

Output:

Cleaned Text: check out this link

To remove English stopwords, simply add the parameters remove_stopwords=True and language_stopwords="english":

from tidyX import TextPreprocessor as tp

# Raw tweet example
raw_tweet = "RT @user: Check out this link: https://example.com ๐ŸŒ #example ๐Ÿ˜ƒ"

# Applying the preprocess method with additional parameters
cleaned_text = tp.preprocess(raw_tweet, remove_stopwords=True, language_stopwords="english")

# Printing the cleaned text
print("Cleaned Text:", cleaned_text)

Output:

Cleaned Text: check link

For a more detailed explanation of the customizable steps of the function, visit the official preprocess() documentation.

  • Stemming and Lemmatizing: One of the foundational steps in preparing text for NLP applications is bringing words to a common base or root. This library provides both stemmer() and lemmatizer() functions to perform this task across various languages.
  • Group similar terms: When working with a corpus sourced from social networks, it's common to encounter texts with grammatical errors or words that aren't formally included in dictionaries. These irregularities can pose challenges when creating Term Frequency matrices for NLP algorithms. To address this, we developed the create_bol() function, which allows you to create specific bags of terms to cluster related terms.
  • Remove unwanted elements: such as special characters, extra spaces, accents, emojis, urls, tweeter mentions, among others.
  • Dependency Parsing Visualization: Incorporates visualization tools that enable the display of dependency parses, facilitating linguistic analysis and feature engineering.
  • Much more!

Tutorials

Contributing

Contributions to enhance tidyX are welcome! Feel free to open issues for bug reports, feature requests, or submit pull requests. If this package has been helpful, please give us a star :D

tidyx's People

Contributors

iamtalhaasghar avatar lgomezt avatar terrok9 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

iamtalhaasghar

tidyx's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.