GithubHelp home page GithubHelp logo

preln's Introduction

logo

A package for preprocessing text in spanish


Preln is a Python package that speeds up development and optimizes the performance of applications that require adequate data processing in the field of NLP (Natural Language Processing). This library takes into account the special characteristics of data written in Spanish. It makes data suitable and ready to use for complex applications like training machine-learning models, extracting content from social media or develop powerful tools to automate language correction, lemmatization, stemming within manny others.

📃​ Last version v0.5.1-alpha out now! 📃​

💬​ Contribution & Questions

Contribution & Questions Type Platforms
🐞​​ Bug Reports [GitHub Issue Tracker]
📦​ Feature Requests & Ideas [GitHub Discussions]
🛠️​ Usage Questions & Discusions [GitHub Discussions]

💼​ Features

  • Apply and combine general basic operations to pre-process text in Spanish
  • Establish direct connection with file paths, databases… for easy reading and writing data
  • Simple implementation, optimized and ready to apply configuration files
  • Autocorrect function to improve data quality
  • Methods for privacy control, replacing or removing personal data from the dataset
  • Support for spanish and english languages

​💾​ Install Preln

To start using Preln use the next command:

pip install preln

Note: you might have to add this command as a “code” line in order to use Preln on a Python notebook.

The main object class of the package is called Preprocessing and it contains all the principal functions of the package. We will be importing this class and creating and object in order to use it’s methods:

from Preln.preprocessing import Preprocessing

preprocessor = Preprocessing(date=False, date_format=None, accents=False, lowercasing=True,   
               privacy=True, privacy_format="multi:replace", correction=True, media=True, 
               media_format="mention:delete", numbers=False, punctuation=True, 
               stopwords=True, tokenizer=True, debug=False)

🔧​ Example of use

In this basic example, you can check how to use the package in order to process a simple piece of text.

sample_text = "¡Hola @usuario!, mi nombre es Preln, me han creado Adrián y Raúl. Revisa mi documentación en https://www.preln.org"

test = preprocessor.pipeline(sample_text)

print(test) # ['MENTION', 'nombre', 'ORG', 'creado', 'PERSON', 'PERSON', 'revisa', 'documentación', 'URL']

Note: The pipeline method has it´s parameters (which toggle the core methods) setted by default. It’s interesting to change them based on each text we want to process.

💳​ License

Preln is licensed under MIT License.

🗃️ Shields

PyPI downloads code_format

preln's People

Contributors

adri-hdez avatar raul-martin-dev avatar

Stargazers

Laura Gutierrez avatar  avatar José Carlos González González avatar  avatar

Watchers

 avatar

Forkers

raul-martin-dev

preln's Issues

Manipular stopwords

  • Implentar la introducción, eliminación y modificación de stopwords por el usuario manualmente mediante métodos.

Añadir funcionalidad para tildes

Buenos días,
sería muy interesante añadir la posibilidad de unificar los textos con tildes en espeñol. Para ello sería necesario que hubiera un parámetro que permitira activar una funcionalidad que transforme las vocales con tilde en vocales sin tilde. Es importante conservar el nivel de mayúsculas. Una volcal con tilde mayúsculas se transforma en una vocal sin tilde mayúsculas y lo mismo par ael caso de minúsculas.
De esta forma si el usuario quisiera que se hiciera la doble conversión debería llamar a lowercase + withoutaccent
Muchas gracias

Date limits bug

Buenos días, cuando se especifica la fecha hay que añadir algún tipo de patron para detectar que el día y el mes son válidos en rango.
Si se especifica un formato fecha con fecha inválida la aplicación se sale de rango y da error. Un ejemplo es con la fecha "31/13/2022".
uede ter dos posibles soluciones con un try-catch o bien chequeando previamente los rangos antes de llamar a los diccionarios de indexación para construir los meses.

image

date_format_ bug

Hay un pequeño bug que da cuando se le llama con date=True y dentro de la cadena la fecha está separada por / en lugar de -
El problema está en esta parte del código, creo que se debe usar la variable acabada en "_" todo el rato
image

image

No module named 'nltk'

Buenos días,
con la versión 0.3.5 existen una dependencia con la librería nltk que no está bien resuelta. incluso haciendo una desisntalación completa y instalación nueva da el problema. Lo probé primero con el pip install preln --upgrade y no iba tampoco.

image

`C:\Users\jgonzal>pip uninstall preln
Found existing installation: preln 0.3.5
Uninstalling preln-0.3.5:
Would remove:
c:\users\jgonzal\appdata\local\programs\python\python39\lib\site-packages\preln-0.3.5.dist-info*
c:\users\jgonzal\appdata\local\programs\python\python39\lib\site-packages\preln*
Proceed (Y/n)? Y
Successfully uninstalled preln-0.3.5

C:\Users\jgonzal>pip install preln
Collecting preln
Using cached preln-0.3.5-py3-none-any.whl (10 kB)
Installing collected packages: preln
Successfully installed preln-0.3.5`

PS C:\Users\jgonzal\Dropbox\BOB-NLP> c:; cd 'c:\Users\jgonzal\Dropbox\BOB-NLP'; & 'C:\Users\jgonzal\AppData\Local\Programs\Python\Python39\python.exe' 'c:\Users\jgonzal\.vscode\extensions\ms-python.python-2022.8.1\pythonFiles\lib\python\debugpy\launcher' '36073' '--' 'c:\Users\jgonzal\Dropbox\BOB-NLP\test_nlp_espaniol.py' Traceback (most recent call last): File "c:\Users\jgonzal\Dropbox\BOB-NLP\test_nlp_espaniol.py", line 1, in <module> from Preln.preprocessing import Preprocessing File "C:\Users\jgonzal\AppData\Local\Programs\Python\Python39\lib\site-packages\Preln\preprocessing.py", line 6, in <module> from .core.tokenizer import tokenizer File "C:\Users\jgonzal\AppData\Local\Programs\Python\Python39\lib\site-packages\Preln\core\tokenizer.py", line 2, in <module> from nltk.tokenize import word_tokenize ModuleNotFoundError: No module named 'nltk' PS C:\Users\jgonzal\Dropbox\BOB-NLP>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.