adri-hdez / preln Goto Github PK

Preln is a package for preprocessing text in spanish.

License: MIT License

Python 100.00%

preln's Introduction

A package for preprocessing text in spanish

Preln is a Python package that speeds up development and optimizes the performance of applications that require adequate data processing in the field of NLP (Natural Language Processing). This library takes into account the special characteristics of data written in Spanish. It makes data suitable and ready to use for complex applications like training machine-learning models, extracting content from social media or develop powerful tools to automate language correction, lemmatization, stemming within manny others.

📃 Last version v0.5.1-alpha out now! 📃

💬 Contribution & Questions

Contribution & Questions Type	Platforms
🐞 Bug Reports	[GitHub Issue Tracker]
📦 Feature Requests & Ideas	[GitHub Discussions]
🛠️ Usage Questions & Discusions	[GitHub Discussions]

💼 Features

Apply and combine general basic operations to pre-process text in Spanish
Establish direct connection with file paths, databases… for easy reading and writing data
Simple implementation, optimized and ready to apply configuration files
Autocorrect function to improve data quality
Methods for privacy control, replacing or removing personal data from the dataset
Support for spanish and english languages

💾 Install Preln

To start using Preln use the next command:

pip install preln

Note: you might have to add this command as a “code” line in order to use Preln on a Python notebook.

If you are using an old version of Preln, check the update guide to install the package’s new changes.

The main object class of the package is called Preprocessing and it contains all the principal functions of the package. We will be importing this class and creating and object in order to use it’s methods:

from Preln.preprocessing import Preprocessing

preprocessor = Preprocessing(date=False, date_format=None, accents=False, lowercasing=True,   
               privacy=True, privacy_format="multi:replace", correction=True, media=True, 
               media_format="mention:delete", numbers=False, punctuation=True, 
               stopwords=True, tokenizer=True, debug=False)

🔧 Example of use

In this basic example, you can check how to use the package in order to process a simple piece of text.

sample_text = "¡Hola @usuario!, mi nombre es Preln, me han creado Adrián y Raúl. Revisa mi documentación en https://www.preln.org"

test = preprocessor.pipeline(sample_text)

print(test) # ['MENTION', 'nombre', 'ORG', 'creado', 'PERSON', 'PERSON', 'revisa', 'documentación', 'URL']

Note: The pipeline method has it´s parameters (which toggle the core methods) setted by default. It’s interesting to change them based on each text we want to process.

You can check every option upon the core methods and find out what combination of them suits perfectly with your dataset

💳 License

Preln is licensed under MIT License.

🗃️ Shields

preln's People

Contributors

Stargazers

Watchers

Forkers

raul-martin-dev

preln's Issues

Implementar tratamiento de fechas

Se necesita implementar el formateo de fechas
Se necesita implementar la eliminación de fechas

Manipular stopwords

Implentar la introducción, eliminación y modificación de stopwords por el usuario manualmente mediante métodos.

Añadir funcionalidad para tildes

Buenos días,
sería muy interesante añadir la posibilidad de unificar los textos con tildes en espeñol. Para ello sería necesario que hubiera un parámetro que permitira activar una funcionalidad que transforme las vocales con tilde en vocales sin tilde. Es importante conservar el nivel de mayúsculas. Una volcal con tilde mayúsculas se transforma en una vocal sin tilde mayúsculas y lo mismo par ael caso de minúsculas.
De esta forma si el usuario quisiera que se hiciera la doble conversión debería llamar a lowercase + withoutaccent
Muchas gracias

Date limits bug

Buenos días, cuando se especifica la fecha hay que añadir algún tipo de patron para detectar que el día y el mes son válidos en rango.
Si se especifica un formato fecha con fecha inválida la aplicación se sale de rango y da error. Un ejemplo es con la fecha "31/13/2022".
uede ter dos posibles soluciones con un try-catch o bien chequeando previamente los rangos antes de llamar a los diccionarios de indexación para construir los meses.

date_format_ bug

Hay un pequeño bug que da cuando se le llama con date=True y dentro de la cadena la fecha está separada por / en lugar de -
El problema está en esta parte del código, creo que se debe usar la variable acabada en "_" todo el rato

No module named 'nltk'

Buenos días,
con la versión 0.3.5 existen una dependencia con la librería nltk que no está bien resuelta. incluso haciendo una desisntalación completa y instalación nueva da el problema. Lo probé primero con el pip install preln --upgrade y no iba tampoco.

`C:\Users\jgonzal>pip uninstall preln
Found existing installation: preln 0.3.5
Uninstalling preln-0.3.5:
Would remove:
c:\users\jgonzal\appdata\local\programs\python\python39\lib\site-packages\preln-0.3.5.dist-info*
c:\users\jgonzal\appdata\local\programs\python\python39\lib\site-packages\preln*
Proceed (Y/n)? Y
Successfully uninstalled preln-0.3.5

C:\Users\jgonzal>pip install preln
Collecting preln
Using cached preln-0.3.5-py3-none-any.whl (10 kB)
Installing collected packages: preln
Successfully installed preln-0.3.5`

PS C:\Users\jgonzal\Dropbox\BOB-NLP> c:; cd 'c:\Users\jgonzal\Dropbox\BOB-NLP'; & 'C:\Users\jgonzal\AppData\Local\Programs\Python\Python39\python.exe' 'c:\Users\jgonzal\.vscode\extensions\ms-python.python-2022.8.1\pythonFiles\lib\python\debugpy\launcher' '36073' '--' 'c:\Users\jgonzal\Dropbox\BOB-NLP\test_nlp_espaniol.py' Traceback (most recent call last): File "c:\Users\jgonzal\Dropbox\BOB-NLP\test_nlp_espaniol.py", line 1, in <module> from Preln.preprocessing import Preprocessing File "C:\Users\jgonzal\AppData\Local\Programs\Python\Python39\lib\site-packages\Preln\preprocessing.py", line 6, in <module> from .core.tokenizer import tokenizer File "C:\Users\jgonzal\AppData\Local\Programs\Python\Python39\lib\site-packages\Preln\core\tokenizer.py", line 2, in <module> from nltk.tokenize import word_tokenize ModuleNotFoundError: No module named 'nltk' PS C:\Users\jgonzal\Dropbox\BOB-NLP>

adri-hdez / preln Goto Github PK

preln's Introduction

💬​ Contribution & Questions

💼​ Features

​💾​ Install Preln

🔧​ Example of use

💳​ License