GithubHelp home page GithubHelp logo

ajdavidl / linguae Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 14.17 MB

Python package to explore natural languages.

License: GNU General Public License v3.0

Python 99.03% Shell 0.59% Dockerfile 0.38%
nlp python language-learning

linguae's Introduction

Linguae

Python package to explore natural languages.

This is a hobby project to learn natural language processing and text mining tools exploring natural languages.

The available features are parsing, translation, word embeddings similarities, text generation, concordance, verb conjugation, fill mask, wiktionary queries, wikipedia queries, word frequency queries, conceptnet queries, news from Google, browse images and audio samples, text samples, word sentiment, stemming and chatbot.

Installation

Create a python enviroment using a tool like conda, pyenv or similar. Then open a terminal and insert the commands.

git clone https://github.com/ajdavidl/Linguae.git
cd Linguae
pip install -r requirements.txt

The parse function uses SpaCy models. The commands above install a few SpaCy models. If you need to install other models you can edit the shell script InstallSpacyModels.sh to install the models or you can type the following command on the terminal with the model you need. See SpaCy Models for more information.

python -m spacy download name_of_the_model

If you want to play with word embeddings, you need the MUSE word vectors. The links are in MUSE repository. Download the languages you wish and put the files in the Linguae/linguae/data/museWordVectors directory. You can edit the shell script DownloadMUSEWordEmbeddings.sh to download the data. If you wish to use the word embeddings from the Conceptnet project (Conceptnet-Numberbatch), you can run the shell script DownloadConceptnetNumberbatchVectors.sh that will download the small version of the data and will convert it to be used by the gensim keyed vectors model.

To use the concordance and the text sample functions you need the Tatoeba's sentences. Download the sentences in Tatoeba (clicking in the sentences.tar.bz2 link). Extract the csv file (sentences.csv) and save it in the Linguae/linguae/data/tatoebaFiles directory. You can use the shell script DownloadTatoebaSentences.sh to download the sentences.

After the above steps, you already can use the linguae package inside the root folder. You can also install the package in your python enviroment with the command:

pip install -e .

Installing in a docker container

It's possible to install this package in a docker container. First edit the scripts DownloadMUSEWordEmbeddings.sh to download the languages you wish and follow the commands in a terminal:

docker build -t linguae --rm .
docker run --rm -ti --name linguae linguae

Keep in mind that the docker image can take up a lot of disk space because of word embeddings data and tatoeba sentences.

Usage

In the Linguae directory open python.

import linguae
# translation example
text_en = 'This is an example sentence.'
text_pt = linguae.translate(from_language='en',to_language='pt',text=text_en)
print(text_pt)

# parsing
nlp_en = linguae.loadSpacyModel('en')
pos_en = linguae.parseSpacy(nlp_en,text_en)
print(pos_en)
nlp_pt = linguae.loadSpacyModel('pt')
pos_pt = linguae.parseSpacy(nlp_pt,text_pt)
print(pos_pt)

# get real text examples from news
print(linguae.googleNews('en', 10)) # news in English language
print(linguae.googleNews('pt', 10)) # news in Portuguese language
print(linguae.googleNews('es', 10)) # news in Spanish language

See the examples.py and Use_case.md files for more examples.

Contributing

Pull requests are welcome.

License

GNU General Public License v3.0

linguae's People

Contributors

ajdavidl avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.