GithubHelp home page GithubHelp logo

aajanki / spacy-fi Goto Github PK

View Code? Open in Web Editor NEW
38.0 7.0 2.0 586.99 MB

Experimental Finnish language model for SpaCy

License: MIT License

Shell 1.43% Python 98.57%
spacy-models spacy finnish-language-analysis

spacy-fi's Introduction

CI status

Experimental Finnish language model for spaCy

Finnish language model for spaCy. The model does POS tagging, dependency parsing, word vectors, noun phrase extraction, word occurrence probability estimates, morphological features, lemmatization and named entity recognition (NER). The lemmatization is based on Voikko.

The main differences between this model and the Finnish language model in the spaCy core:

  • This model includes a different lemmatizer implementation compared to spaCy core. My model's lemmatization accuracy is considerably better but the execution speed is slightly lower.
  • This model requires libvoikko. The spaCy model core does not need any external dependencies.
  • The training data for this model is partly different, and there are other minor tweaks in the pipeline implementation.

Want a hassle free installation? Install the spaCy core model. Need the highest possible accuracy especially for lemmatization? Install this model.

I'm planning to continue to experiment with new ideas on this repository and push the useful features to the spaCy core after testing them here.

Install the Finnish language model

First, install the libvoikko native library and the Finnish morphology data files.

Next, install the model by running:

pip install spacy_fi_experimental_web_md

Compatibility with spaCy versions:

spacy-fi version Compatible with spaCy versions
0.14.0 3.7.x
0.13.0 3.6.x
0.12.0 3.5.x
0.11.0 3.4.x
0.10.0 3.3.x
0.9.0 >= 3.2.1 and < 3.3.0
0.8.x 3.2.x
0.7.x 3.0.x, 3.1.x
0.6.0 3.0.x
0.5.0 3.0.x
0.4.x 2.3.x

Usage

import spacy

nlp = spacy.load('spacy_fi_experimental_web_md')

doc = nlp('Hän ajoi punaisella autolla.')
for t in doc:
    print(f'{t.lemma_}\t{t.pos_}')

The dependency, part-of-speech and named entity labels are documented on a separate page.

Updating the model

Setting up a development environment

# Install the libvoikko native library with Finnish morphology data.
#
# This will install Voikko on Debian/Ubuntu.
# For other distros and operating systems, see https://voikko.puimula.org/python.html
sudo apt install libvoikko1 voikko-fi

python3 -m venv .venv
source .venv/bin/activate
pip install wheel
pip install -r requirements.txt

Training the model

spacy project assets
spacy project run train-pipeline

Optional steps (slow!) for training certain model components. These steps are not necessarily required because the results of have been pre-computed and stored in git.

Train floret embeddings:

spacy project run floret-vectors

Pretrain tok2vec weights:

spacy project run pretrain

Testing

Unit tests:

python -m pytest tests/unit

Functional tests for a trained model:

python -m pytest tests/functional

Importing the trained model directly from the file system without packaging it as a module:

import spacy
import fi

nlp = spacy.load('training/merged')

doc = nlp('Hän ajoi punaisella autolla.')
for t in doc:
    print(f'{t.lemma_}\t{t.pos_}')

Packaging and publishing

See packaging.md.

License

MIT license

License for the training data

The data sets downloaded by the tools/download_data.sh script are licensed as follows:

spacy-fi's People

Contributors

aajanki avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

spacy-fi's Issues

StopIteration exception

Hi,

Spotted this weird bug that might be related to both Voikko implementation and the way word possessives are handled in the spacy-fi.

~/miniconda3/envs/.env/lib/python3.7/site-packages/spacy_fi_experimental_web_md/fi.py in _remove_possessive_suffix(self, word, analysis)
    946         """
    947         suffixes = self.possessive_suffixes[analysis["POSSESSIVE"]]
--> 948         suffix = next((s for s in suffixes if word.endswith(s)))
    949         if not suffix:
    950             return word

To reproduce:

import spacy

nlp = spacy.load('spacy_fi_experimental_web_md')

doc = nlp('Varmasti välit pystyy pitämään ja voisihan maskiakin käyttää.')
for t in doc:
    print(f'{t.lemma_}\t{t.pos_}')

The word that causes issues here is the voisihan.

My environment:

  • Python 3.7.11
  • spacy_fi_experimental_web_md 0.7.1
  • Libvoikko 4.3.1

This problem could be fixed by replacing the nextoperator using something like this:

def _remove_possessive_suffix(self, word, analysis):
    """Removes possessive suffix from the word.
    Example: "kanssamme" -> "kanssa"
    """
    suffixes = self.possessive_suffixes[analysis["POSSESSIVE"]]
    matches = [s for s in suffixes if word.endswith(s)]
    if not matches:
        return word
    suffix = matches[0]

Interestingly the same issue appears only in specific sentences so not sure if that would fix the underlying issue but at least make the method stable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.