GithubHelp home page GithubHelp logo

huspacy / huspacy Goto Github PK

View Code? Open in Web Editor NEW
149.0 14.0 14.0 2.01 MB

HuSpaCy: industrial-strength Hungarian natural language processing

Home Page: https://huspacy.github.io

License: Apache License 2.0

Python 98.90% Shell 0.84% Awk 0.25%
hungarian hunlp nlp natural-language-processing spacy spacy-models pos-tagger dependency-parsing universal-dependencies information-extraction

huspacy's Introduction

project logo

python version spacy PyPI - Wheel PyPI version Demo
Build Models Downloads Downloads Hits stars

HuSpaCy is a spaCy library providing industrial-strength Hungarian language processing facilities through spaCy models. The released pipelines consist of a tokenizer, sentence splitter, lemmatizer, tagger (predicting morphological features as well), dependency parser and a named entity recognition module. Word and phrase embeddings are also available through spaCy's API. All models have high throughput, decent memory usage and close to state-of-the-art accuracy. A live demo is available here, model releases are published to Hugging Face Hub.

This repository contains material to build HuSpaCy and all of its models in a reproducible way.

Installation

To get started using the tool, first, we need to download one of the models. The easiest way to achieve this is to install huspacy (from PyPI) and then fetch a model through its API.

pip install huspacy
import huspacy

# Download the latest CPU optimized model
huspacy.download()

Install the models directly

You can install the latest models directly from 🤗 Hugging Face Hub:

  • CPU optimized large model: pip install https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl
  • GPU optimized transformers model: pip install https://huggingface.co/huspacy/hu_core_news_trf/resolve/main/hu_core_news_trf-any-py3-none-any.whl

To speed up inference on GPUs, CUDA should be installed as described in https://spacy.io/usage.

Read more on the models here

Quickstart

HuSpaCy is fully compatible with spaCy's API, newcomers can easily get started with spaCy 101 guide.

Although HuSpacy models can be loaded with spacy.load(...), the tool provides convenience methods to easily access downloaded models.

# Load the model using spacy.load(...)
import spacy
nlp = spacy.load("hu_core_news_lg")
# Load the default large model (if downloaded)
import huspacy
nlp = huspacy.load()
# Load the model directly as a module
import hu_core_news_lg
nlp = hu_core_news_lg.load()

To process texts, you can simply call the loaded model (i.e. the nlp callable object)

doc = nlp("Csiribiri csiribiri zabszalma - négy csillag közt alszom ma.")

As HuSpaCy is built on spaCy, the returned doc document contains all the annotations given by the pipeline components.

API Documentation is available in our website.

Models overview

We provide several pretrained models:

  1. hu_core_news_lg is a CNN-based large model which achieves a good balance between accuracy and processing speed. This default model provides tokenization, sentence splitting, part-of-speech tagging (UD labels w/ detailed morphosyntactic features), lemmatization, dependency parsing and named entity recognition and ships with pretrained word vectors.
  2. hu_core_news_trf is built on huBERT and provides the same functionality as the large model except the word vectors. It comes with much higher accuracy in the price of increased computational resource usage. We suggest using it with GPU support.
  3. hu_core_news_md greatly improves on hu_core_news_lg's throughput by loosing some accuracy. This model could be a good choice when processing speed is crucial.
  4. hu_core_news_trf_xl is an experimental model built on XLM-RoBERTa-large. It provides the same functionality as the hu_core_news_trf model, however it comes with slightly higher accuracy in the price of significantly increased computational resource usage. We suggest using it with GPU support.

HuSpaCy's model versions follows spaCy's versioning scheme.

A demo of the models is available at Hugging Face Spaces.

To read more about the model's architecture we suggest reading the relevant sections from spaCy's documentation.

Comparison

Models md lg trf trf_xl
Embeddings 100d floret 300d floret transformer:
huBERT
transformer:
XLM-RoBERTa-large
Target hardware CPU CPU GPU GPU
Accuracy ⭑⭑⭑⭒ ⭑⭑⭑⭑ ⭑⭑⭑⭑⭒ ⭑⭑⭑⭑⭑
Resource usage ⭑⭑⭑⭑⭑ ⭑⭑⭑⭑ ⭑⭑

Citation

If you use HuSpaCy or any of its models, please cite it as:

arxiv

@InProceedings{HuSpaCy:2023,
    author= {"Orosz, Gy{\"o}rgy and Szab{\'o}, Gerg{\H{o}} and Berkecz, P{\'e}ter and Sz{\'a}nt{\'o}, Zsolt and Farkas, Rich{\'a}rd"},
    editor= {"Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav"},
    title = {{"Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines"}},
    booktitle = {{"Text, Speech, and Dialogue"}},
    year = "2023",
    publisher = {{"Springer Nature Switzerland"}},
    address = {{"Cham"}},
    pages = "58--69",
    isbn = "978-3-031-40498-6"
}

arxiv

@InProceedings{HuSpaCy:2021,
  title = {{HuSpaCy: an industrial-strength Hungarian natural language processing toolkit}},
  booktitle = {{XVIII. Magyar Sz{\'a}m{\'\i}t{\'o}g{\'e}pes Nyelv{\'e}szeti Konferencia}},
  author = {Orosz, Gy{\"o}rgy and Sz{\' a}nt{\' o}, Zsolt and Berkecz, P{\' e}ter and Szab{\' o}, Gerg{\H o} and Farkas, Rich{\' a}rd},
  location = {{Szeged}},
  pages = "59--73",
  year = {2022},
}

Contact

For feature requests, issues and bugs please use the GitHub Issue Tracker. Otherwise, reach out to us in the Discussion Forum.

Authors

HuSpaCy is implemented in the SzegedAI team, coordinated by Orosz György in the Hungarian AI National Laboratory, MILAB program.

License

This library is released under the Apache 2.0 License

Trained models have their own license (CC BY-SA 4.0) as described on the models page.

huspacy's People

Contributors

dependabot[bot] avatar oroszgy avatar qeterme avatar rfarkas avatar szabogergo01 avatar zsozso21 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

huspacy's Issues

Tokenization bug with !.

Describe the bug
When tokenizing text, for example:
[token for token in nlp("A kutya evett egy csontot!.")]
The expression !. is considered a single token, and is also combined with the preceding word's token.
Problem also occurs with multiple exclamation marks, for example: !!. !!!!!!.
...but not with multiple periods, for example: !.. !!.. !!... <--- these work properly
It also does not occur if it's not directly preceded by a word (for example: there's a space between them, like this: csontot !.)
If there's a chain of this, for example: !.!.!.!.! <- then the entire chain is one token... for example: kutya!.!.!.!. is tokenized simply as
kutya!.!.!.!.

Expected behavior
The exclamation mark and the periods should be separate tokens, like this: kutya!. <--- kutya ! .
Note that question marks for example do behave like this, this bug only happens with exclamation marks (as far as I noticed)

Cannot install the TRF model from the Hugging Face Hub if the CPU model is installed

Describe the bug
If I try to install the TRF model from the HFH, I get the following error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
hu-core-news-lg 3.5.0 requires spacy<3.6.0,>=3.5.0, but you have spacy 3.4.4 which is incompatible.

Ideally, all models have the same dependencies so that we can install both CPU and GPU variants in the same environment.

hu_core_news_trf_xl model Memoryview too large error

Describe the bug
After installing the hu_core_news_trf_xl model with the pip install https://huggingface.co/huspacy/hu_core_news_trf_xl/resolve/main/hu_core_news_trf_xl-any-py3-none command (found on huggingface page), an error is raised after the download finishes: ValueError: Memoryview is too large
It appears the problem occurs because of the following msgpack script: msgpack/fallback.py
In that script, there is a hardcoded 2**32 size limit for Memoryview objects (amongst others), that cannot be changed. The Memoryview object the package attempts to use is too large.
I tried with pip version 23.0.1 and pip version 23.1.2, both resulted in this same error.

To Reproduce
Steps to reproduce the behavior:

  1. Use the pip install https://huggingface.co/huspacy/hu_core_news_trf_xl/resolve/main/hu_core_news_trf_xl-any-py3-none command

Expected behavior
The model downloads and installs properly.

Additional context
A possibly similar problem has occured with another, unrelated package in the past:
dask/dask#7552
The problem in this case is likely the same. The solution with Dask only came with a package update (for Dask).
As a sidenote, I mentioned that the package URL is broken on the HuSpacy GitHub page in a previous Issue, the Issue was closed, but the URL is still broken.

Error during download UD_Hungarian-Szeged

Error during make install. Have the permissions of this dependency changed?

`mkdir -p ./data/raw/UD_Hungarian-Szeged
git clone [email protected]:UniversalDependencies/UD_Hungarian-Szeged.git ./data/raw/UD_Hungarian-Szeged

Cloning into './data/raw/UD_Hungarian-Szeged'...
Host key verification failed.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
make: *** [data/raw/UD_Hungarian-Szeged] Error 128`

hu_core_news_trf-ben token.children hibás

Hiba leírása
token.children mindig üres generátorral tér vissza a hu_core_news_trf modellben.

Hiba előidézése
Az alábbi kód szemlélteti (google colab környezetben):

doc = nlp('Peti evett egy almát.')
displacy.render(doc, style="dep", jupyter=True)

for token in doc:
    print(token.text, token.head, [child for child in token.children])

A displacy kimenete alapján helyesen elemzi a mondatot a modell, ezt megerősíti a kiírásnál, hogy helyes a token.head (a displacy kódjába ásva, kiderült az is token.head-et használ).
A token.children elemit kiolvasva mégis üres listát kapunk.

Peti evett []
evett evett []
egy almát []
almát evett []
. evett []

Elvárt működés
A token.children-nek az adott token gyerekeit kéne visszaadnia.

További kontextus
A fenti kódot hu_core_news_lg-on futtatva helyes kimenetet kapunk.

Peti evett []
evett evett [Peti, almát, .]
egy almát []
almát evett [egy]
. evett []

Eredetileg a DependencyMatcher használata közben vettem észre hibát, onnan sikerült idáig visszavezetnem a hiba forrását.

Spacy lemmatizer does not work with numbers as expected

Kedves György, észrevettem, h a spacy nem mindíg jól lemmatizálja a (betűvel kiírt) számokat. Íme egy példa:

import spacy
import hu_core_ud_lg
import pandas as pd

nlp = hu_core_ud_lg.load() # 2-3 perc

a = "nyolcvanöt"
b = "nyolcvanhat"
c = "nyolcvanhét" 
d = [a, b, c] 
  
df = pd.DataFrame(d, columns = ['datum']) 

output_lemma = []

for i in df.datum:
    mondat = ""
    doc = nlp(i)
    newtext = [(tok.lemma_, tok.is_title) for tok in doc]
    mondat = ' '.join([tok[0].title() if tok[1] == 1 else tok[0] for tok in newtext])
    output_lemma.append(mondat)

output_lemma 
['nyolcvan', 'nyolcvanh', 'nyolcvanhét']

Új vagyok a githubon, de nagyon szívesen segítenék a csomag fejlesztésében. Meg tudnád kérlek mondani, h ez reális nehézségű projekt lenne egy kezdő számára, vagy inkább érdemes előbb egy egyszerűbb feladat után néznem?
Előre is nagyon köszönöm a válaszod!

Inkompatibilitás

Sajnos a kód (huspacy) és a nagy model(hu-core-news-lg) más spacy verziót kíván, és nincs közös halmazuk. A huspacy régebbi spacyt kíván, mint a model. Nem találtam megoldást az inkompatibilitás feloldására. Ebben kérnék segítséget.

Köszönettel !
Attila

Error: Can't locate model data

python -m spacy link hu_tagger_web_md hu

Get error:

    Can't locate model data
    The data should be located in hu_tagger_web_md

and in rasa_nlu.train

…
IOError: Can't find model 'hu'

Compatibility with spaCy version 2.3.2

Hi there,

After installing as instructed:
pip install https://github.com/oroszgy/spacy-hungarian-models/releases/download/hu_core_ud_lg-0.3.1/hu_core_ud_lg-0.3.1-py3-none-any.whl

and loading the model the following warning is received:

 ....\lib\site-packages\spacy\util.py:275: UserWarning: [W031] Model 'hu_core_ud_lg' (0.3.1) requires spaCy v2.1 and is incompatible with the current spaCy version (2.3.2). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

Running: python -m spacy validate outputs:

PS D:\-ji\python>  python -m spacy validate
✔ Loaded compatibility table

====================== Installed models (spaCy v2.3.2) ======================
ℹ spaCy installation:
...\lib\site-packages\spacy

No models found in your current environment.

Can I ignore the above warning?

The time needed to call the load() method (nlp = hu_core_ud_lg.load()) varies from moment to moment from 6 seconds to 26 seconds, typically between 18s and 24s.

Is this speed normal or due to a combatilibilty issue?

Thanks!

Mismatch between moduel version and models downloaded (Forced upgrade spacy 3.5 > 3.6)

Describe the bug
The download facility downloads the wrong version of the models.
The installation is spacy 3.5.4, huspacy 0.9 (with hu_core_news_lg-v3.5.2)
When installing since today, spacy is uninstalled and replaced with 3.6 wich isn't compatible with the other models we have and use.

To Reproduce
Steps to reproduce the behavior:
pipenv install spacy = "==3.5.4"
pipenv install huspacy = "*"

Then we run this code:
pipenv run python -c "import huspacy; huspacy.download()"

Expected behavior
Install corresponding version of the models.
Provide a way to set the version in the download function, even though this should really match the necessary versions.

Additional context
We have encountered this since today, not sure how this happens as the 0.9 version doesn't seem to have 3.6.
Here is someoutput from the pipenv instalation:

Installing collected packages: pl-core-news-lg
Successfully installed pl-core-news-lg-3.5.0
✔ Download and installation successful
You can now load the package via spacy.load('pl_core_news_lg')
==> Loading .env environment variables...
Collecting hu-core-news-lg==any
Downloading https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl (401.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 401.4/401.4 MB 3.7 MB/s eta 0:00:00
Collecting spacy<3.7.0,>=3.6.0 (from hu-core-news-lg==any)
Obtaining dependency information for spacy<3.7.0,>=3.6.0 from https://files.pythonhosted.org/packages/9a/7d/36cab023e0dd65bc2144137f3377481e07f93ead8abfa0be28702f1430e4/spacy-3.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata

Downloading spacy-3.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.7/6.7 MB 80.4 MB/s eta 0:00:00
Installing collected packages: spacy, hu-core-news-lg
Attempting uninstall: spacy
Found existing installation: spacy 3.5.4
Uninstalling spacy-3.5.4:
Successfully uninstalled spacy-3.5.4

==> ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
Successfully installed hu-core-news-lg-3.6.0 spacy-3.6.1
==> da-core-news-lg 3.5.0 requires spacy<3.6.0,>=3.5.0, but you have spacy 3.6.1 which is incompatible.

Hosszú szövegek feldolgozása

Hiba leírása
A hu_core_news_trf modellt használva, amennyiben az input 512 tokennél hosszabb, az elemzés hibaüzenettel megáll.

Hiba előidézése

Az alábbi Colab notebook szemléltei a hibát:
https://colab.research.google.com/drive/1Z6BLO2RYssRQmvdPo66YHSReXFtDPmR0#scrollTo=o72kt1cYbVxD

A problémás szöveget közvetlenül a huBERT-nek átadva tokenizálásra úgy tűnik, hogy az 512 feletti tokenek simán "csonkolódnak", a probléma nem jelentkezik.

Hasnoló lehet: explosion/spaCy#7891

Elvárt működés
Az 512 token feletti részt a modellnek figyelmen kívül kellene hagynia.

További kontextus

Combining HuSpaCy models with pipe objects having n_process > 1 results in error.

Describe the bug
Trying to use a spaCy pipe object to take advantage of multiprocessing for turning docs into tokenized, processed versions I get an error when using a combination of bigger than 1 values for the n_process argument of pipe and one of the huspacy models.
As a test I have tried both using the said Hungarian models with n_process=1 and using a default English model of spaCy with higher n_process values which both succeed.

Traceback:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 1
----> 1 texts = docs_to_texts(t_docs)

Cell In[15], line 8, in docs_to_texts(docs)
      4 nlp = spacy.load("hu_core_news_lg", exclude=['tok2vec', 'senter', 'tagger', 'morphologizer', 'parser'])
      6 texts = []
----> 8 for doc in nlp.pipe(docs, n_process=4):
      9     with doc.retokenize() as retokenizer:
     10         for ent in doc.ents:

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/spacy/language.py:1618, in Language.pipe(self, texts, as_tuples, batch_size, disable, component_cfg, n_process)
   1616     for pipe in pipes:
   1617         docs = pipe(docs)
-> 1618 for doc in docs:
   1619     yield doc

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/spacy/language.py:1701, in Language._multiprocessing_pipe(self, texts, pipes, n_process, batch_size)
   1699 elif byte_error is not None:
   1700     error = srsly.msgpack_loads(byte_error)
-> 1701     self.default_error_handler(
   1702         None, None, None, ValueError(Errors.E871.format(error=error))
   1703     )
   1704 if i % batch_size == 0:
   1705     # tell `sender` that one batch was consumed.
   1706     sender.step()

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/spacy/util.py:1704, in raise_error(proc_name, proc, docs, e)
   1703 def raise_error(proc_name, proc, docs, e):
-> 1704     raise e

ValueError: [E871] Error encountered in nlp.pipe with multiprocessing:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/spacy/language.py", line 2332, in _apply_pipes
    byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/spacy/language.py", line 2332, in <listcomp>
    byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/spacy/util.py", line 1685, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/transition_parser.pyx", line 245, in pipe
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/spacy/util.py", line 1632, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/spacy/util.py", line 1685, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/trainable_pipe.pyx", line 79, in pipe
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/spacy/util.py", line 1704, in raise_error
    raise e
  File "spacy/pipeline/trainable_pipe.pyx", line 75, in spacy.pipeline.trainable_pipe.TrainablePipe.pipe
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/hu_core_news_lg/edit_tree_lemmatizer.py", line 192, in predict
    scores = self.model.predict(docs)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/thinc/model.py", line 334, in predict
    return self._func(self, X, is_train=False)[0]
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/thinc/model.py", line 310, in __call__
    return self._func(self, X, is_train=is_train)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/thinc/model.py", line 310, in __call__
    return self._func(self, X, is_train=is_train)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/thinc/model.py", line 310, in __call__
    return self._func(self, X, is_train=is_train)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/thinc/layers/concatenate.py", line 57, in forward
    Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/thinc/layers/concatenate.py", line 57, in <listcomp>
    Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/thinc/model.py", line 310, in __call__
    return self._func(self, X, is_train=is_train)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/spacy/ml/staticvectors.py", line 56, in forward
    V = vocab.vectors.get_batch(keys)
  File "spacy/vectors.pyx", line 485, in spacy.vectors.Vectors.get_batch
  File "spacy/strings.pyx", line 176, in spacy.strings.StringStore.as_string
  File "spacy/strings.pyx", line 160, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '2911968107040651156'. This usually refers to an issue with the `Vocab` or `StringStore`."

To Reproduce

  • I am using Jupyter Notebook for the project
  • Python 3.10
  • spaCy 3.7.2
  • huspacy 0.11.0
  • hu_core_news_lg 3.7.0 / hu_core_news_md 3.7.0
    This is the involved part of the code:
def docs_to_texts(docs:list[str]) -> list[list[str]]:
    docs = [remove_affixes(doc, affixes) for doc in docs]
    
    nlp = spacy.load("hu_core_news_lg", exclude=['tok2vec', 'senter', 'tagger', 'morphologizer', 'parser'])
    
    texts = []
    
    for doc in nlp.pipe(docs, n_process=4):
        with doc.retokenize() as retokenizer:
            for ent in doc.ents:
                retokenizer.merge(doc[ent.start:ent.end])
        texts.append([token.lemma_ for token in doc)
            
    return texts

texts = docs_to_texts(t_docs)

Expected behavior
I aimed to use the pipe object to speed up the otherwise very speed consuming iteration over thousands of documents.

Spacy 3

Hi! Do you plan to support Spacy 3.0?

BadZipFile error

I get an error when trying to install through pip.

zipfile.BadZipFile: Bad CRC-32 for file 'hu_core_ud_lg/hu_core_ud_lg-0.2.0/tagger/model'

any ideas?

Extracting numerals from text

Is your feature request related to a problem? Please describe.
I'm trying to extract numerals of different form from Hungarian text (percentages, dates, etc.). I was wondering if this is supported in Huspacy.

Describe the solution you'd like
Here's a small example I've been experimenting with:

text = "A Magyar Fejlesztési Bank Rt. értékesítette 94,521 százalékos részvénycsomagját a BÁV Bizományi Kereskedőház és Záloghitel Rt-ben. A BÁV fő - 48,26 százalékos - tulajdonosa a kormány gyorsforgalmi úthálózat-fejlesztési programjában meghatározó szerepet játszó Vegyépszer Rt.-hez köthető Pro-Cash Rt. lett, az OTP Bank Rt.-nek a papírok 46,25 százaléka jutott - közölte a Népszabadság."

doc = nlp(text)
 for ent in doc.ents:
        print(f"{ent}, {ent.label_}")

which results in:

Magyar Fejlesztési Bank Rt., ORG
BÁV Bizományi Kereskedőház és Záloghitel Rt-ben, ORG
BÁV, ORG
Vegyépszer Rt.-hez, ORG
Pro-Cash Rt., ORG
OTP Bank Rt.-nek, ORG
Népszabadság, ORG

If I run the same with the English model, I can extract percentages, dates, etc. as part of the entities as well.

Describe alternatives you've considered
I wrote some regexes already to extract the numerals, but this way my recall is expected to be lower (as dates & times can take many forms).

I can work with this, but it would be great to have this as part of the regular Huspacy NLP pipeline.

I have not checked if for English this feature is part of the NER tagger or something else. We may be limited by the lack of training data.

Additional context
Versions:

  • huspacy==0.9.0
  • spacy==3.5.3
  • model: hu_core_news_lg

hu_core_news_trf model spacy_transformers error?

Describe the bug
When the model starts to analyze any given sentence, run stops with the following error code:

Traceback (most recent call last):
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/examples/examples.py", line 11, in <module>
    main()
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/examples/examples.py", line 7, in main
    doc = nlp(test)
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/venv/lib/python3.9/site-packages/spacy/language.py", line 1047, in __call__
    error_handler(name, proc, [doc], e)
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/venv/lib/python3.9/site-packages/spacy/util.py", line 1724, in raise_error
    raise e
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/venv/lib/python3.9/site-packages/spacy/language.py", line 1042, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/venv/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py", line 192, in __call__
    outputs = self.predict([doc])
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/venv/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py", line 229, in predict
    activations = self.model.predict(docs)
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/venv/lib/python3.9/site-packages/thinc/model.py", line 334, in predict
    return self._func(self, X, is_train=False)[0]
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/venv/lib/python3.9/site-packages/spacy_transformers/layers/transformer_model.py", line 199, in forward
    model_output, bp_tensors = transformer(wordpieces, is_train)
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/venv/lib/python3.9/site-packages/thinc/model.py", line 310, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/venv/lib/python3.9/site-packages/thinc/layers/pytorchwrapper.py", line 224, in forward
    Xtorch, get_dX = convert_inputs(model, X, is_train)
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/venv/lib/python3.9/site-packages/spacy_transformers/layers/transformer_model.py", line 227, in _convert_transformer_inputs
    "input_ids": xp2torch(wps.input_ids, device=hf_device),
  File "/home/istvanu/PycharmProjects/ABSA-PyTorch/venv/lib/python3.9/site-packages/thinc/util.py", line 401, in xp2torch
    torch_tensor = torch.utils.dlpack.from_dlpack(xp_tensor)
RuntimeError: from_dlpack received an invalid capsule. Note that DLTensor capsules can be consumed only once, so you might have already constructed a tensor from it once.

Used with:

numpy>=1.13.3
torch==1.7.1
transformers==4.0.0

Also tested with torch==1.8.0, 1.8.1, 1.10.0, 1.10.1,
and spacy==3.5.0, 3.6.1, 3.7.1, but resulted the same.

Model installed directly with the recommended pip command:
pip install https://huggingface.co/huspacy/hu_core_news_trf/resolve/main/hu_core_news_trf-any-py3-none-any.whl

To Reproduce

Example code:

import hu_core_news_trf


def main():
    nlp = hu_core_news_trf.load()
    test = "Példa mondat."
    doc = nlp(test)


if __name__ == '__main__':
    main()

Expected behavior
A doc object should be created. hu_core_news_lg model works fine in the same venv.

Additional context

zipfile.BadZipFile: Bad CRC-32 for file

Hi,

I tried to install this tool with python3 and I have given the following error. Can you help me to solve this issue please?

Best regards,
László

$ pip3 install https://github.com/oroszgy/spacy-hungarian-models/releases/download/hu_core_ud_lg-0.2.0/hu_core_ud_lg-0.2.0-py3-none-any.whl Downloading https://github.com/oroszgy/spacy-hungarian-models/releases/download/hu_core_ud_lg-0.2.0/hu_core_ud_lg-0.2.0-py3-none-any.whl (1362.0MB) Exception: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/pip/basecommand.py", line 215, in main status = self.run(options, args) File "/usr/lib/python3/dist-packages/pip/commands/install.py", line 353, in run wb.build(autobuilding=True) File "/usr/lib/python3/dist-packages/pip/wheel.py", line 749, in build self.requirement_set.prepare_files(self.finder) File "/usr/lib/python3/dist-packages/pip/req/req_set.py", line 380, in prepare_files ignore_dependencies=self.ignore_dependencies)) File "/usr/lib/python3/dist-packages/pip/req/req_set.py", line 620, in _prepare_file session=self.session, hashes=hashes) File "/usr/lib/python3/dist-packages/pip/download.py", line 821, in unpack_url hashes=hashes File "/usr/lib/python3/dist-packages/pip/download.py", line 663, in unpack_http_url unpack_file(from_path, location, content_type, link) File "/usr/lib/python3/dist-packages/pip/utils/__init__.py", line 617, in unpack_file flatten=not filename.endswith('.whl') File "/usr/lib/python3/dist-packages/pip/utils/__init__.py", line 506, in unzip_file data = zip.read(name) File "/usr/lib/python3.6/zipfile.py", line 1338, in read return fp.read() File "/usr/lib/python3.6/zipfile.py", line 858, in read buf += self._read1(self.MAX_N) File "/usr/lib/python3.6/zipfile.py", line 962, in _read1 self._update_crc(data) File "/usr/lib/python3.6/zipfile.py", line 890, in _update_crc raise BadZipFile("Bad CRC-32 for file %r" % self.name) zipfile.BadZipFile: Bad CRC-32 for file 'hu_core_ud_lg/hu_core_ud_lg-0.2.0/tagger/model'

IndexError: string index out of range

Describe the bug
Using the following input:

Megjelenítőjük 12 hüvelykes, SVGA( 800¥600) felbontású színes LCD, amelyre 6 milliméteres üveggel védett érintőképernyő kerül, és kültéri használatra tervezett billentyűzete is van.

huspacy raises the following error:

IndexError: string index out of range

Traceback:

File "/home/user/.local/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
    exec(code, module.__dict__)
File "/home/user/app/app.py", line 12, in <module>
    spacy_streamlit.visualize(
File "/home/user/.local/lib/python3.11/site-packages/spacy_streamlit/visualizer.py", line 102, in visualize
    doc = process_text(spacy_model, text)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 211, in wrapper
    return cached_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 240, in __call__
    return self._get_or_create_cached_value(args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 266, in _get_or_create_cached_value
    return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/streamlit/runtime/caching/cache_utils.py", line 320, in _handle_cache_miss
    computed_value = self._info.func(*func_args, **func_kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/spacy_streamlit/util.py", line 16, in process_text
    return nlp(text)
           ^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/spacy/language.py", line 1024, in __call__
    error_handler(name, proc, [doc], e)
File "/home/user/.local/lib/python3.11/site-packages/spacy/util.py", line 1701, in raise_error
    raise e
File "/home/user/.local/lib/python3.11/site-packages/spacy/language.py", line 1019, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/hu_core_news_trf/lookup_lemmatizer.py", line 101, in __call__
    token.lemma_ = self.__replace_numbers(lemma_by_pos[key], token.text)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/hu_core_news_trf/lookup_lemmatizer.py", line 132, in __replace_numbers
    return cls._number_pattern.sub(lambda match: token[match.start()], lemma)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/.local/lib/python3.11/site-packages/hu_core_news_trf/lookup_lemmatizer.py", line 132, in <lambda>
    return cls._number_pattern.sub(lambda match: token[match.start()], lemma)
                                                 ~~~~~^^^^^^^^^^^^^^^

To Reproduce
Steps to reproduce the behavior:

  1. Go to Huspacy Demo
  2. Paste the above sentence into the textbox.
  3. Run

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.