openredact / nerwhal Goto Github PK

View Code? Open in Web Editor NEW

21.0 3.0 2.0 403 KB

This is a prototype of a multi-lingual suite for named-entity recognition in Python.

Home Page: https://openredact.org/

License: MIT License

Python 100.00%

named-entities ner suite recognize recognition entity-ruler flashtext keyword deep-learning statistical

nerwhal's People

Contributors

Stargazers

Watchers

Forkers

gdh756462786 supremelobster

nerwhal's Issues

Add mechanism to validate recognizer results

Certain types of entities like IBAN's or credit card numbers can be automatically validated by computing a check sum. Based upon the validation result the score could be changed or the finding dropped.

Allow each SpacyEntityRulerRecognizer to have its own precision

Currently all SpacyEntityRulerRecognizer share the same precision which will be set as each PII's score that is found by them. It would be more appropriate to have one precision per recognizer.

Improve performance by buffering intermediate results

The heavy lifting of setting up the recognizers is currently done in the respective backend's run method (and additionally for some recognizers in its init method). This work should ideally be done in one place (the recognizer) and then be buffered for successive calls of find_piis. Currently the entire setup is repeated on every call.

Dependencies

I haven't tried running the project, but by looking at the requirements.txt and the code it seems that the dependency dataclasses is missing (used to define Pii).

context_words flag not working

Dear OpenRedact Team,

I am working on an ubuntu 20.04 machine with nerwhal 0.1.1 installed in a virtual env via pip. The code is run in a jupyter notebook. I was generally following the Usage instructions from the Readme, but I run into a KeyError when switching on the context_words flag (runs fine if setting it to False):

recognizers = nerwhal.list_integrated_recognizers()
rconfig = nerwhal.Config(
    language="de",
    use_statistical_ner=True,
    recognizer_paths=recognizers
)
result = nerwhal.recognize(
    text,
    config=rconfig,
    combination_strategy="smart-fusion",
    context_words=True,
    return_tokens=False,
)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-46-b780c40150ac> in <module>
----> 1 results = [
      2     nerwhal.recognize(
      3         wrapper.text,
      4         config=rconfig,

<ipython-input-46-b780c40150ac> in <listcomp>(.0)
      1 results = [
----> 2     nerwhal.recognize(
      3         wrapper.text,
      4         config=rconfig,

path/to/nerwhal/core.py in recognize(text, config, combination_strategy, context_words, return_tokens)
    140             )
    141             sentence_words = [token.text for token in sentence_tokens]
--> 142             context_words = analyzer.recognizer_lookup[ent.recognizer].CONTEXT_WORDS
    143             if any(word in sentence_words for word in context_words):
    144                 ent.score = min(ent.score * analyzer.config.context_word_confidence_boost_factor, 1.0)

KeyError: 'StanzaNerBackend'

Adding stanza by hand leads to package not found:

recognizers.append('path/to/nerwhal/backends/stanza_ner_backend.py')

...
path/to/backends/stanza_ner_backend.py in <module>
----> 1 from .base import Backend
      2 from nerwhal.types import NamedEntity
      3 from nerwhal.nlp_utils import load_stanza_nlp
      4 
      5 # the stanza NER models have an F1 score between 74.3 and 94.8, https://stanfordnlp.github.io/stanza/performance.html

ImportError: attempted relative import with no known parent package

Do I miss something?

Add support for context keywords to improve the score

If a recognizer finds e.g. a telephone number and the same sentence contains the word "phone", then it is more likely that the finding indeed is a telephone and not a credit card number.

Overlapping results of entity-ruler recognizers are ignored

If two recognizers using the EntityRulerBackend identify the same (or overlapping) token/span as entity, only the first one to be identified is stored. This may give the wrong entity priority, i.e. the one that comes first in the pipeline and not the larger one or the one with the higher score. The reason behind this is the spaCy pipeline allowing tokens/spans to have at most one entity.

Rework parallel execution of backends

The parallel execution of backends failed, when NERwhal was imported as a library in a different project (openredact-app). Thus, I disabled parallel execution for now in commit 2cd2355.

Details:

pickle wasn't able to import the target function
when target was moved to the top level, pickle wasn't able to import the recognizers

Tokens may be computed multiple times

Tokens are computed in core.py if the options context_words or compute_tokens is true. Further the StanzaNerBackend and EntityRulerBackend compute their own tokenizations, if they are used.

Problems with tests that require stanza models

Tests that require the stanza pipeline are skipped for now. Most likely there is an issue with downloading the stanza models.

openredact / nerwhal Goto Github PK

nerwhal's People

Contributors

Stargazers

Watchers

Forkers

nerwhal's Issues

Add mechanism to validate recognizer results

Allow each SpacyEntityRulerRecognizer to have its own precision

Improve performance by buffering intermediate results

Dependencies

context_words flag not working

Add support for context keywords to improve the score

Overlapping results of entity-ruler recognizers are ignored

Rework parallel execution of backends

Tokens may be computed multiple times

Problems with tests that require stanza models

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs