GithubHelp home page GithubHelp logo

openredact / nerwhal Goto Github PK

View Code? Open in Web Editor NEW
21.0 3.0 2.0 403 KB

This is a prototype of a multi-lingual suite for named-entity recognition in Python.

Home Page: https://openredact.org/

License: MIT License

Python 100.00%
named-entities ner suite recognize recognition entity-ruler flashtext keyword deep-learning statistical

nerwhal's People

Contributors

dependabot[bot] avatar langhabel avatar malteos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

nerwhal's Issues

Add mechanism to validate recognizer results

Certain types of entities like IBAN's or credit card numbers can be automatically validated by computing a check sum. Based upon the validation result the score could be changed or the finding dropped.

Improve performance by buffering intermediate results

The heavy lifting of setting up the recognizers is currently done in the respective backend's run method (and additionally for some recognizers in its init method). This work should ideally be done in one place (the recognizer) and then be buffered for successive calls of find_piis. Currently the entire setup is repeated on every call.

Dependencies

I haven't tried running the project, but by looking at the requirements.txt and the code it seems that the dependency dataclasses is missing (used to define Pii).

context_words flag not working

Dear OpenRedact Team,

I am working on an ubuntu 20.04 machine with nerwhal 0.1.1 installed in a virtual env via pip. The code is run in a jupyter notebook. I was generally following the Usage instructions from the Readme, but I run into a KeyError when switching on the context_words flag (runs fine if setting it to False):

recognizers = nerwhal.list_integrated_recognizers()
rconfig = nerwhal.Config(
    language="de",
    use_statistical_ner=True,
    recognizer_paths=recognizers
)
result = nerwhal.recognize(
    text,
    config=rconfig,
    combination_strategy="smart-fusion",
    context_words=True,
    return_tokens=False,
) 
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-46-b780c40150ac> in <module>
----> 1 results = [
      2     nerwhal.recognize(
      3         wrapper.text,
      4         config=rconfig,

<ipython-input-46-b780c40150ac> in <listcomp>(.0)
      1 results = [
----> 2     nerwhal.recognize(
      3         wrapper.text,
      4         config=rconfig,

path/to/nerwhal/core.py in recognize(text, config, combination_strategy, context_words, return_tokens)
    140             )
    141             sentence_words = [token.text for token in sentence_tokens]
--> 142             context_words = analyzer.recognizer_lookup[ent.recognizer].CONTEXT_WORDS
    143             if any(word in sentence_words for word in context_words):
    144                 ent.score = min(ent.score * analyzer.config.context_word_confidence_boost_factor, 1.0)

KeyError: 'StanzaNerBackend'

Adding stanza by hand leads to package not found:

recognizers.append('path/to/nerwhal/backends/stanza_ner_backend.py')
...
path/to/backends/stanza_ner_backend.py in <module>
----> 1 from .base import Backend
      2 from nerwhal.types import NamedEntity
      3 from nerwhal.nlp_utils import load_stanza_nlp
      4 
      5 # the stanza NER models have an F1 score between 74.3 and 94.8, https://stanfordnlp.github.io/stanza/performance.html

ImportError: attempted relative import with no known parent package

Do I miss something?

Overlapping results of entity-ruler recognizers are ignored

If two recognizers using the EntityRulerBackend identify the same (or overlapping) token/span as entity, only the first one to be identified is stored. This may give the wrong entity priority, i.e. the one that comes first in the pipeline and not the larger one or the one with the higher score. The reason behind this is the spaCy pipeline allowing tokens/spans to have at most one entity.

Rework parallel execution of backends

The parallel execution of backends failed, when NERwhal was imported as a library in a different project (openredact-app). Thus, I disabled parallel execution for now in commit 2cd2355.

Details:

  • pickle wasn't able to import the target function
  • when target was moved to the top level, pickle wasn't able to import the recognizers

Tokens may be computed multiple times

Tokens are computed in core.py if the options context_words or compute_tokens is true. Further the StanzaNerBackend and EntityRulerBackend compute their own tokenizations, if they are used.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.