scossin / iamsystem_python Goto Github PK

Fast dictionary-based approach for semantic annotation / entity linking

License: MIT License

Python 100.00%

brat entity-linking fuzzy-matching keywords-extraction nlp search-in-text spacy-extension medical-informatics semantic-annotation

iamsystem_python's People

Stargazers

Watchers

Forkers

ghisvail

iamsystem_python's Issues

Annotation statistics

When annotating a corpus, it's noteworthy to know the pertinence of each parameter. Statistics could answer the following questions:

If I remove a Levenshtein distance, how many annotations would be missed ?
Show me the annotations where the Levenshtein distance was crucial in the detection.
Show me the stopwords that were relevant.
Show me the annotations where the window value was crucial in the detection: show the distribution of matches per window value.

Brat error : 'Text-bound annotation text ...' does not match marked span

Because the marked span is built with the tokens' boundaries and not the text boundaries, some punctuations mark are removed which generate a Brat error.

tokenizer = english_tokenizer()
tokenizer.split = split_find_iter_closure(pattern=r"(\w|\.|,)+")
matcher = Matcher.build(
    keywords=["calcium 2.6 mmol/L"],
    tokenizer=tokenizer
)
annots = matcher.annot_text(text="calcium 2.6 mmol/L")
print(annots[0])
#  "calcium 2.6 mmol L	0 18	calcium 2.6 mmol/L"

There is a missing / in the marked span: 'calcium 2.6 mmol L' -> 'calcium 2.6 mmol/L'

No new annotation when keywords are repeated (Window strategy)

from iamsystem import Matcher
matcher = Matcher.build(
    keywords=["cancer"]
)
text = "cancer cancer"
annots = matcher.annot_text(text=text)
for annot in annots:
    print(annot)
# cancer	0 6	cancer

It outputs a single annotation although the word 'cancer' is repeated twice. This behavior was explained in a comment in the code:

iamsystem_python/src/iamsystem/matcher/strategy.py

Line 83 in 2b19035

# Don't create multiple annotations for the same transition

         Don't create multiple annotations for the same transition. For example 'cancer cancer' with keyword 'cancer': if an annotation was created for the first 'cancer' occurrence, don't create a new one for the second occurrence.

The rationale was to avoid the creation of two annotations for repeated words when the window is large:

from iamsystem import Matcher
matcher = Matcher.build(
    keywords=["cancer de prostate"],
    w=20
)
text = "cancer de prostate token token token token prostate"
annots = matcher.annot_text(text=text)
for annot in annots:
    print(annot)
# cancer de prostate	0 18	cancer de prostate

However, this is not appropriate for all use cases and is not the behavior a user expects; therefore multiple sequences of words that match a keyword should be annotated several times by default.

Annotation across new lines '\n'

matcher = Matcher.build(
    keywords=["cancer du poumon"]
)
annots = matcher.annot_text("""cancer du\npoumon""")
print(annots[0])
# cancer du
# poumon	0 16	cancer du poumon

The print method prints the annotation on two lines which is an issue for serialization and for Brat.

μg -> mg normalization with unidecode library

from unidecode import unidecode_expect_ascii 
unidecode_expect_ascii("μ")
# m

The unit μg is often used in the biomedical domain.
The normalization μg -> mg by the unidecode library is not desirable since 'mg' is another unit.
In my opinion, a better normalisation is μg -> ug

Related to #7 It is important and useful to keep track of each token index after tokenization in order to know whether the detected sequence of tokens is continuous or not. It's easy to add for any tokenizer.

Negative stopwords are not working with fuzzy algorithms

Negative stopwords ignore all unigrams that are not in keywords.
Since stopwords are called before fuzzy algorithms, an abbreviation, a typo ... are ignored before fuzzy algorithms are called.

from iamsystem import Matcher
matcher = Matcher.build(
    keywords=["cancer du poumon"],
    stopwords=["du"],
    negative=True,
    w=1,
    abbreviations=[("k", "cancer")],
    spellwise=[
        dict(measure=ESpellWiseAlgo.LEVENSHTEIN, max_distance=1)
    ],
)
annots = matcher.annot_text(text="k poumons")
print(len(annots))
# 0

Even if 'k' is not in the keywords, it's defined in abbreviations ; also "poumons" is not defined but one deletion from "poumon".
So the matcher should be able to generate a match and an annotation.

Performance issue because annotations overlap with a large window and in long documents

During performance tests, I noticed that many overlapping annotations can be created when w is large; and since rm_nested_annot function performs a nested loop on each set of nested annotations, deleting nested annotations can take O(x2) with x the number of a nested annotations set.

Although there may be a faster algorithm for removing nested annotations, it is not appropriate to annotate every possible transition. For example:

matcher = Matcher.build(keywords=["cancer de la prostate"],
                        w=3)
annots = matcher.annot_text(text="cancer cancer de de la la prostate prostate")
for annot in annots:
    print(annot)
cancer de la prostate	0 6;14 16;23 34	cancer de la prostate
cancer de la prostate	0 6;17 19;23 34	cancer de la prostate
cancer de la prostate	0 6;14 16;20 22;26 34	cancer de la prostate
cancer de la prostate	0 6;17 22;26 34	cancer de la prostate
cancer de la prostate	0 6;14 16;23 25;35 43	cancer de la prostate
cancer de la prostate	0 6;17 19;23 25;35 43	cancer de la prostate
cancer de la prostate	0 6;14 16;20 22;35 43	cancer de la prostate
cancer de la prostate	0 6;17 22;35 43	cancer de la prostate
cancer de la prostate	7 16;23 34	cancer de la prostate
cancer de la prostate	7 13;17 19;23 34	cancer de la prostate
cancer de la prostate	7 16;20 22;26 34	cancer de la prostate
cancer de la prostate	7 13;17 22;26 34	cancer de la prostate
cancer de la prostate	7 16;23 25;35 43	cancer de la prostate
cancer de la prostate	7 13;17 19;23 25;35 43	cancer de la prostate
cancer de la prostate	7 16;20 22;35 43	cancer de la prostate
cancer de la prostate	7 13;17 22;35 43	cancer de la prostate

For each keyword, there are 2 paths and so 2⁴ paths/annotations are generated in this example. One solution to this problem that arises when setting up a large window on long documents is to store the states in a set rather than an array. By doing this, a path would replace an existing path, thus producing in this example a single annotation. The total number of states/paths that the algorithm could take will be limited by the size of the terminology and therefore the number of annotations cannot grow exponentially.

Add multiple matching strategies

The default matching algorithm that allows overlaps is not the original algorithm described in the scientific paper. Several matching strategies are possible, it is necessary to declare an interface and let the user choose the strategy.

ImportError: undefined symbol: _ZSt28__throw_bad_array_new_lengthv

Error message

ImportError: /home/**/lib/python3.7/site-packages/pysimstring/_simstring.so: undefined symbol: _ZSt28__throw_bad_array_new_lengthv

Reproduce

python3 -m venv ./test
source test/bin/activate
pip install iamsystem
python -c "import iamsystem"

Configuration

pip 19.2.3
Platform: "linux-x86_64"
Python version: "3.7"
Current installation scheme: "posix_prefix"

This problem was solved by upgrading pip to version 23.0

Can't change spacy component name: RecursionError: maximum recursion depth exceeded while calling a Python object

from iamsystem.spacy.component import IAMsystemBuildSpacy

nlp = spacy.blank("fr")
nlp.add_pipe(
    "iamsystem_matcher",
    name="iamsystem2",
    last=True,
    config={"build_params": {"keywords": ["cancer"]}},
)
doc = nlp("prostate cancer")

The iamsystem component must be disabled when the tokenizer is called, since the component name 'iamsystem' was hard coded here:

iamsystem_python/src/iamsystem/spacy/tokenizer.py

Line 35 in 294f6d3

doc = self.nlp(text, disable=["iamsystem"])

; it is called recursively, causing this error.

Loss of information strategy stopwords vs window

These two matching strategies lead to a similar output:

Stopwords strategy

matcher = Matcher.build(
    keywords=["cancer prostate"],
    stopwords=["de", "la"],
    w=1
)
annots = matcher.annot_text(text="cancer de la prostate")
for annot in annots:
    print(annot)
    # cancer prostate	0 6;13 21	cancer prostate

Window strategy

matcher = Matcher.build(
    keywords=["cancer prostate"],
    w=3
)
annots = matcher.annot_text(text="cancer de la prostate")
for annot in annots:
    print(annot)
    # cancer prostate	0 6;13 21	cancer prostate

It can be useful to keep the information if stopwords were used in the matching strategy.
Also in 1) the words are discontinuous although the annotation can be considered as continuous, the human annotator would annotate this way: cancer de la prostate 0 21 cancer prostate.
When generating the Brat format, it's important to have the choice to include or not the stopwords.

scossin / iamsystem_python Goto Github PK

iamsystem_python's People

Stargazers

Watchers

Forkers

iamsystem_python's Issues

Error message

Reproduce

Configuration

Recommend Projects

Recommend Topics

Recommend Org

Jobs