scossin / iamsystem_python Goto Github PK
View Code? Open in Web Editor NEWFast dictionary-based approach for semantic annotation / entity linking
License: MIT License
Fast dictionary-based approach for semantic annotation / entity linking
License: MIT License
When annotating a corpus, it's noteworthy to know the pertinence of each parameter. Statistics could answer the following questions:
Because the marked span is built with the tokens' boundaries and not the text boundaries, some punctuations mark are removed which generate a Brat error.
tokenizer = english_tokenizer()
tokenizer.split = split_find_iter_closure(pattern=r"(\w|\.|,)+")
matcher = Matcher.build(
keywords=["calcium 2.6 mmol/L"],
tokenizer=tokenizer
)
annots = matcher.annot_text(text="calcium 2.6 mmol/L")
print(annots[0])
# "calcium 2.6 mmol L 0 18 calcium 2.6 mmol/L"
There is a missing / in the marked span: 'calcium 2.6 mmol L' -> 'calcium 2.6 mmol/L'
from iamsystem import Matcher
matcher = Matcher.build(
keywords=["cancer"]
)
text = "cancer cancer"
annots = matcher.annot_text(text=text)
for annot in annots:
print(annot)
# cancer 0 6 cancer
It outputs a single annotation although the word 'cancer' is repeated twice. This behavior was explained in a comment in the code:
Don't create multiple annotations for the same transition. For example 'cancer cancer' with keyword 'cancer': if an annotation was created for the first 'cancer' occurrence, don't create a new one for the second occurrence.
The rationale was to avoid the creation of two annotations for repeated words when the window is large:
from iamsystem import Matcher
matcher = Matcher.build(
keywords=["cancer de prostate"],
w=20
)
text = "cancer de prostate token token token token prostate"
annots = matcher.annot_text(text=text)
for annot in annots:
print(annot)
# cancer de prostate 0 18 cancer de prostate
However, this is not appropriate for all use cases and is not the behavior a user expects; therefore multiple sequences of words that match a keyword should be annotated several times by default.
matcher = Matcher.build(
keywords=["cancer du poumon"]
)
annots = matcher.annot_text("""cancer du\npoumon""")
print(annots[0])
# cancer du
# poumon 0 16 cancer du poumon
The print method prints the annotation on two lines which is an issue for serialization and for Brat.
from unidecode import unidecode_expect_ascii
unidecode_expect_ascii("μ")
# m
The unit μg is often used in the biomedical domain.
The normalization μg -> mg by the unidecode library is not desirable since 'mg' is another unit.
In my opinion, a better normalisation is μg -> ug
Related to #7 It is important and useful to keep track of each token index after tokenization in order to know whether the detected sequence of tokens is continuous or not. It's easy to add for any tokenizer.
Negative stopwords ignore all unigrams that are not in keywords.
Since stopwords are called before fuzzy algorithms, an abbreviation, a typo ... are ignored before fuzzy algorithms are called.
from iamsystem import Matcher
matcher = Matcher.build(
keywords=["cancer du poumon"],
stopwords=["du"],
negative=True,
w=1,
abbreviations=[("k", "cancer")],
spellwise=[
dict(measure=ESpellWiseAlgo.LEVENSHTEIN, max_distance=1)
],
)
annots = matcher.annot_text(text="k poumons")
print(len(annots))
# 0
Even if 'k' is not in the keywords, it's defined in abbreviations ; also "poumons" is not defined but one deletion from "poumon".
So the matcher should be able to generate a match and an annotation.
During performance tests, I noticed that many overlapping annotations can be created when w is large; and since rm_nested_annot function performs a nested loop on each set of nested annotations, deleting nested annotations can take O(x2) with x the number of a nested annotations set.
Although there may be a faster algorithm for removing nested annotations, it is not appropriate to annotate every possible transition. For example:
matcher = Matcher.build(keywords=["cancer de la prostate"],
w=3)
annots = matcher.annot_text(text="cancer cancer de de la la prostate prostate")
for annot in annots:
print(annot)
cancer de la prostate 0 6;14 16;23 34 cancer de la prostate
cancer de la prostate 0 6;17 19;23 34 cancer de la prostate
cancer de la prostate 0 6;14 16;20 22;26 34 cancer de la prostate
cancer de la prostate 0 6;17 22;26 34 cancer de la prostate
cancer de la prostate 0 6;14 16;23 25;35 43 cancer de la prostate
cancer de la prostate 0 6;17 19;23 25;35 43 cancer de la prostate
cancer de la prostate 0 6;14 16;20 22;35 43 cancer de la prostate
cancer de la prostate 0 6;17 22;35 43 cancer de la prostate
cancer de la prostate 7 16;23 34 cancer de la prostate
cancer de la prostate 7 13;17 19;23 34 cancer de la prostate
cancer de la prostate 7 16;20 22;26 34 cancer de la prostate
cancer de la prostate 7 13;17 22;26 34 cancer de la prostate
cancer de la prostate 7 16;23 25;35 43 cancer de la prostate
cancer de la prostate 7 13;17 19;23 25;35 43 cancer de la prostate
cancer de la prostate 7 16;20 22;35 43 cancer de la prostate
cancer de la prostate 7 13;17 22;35 43 cancer de la prostate
For each keyword, there are 2 paths and so 2⁴ paths/annotations are generated in this example. One solution to this problem that arises when setting up a large window on long documents is to store the states in a set rather than an array. By doing this, a path would replace an existing path, thus producing in this example a single annotation. The total number of states/paths that the algorithm could take will be limited by the size of the terminology and therefore the number of annotations cannot grow exponentially.
The default matching algorithm that allows overlaps is not the original algorithm described in the scientific paper. Several matching strategies are possible, it is necessary to declare an interface and let the user choose the strategy.
ImportError: /home/**/lib/python3.7/site-packages/pysimstring/_simstring.so: undefined symbol: _ZSt28__throw_bad_array_new_lengthv
python3 -m venv ./test
source test/bin/activate
pip install iamsystem
python -c "import iamsystem"
pip 19.2.3
Platform: "linux-x86_64"
Python version: "3.7"
Current installation scheme: "posix_prefix"
This problem was solved by upgrading pip to version 23.0
from iamsystem.spacy.component import IAMsystemBuildSpacy
nlp = spacy.blank("fr")
nlp.add_pipe(
"iamsystem_matcher",
name="iamsystem2",
last=True,
config={"build_params": {"keywords": ["cancer"]}},
)
doc = nlp("prostate cancer")
The iamsystem component must be disabled when the tokenizer is called, since the component name 'iamsystem' was hard coded here:
; it is called recursively, causing this error.These two matching strategies lead to a similar output:
matcher = Matcher.build(
keywords=["cancer prostate"],
stopwords=["de", "la"],
w=1
)
annots = matcher.annot_text(text="cancer de la prostate")
for annot in annots:
print(annot)
# cancer prostate 0 6;13 21 cancer prostate
matcher = Matcher.build(
keywords=["cancer prostate"],
w=3
)
annots = matcher.annot_text(text="cancer de la prostate")
for annot in annots:
print(annot)
# cancer prostate 0 6;13 21 cancer prostate
It can be useful to keep the information if stopwords were used in the matching strategy.
Also in 1) the words are discontinuous although the annotation can be considered as continuous, the human annotator would annotate this way: cancer de la prostate 0 21 cancer prostate.
When generating the Brat format, it's important to have the choice to include or not the stopwords.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.