teamheka / medkit Goto Github PK

View Code? Open in Web Editor NEW

23.0 8.0 2.0 3.56 MB

This repository is now archived. Further development has been moved to https://github.com/medkit-lib/medkit.

Home Page: https://github.com/medkit-lib/medkit

License: MIT License

Python 99.98% Makefile 0.02%

annotations clinical-data machine-learning medicine nlp pipelines provenance speech-to-text

medkit's People

Contributors

Stargazers

Watchers

Forkers

scossin drfabach

medkit's Issues

Create first Github issues templates

Suggestion :

: create issue templates for feature, bug, documentation

Ref: Github Issue Templates

Lengths of unicode text and generated ascii text are different

The code bellow generates a warning:
Lengths of unicode text and generated ascii text are different. Please, pre-process input text before running RegexpMatcher

It would be great if Medkit could help with this kind of issue. Working with French, we will surely encounters a lot of these.

Code to reproduce the message:

# !wget https://github.com/aneuraz/casCliniques/raw/main/casCliniques/trainset.json

from spacy import displacy
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy
import json
f = open('trainset.json')

data = json.load(f)

from medkit.text.ner import RegexpMatcher, RegexpMatcherRule

regexp_rules = [
    RegexpMatcherRule(regexp=r"\binsuffisance r.nale\b", label="phenotype"), 
]

regexp_matcher = RegexpMatcher(rules=regexp_rules)

colors = {'phenotype': "#85C1E9"}
options = {"ents": ['phenotype'], "colors": colors}

for i in range(0, len(data)):
    s = ' '.join(data[i]['token'])
    #s = str(s).encode(encoding = 'ascii', errors = 'replace')
    #s = str(s)
    doc = TextDocument(text=s)
    entities = regexp_matcher.run([doc.raw_segment])
    if(len(entities)>0):
        for entity in entities:
            doc.add_annotation(entity)
        displacy_data = medkit_doc_to_displacy(doc)
        displacy.render(displacy_data, manual=True, style="ent", options=options)
        break

DucklingMatcher does not seem to work with empty "dims"

Issue

If no dims are set for
duck = DucklingMatcher("ducklin_annot","duckling")
an error is raised :

File ~/test_medikit/.medkit_env/lib/python3.8/site-packages/medkit/text/ner/duckling_matcher.py:86, in DucklingMatcher.run(self, segments)
73 def run(self, segments: List[Segment]) -> List[Entity]:
74 """Return entities for each match in segments
75
76 Parameters
(...)
84 Entities found in segments
85 """
---> 86 return [
87 entity
88 for segment in segments
89 for entity in self._find_matches_in_segment(segment)
90 ]

File ~/test_medikit/.medkit_env/lib/python3.8/site-packages/medkit/text/ner/duckling_matcher.py:89, in (.0)
73 def run(self, segments: List[Segment]) -> List[Entity]:
74 """Return entities for each match in segments
75
76 Parameters
(...)
84 Entities found in segments
85 """
86 return [
87 entity
88 for segment in segments
---> 89 for entity in self._find_matches_in_segment(segment)
90 ]

File ~/test_medikit/.medkit_env/lib/python3.8/site-packages/medkit/text/ner/duckling_matcher.py:111, in DucklingMatcher._find_matches_in_segment(self, segment)
109 matches = api_result.json()
110 for match in matches:
--> 111 if match["dim"] not in self.dims:
112 warnings.warn("Dims are not properly filtered by duckling API call")
113 continue

TypeError: argument of type 'NoneType' is not iterable

At least one dim must be specified.

Warnings :

dims doesn't match to the ones available at https://github.com/facebook/duckling
they can be found at
https://github.com/facebook/duckling/blob/main/Duckling/Dimensions/Types.hs

Example for time extraction: dimension is "time" not "Time"

Add IAMsystemMatcher (NER)

I would like to add support for IAMsystem, an alternative to QuickUMLS used in medkit.

In my opinion, the easiest solution is to deploy a IAMsystem server in a docker container with a configuration file (dictionary file, approximate string matching algorithms...) and have a socket communication between medkit and the Java server to send text to annotate, then parse the json response.

'Exclusion_regexp' failed when combined with section segmentation

Le bug a eu lieu dans un contexte de pipeline NER. J'ai préparé ci dessous une version compacte pour la reproductibilité du bug.

Voici le texte nécessaire pour reproduire l'erreur, ce qui correspond au fichier exemple.txt dans le code.

"
Technique:
Acquisition thoraco-abdominale au temps artériel et abdomino-pelvienne au temps portal.
90 ml de Iomeron 350.
DLP = 606 mGy.cm

Résultats:

LESIONS CIBLES
PANCREAS

GANGLIONS

LESIONS NON CIBLES
POUMON
NOUVELLES LESIONS
GANGLIONS
"

Et voici le code Python.

"
file = Path("exemple.txt")
doc = TextDocument(text=file.read_text())

segmentation to sections

with open("default_section_definition.yml", "r") as file:
section_data = yaml.load(file, Loader=yaml.FullLoader)
section_tokenizer = SectionTokenizer(section_dict = section_data['sections'], output_label="section")

entity recognition through regex

regexp_rules = [
RegexpMatcherRule(regexp=r"(\bNon\b|\bNo\b|\bYes\b|\bN+$|\bN\d$|\bOui\b)", label="recist_new_lesion", unicode_sensitive=True,
exclusion_regexp=r"(lesions?\snon\scibles?)")
]
regexp_matcher = RegexpMatcher(rules=regexp_rules, attrs_to_copy=["is_negated"])

pipeline_steps = [
PipelineStep(section_tokenizer, input_keys=["full_text"], output_keys=["section_text"]),
PipelineStep(regexp_matcher, input_keys=["section_text"], output_keys=["entities"])
]
pipeline = Pipeline(pipeline_steps, input_keys=["full_text"], output_keys=["entities"])

entities = pipeline.run([doc.raw_segment])

for entity in entities:
doc.add_annotation(entity)
doc.to_dict()
displacy_data = medkit_doc_to_displacy(doc)
options= {"compact": True, "color": "blue", "word_spacing": 2}
displacy.render(docs=displacy_data, manual=True, options=options, style="ent")
"

Je précise que j'ai édité le fichier default_section_definition.yml à la fin pour essayer de sectionner mon fichier .txt avec les différentes sections RECIST. J'ai donc ajouté à la fin :
"
'structure_recist_doc':
- 'LESIONS CIBLES'
- 'LESIONS NON CIBLES'
- 'NOUVELLES LESIONS'
- 'LESIONS SANS RAPPORT AVEC LA MALADIE'
"

A travers le code, je cherche dans un texte les mots clefs de type "non", mais je souhaite exclure les matchs sur la phrase "LESIONS NON CIBLES". J'ai donc utilisé le paramètre exclusion_regexp avec le motif "(lesions?\snon\scibles?)". Or le match a quand même lieu sur le "NON".
Encore plus étonnant, si je modifie le motif de la régex d'exclusion en "sions?\snon\scibles?" là ça fonctionne comme attendu.
Après quelques test, il semblerait que le problème vient spécifiquement de la segmentation du texte en sections.
J'ai vérifié à chacune des étapes que la regex n'essaie pas de matcher sur un fragment de phrase incomplète, ça ne semble pas être le problème car "LESIONS NON CIBLES" est bien au complet.

Après en avoir discuté avec Bastion, on soupçonne un problème de span, avec un probable décalage lorsque le texte est segmenté.
Ça ou le fait qu'un titre de section ne puisse pas être matché par une exclusion_regexp pour une raison que j'ignore.

iamsystem_matcher n'a pas de propagation d'attribut "is_negated" "is_family"

Context : Dans un contexte de NER, on a utilisé l'outil iamsystem_matcher proposé par medkit afin d'annoter un document en souhaitant en parallèle appliquer une détection des negations et des antécédents familiaux.

Problème : Contrairement à ce que propose le matcher d'entité par regex (regexp_matcher = RegexpMatcher(rules=regexp_rules, attrs_to_copy=["is_negated"])) l'annotateur iamsystem_matcher ne propose pas de propagation d'attributs. Ainsi les entités reconnues n'ont pas d'attribut "is_negated" ou "is_family".

Exemple de code pour reproduire l'erreur :
"
from medkit.text.ner.iamsystem_matcher import MedkitKeyword, IAMSystemMatcher
from medkit.core.text import TextDocument
from medkit.text.context import NegationDetector, NegationDetectorRule, FamilyDetector, FamilyDetectorRule
from iamsystem import Matcher
from iamsystem import ESpellWiseAlgo
from medkit.text.segmentation import SentenceTokenizer, SyntagmaTokenizer

keywords_list=[]
keywords_list.append(MedkitKeyword(label="poumon gauche", kb_id="M001", kb_name="manual", ent_label="anatomy"))
keywords_list.append(MedkitKeyword(label="vascularite", kb_id="M002", kb_name="manual", ent_label="disorder"))

matcher = Matcher.build(
keywords=keywords_list,
spellwise=[dict(measure=ESpellWiseAlgo.LEVENSHTEIN, max_distance=1, min_nb_char=5)],
stopwords=["et"],
w=2
)
iam = IAMSystemMatcher(matcher = matcher)

neg_detector = NegationDetector(output_label="is_negated")
fam_detector = FamilyDetector(output_label="family")

doc = TextDocument(text="Le patient présente une asténie de grade 2 et une anémie de grade 3. Atteinte du poumon gauche et droit. Il est traité par chimiothérapie. Son père est décédé d'un cancer du poumon. Il n'a pas de vascularite.")

sent_tokenizer = SentenceTokenizer(
output_label="sentence",
punct_chars=[".", "?", "!", "\n"],
)

sentences = sent_tokenizer.run([doc.raw_segment])
neg_detector.run(sentences)
fam_detector.run(sentences)
entities = iam.run([doc.raw_segment])
for entity in entities:
doc.anns.add(entity)

print("text=", entity.text, ", label=", entity.label, ", is_negated=", entity.attrs.get(label="is_negated"), ", spans=", entity.spans, sep='')
#print(f"text='{entity.text}', label={entity.label}, is_negated={entity.attrs.get(label="is_negated")[0].value}, spans={entity.spans}")
#print(f"text='{entity.text}', label={entity.label}, spans={entity.spans}")
"

Define a workflow for the software development

Cf. developer-guide document