aphp / edsnlp Goto Github PK

View Code? Open in Web Editor NEW

97.0 8.0 27.0 89.04 MB

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.

Home Page: https://aphp.github.io/edsnlp/

License: BSD 3-Clause "New" or "Revised" License

Python 99.07% Cython 0.89% Makefile 0.04%

nlp medical text-mining clinical-data-warehouse french spacy deep-learning fast multi-task pytorch rule-based

edsnlp's Introduction

EDS-NLP

EDS-NLP is a collaborative NLP framework that aims primarily at extracting information from French clinical notes. At its core, it is a collection of components or pipes, either rule-based functions or deep learning modules. These components are organized into a novel efficient and modular pipeline system, built for hybrid and multitask models. We use spaCy to represent documents and their annotations, and Pytorch as a deep-learning backend for trainable components.

EDS-NLP is versatile and can be used on any textual document. The rule-based components are fully compatible with spaCy's components, and vice versa. This library is a product of collaborative effort, and we encourage further contributions to enhance its capabilities.

Check out our interactive demo !

Features

Rule-based components for French clinical notes
Trainable components: NER, Span classification
Support for multitask deep-learning models with weights sharing
Fast inference, with multi-GPU support out of the box
Easy to use, with a spaCy-like API
Compatible with rule-based spaCy components
Support for various io formats like BRAT, JSON, Parquet, Pandas or Spark

Quick start

Installation

You can install EDS-NLP via pip. We recommend pinning the library version in your projects, or use a strict package manager like Poetry.

pip install edsnlp==0.11.2

or if you want to use the trainable components (using pytorch)

pip install "edsnlp[ml]==0.11.2"

A first pipeline

Once you've installed the library, let's begin with a very simple example that extracts mentions of COVID19 in a text, and detects whether they are negated.

import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")

terms = dict(
    covid=["covid", "coronavirus"],
)

# Split the documents into sentences, this isneeded for negation detection
nlp.add_pipe(eds.sentences())
# Matcher component
nlp.add_pipe(eds.matcher(terms=terms))
# Negation detection (we also support spacy-like API !)
nlp.add_pipe("eds.negation")

# Process your text in one call !
doc = nlp("Le patient n'est pas atteint de covid")

doc.ents
# Out: (covid,)

doc.ents[0]._.negation
# Out: True

Documentation & Tutorials

Go to the documentation for more information.

Disclaimer

The performances of an extraction pipeline may depend on the population and documents that are considered.

Contributing to EDS-NLP

We welcome contributions ! Fork the project and propose a pull request. Take a look at the dedicated page for detail.

Citation

If you use EDS-NLP, please cite us as below.

@misc{edsnlp,
  author = {Wajsburt, Perceval and Petit-Jean, Thomas and Dura, Basile and Cohen, Ariel and Jean, Charline and Bey, Romain},
  doi    = {10.5281/zenodo.6424993},
  title  = {EDS-NLP: efficient information extraction from French clinical notes},
  url    = {https://aphp.github.io/edsnlp}
}

Acknowledgement

We would like to thank Assistance Publique – Hôpitaux de Paris, AP-HP Foundation and Inria for funding this project.

edsnlp's People

Contributors

Stargazers

Watchers

edsnlp's Issues

identify tables

Suggestion for a new pipeline to detect tables of biological results. To discuss possible improvements to this first example.

import spacy
from edsnlp import components
from io import StringIO


nlp = spacy.blank("fr")

regex = dict(
    tables=[r"(\b.*[|¦].*\n)+",],
)

# Sentencizer component, needed for negation detection
nlp.add_pipe("sentences")
# Matcher component
nlp.add_pipe("matcher", config=dict(regex=regex))

text = """
Le patientqsfqfdf bla bla bla
Leucocytes ¦x10*9/L ¦4.97 ¦4.09-11
Hématies ¦x10*12/L¦4.68 ¦4.53-5.79
Hémoglobine ¦g/dL ¦14.8 ¦13.4-16.7
Hématocrite ¦% ¦44.2 ¦39.2-48.6
VGM ¦fL ¦94.4 + ¦79.6-94
TCMH ¦pg ¦31.6 ¦27.3-32.8
CCMH ¦g/dL ¦33.5 ¦32.4-36.3
Plaquettes ¦x10*9/L ¦191 ¦172-398
VMP ¦fL ¦11.5 + ¦7.4-10.8

qdfsdf

"""

doc = nlp(text)

table_str = doc.spans["tables"][0].text
print(table_str)


table_str = doc.spans["tables"][0].text
table_str


table_str_io = StringIO(table_str)

table_pandas = pd.read_csv(table_str_io, sep="¦", engine="python",header=None)

table_pandas

	0	1	2	3
0	Leucocytes	x10*9/L	4.97	4.09-11
1	Hématies	x10*12/L	4.68	4.53-5.79
2	Hémoglobine	g/dL	14.8	13.4-16.7
3	Hématocrite	%	44.2	39.2-48.6
4	VGM	fL	94.4 +	79.6-94
5	TCMH	pg	31.6	27.3-32.8
6	CCMH	g/dL	33.5	32.4-36.3
7	Plaquettes	x10*9/L	191	172-398
8	VMP	fL	11.5 +	7.4-10.8

Matcher documentation

The documentation is light on the differents arguments eds.matcher.

Insist on the type expected by regex and terms
Add documention on "subclassing" eds.matcher

[Numpy version] Incompatibility version with scikit-eds

Description

edsnlp requires a version of numpy >= 1.21 whereas scikit-eds requires numpy < 1.20 (probably due to pyspark issues).

How to reproduce the bug

requirements.txt from edsnlp here
requirements.txt from scikit:

pgpasslib
psycopg2-binary
pandas
numpy<1.20.0  # https://github.com/databricks/koalas/pull/2166
koalas
altair
loguru

Your Environment

Operating System: Windows 10
Python Version Used: 3.7
spaCy Version Used: 3.31
EDS-NLP Version Used: 0.6.0
scikit-eds Version Used: 0.1.0

Sentencizer cut codes in different sentences while it's the same token

Description

For the moment the sentencizer makes a new sentence when there is a "." character followed by a capitalized letter.
This can be problematic for some codes or accronyms, as they can be constructed with those patterns (example : "V.I.H",), and will be divided in different sentences.

The ADICAP codes analysed by the eds.adicap pipeline can be found in text in the form : "code ADICAP : B.H.HP.A7A0", and the eds.contextual-matcher used behind will not capture the code.

A solution would be to create a new sentence if there is a . followed by a space/new line/other separation and a capitalized letter.

How to reproduce the bug

import spacy

nlp = spacy.blank("eds")
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sentences")

code = "B.H.HP.A7A0"

for sent in nlp(code).sents:
    print(sent.text)

B.
H.
HP.
A7A0

Your Environment

Operating System: Ubuntu 22.04.1 LTS
Python Version Used: 3.10.6
spaCy Version Used: 3.4.1
EDS-NLP Version Used: 0.7.4
Environment Information:

Feature request: Pollution

Feature type

Add new patterns

Description

Add pattern to match text like :

2/2Pat : <NOM> <Prenom> le <date> IPP <ipp> Intitulé RCP : Urologie HMN le <date>

1/2Pat : <NOM> <Prenom> le <date> IPP <ipp> Intitulé RCP : Urologie HMN le <date>

TNM doesn't match regex in sentence

Description

Le pipeline TNM ne renvoie pas les scores suivis par un espace. (match avec "aTxN1M0\nanything" mais pas avec "aTxN1M0 anything")

Plus généralement spacy n'a pas l'air ok avec les regex qui finissent par un espace ?

How to reproduce the bug

import spacy
nlp = spacy.blank("fr")
nlp.add_pipe("eds.TNM")

doc = nlp("aTxN1M0")
print(doc.ents)  # works
# (aTxN1M0,)
doc = nlp("aTxN1M0\n")
print(doc.ents)  # works
# (aTxN1M0,)
doc = nlp("aTxN1M0 ")
print(doc.ents)  # doesn't work
# ()

On peut simplifier le problème sous la forme:

import spacy
from edsnlp.matchers.regex import RegexMatcher
nlp = spacy.blank("fr")

matcher = RegexMatcher(attr="TEXT", alignment_mode="strict")
matcher.add("tnm", [r"a ?"]) #regex simple : _a_ peut être suivi par un espace
nlp = spacy.blank("fr")

doc = nlp("test: a") # exemple simple sans espace à la fin
for r in matcher(doc, as_spans=True,return_groupdict=True):
    print(r) 
#(a, {})
doc = nlp("test: a ")  # exemple simple avec espace à la fin
for r in matcher(doc, as_spans=True,return_groupdict=True):
    print(r) 
#ne renvoie rien

Pourtant, un simple match de regex fonctionne:

import re
re.search('a ?', "test: a ")
#<re.Match object; span=(6, 8), match='a '>

quelle est la différence entre re et spacy qui justifie ce comportement ?

Your Environment

EDS-NLP Version Used: V0.7.4

Feature request: smarter reported_speech

Feature type

Detect all the entities made by the patient, not by another person .

Description

According to the documentation doc, I thought that eds.reported_speech allowed to detect all the entities that refer to the patient, and not to the "son", "brother" or "sister .." of the patient.

import spacy
nlp = spacy.blank("fr")
nlp.add_pipe("eds.sentences")

nlp.add_pipe("eds.matcher",config=dict(terms=dict(patient="patient", alcool="alcoolisé",inconscience="inconscient")),)
nlp.add_pipe("eds.reported_speech")

text = ("Le patient est admis aux urgences ce soir après un grave accident de voiture. Il nie être inconscient. Il se plaint d'une douleur au cou, au bras gauche et au dos. Suspicion d'un traumatisme crânien. Le fils du patient était au volant. Il nie être alcoolisé.")
doc = nlp(text)
In: doc.ents
Out: (patient, inconscient, patient, alcoolisé)
In: doc.ents[0]._.reported_speech
Out: False
In: doc.ents[1]._.reported_speech
Out: True
In: doc.ents[2]._.reported_speech
Out: False
In: doc.ents[3]._.reported_speech
Out: True # False ?

It is interesting to use the dependency parse tree from the "patient" node (use appos, flat:name or conj .. cf spacy linguistic features) to make sure that "alcoolisé" doesn't refers to "patient".

For Le fils du patient, we have a "nmod" relationship:
fils <--nmod--> patient
that refer to the root "patient" node "Le patient est admis .."

At the least, we should exclude nmod relationship to detect only the entities made by the patient.

Reason for pydantic <1.10.0

Context

I am using pydantic factory for my study module to generate mock data. It relies on pydantic >=1.10.

Description:

edsnlp precises that pydantic should < 1.10.0 . Is there a specific reason that we do not want to accept later version of pydantic ?

Correct eds.history pipeline to distinguish "medical history" from "history of current disease"

As built, if the use_section=True config is applied to the eds.history pipeline, all "antécédents", "antécédents familiaux" and "histoire de la maladie" sections are used to tag entities as "history".

The problem is that :

"histoire de la maladie" refers to history of the current disease, and not to medical history.
"antécédents familiaux" refers to family diseases

I suggest removing "histoire de la maladie" and "antécédents familiaux" from section_history list in edsnlp/pipelines/qualifiers/history/patterns.py

If an entity refers to the history of the current disease, this will be found with the section title.

Thank you !

Feature request: Terminology matching

Feature type

Pipeline component for easy terminology matching.

Description

We need a proper way to match terminologies. The span label should reflect the matched terminology.

I've started writing a simple terminology matcher that mimics the GenericMatcher. See #75.

We can use the kb_url_template in displacy for better entity linking.

Adding terminologies for ATC code A05AA02 to drugs.json

In reference file for drug identification : https://github.com/aphp/edsnlp/blob/master/edsnlp/resources/drugs.json, some terminologies are missing for ursodesoxycholic acide (A05AA02)

Current list :
"A05AA02": [
"ACIDE URSODESOXYCHOLIQUE",
"CHOLURSO",
"DELURSAN",
"TILLHEPO",
"URSOLVAN",
"ursodesoxycholique acide"
]

Suggested list :
"A05AA02": [
"ACIDE URSODESOXYCHOLIQUE",
"CHOLURSO",
"DELURSAN",
"TILLHEPO",
"URSOLVAN",
"ursodesoxycholique acide",
"URSOFALK",
"DOZURSO",
"AUDC",
"UDCA"
]

Reference : https://www.vidal.fr/medicaments/substances/acide-ursodesoxycholique-128.html

Refactor the parallelization utils

Feature type

Following a brainstorming with @Thomzoy, we'd like to refactor the parallelization utilities to decouple the type of collection (iterators, lists, dataframe pandas, dataframe spark, hive table, etc.) from the type of parallelization (no parallelization, multi cpu, gpu, distributed computing for spark).

Description

Collection types

Most of the processing with edsnlp is done on pandas and spark lists and dataframes (to the best of our knowledge), so we feel it's necessary to handle these cases natively.

The following changes will be made:

the nlp.pipe method (cf refacto) will be able to receive a classic iterable as input, as is already the case, or a dataframe (pandas / spark at least)
to manage the conversion of table rows / dictionaries into spacy.tokens.Doc objects, we add two methods to nlp.__call__ and nlp.pipe to replace (eventually) the parameters (additional_spans, extensions, context, results_extractor):
- parameter to_doc: (Any -> spacy.tokens.Doc)
- parameter from_doc: (spacy.tokens.Doc -> Any)
in the case of a pandas or spark dataframe, we pre-convert the rows of these tables into dictionaries, before calling the to_doc method, and convert a dictionary produced by from_doc into a table row.

How do we plan to support other formats?

It's up to the user to convert the entry into an accepted format. For example, polars to pandas, or polars to dictionary iterator.

Why add two methods to_doc / from_doc, and not simply ask the user to convert everything into an iterator?

Depending on the distribution method, data may not always be accessible locally. For instance, in the case of a spark dataframe, a function is sent to each executor to apply the nlp object to a line. It is therefore necessary to send these conversion functions with the nlp object to each executor, hence these two parameters.
This also allows us to optimize the common but tricky collections (pandas, spark), while giving users some leeway in the choice of columns and outputs.

Acceleration / parallelization mode

We plan to manage acceleration on several processes, one or more gpus, or in a distributed way via spark

the nlp.pipe method will be able to take n_process as input (parameter originally used in the spacy Language object) to distribute via several processes locally
a new "method" parameter will receive an acceleration object (dict / custom object?) containing the acceleration type:
- if gpu method: devices, num_cpu_per_gpus, batch_size
- if spark method (only compatible with dataframe spark) with number of partitions if repartitioning, number of executors, etc.
- ...
This will in turn call a specific function depending on the method
We can probably infer the parallelization method automatically, depending on the type of input collection and the computational resources available.

Pseudo implementation

This is open to discussion:

def pipe(self, data, from_doc, to_doc, method):
    if is_simple_iterable(data):
        accelerator = get_iterable_accelerator(method)  # various accelerators 
        return accelerator(data, from_doc, to_doc)
    elif is_pandas(data):
        iterable = pandas_to_iterable(data)
        accelerator = get_iterable_accelerator(method)  # various accelerators 
        iterable = accelerator(iterable, from_doc, to_doc)
        return iterable_to_pandas(iterable)
    elif is_polars(data): # we haven't decided yet which formats are common enough to be supported natively
        ...
    elif is_spark(data):
        accelerator = get_spark_accelerator(method)  # will check that the accelerator is compatible
        return accelerator(data, from_doc, to_doc)
    else:
        raise NotImplementedError()

Detection of successive empty sections

Description

Using the beta pipeline "eds.sections", I encountered a bug in detecting sections preceded by an empty section. Indeed, if a section is preceded by an empty section, it is not detected and its content is labeled as belonging to the first section.

Empty sections are quite common in documents, so this could lead to errors in section labelling of entities.

How to reproduce the bug

For instance, in the following example, sections "Antécédents :" and "Conclusion" are not distinguished. Therefore, all the content of "Conclusion" section is tagged as "Antécédents".

import spacy

nlp = spacy.blank("eds")

nlp.add_pipe("eds.sentences")
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sections")

# Definition matcher
regex = dict(
    # Myolyse
    rhabdomyolyse = "rhabdom[yi]ol[yi]se",
    myolyse = "m[yi]ol[yi]se"
)

nlp.add_pipe("eds.matcher", 
             config = dict(
                 regex=regex, 
                 attr="NORM",
                 ignore_excluded=True,
             ),
            )

text = """
Antécédents : 
Conclusion : 
Patient va mieux

Au total:
sortie du patient
"""

doc.spans["sections"]

Your Environment

Operating System:
Python Version Used: 3.10.0
spaCy Version Used: 3.4.1
EDS-NLP Version Used: 0.6.1
Environment Information:

Discussion: Improve the Contextual Matcher

Draft

Starting example:

When looking for ADICAP codes, some notes showed multiples codes:

"adicap : OHGSA7B1, OHGSA7B3"

Using he ContextualMatcher as of today, only the last code will be retrieved.

Proposition

Maybe a replace option in the assing dictionary might be useful:

One assign key can have the replace option.
In this case, those assigned spans will become the new entity.ies
Other assigned value will be "transfered" to those entities. We might see the case were multiple assign values (from the same key) have to be transfered. In this case:
- Either all values are transfered as a list
- Or a single one is transfered (controlled via a parameter like reduce_mode)

Examples

"Le patient a un diabète insulinorequérent de type I et II"

No `replace`, no `expand`, `reduce_mode=None`

`replace`	`expand`	`reduce_mode`
False	False	False

Harmonize processing utils

Description

@aricohen93

Parallel and distributed pipelines do not behave the same regarding the addition of a note_id column after the processing of documents.

TODO:

either remove the note_id select from distributed.py
add note_id after parallel
something else ?

Add support for new family relationships

Description

I noticed the family qualifier lacks nephew and niece patterns which makes it miss some relationships, although unusual. Is there a specific reason for that?

Happy to make a pull request if not.

Not expected behavior on spaces in the measures module

Description

While modifying the measure module (see https://github.com/gozat/edsnlp/tree/59-more-measurements), I find that a string like 12 m 45 is not completely detected : only 12.0m is detected. It seems it comes from the REGEX generation, that puts a \\s* in the pattern generated by pipelines.misc.measures.measures.make_patterns, and then not matching the space ; 12 m45 still produces 12.5m as string of span._.value.

I wonder whether this bug is reproducible or not ?

I'll update this issue when I have more clue about it, sorry for this short introduction.

Problem with special whitespaces using `spacy.blank("eds")`

Edge cases with the eds tokenizer

Description

Due to its more simple implementation, the Tokenizer shipped with the eds language doesn't handle every whitespace characters very well. For instance:

Text	`spacy.blank('fr')`	`spacy.blank('eds')`
`"il\xa0fait\tchaud et froid"`	`['il', '\xa0', 'fait', '\t', 'chaud', 'et', 'froid']`	`['il\xa0fait\tchaud', 'et', 'froid']`

This is then especially problematic when using the Term Matcher

How to reproduce the bug

import spacy

langs = dict(
    fr=spacy.blank('fr'),
    eds=spacy.blank('eds'),
)

text = "il\xa0fait\tchaud et froid"

for lang, nlp in langs.items():
    doc = nlp(text)
    print(lang)
    print([repr(t) for t in doc])

Suggestions

Go back to the classic FrenchTokenizer but change some defaults to get the same features
Handle the case in the normalizer pipeline
Change the word_regex to handle different types of whitespaces

Error when using qualifiers with spacy-transformers model

When loading a pipeline from disk, if the pipeline contains a spacy-transformers model and any edsnlp qualifiers this error is encountered:

KeyError: "Parameter 'W' for model 'softmax' has not been allocated yet."

Description

Full Traceback

File "/Users/Louise/Library/Application Support/JetBrains/PyCharm2023.2/scratches/scratch.py", line 8, in <module>
    nlp = spacy.load("nlp")
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/__init__.py", line 51, in load
    return util.load_model(
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 467, in load_model
    return load_model_from_path(Path(name), **kwargs)  # type: ignore[arg-type]
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 539, in load_model_from_path
    nlp = load_model_from_config(
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 587, in load_model_from_config
    nlp = lang_cls.from_config(
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/language.py", line 1864, in from_config
    nlp.add_pipe(
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/language.py", line 821, in add_pipe
    pipe_component = self.create_pipe(
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/language.py", line 709, in create_pipe
    resolved = registry.resolve(cfg, validate=validate)
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/confection/__init__.py", line 756, in resolve
    resolved, _ = cls._make(
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/confection/__init__.py", line 805, in _make
    filled, _, resolved = cls._fill(
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/confection/__init__.py", line 877, in _fill
    getter_result = getter(*args, **kwargs)
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/edsnlp/pipelines/qualifiers/negation/negation.py", line 174, in __init__
    super().__init__(
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/edsnlp/pipelines/qualifiers/base.py", line 84, in __init__
    self.phrase_matcher.build_patterns(nlp=nlp, terms=terms)
  File "edsnlp/matchers/phrase.pyx", line 99, in edsnlp.matchers.phrase.EDSPhraseMatcher.build_patterns
  File "edsnlp/matchers/phrase.pyx", line 111, in edsnlp.matchers.phrase.EDSPhraseMatcher.build_patterns
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/language.py", line 1618, in pipe
    for doc in docs:
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 1685, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/pipe.pyx", line 55, in pipe
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 1685, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/transition_parser.pyx", line 245, in pipe
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 1632, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 1685, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/trainable_pipe.pyx", line 79, in pipe
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 1704, in raise_error
    raise e
  File "spacy/pipeline/trainable_pipe.pyx", line 75, in spacy.pipeline.trainable_pipe.TrainablePipe.pipe
  File "spacy/pipeline/tagger.pyx", line 138, in spacy.pipeline.tagger.Tagger.predict
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/model.py", line 334, in predict
    return self._func(self, X, is_train=False)[0]
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/model.py", line 310, in __call__
    return self._func(self, X, is_train=is_train)
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/layers/with_array.py", line 42, in forward
    return cast(Tuple[SeqT, Callable], _list_forward(model, Xseq, is_train))
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/layers/with_array.py", line 77, in _list_forward
    Yf, get_dXf = layer(Xf, is_train)
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/model.py", line 310, in __call__
    return self._func(self, X, is_train=is_train)
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/layers/softmax.py", line 69, in forward
    W = cast(Floats2d, model.get_param("W"))
  File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/model.py", line 235, in get_param
    raise KeyError(
KeyError: "Parameter 'W' for model 'softmax' has not been allocated yet."

The error occurs during the initialization of the qualifiers, where the token_pipelines are ran in EDSPhraseMatcher's build_patterns. I did a bit of digging and it seems like the error comes from the fact that the spacy-transformers pipelines are not fully initialized at this point so running them raises an error. Possible fixes could be to skip the problematic pipes if they are not necessary to run, or do this step once the whole pipeline has been completely initialized (not in the __init__)

How to reproduce the bug

import spacy

nlp = spacy.load("fr_dep_news_trf")
nlp.add_pipe("sentencizer")
nlp.add_pipe("eds.negation", name="eds_negation")
nlp("Test")   # no problem here 
nlp.to_disk("nlp")
nlp = spacy.load("nlp")  # here is the bug

Your Environment

Operating System: macOS
Python Version Used: 3.9
spaCy Version Used: 3.7.2
EDS-NLP Version Used: 0.9.1
Environment Information:
- spacy-tranformers version: 1.3.2

Feature request: custom qualifier patterns

Feature type

Modification of pipelines

Description

Hi 🙂

We're considering migrating from negspacy, that we use for instance with termsets adapted from French Fast Context, to edsnlp, in order to perform assertion status detection, so I had a suggestion about your qualifier components, that I find really cool btw!

I was wondering about the reason(s) why you chose to load the default patterns of the qualifiers during their initializations, compared to include the default patterns as the factory default parameters? I guess on the "pro" side of adding the default patterns during the initialization (your approach), I can see that it's more user-friendly (it's simpler to add new patterns), but on the "con" side, we cannot use entirely custom patterns (since we cannot remove the default patterns), therefore it gives less room to users to optimize the patterns to their own use-cases, if I got that correctly

Is there another reason why you made this design choice? If not, would you be open to change the way the default patterns are working? (this could be done with a parameter that indicates whether or not to add the default patterns to the ones in the parameters for instance, if you want to bring less radical changes to the existing code base)

PS: looking into this, I found what I believe to be a very minor issue in the History component factory, here, in the DEFAULT_CONFIG, you use not None values for history and termination but these values are useless since they are duplicates of the component's defaults class attribute (in the other qualifiers the corresponding parameters are correctly set to None)

Feature request: Score

Score

Description

We analysed the performance of the pipeline eds.charlson over 100 documents extracted from the Bordeaux CHU medical datawarehouse. We compare Charlson score extracted by edsnlp pipeline with Charlson score extracted by hand. Over the hundred documents we found 5 diverging cases which brings out several issues that might be usefull in a more general context of integer score detection.

Proposition

Here are few points that could help to enhance score detection:

Include Roman numerals (i.e 'Charlson score is about II)
Ranges in score (i.e 'Charlson score lies between 2 and 3)
Fuzziness for mispelling score name (i.e 'Charltson score of 3')
Ordering (i.e 'Charlson score > 7)

Feature request: IAM system

Feature type

After discussing with @scossin, it would be a nice feature to be able to use the IAM system with EDS-NLP. Since this library already ships with a spaCy connector, a first step would be to create a documentation page here referencing the project and demonstrating it on an example similar to the Matcher to inform users.

To make this component compatible with edsnlp, the only things missing are:

the ability to add the detected entities to the Doc.ents and to assign them a label, either fixed or dynamic according to the kb_id of the retrieved entity
make it available via spaCy's entry points (spacy_factories)

We could also add the IAM system as a matcher (https://github.com/aphp/edsnlp/tree/master/edsnlp/matchers), but this would require splitting it into multiple parts to remain modular (add spellwise to normalizers, trie + simstring / levenshtein / ..., make it accept a dict of patterns, etc) so let's keep it simple for now.

Feature request: TNM detection

Feature type

NER pipeline to detect and normalise the TNM score.

Description

A simple regex will do.

Recognised entitites assigned values are shifted left by n tokens

Description

eds.normaliser converts '...' into three separate tokens '.' '.' '.' , which shifts the entities detected in the output by n tokens according to the number of '...' the note_text contains

How to reproduce the bug

import spacy

patterns_kal = dict(
    source="kaliemia",
    terms=["kaliemie", "potassium", "hypokaliemie"],
    regex_attr="NORM",
    assign=[
        dict(
            name="value",
            regex=r"\b(\d+[\.]*\d+)\b",
            replace_entity=False,
            window=15,
        )])

nlp = spacy.blank("fr")
nlp.add_pipe("eds.normalizer")
nlp.add_pipe(
    "eds.contextual-matcher",
    name="Kaliemia",
    config=dict(
        patterns=patterns_kal,
        include_assigned=False,
    ),
)


test_text = """
Contexte de prélè…………1 ¦ ¦Non précisé ¦
Potassium ¦mmol/L ¦4.3
"""

doc=nlp(test_text)
e=doc.ent[0]
e._.assigned 

# ...

Your Environment

Operating System: linux ubuntu20
Python Version Used: 3.7.16
spaCy Version Used: 3.4.4
EDS-NLP Version Used: 0.7.4
Environment Information: conda

Solution temporaire dans class Normalizer

def __call__(self, doc: Doc) -> Doc:
        """
        Apply the normalisation pipeline, one component at a time.

        Parameters
        ----------
        doc : Doc
            spaCy `Doc` object

        Returns
        -------
        Doc
            Doc object with `NORM` attribute modified
        """
        for token in doc:
            token.norm_ = token.text.lower()
        if not self.lowercase:
            remove_lowercase(doc)
        if self.accents is not None:
            self.accents(doc)
        if self.quotes is not None:
            self.quotes(doc)
        if self.pollution is not None:
            self.pollution(doc)

        return doc

Bug: `explain` type is incorrect in `History` factory

Hi! I find your work with edsnlp really cool, and I'm working on integrating it in one of our tools at Arkhn 🙂
I found a very little mistake which prevents us from trying the History component at all and which should be very easy to fix!

Description

In the History class, you use an attribute explain, which is a boolean, however in the corresponding factory, you made it a string (right here).

As a consequence, spaCy converts the parameter to a string, so the "explain mode" is always on, which is an issue when we try to serialize the documents (e.g. with doc.to_bytes()) as the history cues are not serializable. 😬

How to reproduce the bug

import spacy

nlp = spacy.blank("fr")
nlp.add_pipe("eds.history")

name, component = nlp.pipeline[-1]
print(name)
print(component.explain)
print(type(component.explain))
print(bool(component.explain))

which gives:

eds.history
False
<class 'str'>
True

the last line exhibits that the explain feature here is always activated 😕

Your Environment

Operating System: MacOS
Python Version Used: 3.9.13
spaCy Version Used: 3.1.3
EDS-NLP Version Used: 0.6.2
Environment Information:

Makefile

You can use make to automate different parts of the project :

run linters like flake8
run tests
installing dependencies/requirements
...

A good reference : http://www.gnu.org/software/make/manual/make.html

UMLS matching

(copied from APHP's gitlab — 30/09/22)

Feature type

For now, EDS-NLP only allows to extract and normalize entities to ATC (via ROMEDI), and ICD10.
As UMLS is an international resource and gathers many terminologies (including SnomedCT) in many languages, integrating it would greatly benefit the library and its users to

automatically categorize the texts of a corpus according to different concept IDs
perform entity searching
create processing rules (if ent.concept_id is a child of CUIXXXXX then, ...)
do corpus pre-annotation
...

Several points are targeted:

Downloading the resource

The UMLS contains several tables. We are mainly interested in the MRCONSO table, which contains synonyms and concept IDs (2GB for the 2022AA version). It does not seem reasonable to ask the users to download it themselves, the procedure is long and painful. Fortunately, there is the small (but very well done) umls_downloader library that allows to automate this process provided you have an UMLS license (which is necessary anyway), and store the tables in a shared cache folder.

We should therefore:

decide when the download is done (at the installation ? at the instantiation of the eds.umls pipeline ?)
see how downloading and caching of resources (like cim10) could be generalized for edsnlp

Exact & approximate matching

Once the resource is downloaded, we need to find the UMLS synonyms (MRCONSO table) in the texts. For this, we can use the EDSPhraseMatcher of edsnlp for exact matching and the SimstringMatcher for approximate matching.
The easiest way is to adapt one of the two other TerminologyMatcher implemented for ICD10 or for ATC.

This would require:

pre-process the MRCONSO table downloaded in the previous step (e.g. with optional filters on some columns)
load it into a TerminologyMatcher

Normalization

Once the synonyms have been identified, we need to decide how to present the extracted information to the user. The UMLS aligns synonyms with a unique identifier, the CUI, but also offers alignments to the IDs of all the terminologies it contains. At the moment, the ent.kb_id_ attribute contains the various identifiers jumbled together (ATC / ICD10), making it difficult to use if you start mixing terminologies in a pipeline.
A more robust solution (#47) would be to store in different extensions the IDs proposed by the terminology.

expose ignore_exluded option

edsnlp/edsnlp/pipelines/misc/dates/dates.py

Line 59 in bb86cc0

def __init__(

expose ignore_exluded option

Feature request: extract date format

Feature type

New option / feature / span extension of eds.dates (@acalliger)

Description

It would be great if we could extract the date format like "September 5, 2012" -> "%d %B %Y", especially for downstream applications like pseudonymization to shift dates while keeping the same format.

Format codes: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

Feature request: A qualifier that detects the experiencer (beyond family members)

Feature type

New qualifier pipeline, overtaking and improving the eds.family component.

Description

As of today, EDS-NLP only equiped to detect family members. We should add detection of other types of experiencers, perhaps as a catch-all other modality?

Possible experiencers in that paradigm:

patient or self
family
other

I feel this would cover most use-case. The family modality is particularly important in case of hereditary diseases (although one could argue we need to know whether those are blood relatives in that case).

Depending on the speciality (psychology pops to mind), other types of experiencers might be useful (friend?), but I wonder whether the other experiencer is enough even in that case. The mere fact that the healthcare provider mentions an other could well be sufficient information?

Any thoughts?

Adicap : enhancement of regex to match local spelling

Description

In my hospital (CHU de Brest), ADICAP codes are written like this:


ADICAP :B.H.HP.A7A0

Cotations :
ZZQX217      R-AHC-100-A001 R-AHC-10-A015

In this case dots spells adicap structure and dictionnaries for (d1-d8) part of code.

Your regex in adicap ner is without dots, here

Are you ok if I propose this modified regex?

just add 3 conditionnal dots \.{0,1} in d1_4 = r"[A-Z]\.{0,1}[A-Z]\.{0,1}[A-Z]{2}\.{0,1}"

d1_4 = r"[A-Z]\.{0,1}[A-Z]\.{0,1}[A-Z]{2}\.{0,1}"
d5_8_v1 = r"\d{4}"
d5_8_v2 = r"\d{4}|[A-Z][0-9A-Z][A-Z][0-9]"
d5_8_v3 = r"[0-9A-Z][0-9][09A-Z][0-9]"
d5_8_v4 = r"0[A-Z][0-9]{2}"


adicap_prefix = r"(?i)(codification|adicap)"
base_code = (
    r"("
    + d1_4
    + r"(?:"
    + d5_8_v1
    + r"|"
    + d5_8_v2
    + r"|"
    + d5_8_v3
    + r"|"
    + d5_8_v4
    + r"))"
)

test :

Many thanks

Feature request: opt-ins for the pollution

Feature type

Modification of the pollution matcher to be more center-specific (eg AP-HP, Bordeaux, etc).

Description

Light modification of the eds.pollution pipeline to separate patterns between the main case and center-specific patterns. Most pollution comes from the upstream text extraction phase, and is mostly hospital-specific.

Feature request: terminology matcher with normalisation

Feature type

Matcher pipeline to handle the single label/multiple subconcepts use-case.

Description

As discussed in #58, we would certainly benefit from having EDS-NLP handle the nitty-gritty detail of matching a terminology with automatic concept normalisation.

For now, it is reasonably easy to match a terminology wherein the label is the normalisation. However, we could use the kb_id_ attribute (see spaCy documentation) to include a more hierarchical structure.

For instance, paracetamol/tylenol should probably get the label drug and a kb_id_ like ATC=N02BE01.

Proposition

We could modify the eds.matcher component to handle this case natively, or create a new component.

Feature request: efficient fuzzy matching

Feature type

Some performance comparisons between different matching algorithms (SimString, FlashText and EDS.Matcher)

Description

SimString

Search with a given measure (for example Cosine Similarity) if a string match another string. It is also the base of QuickUMLS.
Quite annoying to use (you have to split the text in order to have a list of query word/expressions).

FlashText

Build a Tree with all the words in the dictionary, and go through it with the query tokens. Built to find exact match. Possibilitie to allow some misspeling through the Levensthein's Distance, already implemented in the GitHub but not in the pip version, so the code is in edsnlp\u tils\flashtext.py.

Global Comparison with SimString, FlashText (with and without a spaCy pipeline) and the EDS.matcher pipeline :

FTspaCy is the FlashText algorithm included in a spaCy pipeline to include all the computation added by it.

Making the number of keywords evolves (we search them in 10k words)

EXACT COMPETITION
Count | SimString | FlashText | FTspaCy | EDSmatcher |
------------------------------------------------------
0     | 0.16901   | 0.01900   | 0.88706 | 0.78106    |
500   | 0.29502   | 0.02600   | 0.83106 | 1.32510    |
1000  | 0.28202   | 0.02800   | 0.78306 | 1.84014    |
1500  | 0.32902   | 0.03000   | 0.83706 | 2.64216    |
2000  | 0.21301   | 0.03200   | 0.81806 | 3.09923    |
2500  | 0.35903   | 0.03700   | 0.76006 | 3.44325    |

FUZZY COMPETITION
Count | SimString | FlashText | FTspaCy |
-----------------------------------------
0     | 1.16013   | 0.05101   | 0.89707 |
500   | 1.04108   | 2.24417   | 2.92714 |
1000  | 1.05508   | 2.32512   | 3.05924 |
1500  | 1.03408   | 3.09474   | 3.92729 |
2000  | 1.21609   | 2.06315   | 2.95722 |
2500  | 1.10708   | 2.16816   | 2.96522 |

Making the number of total words evolves (we search 2587 keywords)

EXACT COMPETITION
Count | SimString | FlashText | FTspaCy | EDSmatcher |
------------------------------------------------------
10000 | 0.39403   | 0.04200   | 1.17209 | 4.73035    |
20000 | 0.59956   | 0.06701   | 1.83514 | 8.09860    |
30000 | 0.71705   | 0.08101   | 2.53819 | 12.29189   |
40000 | 0.94507   | 0.14701   | 4.00055 | 16.79729   |

FUZZY COMPETITION

Count | SimString | FlashText |  FTspaCy  |
-------------------------------------------
10000 | 1.35110   | 2.88821   | 4.91637   |
20000 | 2.21816   | 6.73250   | 7.80158   |
30000 | 2.33017   | 9.09268   | 12.11979  |
40000 | 3.31178   | 11.93089  | 14.07093  |

Feature request: section extension

Description

Suggested by @marieverdoux
Add a new extension ._.section (or something else to the same extent) to easily retrieve the section of an entity.

Currently, to get the section title of the section of an entity, an iteration of entities / sections must be implemented manually.

Potential issues: depending on how we implement this, entities added after this assignment is performed will not be updated, unless with define ._.section as a getter, but a getter might be slow if we have to iterate over all sections for each entity.

Feature request: [feature]

Feature type

Add the ignore_excluded option in the Dates pipeline

Feature request: Unify span getters / setters

Feature type

We might want to have a more uniform way of getting spans in pipelines. Currently, we have on_ents_only, on_spans, etc...
An idea is to expose a span_getter key in the configuration that could look like:

span_getter = dict(
    ents = True,
    spans = ["first_span_key","second_span_key"], # or `True` to get all SpanGroups
    labels = ["relevant_label"], # to keep only entities with a specific `label_`
)

If a more complex getter is needed, it could come from a span_getter factory

Termination improvement: Support for newline character

Hi,

Would you consider adding the newline character "\n" to the list of terminations (in edsnlp.pipelines.terminations)?

In a note like this:


pas de toux
NOVATREX

NOVATREX would for example qualify as a negation because the line above contains "pas". I have see this list structure with "\n" used as a separator and no punctuation a few times.

Also: Great work on the package! Thank you :)

Feature request: relieve constraints on non edsnlp custom attributes

Feature type

Enhance compatibility of EDSnlp custom attributes with potential external pipelines.

Description

in the BaseComponent class, in this commit you added this line :

        Span.set_extension(
            "value",
            getter=lambda span: span._.get(span.label_)
            if span._.has(span.label_)
            else None,
            force=True,
        )`

Doing this, you are enforcing (overwriting if already defined) an attribute which is non edsnlp specific. i.e not named specificaly for your use case, and could be very easily required by anyone using your package as part of a broader pipeline.(see spacy good practices regarding naming components/attributes)

Forcing a getter function means if later on a component were to try to set a value, it would be ignored.

Example of potential conflict:

import spacy
# Setting up a first pipeline
random_nlp = spacy.blank("fr")
random_nlp.add_pipe(
    "eds.terminology", # This component is enforcing the "value" custom attribute
    name="test",
    config=dict(label="Any",terms={}),
)

# Setting up another custom pipeline somewhere else in the code
nlp = spacy.blank("fr")
text = "hello this is a test"
doc = nlp(text)
my_span = doc[0:3]
my_span._.value = "CustomValue" # This raises no error.

assert my_span._.value == "CustomValue" # Error: my_span._.value is None.

WIth my modest experience I would suggest avoiding enforcing attributes in general, but if necessary, renaming the attribute to avoid conflicts. If not possible allowing renaming of the attribute, and/or make sure you let the user know what attributes you are enforcing. (I believe this should be made very clear in the doc specifically in the case of attributes such as "value").

I ran into this conflict upgrading edsnlp from 0.7.4 to 0.9 I might be missing something here and would love to hear what the reasons are if this is absolutely needed.

Feature request: [feature] Feddback annotation EDS-TeVa from Emmanuelle

Description

Hypothesis : Take into account following verbs such as "embolie suspectée"
Negation: Add Elimination to patterns

dates that do not exist

Description

What to do when we found in the text dates that are not possible. Currently, an error is returned

ValueError: day is out of range for month

Should we return None or coerce to last day of the month ? Some action is required for compatibility

How to reproduce the bug

import edsnlp
import spacy

nlp = spacy.blank("eds")
nlp.add_pipe("eds.dates")
text = " Le 31/06/17, la dernière dose."
doc = nlp(text)
ent = doc.spans["dates"][0]
ent._.date.norm()
ent._.date.to_datetime()

Feature request: more measurements

From the experience we have with clinical notes, it seems some measurements could be added, and some discarded or modified.

Angle

Unless you have specific examples, it seems to me that angle are more often written in degrees (and eventually seconds of arcs) than hours and minutes. Units are then represented by °, ' and '' symbols.

Discussion

Angle detection would then be quite close to temperature detection, and might be confusing : °C

Note also that angle and temperature data are often described in context : fièvre à 38.5, température normale, pas de température, ...

Volume

To add volume measures might be useful, and not more difficult than weight and size to implement. This is usually given in perfusion/injection, but also in ophtalmic / ORL prescriptions.

Units : ml, cl, dl, l (+ L replacing l), cc, goutte (1 goutte or droplet = 1 ml)

Sources

https://www.fda.gov/media/88498/download
https://www.drugs.com/article/measurement-conversions.html

Discussion

Not sure whether càs, càc, cuillère à café ou cuilère à soupe should be added.

Weight

Sometimes, weights can be given in a composite format, as e.g. in the case of drugs prescription (2x500mg).

Some hypothesis patterns are too generic

Specifically, in the following patterns, the terms ":\n" and ": \n" are too generic and for instance:

"the patient with covid receives the following care:

ventilation
etc."

import spacy

nlp = spacy.blank("fr")
nlp.add_pipe("eds.hypothesis")
text = "patient atteint de covid recoit les soins suivant :\n- ventilation\n- etc."
doc = nlp(text)
doc.ents[-1]._.hypothesis
True

Installation issues on mac M1/M2 with python 3.9

Description

An error occurs when pip installing edsnlp with python 3.9 on mac M1/M2. It can't be reproduced on intel processors, and it works with python 3.10. Here is the tail of the error:

        ERROR: Failed building wheel for numpy
      Failed to build numpy
      ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Full log here

It could be caused by the version of numpy required that is not compatible with M1/M2 processors

How to reproduce the bug

Run pip install edsnlp on a python3.9 env, on mac M1/M2

Your Environment

Operating System: macos
Python Version Used: 3.9
EDS-NLP Version Used: 0.7.4

Full-hour matching error

Description

Cc @paul-bssr, thank you for reporting this.
Full hours are not correctly matched (time is not part of the matched span)

How to reproduce the bug

import edsnlp

nlp = edsnlp.blank('eds')
nlp.add_pipe('eds.normalizer')
nlp.add_pipe('eds.sentences')
nlp.add_pipe('eds.dates')

assert str(nlp("17/10/2023 18:37").spans['dates'][0]) == "17/10/2023 18:37"
assert str(nlp("17/10/2023 18:00").spans['dates'][0]) == "17/10/2023 18:00"   # Fails here

Problem with date.norm()

Description

There is a bag when calling the .norm() method, it seems that it happens on dates of mode <Mode.UNTIL: 'UNTIL'>

How to reproduce the bug

import edsnlp
import spacy
nlp = spacy.blank("eds")
nlp.add_pipe("eds.dates")
text_bad = "Pas d'arrêt de travail en cours, la patiente poursuivant ses activités jusqu'à ce jour."
text_good = "Ce jour, le patient est ..."
for text in [text_good, text_bad]:
    doc = nlp(text)
    ent = doc.spans["dates"][0]
    print(ent._.date.norm())

Architecture choice on custom extensions

Description

The way we've handled spaCy extensions in EDS-NLP has been erratic at best, with each pipeline declaring its own set of new extensions, cluttering spaCy's Underscore object.

For instance, the pipeline eds.dates, eds.measures and eds.emergency.priority all include a parsing component, and each pipeline saves its result in a different attribute.

We can clearly do better than that by adopting a more holistic and uniform approach.

Proposition

To start off the discussion, here are a few ideas/questions:

Every extensions managed by edsnlp should probably land within a Obect._.edsnlp master attribute. This would avoid cluttering the extensions, and leave more room for the end users.
We should regroup similar extensions as much as possible. In the example above, all three pipelines could write the parsing results to a unique key, for instance value or parsed.
We could introduce a norm key, that contains the normalised variant for a given entity (eg stripped of pollution, accents, etc). A reasonable idea is to provide the text used for matching. The normalised variant should perhaps be computed on the fly, using a getter function ?
To enable the use of getters and methods within the edsnlp extensions, we could use an Underscore object. Not sure if that is overkill and/or incurs significant added complexity.

RegexMatcher with ignore_excluded=True fails on spans

RegexMatcher fails on spans that do not start at the beginning of the doc when ignore_excluded=True

import spacy
from edsnlp.matchers.regex import RegexMatcher

nlp = spacy.blank("fr")
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sentences")

doc = nlp("Premier motif avant pollution. Ceci est ma phrase ======== avec des pollutions. Voici un motif à matcher.")
matcher = RegexMatcher(ignore_excluded=True)
matcher.add("motif", ["motif"])
print(list(matcher(list(doc.sents)[0], as_spans=True)))
# [motif]
print(list(matcher(list(doc.sents)[2], as_spans=True)))
# [avant pollution]
print(list(matcher(doc, as_spans=True)))
# [motif, motif]
print(list(matcher(doc[1:-1], as_spans=True)))
# [Premier, Voici]

Discussion: Using dependency parsing

Description

We might benefit from using a parse tree.

Parsing is a most transverse task, and could provide an easy way to create baselines for many pipeline components (qualifiers, linking events, etc).

However, this is uncharted territory for me. A few questions:

Has anyone tried using models trained on non-speciality French language in the clinical setting?
If re-training is needed, do you have an idea of the effort that's needed in terms of annotation?

Any thoughts?

ModuleNotFoundError: No module named 'edsnlp.matchers.phrase' once edsnlp is installed using pip in conda environment

Description

Can not load spacy model once edsnlp is installed in my virtual environment created using (mini)conda. Installation of spacy via mamba

How to reproduce the bug

import spacy

nlp = spacy.blank("fr")

# ...

Note no edsnlp-module is called

Returns

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/miniconda3/envs/spacy/lib/python3.9/site-packages/spacy/__init__.py", line 74, in blank
    return LangClass.from_config(config, vocab=vocab, meta=meta)
  File "~/miniconda3/envs/spacy/lib/python3.9/site-packages/spacy/language.py", line 1749, in from_config
    nlp = lang_cls(vocab=vocab, create_tokenizer=create_tokenizer, meta=meta)
  File "~/miniconda3/envs/spacy/lib/python3.9/site-packages/spacy/language.py", line 162, in __init__
    util.registry._entry_point_factories.get_all()
  File "~/miniconda3/envs/spacy/lib/python3.9/site-packages/catalogue/__init__.py", line 109, in get_all
    result.update(self.get_entry_points())
  File "~/miniconda3/envs/spacy/lib/python3.9/site-packages/catalogue/__init__.py", line 124, in get_entry_points
    result[entry_point.name] = entry_point.load()
  File "~/miniconda3/envs/spacy/lib/python3.9/importlib/metadata.py", line 86, in load
    module = import_module(match.group('module'))
  File "~/miniconda3/envs/spacy/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "~/Documents/edsnlp/edsnlp/components.py", line 1, in <module>
    from edsnlp.pipelines.factories import *  # noqa : used to import pipelines
  File "~/Documents/edsnlp/edsnlp/pipelines/factories.py", line 2, in <module>
    from .core.advanced.factory import create_component as advanced
  File "~/Documents/edsnlp/edsnlp/pipelines/core/advanced/__init__.py", line 1, in <module>
    from .advanced import AdvancedRegex
  File "~/Documents/edsnlp/edsnlp/pipelines/core/advanced/advanced.py", line 9, in <module>
    from edsnlp.pipelines.core.matcher import GenericMatcher
  File "~/Documents/edsnlp/edsnlp/pipelines/core/matcher/__init__.py", line 1, in <module>
    from .matcher import GenericMatcher
  File "~/Documents/edsnlp/edsnlp/pipelines/core/matcher/matcher.py", line 6, in <module>
    from edsnlp.matchers.phrase import EDSPhraseMatcher
ModuleNotFoundError: No module named 'edsnlp.matchers.phrase'

Your Environment

Operating System: Linux Ubuntu
Python Version Used: 3.9.1
spaCy Version Used: 3.2.1
EDS-NLP Version Used: 0.5.1
Environment Information: conda environment, using mamba to install spacy and spyder, then using pip to install edsnlp

aphp / edsnlp Goto Github PK

edsnlp's Introduction

EDS-NLP

Features

Quick start

Installation

A first pipeline

Documentation & Tutorials

Disclaimer

Contributing to EDS-NLP

Citation

Acknowledgement

edsnlp's People

Contributors

Stargazers

Watchers

Forkers

edsnlp's Issues

Description

How to reproduce the bug

Your Environment

Description

How to reproduce the bug

Your Environment

Feature type

Description

Description

How to reproduce the bug

Your Environment

Feature type

Description

Context

Description:

Feature type

Description

Feature type

Description

Collection types

Acceleration / parallelization mode

Pseudo implementation

Description

How to reproduce the bug

Your Environment

Draft

Starting example:

Proposition

Examples

No replace, no expand, reduce_mode=None

Description

Description

Description

Description

How to reproduce the bug

Suggestions

Description

How to reproduce the bug

Your Environment

Feature type

Description

Score

Description

Proposition

Feature type

Feature type

Description

Description

How to reproduce the bug

Your Environment

Solution temporaire dans class Normalizer

Description

How to reproduce the bug

Your Environment

Feature type

Downloading the resource

Exact & approximate matching

Normalization

Feature type

Description

Feature type

Description

No `replace`, no `expand`, `reduce_mode=None`