GithubHelp home page GithubHelp logo

Comments (7)

percevalw avatar percevalw commented on June 16, 2024 1

Great idea, I second you on this. Here are a few points of our IRL discussion

As most end users would either only use the existing attributes (negated, value, ...) and not write new ones, we should prefer a simpler notation as it is currently the case, under the base _ getter, but limiting the number of attributes. Therefore as suggested, we could use:

  • span._.value for dates, measures, scores, concepts, ...,
  • and span._.negation, span._.rspeech, etc for more syntax-based attributes.

As the norm attribute is now strongly used throughout the lib as a depolluted ascii version of the text, changing it would probably mean refactoring most of the code. To generate semantically normalized text, we can use the __str__ and __repr__ methods on the generic value attribute.

Following your dates revamp, the _.value extension could inherit of a pydantic.BaseModel like:

class EDSValue(BaseModel):
  def __str__(self): ...
  def __repr__(self): ...
  ...

class Date(EDSValue):
  ...

class Measure(EDSValue):
  ...

class Concept(EDSValue):
  ...

from edsnlp.

bdura avatar bdura commented on June 16, 2024

@percevalw, @Thomzoy, @Aremaki, I'd love to get your thoughts on this!

from edsnlp.

bdura avatar bdura commented on June 16, 2024

Sounds good! 🎉

from edsnlp.

percevalw avatar percevalw commented on June 16, 2024

As discussed, here is a potential solution that standardizes the current architecture for custom extensions @Thomzoy @aricohen93
Each component can create a Span extension named after the label of the entities it creates:

  • eds.adicap creates entities labeled adicap, and adds an ent._.adicap extension containing the decoding information
  • eds.tnm creates entities labeled tnm, and adds an ent._.tnm extension
  • eds.drugs creates entities labeled drug, and adds an ent._.drug extension
  • ...

A specific ._.value extension is defined as an aggregator and retrieves the field associated with the label via a getter such that ent._.value == getattr(ent._, ent.label_). The str representation of ._.value could be the one displayed in the demonstrator.

This way, we can keep a consistent typing of each extension (tnm -> TNMScore, adicap -> AdicapCode, date -> Date, ...), while offering a unique entry point for some use cases via the value extension.

This does not prevent to define other extensions if needed, or to keep the old entity extensions and deprecate them in future versions.

from edsnlp.

percevalw avatar percevalw commented on June 16, 2024

@Vincent-Maladiere

from edsnlp.

percevalw avatar percevalw commented on June 16, 2024

Following the discussion with @Thomzoy, we carry on with the approach commented above:

  • each pipe defines the extensions it needs (negation, scores, etc)
  • the extensions related to a normalized value should be named with the label_ of the entity extracted (if any)
  • the value extensions is defined as the following getter: lambda span: span._.get(span.label_, default=None). Having multiple extensions and an aggregator extension allows multiple pipes to modify a single entity, and to prioritize the normalized value of the entity by setting its label — for instance, to choose between the extraction of eds.drugs (label = drug), and eds.umls (label = umls) — without loosing information between pipes
  • the normalized extension can be anything: an int, a bool, a string, an object, depending on the complexity of the extraction, and should implement the equality operator such that any span1._.value == span2._.value test runs

For instance, the following (non-exhaustive) modifications should be made:

  • dates: the dates will be labelled as date, to match the date extension (instead of absolute/relative since this info is already stored in the span._.date object)
  • measurements: the label of the extracted measurements becomes measurement, and the previous label (e.g. eds.weight is added to the normalized span._.measurement object
  • consultation_dates: the label of the consultation_dates spans will become consultation_date and the consultation_date extension will be the extracted date
  • tables: labelled as table and span._.table is a getter to span._.to_pd_table(as_values=True)
  • umls: labelled as umls (this is already the case) and change span._.umls to a new UMLSConcept(id=the cui, sty=the semantic type) object
    ...

from edsnlp.

percevalw avatar percevalw commented on June 16, 2024

These suggestions have been integrated in #213

from edsnlp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.