GithubHelp home page GithubHelp logo

microsoft / recognizers-text Goto Github PK

View Code? Open in Web Editor NEW
1.6K 67.0 425.0 49.75 MB

Microsoft.Recognizers.Text provides recognition and resolution of numbers, units, date/time, etc. in multiple languages (ZH, EN, FR, ES, PT, DE, IT, TR, HI, NL. Partial support for JA, KO, AR, SV). Packages available at: https://www.nuget.org/profiles/Recognizers.Text, https://www.npmjs.com/~recognizers.text

License: MIT License

C# 41.33% Batchfile 0.14% PowerShell 0.04% JavaScript 9.79% TypeScript 9.65% HTML 0.14% Shell 0.02% Python 19.74% Java 19.14% Dockerfile 0.01%
nlp datetime timex parser-library ner hacktoberfest date entity-extraction number-expression numbers

recognizers-text's Introduction

Microsoft Recognizers Text Overview

Build Status Build Status

Microsoft.Recognizers.Text provides robust recognition and resolution of entities like numbers, units, and date/time; expressed in multiple languages. Full support for Chinese, English, French, Spanish, Portuguese, German, Italian, Turkish, Hindi, and Dutch. Partial support for Japanese, Korean, Arabic, and Swedish. More on the way.

Utilizing the Project

Microsoft.Recognizers.Text powers pre-built entities in LUIS: Language Understanding Intelligent Service, Power Virtual Agents, and Microsoft Bot Framework; base entity types in Text Analytics Cognitive Service; and it is also available as standalone packages (for the base classes and the different entity recognizers).

The Microsoft.Recognizers.Text packages currently target four platforms:

Contributions are greatly welcome! Both for fixes and extensions in the currently supported languages and for expansion to new ones. Especially for Japanese, Korean, Arabic, Swedish, and others! More info below.

.NET is the primary package version and contributions propagate to the other platforms with time.

Citing the Recognizers-Text project

If you utilize the recognizers in academic works, please cite it as below (you can omit the version number or update it to a specific version if relevant):

@software{soft:recognizers-text,
  author    = {Wenhao Huang and Zijia Lin and Chris McConnell and B{\"{o}}rje F. Karlsson},
  title     = {{Recognizers-Text}: {R}ecognition and resolution of numbers, units, and date/time entities expressed across multiple languages},
  month     = jul,
  year      = 2017,
  publisher = {Zenodo},
  version   = {1.0.0},
  doi       = {10.5281/zenodo.6860598},
  url       = {https://doi.org/10.5281/zenodo.6860598}
}

Feel free to change "@software" to "@misc" if it better fits your templates.

Help

If you have any questions, please go ahead and open an issue, even if it's not an actual bug. Issues are an acceptable discussion forum as well.

Contributing

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Good starting points for contribution are:

  • the list of open issues (especially those marked as help wanted);
  • the json spec cases temporarily marked as NotSupported (Specs); and
  • translating json test spec cases that work in English, but don't yet exist in a target language.

The links below describe the project structure and provide both an overview and tips on how to contribute (although some steps may have become a little out-of-date). Thank you!

Supported Entities across Cultures

The table below summarizes the currently supported entities. Support for English is usually more complete than others. The primary platform is .NET (shown in table) and support should propagate to the others.

Entity Type EN ZH-CN NL FR DE IT JA KO PT ES
Number (cardinal)
Ordinal
Percentage
Number Range PA/EO
Unit - Age PA/EO
Unit - Currency PA/EO
Unit - Dimensions PA/EO
Unit - Temperature
Choice - Boolean SO
Seq. - E-mail G G* G G G G G* G* G G
Seq. - GUID G G G G G G G G G G
Seq. - Social G G G G G G G G G G
Seq. - IP Address G G G G G G G G G G
Seq. - Phone Number G G G G G G G G G G
Seq. - URL G G* G G G G G* G* G G
DateTime (+subtypes) SO
Entity Type SV BG TR HI AR
Number (cardinal) PA/EO
Ordinal PA/EO
Percentage PA/EO
Number Range PA/EO
Unit - Age
Unit - Currency
Unit - Dimensions
Unit - Temperature
Choice - Boolean
Seq. - E-mail G G G G G
Seq. - GUID G G G G G
Seq. - Social G G G G G
Seq. - IP Address G G G G G
Seq. - Phone Number
Seq. - URL G G G G* G*
DateTime (+subtypes) SP SO
  • G: Generic entity, not language-specific (* unicode TLDs not-supported);
  • EO: Extraction-only (parsing/resolution/normalization pending);
  • PA: Partial support (type not fully supported);
  • SO: Specs-only (test specs coverage OK, but support pending);
  • SP: Partial specs;
  • SI: Very initial specs (typically language support start for a new language).

recognizers-text's People

Contributors

acblacktea avatar aitelint avatar aitelintii avatar aliandi avatar amitstein avatar anichikage avatar chopperman33 avatar dependabot[bot] avatar ejadib avatar enzocano avatar gasper-az avatar grey0202 avatar guom08 avatar haoyangms avatar imicknl avatar johnataylor avatar juanar avatar matthewshim-ms avatar neelisha-saxena avatar neudurgeshp avatar paradoxarg avatar pcostantini avatar pete1854 avatar rubio41 avatar sanxing-chen avatar songwenhao1 avatar sothan avatar tellarin avatar visionshao avatar wgx998877 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

recognizers-text's Issues

Support for PT-PT or PT-BR for date-time

I'm interested in date/time for Portuguese. How hard would it be to adapt existing code for Portuguese strings? Is there any steps one could follow?

Thanks in advance.

[EN DateTimeV2] Information loss when recognizing terms such as 'this week' or 'next week'

Language such as 'last week', 'this week' and 'next week' are recognized by the datetime recognizers. This is great but the resolution goes to far with the result that the original meaning is lost.

Specifically the utterance is resolved to a daterange relative to now.

Not only is this calculation something the client could easily do for itself, but on some occasions it would have some advantages to doing it itself. Specifically if it wanted to specify what now to base the calculation on. If for example the utterance was captured and the meaning applied later, next week for example. The original meaning would have been lost.

What would be most useful is for the recognizer to just recognize and not calculate ranges.

There is a precedence for just returning the recognition. Consider the case of "now" which resolves to "PRESENT_REF." It would be far better if a phrase like 'this week' resolved to something similar 'PRESENT_WEEK' or some other symbol the client could switch on. Perhaps the range TIMEX syntax could be used - for example something like (PRESENT_REF,,P7D) for this week or (PRESENT_REF+P7D,,P7D) for next week....

[EN DateTimeV2] Support hemisphere property conf, so DatePeriodParser can properly resolve seasons

TestDatePeriodParser.cs

BasicTestFuture("I'll leave this summer", 2016, 6, 1, 2016, 9, 1);
BasicTestFuture("I'll leave in summer", 2017, 6, 1, 2017, 9, 1);
BasicTestFuture("I'll leave in winter", 2016, 12, 1, 2017, 3, 1);
BasicTestFuture("I'll leave in winter, 2017", 2017, 12, 1, 2018, 3, 1);

Resolution of season is a matter of difference between north and south hemispheres. Code should support conf parameter for this. And assume a default (north, mentioned in the schema).

[* Unit] Age recognizer should use TIMEX duration

Rather than returning:
"resolution": {
"value": "50",
"unit": "Year"
},
for the text '50 years old' it would be more consistent to return a TIMEX duration. For this example that would be "P50Y"

single number around date

such as tomorrow 3, 3 should be extracted as time if there are date around.
Only effective for calendar mode.

Support date-time alternatives in datetime extraction

Examples:

  • "Would be great to get together next week. Copying person_name here, who can schedule a coffee for us. I could do Monday from 12 to 4 or Wednesday from 10 to 11"
  • "Please come by. Monday 8-9am or 9-10 am works for me."

Such cases could be recognized as a single entity instead of two separate datetimes.

Ambiguity for "8p"

it means "8pm" sometime and "8 people" sometime.
Should add a field to say it is ambiguous.

date time recognizers get the ISO week wrong

Try running something like this (using the JavaScript though I think the bug is also in the C# because LUIS also appears to get this wrong.)

dateTimeModel.parse('this week')

Today (I'm writing this on 2017-11-13) resulting TIMEX expression is 2017-W47

In other words the library thinks it is week 47 however, the ISO week is actually week 46.

TIMEX should contain the ISO week number.

For reference there are a bunch of web sites that show this. https://weeknumber.net/ and https://www.epochconverter.com/weeknumbers are a couple of examples.

Alternatively you can use Excel. Open Excel and type =ISOWEEKNUM(TODAY()) into a cell.

Needless to say next week, last week etc. are also wrong.

Bug in handling "every morning"

Every morning should be recognized as a set. "time": "XXXX-XX-XXTMO"

Currently it returns only "morning":
{ "type": "builtin.datetimeV2.timerange", "entity": "morning", "resolution": { "values": [{"timex": "TMO", "type": "timerange", "start": "08:00:00", "end": "12:00:00"}] }

datetime recognizer should return simple JavaScript objects and arrays

Current the result of parse appears to contain the new JavaScript data structures, things like Set and Map.

It would be better to restrict this small result to simple JavaScript object and array structures. This would make them easier to work with: the developer could for example trivially use JSON.stringify and then use simple language operators . and []

Unit test failures in JavaScript

When I build the JavaScript code (using the build.cmd as the manual steps appear to be broken) I get three unit tests broken:

5305 passed
3 failed
209 skipped

DateTime - Spanish - DateTimePeriodParser - "Estaré afuera de 2:00pm, 2016-2-21 a 3:32, 04/23/2016"
C:\private\Recognizers-Text\JavaScript\test\runner-datetime.js:100
100: t.is(actual.value.timex, expected.Value.Timex, 'Result.Value.Timex');
Result.Value.Timex
Difference:

  • '(2016-02-21T14:00,2016-04-23T03:32,PT1477H)'
  • '(2016-02-21T14:00,2016-04-23T03:32,PT1478H)'
    _.zip.forEach.o (test/runner-datetime.js:100:19)
    Test.t [as fn] (test/runner.js:26:35)
    DateTime - English - DateTimePeriodParser - "I'll be out from 2:00pm, 2016-2-21 to 3:32, 04/23/2016"
    C:\private\Recognizers-Text\JavaScript\test\runner-datetime.js:100
    100: t.is(actual.value.timex, expected.Value.Timex, 'Result.Value.Timex');
    Result.Value.Timex
    Difference:
  • '(2016-02-21T14:00,2016-04-23T03:32,PT1477H)'
  • '(2016-02-21T14:00,2016-04-23T03:32,PT1478H)'
    _.zip.forEach.o (test/runner-datetime.js:100:19)
    Test.t [as fn] (test/runner.js:26:35)
    DateTime - English - DateTimePeriodParser - "I'll be out I'll be out from 2:00pm, 2016-2-21 to 3:32, 04/23/2016"
    C:\private\Recognizers-Text\JavaScript\test\runner-datetime.js:100
    100: t.is(actual.value.timex, expected.Value.Timex, 'Result.Value.Timex');
    Result.Value.Timex
    Difference:
  • '(2016-02-21T14:00,2016-04-23T03:32,PT1477H)'
  • '(2016-02-21T14:00,2016-04-23T03:32,PT1478H)'

Bug handling "week of September 30th" as a range

Currently recognizing only "September 30". "date": "XXXX-09-30"

Should be:
{ "entity": "september 30th", "type": "builtin.datetimeV2.date", "resolution": { "values": [{ "timex": "XXXX-09-30", "type": "date", "value": "2016-09-30" },{"timex": "XXXX-09-30","type": "date", "value": "2017-09-30" }]} }

Schema doesn't match LUIS

The schema returned by the recognizers is close to the LUIS schema for Entities but not exactly the same. It should be exactly the same. For example the number recognizer returns "start", "end", and "text" fields. These are "startIndex", "endIndex", and "entity" fields in LUIS.

A developer should be able to work with an entity the same way, regardless of whether it came from a call to LUIS or a call to a recognizer. The schemas need to be identical for that to be true.

datetime recognizer inconsistent with LUIS

When I give LUIS the string "in 5 minutes" it resolves that to a datetime "2017-11-13T21:39:41"

However, when I use the local JavaScript recognizer is just recognizes the "5 minutes" and produces a duration "PT5M"

The local JavaScript recognizer should have found the datetime.

[* Number] Percentage recognizer resolution should resolve to fraction

Recognizing an utterance such as 'fifty percent' as a percentage value is great but it would be more helpful to client code if the resolution was to a floating point value.

Specifically: an utterance such as 'fifty percent' results in:
"resolution": {
"value": "50%"
},

Given the type has been correctly identified as percentage it would be better for the client code to receive a floating point number.

This would also fit more naturally with NumberFormat where the client could use code like this:

        const formatter = new Intl.NumberFormat('en-US', { style: 'percent' });
        return formatter.format(obj.value);

Which would assume obj.value to be 0.5 for 50%

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.