axa-group / nlp.js Goto Github PK

View Code? Open in Web Editor NEW

6.2K 108.0 616.0 32.02 MB

An NLP library for building bots, with entity extraction, sentiment analysis, automatic language identify, and so more

License: MIT License

JavaScript 99.86% HTML 0.02% CSS 0.05% WebAssembly 0.08%

nlp nlu nodejs sentiment-analysis natural-language-processing bot bots chatbot conversational-ai javascript

nlp.js's People

Contributors

Stargazers

Watchers

Forkers

jose-lara ana-gamito hichem-elabassi ecanro jseijas neightlocke jdc08161063 cuulee devmanx anupamandroid mysqlsc joizhang2012 wh-forker ilhomjonkurbanov dailyactie vvscode emanzanoaxa jesus-seijas-sp mingjunw chronsyn penguinofwar jamesvillarrubia iamjagan thatsparks cube3power shaunstanislauslau shanewholloway krismuniz pikamuchu starvingfool jdrew1303 songxianjin hhy5277 quockhanhle93 omitobi lp249839965 lmm713281 huydeerpets paulooosrj zhuijing thanhtd91 loretoparisi karamata fengweijp prazvan arnaudbey fromeroy yaocanwei daxong catataw webantic pythoner2019 bagginz martalais hammenm faker-a bniedermeyer the-prophet aelamel mikepsinn tsuna-mi webmaia nguyenducnhaty hieuqtran kod3r benzei esmaeilinia joshuaquek codenewer tony32769 hsouporto j2l thalesfsp gehongpeng satriawadhipurusa arnoldligtvoet w95 blakek fulin-wei northeast250 chitrang89 raymondseger javascriptnoob1 gastonrobledo rajarajanchandran jnv frenchbully mxvsh zahidul-islam hibearpanda jpencausse jai2033shankar yusufola mohanarunachalam pavan142 nysdy zekuath invertase arsalanmaqsood rnahvi93

nlp.js's Issues

Process to autogenerate documentation

Add a library that generates documentation of the api from jsDoc comments, add the task to package.json, generate a first version documentation (for sure it will need to be refined), and upload.

Remember to put the documentation path in the .npmignore, because the docs will not be needed in the npm package.

Full japanese support

Is your feature request related to a problem? Please describe.
Japanese currently right now only works with katakana, beacause the Natural stemmer only supports katakana.

Describe the solution you'd like
Support Katakana, Hiragana and Joyo Kanji, perhaps it can be achieve with a Kanji -> hiragana -> katakana translation.

Describe alternatives you've considered
Doing a complete stemming on hiragana and katakana. Synonims over Kanji. Translate to base romanji.

Language detection not working for some languages

Describe the bug

Some languages are not working with language detection that should. The problem seems to be franc is using one set of language codes and this package is using another.

To Reproduce

const { Language } = require('node-nlp');

const language = new Language();
console.log(language.guessBest('你叫什么名字？')); // Returns `undefined`

Expected behavior

Should show the language is Chinese.

Additional context

zh or zho is language code used in lib/language/languages.json, but franc uses cmn to represent Mandarin Chinese. Updating the 3-character code to match that from franc seems to work.

Browser support

Is browser support on the roadmap? I have tried to use this module an Angular 6 application?

Cannot use the NLP service while training

Hello, I've built an app that constantly trains new data.

However, it seems I cannot reliably use the NLP server while a train()/save() is in progress.

What is the best solution to overcome this issue?

Add support for synonyms

Describe the solution you'd like

The library can recognize similar sentences even if they use synonyms (or even negated antonyms)
Maybe the library can paraphrase a sentence too

Describe alternatives you've considered
Load lists of synonyms that are available in GH (like word-net)

date indicated in original slot messes slotManager process

It's not urgent

Describe the bug
When indicating toCity and date and after replying to the fromCity question, answer misses fromCity even if context is complete.
Same bug when indicating fromCity + date.

May be related to builtin (date) working as final slot? Order in addSlot shouldn't count (?)

To Reproduce
[After upgrading to 2.1.1 (I spent 3h pulling hair with 2 different versions of node-nlp on the same machine)]
I can't say for the MSBot, but the adaptation to terminal exposed this behavior:

travel to mpl today
bot> From where you are traveling?
bcn
bot>You want to travel from  to mpl today

travel from bcn today
bot> Where do you want to go?
mpl
{ date: 'today', fromCity: 'bcn', toCity: 'mpl' }
bot> You want to travel from bcn to  today

Expected behavior
You want to travel from bcn to mpl today
That only work for:

travel to mpl
bcn
today

and

travel from bcn
mpl
today

Additonal info
now is correctly interpreted as a datetime entity, but not showing in the final answer AND removing toCity (shifting?):

now
undefined (say(result))
{ toCity: 'mpl', fromCity: 'bcn', datetime: 'now', date: 'now' } (say(context))
bot> You want to travel from  to mpl

I wish I could help more in the future and hope QT helps 😄

builtin entity recognition superseeds NER

Describe the bug
It looks like entity is set using builtin enity recognition only.
Therefore, you can't extract what you want using regex.

To Reproduce
NER:

Using the example, {{hashtag}} converts to #proudtobeaxa

Expected behavior
It shloud extract proudtobeaxa as %hashtag% since the group doesn't include "#" in the NER regex group.

Additional question
Any way to extract 2 groups from regex like /\b\#(\w+)[, ]\#(\w+)\b/ig to %hastag1% %hastag2%?

Document context

Describe the bug
Currently a context object can be provided to the NlpManger.process(). The NLG answers can be conditional based on conditions on context variables. This is described in the tests and the excel provided in the tests "It should use context if conversation id is provided":
https://github.com/axa-group/nlp.js/blob/master/test/recognizer/recognizer.test.js

Also, the Microsoft Recognizer, automatically creates a context manager, so when inside a conversation it adds the last retrieved entities into the conversation.

This allows something like that: Suppose that you have two intents:

whois: Who is %hero%
powers: Which are the powers of %hero%
And you have 2 heros: Hulk and Spiderman
if hero is hulk then whois returns "Bruce Banner"
if hero is Spiderman then whois returns "Peter Parker"
if hero is hulk then powers returns "Super-Strengh"
if hero is Spiderman then powers returns "Super-Agility"

This allows to get a conversation like:
user> Who is Spiderman?
bot> Peter parker
user> Which are his powers?
bot> Super-agility

As you can see the second question does not contains the name of the hero because is automatically stored in the context so the user can continue the conversation without having to repeat it in each question.

Add Domain Concept

Is your feature request related to a problem? Please describe.

Domain is a logical aggrupation of intents under a common topic.

This also open the way to have prebuilt domains, example, a domain can be "personality" and with one single line of code, the bot will be prefilled with common personality questions and answers. This will be discussed and develop in another topic.

Describe the solution you'd like
When adding an intent, a domain can be especified. If not especified will be "default", so the name of the domain is optional. As internally can be a charasteristic of an intent, there is no need of adding a class and refactor the NLP manager load and save. So it will be simply a property of each intent.

Describe alternatives you've considered
Another alternative is to have it as a tree, so the NLP Manager has a Domain Manager, and each Domain Manager contains the intents. This will be useful if Domain were a more complex class, with more properties or methods, but is not the case, and the KISS principle must be followed.

An in-range update of brain.js is breaking the build 🚨

The dependency brain.js was updated from `1.4.5` to `1.4.6`.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

brain.js is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details

❌ Travis CI - Branch: The build failed.

Release Notes for Maintenance release

Adds toFunction() to RNNTimeStep and a number of fixes to do with hidden layers in recurrent nets.

Commits

The new version differs by 1 commits.

3a4d98e fix: Weight loss.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

Named Entity Recognition coherence

Describe the bug
Named Entity Recognition is not working as expected:

If an enumerated entity is added, it only finds the first occurance
If there is a regular expression entity that overlaps with an enumerated entity, if returns both instead of finding edges

To Reproduce

const { NlpManager } = require('node-nlp');

const manager = new NlpManager({ languages: ['en'] });
manager.addRegexEntity('mail', /\b(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})\b/ig);
manager.addNamedEntityText('location', 'barcelona', ['en'], ['Barcelona', 'Barna']);
manager.addNamedEntityText('location', 'madrid', ['en'], ['Madrid']);

const result = manager.process('en', 'My mail is [email protected] and i live in madrid', {});
console.log(result);

Expected behavior
Currently it returns:

[ { start: 11,
       end: 20,
       levenshtein: 0,
       accuracy: 1,
       option: 'barcelona',
       sourceText: 'Barcelona',
       entity: 'location',
       utteranceText: 'barcelona' },
     { start: 11,
       end: 33,
       accuracy: 1,
       sourceText: '[email protected]',
       utteranceText: '[email protected]',
       entity: 'mail' } ]

It should return:

[ { start: 11,
       end: 33,
       accuracy: 1,
       sourceText: '[email protected]',
       utteranceText: '[email protected]',
       entity: 'mail' },
{ start: 48,
       end: 53,
       levenshtein: 0,
       accuracy: 1,
       option: 'madrid',
       sourceText: 'Madrid',
       entity: 'location',
       utteranceText: 'madrid' }
   ]

Common date format not recognised by NerManager

Describe the bug
The string "Thu, Nov 1, 2018 at 5:06 PM" is not recognised as a date/datetime, despite being a very common way to describe a date. Ideally I would like to be able to recognise the whole datetime, but even getting the date from "Thu, Nov 1, 2018" would be helpful.
As a subset of this problem, I have notices that dates in the format "7 Nov 2018" are not recognised either.

To Reproduce
Steps to reproduce the behavior:

import { NerManager } from "node-nlp";
const manager = new NerManager();
const results = await manager.findEntities("This message was sent: Thu, Nov 1, 2018 at 5:06 PM, and I expect a reply before 7 Nov 2018");
const dates = results.filter(entity => {
  return entity.entity === "date";
});
console.log(dates);
// []

Expected behavior
Output something like:

[
  {
    "start": 0,
    "end": 9,
    "len": 10,
    "accuracy": 0.95,
    "sourceText": "Thu, Nov 1, 2018 at 5:06 PM",
    "utteranceText": "Thu, Nov 1, 2018 at 5:06 PM",
    "entity": "date",
    "resolution": {
      "type": "date",
      "timex": "2018-11-01",
      "strValue": "2018-11-01",
      "date": "2018-11-01T17:06:00.000Z"
    }
  },
  {
    "start": 0,
    "end": 9,
    "len": 10,
    "accuracy": 0.95,
    "sourceText": "7 Nov 2018",
    "utteranceText": "7 Nov 2018",
    "entity": "date",
    "resolution": {
      "type": "date",
      "timex": "2018-11-07",
      "strValue": "2018-11-07",
      "date": "2018-11-07T00:00:00.000Z"
    }
  }
]

Screenshots
NA

Desktop (please complete the following information):

OS: OSX
NodeJS v9.8.0
Version 2.1.7

Additional context
The rest of the entity extraction works like a charm (numbers, emails, etc.).

I would be happy to help out on this if somebody could point me in the right direction 😀

Update NLP classifier documentation

Describe the bug
https://github.com/axa-group/nlp.js/blob/master/docs/nlp-classifier.md
Must be updated to have Tamil(ta) and 27 instead of 26 languages

NLU-Benchmark must await training

Describe the bug
The NLU-Benchmark right now is returning a 0.89 or score instead of 0.90. This is due to the fact that manager.train() is called sync and not awaited, so the process start before training end. Is only to put the await to the manager.train();

To Reproduce
Steps to reproduce the behavior:

Go to https://github.com/axa-group/nlp.js/blob/master/examples/nlu-benchmark/index.js
Execute it
The result in console is 0.89, it should be 0.90

Parallelize computeThetas to reduce training time

Is your feature request related to a problem? Please describe.
Training time can be reduced by using worker_threads (when available based on the node version).

Describe the solution you'd like
Computing Thetas does a descend of gradient that is unique for each classification label, so the calculation of the theta can be theorically executed threaded:

https://github.com/axa-group/nlp.js/blob/master/lib/math/mathops.js#L165

Isolate specific entity

I would like to know, how can I isolate a specific entity. I don't know if it's a bug but I would like to isolate an entity in my intent with this pattern :

'%BOOK% %PAGE_START% %PARAGRAPH_START%'

in the result I have PAGE_START in double and PARAGRAPH_START in double :

...
"intent": "[BOOK] search_paragraph",
"domain": "default",
"score": 0.9987136557407928,
"entities": [
{
"start": 0,
"end": 2,
"len": 3,
"levenshtein": 0,
"accuracy": 1,
"option": "DAILY_PLANET",
"sourceText": "Daily",
"entity": "BOOK",
"utteranceText": "dai"
},
{
"start": 4,
"end": 4,
"len": 1,
"levenshtein": 0,
"accuracy": 1,
"option": "1",
"sourceText": "2",
"entity": "PAGE_START",
"utteranceText": "2"
},
{
"start": 6,
"end": 6,
"len": 1,
"levenshtein": 0,
"accuracy": 1,
"option": "1",
"sourceText": "3",
"entity": "PAGE_START",
"utteranceText": "3"
},
{
"start": 4,
"end": 4,
"len": 1,
"levenshtein": 0,
"accuracy": 1,
"option": "1",
"sourceText": "2",
"entity": "PARAGRAPH_START",
"utteranceText": "2"
},
{
"start": 6,
"end": 6,
"len": 1,
"levenshtein": 0,
"accuracy": 1,
"option": "1",
"sourceText": "3",
"entity": "PARAGRAPH_START",
"utteranceText": "3"
}
],
...

I would like to have only 3 entities in the response (and not the double PAGE_START and PARAGRAPH_START) :

BOOK (value: 'DAILY_PLANET'...)
PAGE_START (value: 1, 2, 3, 4...)
PARAGRAPH_START (value: 1, 2, 3, 4...)

How can I have that please ? It's a bug ?

Q&A like SQuAD

Hi, how I do Q&A from a dataset like SQuAD? I have a dataset Turkish, I will change configurations for Turkish but ı am not know how ı do it, are you have any Q&A script training data from SQuAD DataSet? Or How I do it?

Organize documentation

Is your feature request related to a problem? Please describe.
Currently the documentation is only in the README.md, and as it grows, it is becoming bigger and harder to read.

Describe the solution you'd like
Create a docs folder, and split documentation into smaller md files with partial information, and the README.md should contain the basic information to work with the library (install, basic usage, license...), and a Table of Contents pointing to the correct md file and hash.

slot memory and city names issue or memory failing?

DON'T READ, scroll to the end

Describe the bug
Replying with a city name is not correctly interpreted.

To Reproduce
See previous bug report. Same code. No error.

i want to travel today to London

From where you are traveling?{"locale":"en","localeIso2":"en","language":"English","utterance":"i want to travel today to London","classification":[{"label":"travel","value":1}],"intent":"travel","domain":"default","score":1,"entities":[{"start":17,"end":21,"len":5,"accuracy":0.95,"sourceText":"today","utteranceText":"today","entity":"date","resolution":{"type":"date","timex":"2018-10-15","strValue":"2018-10-15","date":"2018-10-15T00:00:00.000Z"}},{"type":"afterLast","start":26,"end":31,"len":6,"accuracy":0.99,"sourceText":"London","utteranceText":"London","entity":"toCity"}],"sentiment":{"score":-0.275,"comparative":-0.03928571428571429,"vote":"negative","numWords":7,"numHits":1,"type":"senticon","language":"en"},"srcAnswer":"From where you are traveling?","answer":"From where you are traveling?","slotFill":{"localeIso2":"en","intent":"travel","entities":[{"start":17,"end":21,"len":5,"accuracy":0.95,"sourceText":"today","utteranceText":"today","entity":"date","resolution":{"type":"date","timex":"2018-10-15","strValue":"2018-10-15","date":"2018-10-15T00:00:00.000Z"}},{"type":"afterLast","start":26,"end":31,"len":6,"accuracy":0.99,"sourceText":"London","utteranceText":"London","entity":"toCity"}],"answer":"You want to travel from  to London today","srcAnswer":"You want to travel from {{ fromCity }} to {{ toCity }} {{ date }}","currentSlot":"fromCity"}}.   :(   (-0.275)

Barcelona

Sorry, I don't understand, {"locale":"en","localeIso2":"en","language":"English","utterance":"Barcelona","classification":[{"label":"travel","value":0.5}],"intent":"None","domain":"default","score":1,"entities":[],"sentiment":{"score":0,"comparative":0,"vote":"neutral","numWords":1,"numHits":0,"type":"senticon","language":"en"}}.

Expected behavior
Record reply as city name and give final answer You want to travel from {{ fromCity }} to {{ toCity }} {{ date }}

Additional context
Updated node to 8.12. No more regex error.
Should I add city names entities?
It looks like currentSlot is forgotten after output.
May be my code flushes the memory state (waiting for input to fill the slot).

Here's the generated model:

{
  "settings": {
    "fullSearchWhenGuessed": true,
    "useNlg": true,
    "useNeural": true
  },
  "languages": [
    "en"
  ],
  "intentDomains": {
    "travel": "default"
  },
  "nerManager": {
    "settings": {},
    "threshold": 0.8,
    "builtins": [
      "Number",
      "Ordinal",
      "Percentage",
      "Age",
      "Currency",
      "Dimension",
      "Temperature",
      "DateTime",
      "PhoneNumber",
      "IpAddress",
      "Boolean",
      "Email",
      "Hashtag",
      "URL"
    ],
    "namedEntities": {
      "fromCity": {
        "type": "trim",
        "name": "fromCity",
        "localeFallback": {
          "*": "en"
        },
        "locales": {
          "en": {
            "conditions": [
              {
                "type": "between",
                "options": {
                  "skip": [
                    "travel"
                  ]
                },
                "leftWords": [
                  "from"
                ],
                "rightWords": [
                  "to"
                ],
                "regex": "/(?<= from )(.*)(?= to )/gi"
              },
              {
                "type": "afterLast",
                "options": {
                  "skip": [
                    "travel"
                  ]
                },
                "words": [
                  "from"
                ]
              }
            ]
          }
        }
      },
      "toCity": {
        "type": "trim",
        "name": "toCity",
        "localeFallback": {
          "*": "en"
        },
        "locales": {
          "en": {
            "conditions": [
              {
                "type": "between",
                "options": {
                  "skip": [
                    "travel"
                  ]
                },
                "leftWords": [
                  "to"
                ],
                "rightWords": [
                  "from"
                ],
                "regex": "/(?<= to )(.*)(?= from )/gi"
              },
              {
                "type": "afterLast",
                "options": {
                  "skip": [
                    "travel"
                  ]
                },
                "words": [
                  "to"
                ]
              }
            ]
          }
        }
      }
    }
  },
  "slotManager": {
    "travel": {
      "toCity": {
        "intent": "travel",
        "entity": "toCity",
        "mandatory": true,
        "locales": {
          "en": "Where do you want to go?"
        }
      },
      "fromCity": {
        "intent": "travel",
        "entity": "fromCity",
        "mandatory": true,
        "locales": {
          "en": "From where you are traveling?"
        }
      },
      "date": {
        "intent": "travel",
        "entity": "date",
        "mandatory": true,
        "locales": {
          "en": "When do you want to travel?"
        }
      }
    }
  },
  "classifiers": [
    {
      "language": "en",
      "docs": [
        {
          "intent": "travel",
          "utterance": [
            "i",
            "want",
            "to",
            "travel",
            "from",
            "fromciti",
            "to",
            "tociti",
            "date"
          ]
        }
      ],
      "features": {
        "i": 1,
        "want": 1,
        "to": 2,
        "travel": 1,
        "from": 1,
        "fromciti": 1,
        "tociti": 1,
        "date": 1
      },
      "logistic": {
        "observations": {
          "travel": [
            [
              1,
              2,
              3,
              4,
              5,
              6,
              7
            ]
          ]
        },
        "labels": [
          "travel"
        ],
        "observationCount": 1
      },
      "useNeural": true,
      "neuralClassifier": {
        "settings": {
          "config": {
            "activation": "leaky-relu",
            "hiddenLayers": [],
            "learningRate": 0.1,
            "errorThresh": 0.0005
          }
        },
        "classifierMap": {}
      }
    }
  ],
  "responses": {
    "en": {
      "travel": [
        {
          "response": "You want to travel from {{ fromCity }} to {{ toCity }} {{ date }}"
        }
      ]
    }
  }
}

Debate: Monorepo

We are near to publish the version 2.0.0, with async process, transformations, slot filling... and so many features.
How to evolve to the version 3.0.0? From my point of view, evolving to a monorepo with lerna can be positive.
Pros:

Import only what you need
As there will be integration with duckling, not only Microsoft Recognizers, that means that you can choose which one to use, without having the other into node_modules.
There are features that works perfectly out of the NLP environment, example the SimilarSearch, with levenshtein distance, is faster than the existing in NPM due to the use of a fast normalizetion instead the INTL collation.

Contras:

More libraries into the npm ecosystem
Harder to mantain and publish
More noise to the developer environment.

What do you think?

@FKSI @franher @vvscode @poveden @nulltoken

wildcard regexp

hi,
Is your feature request related to a problem? Please describe.
can be possible use addRegexEntity in manager for create a wildcard?

NlpManager behaviour changed

NlpManager produces false positives with score 1

sample code:

`const { NlpManager } = require('node-nlp');

const manager = new NlpManager({
languages: ['de'],
});

manager.addDocument('de', 'ich will auto kaufen', 'buy');

(async () => {
await manager.train();

console.log(await manager.process('ich will auto kaufen'));
console.log(await manager.process('ich habe hunger'));
})();
`

ver. 2.0.2 result
{ locale: 'de', localeIso2: 'de', language: 'German', utterance: 'ich will auto kaufen', classification: [ { label: 'buy', value: 0.9975597509481738 } ], intent: 'buy', domain: 'default', score: 0.9975597509481738, entities: [], sentiment: { score: 0, comparative: 0, vote: 'neutral', numWords: 4, numHits: 0, type: 'senticon', language: 'de' } } { locale: 'de', localeIso2: 'de', language: 'German', utterance: 'ich habe hunger', classification: [ { label: 'buy', value: 0.8180665881599193 } ], intent: 'buy', domain: 'default', score: 0.8180665881599193, entities: [], sentiment: { score: -0.0565, comparative: -0.018833333333333334, vote: 'negative', numWords: 3, numHits: 1, type: 'senticon', language: 'de' } }

with current version:
{ locale: 'de', localeIso2: 'de', language: 'German', utterance: 'ich will auto kaufen', classification: [ { label: 'buy', value: 1 } ], intent: 'buy', domain: 'default', score: 1, entities: [], sentiment: { score: 0, comparative: 0, vote: 'neutral', numWords: 4, numHits: 0, type: 'senticon', language: 'de' } } { locale: 'de', localeIso2: 'de', language: 'German', utterance: 'ich habe hunger', classification: [ { label: 'buy', value: 1 } ], intent: 'buy', domain: 'default', score: 1, entities: [], sentiment: { score: -0.0565, comparative: -0.018833333333333334, vote: 'negative', numWords: 3, numHits: 1, type: 'senticon', language: 'de' } }

Why does my utterance's score keep decreasing?

Hello,

I have built an app where users constantly train new intents/utterances & answers into the model. And then the model constantly re-trains itself every 4 minutes.

I have noticed that the scores of utterances that users ask keep decreasing.

For example, one user created an utterance/intent for "What is your favorite color?". Typing the exact utterance used to return ~.9 score. Now it returns ~.3 score.

What is the cause of this and how can I reliably solve this problem?

Add prettier + husky hooks for formatting code on commit

Is your feature request related to a problem? Please describe.
It's not a problem, but it would be much easier to maintain the project with prettier (less changes, single code style)

Describe the solution you'd like
Add prettier, and lint-staged/husky packages for formatting code on commit

Migrate codebase to Typescript

What about migrating codebase to TS? Not the whole codebase in one time, but at least add support for tsc to make at possible add types further.

Reduce .nlp files size by reducing label/features matrix

Is your feature request related to a problem? Please describe.
The .nlp files are very big. Most of the space is consumed by the classification matrix of each label and the weights matrix. The classification matrix is a matrix of 0/1 values per cell, when each cell position represents a label, so can be represented in another way more reduced.

Describe the solution you'd like
Currently the classification matrix of each label stored like that (imagine 20 features):
[ [ 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ],
[ 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ],
[ 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ] ]
The more features, the more zeros will be in each vector, so the bigger the model is, more space will be saved.
The proposal is to store each vector as an object. Imagine that the feats are labeled "feat0", "feat1", ...
Then the previous matrix can be stored as:
[ { feat3: 1, feat4: 1, feat9: 1 },
{ feat3: 1, feat4: 1, feat8: 1, feat9: 1 },
{ feat4: 1, feat9: 1, feat19: 1 } ]

This compression should be done at the save() method of the NLP Manger, but also a descompression from object to vector. For the descompression to be faster, avoid the use of indexOf in an array of features, and an dictionary object that relates feature name with position should be used. The idea on the descompresion is to generate a zero vector with the length equals to the number of features, and then put ones in the positions of the features.

Feature Request: Enhanced Import / Export functionality on NlpManager

Is your feature request related to a problem? Please describe.

Not related to a specific problem. It would be awesome to have a way of importing and exporting models from sources other than files. Examples: from a persistent database or from memory.

Describe the solution you'd like

I imagine that the following class methods could work:

NlpManager.import(data) takes a string, parses it as JSON, and incorporates it into the class.
NlpManager.export() returns a JSON as a string (or maybe a plain object) that can be saved anywhere (a database, a variable, etc.)

Describe alternatives you've considered
I've considered extending the original NlpManager class. In fact, I've done it (check it out).

Although my alternative definitely works, I feel like this is a feature that would be very useful if it was integrated with the library. Also, any additional changes in how nlp.js handles models would probably break any class extension if not considered correctly.

Additional context
GIST: Extending NlpManager to add .export() and .import()

I can definitely submit a PR with the .import() and .export() methods in NlpManager, but I figured it would be better to submit an issue first in case this is something you've already considered or I'm missing something.

Thank you for this library, it's pretty awesome! 👋

Example Script Bug

When I try Example Script, getting these error..

mrpeker@MrPeker-MacBook-Air  ~/Desktop/nlp.js-master/examples/console-bot  node index.js  1 ↵  3354  02:08:15
Say something!
hello
/Users/mrpeker/Desktop/nlp.js-master/examples/console-bot/index.js:51
if (result.sentiment.score !== 0) {
^

TypeError: Cannot read property 'score' of undefined
at Interface.rl.on (/Users/mrpeker/Desktop/nlp.js-master/examples/console-bot/index.js:51:26)
at Interface.emit (events.js:180:13)
at Interface._onLine (readline.js:285:10)
at Interface._normalWrite (readline.js:433:12)
at ReadStream.ondata (readline.js:144:10)
at ReadStream.emit (events.js:180:13)
at addChunk (_stream_readable.js:274:12)
at readableAddChunk (_stream_readable.js:261:11)
at ReadStream.Readable.push (_stream_readable.js:218:10)
at TTY.onread (net.js:581:20)

Fail over to Spanish?

Describe the bug
When guessing very short utterance, classification is correct but language is guessed as Spanish, hence not throwing english reply.

To Reproduce
NLP:

NLG:

User: [email protected]
AxaBot:
Sorry, I don't understand, {"locale":"es","localeIso2":"es","language":"Spanish","utterance":"[email protected]","classification":[{"label":"email2","value":0.9819053927264487},{"label":"email","value":0.6188665942300879},{"label":"realname","value":0.12743325382186388},{"label":"whois","value":0.08484683107101253},{"label":"whereis","value":0.08484683107101253},{"label":"hashtag","value":0.029910769295364337}],"intent":"email2","domain":"default","score":0.9819053927264487,"entities":[{"start":0,"end":14,"accuracy":1,"sourceText":"[email protected]","utteranceText":"[email protected]","entity":"mail"}],"sentiment":{"score":0,"comparative":0,"vote":"neutral","numWords":3,"numHits":0,"type":"senticon","language":"es"}}.

Expected behavior
Default to any chosen language, or best, be able to force language for this input (once we guessed preferred language using prior talks guesses and wrote to user's settings).

JavaScript heap out of memory

Describe the bug
Out of memory error when training on a public set. Full error message below.

To Reproduce
Download, unzip the train.csv file, adapt it to XLS (NER is empty):
NLP:

intent | language | utterance
EAP | en | id26305, This process, however, afforded me no means of   ascertaining the dimensions of my dungeon; as I might make its circuit, and   return to the point whence I set out, without being aware of the fact; so   perfectly uniform seemed the wall.
... (19568 lines!)

NLG:

intent | condition | language | response
EAP |   | en | EAP
MWS |   | en | MWS
HPL |   | en | HPL

Expected behavior
A trained model saved to model.nlp

More globally
Could we have an example of some tests script using XLS?

** Full error message**

Training, please wait..

<--- Last few GCs --->

[5192:000FA808]   298302 ms: Mark-sweep 666.6 (736.9) -> 666.6 (736.9) MB, 596.3 / 0.0 ms  allocation failure GC in old space requested
[5192:000FA808]   298985 ms: Mark-sweep 666.6 (736.9) -> 666.6 (720.9) MB, 682.9 / 0.0 ms  last resort GC in old space requested[5192:000FA808]   299666 ms: Mark-sweep 666.6 (720.9) -> 666.6 (720.4) MB, 681.5 / 0.0 ms  last resort GC in old space requested

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 048961D9 <JSObject>
    1: /* anonymous */(aka /* anonymous */) [E:\disctut\node_modules\node-nlp\lib\nlp\nlp-classifier.js:~195] [pc=0C2E84FB](this=33C0417D <undefined>,srcToken=01E45131 <String[7]: id03416>)
    2: arguments adaptor frame: 3->1
    3: forEach(this=18768569 <JSArray[35184]>)
    4: tokensToNeural [E:\disctut\node_modules\node-nlp\lib\nlp\nlp-classifier.js:195] [bytecode=01F6C4AD offset=153](this=061E773...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node_module_register
 2: v8::internal::Factory::NewFixedArray
 3: v8::internal::HashTable<v8::internal::SeededNumberDictionary,v8::internal::SeededNumberDictionaryShape>::IsKey

Update language support documentation

Describe the bug
The documentation of language support is not updated with the last one for builtin entity extraction.
Language support: https://github.com/axa-group/nlp.js/blob/master/docs/language-support.md
Builtin entity extraction: https://github.com/axa-group/nlp.js/blob/master/docs/builtin-entity-extraction.md

Tokenizer words

Is your feature request related to a problem? Please describe.
Machine learn, I need to tokenizer and stem words

Describe the solution you'd like
Create a function to tokenizer and stem

Describe alternatives you've considered
NaN

Additional context
My project https://github.com/ran-j/ChatBotNodeJS/blob/master/routes/index.js#L40

Bayes Classifier

Is your feature request related to a problem? Please describe.
Currently a logistic regression classifier is used. Logistic regression classifier provides a great way to classify, but requires CPU intensive process to train.
A Bayes Classifier, on the other hand, don't need a CPU intensive process to train, so can grow in a progressive way without consuming time.
Why a Bayes Classifier is useful?
While a frontend is being developed, that will mean that users will be able to add intents and try the bot in the fly. If a Logistic Regression Classifier is used while the user teach the bot, too much time will be consumed because the user will train so often. One solution is to use a Bayes Classifier while teaching, and when deploying then train with a logistic regression classifier.
Also evaluate: it's possible to combine a Logistic Regression Classifier with a Bayes Classifier to get a better accuracy? Example: I've 30 intents, trained with Logistic Regression Classifier; I add a new intent, so if I write an utterance, it pass through the Bayes Classifier, if is identified as an intent that has been retrained continue with bayes, if is identified as an intent already trained with LRC, then pass through LRC.

Mathops exception when training on certain data

Describe the bug
There is an exception being thrown from lib/math/mathops.js:113 (Unable to find minimum) when training on certain data sets.

To Reproduce

const { NlpManager } = require('./lib');

const manager = new NlpManager({ languages: ['nb'] });

manager.addDocument('nb', 'foo', 'foo');
const input = "Orkanen Florence treffer fredag østkysten av USA som en såkalt kategori 1-orkan."
manager.addDocument('nb', input, 'bar');

manager.train();

It seems to be triggered by the digits in the input - e.g. removing the digit 1 fixes it.

Some words such as "hey" doesn't get trained

I noticed that some utterances such as "hey" never gets properly trained into the model. Therefore, I can never correctly get the intent for these utterances.

So far I discovered "hey" is one such word.

Are there other utterances that never gets trained into the model? If so, what are they? What are the rules in determining such utterances getting ignored?

Duckling integration

Is your feature request related to a problem? Please describe.
Currently we are using Recognizer Text Suite for default named entity extraction. Giving the user the choice between it or duckling is a good idea.

Describe the solution you'd like
In the ner-manager, when retrieving the entities, it should be done based on configuration. If a duckling is configurated, then duckling should be called, and the answer translated to the nlp.js format.
This also means that this part must be done async, because this will have a request to duckling.

Additional context
https://github.com/facebook/duckling

Make process asynchronous

Is your feature request related to a problem? Please describe.
As we want to open the development to integration with APIs, example Duckling in the issue #15 , the process method must be asynchronous, while currently is synchronous.

Describe the solution you'd like

NerManager findEntities must return a Promise.
NlpManager process must return a Promise.
Recognizer recognizeUtterance and recognize must change to use NlpManger process as a promise
All the tests involving NerManager, NlpManager and Recognizer must change to handle async.

Async/await is welcome to have better syntax.
Another approach is to have NerManager.findEntities and NlpManger.process with versions Sync and Async (following node standards, only the sync version should have the suffix).

Slot filling

Is your feature request related to a problem? Please describe.
Currently we can know the entities related to a given intent by the text in the intent, and the intent structure will have the names of the entities so can be extracted.
The idea is to have slot filling, that is: for each entity we should know if is mandatory or optional, and in case of mandatory for each language we should have the question that the chatbot should ask to the user when the slot is not filled.

Example: Whe have the intent travel and 2 entities: location and date. There are four types of conversation:

The user fills all the slots
user> I want to travel tomorrow to London
bot> Ok, preparing your travel to London for 27/08/2028
The user fills date slot:
user> I want to travel tomorrow
bot> What is your destination?
user> London
bot> Ok, preparing your travel to London for 27/08/2028
The user fills location slot:
user> I want to travel to London
bot> When do you want to travel to London?
user> I want to travel tomorrow
bot> Ok, preparing your travel to London for 27/08/2028
The user does not fill any slot
user> I want to travel to London
bot> What is your destination?
user> I want to go to London
bot> When do you want to travel to London?
user> I want to travel tomorrow
bot> Ok, preparing your travel to London for 27/08/2028

Describe the solution you'd like
I think that can be implemented as a slot manager, so a new hard entity slot should be defined, and inside the NLP manager an implementation of the slot manager is given. Questions in the slots should be templated so they can use things from the context (as in the example when it say When do you want to travel to London ).
Also, the Microsoft Bot Framework Recognizer will need a hard work so when an answer is received with slots to fill, it should be able to take control of the dialog inserting a new artificial prompt or a new dialog state. About where the logic should be implemented, I think that should be send to the NLP manager as contextualized information. The reason is: imagine the previous example the user does not fill any slot so 2 slot questions are received, if the logic for both questions is at the recognizer and in the first question the user answer the 2 slots, the recognizer will not know and will answer the date even if the user already provided it.

example not fully working

Describe the bug
I can run the example and model.nlp is created, but there's no classification, nor answer:

No error is thrown.

To Reproduce
win 10, node.js 8.9.3, npm 5.5.1
npm i node-nlp
await is not working, changed to use .then() (see screenshot)

Expected behavior
same result as example in readme.

regex error with slot example

Describe the bug
copy-pasting-adapting test example for slot throws:
SyntaxError: Invalid regular expression: /(?<= from )(.*)(?= to )/: Invalid group

To Reproduce
Copy paste Slot example.
Adapt to use npm package:

const { NlpManager } = require('node-nlp');
const modelName = './model.nlp';
const threshold = 0.7;
const nlpManager = new NlpManager();

nlpManager.addLanguage('en');
const fromEntity = nlpManager.addTrimEntity('fromCity');
fromEntity.addBetweenCondition('en', 'from', 'to', { skip: ['travel'] });
fromEntity.addAfterLastCondition('en', 'from', { skip: ['travel'] });
const toEntity = nlpManager.addTrimEntity('toCity');
toEntity.addBetweenCondition('en', 'to', 'from', { skip: ['travel'] });
toEntity.addAfterLastCondition('en', 'to', { skip: ['travel'] });
nlpManager.slotManager.addSlot('travel', 'toCity', true, {
  en: 'Where do you want to go?',
});
nlpManager.slotManager.addSlot('travel', 'fromCity', true, {
  en: 'From where you are traveling?',
});
nlpManager.slotManager.addSlot('travel', 'date', true, {
  en: 'When do you want to travel?',
});
nlpManager.addDocument(
  'en',
  'I want to travel from %fromCity% to %toCity% %date%',
  'travel'
);
nlpManager.addAnswer(
  'en',
  'travel',
  'You want to travel from {{ fromCity }} to {{ toCity }} {{ date }}'
);

if (fs.existsSync(modelName)) {
  nlpManager.load(modelName);
} else {
  //nlpManager.loadExcel(excelName);
  nlpManager.train()
  nlpManager.save(modelName);
}

run and you get:

E:\disctut\node_modules\node-nlp\lib\ner\regex-named-entity.js:113
    return new RegExp(str.slice(1, index), str.slice(index + 1));
           ^

SyntaxError: Invalid regular expression: /(?<= from )(.*)(?= to )/: Invalid group
    at new RegExp (<anonymous>)
    at Function.str2regex (E:\disctut\node_modules\node-nlp\lib\ner\regex-named-entity.js:113:12)
    at languages.forEach.language (E:\disctut\node_modules\node-nlp\lib\ner\trim-named-entity.js:74:33)
    at Array.forEach (<anonymous>)
    at TrimNamedEntity.addBetweenCondition (E:\disctut\node_modules\node-nlp\lib\ner\trim-named-entity.js:65:15)
    at Object.<anonymous> (E:\disctut\server.js:56:12)
    at Module._compile (module.js:635:30)
    at Object.Module._extensions..js (module.js:646:10)
    at Module.load (module.js:554:32)
    at tryModuleLoad (module.js:497:12)
    at Function.Module._load (module.js:489:3)
    at Function.Module.runMain (module.js:676:10)
    at startup (bootstrap_node.js:187:16)
    at bootstrap_node.js:608:3

BEFORE model is saved.

Expected behavior
working bot 😄
But I guess I messed up again somewhere in my adaptation?

Additional
How to add memory slots to XLS?

Improve Named Entity Extraction

Is your feature request related to a problem? Please describe.
Currently the named entity extraction is done in three layers:

Enums defined by user
Regular Expressions defined by user
External extraction, currently using recognizer text suite.

This comes with several problems: Recognizer text suite returns the units, metrics, dimensions, etc. translated to the target language, with no option to keep it in english to have a common interface in code. This causes that in code we have a dictionary structure like:

initializeDictionary() {
    this.dictionary = {
      Año: 'Year',
      Mes: 'Month',
      Día: 'Day',
      Semana: 'Week',
      Ans: 'Year',
      Mois: 'Month',
      Semaines: 'Week',
      Jour: 'Day',
      Ano: 'Year',
      Mês: 'Month',
      Dia: 'Day',
    };
  }

This is far from being done, so what we have to do is continue this work with other entities, completing this dictionary. And also have this dictionary as a json file, and perhaps splitted by language to be more maintainable and also avoid possible collition (same work that exists in different languages meaning different things).

Describe the solution you'd like

Have it externalized to a json, splitted by language.
Completed
With unit tests for each type of entity.

Fix console-bot example to have async train

Describe the bug
Currently the example at https://github.com/axa-group/nlp.js/tree/master/examples/console-bot is not working properly, because the training now is asynchronous, so when the model is saved into file model.nlp, the weights are still not calculated.

To Reproduce
Steps to reproduce the behavior:

Run example console-bot so it will train and save into model.nlp
Exit the example and rerun so it will load model.nlp
Tell the bot "Who are you"
It will return "Sorry, I don't understand" because is not trained

Recomended fix

manager.train() must have an await, so the function must be async.
At index.js, wrap the main code in an async main() and run main().

Trim Named Entity Splitting

Is your feature request related to a problem? Please describe.
When a Trim named entity collide with another entity, instead of removing one of the edges, the trim entity can be reduced on size to fit with the other entity.
Example: Supose this utterance:
"I want to travel from Barcelona to London tomorrow".
With those entities:
fromEntity: between from and to or after last from
toEntity: between to and from (skip travel) or after last to
date entity

It will result into three edges:

Barcelona
London tomorrow
tomorrow

When the edges are reduced, since "London tomorrow" and "tomorrow" collide, the one with more accuracy or more lenght when equal accuracy is the one that survives, the other is removed.

Describe the solution you'd like
The reduceEdges algorithm must take into account a first loop detecting collisions of Trim Named Entities with another ones, trying to split the Trim Named Entity. After that first loop, the normal edge collision is passed, resulting, for the provided example, in three edges:

Barcelona
London
tomorrow

JSON transformations

Is your feature request related to a problem? Please describe.
Currently the JSON format returned is ok. But there are already NLUs on the market like LUIS, DialogFlow, Wit, RASA or Snips. The idea is to have transformations to the json formats of the market, so users that already have implementations with those formats, can use NLP.js without having to change their code behaviours.

Describe the solution you'd like

Implement a base clase for a transformer, and derivated class at least for LUIS and DialogFlow.
There will be a way to define a transformer to a NlpManager so when we call process, the answer is piped through the transformer.

Document save/load/import/export of NlpManager

Is your feature request related to a problem? Please describe.
At the documentation of NlpManger there is no information about how load, save, import or export works.

Add color to built-in entity extraction

Is your feature request related to a problem? Please describe.
It's a typical feature that would be very useful to have built-in inside this great lib. As @sys.color in DialogFlow https://dialogflow.com/docs/reference/system-entities

Describe the solution you'd like
Extract colors of an input.

i.e:

"I have a red car"

Output:


{
	"locale": "en",
	"localeIso2": "en",
	"language": "English",
	"utterance": "I have a red car",
	"classification": [{
		"label": "color",
		"value": 0.8567240019144264
	}],
	"intent": "car.getcolor",
	"domain": "default",
	"score": 0.8567240019144264,
	"entities": [{
		"start": 9,
		"end": 11,
		"len": 3,
		"levenshtein": 0,
		"accuracy": 1,
		"option": "color",
		"sourceText": "red",
		"entity": "general",
		"utteranceText": "red"
	}],
	...,
	"srcAnswer": "ok, we took note of your color",
	"answer": "ok, we took note of your color"
}

Save/Load Trim Named Entities

Describe the bug
Currently the save and load process of the NLP Manager take into account EnumNamedEntity and RegexNamedEntity classes, but not TrimNamedEntiy classes.

To Reproduce
Create a new NLP Manager, add some TrimNamedEntity into it, save and load: the model.nlp does not contains info about the trim named entities.

how to match intent to regex with XLS?

Hello,
After multiple tests, I'm stuck at regex in XLS :(

Issue Template

I searched issues and tried using the NER Manager example before filling this issue.

Summary

Adding an intent and response to the regex entity didn't throw any intent and response.

Simplest Example to Reproduce

NER:

NLP:

NLG:

Added code (to answer):
"Sorry, I don't understand, " + JSON.stringify(result) + ", ";

Input:
my mail is [email protected]

Response:
Sorry, I don't understand, {"locale":"en","localeIso2":"en","language":"English","utterance":"my mail is [email protected]","intent":"None","domain":"default","score":1,"entities":[{"start":11,"end":25,"accuracy":1,"sourceText":"[email protected]","utteranceText":"[email protected]","entity":"mail"}],"sentiment":{"score":0,"comparative":0,"vote":"neutral","numWords":6,"numHits":0,"type":"senticon","language":"en"}}, .

Expected Behavior

I may write to you (of course, in the future I'd like I may write to you at %mail%)

Software	Version
`nlp.js`	2.1.0
`node`	8.9.3
`npm`	5.5.1
Operating System	Win10 Pro

Thanks for your help!

Actions

Is your feature request related to a problem? Please describe.
Currently the NLG return answers based on locale and context conditions, and that is ok. But what if given an intent and conditions, we are able to return an script of predefined actions to be executed?
Examples:

We can be inside slot filling, but have a clear intent that wants to clear all the slots, or the last slot introduced. That way we will be able to have a bot where "I did a mistake" leads the bot to the previous state repeating the last question.
We want intents to navigate between dialogs: beginDialog, endDialog, replaceDialog

Describe the solution you'd like

As well as answer is returned we can return the actions.
Actions don't need to be under a locale to be translated
In this first iteration, actions don't need parameters, so are only the name of the action
Actions are returned as an script of actions separated by semicolons: "cleanSlots(); endDialog();"
At the Microsoft Bot Framework recognizer, there is a method to execute the answer deciding if is a session.beginDialog or a session.send. Actions should be executed there, before. To do that we will need:
- A parser for the actions
- At the recognizer, we can implement the actions. For each action parsed, look at this[actionName], if defined then invoke. Right now must be synchronous. To invoke, session and context must be provided, so the action can modify context (example clear slots) or invoke methods of session (example endDialog).

axa-group / nlp.js Goto Github PK

nlp.js's People

Contributors

Stargazers

Watchers

Forkers

nlp.js's Issues

The dependency brain.js was updated from 1.4.5 to 1.4.6.

Issue Template

Summary

Simplest Example to Reproduce

Expected Behavior

Recommend Projects

Recommend Topics

Recommend Org

Jobs

The dependency brain.js was updated from `1.4.5` to `1.4.6`.