GithubHelp home page GithubHelp logo

retextjs / retext-keywords Goto Github PK

View Code? Open in Web Editor NEW
323.0 17.0 32.0 2.79 MB

plugin to extract keywords and key-phrases

Home Page: https://retextjs.github.io/retext-keywords

License: MIT License

JavaScript 100.00%
retext retext-plugin natural-language keyword-extraction keyword term tensorflow

retext-keywords's People

Contributors

eklem avatar facundoolano avatar iamstarkov avatar wooorm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

retext-keywords's Issues

United not a keyword

"United" is always filtered out, it is a very common word used in Country or Organization names

TypeError: Cannot read property 'push' of undefined

I'm getting the following error whenever I want to process a string that contains the word "constructor":

TypeError: Cannot read property 'push' of undefined
    at /Users/facundo/dev/gp-keywords/node_modules/retext-keywords/index.js:100:36
    at one (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:72:22)
    at all (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:48:26)
    at one (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:76:20)
    at all (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:48:26)
    at one (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:76:20)
    at all (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:48:26)
    at one (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:76:20)
    at visit (/Users/facundo/dev/gp-keywords/node_modules/unist-util-visit/index.js:82:5)
    at getImportantWords (/Users/facundo/dev/gp-keywords/node_modules/retext-keywords/index.js:81:5)

You can see the code where I use retext-keywords here. I've first found the issue when passing the string 'Happy Bike Race: sMAShy for WhEEls the WantED - Bridge Road: COnstrucTOR' to the process function, but I can reproduce it by just passing 'constructor' too.

Problem with words like "night’s" causing keyphrases to be malformed due to ’ character

Hi Guys,

hoping you can help me with an issue i'm having. I've created an example using the example code provided so the bug can be reproduced.

the issue is that when analysing text which contains words such as "night’s" e.g. :

"Last night’s concert was the third in a series organised by the Lizz Hobbs Group, who have now produced concerts at Slessor Gardens two years in a row."

the keyphrase is given as "Last night2’2s concert". I'm wondering how i can work around this or resolve so that the 2s are not appearing around the ’ character.

please see the following example code:

var retext = require('retext');
var keywords = require('retext-keywords');
var nlcstToString = require('nlcst-to-string');

var text = "Last night’s concert was the third in a series organised by the Lizz Hobbs Group, who have now produced concerts at Slessor Gardens two years in a row.";

retext()
  .use(keywords)
  .process(text, function (err, file) {
    if (err) throw err;

    console.log('Keywords:');
    file.data.keywords.forEach(function (keyword) {
      console.log(nlcstToString(keyword.matches[0].node));
    });

    console.log();
    console.log('Key-phrases:');
    file.data.keyphrases.forEach(function (phrase) {
      console.log(phrase.matches[0].nodes.map(nlcstToString).join(''));
    });
  }
);

output is:

Keywords:
concert
Last
night’s
third
series
Lizz
Hobbs
Group
Slessor
Gardens
two
years
row

Key-phrases:
Last night2’2s concert
Slessor Gardens two years
Lizz Hobbs Group
concerts

see Last night2’2s concert

node v10.10.0
npm 6.4.1

i'd appreciate any pointers as i'm not sure how to approach resolving the issue.

Not working with head version of retext

I think the problem is that the transformer function that is returned get's file as a second parameter, where at the head version of retext the transformer is called with options (null) instead.

Spanish

Hi,
What's the algorithm/logic used? I am considering using it to get keywords from text in spanish, would it would?
Thank you!

Not working with custom texts

I have copied and pasted the content of this blog https://codeforgeek.com/2015/01/nodejs-mysql-tutorial/ and passed it as a string in code below.

var retext = require('retext');
var keywords = require('retext-keywords');
var nlcstToString = require('nlcst-to-string');
var phraseData2 = "string of that web page"; 
retext()
  .use(keywords)
  .process(phraseData2, function (err, file) {
    if (err) throw err;

    console.log('Keywords:');
    file.data.keywords.forEach(function (keyword) {
      console.log(nlcstToString(keyword.matches[0].node));
    });

    console.log();
    console.log('Key-phrases:');
    file.data.keyphrases.forEach(function (phrase) {
      console.log(phrase.matches[0].nodes.map(nlcstToString).join(''));
    });
  }
);

it returns blank for keywords and phrases.

Identify American English

Keywords that are identified if written in en-uk are not identified if written in en-us.

Example: favourite is identified as a keyword, but favorite is not.

I tried out a few workarounds but didn't get anywhere. Please let me know if a workaround already exists.

Stopwords that contain apostrophe are not filtered

Hello! Thanks for this project, it's very useful. I've been using it to extract keywords from apps titles and descriptions at Google Play and iTunes, and I've found that some words that I understand that should be considered stopwords, but have apostrophes in them, aren't filtered by this plugin. Some examples are: it's, you'll, you're, we'll, can't, won't, etc.

For now I'm just adding a string replacement for every new case I find before sending it to the processor, but I was wondering if there's a more generic way to filter them out.

Phrases

Subject of the feature

I have noticed a lot of single-word phrases are being returned in some documents. Considering "phrases" is plural I think it should return 2+ word phrases.

Problem

It would be nice to be able to select phrases by a number of words or greater than a specific number of words.

Expected behavior

return "phases" not single words. Preferably the ability to chose between 3-6 or 2-4 word phrases.

Alternatives

TypeError: Cannot read property 'children' of undefined when using options

Subject of the issue

When using the default example given in the readme, adding an options object with maximum value causes an error.

Environment

  • OS: OSX Mojave 10.14.4
  • Packages:
Package Version
retext ^7.0.1
retext-keywords ^5.0.0
nlcst-to-string latest
retext-pos ^2.0.2
to-vfile ^6.0.0
  • Env:
Package Version
node 8.7.0
'npm' 6.13.1

Steps to reproduce

  1. Start a new project in a clean directory
  2. npm instal retext
  3. npm install retext-keywords
  4. npm install nlcst-to-string
  5. npm install retext-pos
  6. npm install to-vfile
  7. create a file called index.js
  8. copy the contents of the example given in the readme
  9. create the example.txt file in the same directory
  10. add an options object to the keywords with a maximum: 8
  11. run node index

Expected behaviour

We should see a similar console log but with around 8 keywords and phrase

Actual behaviour

This error is throw:

/Users/Mario/Sites/retext-test/node_modules/unist-util-visit-parents/index.js:41
    if (node.children && result[0] !== SKIP) {
             ^

TypeError: Cannot read property 'children' of undefined
    at one (/Users/Mario/Sites/retext-test/node_modules/unist-util-visit-parents/index.js:41:14)
    at visitParents (/Users/Mario/Sites/retext-test/node_modules/unist-util-visit-parents/index.js:26:3)
    at visit (/Users/Mario/Sites/retext-test/node_modules/unist-util-visit/index.js:22:3)
    at getImportantWords (/Users/Mario/Sites/retext-test/node_modules/retext-keywords/index.js:214:3)
    at Function.transformer (/Users/Mario/Sites/retext-test/node_modules/retext-keywords/index.js:17:21)
    at freeze (/Users/Mario/Sites/retext-test/node_modules/unified/index.js:118:28)
    at Function.process (/Users/Mario/Sites/retext-test/node_modules/unified/index.js:352:5)
    at Object.<anonymous> (/Users/Mario/Sites/retext-test/index.js:14:6)
    at Module._compile (module.js:624:30)
    at Object.Module._extensions..js (module.js:635:10)

Please see attached example project.
retext-test.zip

Error: Attempted import error: 'color' is not exported from 'unist-util-visit-parents/do-not-use-color' (imported as 'color').

Initial checklist

Affected packages and versions

 "retext": "^9.0.0",     "retext-keywords": "^8.0.1",     "retext-pos": "^5.0.0",    "nlcst-to-string": "^4.0.0",    "to-vfile": "^8.0.0",

Link to runnable example

No response

Steps to reproduce

  1. Start a new Next.js (V13) project in a clean directory
  2. pnpm install retext
  3. pnpm install retext-keywords
  4. pnpm install nlcst-to-string
  5. pnpm install retext-pos
  6. pnpm install to-vfile
  7. create a file called index.tsx
  8. copy the contents of the example given in the readme
  9. create the example.txt file in the same directory
  10. run app

Expected behavior

We should be able to see keywords and phrases, but instead, we encounter the following error:

./node_modules/.pnpm/[email protected]/node_modules/unist-util-visit-parents/lib/index.js
Attempted import error: 'color' is not exported from 'unist-util-visit-parents/do-not-use-color' (imported as 'color').

image
image

Affected runtime and version

[email protected]

Affected package manager and version

[email protected]

Affected OS and version

MacOS Sonoma 14.1.2

Build and bundle tools

Next.js

Duplicates using the same word

Hi, great project btw!

Slight issue using same word in the title, eg:
"How to Get best More Likes on Your Facebook Page " -> "keywords": [ "facebook", "page" ].

But if you put page in the title twice:
"How to Get best More page Likes on Your Facebook Page" -> "keywords": [ "page,page" ].

var Retext = require('retext'),
    visit = require('retext-visit'),
    keywords = require('retext-keywords'),
    sentiment = require('retext-sentiment');
var rt = new Retext().use(visit).use(keywords).use(sentiment);

rt.parse(headline, function (err, tree) {
        if (err) return cb(err);
        if (tree.length == 0) {
            return cb(new Error('Error loading the data!'));
        }

       var s = []
      _.forEach(tree.keywords({ 'minimum': 1 }), function(n) {
            s.push(n.nodes.toString());
      });
}

Version bump on retext-pos dependency?

I'm trying to bundle retext-keywords using webpack, and apparently it's tripping on the block in retext-pos:

/*
 * Duo and component / npm and component.
 */

try {
    posjs = require('pos');
} catch (err) {
    /* istanbul ignore next - browser */
    posjs = require('pos-js');
}

It looks like webpack sees the require() statement inside catch and tries to import that file, which of course doesn't exist since pos-js is not an npm module. retext-pos v2.0.0 does not appear to have this kind of try..catch in there.

What algorithm does this use?

Thanks for building this! Do you think you can put the underlying algorithm (RAKE?) in README for easier estimation of big-O?

AWS Lambda Insights

Hi There -

This is a pretty handy package. I've been using it on the server/local machine just fine.

I've been trying to get this working in AWS lambda, but it keeps seeming to fail, but with no error (the function just times out when trying to require the module.

Looking through the source code, I can't see anything in there that would cause it to do that (eg: native libraries).

Do you have any insights into this?

Getting "it's" as a keyword

Initial checklist

Affected packages and versions

retext-keyword

Link to runnable example

sqlite> select distinct(tag) from tags where tag LIKE 'it%';
it's
italian
item
items
itext
itoco
it’ll
it’s
it’s-leadership

Here are some urls that parse keyword as "it's" (I'm passing the html to deno-dom then to moz-readability to get article.content and article.title

it's|https://www.bbc.co.uk/news/science-environment-59268393
it's|https://www.cbsnews.com/news/china-grows-as-campaign-theme-during-coronavirus-pandemic/
it's|https://www.cbsnews.com/news/classic-cars-electric-vehicles-london-mechanic/
it's|https://www.cnbc.com/2021/11/18/inside-cornings-new-vaccine-vial-factory-in-north-carolina.html
it's|https://www.cnn.com/travel/article/beautiful-towns-europe/index.html
it's|https://www.cnn.com/travel/article/dead-sea-shrinks-as-jordan-turns-tide-on-tourism/index.html
it's|https://www.cnn.com/travel/article/uk-tourism-decline-restrictions-cmd/index.html
it's|https://www.cracked.com/article_31747_canadian-children-marched-to-protest-the-rising-price-of-candy.html
it's|https://www.euronews.com/green/2021/11/18/climate-misinformation-is-getting-more-sophisticated-and-experts-say-cop26-progress-could-
it's|https://www.firstshowing.net/2021/watch-remember-a-visual-poem-film-about-interconnectedness/
it's|https://www.freep.com/story/news/local/michigan/2021/11/18/ann-arbor-ordinance-tampons-pads-all-public-bathrooms/8652533002/
it's|https://www.globalcryptopress.com/2021/10/bitcoin-network-holds-over-1-trillion.html
it's|https://www.inc.com/anna-meyer/jennifer-fleiss-rent-the-runway-jetblack-volition-brands.html
it's|https://www.inc.com/joe-sanok/psychologist-joe-sanok-reveals-the-best-parts-of-his-new-book-thursday-is-the-new-friday.html
it's|https://www.inc.com/suzanne-lucas/osha-wont-enforce-covid-rules-pending-court-prepare-anyway.html
it's|https://www.neatorama.com/2021/11/18/Every-Picture-Tells-a-Story-This-One-is-a-Romantic-Comedy/
it's|https://www.npr.org/2021/11/16/1056263648/pfizer-says-it-will-share-the-rights-to-its-covid-19-pill
it's|https://www.npr.org/2021/11/17/1056646740/la-palma-volcano-brings-both-destruction-and-renewal-to-the-island
it's|https://www.rt.com/sport/540633-djokovic-covid-vaccine-status/
it's|https://www.slashfilm.com/664323/jennifer-coolidge-will-star-in-ryan-murphys-the-watcher-tv-series-for-netflix/
it's|https://www.wired.com/gallery/25-amazing-holiday-gift-ideas-under-25-2021/
it's|https://www.wired.com/story/best-black-friday-outdoors-deals-rei-2021/
it's|https://www.wired.com/story/best-buy-early-black-friday-deals-2021-2/
it's|https://www.wired.com/story/early-black-friday-deals-2021/
it's|https://wyrk.com/what-would-you-tear-down-in-buffalo-and-why/

Steps to reproduce

    const doc = new DOMParser().parseFromString(body, 'text/html');
    article = new Readability(doc).parse();


    const nodes = await retext()
      .use(retextPos)
      .use(retextKeywords)
      .process(`${article.title} - ${article.textContent}`);

Expected behavior

seems like "it's" is a stop word

Actual behavior

Getting "it's" as a keyword

Runtime

Deno v1

Package manager

Other (please specify in steps to reproduce)

OS

Linux

Build and bundle tools

import retextKeywords from 'https://cdn.skypack.dev/[email protected]?dts';

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.