GithubHelp home page GithubHelp logo

python-rake's Introduction

python-rake

Note on Upgrades

Some users have reported issues importing the stoplists in the upgrade to 1.1.*, if you experience import issues after upgrading try doing a full uninstall + reinstall.


Build Status Upload Python Package PyPI version

A Python module implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons. Initially by @aneesha, packaged by @tomaspinho.

The source code is released under the MIT License.

Installation

pip install python-rake #or pip3

Usage

For external .txt, .csv, etc files: Takes path as string datatype. Words can be on same or different lines but must be seperated by non-word characters. This should support all languages as it's based on unicode, but please validate the results of and report any issues with non-western languages, as they haven't been thoroughly tested.

import RAKE
Rake = RAKE.Rake(<path_to_your_stopwords_file>)
Rake.run(<text>);

To change how a file is read-in, simply use the code below. The default regex described above is [\W\n]+.

RAKE.Rake(<path_to_your_stopwords_file> , regex = '<your regex>')

For lists:

import RAKE
Rake = RAKE.Rake(<list>); #takes stopwords as list of strings
Rake.run(<text>)

SmartStopList(), FoxStopList(), NLTKStopList() and MySQLStopList return the expected lists as lists, they can be used as shown bellow. GoogleSearchStopList() returns what were thought to be stop words in Google search back when large numbers of search suggestions very available. RanksNLStopList() and RanksNLLongStopList() returns the in-house developed stoplists from Ranks NL, a webmaster suite.

import RAKE
Rake = RAKE.Rake(RAKE.SmartStopList())
Rake.run(<text>)

Additional flags:

The RAKE.rake function also accepts minCharacters, maxWords and minFrequency flags to better tune your outputs. minCharacters is the minimum characters allowed in a keyword. maxWords is the maximum number of words allowed in a phrase considered as a keyword. minFrequency is the minimum number of occurances a keyword has to have to be considered as a keyword. An example of this which shows the default values is as follows:

import RAKE
rake = RAKE.Rake(RAKE.SmartStopList())
rake.run(<text>, minCharacters = 1, maxWords = 5, minFrequency = 1)

Other stoplists and stoplists in other languages can be found at https://github.com/trec-kba/many-stop-words/tree/master/orig, at http://www.ranks.nl/stopwords, at https://sites.google.com/site/kevinbouge/stopwords-lists and in the NLTK stopwords package

Releases

I will push releases to PyPi periodically, but if there is a feature in master not built/pushed and you want it to be, just ping me.

Credit

This is a maintained fork of the original python RAKE project, which can be found here: https://github.com/aneesha/RAKE The Fox Stopwords list was originally created by Christopher Fox, http://dl.acm.org/citation.cfm?id=378888 The Smart stopwords list was originally created by Gerard Salton and Chris Buckley for the experimental SMART information retrieval system at Cornell University. The MySQL stopwords list is (surprisingly) from MySQL, owned and mainted by Oracle and under the GPL2 license. The NTLK stopword list was created by the NLTK project under the Apache license, project here: https://github.com/nltk/nltk The Ranks NL stopword lists were created by Ranks NL, who also compiled the Google Search stopword list, who said via email that we could include them in this package if we credited them.

python-rake's People

Contributors

afreeland avatar akki3d76 avatar be-ndee avatar fabianvf avatar jbernau avatar jkterry1 avatar stickler-ci avatar tomaspinho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-rake's Issues

Change Run Syntax

@fabianvf Why not do the pythonic thing and change the run syntax from

Rake = RAKE.Rake([path_to_your_stopwords_file]);
Rake.run(text);

to
Rake = RAKE.run(text,[path_to_your_stopwords_file]);
?
It uses a little less memory and is much cleaner.
I'll do it myself if you want.
Under the init method for the Rake class, you can just print that the syntax was cleaned up under the newest version and the new function, it shouldn't cause any confusion.

Add CSV Support

Per issue #18 discussion, once the current pull requests stuff is thoroughly done I want to add CSV support, but I'm not sure what the most pythonic way to do it is.

Rake.split_sentences(text) uses 'u' as separator

Hello,
I met an issue that split_sentences(text) function uses 'u' as separator. For instance
text: "is an incredibly popular library and for good reason it s powerful fast"
sentences list: [u'is an incredibly pop', u'lar library and for good reason it s powerf', u'l fast']
Definitely I can fix it at my environment, but I wonder what I did wrong and why nobody met this issue before?
My environment is python 2.7, python-rake is installed with pip.

word with "-" inside not found

Hello
I like python-rake really
Just ask about case that was today
There are 2 times phrase
"document-word"
in the text (file doc_text.txt) but only "('document-', 1.0)" was found.

Was done:
from RAKE import Rake # module python-rake
#r=Rake( "SmartStoplist.txt")
r=Rake( "SmartStoplist.txt", minCharacters = 2, maxWords = 3, minFrequency = 2)
file=open("doc_text.txt","r")
t=file.read()

remove \n

a=t.split("\n")
up=" ".join(a)
z=r.run(up)
print(z)

Results
( file result.txt)
only ('document-', 1.0) was found

I'm just worried about the process. Can you comment or say what to do
test.zip

Adding an Asterisk * to StopWords

Is it possible to eliminate an asterisk from returned keywords?
I've tried adding to StopWords list, but get an error:

sre_constants.error: nothing to repeat at position 9892

Is it acting as a wildcard in regex?

BTW, python-rake is a very helpful utility, and quick to implement.

Scaling to massive datasets

@fabianvf For a competition, I'm interested in finding the keywords of every single wikipedia page. Any thoughts on changes I could make to this code base to make it plausible? I'm contemplating trying to run this PyPy, Cython, or rewriting this in Go.

All phrases scored as 1.0?

Python 3.6 venv >> pip install python-rake

import RAKE
Rake = RAKE.Rake(RAKE.SmartStopList())

text="The initiating oncogenic event in almost half of human lung adenocarcinomas is still unknown, a fact that complicates the development of selective targeted therapies. Yet these tumours harbour a number of alterations without obvious oncogenic function including BRAF-inactivating mutations. Researchers at the Spanish National Cancer Research Centre (CNIO) have demonstrated that the expression of an endogenous Braf (D631A) kinase-inactive isoform in mice (corresponding to the human BRAF(D594A) mutation) triggers lung adenocarcinoma in vivo, indicating that BRAF-inactivating mutations are initiating events in lung oncogenesis. The paper, published in Nature, indicates that the signal intensity of the MAPK pathway is a critical determinant not only in tumour development, but also in dictating the nature of the cancer-initiating cell and ultimately the resulting tumour phenotype."

Rake.run(text)
[('mapk pathway', 1.0), ('still unknown', 1.0), ('initiating oncogenic event', 1.0), ('spanish national cancer research centre', 1.0), ('demonstrated', 1.0), ('dictating', 1.0), ('kinase-inactive isoform', 1.0), ('ultimately', 1.0), ('selective targeted therapies', 1.0), ('vivo', 1.0), ('researchers', 1.0), ('development', 1.0), ('tumours harbour', 1.0), ('yet', 1.0), ('mice', 1.0), ('braf-inactivating mutations', 1.0), ('tumour development', 1.0), ('expression', 1.0), ('indicating', 1.0), ('cancer-initiating cell', 1.0), ('triggers lung adenocarcinoma', 1.0), ('signal intensity', 1.0), ('critical determinant', 1.0), ('resulting tumour phenotype', 1.0), ('complicates', 1.0), ('corresponding', 1.0), ('lung oncogenesis', 1.0), ('human lung adenocarcinomas', 1.0), ('paper', 1.0), ('mutation', 1.0), ('published', 1.0), ('cnio', 1.0), ('d594a', 1.0), ('number', 1.0), ('initiating events', 1.0), ('d631a', 1.0), ('fact', 1.0), ('endogenous braf', 1.0), ('nature', 1.0), ('indicates', 1.0), ('almost half', 1.0), ('also', 1.0), ('human braf', 1.0)]

All scored as 1.0 (also on imported text, from a file); am I doing something wrong? Thanks ...


Additional tests:

text="Halifax, an Atlantic Ocean port in eastern Canada, is the provincial capital of Nova Scotia. A major business centre, it’s also known for its maritime history. The city’s dominated by the hilltop Citadel, a star-shaped fort completed in the 1850s. Waterfront warehouses known as the Historic Properties recall Halifax’s days as a trading hub for privateers, notably during the War of 1812. Halifax, legally known as the Halifax Regional Municipality (HRM), is the capital of the province of Nova Scotia, Canada. The municipality had a population of 403,131 in 2016, with 316,701 in the urban area centred on Halifax Harbour. The regional municipality consists of four former municipalities that were amalgamated in 1996: Halifax, Dartmouth, Bedford, and the Municipality of Halifax County. Halifax is a major economic centre in Atlantic Canada with a large concentration of government services and private sector companies. Major employers and economic generators include the Department of National Defence, Dalhousie University, Saint Mary's University, the Halifax Shipyard, various levels of government, and the Port of Halifax. Agriculture, fishing, mining, forestry and natural gas extraction are major resource industries found in the rural areas of the municipality. Halifax was ranked by MoneySense magazine as the fourth best place to live in Canada for 2012, placed first on a list of 'large cities by quality of life' and placed second in a list of 'large cities of the future', both conducted by fDi Magazine for North and South American cities. Additionally, Halifax has consistently placed in the top 10 for business friendliness of North and South American cities, as conducted by fDi Magazine. For a city with more pubs and clubs per capita than almost any city in Canada, it’s fitting that our most famous brewmaster was also our mayor. Three times. Alexander Keith’s original 1820 brewery welcomes visitors with costumed guides, stories and, of course, good ale. Walk across the street from Keith’s Brewery to the Halifax waterfront boardwalk that follows the water’s edge alongside the world’s second largest ice-free harbour. Stretching from the Canadian Museum of Immigration at Pier 21 – the gateway into Canada for over one million immigrants – to Casino Nova Scotia, you’ll pass unique shops, restaurants, and in the warmer months, graceful tall ships. Hop aboard the ferry, North America's longest running saltwater ferry, in fact, and cross the harbour to the Dartmouth side which is filled with more locally-owned shops, galleries, cafés, restaurants, and pubs. A visit to Halifax is not complete without trying the fabled donair, the offical food of Halifax. Become a soldier for a day at Halifax Citadel National Historic Site. Visit a 200-year-old restored fishing village at Fisherman’s Cove. Hear captivating sea stories from small to the Titanic at the Maritime Museum of the Atlantic. Discover the stories of over 1 million immigrants who landed in Halifax at Pier 21. Explore the new Halifax Cental Library, named as one of CNN's 10 eye-popping new buildings in 2014. Skate or bike The Emera Oval. The long-track speed skating oval on the Halifax Commons is an outdoor activity destination in summer and in winter. Stroll through the beautiful Victorian flower gardens and grounds at Halifax Public Gardens. Take in one of Canada’s best walks along the Halifax Waterfront. Be inspired by Atlantic Canada’s largest art collection at the Art Gallery of Nova Scotia. Ride the oldest running saltwater ferry service in North America (second oldest in the world) when you take the ferry between Dartmouth and Halifax. Experience the craftsmanship of Canada's only mouth-blown, hand-cut crystal maker, NovaScotian Crystal on the Halifax Waterfront. Venture to McNabs Island, located at the mouth of the Halifax Harbour, for secluded trails, a beautiful beach, and a historic fort. Explore the oldest continuously running farmers' market in North America at the Halifax Seaport Farmers' Market. Visit Alderney Landing on the Dartmouth Waterfront and peruse the shops, art gallery, community theatre, and restaurants. For the golfer - you have plenty of golfing choices to make while golfing in Halifax Metro."

  • I crafted the text string above, to have some degree of repetition.

Rake.run(text)
[('provincial capital', 1.0), ('national defence', 1.0), ('nova scotia', 1.0), ('skate', 1.0), ('street', 1.0), ('clubs per capita', 1.0), ('privateers', 1.0), ('population', 1.0), ('three times', 1.0), ('south american cities', 1.0), ('take', 1.0), ('galleries', 1.0), ('fisherman', 1.0), ('halifax regional municipality', 1.0), ('edge alongside', 1.0), ('fitting', 1.0), ('agriculture', 1.0), ('urban area centred', 1.0), ('business friendliness', 1.0), ('top 10', 1.0), ('plenty', 1.0), ('mining', 1.0), ('mcnabs island', 1.0), ('peruse', 1.0), ('complete without trying', 1.0), ('atlantic canada', 1.0), ('ll pass unique shops', 1.0), ('stretching', 1.0), ('large cities', 1.0), ('fishing', 1.0), ('future', 1.0), ('star-shaped fort completed', 1.0), ('maritime museum', 1.0), ('original 1820 brewery welcomes visitors', 1.0), ('hilltop citadel', 1.0), ('list', 1.0), ('walk across', 1.0), ('cnn', 1.0), ('quality', 1.0), ('filled', 1.0), ('ride', 1.0), ('notably', 1.0), ('days', 1.0), ('beautiful beach', 1.0), ('large concentration', 1.0), ('longest running saltwater ferry', 1.0), ('also', 1.0), ('largest art collection', 1.0), ('port', 1.0), ('restaurants', 1.0), ('market', 1.0), ('fourth best place', 1.0), ('gateway', 1.0), ('water', 1.0), ('alexander keith', 1.0), ('canadian museum', 1.0), ('various levels', 1.0), ('oldest continuously running farmers', 1.0), ('additionally', 1.0), ('named', 1.0), ('mouth-blown', 1.0), ('city', 1.0), ('department', 1.0), ('summer', 1.0), ('halifax harbour', 1.0), ('hear captivating sea stories', 1.0), ('placed first', 1.0), ('almost', 1.0), ('ferry', 1.0), ('follows', 1.0), ('novascotian crystal', 1.0), ('cross', 1.0), ('amalgamated', 1.0), ('cove', 1.0), ('beautiful victorian flower gardens', 1.0), ('new halifax cental library', 1.0), ('university', 1.0), ('10 eye-popping new buildings', 1.0), ('soldier', 1.0), ('course', 1.0), ('become', 1.0), ('1850s', 1.0), ('halifax public gardens', 1.0), ('winter', 1.0), ('1 million immigrants', 1.0), ('warmer months', 1.0), ('maritime history', 1.0), ('craftsmanship', 1.0), ('brewery', 1.0), ('economic generators include', 1.0), ('capital', 1.0), ('hop aboard', 1.0), ('venture', 1.0), ('oldest running saltwater ferry service', 1.0), ('keith', 1.0), ('one million immigrants', 1.0), ('hand-cut crystal maker', 1.0), ('waterfront warehouses known', 1.0), ('day', 1.0), ('mayor', 1.0), ('halifax waterfront boardwalk', 1.0), ('municipality', 1.0), ('major economic centre', 1.0), ('regional municipality consists', 1.0), ('dartmouth waterfront', 1.0), ('historic properties recall halifax', 1.0), ('located', 1.0), ('government', 1.0), ('dartmouth', 1.0), ('canada', 1.0), ('halifax shipyard', 1.0), ('war', 1.0), ('also known', 1.0), ('forestry', 1.0), ('golfing', 1.0), ('trading hub', 1.0), ('best walks along', 1.0), ('explore', 1.0), ('dalhousie university', 1.0), ('halifax seaport farmers', 1.0), ('cafés', 1.0), ('harbour', 1.0), ('mouth', 1.0), ('legally known', 1.0), ('north america', 1.0), ('major business centre', 1.0), ('bike', 1.0), ('private sector companies', 1.0), ('hrm', 1.0), ('offical food', 1.0), ('titanic', 1.0), ('major employers', 1.0), ('visit', 1.0), ('small', 1.0), ('shops', 1.0), ('saint mary', 1.0), ('moneysense magazine', 1.0), ('natural gas extraction', 1.0), ('pubs', 1.0), ('bedford', 1.0), ('ranked', 1.0), ('locally-owned shops', 1.0), ('life', 1.0), ('discover', 1.0), ('province', 1.0), ('eastern canada', 1.0), ('stories', 1.0), ('stroll', 1.0), ('inspired', 1.0), ('immigration', 1.0), ('live', 1.0), ('famous brewmaster', 1.0), ('placed second', 1.0), ('visit alderney landing', 1.0), ('graceful tall ships', 1.0), ('halifax metro', 1.0), ('second largest ice-free harbour', 1.0), ('golfer', 1.0), ('outdoor activity destination', 1.0), ('pier 21', 1.0), ('dartmouth side', 1.0), ('art gallery', 1.0), ('world', 1.0), ('four former municipalities', 1.0), ('costumed guides', 1.0), ('fact', 1.0), ('long-track speed skating oval', 1.0), ('second oldest', 1.0), ('fabled donair', 1.0), ('secluded trails', 1.0), ('emera oval', 1.0), ('halifax commons', 1.0), ('north', 1.0), ('consistently placed', 1.0), ('fdi magazine', 1.0), ('good ale', 1.0), ('conducted', 1.0), ('atlantic', 1.0), ('government services', 1.0), ('historic fort', 1.0), ('community theatre', 1.0), ('200-year-old restored fishing village', 1.0), ('halifax county', 1.0), ('experience', 1.0), ('halifax citadel national historic site', 1.0), ('halifax', 1.0), ('make', 1.0), ('grounds', 1.0), ('landed', 1.0), ('one', 1.0), ('atlantic ocean port', 1.0), ('golfing choices', 1.0), ('halifax waterfront', 1.0), ('rural areas', 1.0), ('casino nova scotia', 1.0), ('major resource industries found', 1.0), ('dominated', 1.0), ('403', 0), ('2014', 0), ('131', 0), ('2012', 0), ('1996', 0), ('2016', 0), ('1812', 0), ('701', 0), ('316', 0)]

Rake = RAKE.Rake(RAKE.NLTKStopList())

Rake.run(text)
[('provincial capital', 1.0), ('national defence', 1.0), ('nova scotia', 1.0), ('skate', 1.0), ('street', 1.0), ('clubs per capita', 1.0), ('privateers', 1.0), ('population', 1.0), ('three times', 1.0), ('south american cities', 1.0), ('take', 1.0), ('galleries', 1.0), ('fisherman', 1.0), ('halifax regional municipality', 1.0), ('edge alongside', 1.0), ('fitting', 1.0), ('agriculture', 1.0), ('urban area centred', 1.0), ('business friendliness', 1.0), ('top 10', 1.0), ('plenty', 1.0), ('mining', 1.0), ('mcnabs island', 1.0), ('peruse', 1.0), ('complete without trying', 1.0), ('atlantic canada', 1.0), ('ll pass unique shops', 1.0), ('stretching', 1.0), ('large cities', 1.0), ('fishing', 1.0), ('future', 1.0), ('star-shaped fort completed', 1.0), ('maritime museum', 1.0), ('original 1820 brewery welcomes visitors', 1.0), ('hilltop citadel', 1.0), ('list', 1.0), ('walk across', 1.0), ('cnn', 1.0), ('quality', 1.0), ('filled', 1.0), ('ride', 1.0), ('notably', 1.0), ('days', 1.0), ('beautiful beach', 1.0), ('large concentration', 1.0), ('longest running saltwater ferry', 1.0), ('also', 1.0), ('largest art collection', 1.0), ('port', 1.0), ('restaurants', 1.0), ('market', 1.0), ('fourth best place', 1.0), ('gateway', 1.0), ('water', 1.0), ('alexander keith', 1.0), ('canadian museum', 1.0), ('various levels', 1.0), ('oldest continuously running farmers', 1.0), ('additionally', 1.0), ('named', 1.0), ('mouth-blown', 1.0), ('city', 1.0), ('department', 1.0), ('summer', 1.0), ('halifax harbour', 1.0), ('hear captivating sea stories', 1.0), ('placed first', 1.0), ('almost', 1.0), ('ferry', 1.0), ('follows', 1.0), ('novascotian crystal', 1.0), ('cross', 1.0), ('amalgamated', 1.0), ('cove', 1.0), ('beautiful victorian flower gardens', 1.0), ('new halifax cental library', 1.0), ('university', 1.0), ('10 eye-popping new buildings', 1.0), ('soldier', 1.0), ('course', 1.0), ('become', 1.0), ('1850s', 1.0), ('halifax public gardens', 1.0), ('winter', 1.0), ('1 million immigrants', 1.0), ('warmer months', 1.0), ('maritime history', 1.0), ('craftsmanship', 1.0), ('brewery', 1.0), ('economic generators include', 1.0), ('capital', 1.0), ('hop aboard', 1.0), ('venture', 1.0), ('oldest running saltwater ferry service', 1.0), ('keith', 1.0), ('one million immigrants', 1.0), ('hand-cut crystal maker', 1.0), ('waterfront warehouses known', 1.0), ('day', 1.0), ('mayor', 1.0), ('halifax waterfront boardwalk', 1.0), ('municipality', 1.0), ('major economic centre', 1.0), ('regional municipality consists', 1.0), ('dartmouth waterfront', 1.0), ('historic properties recall halifax', 1.0), ('located', 1.0), ('government', 1.0), ('dartmouth', 1.0), ('canada', 1.0), ('halifax shipyard', 1.0), ('war', 1.0), ('also known', 1.0), ('forestry', 1.0), ('golfing', 1.0), ('trading hub', 1.0), ('best walks along', 1.0), ('explore', 1.0), ('dalhousie university', 1.0), ('halifax seaport farmers', 1.0), ('cafés', 1.0), ('harbour', 1.0), ('mouth', 1.0), ('legally known', 1.0), ('north america', 1.0), ('major business centre', 1.0), ('bike', 1.0), ('private sector companies', 1.0), ('hrm', 1.0), ('offical food', 1.0), ('titanic', 1.0), ('major employers', 1.0), ('visit', 1.0), ('small', 1.0), ('shops', 1.0), ('saint mary', 1.0), ('moneysense magazine', 1.0), ('natural gas extraction', 1.0), ('pubs', 1.0), ('bedford', 1.0), ('ranked', 1.0), ('locally-owned shops', 1.0), ('life', 1.0), ('discover', 1.0), ('province', 1.0), ('eastern canada', 1.0), ('stories', 1.0), ('stroll', 1.0), ('inspired', 1.0), ('immigration', 1.0), ('live', 1.0), ('famous brewmaster', 1.0), ('placed second', 1.0), ('visit alderney landing', 1.0), ('graceful tall ships', 1.0), ('halifax metro', 1.0), ('second largest ice-free harbour', 1.0), ('golfer', 1.0), ('outdoor activity destination', 1.0), ('pier 21', 1.0), ('dartmouth side', 1.0), ('art gallery', 1.0), ('world', 1.0), ('four former municipalities', 1.0), ('costumed guides', 1.0), ('fact', 1.0), ('long-track speed skating oval', 1.0), ('second oldest', 1.0), ('fabled donair', 1.0), ('secluded trails', 1.0), ('emera oval', 1.0), ('halifax commons', 1.0), ('north', 1.0), ('consistently placed', 1.0), ('fdi magazine', 1.0), ('good ale', 1.0), ('conducted', 1.0), ('atlantic', 1.0), ('government services', 1.0), ('historic fort', 1.0), ('community theatre', 1.0), ('200-year-old restored fishing village', 1.0), ('halifax county', 1.0), ('experience', 1.0), ('halifax citadel national historic site', 1.0), ('halifax', 1.0), ('make', 1.0), ('grounds', 1.0), ('landed', 1.0), ('one', 1.0), ('atlantic ocean port', 1.0), ('golfing choices', 1.0), ('halifax waterfront', 1.0), ('rural areas', 1.0), ('casino nova scotia', 1.0), ('major resource industries found', 1.0), ('dominated', 1.0), ('403', 0), ('2014', 0), ('131', 0), ('2012', 0), ('1996', 0), ('2016', 0), ('1812', 0), ('701', 0), ('316', 0)]

  • Updates:
    • same (basically) in Python 2.7 venv: all values 1.0, with numbers (integers, dates) scoring as 0.
    • same, in host (Python 3.6.3) env, RAKE installed via sudo pip3 install python-rake
    • OS: Arch Linux x86_64

Would be nice to example usage to show how to get to the included stoplists

At the moment the example just says:

import RAKE
Rake = RAKE.Rake([path_to_your_stopwords_file]);
# You can use one of the stoplists included in the repository under stoplists/
Rake.run(text);

But as a naive user installing this using pip - I've no idea where those stoplists have been installed to. Or even if they've actually been installed.

Would be nice to have this work as simply as:

import RAKE
Rake = RAKE.Rake(RAKE.FoxStopList);
Rake.run(text);

Badly need tests

Not even fancy ones, just something that runs at least a happy path with a small amount of text to make sure everything is valid.

Filter results by word/phrase category

Love the package! I was wondering if there is a way to return only results that fit one of NLTK's "Parts of Speech"? I'd love to contribute, but sadly my knowledge regarding the inner workings of NLTK is a bit lacking (at the moment!).

As an example, I'm hoping to use some NLTK/RAKE type package to automatically assign keywords to blog posts. After running a few test scenarios using the current package, I find that some of the higher-ranked keywords returned by python-rake are, say, adjectives, adverbs, etc., where I'm really hoping for nouns, noun phrases, possibly even proper nouns.

I feel like this feature would benefit anyone using the package, but I could be wrong. Thoughts?

Unexpected results for german text with umlauts

The following small sample program demonstrates the problem

import RAKE
Rake = RAKE.Rake(['da']);
print(Rake.run(u'und da\xdfselbe nochmal'))

as it returns [(u'\xdfselbe nochmal', 4.0), (u'und', 1.0)]

Tested with python 2.7.6

The issue seems to be, that the regex for word spliting and stopword removal are not unicode.

[Python3.5] AttributeError: 'dict' object has no attribute 'iteritems'

The above issue occurs when trying to run the following code:

import RAKE
import os

rake = RAKE.Rake(os.path.dirname(__file__) + '/../stoplist/japanese.stop')
text = 'これそれあれこのそのあの'
keywords = rake.run(text)
print(keywords)

Here's the stacktrace:

Traceback (most recent call last):
  File "/vagrant/workspace/bambooshoot-ads/analytics/core/keyword.py", line 6, in <module>
    keywords = rake.run(text)
  File "/home/vagrant/.pyenv/versions/3.5.0/lib/python3.5/site-packages/RAKE/RAKE.py", line 132, in run
    sorted_keywords = sorted(keyword_candidates.iteritems(), key=operator.itemgetter(1), reverse=True)
AttributeError: 'dict' object has no attribute 'iteritems'

Appears to be an issue in the original code, as you can see here: aneesha/RAKE#9

I've attempted the above fix without success.

Imports broken in python3

As per discussion in #9, the code

import RAKE

RAKE.SmartStopList()

works in python 2 but not in python3

@justinkterry working on it, forgot the import system changed so much, didn't think to look there. My fault.

README.md Issue With Python 3.5

When I sudo pip3 install python-rake on Ubuntu Gnome 16.04.2, (v1.0.7 according to pip3 freeze), when using iPython 3.5, the README.md files usage instructions seem like they're wrong

Instructions:

import RAKE
Rake = RAKE.Rake([path_to_your_stopwords_file]);
# You can use one of the stoplists included in the repository under stoplists/
Rake.run(text);

What I tried:

import RAKE
rake = RAKE.Rake('/home/justin/Documents/python_rake/stoplists/SmartStoplist.txt')
text = "Compatibility of systems of linear constraints over the set of natural numbers."
rake.run(text)

Output:

runfile('/home/justin/Documents/rake test.py', wdir='/home/justin/Documents')
Traceback (most recent call last):

  File "<ipython-input-42-c9f95de7800a>", line 1, in <module>
    runfile('/home/justin/Documents/rake test.py', wdir='/home/justin/Documents')

  File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 880, in runfile
    execfile(filename, namespace)

  File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/home/justin/Documents/rake test.py", line 3, in <module>
    rake = RAKE.Rake('/home/justin/Documents/python_rake/stoplists/SmartStoplist.txt')

AttributeError: module 'RAKE' has no attribute 'Rake'

How can I get the resulting keywords?

Its nicely explained how to use stopword lists, but how can I get the keywords?

What I would have expect results ins array of floats like ['4.0', '9.0', '1.0']:

rake = RAKE.Rake("stopwords-de.txt")
res = rake.run(text, minCharacters = 3, maxWords = 6, minFrequency = 1)
print(res)

Maybe you could ad some words about that to the readme.
Thanks

Import error under Python 3.5

Hi there. I'm new to Python, and was hoping to use this module for a prototype I'm working on. Unortunately, after installing with Pip, I get this error in the interpreter when trying to import:

import RAKE
Traceback (most recent call last):
File "", line 1, in
File "/Users/13tales/notesproto/lib/python3.5/site-packages/RAKE/init.py", line 1, in
from RAKE import Rake
ImportError: cannot import name 'Rake'

Any suggestions on how I can resolve this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.