GithubHelp home page GithubHelp logo

maartengr / keybert Goto Github PK

View Code? Open in Web Editor NEW
3.2K 32.0 327.0 3.12 MB

Minimal keyword extraction with BERT

Home Page: https://MaartenGr.github.io/KeyBERT/

License: MIT License

Makefile 0.35% Python 99.65%
keyword-extraction keyphrase-extraction bert mmr

keybert's Introduction

PyPI - Python PyPI - License PyPI - PyPi Build Open In Colab

KeyBERT

KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

Corresponding medium post can be found here.

Table of Contents

  1. About the Project
  2. Getting Started
    2.1. Installation
    2.2. Basic Usage
    2.3. Max Sum Distance
    2.4. Maximal Marginal Relevance
    2.5. Embedding Models
  3. Large Language Models

1. About the Project

Back to ToC

Although there are already many methods available for keyword generation (e.g., Rake, YAKE!, TF-IDF, etc.) I wanted to create a very basic, but powerful method for extracting keywords and keyphrases. This is where KeyBERT comes in! Which uses BERT-embeddings and simple cosine similarity to find the sub-phrases in a document that are the most similar to the document itself.

First, document embeddings are extracted with BERT to get a document-level representation. Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity to find the words/phrases that are the most similar to the document. The most similar words could then be identified as the words that best describe the entire document.

KeyBERT is by no means unique and is created as a quick and easy method for creating keywords and keyphrases. Although there are many great papers and solutions out there that use BERT-embeddings (e.g., 1, 2, 3, ), I could not find a BERT-based solution that did not have to be trained from scratch and could be used for beginners (correct me if I'm wrong!). Thus, the goal was a pip install keybert and at most 3 lines of code in usage.

2. Getting Started

Back to ToC

2.1. Installation

Installation can be done using pypi:

pip install keybert

You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:

pip install keybert[flair]
pip install keybert[gensim]
pip install keybert[spacy]
pip install keybert[use]

2.2. Usage

The most minimal example can be seen below for the extraction of keywords:

from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

You can set keyphrase_ngram_range to set the length of the resulting keywords/keyphrases:

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
[('learning', 0.4604),
 ('algorithm', 0.4556),
 ('training', 0.4487),
 ('class', 0.4086),
 ('mapping', 0.3700)]

To extract keyphrases, simply set keyphrase_ngram_range to (1, 2) or higher depending on the number of words you would like in the resulting keyphrases:

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
[('learning algorithm', 0.6978),
 ('machine learning', 0.6305),
 ('supervised learning', 0.5985),
 ('algorithm analyzes', 0.5860),
 ('learning function', 0.5850)]

We can highlight the keywords in the document by simply setting highlight:

keywords = kw_model.extract_keywords(doc, highlight=True)

NOTE: For a full overview of all possible transformer models see sentence-transformer. I would advise either "all-MiniLM-L6-v2" for English documents or "paraphrase-multilingual-MiniLM-L12-v2" for multi-lingual documents or any other language.

2.3. Max Sum Distance

To diversify the results, we take the 2 x top_n most similar words/phrases to the document. Then, we take all top_n combinations from the 2 x top_n words and extract the combination that are the least similar to each other by cosine similarity.

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
                              use_maxsum=True, nr_candidates=20, top_n=5)
[('set training examples', 0.7504),
 ('generalize training data', 0.7727),
 ('requires learning algorithm', 0.5050),
 ('supervised learning algorithm', 0.3779),
 ('learning machine learning', 0.2891)]

2.4. Maximal Marginal Relevance

To diversify the results, we can use Maximal Margin Relevance (MMR) to create keywords / keyphrases which is also based on cosine similarity. The results with high diversity:

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
                              use_mmr=True, diversity=0.7)
[('algorithm generalize training', 0.7727),
 ('labels unseen instances', 0.1649),
 ('new examples optimal', 0.4185),
 ('determine class labels', 0.4774),
 ('supervised learning algorithm', 0.7502)]

The results with low diversity:

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english',
                              use_mmr=True, diversity=0.2)
[('algorithm generalize training', 0.7727),
 ('supervised learning algorithm', 0.7502),
 ('learning machine learning', 0.7577),
 ('learning algorithm analyzes', 0.7587),
 ('learning algorithm generalize', 0.7514)]

2.5. Embedding Models

KeyBERT supports many embedding models that can be used to embed the documents and words:

  • Sentence-Transformers
  • Flair
  • Spacy
  • Gensim
  • USE

Click here for a full overview of all supported embedding models.

Sentence-Transformers
You can select any model from sentence-transformers here and pass it through KeyBERT with model:

from keybert import KeyBERT
kw_model = KeyBERT(model='all-MiniLM-L6-v2')

Or select a SentenceTransformer model with your own parameters:

from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)

Flair
Flair allows you to choose almost any embedding model that is publicly available. Flair can be used as follows:

from keybert import KeyBERT
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
kw_model = KeyBERT(model=roberta)

You can select any 🤗 transformers model here.

3. Large Language Models

Back to ToC

With KeyLLM you can new perform keyword extraction with Large Language Models (LLM). You can find the full documentation here but there are two examples that are common with this new method. Make sure to install the OpenAI package through pip install openai before you start.

First, we can ask OpenAI directly to extract keywords:

import openai
from keybert.llm import OpenAI
from keybert import KeyLLM

# Create your LLM
client = openai.OpenAI(api_key=MY_API_KEY)
llm = OpenAI(client)

# Load it in KeyLLM
kw_model = KeyLLM(llm)

This will query any ChatGPT model and ask it to extract keywords from text.

Second, we can find documents that are likely to have the same keywords and only extract keywords for those. This is much more efficient then asking the keywords for every single documents. There are likely documents that have the exact same keywords. Doing so is straightforward:

import openai
from keybert.llm import OpenAI
from keybert import KeyLLM
from sentence_transformers import SentenceTransformer

# Extract embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(MY_DOCUMENTS, convert_to_tensor=True)

# Create your LLM
client = openai.OpenAI(api_key=MY_API_KEY)
llm = OpenAI(client)

# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
keywords = kw_model.extract_keywords(MY_DOCUMENTS, embeddings=embeddings, threshold=.75)

You can use the threshold parameter to decide how similar documents need to be in order to receive the same keywords.

Citation

To cite KeyBERT in your work, please use the following bibtex reference:

@misc{grootendorst2020keybert,
  author       = {Maarten Grootendorst},
  title        = {KeyBERT: Minimal keyword extraction with BERT.},
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.3.0},
  doi          = {10.5281/zenodo.4461265},
  url          = {https://doi.org/10.5281/zenodo.4461265}
}

References

Below, you can find several resources that were used for the creation of KeyBERT but most importantly, these are amazing resources for creating impressive keyword extraction models:

Papers:

Github Repos:

MMR: The selection of keywords/keyphrases was modeled after:

NOTE: If you find a paper or github repo that has an easy-to-use implementation of BERT-embeddings for keyword/keyphrase extraction, let me know! I'll make sure to add a reference to this repo.

keybert's People

Contributors

adhadse avatar artmatsak avatar igor-pechersky avatar koaning avatar kunihik0 avatar lucafirefox avatar maartengr avatar mabhay3420 avatar priyanshul-govil avatar sam-frampton avatar shengbo-ma avatar yusuke1997 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

keybert's Issues

Dutch

Hi Maarten,

As you are dutch, what is your take on dutch embeddings. What is your experience with dutch models (which model gives the best results for example for KeyBert?
Bertje is the dutch Bert model but for sentence embeddings there is no trained Bertje-sentence model. Did you experiment with it? I am really interested in your experience.

App with KeyBERT installed keeps crashing.

Hi Maarten,

I've deployed a Streamlit which relies on KeyBERT (2 embedding models are currently being used: Flair and DistilBERT)

The app regularly crashes with the following logtrace (please see below).

I'd suspect RAM limit being reached on my app, yet could it be anything else related to KeyBERT or one of the huggingface models?

Many thanks,
Charly

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s][2021-07-20 07:07:37.718555] 
Downloading:  49%|████▊     | 113k/232k [00:00<00:00, 1.04MB/s][2021-07-20 07:07:37.739656] 
Downloading: 100%|██████████| 232k/232k [00:00<00:00, 1.80MB/s][2021-07-20 07:07:37.740166] 
2021-07-20 07:07:37.739 storing https://huggingface.co/sentence-transformers/distilbert-base-nli-mean-tokens/resolve/535a778226b02d76f0cae7cab98e435c58572fec/vocab.txt in cache at /home/appuser/.cache/torch/sentence_transformers/sentence-transformers__distilbert-base-nli-mean-tokens.535a778226b02d76f0cae7cab98e435c58572fec/vocab.txt
2021-07-20 07:07:37.739 Lock 140451516694608 released on /home/appuser/.cache/torch/sentence_transformers/sentence-transformers__distilbert-base-nli-mean-tokens.535a778226b02d76f0cae7cab98e435c58572fec/vocab.txt.lock
2021-07-20 07:07:38.033 Lock 140451516715664 acquired on /home/appuser/.cache/torch/sentence_transformers/sentence-transformers__distilbert-base-nli-mean-tokens.535a778226b02d76f0cae7cab98e435c58572fec/1_Pooling/config.json.lock
2021-07-20 07:07:38.034 downloading https://huggingface.co/sentence-transformers/distilbert-base-nli-mean-tokens/resolve/535a778226b02d76f0cae7cab98e435c58572fec/1_Pooling/config.json to /home/appuser/.cache/torch/sentence_transformers/sentence-transformers__distilbert-base-nli-mean-tokens.535a778226b02d76f0cae7cab98e435c58572fec/tmpzdm288m_

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s][2021-07-20 07:07:38.329461] 
Downloading: 100%|██████████| 190/190 [00:00<00:00, 128kB/s][2021-07-20 07:07:38.329779] 
2021-07-20 07:07:38.329 storing https://huggingface.co/sentence-transformers/distilbert-base-nli-mean-tokens/resolve/535a778226b02d76f0cae7cab98e435c58572fec/1_Pooling/config.json in cache at /home/appuser/.cache/torch/sentence_transformers/sentence-transformers__distilbert-base-nli-mean-tokens.535a778226b02d76f0cae7cab98e435c58572fec/1_Pooling/config.json
2021-07-20 07:07:38.329 Lock 140451516715664 released on /home/appuser/.cache/torch/sentence_transformers/sentence-transformers__distilbert-base-nli-mean-tokens.535a778226b02d76f0cae7cab98e435c58572fec/1_Pooling/config.json.lock
2021-07-20 07:07:38.941 Use pytorch device: cpu
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
2021-07-20 07:29:22.655 Load pretrained SentenceTransformer: distilbert-base-nli-mean-tokens
2021-07-20 07:29:24.144 Use pytorch device: cpu
�[32m[manager] �[0mError checking Streamlit healthz: Get "http://localhost:8501/healthz": dial tcp 127.0.0.1:8501: connect: connection refused

Could I use .pt file?

I need to use .pt file (bert model),not .bin file.
Could I use .pt file? with sentencetransformers

Obtained keywords

Hi,
in the documentation it seems that the returned keywords are tuples with keyword value and weight:

kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
[('learning', 0.4604),
 ('algorithm', 0.4556),
 ('training', 0.4487),
 ('class', 0.4086),
 ('mapping', 0.3700)]

However, when I run it I get just a list of keywords:

['fermi', 'tungsten', 'superconducting', 'superconductors', ...

Are these keywords ordered by any criteria? I mean, if I fetch the top 5 but then I'm interested in the most "important" one, may I just pick up the first?

Regards
Luca

How to cite this paper?

First of all, thanks for this excellent repo.
Any BibTeX for this repo? I want to cite this repo in my paper.
Thanks!

keyphrase_length is not working

Hello,
I just copied the code from your readme and got following error:

File "/Users/vladimir/Work/python/foo/main.py", line 35, in
keywords = model.extract_keywords(doc, keyphrase_length=3, stop_words=None)
TypeError: extract_keywords() got an unexpected keyword argument 'keyphrase_length'

confused about performance in chinese text

I wonder why when I use the multilingual model for extracting keywords in chinese text, and the gram I set is (1,1), why the result is several words rather than one word? Thx!

min_df and vectorizer parameters

Maarten thanks for this interesting work.

When trying out extract_keywords I have experimented with different values for the min_df parameter. By reference to sklearn documentation that seem to control stop-words based of frequency of occurrence in document. However in my experiment there is no observable difference between the results in using a value of 10 or 100. Am I missing something of the parameter actually have different meaning?

On the other hand is there a reason why the corresponding max_df parameter is not available?

Another parameter is vectorizer, which seem to allow an instance of Countvectorizer instantiated to be passed into for modelling. Can you please share the typical scenario when this would be useful? In particular, is it a possibility to add domain-specific vocabulary to the vecotrizer, such that tokens thus keyword candidates derived by the model will be domain specific?

Thanks.

Return weights/scores for keywords?

Hi Maarten
I don't know if it is possible to return for each keyword a kind of relevancy score/weight that could be computed using the cosine distance ?
In that case the returned type of extract_keywords will be changed from List[str] to List[tuple[str, float]]

What do you think?

Best regards
Olivier Terrier

installation error

I am having trouble installing keybert using command 'pip install keybert'.
Below is the error I got.

Collecting transformers<5.0.0,>=3.1.0
Using cached transformers-4.6.1-py3-none-any.whl (2.2 MB)
Collecting tqdm
Using cached tqdm-4.61.1-py2.py3-none-any.whl (75 kB)
Collecting torch>=1.6.0
Using cached torch-1.7.0-cp36-cp36m-win_amd64.whl (184.0 MB)
ERROR: torch has an invalid wheel, .dist-info directory not found

Add Max Sum Similarity

Instead of MMR, we can use Max Sum Similarity to extract the keywords that are similar to the document, but different from each other to maximize diversity.

cp950' codec can't decode byte 0xf0 in position 8324

      File "C:\Users\tinlok\AppData\Local\Temp\pip-req-build-hnmzzea8\setup.py", line 29, in <module>
        long_description = fh.read()
    UnicodeDecodeError: 'cp950' codec can't decode byte 0xf0 in position 8324: illegal multibyte sequence

please modify line 28 in setup.py to

with open("README.md", "r", encoding='utf-8-sig') as fh:
    long_description = fh.read()

Differences between KeyBERT and BERTopic

Hi,
thanks for sharing these projects, super neat work!

I just wanted to ask which are the main differences between KeyBERT and BERTopic.
The two approaches may look similar, as one of the approaches of BERTopic can be maybe applied exactly to recreate KeyBERT.

In which case should I use one instead of the other in your opinion?
Thanks!

Passing in pre-made document embeddings?

I would like to leverage some of the infrastructure already engineered in KeyBERT for finding keyphrases. However, for downstream modeling tasks we've already built embeddings for each of our documents. Is there a way to pass these embeddings in and skip that step of the KeyBERT pipeline to try and save on processing time?

Demo doesn't work

Hey I am interested in trying out demo, but it gives error:
doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs.[1] It infers a
function from labeled training data consisting of a set of training examples.[2]
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
model = KeyBERT('distilbert-base-nli-mean-tokens')
keywords = model.extract_keywords(doc)


TypeError Traceback (most recent call last)
in
14 """
15 model = KeyBERT('distilbert-base-nli-mean-tokens')
---> 16 keywords = model.extract_keywords(doc)

~/anaconda3/lib/python3.7/site-packages/keybert/model.py in extract_keywords(self, docs, keyphrase_length, stop_words, top_n, min_df, use_maxsum, use_mmr, diversity, nr_candidates)
90 use_mmr,
91 diversity,
---> 92 nr_candidates)
93 elif isinstance(docs, list):
94 warnings.warn("Although extracting keywords for multiple documents is faster "

~/anaconda3/lib/python3.7/site-packages/keybert/model.py in _extract_keywords_single_doc(self, doc, keyphrase_length, stop_words, top_n, use_maxsum, use_mmr, diversity, nr_candidates)
129 # Extract Words
130 n_gram_range = (keyphrase_length, keyphrase_length)
--> 131 count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit([doc])
132 words = count.get_feature_names()
133

~/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit(self, raw_documents, y)
1163 """
1164 self._warn_for_unused_params()
-> 1165 self.fit_transform(raw_documents)
1166 return self
1167

~/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
1218 max_doc_count,
1219 min_doc_count,
-> 1220 max_features)
1221 if max_features is None:
1222 X = self._sort_features(X, vocabulary)

~/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _limit_features(self, X, vocabulary, high, low, limit)
1088 raise ValueError("After pruning, no terms remain. Try a lower"
1089 " min_df or a higher max_df.")
-> 1090 return X[:, kept_indices], removed_terms
1091
1092 def _count_vocab(self, raw_documents, fixed_vocab):

~/anaconda3/lib/python3.7/site-packages/scipy/sparse/_index.py in getitem(self, key)
33 """
34 def getitem(self, key):
---> 35 row, col = self._validate_indices(key)
36 # Dispatch to specialized methods.
37 if isinstance(row, INT_TYPES):

~/anaconda3/lib/python3.7/site-packages/scipy/sparse/_index.py in _validate_indices(self, key)
146 col += N
147 elif not isinstance(col, slice):
--> 148 col = self._asindices(col, N)
149
150 return row, col

~/anaconda3/lib/python3.7/site-packages/scipy/sparse/_index.py in _asindices(self, idx, length)
167
168 # Check bounds
--> 169 max_indx = x.max()
170 if max_indx >= length:
171 raise IndexError('index (%d) out of range' % max_indx)

~/anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py in _amax(a, axis, out, keepdims, initial, where)
37 def _amax(a, axis=None, out=None, keepdims=False,
38 initial=_NoValue, where=True):
---> 39 return umr_maximum(a, axis, None, out, keepdims, initial, where)
40
41 def _amin(a, axis=None, out=None, keepdims=False,

TypeError: int() argument must be a string, a bytes-like object or a number, not '_NoValueType'

when multi-thread, "runtime error: already borrowed"occurred

Excuse me, When I use keybert through multi-thread, error like "runtime error: already borrowed" happened. I search the solution and found the discussion about "TF Dataset Pipline throws RuntimeError: Already borrowed when tokenizing". (huggingface/transformers#10434) . It means that different thread reading the tokenizers caused the conflict. I try to upgrade or downgrade the transformers or tokenizers. The error still exist. I don't know how to solve it. Thanks.

N-gram range as parameter in extract_keywords method

I am using this repository for generating keywords from documents and meeting transcriptions. I found out that sometimes it is better to accept keywords ranging in n-gram length, e. g ngram_range=(1, 3) rather than 3-grams, because sometimes 1 word is good keyword and not the whole phrase. Rather than creating my own modified repository I would like to propose modification to Your codebase.

Instead of passing parameter:

    def _extract_keywords_single_doc(self,
                                     doc: str,
                                     keyphrase_length: int = 1,
                                     ...

You could provide parameter as follow:

    def _extract_keywords_single_doc(self,
                                     doc: str,
                                     keyphrase_ngram_range: Tuple[int, int] = (1,1),
                                     ...

for both methods _extract_keywords_single_doc and _extract_keywords_multiple_docs.

Let me know what do You think about this.

Class of KeyBERT

Hi Marrten!

Would it make the code more coherent if you let your KeyBERT inherit the class of your favorite SentenceTransformer?

Thank you.

Using Universal Sentence Encoder as the embedding model for documents/candidate keyphrases

Can you suggest a way to use Universal Sentence Encoder, or any custom model for generating document/candidate embeddings, instead of options just from HF/Flair/sentence-transformers? I think in general it would be nice to have the functionality to be able to use MMR/MaxSumSimilarity with a generic model! Let me know if you have any thoughts about this, would love to hear your opinion!

Multi-document usage doesn't work with MMR diversity changes

When attempting to look at different keyphrase lists by adjusting diversity for the MMR similarity metric, I find that the keyphrases never change regardless of the diversity value used. The keyphrases' output is corrected when I take the exemplar document out of a list form and just feed it as a single documentPlease see below code and results (using the text snippet from the README tutorial):

Version with a single piece of text in a single-element list:

keywords = kw_model.extract_keywords(
    docs=[doc],
    keyphrase_ngram_range=(2,5),
    use_mmr=True,
    diversity=0.7,
    vectorizer=None
)

>>> [[('examples supervised learning example', 0.6984),
  ('supervised learning machine', 0.6989),
  ('supervised learning', 0.7035),
  ('examples supervised learning', 0.7041),
  ('supervised learning example', 0.7558)]]

Version with a single piece of text fed directly as just a single document:

keywords = kw_model.extract_keywords(
    docs=doc,
    keyphrase_ngram_range=(2,5),
    use_mmr=True,
    diversity=0.7,
    vectorizer=None
)

>>> [('supervised learning example', 0.7558),
 ('output pairs', 0.1079),
 ('reasonable way', 0.1529),
 ('object typically vector', 0.2502),
 ('value called supervisory signal', 0.3348)]

Looking at the code for KeyBERT.extract_keywords(), it appears this is intentional, as there aren't even MMR/MaxSum options for _extract_keywords_multiple_docs(). Is there a reason why these diversity-enhancing metrics can't be used in the multi-document scenario?

Lots of ['None Found'] results for Japanese language.

So I was testing how this fares for the Japanese language. I used the bert tokenizer from the bert model (https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking). I could change the tokenizer to Sudachi or Meccab but I think the results will be more or less the same.

from keybert import KeyBERT
# 日本語の事前学習モデル
MODEL_NAME = 'cl-tohoku/bert-base-japanese-whole-word-masking'
tokenizer = BertJapaneseTokenizer.from_pretrained(MODEL_NAME)

kw_model = KeyBERT()
kw_model.extract_keywords(tokenizer.tokenize("感染力の強い変異株の影響により、これまでに経験のないスピードで感染が拡大しています。"), keyphrase_ngram_range=(1, 2), stop_words=None)

The output I get is

[[('感染', 1.0)],
 ['None Found'],
 ['None Found'],
 [('強い', 1.0)],
 [('変異', 1.0)],
 ['None Found'],
 ['None Found'],
 [('影響', 1.0)],
 [('により', 1.0)],
 ['None Found'],
 [('これ', 1.0)],
 [('まで', 1.0)],
 ['None Found'],
 [('経験', 1.0)],
 ['None Found'],
 [('ない', 1.0)],
 [('スピード', 1.0)],
 ['None Found'],
 [('感染', 1.0)],
 ['None Found'],
 [('拡大', 1.0)],
 ['None Found'],
 ['None Found'],
 ['None Found'],
 [('ます', 1.0)],
 ['None Found']]

It seems that the score is either 0 or 1 and the extraction does not seem very satisfactory. Any idea to improve the performance?

Spanish and other languages

Hi,
Just want to know if KeyBert works in spanish. Can I change the BERT model to one that is suitable for spanish?

Thank you

What about training?

What are your thoughts on the training of a selected model to use with KeyBERT?

Add unit tests

Add a few basic unit tests and implement a github workflow

`xla_device` argument has been deprecated

I found notifications like this:

The xla_device argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your config.json file.
The xla_device argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your config.json file.

My model code:
sentence_model = SentenceTransformer("distilbert-base-nli-stsb-mean-tokens", device="cpu")
kw_model = KeyBERT(model=sentence_model)

Thanks

Keyword extraction results vs YAKE

Hi Maarten,
I was super excited when I found out about your project because I wasn't happy with the results of the "static" algorithms (TF-IDF, RAKE, etc) and I thought that adding the Transformers twist could have been a game changer.
I just ran KeyBERT on a bunch of text and unfortunately the results are far from what I expected...I wanted to understand if I'm missing something in the configuration...I ran a comparison against RAKE, which I believe delivers a good selection. I highlighted what I believe are the right keywords among those extracted.

abb_brief.txt: Six steps to predict...

KeyBert
N-gram 1
['powerful', 'crucial', 'heuristic', 'heuristics', 'holistic']

N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['engineers', 'training', 'costly', 'decades', 'holistic']

N-gram 1 | mmr True | diversity 0.7
['powerful', 'environmental', 'oil', 'decades', 'conducting']

N-gram 1 | mmr True | diversity 0.2
['powerful', 'decades', 'engineers', 'holistic', 'oil']

Yake
[ ('maintenance', 0.009284128714982649),
('predictive maintenance', 0.014353824666396151),
('asset', 0.017838983298883827),
('equipment', 0.01827858388518511),
('assets', 0.01921121278341335),
('data', 0.02202428913616752),
('performance', 0.02562450472920611),
('system', 0.030684635541426298),
('predictive', 0.03602109202391525),
('predictive maintenance strategy', 0.04021698272687705),
('preventative maintenance', 0.04097700545772903),
('key', 0.045187052701102695),
('asset performance', 0.04537221430023765),
('plant', 0.045993250652619805),
('step', 0.04701900159897897),
('asset health', 0.05129570336498491),
('maintenance strategy', 0.05190103991418995),
('asset performance management', 0.053141427401525),
('asset management system', 0.054003690529928906),
('systems', 0.056255165159281556)]

honeywell_brief.txt: Honeywell Brings Ene...

KeyBert
N-gram 1
['oil', 'norway', 'norwegian', 'oslo', 'offshore']

N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['train', '70mw', 'environmental', 'offshore', 'norway']

N-gram 1 | mmr True | diversity 0.7
['oil', 'norway', 'accounting', 'daily', 'environmental']

N-gram 1 | mmr True | diversity 0.2
['oil', 'norway', 'offshore', 'compressors', 'environmental']

Yake
[ ('edvard grieg', 0.0047243731441952855),
('honeywell brings energy', 0.005048731021543624),
('lundin norway edvard', 0.00774357023651215),
('norway edvard grieg', 0.007859681432647146),
('lundin', 0.01097489749993232),
('honeywell', 0.011659075482465663),
('edvard grieg platform', 0.011747470639833146),
('honeywell forge', 0.012242825657779946),
('lundin norway creates', 0.012446534894446847),
('honeywell brings', 0.012646039494275783),
('asset performance management', 0.013202226795784838),
('brings energy accounting', 0.01421896527633952),
('enterprise performance management', 0.014355323112609336),
('lundin norway', 0.014378904821544412),
('performance management', 0.01779551307545978),
('honeywell forge asset', 0.017857115752163682),
('asset performance', 0.02052810653424872),
('north sea', 0.02116890341510034),
('edvard grieg serves', 0.022453059681042196),
('performance management software', 0.022799786193990212)]

ibm_brief.txt: Essential intelligen...

KeyBert
N-gram 1
['optimizing', 'improving', 'workflow', 'adaptability', 'workflows']

N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['expanded', 'analytics', 'lifecycle', 'efficiency', 'workflow']

N-gram 1 | mmr True | diversity 0.7
['optimizing', 'ibm', 'lake', 'global', 'lifecycle']

N-gram 1 | mmr True | diversity 0.2
['optimizing', 'workflow', 'improving', 'lifecycle', 'adaptability']

Yake
[ ('asset', 0.019425671610029897),
('maximo', 0.03385207707983861),
('maintenance', 0.043681051375471604),
('data', 0.04682441705459914),
('asset management', 0.0489307794359093),
('eam', 0.06313545614404494),
('management', 0.06687517191726874),
('operational', 0.06982561607664019),
('assets', 0.06993241779610762),
('costs', 0.07471213680215111),
('reduce', 0.07615429287108003),
('essential intelligence', 0.0768640289294993),
('reliable asset management', 0.07907749923254684),
('single', 0.09584152535240265),
('maximo manage', 0.09600993459669702),
('applications', 0.098110217945894),
('cmms', 0.0991284264831352),
('maximo mobile', 0.10147884696539622),
('operations', 0.10275673612171073),
('maximo application suite', 0.10860457497355133)]

aspentech_brief.txt: The Wide-Ranging Imp...

KeyBert
N-gram 1
['degrading', 'dangerous', 'toxins', 'damaging', 'hurts']

N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['competitive', 'powerful', 'shutdowns', 'robbing', 'toxins']

N-gram 1 | mmr True | diversity 0.7
['degrading', 'california', 'accounting', 'tomorrow', 'safest']

N-gram 1 | mmr True | diversity 0.2
['degrading', 'dangerous', 'toxins', 'damaging', 'hurts']

Yake
[ ('unplanned downtime', 0.007983226969004767),
('unplanned shutdowns', 0.010538494059896914),
('unplanned', 0.011342119859815765),
('downtime', 0.01742835270368349),
('reduce unplanned shutdowns', 0.018388374839244975),
('reduce unplanned downtime', 0.02003633232508374),
('technology', 0.021390981645117383),
('reduce unplanned', 0.022972668884322592),
('safety', 0.026816852757341234),
('shutdowns', 0.027532689357221824),
('operations', 0.029102917228258057),
('unplanned shutdown', 0.0368847292096392),
('maintenance', 0.037754074459235676),
('reduce', 0.04587756910104556),
('shutdown', 0.0458878155953697),
('predictive analytics', 0.05089898702999869),
('unplanned shutdowns cost', 0.05166283039628185),
('business', 0.05171852210565718),
('companies', 0.05193281587506164),
('operation', 0.05335534825180644)]

aspentech_blog.txt: From food and bevera...

KeyBert
N-gram 1
['businesses', 'aspentech', 'pharmaceuticals', 'everyone', 'petrochemical']

N-gram 1 | maxsum True | nr_candidates 20 | top_n 5
['owner', 'excited', 'oil', 'pharmaceuticals', 'aspentech']

N-gram 1 | mmr True | diversity 0.7
['businesses', 'eager', 'aspentech', 'oil', 'twins']

N-gram 1 | mmr True | diversity 0.2
['businesses', 'aspentech', 'pharmaceuticals', 'oil', 'petrochemical']

Yake
[ ('asset performance management', 0.001774287093354811),
('performance management', 0.004664558428596046),
('adopting asset performance', 0.006574406683268778),
('apm', 0.023224587127132566),
('actively adopting asset', 0.0241279733584699),
('asset performance', 0.025645678465285548),
('apm technology', 0.029995299363958845),
('technology', 0.04937113116851416),
('gas production', 0.055571440652063854),
('data', 0.05568651853043156),
('pharmaceuticals to oil', 0.057392310280758474),
('oil and gas', 0.057392310280758474),
('actively adopting', 0.057392310280758474),
('’re', 0.05870376376363502),
('assets', 0.06046162516715442),
('myth', 0.06051908857502686),
('understand', 0.06485308801630693),
('management', 0.06736024901319339),
('performance', 0.06892642785642587),
('asset', 0.0725539502005853)]

Spacy's non-transformer models in KeyBERT example code bugs in Anaconda project environment

From keybert documentation I used code to use Spacy's non-transformer models in KeyBERT with Anaconda project environment. But I get error, see below code? How to use Spacy models with KeyBERT? Does Anaconda environment matters? I used python 3.8 version.
https://maartengr.github.io/KeyBERT/guides/embeddings.html

`import spacy

nlp = spacy.load("en_core_web_sm", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
kw_model = KeyBERT(model=nlp)`

kw_model = KeyBERT(model=nlp)
File "C:\Users\User\anaconda3\envs\test\lib\site-packages\keybert\model.py", line 33, in init
self.model = SentenceTransformer(model)
File "C:\Users\User\anaconda3\envs\test\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 41, in init
if not os.path.isdir(model_path) and not model_path.startswith('http://') and not model_path.startswith('https://'):
File "C:\Users\User\anaconda3\envs\test\lib\genericpath.py", line 42, in isdir
st = os.stat(s)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not English

When using transformers model with Flair, an error occurred

KeyBert: K0.3.0
Flair: V0.8.0

from keybert import KeyBERT from flair.embeddings import TransformerDocumentEmbeddings doc = """ ****** """ roberta= TransformerDocumentEmbeddings('roberta-base') model = KeyBERT(model=roberta) keywords = model.extract_keywords(doc, keyphrase_ngram_range=(2, 2), stop_words='english', use_mmr=True, diversity=0.7) print(keywords)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not TransformerDocumentEmbeddings

How to use KeyBERT in deployment scenarios?

Hey,

Great work with KeyBERT.

Could you advise on how we can use this service in production?

How do we load the saved pre-trained model from storage and set-up an API endpoint?

Regards,
Paritosh

Problem with scipy during install: Preparing wheel metadata ... error, ERROR: Command errored out with exit status 1:

Weird problem happens to me when I'm trying to install keyBERT on my laptop. It fails during a scipy dependency check. On my desktop it's working, even with the same project and other packages. For some reason it doesn't work for me even at blank empty project without any other packages.

I'm using Pycharm with virtual interpreter, Python 3.8, pip 21.0.1, setuptools 56.0.0.

I would really appreciate any help.

This is the exact output, I only slightly censored my folder strucuture.

`
C:\my-path>pip install keybert
Collecting keybert
Using cached keybert-0.2.0.tar.gz (12 kB)
Collecting sentence-transformers>=0.3.8
Using cached sentence-transformers-1.0.4.tar.gz (74 kB)
Requirement already satisfied: scikit-learn>=0.22.2 in c:\my-path\appdata\local\programs\python\python38-32\lib\site-packages (from keybert) (0
.24.1)
Requirement already satisfied: numpy>=1.18.5 in c:\my-path\appdata\local\programs\python\python38-32\lib\site-packages (from keybert) (1.19.5)
Requirement already satisfied: joblib>=0.11 in c:\my-path\appdata\local\programs\python\python38-32\lib\site-packages (from scikit-learn>=0.22.
2->keybert) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\my-path\appdata\local\programs\python\python38-32\lib\site-packages (from scikit-lear
n>=0.22.2->keybert) (2.1.0)
Requirement already satisfied: scipy>=0.19.1 in c:\my-path\appdata\local\programs\python\python38-32\lib\site-packages (from scikit-learn>=0.22
.2->keybert) (1.6.2)
Collecting transformers<5.0.0,>=3.1.0
Using cached transformers-4.5.1-py3-none-any.whl (2.1 MB)
Requirement already satisfied: tqdm in c:\my-path\appdata\local\programs\python\python38-32\lib\site-packages (from sentence-transformers>=0.3.
8->keybert) (4.54.1)
Collecting sentence-transformers>=0.3.8
Using cached sentence-transformers-1.0.3.tar.gz (74 kB)
Using cached sentence-transformers-1.0.2.tar.gz (74 kB)
Using cached sentence-transformers-1.0.1.tar.gz (74 kB)
Using cached sentence-transformers-1.0.0.tar.gz (74 kB)
Using cached sentence-transformers-0.4.1.2.tar.gz (64 kB)
Using cached sentence-transformers-0.4.1.1.tar.gz (64 kB)
Using cached sentence-transformers-0.4.1.tar.gz (64 kB)
Using cached sentence-transformers-0.4.0.tar.gz (65 kB)
Using cached sentence-transformers-0.3.9.tar.gz (64 kB)
Collecting transformers<3.6.0,>=3.1.0
Using cached transformers-3.5.1-py3-none-any.whl (1.3 MB)
Collecting sentence-transformers>=0.3.8
Using cached sentence-transformers-0.3.8.tar.gz (66 kB)
Collecting transformers<3.4.0,>=3.1.0
Using cached transformers-3.3.1-py3-none-any.whl (1.1 MB)
INFO: pip is looking at multiple versions of scipy to determine which version is compatible with other requirements. This could take a while.
Collecting scipy>=0.19.1
Using cached scipy-1.6.1-cp38-cp38-win32.whl (29.5 MB)
Using cached scipy-1.6.0-cp38-cp38-win32.whl (29.5 MB)
Using cached scipy-1.5.4-cp38-cp38-win32.whl (28.4 MB)
Using cached scipy-1.5.3-cp38-cp38-win32.whl (28.4 MB)
Using cached scipy-1.5.2-cp38-cp38-win32.whl (28.4 MB)
Using cached scipy-1.5.1-cp38-cp38-win32.whl (28.4 MB)
Using cached scipy-1.5.0-cp38-cp38-win32.whl (28.4 MB)
Using cached scipy-1.4.1-cp38-cp38-win32.whl (27.9 MB)
Using cached scipy-1.4.0-cp38-cp38-win32.whl (27.9 MB)
Using cached scipy-1.3.3-cp38-cp38-win32.whl (27.4 MB)
Using cached scipy-1.3.2-cp38-cp38-win32.whl (27.4 MB)
Using cached scipy-1.3.1.tar.gz (23.6 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... error
ERROR: Command errored out with exit status 1:
command: 'c:\my-path\appdata\local\programs\python\python38-32\python.exe' 'c:\my-path\appdata\local\programs\python\python38-32\lib\s
ite-packages\pip_vendor\pep517_in_process.py' prepare_metadata_for_build_wheel 'C:\my-path\AppData\Local\Temp\tmpquyu57o9'
cwd: C:\my-path\AppData\Local\Temp\pip-install-3scfomzx\scipy_98374e591d234ac8937f2333572e7063
Complete output (192 lines):
lapack_opt_info:
lapack_mkl_info:
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
libraries mkl_rt not found in ['c:\users\patri\appdata\local\programs\python\python38-32\lib', 'C:\', 'c:\users\patri\appdata\lo
cal\programs\python\python38-32\libs']
NOT AVAILABLE

openblas_lapack_info:
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries openblas not found in ['c:\\users\\patri\\appdata\\local\\programs\\python\\python38-32\\lib', 'C:\\', 'c:\\users\\patri\\appdata\\

local\programs\python\python38-32\libs']
get_default_fcompiler: matching types: '['gnu', 'intelv', 'absoft', 'compaqv', 'intelev', 'gnu95', 'g95', 'intelvem', 'intelem', 'flang']'
customize GnuFCompiler
Could not locate executable g77
Could not locate executable f77
customize IntelVisualFCompiler
Could not locate executable ifort
Could not locate executable ifl
customize AbsoftFCompiler
Could not locate executable f90
customize CompaqVisualFCompiler
Could not locate executable DF
customize IntelItaniumVisualFCompiler
Could not locate executable efl
customize Gnu95FCompiler
Could not locate executable gfortran
Could not locate executable f95
customize G95FCompiler
Could not locate executable g95
customize IntelEM64VisualFCompiler
customize IntelEM64TFCompiler
Could not locate executable efort
Could not locate executable efc
customize PGroupFlangCompiler
Could not locate executable flang
don't know how to compile Fortran code on platform 'nt'
NOT AVAILABLE

openblas_clapack_info:
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries openblas,lapack not found in ['c:\\users\\patri\\appdata\\local\\programs\\python\\python38-32\\lib', 'C:\\', 'c:\\users\\patri\\ap

pdata\local\programs\python\python38-32\libs']
NOT AVAILABLE

atlas_3_10_threads_info:
Setting PTATLAS=ATLAS
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries tatlas,tatlas not found in c:\my-path\appdata\local\programs\python\python38-32\lib
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack_atlas not found in c:\my-path\appdata\local\programs\python\python38-32\lib
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries tatlas,tatlas not found in C:\
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack_atlas not found in C:\
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries tatlas,tatlas not found in c:\my-path\appdata\local\programs\python\python38-32\libs
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack_atlas not found in c:\my-path\appdata\local\programs\python\python38-32\libs
<class 'numpy.distutils.system_info.atlas_3_10_threads_info'>
  NOT AVAILABLE

atlas_3_10_info:
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries satlas,satlas not found in c:\my-path\appdata\local\programs\python\python38-32\lib
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack_atlas not found in c:\my-path\appdata\local\programs\python\python38-32\lib
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries satlas,satlas not found in C:\
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack_atlas not found in C:\
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries satlas,satlas not found in c:\my-path\appdata\local\programs\python\python38-32\libs
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack_atlas not found in c:\my-path\appdata\local\programs\python\python38-32\libs
<class 'numpy.distutils.system_info.atlas_3_10_info'>
  NOT AVAILABLE

atlas_threads_info:
Setting PTATLAS=ATLAS
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries ptf77blas,ptcblas,atlas not found in c:\my-path\appdata\local\programs\python\python38-32\lib
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack_atlas not found in c:\my-path\appdata\local\programs\python\python38-32\lib
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries ptf77blas,ptcblas,atlas not found in C:\
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack_atlas not found in C:\
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries ptf77blas,ptcblas,atlas not found in c:\my-path\appdata\local\programs\python\python38-32\libs
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack_atlas not found in c:\my-path\appdata\local\programs\python\python38-32\libs
<class 'numpy.distutils.system_info.atlas_threads_info'>
  NOT AVAILABLE

atlas_info:
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries f77blas,cblas,atlas not found in c:\my-path\appdata\local\programs\python\python38-32\lib
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack_atlas not found in c:\my-path\appdata\local\programs\python\python38-32\lib
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries f77blas,cblas,atlas not found in C:\
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack_atlas not found in C:\
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries f77blas,cblas,atlas not found in c:\my-path\appdata\local\programs\python\python38-32\libs
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack_atlas not found in c:\my-path\appdata\local\programs\python\python38-32\libs
<class 'numpy.distutils.system_info.atlas_info'>
  NOT AVAILABLE

lapack_info:
No module named 'numpy.distutils._msvccompiler' in numpy.distutils; trying from distutils
customize MSVCCompiler
  libraries lapack not found in ['c:\\users\\patri\\appdata\\local\\programs\\python\\python38-32\\lib', 'C:\\', 'c:\\users\\patri\\appdata\\lo

cal\programs\python\python38-32\libs']
NOT AVAILABLE

lapack_src_info:
  NOT AVAILABLE

  NOT AVAILABLE

setup.py:386: UserWarning: Unrecognized setuptools command ('dist_info --egg-base C:\my-path\AppData\Local\Temp\pip-modern-metadata-5c5bml6

g'), proceeding with generating Cython sources and expanding templates
warnings.warn("Unrecognized setuptools command ('{}'), proceeding with "
Running from scipy source directory.
C:\my-path\AppData\Local\Temp\pip-build-env-0dcoq7t6\overlay\Lib\site-packages\numpy\distutils\system_info.py:624: UserWarning:
Atlas (http://math-atlas.sourceforge.net/) libraries not found.
Directories to search for the libraries can be specified in the
numpy/distutils/site.cfg file (section [atlas]) or by setting
the ATLAS environment variable.
self.calc_info()
C:\my-path\AppData\Local\Temp\pip-build-env-0dcoq7t6\overlay\Lib\site-packages\numpy\distutils\system_info.py:624: UserWarning:
Lapack (http://www.netlib.org/lapack/) libraries not found.
Directories to search for the libraries can be specified in the
numpy/distutils/site.cfg file (section [lapack]) or by setting
the LAPACK environment variable.
self.calc_info()
C:\my-path\AppData\Local\Temp\pip-build-env-0dcoq7t6\overlay\Lib\site-packages\numpy\distutils\system_info.py:624: UserWarning:
Lapack (http://www.netlib.org/lapack/) sources not found.
Directories to search for the sources can be specified in the
numpy/distutils/site.cfg file (section [lapack_src]) or by setting
the LAPACK_SRC environment variable.
self.calc_info()
Traceback (most recent call last):
File "c:\my-path\appdata\local\programs\python\python38-32\lib\site-packages\pip_vendor\pep517_in_process.py", line 280, in
main()
File "c:\my-path\appdata\local\programs\python\python38-32\lib\site-packages\pip_vendor\pep517_in_process.py", line 263, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "c:\my-path\appdata\local\programs\python\python38-32\lib\site-packages\pip_vendor\pep517_in_process.py", line 133, in prepare_met
adata_for_build_wheel
return hook(metadata_directory, config_settings)
File "C:\my-path\AppData\Local\Temp\pip-build-env-0dcoq7t6\overlay\Lib\site-packages\setuptools\build_meta.py", line 166, in prepare_meta
data_for_build_wheel
self.run_setup()
File "C:\my-path\AppData\Local\Temp\pip-build-env-0dcoq7t6\overlay\Lib\site-packages\setuptools\build_meta.py", line 258, in run_setup
super(_BuildMetaLegacyBackend,
File "C:\my-path\AppData\Local\Temp\pip-build-env-0dcoq7t6\overlay\Lib\site-packages\setuptools\build_meta.py", line 150, in run_setup
exec(compile(code, file, 'exec'), locals())
File "setup.py", line 505, in
setup_package()
File "setup.py", line 501, in setup_package
setup(**metadata)
File "C:\my-path\AppData\Local\Temp\pip-build-env-0dcoq7t6\overlay\Lib\site-packages\numpy\distutils\core.py", line 135, in setup
config = configuration()
File "setup.py", line 403, in configuration
raise NotFoundError(msg)
numpy.distutils.system_info.NotFoundError: No lapack/blas resources found.
----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/ee/5b/5afcd1c46f97b3c2ac3489dbc95d6ca28eacf8e3634e51f495da68d97f0f/scipy-1.3.1.tar.gz#s
ha256=2643cfb46d97b7797d1dbdb6f3c23fe3402904e3c90e6facfe6a9b98d808c1b5 (from https://pypi.org/simple/scipy/) (requires-python:>=3.5). Command error
ed out with exit status 1: 'c:\my-path\appdata\local\programs\python\python38-32\python.exe' 'c:\my-path\appdata\local\programs\python\pyth
on38-32\lib\site-packages\pip_vendor\pep517_in_process.py' prepare_metadata_for_build_wheel 'C:\my-path\AppData\Local\Temp\tmpquyu57o9' Check
the logs for full command output
`

The output result has no floating point numbers.

I use this code to test keybert.

from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs.[1] It infers a
         function from labeled training data consisting of a set of training examples.[2]
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).
      """
model = KeyBERT('distilbert-base-nli-mean-tokens')
print(model.extract_keywords(doc, keyphrase_ngram_range=(1, 1)))

The output result is like this.
['learning', 'algorithm', 'training', 'class', 'mapping']

There are only words and no floating-point numbers corresponding to the words. Why?

The official result is like this.

[('learning', 0.4604),
 ('algorithm', 0.4556),
 ('training', 0.4487),
 ('class', 0.4086),
 ('mapping', 0.3700)]

Add MMR candidate selection

EmbedRank uses Maximal Margin Relevance to select the resulting keywords which is an interesting technique to diversify the selected candidates.

It should be noted that if MMR is used, then KeyBERT is essentially EmbedRank with BERT and without the selection of candidate phrases based on part-of-speech sequences.

However, the base implementation will likely, as a default, use simple cosine similarity to keep the usage without too many parameters.

Some problem about tokenizer

I have trid your model, and its suitable to extract keywords obtain semantic info.

What i want to ask is you tokenize the doc by countvector firstly .
and when it comes to keyword with blank inside such as "learning progress" , it seems like you tokenize it at encode method in sentence-transformers model as pre_tokenized param set to False.
So the tokenizer used this two times seems different, one for default, another should be the tokenizer from transformers model such as some pretrained tokenizer's tokebize attribute.
dose this conflict may yield some problem ?
For me i use chinese document.So i pretokenlize doc to phares and join them with blank to simulated english suitable input, so this two tokenizer can process the doc to list of phares rather than list of chars . with not change the tokenizer inside model, this produce reasonable conclusion but to other task or domain, did this conflict have some problem ?

How to use with other languages other than english?

I would like to use KeyBert with the French language.
To do this, must I select model and pass it through KeyBERT with model?
Like this:

from keybert import KeyBERT
doc = """
L'apprentissage supervisé est la tâche d'apprentissage machine qui consiste à apprendre une fonction qui associe une entrée à une sortie en se basant sur des exemples de paires entrée-sortie [1]. 
Il déduit une fonction à partir de données de formation étiquetées consistant en un ensemble d'exemples de formation.
Dans l'apprentissage supervisé, chaque exemple est une paire constituée d'un objet d'entrée  (généralement un vecteur) et une valeur de sortie souhaitée (également appelée signal de supervision). 
"""
model = KeyBERT(model='MODEL_TO_CHOOSE')
keywords = model.extract_keywords(doc)

For the French language:

  • Which model do you recommend?
  • Is that xlm-r-bert-base-nli-stsb-mean-tokens is a good choice?

Multiple Sentence Input to KeyBERT

Hi

I would like to provide two sentence input to KeyBERT to extract keywords.

from keybert import KeyBERT

doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

In the above example - there is one sentence input let's call it sentence A, I would like to include a Sentence B and generate keywords or phrases after modelling both simultaneously.

Is there a way to do this ?

Thanks,
Subham

cannot use without internet connectivity

while using "kw_extractor = KeyBERT('distilbert-base-nli-mean-tokens')", it tries to download a pre-trained model from internet. The code is written in a way that I can not customize to use this while working in a server, where there is no internet connectivity. Is there any way to modify the code so that I can upload the pre-trained model in the server and then load it locally?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.