Sentence Clustering with BERT project which aim to use state-of-the-art BERT models to compute vectors for sentences. A few tools are also implemented to explore those vectors and how sentences are related to each others in the latent space.
- Create vectors from raw data :
#How to transform raw french texts into vectors using BERT model.
from SCBert.SCBert import Vectorizer
vectorizer = Vectorizer("flaubert")
text_vectors = vectorizer.vectorize(data)
- Explore the embedded space :
#How to explore the relation in your data.
from SCBert.SCBert import EmbeddingExplorer
ee = EmbeddingExplorer(data,text_vectors)
labels = ee.cluster(k=3) #Cluster with k-means
ee.extract_keywords() #Extract keywords using Rake algorithm, then accessible with ee.keywords
ee.explore(color = labels) #Generate a plot with PCA of the embedded vectors with colors corresponding to the labels
There is a built-in example that you can find in the example folder. It comes with it's own data which is the CLS-fr composed of Amazon reviews from different sources (DVD, CD, Livres)
You can either download the zip file or use the Pypi package that you can install with the following command :
> pip install SCBert
If you encounter problems during the installation it may be because of the multi-rake dependy with cld2-cffi. I will try to address this later on. To bypass, just follow the instructions :
> export CFLAGS="-Wno-narrowing"
> pip install cld2-cffi
> pip install multi-rake