m3ttiw / orange_cb_recsys Goto Github PK

View Code? Open in Web Editor NEW

8.0 4.0 2.0 21.02 MB

Content-Based Recommender System Framework in python

License: GNU General Public License v3.0

Python 100.00%

python python3 framwework recommender-system content-based-recommendation

orange_cb_recsys's Introduction

Orange_cb_recsys

Framework for content-based recommender system

Installation

pip install orange-cb-recsys

PyLucene is required and will not be installed like other dependencies, you will need to install it personally.

You also need to manualy copy in the installing directory the files runnable_instances.xz e categories.xz that you can find in source directory

Usage

There are two types of use for this framework It can be used through API or through the use of a config file

API Usage

The use through API is the classic use of a library, classes and methods are used by invoking them.

Example:

Config Usage

The use through the config file is an automated use.

Just indicate which algorithms you want to use and change variables where necessary without having to call classes or methods This use is intended for users who want to use many framework features.

Config File

We can see in the Config File an example of using the framework with this methodology.

As mentioned above, you need to change certain variables in order to allow the framework to work properly, here are some examples of these variables:

"content_type": "ITEM"

This can be ITEM, USER or RATING depending on what you are using

"output_directory": "movielens_test"

You can change this value in any output directory you want to get

"raw_source_path": "../../datasets/movies_info_reduced.json"

This is the source of the ITEM, USER or RATING file that you are using

"source_type": "json"

Here you can specify the source_type, this can be JSON, CSV or SQL

"id_field_name": ["imdbID", "Title"]

Specify the field name of the ID

"search_index": "True"

True if you want to use the text indexing technique, otherwise False

"fields": [
{
  "field_name": "Plot",
  "lang": "EN",
  "memory_interface": "None",
  "memory_interface_path": "None",

In the "field" field you can specify the name of the field on which to use the technique, its language and the memory interface

The language will be specified for each field, so it will be possible to insert a single file to index ITEM or USER in many languages

"pipeline_list": [
    {
    "field_content_production": {"class": "search_index"},
    "preprocessing_list": [
      ]
    },
    {
    "field_content_production": {"class": "embedding",
      "combining_technique": {"class":  "centroid"},
      "embedding_source": {"class": "binary_file", "file_path": "../../datasets/doc2vec/doc2vec.bin", "embedding_type":  "doc2vec"},
      "granularity": "doc"},
    "preprocessing_list": [
      {"class": "nltk", "url_tagging":"True", "strip_multiple_whitespaces": "True"}
      ]
    },
    {
    "field_content_production": {"class": "lucene_tf-idf"},
    "preprocessing_list": [
      {"class": "nltk", "lemmatization": "True"}
      ]
    }

Here instead it is possible to define the pipeline:

For each field you can create many representations, as in this example search_index, embedding and tf-idf.

For each representation we can specify the preprocessing list to be used.

For example, for the tf-idf the nltk class is used which analyzes the natural language and the lemmatization is done

When using nltk these are the variables that can be changed: stopwords_removal, stemming, lemmatization, strip_multiple_white_space and url_tagging

When specifying embedding as field_content_production one must also specify the combining_technique which is currently only centroid, the source of the embedding and the granularity of it which can be word, doc and sentence

orange_cb_recsys's People

Contributors

Stargazers

Watchers

Forkers

itsfrank98 marcopoli

orange_cb_recsys's Issues

Fast text con gensim

Realizzare un implementazione di Fast text con gensim

Creare classe importa rating

Creare classe RatingsImporter che importa ratings da un file ad essi dedicato,
l'utilizzatore in questa classe deve specificare:

un'istanza di RawInformationSource rappresentante la sorgente dei ratings
i field dove trovare le preferenze
il field dove trovare l'id dell'utente
il field dove trovare l'id dell'item
il field dove trovare il timestamp.

Creare classe astratta RatingProcessor che ottiene score numerici ( nell'intervallo [-1, 1] ) a partire dai rating originali in un formato eventualmente diverso.

Riadattare SentimentAnalysis come implementazione di RatingProcessor, e modificare il metodo in modo che calcoli lo score per un singolo rating.

Nel RatingImporter va anche prevista un'istanza di RatingProcessor.

Prevedere in RatingImporter un metodo import_ratings che importa questi ratings in un dataframe le cui colonne sono: user_id, item_id, original_rating, derived_score, timestamp

Prevedere nel run.py una configurazione con content type rating che costruisce un'istanza di questa classe.

File di configurazione yaml

Modificare il file run.py in modo che controlli se il file in input è json o yaml e agisca di conseguenze

ScorePredicitonMetric

Implementare le funzioni di calcolo delle ScorePredictionMetric, l'input sono due series: predizioni e truth. Rimuovere global MAE e globalRMSE

RankingMetric

Implementare le funzioni di calcolo delle RankingMetric, l'input sono due frame: predizioni e truth, entrambi hanno come colonne item e rating.

LOD Properties Retrieval

Aggiungere un attributo LOD_properties nella classe Content
Aggiungere una classe astratta LODPropertiesRetrieval, una prima implementazione è quella con DBPedia
far scegliere nella configurazione quale LODPropertiesRetrieval usare, anche None
eseguire dal content analyzer main le tecniche LODPropertiesRetrieval prima di passare a quelle che istanziano i field

Struttura Grafo

Realizzare una classe astratta Graph, che si specializza in BipartiteGraph e TripartiteGraph.
Le implementazioni di queste classi possono essere realizzate ad esempio con networkx o pyneo4j utilizzando il dbms neo4j.
il grafo è un DAG e viene creato a partire dalle informazioni presenti nel ratings_dataframe, realizzando degli archi orientati e pesati. Semanticamente gli archi rappresentano un giudizio dato da un utente (from) ad un item (to) con un peso, che è dato dalla formula: 1 - score / 2
Questo perchè a pesi minori corrisponde una miglior navigazione dell'arco e la normalizzazione dei pesi
(da range [-1, 1] a range [0, 1]) facilita l'implementazione dei recommender.
Quindi 1.0 rappresenta il peso massimo per attraversare un arco e corrisponde a un 'dislike', 0.0 il minimo e corrisponde a un 'like'.
Realizzare dei metodi per navigare efficacemente il grafo, che deve essere esplorabile anche in senso inverso all'orientazione degli archi.

Realizzare un implementazione con networkX

Adattare fit a gestione di rating come istanze di content

Inoltre aggiungere tecnica che fa semplice parsing di coppie item: rating, nel caso il rating sia un field dell'utente

Istanziare classi config a partire da un file json di configurazione

Il file di configurazione deve essere uno solo tramite cui istanziare gli oggetti: RawDataConfig e ContentAnalyzerConfig

Implementare recommender classificatore

Implementare una classe ClassifierRecommender come figlia di RatingsSPA.
Utilizzare classificatore decisionTree cioè: apprendere un albero di decisione usando come esempi gli item per cui l'utente ha espresso un rating, valutare il "nuovo" item tramite quest'albero.

Implementare combining technique somma

sommare le righe della matrice numpy in input

Rendere generica la fase di preparazione del dataset per tecniche basate sull'intera collezione

Al momento effettuiamo un controllo su tf-idf, serializzando in un indice i field per cui è stata scelta questa tecnica; il comportamento di questa fase deve essere reso generico considerando che qualsiasi tecnica può avere bisogno di un "refactor" del dataset in una struttura anche diversa dall'indice

Latent semantic analysis con gensim

Realizzare un implementazione di Latent semantic analysis con gensim

Tecnica sentiment analysis

Creare una classe sentiment analysis come implementazione di FieldProductionTechnique per ottenere score numerici da rating testuali.

RandomIndexing con Gensim RpModel

EmbeddingSource wikipedia2vec - ExplicitSemanticAnalysis

Realizzare un implementazione di Explicit semantic analysis con wikipedia2vec

Scrivere docstrings

word2vec con gensim

Realizzare un implementazione di word2vec con gensim

Implementare metodi str

Fornire una implementazione di str per tutte le classi dove non è presente

Implementare classe SQLDatabase

Realizzare esecuzione di una query che prende tutto il contenuto della tabella specificata e itera sul risultato. (metodo iter)

import from pypi error

Ho provato a installare il package da pypi e il package mysqlclient ha dato questo errore:

Collecting orange-cb-recsys
  Downloading orange_cb_recsys-0.1-py3-none-any.whl (14 kB)
Requirement already satisfied: PyYAML==5.3.1 in /home/mattia/PycharmProjects/untitled/venv/lib/python3.8/site-packages (from orange-cb-recsys) (5.3.1)
Collecting mysql-connector-python==8.0.20
  Downloading mysql_connector_python-8.0.20-cp38-cp38-manylinux1_x86_64.whl (14.8 MB)
Collecting numpy==1.18.4
  Downloading numpy-1.18.4-cp38-cp38-manylinux1_x86_64.whl (20.7 MB)
Collecting nltk==3.5
  Downloading nltk-3.5.zip (1.4 MB)
Collecting wikipedia2vec==1.0.4
  Downloading wikipedia2vec-1.0.4.tar.gz (1.2 MB)
Collecting babelpy==1.0.1
  Downloading BabelPy-1.0.1.tar.gz (8.0 kB)
Collecting mysql==0.0.2
  Downloading mysql-0.0.2.tar.gz (1.9 kB)
Collecting gensim==3.8.3
  Downloading gensim-3.8.3-cp38-cp38-manylinux1_x86_64.whl (24.2 MB)
Collecting protobuf>=3.0.0
  Downloading protobuf-3.12.2-cp38-cp38-manylinux1_x86_64.whl (1.3 MB)
Collecting click
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
Requirement already satisfied: joblib in /home/mattia/PycharmProjects/untitled/venv/lib/python3.8/site-packages (from nltk==3.5->orange-cb-recsys) (0.15.1)
Collecting regex
  Downloading regex-2020.6.8-cp38-cp38-manylinux2010_x86_64.whl (673 kB)
Collecting tqdm
  Downloading tqdm-4.46.1-py2.py3-none-any.whl (63 kB)
Collecting jieba
  Downloading jieba-0.42.1.tar.gz (19.2 MB)
Collecting lmdb
  Downloading lmdb-0.98.tar.gz (869 kB)
Collecting marisa-trie
  Downloading marisa-trie-0.7.5.tar.gz (270 kB)
Collecting mwparserfromhell
  Downloading mwparserfromhell-0.5.4.tar.gz (135 kB)
Requirement already satisfied: scipy in /home/mattia/PycharmProjects/untitled/venv/lib/python3.8/site-packages (from wikipedia2vec==1.0.4->orange-cb-recsys) (1.4.1)
Requirement already satisfied: six in /home/mattia/PycharmProjects/untitled/venv/lib/python3.8/site-packages (from wikipedia2vec==1.0.4->orange-cb-recsys) (1.15.0)
Collecting mysqlclient
  Downloading mysqlclient-1.4.6.tar.gz (85 kB)

    ERROR: Command errored out with exit status 1:
     command: /home/mattia/PycharmProjects/untitled/venv/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pycharm-packaging5/mysqlclient/setup.py'"'"'; __file__='"'"'/tmp/pycharm-packaging5/mysqlclient/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-89vzbxx0
         cwd: /tmp/pycharm-packaging5/mysqlclient/
    Complete output (12 lines):
    /bin/sh: 1: mysql_config: not found
    /bin/sh: 1: mariadb_config: not found
    /bin/sh: 1: mysql_config: not found
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pycharm-packaging5/mysqlclient/setup.py", line 16, in <module>
        metadata, options = get_config()
      File "/tmp/pycharm-packaging5/mysqlclient/setup_posix.py", line 61, in get_config
        libs = mysql_config("libs")
      File "/tmp/pycharm-packaging5/mysqlclient/setup_posix.py", line 29, in mysql_config
        raise EnvironmentError("%s not found" % (_mysql_config_path,))
    OSError: mysql_config not found
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Migliorare percentuale coverage

Aggiungere un numero maggiore di test in modo tale da far aumentare la percentuale di coverage

Completare docstrings

mancano classi CollectionBasedTechnique, SingleContentTechnique, LuceneTfIdf, IndexInterface

Calcolare metriche di fairness

implementare metodi che calcolano metriche di fairness (fairness_metrics.py);
l'input è un dataframe avente le colonne user, item, rating, il frame è complessivo, cioè contiene tutte le raccomandazioni effettuate su tutti gli utenti e tutti gli item.
Le metriche da calcolare sono:

catalog coverage
delta gaps
gini index
pop ratio profile vs recs
pop recs correlation
recs long tail distr

Realizzare una implementazione di NLP

per esempio nella classe già predisposta OpenNLP

Riadattare il fit

Riadattare il metodo fit di recsys considerando che arriverà in input un frame con le 5 colonne

Tf-Idf tramite tensorflow

Implementare calcolo di tf-idf (e relativa preparazione del dataset) tramite tensorflow, la classe in cui sarà implementata questa feature estende TfIdfTechnique

Babelfy Api Key

Describe the bug
Abbiamo solo un limitato numero di accessi a babelfy con la api key di default.

To Reproduce
Steps to reproduce the behavior:

Run BabelPy entity linking

Expected behavior
Per evitare problemi l'utente può registrarsi su babelfy e inserire la chiave in fase di configurazione.
Permettere all'utente di inserire la chiave in fase di configurazione o utilizzare un'api key dell'universita(?)

Recommender centroide di document embeddings

Implementare il metodo predict della classe CentroidVector nel file ratings_based.py, calcolare il centroide dei vettori rappresentativi di ogni item, tali vettori si trovano nel field item_field specificato dall'utilizzatore, per questo field va utilizzata la rappresentazione specificata nel relativo parametro.
Lanciare una eccezione laddove

il field non esiste,
il field esiste, ma non esiste la rappresentazione
esistono entrambi, ma la rappresentazione non è un document embedding (doc ha 1 solo dimensione)

Confrontare con una funzione di similarità il centroide ottenuto e il vettore dell'item per cui si effettua la predizione.

Realizzare una classe astratta Similarity e prevedere in CentroidVector un attributo di questo tipo.

Implementare CosineSimilarity come figlia di Similarity.