GithubHelp home page GithubHelp logo

accenture / ampligraph Goto Github PK

View Code? Open in Web Editor NEW
2.1K 66.0 251.0 31.15 MB

Python library for Representation Learning on Knowledge Graphs https://docs.ampligraph.org

License: Apache License 2.0

Python 67.80% Jupyter Notebook 32.20%
machine-learning knowledge-graph relational-learning representation-learning graph-representation-learning graph-embeddings knowledge-graph-embeddings

ampligraph's Introduction

AmpliGraph

DOI

Documentation Status

CircleCI

Join the conversation on Slack

Open source library based on TensorFlow that predicts links between concepts in a knowledge graph.

AmpliGraph is a suite of neural machine learning models for relational Learning, a branch of machine learning that deals with supervised learning on knowledge graphs.

Use AmpliGraph if you need to:

  • Discover new knowledge from an existing knowledge graph.
  • Complete large knowledge graphs with missing statements.
  • Generate stand-alone knowledge graph embeddings.
  • Develop and evaluate a new relational model.

AmpliGraph's machine learning models generate knowledge graph embeddings, vector representations of concepts in a metric space:

It then combines embeddings with model-specific scoring functions to predict unseen and novel links:

AmpliGraph 2.0.0 is now available!

The new version features TensorFlow 2 back-end and Keras style APIs that makes it faster, easier to use and extend the support for multiple features. Further, the data input/output pipeline has changed, and the support for some obsolete models was discontinued.
See the Changelog for a more thorough list of changes.

Key Features

  • Intuitive APIs: AmpliGraph APIs are designed to reduce the code amount required to learn models that predict links in knowledge graphs. The new version AmpliGraph 2 APIs are in Keras style, making the user experience even smoother.
  • GPU-Ready: AmpliGraph 2 is based on TensorFlow 2, and it is designed to run seamlessly on CPU and GPU devices - to speed-up training.
  • Extensible: Roll your own knowledge graph embeddings model by extending AmpliGraph base estimators.

Modules

AmpliGraph includes the following submodules:

  • Datasets: helper functions to load datasets (knowledge graphs).
  • Models: knowledge graph embedding models. AmpliGraph 2 contains TransE, DistMult, ComplEx, HolE (More to come!)
  • Evaluation: metrics and evaluation protocols to assess the predictive power of the models.
  • Discovery: High-level convenience APIs for knowledge discovery (discover new facts, cluster entities, predict near duplicates).
  • Compat: submodule that extends the compatibility of AmpliGraph 2 APIs to those of AmpliGraph 1.x for the user already familiar with them.

Installation

Prerequisites

  • Linux, macOS, Windows
  • Python โ‰ฅ 3.8

Provision a Virtual Environment

To provision a virtual environment for installing AmpliGraph, any option can work; here we will give provide the instruction for using venv and Conda.

venv

The first step is to create and activate the virtual environment.

python3.8 -m venv PATH/TO/NEW/VIRTUAL_ENVIRONMENT
source PATH/TO/NEW/VIRTUAL_ENVIRONMENT/bin/activate

Once this is done, we can proceed with the installation of TensorFlow 2:

pip install "tensorflow==2.9.0"

If you are installing Tensorflow on MacOS, instead of the following please use:

pip install "tensorflow-macos==2.9.0"

IMPORTANT: the installation of TensorFlow can be tricky on Mac OS with the Apple silicon chip. Though venv can provide a smooth experience, we invite you to refer to the dedicated section down below and consider using conda if some issues persist in alignment with the Tensorflow Plugin page on Apple developer site.

Conda

The first step is to create and activate the virtual environment.

conda create --name ampligraph python=3.8
source activate ampligraph

Once this is done, we can proceed with the installation of TensorFlow 2, which can be done through pip or conda.

pip install "tensorflow==2.9.0"

or 

conda install "tensorflow==2.9.0"

Install TensorFlow 2 for Mac OS M1 chip

When installing TensorFlow 2 for Mac OS with Apple silicon chip we recommend to use a conda environment.

conda create --name ampligraph python=3.8
source activate ampligraph

After having created and activated the virtual environment, run the following to install Tensorflow.

conda install -c apple tensorflow-deps
pip install --user tensorflow-macos==2.9.0
pip install --user tensorflow-metal==0.6

In case of problems with the installation or for further details, refer to Tensorflow Plugin page on the official Apple developer website.

Install AmpliGraph

Once the installation of Tensorflow is complete, we can proceed with the installation of AmpliGraph.

To install the latest stable release from pip:

pip install ampligraph

To sanity check the installation, run the following:

>>> import ampligraph
>>> ampligraph.__version__
'2.1.0'

If instead you want the most recent development version, you can clone the repository from GitHub, install AmpliGraph from source and checkout the develop branch. In this way, your local working copy will be on the latest commit on the develop branch.

git clone https://github.com/Accenture/AmpliGraph.git
cd AmpliGraph
git checkout develop
pip install -e .

Notice that the code snippet above installs the library in editable mode (-e).

To sanity check the installation run the following:

>>> import ampligraph
>>> ampligraph.__version__
'2.1-dev'

Predictive Power Evaluation (MRR Filtered)

AmpliGraph includes implementations of TransE, DistMult, ComplEx, HolE and RotatE. Versions <2.0 also includes ConvE, and ConvKB. Their predictive power is reported below and compared against the state-of-the-art results in literature. More details available here.

FB15K-237 WN18RR YAGO3-10 FB15k WN18
Literature Best 0.35* 0.48* 0.49* 0.84** 0.95*
TransE 0.31 0.22 0.50 0.62 0.66
DistMult 0.30 0.47 0.48 0.71 0.82
ComplEx 0.31 0.51 0.49 0.73 0.94
HolE 0.30 0.47 0.47 0.73 0.94
RotatE 0.31 0.51 0.43 0.70 0.95
ConvE (AmpliGraph v1.4) 0.26 0.45 0.30 0.50 0.93
ConvE (1-N, AmpliGraph v1.4) 0.32 0.48 0.40 0.80 0.95
ConvKB (AmpliGraph v1.4) 0.23 0.39 0.30 0.65 0.80
* Timothee Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition for knowledge base completion. In International Conference on Machine Learning, 2869โ€“2878. 2018.
** Kadlec, Rudolf, Ondrej Bajgar, and Jan Kleindienst. "Knowledge base completion: Baselines strike back. " arXiv preprint arXiv:1705.10744 (2017).
Results above are computed assigning the worst rank to a positive in case of ties. Although this is the most conservative approach, some published literature may adopt an evaluation protocol that assigns the best rank instead.

Documentation

Documentation available here

The project documentation can be built from your local working copy with:

cd docs
make clean autogen html

How to contribute

See guidelines from AmpliGraph documentation.

How to Cite

If you like AmpliGraph and you use it in your project, why not starring the project on GitHub!

GitHub stars

If you instead use AmpliGraph in an academic publication, cite as:

@misc{ampligraph,
 author= {Luca Costabello and
          Alberto Bernardi and
          Adrianna Janik and
          Aldan Creo and
          Sumit Pai and
          Chan Le Van and
          Rory McGrath and
          Nicholas McCarthy and
          Pedro Tabacof},
 title = {{AmpliGraph: a Library for Representation Learning on Knowledge Graphs}},
 month = mar,
 year  = 2019,
 doi   = {10.5281/zenodo.2595043},
 url   = {https://doi.org/10.5281/zenodo.2595043}
}

License

AmpliGraph is licensed under the Apache 2.0 License.

ampligraph's People

Contributors

acmcmc avatar adrijanik avatar albernar avatar cclauss avatar chanlevan avatar dogatekin avatar iamaziz avatar idigitopia avatar lukostaz avatar nicholasmccarthy avatar pyvandenbussche avatar sumitpai avatar tabacof avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ampligraph's Issues

Replace generic exceptions with specific exceptions and helpful messages.

In many places in the code specific exceptions are being handled and then a generic exception is being thrown with a general error.

except KeyError:
    raise Exception('Some of the hyperparams for regularizer were not passed.')

Instead we should raise a KeyError and display list the specific keys that were not passed.

Implement code to download datasets.

Background and Context
The current examples require publicly available datasets in order to run. Obtaining these datasets manually may be tedious. This process should be automated.

Description
Create a database class in the module that will download and store the required datasets.
Datasets have been saved to google drive: datasets

Add MD5 checksum for datasets

Description
Each dataset loader should have an argument check_MD5 (set to False by default) that performs MD5 checksum of the downloaded dataset.

Automate datasets download

Background and Context
Download and uncompress each dataset manually contributes to user friction during installation.

Description
Add a script to automatically download and decompress datasets in the desired folder (which must match AMPLIGRAPH_DATA_HOME).

Improve datasets loaders documentation

Description
We must add in docstrings:

  • stats on how many tripels in each split, how many distinct entities and relations
  • warning boxes for datasets with missing entities (FB15k-237, WN18RR)
  • references to where the dataset was first proposed, and from where we downloaded it.

A summary table in the "datasets" section would also help. Something like this:
image

Add single experiment reproduce results

Description
We need to publish single script that reproduces our best results shown in #17.
e.g.:
$ ./predictive_performance.py -i fb15k_237 -m complex

The script may take up to two arguments from the command line:

  • -m: the model (complex, transe, distmult)
  • -i: the dataset (fb15k, wn18, etc)

If no arguments are passed, all experiments are carried out.

Best hyperparams are hardcoded.

Verbosity should be strictly limited to:

  • output the best hyperparams,
  • overall progress bar on the most outer loop of experiments
  • final results, output as table. (you can use beautifultable)

This script will replace what's in the experiments folder.

Note everything should be kept as simple adn user-friendly as possible.

Handling datasets with unseen entities

Description

There can be datasets with unseen entities from validation/test sets. This results in the library crashes.

Actual Behavior

The library crashes when evaluate performance for validation/test sets if it meets unseen entities.

Expected Behavior

The library should at least show a friendly error message. Ideally, there should be a configurable strict mode in performance_evaluation method to allow the execution keeps running or stopping.

Steps to Reproduce

  1. Using load_wn18rr or load_fb15k_237 to load the data.
  2. Fit the train+valid dataset.
  3. Predict test dataset. The library crashes at this step.

Implement multi-class log-loss

Description

Implement multi-class log-loss as presented by Lacroix2018.

In section 6.2 (see screenshot below) they claim multi-class log-loss can be responsible for better results compared to our current binomial nll loss:

image
image

This is similar to Kadlec2017, where they use a sampled multi-class log loss that seems to perform better than ComplEx's current binomial nll loss (see also #22).

From Kadlec2017:
image

And indeed Kadlec2017 uses the loss defined by Toutanova2015, that defines the loss as (section 3.3):
image
image

Add examples to savemodel, restoreModel

Background and Context
All functions and classes's documentation must include working code examples.

Description
Add working examples of savemodel, restoreModel functions in their docstrings.

Assess docstring examples quality

Background and Context
AmpliGraph documentation includes examples for each public API.
Some examples may not be up to date with the latest code changes, and some new APIs may miss an example.

Description

  • Make sure all public APIs in the HTML documentation have an example
  • check if each example runs properly. Fix any broken code.
  • make sure HTML rendering of the example is correct
  • Also check the "Examples" section in examples.md.

Revamp documentation

Background and Context
AmpliGraph documentation is currently insufficient.

Description
Improve API documentation project-wise.

Find a location to host the example datasets.

Background and Context
The datasets will have to be hosted somewhere such that the automated downloader can access them. Google drive and Drop box were investigated but they required additional dependencies. In order to download from google drive additional dependencies need to be added to the project. Additional dependencies cannot be added at this time due to legal reasons with open-sourcing the project.

Description
To allow modules to download datasets automatically the datasets need to be hosted in a location that does not require additional dependencies to the project.
There is an issue with downloading the files from google drive, we will have to use additional dependencies:

pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

Dropbox has similar dependencies.

pip install dropbox

import dropbox

A hosting location will need to be identified and the datasets will need to be moved there.

Optimise code

Replace numpy objects with tensorflow objects where applicable.

Use self-explanatory, meaningful argument names for loss functions, regularizers and models

Background and Context
The signatures of all functions and classes of the library should be easily understandable. Argument names should be intuitive, and default values should lead to viable results, even without user-defined preferences.

Description
Currently, loss_functions.py and regularizers.py inherited classes such as PairwiseLoss use the hyperparam_dict argument in their __init__() method. Such argument should be replaced with Python's kwargs approach, and unpakced into a list of meaningful argument names (e.g. margin, gamma).

Implement alternative ranking evaluation protocols

Background

Literature adopts different interpretations of the ranking evaluation protocol. We must implement alternative protocols, to be enabled with a flag in the evaluation function, to get fair comparisons of results.

Description

Protocols to implement:

  • Trouillon16 (ComplEx paper): ranks each positive against corruptions of head and tail together. (current implemented version)
  • ConvE, RotatE: corrupt head and tail separately

Implement the evaluation strategy described by RotatE and ConvE.
This may imply also revisiting the metric implementations (MRR, MR, Hits@N).

From the ConvE paper:

image

See also RotatE reference implementation for clues.

Experiments fail due to prime_number_list.txt file

Description

When running the experiments an error is thrown after training is completed.
The error shows that the prime_number_list.txt files cannot be found.

FileNotFoundError: [Errno 2] No such file or directory: 'anaconda3/envs/ampligraph/lib/python3.6/site-packages/ampligraph/latent_features/prime_number_list.txt'

Actual Behavior

An error is thrown after training and no results are generated.

Expected Behavior

The model should continue to evaluation and display results.

Resolve Sphinx warnings when generating documentation.

Background and Context
The sphinx documentation generation process current generates a lot of warnings when generating the documentation. These warnings should be resolved or handled.

Description
Resolve the warnings that are generated by Sphinx when the documentation is generated. This warnings include missing citations, unexpected unindent and missing blank lines after block quotes.

YAGO3-10 experiments

Description
Run a batch of experiments on the YAGO3-10 dataset.

From [Lacroix2018]:

image

Sphinx documentation fails to build due to logger configuration file.

Sphinx logs of the first reported failure (March 8th)

[rtd-command-info] start-time: 2019-03-08T15:18:41.597007Z, end-time: 2019-03-08T15:18:42.128740Z, duration: 0, exit-code: 2
python /home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/bin/sphinx-build -T -E -b readthedocs -d _build/doctrees-readthedocs -D language=en . _build/html
Running Sphinx v1.7.9

Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/lib/python3.7/site-packages/sphinx/config.py", line 161, in init
execfile_(filename, config)
File "/home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 150, in execfile_
exec_(code, _globals)
File "conf.py", line 26, in
import ampligraph
File "/home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/lib/python3.7/site-packages/ampligraph/init.py", line 8, in
logging.config.fileConfig(fname=os.path.join(os.path.abspath(os.path.dirname(file)),'logger.conf'), disable_existing_loggers=False)
File "/home/docs/.pyenv/versions/3.7.1/lib/python3.7/logging/config.py", line 71, in fileConfig
formatters = _create_formatters(cp)
File "/home/docs/.pyenv/versions/3.7.1/lib/python3.7/logging/config.py", line 104, in _create_formatters
flist = cp["formatters"]["keys"]
File "/home/docs/.pyenv/versions/3.7.1/lib/python3.7/configparser.py", line 958, in getitem
raise KeyError(key)
KeyError: 'formatters'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/lib/python3.7/site-packages/sphinx/cmdline.py", line 303, in main
args.warningiserror, args.tags, args.verbosity, args.jobs)
File "/home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/lib/python3.7/site-packages/sphinx/application.py", line 163, in init
confoverrides or {}, self.tags)
File "/home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/lib/python3.7/site-packages/sphinx/config.py", line 167, in init
raise ConfigError(CONFIG_ERROR % traceback.format_exc())
sphinx.errors.ConfigError: There is a programable error in your configuration file:

Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/lib/python3.7/site-packages/sphinx/config.py", line 161, in init
execfile_(filename, config)
File "/home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 150, in execfile_
exec_(code, _globals)
File "conf.py", line 26, in
import ampligraph
File "/home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/lib/python3.7/site-packages/ampligraph/init.py", line 8, in
logging.config.fileConfig(fname=os.path.join(os.path.abspath(os.path.dirname(file)),'logger.conf'), disable_existing_loggers=False)
File "/home/docs/.pyenv/versions/3.7.1/lib/python3.7/logging/config.py", line 71, in fileConfig
formatters = _create_formatters(cp)
File "/home/docs/.pyenv/versions/3.7.1/lib/python3.7/logging/config.py", line 104, in _create_formatters
flist = cp["formatters"]["keys"]
File "/home/docs/.pyenv/versions/3.7.1/lib/python3.7/configparser.py", line 958, in getitem
raise KeyError(key)
KeyError: 'formatters'

Configuration error:
There is a programable error in your configuration file:

Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/lib/python3.7/site-packages/sphinx/config.py", line 161, in init
execfile_(filename, config)
File "/home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 150, in execfile_
exec_(code, _globals)
File "conf.py", line 26, in
import ampligraph
File "/home/docs/checkouts/readthedocs.org/user_builds/accenture-labs-ampligraph/envs/latest/lib/python3.7/site-packages/ampligraph/init.py", line 8, in
logging.config.fileConfig(fname=os.path.join(os.path.abspath(os.path.dirname(file)),'logger.conf'), disable_existing_loggers=False)
File "/home/docs/.pyenv/versions/3.7.1/lib/python3.7/logging/config.py", line 71, in fileConfig
formatters = _create_formatters(cp)
File "/home/docs/.pyenv/versions/3.7.1/lib/python3.7/logging/config.py", line 104, in _create_formatters
flist = cp["formatters"]["keys"]
File "/home/docs/.pyenv/versions/3.7.1/lib/python3.7/configparser.py", line 958, in getitem
raise KeyError(key)
KeyError: 'formatters'

Implement HolE

Description

Implement HolE.

Nickel, Maximilian, Lorenzo Rosasco, and Tomaso A. Poggio. "Holographic Embeddings of Knowledge Graphs." AAAI. Vol. 2. No. 1. 2016.
http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12484/11828

Note HolE scoring function can be implemented as in the orange square below (table from RotatE paper):

screenshot-1

It is also worth taking this paper into account :
Hayashi, Katsuhiko, and Masashi Shimbo. "On the equivalence of holographic and complex embeddings for link prediction." arXiv preprint arXiv:1702.05563 (2017).
https://arxiv.org/pdf/1702.05563.pdf

Hayashi claims that HolE scoring function is simply:

screenshot-2

So, very easy to implement, but predictive power and validity of this approach must be carefully tested with experiments.

Resolve warnings in code to future proof module.

Background and Context
Current when running tests there are 156 warnings generated.
These warnings are due to the use of deprecated methods or future warnings about methods that will be removed in the next iteration of pandas.
To help support future versions we should resolve these issues.

Description
Warnings need to be resolved or handled to avoid future updates of dependencies.

Remove HDT support or make it optional

Description

Currently the documentation suggests that installing with HDT support is optional.
However unit tests and dataset code will fail to run unless it is installed.

This is due to the dataset.py script importing hdt:

from hdt import HDTDocument

Actual Behavior

Currently hdt support is not optional.

Expected Behavior

hdt support should be optional.

Implement tf.data api for loading data

Use tf.data to efficiently extract and preprocess the data and apply transformations like batching, shuffling and mapping functions. This should remove bottleneck of performing these operations on the cpu.

Run time performance analysis

Description

Can you please report

  • how many milliseconds it takes to train a batch each model on FB15k-237, when k=200, eta=2, batches_count=100, loss=nll

Tensorboard Visualizations

Background and Context

Embeddings can be visualized nicely in TensorBoard, but require the appropriate checkpoint and metadata files to be written.

Description

  • Implement function to write embeddings and labels to disk as files read by tensorboard.
  • (Optional): include functionality that can give groups/labels for each embedding, as Tensorboard can colour each group in a different colour.

Remove misc.py

The code in this file is for some previous work on explainability, and should be removed from this release.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.