stellargraph / stellargraph Goto Github PK

StellarGraph - Machine Learning on Graphs

Home Page: https://stellargraph.readthedocs.io/

License: Apache License 2.0

Python 98.78% Shell 1.04% Dockerfile 0.19%

graphs machine-learning machine-learning-algorithms graph-convolutional-networks networkx geometric-deep-learning saliency-map interpretability heterogeneous-networks graph-neural-networks

stellargraph's Issues

Understand HIN GraphSage algorithm

Description

Kevin and Yuriy have implemented a HIN GraphSAGE algorithm. I'd like to understand the implementation.

There are are other heterogeneous GCN-like algorithms in the literature, read and understand them. How do they compare? Which algorithms could we implement for the ML library? Can we obtain code and test them on different problems? What input sampling strategies are required for each algorithm? How do training and prediction differ?

Done Checklist (Research)

Add different algorithms to documentation on Google Docs
Add sampling strategies to documentation on Google Docs

Tune node2vec parameters for link prediction demo

Description

Currently, the link prediction demo uses fixed parameter values, e.g., p=q=1 and several other parameters, for node2vec. We need to allow for these parameters to be tuned for improved link prediction performance.

User Story

As a: Research Engineer
I want: to tune the hyper-parameters of the node2vec algorithm
so that: I can achieve the highest performance in link prediction

Done Checklist (Development)

Code to tune node2vec hyper-parameters
Pull request

Write simple data feed to Graphsage for StellarML library

Description

Currently the Graphsage code from Kevin runs well, but requires a redis database and is slow for simple single computer testing.

As part of the StellarML library we want to pass data into tensorflow fast. Having a simple in-memory graphsage sampler would be a good start.

Done Checklist (Development)

Done Checklist (Research)

Done Checklist (Bug)

Bug fixed
Branch and Pull Request build on CI
Branch and Pull Request pass unit tests on CI
Branch and Pull Request pass integration tests on CI
Version number reflects new status
Peer Code Review Performed
Code well commented

Extend HinSAGE for Link Prediction

Description

Use the documentation for HinSAGE link prediction to create a working link prediction example using the Paradise Papers dataset from the Data team.

User Story

As a: data scientist
I want: to use GraphSAGE layers for link prediction
so that: I can run scalable link prediction

Done Checklist (Development)

Working example with Alzheimer data
Code well commented
Documentation in repo
Peer Code Review Performed

Present Architecture(s) and Bottlenecks for GraphSAGE

Description

Present initial analysis of scalable architecture for GraphSAGE, its bottlenecks, and possible ways to fix them.

Checklist

Add to architecture documentation in Google Docs

Done

https://drive.google.com/open?id=1jdo6ZvNZscTaMj6jiQLGyjhuF1n-c8uTKY4TtT0uxbE

Prepare YOW Data Experiment

Description

We need to get started with an interesting demonstration of machine learning on graphs for the YowData! conference in mid-May.
First, we need to decide on a dataset and problem.

Checklist

Write-up of dataset and problem options on google docs
Initial implementation of solving the selected problem on the dataset

Adapt the initial library skeleton to include Kevin's initial GraphSAGE workflow

Description

Adapt the initial library skeleton to include Kevin's initial GraphSAGE workflow

Done Checklist (Development)

Improve speed of 'local' sampling method for link prediction

Description

Sampling negative edges for link prediction using the nodes' local neighbourhood structure currently uses BFS that runs very slow if target nodes more than 5 edges away need to be sample. This issue is about replacing BFS with DFS to speed up the sampling algorithm.

User Story

As a: Research Engineer
I want: to run link experiments as fast as possible
so that: I maximise my efficiency.

Done Checklist (Development)

Updated source code to replace BFS for target nodes with DFS.
Pull request

YowData: Investigate NetFlix prediction using N2V

Description

For the YowData conference, I'd like to present a recommender example. The Netflix prize dataset is well known, and a large amount of effort has been spent on getting results on this dataset. Good performance on this dataset would be impressive.

Recommender systems are often not thought about in terms of graphs. Therefore, posing this in a graph framework and solving it would be interesting. We can start by using node2vec to extract node embeddings and trying to predict the scores from this.

Done Checklist (Research)

Notes or slides on recommedations for movielens with node2vec
Code for recommendations for movielens with node2vec

[Dynamic Node2Vec] Investigate temporal updates of skipgram model

Description

Hooman is performing through experiments in how his dynamic random walk methods perform as a part of an end-to-end dynamic node2vec algorithm. There are difficulties with how the skip-gram model interacts with random walk updates. To get a good publication we need a description of the skip-gram model and some explanation of how different training update schemes will affect the model.

Done Checklist (Research)

Skip-gram model understanding
Documentation of skip-gram model
Documentation of skip-gram model update techniques

Create baseline skeleton library

Description

Create initial dummy library using the documentation and pseudo code already accumulated

Done Checklist

Code
Pull Request
Unit Tests

Test link prediction demo on HIN

Description

The link prediction demo works on homogeneous datasets. We want to test whether it also works on heterogeneous datasets under the assumption that the latter will be treated as homogeneous.

User Story

As a: Research Enigneer
I want: to make sure that the link prediction demo work for both homogeneous and heterogeneous networks
so that: I can tackle more general analytics problems.

Done Checklist (Development)

Determine heterogeneous dataset for testing
Link prediction demo works with input the selected heterogeneous network.
Pull request

Review Kevin's pull request for graphsage demo

Description

Review the pull request #12

Done Checklist (Development)

Peer Code Review Performed
Code well commented
Documentation in repo

Scalable Node Attribute Inference for Graphs

Description

Build a scalable implementation of node attribute inference (NAI) for graphs, that works for at least 10M node graphs.

Value

Besides satisfying stakeholders' requirements for scalable attribute inference tasks on large graph datasets (thus expanding the NAI capability of Release 1), this should allow us to find an optimal scalable architecture for other ML tasks on graphs, such as link prediction and classification, recommendations, etc.

Investigate risknet with small-graph version GraphSage

Description

Since risknet has similar data size to Cora dataset, check out if GraphSage small graph version works with risknet,

Done Checklist (Research)

Code Review
Documentation in repo

Investigate Metapath2Vec paper for link prediction on heterogeneous graphs

Description

I want to understand the Metapath2Vec algorithm for representation learning in heterogeneous graphs.

User Story

As a: Research Engineer
I want: to understand the MetaPath2Vec algorithm for representation learning on heterogeneous graphs
so that: I can use it for node attribute inference and link prediction.

Done Checklist (Development)

Document differences between Metapath2Vec and Node2Vec algorithms
Determine scalability issues that are unique to Metapath2Vec
Search for reference implementation and, if found, run some experiments on test graphs to better understand its performance.

Extend HinSAGE demo code for unsupervised learning

Description

Implement the wrappers / additional layers for unsupervised learning around the HinSAGE demo code. Create a working example using the risk net dataset.

User Story

As a: data scientist
I want: to run unsupervised learning using GraphSAGE
so that: I can transform my large dataset into node embeddings

Done Checklist

Peer Code Review Performed
Code well commented
Documentation in repo

PoC for Unsupervised GraphSAGE

Description

Currently the GraphSage unsupervised method is not in the library. This task is to add a simple unsupervised GraphSage module in the stellar-ml library.

User Story

As a: Data Scientist
I want: everything that GraphSAGE offers
so that: I have freedom for my unsupervised method experiments

Done Checklist (Development)

Branch and Pull Request build on CI
Code well commented

Write YOWData! presentation

Description

I'll be presenting at YOWData! on the 15th (at 5pm) so I need to prepare some slides!

Done Checklist

Slides on Google Docs
Give YOWData! presentation

Explore Aboleth for library design choices

Description

Can we borrow some design choices, e.g.: base classes & inheritance, layer compositions, pipelining
Link: https://github.com/data61/aboleth

Checklist

List of Aboleth base clases + description, perhaps indicating which base classes can be borrowed into our library
pseudo code for node2vec+logistic workflow with the graph ML library (as we imagine it)
pseudo code for GraphSAGE

Build demo link prediction code from existing

Description

Use existing code to build up a link prediction demo for homogeneous graphs

Checklist

Make stellar-ml-sandbox
Create link-prediction scripts
How-to Readme

Improve data splitting code for link prediction

Description

The node splitter developed for the link prediction demo of issue #8 needs to be improved such that negative samples are more challenging, i.e., should not be randomly selected out of all pairs of disconnected nodes but rather of disconnected nodes that are nearby in the graph.

User Story

As a: Research Engineer
I want: to use my data to correctly evaluate my link prediction algorithm
so that: I am confident about its performance on unseen data.

Done Checklist (Development)

Edge splitter class with improved sampling algorithm
Integration of new edge splitter class with baseline link prediction demo
Pull Request

Create stellar-ml library structure, and populate with base classes

Description

Define the library's structure, base classes, methods, some helper functions, etc.
Create unit tests for all the library's base classes and helper functions

User Story

As a: developer of the library
I want: to see a clear structure of base classes to inherit from, their methods, and examples of composing workflows from them.

Done Checklist (Development)

Assumptions of the user story met
Produced code for required functionality
Branch and Pull Request pass unit tests on CI
Peer Code Review Performed
Code well commented
Documentation in repo

Update GraphSAGE HIN document with ML Task architectures (continuing #6)

Description

Continuing from #6

Done Checklist

Description added to the document of the architectures for
- unsupervised node feature learning
- semi-supervised node attribute inference on HINs with GraphSAGE
- supervised link prediction on HINs

Clean up Movielens using HIN Graphsage and move to demos

Description

The movielens recommender demo developed for YOWData could be useful for other problems (Anna would like to try it out to see if it will work for the medicare dataset.

Currently the code is rough and ready, so I'd like to tidy it up, add documentation and have a quick-to-run test case (say on movielens 100k).

Done Checklist (Research)

Code Review
Documentation in repo
Code well commented

Add equations for GraphSAGE aggregator and nonlinear embedding generator in case of HINs

Description

Add equations (similar to those in Algorithm 1 or 2 in GraphSAGE paper) for embedding step (aggregator + dense layer) applied to HINs

Done Checklist

Equations added to Heterogeneous GraphSAGE document created in Issue #6

Investigate message passing for node2vec

Description

Node2vec can be implemented in a message-passing framework. However, this is strictly only true for prediction. Can we also place training in a message passing framework?

User Story

As a: developer of the graphml library
I want: to train and predict using node2vec in a message-passing framework
so that: i can train node2vec in a one-step scalable fashion.

Done Checklist (Research)

Code Review
Documentation on Google Docs

Run HIN Graphsage for Movielens 1M dataset with node attributes

Description

Kevin has written code for HIN GraphSage, I'd like to use this to make predictions on the Movielens 1M dataset with the same train/test split as other examples and using intrinsic user/movie features.

Done Checklist

Obtain performance numbers for node2vec features
Obtain performance numbers for intrinsic features
Documentation on Google Docs
Code Review

Unit tests for link prediction demo

Description

We need unit tests for the link prediction utility classes.

User Story

As a: Research Engineer
I want: to make sure that changes to the link prediction code are not breaking existing functionality
so that: I can be certain that my code works correctly as it is expanded and improved.

Done Checklist (Development)

Create test directory for link prediction demo
Add test for link prediction code
Pull request

Prepare for GraphSAGE/HinSAGE usage during Hackathon

Description

Prepare for the Spotify hackathon to allow everyone to use GraphSAGE/HinSAGE with ease on the day. Investigate the dataset and prepare notes on any requirements such as AWS setup, input batch preparation code, etc.

User Story

As a: Hackathoner
I want: to run Stellar's graph ML algorithms during the Hackathon
so that: we can win the Spotify competition

Done Checklist (Development)

Documentation on Google Docs
Documentation in repo

YowData: Investigate NetFlix prediction using GraphSage

Description

Done Checklist (Research)

Team talk
Results for movie rating predictions (RMSE)

Improve link prediction demo to handle non-integer node IDs

Description

The current implementation of the link prediction demo assumes that node IDs are integers. This is a restrictive assumption because for some datasets the node IDs are not integers. This causes the link prediction demo to fail with an Exception. We need to generalise the code so that it handles non-integer node IDs.

User Story

As a: Research Engineer
I want: to perform link prediction on a variety of network datasets stored in valid EPGM format
so that: I can be certain of the link prediction algorithms generalisation

Done Checklist (Development)

Update implementation to handle non-integer node IDs
Add unit tests
Pull request

Collaborate with Platform Team on caching architecture

Description

The platform team is building an experimental stack. We need to ensure that this meets the needs of the ML and Data teams.

User Story

As a: IA ML dev
I want: ensure that I'm building in sync with the platform team
so that: there is no wasted effort

Done Checklist (Development)

Documentation on Google Docs

Graph Machine Learning library that is easy to use and contribute to

Description

Create a machine learning library in Python that is simple to use and simple to contribute too. The library should focus on the deep learning on graph algorithms, and not attempt to duplicate existing algorithms e.g. community detection, random forests etc.

Value

This library will allow Data Scientists and Researchers to create models over network datasets with minimal overhead. The goal is to allow a fast experiment cycle time, with minimal assumed knowledge. For Researchers, it should be a place to add new algorithms, get their algorithms seen, and supply functions for building new deep learning models on graphs.

YowData: Prepare spammers example

Description

I want to present an example a the YowData conference. The spammers dataset is an interesting case for applying graph ML. I want to prepare the spammers dataset and run node attribute inference on it.

Note:
Anna has done some investigation into using GraphSage and node2vec, so I will find out what has been done so far.

Done Checklist (Research)

Gave presentation at YowData

Investigate Turi for graph processing in the ML library and platform

Description

Apple has open-sourced Turi which is a powerful graph processing framework. We should evaluate this tech with the following critera:

functionality
easy of use
easy of scalability
performance

This task would be to ingest a > 1M edge dataset and perform a set of graph tasks e.g. BFS/DFS, graph traversal, grabbing neighbours, random sampling.

User Story

As a: data scientist
I want: the graph processing part of the library to be fast and have lots of functionality
so that: I can move on to my tensorflow part to build my model

Done Checklist (Development)

Small experiment setup to run the evaluation, e.g. python script file
Documentation on Google Docs
Team demo

Update EPGM class to use networkx v2.*

Description

Currently, our graph processing module requires an earlier version of networkx, e.g., 1.. Newer versions of networkx, namely 2., have changed how nodes and edges are returned to the user. We need to updated our code to work with the newer version of networkx because it is becoming more common and often causes problems.

User Story

As a: Research Engineer
I want: my network analytics library to work with the latest version of python modules
so that: I can make use of the latest developments and improvements in these modules

Done Checklist (Development)

Produced code for required functionality
Unit tests updated and new ones added as necessary
Pull request

Git workflow demo

Description

Use example git repositories to understand the git and github workflow.

User Story

As a: Research engineer
I want: to understand git and github workflows
so that: I can work with the rest of the team to develop the ML library.

Done Checklist (Development)

Create test repos
Document the workflow for forking a repo, developing new code, and putting the code back to original repo via a pull request

Investigate Apache Tinkerpop and write GraphSAGE input preparation in gremlin-python

Description

Investigate Gremlin's viability to efficiently prepare inputs for GraphSAGE from a graph database as well as from local memory.

User Story

As a: data scientist
I want: to use gremlin to prepare inputs for my graph ML tasks.
so that: I can efficiently prepare batch inputs from various graph data sources.

Done Checklist (Development)

Code well commented
Documentation
Peer Code Review Performed
Mini-meetup talk

Build inductive NAI with GCN

Steps:
Given a full graph G:

Randomly select a test set of nodes {V}_test, remove them from G, resulting in G_train = G - {V}_test
Evaluate \hat{A}=\hat{A_train}, X_train from G_train
Train GCN on G_train (feeding \hat{A_train}, X_train), save the trained model
Evaluate \hat{A}, X for the full graph G, ensuring that the order of nodes in the intersection of G and G_train is preserved. I.e., update \hat{A_train}, X_train with test nodes to obtain the full graph's \hat{A}, X.
Do a forward pass of the updated \hat{A}, X through the trained GCN model, predicting attributes for test nodes.
Evaluate predictions by comparing them with true test node attributes

Repeat steps 1-5 to obtain average prediction metrics.

Investigate feature alignment for link prediction

Description

Investigate whether link features obtained from G_train and G_test are aligned, and whether/how this affects performance of the link prediction classifier.

We need to remove the confluence docs, it would be good to get the link prediction code from there.

Done Checklist (Research)

Experimental code/visualisations in 'alignment' branch in stellar-ml-sandbox/link-prediction
Documentation on Google Docs

Start moving code from link-prediction/utils to stellar ML library

Description

Some of the code in link-prediction/utils is mature enough to be integrated into the stellar ML library.

User Story

As a: Research Engineer
I want: to transfer mature code from demo code into the stellar graph ML library
so that: it can be re-used by other IA member and properly unit tested with CI.

Done Checklist (Development)

Produced code for required functionality
Branch and Pull Request build on CI
Branch and Pull Request pass unit tests on CI
Peer Code Review Performed
Code well commented

Organise reference datasets

Description

Organise the reference datasets with a readme.

User Story

As an: IA team member
I want: to have easy access to well defined data sets
So that: I can test my code and minimise dataset confusion

Done Checklist (Bug)

Documented dataset procedure
Document

Graph splitting based on edge type to predict

Description

Data splitter for link prediction should be able to split the graph based on the type of the edge to predict. Also, it should be able to split based on an edge property. For example, we should be able to split based on timestamps if edge have it as a property.

User Story

As a: Research Engineer
I want: to prepare my data
so that: I can perform link prediction on HINs based on edge types and properties

Done Checklist (Research)

Implement data splitting based on edge type and/or edge property, e.g., timestamp
Pull request
Unit tests

Organize external engadgements with Jia and Jesse

Description

To engage successfully with the research groups led by Jia and Jesse, we need to map out the research interests of both groups and match them with research questions of relevance to us.

User Story

As a: researcher collaborating with the Stellar project
I want: to research graph technologies that are of interest to Stellar
so that: we can get publications for our research and support from Stellar.

Done Checklist (Research)

Documentation of ongoing engadgements on Google Docs
Schedule of meetings with Jesse and Jia
Outline of scope of research.
AC review

Sketch a principled way to generalise GraphSAGE to heterogeneous graphs

Description

As is, GraphSAGE works for homogeneous graphs. We need to sketch a principled way to generalise it to HINs, similar to DECAGON, ideally by generalising the existing GraphSAGE code.

Checklist

Documentation in ML Team Google Docs

Done

Setup Travis for stellar-ml

Description

Setup continuous integration for automated tests in the library. Create a buildkite.yml file

Done Checklist

Triggered commits for BuildKite
Writing to build-bots

Graphsage demo in StellarML Library

Description

Currently StellarML Library has some base classes. We have working Graphsage code from Kevin. We should implement a demo using the classes from StellarML, moving code to this style as required.

Done Checklist (Development)

Done Checklist (Research)

Done Checklist (Bug)

Bug fixed
Branch and Pull Request build on CI
Branch and Pull Request pass unit tests on CI
Branch and Pull Request pass integration tests on CI
Version number reflects new status
Peer Code Review Performed
Code well commented

Write YOWData! presentation

Description

I'll be presenting at YOWData! on the 15th (at 5pm) so I need to prepare some slides!

Done Checklist

Slides on Google Docs
Give YOWData! presentation

Move link prediction demo from stellar-ml-sandbox to stellar-ml repo

Description

We need to move the demo for link prediction from the stellar-ml-sandbox repo to here.

User Story

As a: Research Engineer
I want: to have all my code relating to graph ML in one place
so that: I can more effectively develop the graph-ml library and share code with my team

Done Checklist (Development)

Moved code from stellar-ml-sandbox to stellar-ml repo
Pull request