stellargraph / stellargraph Goto Github PK
View Code? Open in Web Editor NEWStellarGraph - Machine Learning on Graphs
Home Page: https://stellargraph.readthedocs.io/
License: Apache License 2.0
StellarGraph - Machine Learning on Graphs
Home Page: https://stellargraph.readthedocs.io/
License: Apache License 2.0
Kevin and Yuriy have implemented a HIN GraphSAGE algorithm. I'd like to understand the implementation.
There are are other heterogeneous GCN-like algorithms in the literature, read and understand them. How do they compare? Which algorithms could we implement for the ML library? Can we obtain code and test them on different problems? What input sampling strategies are required for each algorithm? How do training and prediction differ?
Currently, the link prediction demo uses fixed parameter values, e.g., p=q=1 and several other parameters, for node2vec. We need to allow for these parameters to be tuned for improved link prediction performance.
As a: Research Engineer
I want: to tune the hyper-parameters of the node2vec algorithm
so that: I can achieve the highest performance in link prediction
Currently the Graphsage code from Kevin runs well, but requires a redis database and is slow for simple single computer testing.
As part of the StellarML library we want to pass data into tensorflow fast. Having a simple in-memory graphsage sampler would be a good start.
Use the documentation for HinSAGE link prediction to create a working link prediction example using the Paradise Papers dataset from the Data team.
As a: data scientist
I want: to use GraphSAGE layers for link prediction
so that: I can run scalable link prediction
Present initial analysis of scalable architecture for GraphSAGE, its bottlenecks, and possible ways to fix them.
https://drive.google.com/open?id=1jdo6ZvNZscTaMj6jiQLGyjhuF1n-c8uTKY4TtT0uxbE
We need to get started with an interesting demonstration of machine learning on graphs for the YowData! conference in mid-May.
First, we need to decide on a dataset and problem.
Adapt the initial library skeleton to include Kevin's initial GraphSAGE workflow
Sampling negative edges for link prediction using the nodes' local neighbourhood structure currently uses BFS that runs very slow if target nodes more than 5 edges away need to be sample. This issue is about replacing BFS with DFS to speed up the sampling algorithm.
As a: Research Engineer
I want: to run link experiments as fast as possible
so that: I maximise my efficiency.
For the YowData conference, I'd like to present a recommender example. The Netflix prize dataset is well known, and a large amount of effort has been spent on getting results on this dataset. Good performance on this dataset would be impressive.
Recommender systems are often not thought about in terms of graphs. Therefore, posing this in a graph framework and solving it would be interesting. We can start by using node2vec to extract node embeddings and trying to predict the scores from this.
Hooman is performing through experiments in how his dynamic random walk methods perform as a part of an end-to-end dynamic node2vec algorithm. There are difficulties with how the skip-gram model interacts with random walk updates. To get a good publication we need a description of the skip-gram model and some explanation of how different training update schemes will affect the model.
Create initial dummy library using the documentation and pseudo code already accumulated
The link prediction demo works on homogeneous datasets. We want to test whether it also works on heterogeneous datasets under the assumption that the latter will be treated as homogeneous.
As a: Research Enigneer
I want: to make sure that the link prediction demo work for both homogeneous and heterogeneous networks
so that: I can tackle more general analytics problems.
Review the pull request #12
Build a scalable implementation of node attribute inference (NAI) for graphs, that works for at least 10M node graphs.
Besides satisfying stakeholders' requirements for scalable attribute inference tasks on large graph datasets (thus expanding the NAI capability of Release 1), this should allow us to find an optimal scalable architecture for other ML tasks on graphs, such as link prediction and classification, recommendations, etc.
Since risknet has similar data size to Cora dataset, check out if GraphSage small graph version works with risknet,
I want to understand the Metapath2Vec algorithm for representation learning in heterogeneous graphs.
As a: Research Engineer
I want: to understand the MetaPath2Vec algorithm for representation learning on heterogeneous graphs
so that: I can use it for node attribute inference and link prediction.
Implement the wrappers / additional layers for unsupervised learning around the HinSAGE demo code. Create a working example using the risk net dataset.
As a: data scientist
I want: to run unsupervised learning using GraphSAGE
so that: I can transform my large dataset into node embeddings
Currently the GraphSage unsupervised method is not in the library. This task is to add a simple unsupervised GraphSage module in the stellar-ml library.
As a: Data Scientist
I want: everything that GraphSAGE offers
so that: I have freedom for my unsupervised method experiments
I'll be presenting at YOWData! on the 15th (at 5pm) so I need to prepare some slides!
Can we borrow some design choices, e.g.: base classes & inheritance, layer compositions, pipelining
Link: https://github.com/data61/aboleth
Use existing code to build up a link prediction demo for homogeneous graphs
The node splitter developed for the link prediction demo of issue #8 needs to be improved such that negative samples are more challenging, i.e., should not be randomly selected out of all pairs of disconnected nodes but rather of disconnected nodes that are nearby in the graph.
As a: Research Engineer
I want: to use my data to correctly evaluate my link prediction algorithm
so that: I am confident about its performance on unseen data.
As a: developer of the library
I want: to see a clear structure of base classes to inherit from, their methods, and examples of composing workflows from them.
The movielens recommender demo developed for YOWData could be useful for other problems (Anna would like to try it out to see if it will work for the medicare dataset.
Currently the code is rough and ready, so I'd like to tidy it up, add documentation and have a quick-to-run test case (say on movielens 100k).
Add equations (similar to those in Algorithm 1 or 2 in GraphSAGE paper) for embedding step (aggregator + dense layer) applied to HINs
Node2vec can be implemented in a message-passing framework. However, this is strictly only true for prediction. Can we also place training in a message passing framework?
As a: developer of the graphml library
I want: to train and predict using node2vec in a message-passing framework
so that: i can train node2vec in a one-step scalable fashion.
Kevin has written code for HIN GraphSage, I'd like to use this to make predictions on the Movielens 1M dataset with the same train/test split as other examples and using intrinsic user/movie features.
We need unit tests for the link prediction utility classes.
As a: Research Engineer
I want: to make sure that changes to the link prediction code are not breaking existing functionality
so that: I can be certain that my code works correctly as it is expanded and improved.
Prepare for the Spotify hackathon to allow everyone to use GraphSAGE/HinSAGE with ease on the day. Investigate the dataset and prepare notes on any requirements such as AWS setup, input batch preparation code, etc.
As a: Hackathoner
I want: to run Stellar's graph ML algorithms during the Hackathon
so that: we can win the Spotify competition
The current implementation of the link prediction demo assumes that node IDs are integers. This is a restrictive assumption because for some datasets the node IDs are not integers. This causes the link prediction demo to fail with an Exception. We need to generalise the code so that it handles non-integer node IDs.
As a: Research Engineer
I want: to perform link prediction on a variety of network datasets stored in valid EPGM format
so that: I can be certain of the link prediction algorithms generalisation
The platform team is building an experimental stack. We need to ensure that this meets the needs of the ML and Data teams.
As a: IA ML dev
I want: ensure that I'm building in sync with the platform team
so that: there is no wasted effort
Create a machine learning library in Python that is simple to use and simple to contribute too. The library should focus on the deep learning on graph algorithms, and not attempt to duplicate existing algorithms e.g. community detection, random forests etc.
This library will allow Data Scientists and Researchers to create models over network datasets with minimal overhead. The goal is to allow a fast experiment cycle time, with minimal assumed knowledge. For Researchers, it should be a place to add new algorithms, get their algorithms seen, and supply functions for building new deep learning models on graphs.
I want to present an example a the YowData conference. The spammers dataset is an interesting case for applying graph ML. I want to prepare the spammers dataset and run node attribute inference on it.
Note:
Anna has done some investigation into using GraphSage and node2vec, so I will find out what has been done so far.
Apple has open-sourced Turi which is a powerful graph processing framework. We should evaluate this tech with the following critera:
This task would be to ingest a > 1M edge dataset and perform a set of graph tasks e.g. BFS/DFS, graph traversal, grabbing neighbours, random sampling.
As a: data scientist
I want: the graph processing part of the library to be fast and have lots of functionality
so that: I can move on to my tensorflow part to build my model
Currently, our graph processing module requires an earlier version of networkx, e.g., 1.. Newer versions of networkx, namely 2., have changed how nodes and edges are returned to the user. We need to updated our code to work with the newer version of networkx because it is becoming more common and often causes problems.
As a: Research Engineer
I want: my network analytics library to work with the latest version of python modules
so that: I can make use of the latest developments and improvements in these modules
Use example git repositories to understand the git and github workflow.
As a: Research engineer
I want: to understand git and github workflows
so that: I can work with the rest of the team to develop the ML library.
Investigate Gremlin's viability to efficiently prepare inputs for GraphSAGE from a graph database as well as from local memory.
As a: data scientist
I want: to use gremlin to prepare inputs for my graph ML tasks.
so that: I can efficiently prepare batch inputs from various graph data sources.
Steps:
Given a full graph G:
Repeat steps 1-5 to obtain average prediction metrics.
Investigate whether link features obtained from G_train and G_test are aligned, and whether/how this affects performance of the link prediction classifier.
We need to remove the confluence docs, it would be good to get the link prediction code from there.
Some of the code in link-prediction/utils is mature enough to be integrated into the stellar ML library.
As a: Research Engineer
I want: to transfer mature code from demo code into the stellar graph ML library
so that: it can be re-used by other IA member and properly unit tested with CI.
Organise the reference datasets with a readme.
As an: IA team member
I want: to have easy access to well defined data sets
So that: I can test my code and minimise dataset confusion
Data splitter for link prediction should be able to split the graph based on the type of the edge to predict. Also, it should be able to split based on an edge property. For example, we should be able to split based on timestamps if edge have it as a property.
As a: Research Engineer
I want: to prepare my data
so that: I can perform link prediction on HINs based on edge types and properties
To engage successfully with the research groups led by Jia and Jesse, we need to map out the research interests of both groups and match them with research questions of relevance to us.
As a: researcher collaborating with the Stellar project
I want: to research graph technologies that are of interest to Stellar
so that: we can get publications for our research and support from Stellar.
As is, GraphSAGE works for homogeneous graphs. We need to sketch a principled way to generalise it to HINs, similar to DECAGON, ideally by generalising the existing GraphSAGE code.
Setup continuous integration for automated tests in the library. Create a buildkite.yml file
Currently StellarML Library has some base classes. We have working Graphsage code from Kevin. We should implement a demo using the classes from StellarML, moving code to this style as required.
I'll be presenting at YOWData! on the 15th (at 5pm) so I need to prepare some slides!
We need to move the demo for link prediction from the stellar-ml-sandbox repo to here.
As a: Research Engineer
I want: to have all my code relating to graph ML in one place
so that: I can more effectively develop the graph-ml library and share code with my team
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.