GithubHelp home page GithubHelp logo

alibaba / graphscope Goto Github PK

View Code? Open in Web Editor NEW
3.1K 52.0 412.0 462.47 MB

🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统

Home Page: https://graphscope.io

License: Apache License 2.0

Shell 1.45% Makefile 0.17% CMake 0.56% C++ 27.81% C 0.38% Python 20.86% Java 21.87% Rust 24.49% Dockerfile 0.27% Jupyter Notebook 1.58% JavaScript 0.03% CSS 0.01% TypeScript 0.04% Smarty 0.15% ANTLR 0.20% Scala 0.01% Cypher 0.14%
graph graph-computation graph-neural-networks gremlin graph-analytics graph-data graph-computing analytics big-data data-science

graphscope's Introduction

graphscope-logo

A One-Stop Large-Scale Graph Computing System from Alibaba

GraphScope CI Coverage Playground Open in Colab Artifact HUB Docs-en FAQ-en Docs-zh FAQ-zh README-zh ACM DL

🎉 See our ongoing GraphScope Flex: a LEGO-inspired, modular, and user-friendly GraphScope evolution. 🎉

GraphScope is a unified distributed graph computing platform that provides a one-stop environment for performing diverse graph operations on a cluster of computers through a user-friendly Python interface. GraphScope makes multi-staged processing of large-scale graph data on compute clusters simply by combining several important pieces of Alibaba technology: including GRAPE, MaxGraph, and Graph-Learn (GL) for analytics, interactive, and graph neural networks (GNN) computation, respectively, and the Vineyard store that offers efficient in-memory data transfers.

Visit our website at graphscope.io to learn more.

Latest News

Table of Contents

Getting Started

We provide a Playground with a managed JupyterLab. Try GraphScope straight away in your browser!

GraphScope supports running in standalone mode or on clusters managed by Kubernetes within containers. For quickly getting started, let's begin with the standalone mode.

Installation for Standalone Mode

GraphScope pre-compiled package is distributed as a python package and can be easily installed with pip.

pip3 install graphscope

Note that graphscope requires Python >= 3.8 and pip >= 19.3. The package is built for and tested on the most popular Linux (Ubuntu 20.04+ / CentOS 7+) and macOS 12+ (Intel/Apple silicon) distributions. For Windows users, you may want to install Ubuntu on WSL2 to use this package.

Next, we will walk you through a concrete example to illustrate how GraphScope can be used by data scientists to effectively analyze large graphs.

Demo: Node Classification on Citation Network

ogbn-mag is a heterogeneous network composed of a subset of the Microsoft Academic Graph. It contains 4 types of entities(i.e., papers, authors, institutions, and fields of study), as well as four types of directed relations connecting two entities.

Given the heterogeneous ogbn-mag data, the task is to predict the class of each paper. Node classification can identify papers in multiple venues, which represent different groups of scientific work on different topics. We apply both the attribute and structural information to classify papers. In the graph, each paper node contains a 128-dimensional word2vec vector representing its content, which is obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are pre-trained. The structural information is computed on-the-fly.

Loading a graph

GraphScope models graph data as property graph, in which the edges/vertices are labeled and have many properties. Taking ogbn-mag as example, the figure below shows the model of the property graph.

sample-of-property-graph

This graph has four kinds of vertices, labeled as paper, author, institution and field_of_study. There are four kinds of edges connecting them, each kind of edges has a label and specifies the vertex labels for its two ends. For example, cites edges connect two vertices labeled paper. Another example is writes, it requires the source vertex is labeled author and the destination is a paper vertex. All the vertices and edges may have properties. e.g., paper vertices have properties like features, publish year, subject label, etc.

To load this graph to GraphScope with our retrieval module, please use these code:

import graphscope
from graphscope.dataset import load_ogbn_mag

g = load_ogbn_mag()

We provide a set of functions to load graph datasets from ogb and snap for convenience. Please find all the available graphs here. If you want to use your own graph data, please refer this doc to load vertices and edges by labels.

Interactive query

Interactive queries allow users to directly explore, examine, and present graph data in an exploratory manner in order to locate specific or in-depth information in time. GraphScope adopts a high-level language called Gremlin for graph traversal, and provides efficient execution at scale.

In this example, we use graph traversal to count the number of papers two given authors have co-authored. To simplify the query, we assume the authors can be uniquely identified by ID 2 and 4307, respectively.

# get the endpoint for submitting Gremlin queries on graph g.
interactive = graphscope.gremlin(g)

# count the number of papers two authors (with id 2 and 4307) have co-authored
papers = interactive.execute("g.V().has('author', 'id', 2).out('writes').where(__.in('writes').has('id', 4307)).count()").one()

Graph analytics

Graph analytics is widely used in real world. Many algorithms, like community detection, paths and connectivity, centrality are proven to be very useful in various businesses. GraphScope ships with a set of built-in algorithms, enables users easily analysis their graph data.

Continuing our example, below we first derive a subgraph by extracting publications in specific time out of the entire graph (using Gremlin!), and then run k-core decomposition and triangle counting to generate the structural features of each paper node.

Please note that many algorithms may only work on homogeneous graphs, and therefore, to evaluate these algorithms over a property graph, we need to project it into a simple graph at first.

# extract a subgraph of publication within a time range
sub_graph = interactive.subgraph("g.V().has('year', gte(2014).and(lte(2020))).outE('cites')")

# project the projected graph to simple graph.
simple_g = sub_graph.project(vertices={"paper": []}, edges={"cites": []})

ret1 = graphscope.k_core(simple_g, k=5)
ret2 = graphscope.triangles(simple_g)

# add the results as new columns to the citation graph
sub_graph = sub_graph.add_column(ret1, {"kcore": "r"})
sub_graph = sub_graph.add_column(ret2, {"tc": "r"})

In addition, users can write their own algorithms in GraphScope. Currently, GraphScope supports users to write their own algorithms in Pregel model and PIE model.

Graph neural networks (GNNs)

Graph neural networks (GNNs) combines superiority of both graph analytics and machine learning. GNN algorithms can compress both structural and attribute information in a graph into low-dimensional embedding vectors on each node. These embeddings can be further fed into downstream machine learning tasks.

In our example, we train a GCN model to classify the nodes (papers) into 349 categories, each of which represents a venue (e.g. pre-print and conference). To achieve this, first we launch a learning engine and build a graph with features following the last step.

# define the features for learning
paper_features = [f"feat_{i}" for i in range(128)]

paper_features.append("kcore")
paper_features.append("tc")

# launch a learning engine.
lg = graphscope.graphlearn(sub_graph, nodes=[("paper", paper_features)],
                  edges=[("paper", "cites", "paper")],
                  gen_labels=[
                      ("train", "paper", 100, (0, 75)),
                      ("val", "paper", 100, (75, 85)),
                      ("test", "paper", 100, (85, 100))
                  ])

Then we define the training process, and run it.

# Note: Here we use tensorflow as NN backend to train GNN model. so please
# install tensorflow.
try:
    # https://www.tensorflow.org/guide/migrate
    import tensorflow.compat.v1 as tf
    tf.disable_v2_behavior()
except ImportError:
    import tensorflow as tf

import graphscope.learning
from graphscope.learning.examples import EgoGraphSAGE
from graphscope.learning.examples import EgoSAGESupervisedDataLoader
from graphscope.learning.examples.tf.trainer import LocalTrainer

# supervised GCN.
def train_gcn(graph, node_type, edge_type, class_num, features_num,
              hops_num=2, nbrs_num=[25, 10], epochs=2,
              hidden_dim=256, in_drop_rate=0.5, learning_rate=0.01,
):
    graphscope.learning.reset_default_tf_graph()

    dimensions = [features_num] + [hidden_dim] * (hops_num - 1) + [class_num]
    model = EgoGraphSAGE(dimensions, act_func=tf.nn.relu, dropout=in_drop_rate)

    # prepare train dataset
    train_data = EgoSAGESupervisedDataLoader(
        graph, graphscope.learning.Mask.TRAIN,
        node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,
    )
    train_embedding = model.forward(train_data.src_ego)
    train_labels = train_data.src_ego.src.labels
    loss = tf.reduce_mean(
        tf.nn.sparse_softmax_cross_entropy_with_logits(
            labels=train_labels, logits=train_embedding,
        )
    )
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

    # prepare test dataset
    test_data = EgoSAGESupervisedDataLoader(
        graph, graphscope.learning.Mask.TEST,
        node_type=node_type, edge_type=edge_type, nbrs_num=nbrs_num, hops_num=hops_num,
    )
    test_embedding = model.forward(test_data.src_ego)
    test_labels = test_data.src_ego.src.labels
    test_indices = tf.math.argmax(test_embedding, 1, output_type=tf.int32)
    test_acc = tf.div(
        tf.reduce_sum(tf.cast(tf.math.equal(test_indices, test_labels), tf.float32)),
        tf.cast(tf.shape(test_labels)[0], tf.float32),
    )

    # train and test
    trainer = LocalTrainer()
    trainer.train(train_data.iterator, loss, optimizer, epochs=epochs)
    trainer.test(test_data.iterator, test_acc)

train_gcn(lg, node_type="paper", edge_type="cites",
          class_num=349,  # output dimension
          features_num=130,  # input dimension, 128 + kcore + triangle count
)

A Python script with the entire process is available here, you may try it out by yourself.

Processing Large Graph on Kubernetes Cluster

GraphScope is designed for processing large graphs, which are usually hard to fit in the memory of a single machine. With Vineyard as the distributed in-memory data manager, GraphScope supports running on a cluster managed by Kubernetes(k8s).

To continue this tutorial, please ensure that you have a k8s-managed cluster and know the credentials for the cluster. (e.g., address of k8s API server, usually stored a ~/.kube/config file.)

Alternatively, you can set up a local k8s cluster for testing with Kind. You can install and deploy Kind referring to Quick Start;

If you did not install the graphscope package in the above step, you can install a subset of the whole package with client functions only.

pip3 install graphscope-client

Next, let's revisit the example by running on a cluster instead.

how-it-works

The figure shows the flow of execution in the cluster mode. When users run code in the python client, it will:

  • Step 1. Create a session or workspace in GraphScope.
  • Step 2 - Step 5. Load a graph, query, analysis and run learning task on this graph via Python interface. These steps are the same to local mode, thus users process huge graphs in a distributed setting just like analysis a small graph on a single machine.(Note that graphscope.gremlin and graphscope.graphlearn need to be changed to sess.gremlin and sess.graphlearn, respectively. sess is the name of the Session instance user created.)
  • Step 6. Close the session.

Creating a session

To use GraphScope in a distributed setting, we need to establish a session in a python interpreter.

For convenience, we provide several demo datasets, and an option with_dataset to mount the dataset in the graphscope cluster. The datasets will be mounted to /dataset in the pods. If you want to use your own data on k8s cluster, please refer to this.

import graphscope

sess = graphscope.session(with_dataset=True)

For macOS, the session needs to establish with the LoadBalancer service type (which is NodePort by default).

import graphscope

sess = graphscope.session(with_dataset=True, k8s_service_type="LoadBalancer")

A session tries to launch a coordinator, which is the entry for the back-end engines. The coordinator manages a cluster of resources (k8s pods), and the interactive/analytical/learning engines ran on them. For each pod in the cluster, there is a vineyard instance at service for distributed data in memory.

Loading a graph and processing computation tasks

Similar to the standalone mode, we can still use the functions to load a graph easily.

from graphscope.dataset import load_ogbn_mag

# Note we have mounted the demo datasets to /dataset,
# There are several datasets including ogbn_mag_small,
# User can attach to the engine container and explore the directory.
g = load_ogbn_mag(sess, "/dataset/ogbn_mag_small/")

Here, the g is loaded in parallel via vineyard and stored in vineyard instances in the cluster managed by the session.

Next, we can conduct graph queries with Gremlin, invoke various graph algorithms, or run graph-based neural network tasks like we did in the standalone mode. We do not repeat code here, but a .ipynb processing the classification task on k8s is available on the Playground.

Closing the session

Another additional step in the distribution is session close. We close the session after processing all graph tasks.

sess.close()

This operation will notify the backend engines and vineyard to safely unload graphs and their applications, Then, the coordinator will release all the applied resources in the k8s cluster.

Please note that we have not hardened this release for production use and it lacks important security features such as authentication and encryption, and therefore it is NOT recommended for production use (yet)!

Development

Building on local

To build graphscope Python package and the engine binaries, you need to install some dependencies and build tools.

python3 gsctl.py install-deps dev

# With argument --cn to speed up the download if you are in China.
python3 gsctl.py install-deps dev --cn

Then you can build GraphScope with pre-configured make commands.

# to make graphscope whole package, including python package + engine binaries.
sudo make install

# or make the engine components
# make interactive
# make analytical
# make learning

Building Docker images

GraphScope ships with a Dockerfile that can build docker images for releasing. The images are built on a builder image with all dependencies installed and copied to a runtime-base image. To build images with latest version of GraphScope, go to the k8s/internal directory under root directory and run this command.

# by default, the built image is tagged as graphscope/graphscope:SHORTSHA
# cd k8s
make graphscope

Building client library

GraphScope python interface is separate with the engines image. If you are developing python client and not modifying the protobuf files, the engines image doesn't require to be rebuilt.

You may want to re-install the python client on local.

make client

Note that the learning engine client has C/C++ extensions modules and setting up the build environment is a bit tedious. By default the locally-built client library doesn't include the support for learning engine. If you want to build client library with learning engine enabled, please refer Build Python Wheels.

Testing

To verify the correctness of your developed features, your code changes should pass our tests.

You may run the whole test suite with commands:

make test

Documentation

Documentation can be generated using Sphinx. Users can build the documentation using:

# build the docs
make graphscope-docs

# to open preview on local
open docs/_build/latest/html/index.html

The latest version of online documentation can be found at https://graphscope.io/docs

License

GraphScope is released under Apache License 2.0. Please note that third-party libraries may not have the same license as GraphScope.

Publications

  • Wenfei Fan, Tao He, Longbin Lai, Xue Li, Yong Li, Zhao Li, Zhengping Qian, Chao Tian, Lei Wang, Jingbo Xu, Youyang Yao, Qiang Yin, Wenyuan Yu, Jingren Zhou, Diwen Zhu, Rong Zhu. GraphScope: A Unified Engine For Big Graph Processing. The 47th International Conference on Very Large Data Bases (VLDB), industry, 2021.
  • Jingbo Xu, Zhanning Bai, Wenfei Fan, Longbin Lai, Xue Li, Zhao Li, Zhengping Qian, Lei Wang, Yanyan Wang, Wenyuan Yu, Jingren Zhou. GraphScope: A One-Stop Large Graph Processing System. The 47th International Conference on Very Large Data Bases (VLDB), demo, 2021

If you use this software, please cite our paper using the following metadata:

@article{fan2021graphscope,
  title={GraphScope: a unified engine for big graph processing},
  author={Fan, Wenfei and He, Tao and Lai, Longbin and Li, Xue and Li, Yong and Li, Zhao and Qian, Zhengping and Tian, Chao and Wang, Lei and Xu, Jingbo and others},
  journal={Proceedings of the VLDB Endowment},
  volume={14},
  number={12},
  pages={2879--2892},
  year={2021},
  publisher={VLDB Endowment}
}

Contributing

Any contributions you make are greatly appreciated!

  • Join in the Slack channel for discussion.
  • Please report bugs by submitting a GitHub issue.
  • Please submit contributions using pull requests.

graphscope's People

Contributors

acezen avatar bingqinglyu avatar bmmcq avatar bufapiqi avatar dashanji avatar dependabot[bot] avatar doudoubobo avatar goldenleaves avatar haoxins avatar jsoref avatar lidongze0629 avatar lisu avatar liulx20 avatar lixueclaire avatar lnfjpt avatar longbinlai avatar luoxiaojian avatar meloyang05 avatar shirly121 avatar sighingnow avatar siyuan0322 avatar songqing avatar tianliplus avatar varinic avatar waruto210 avatar wenyuanyu avatar wuyueandrew avatar yecol avatar youyangy avatar zhanglei1949 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

graphscope's Issues

Respect the default value when missing k8s_namespace param in session

Is your feature request related to a problem? Please describe.
For support jupyterhub.
We can get current namespace from the k8s context (such as ~/.kube/config) when missing k8s_namespace params, rather than create a new namespace with random str (gs-djdjch). So, we also can limit resources for the namespace.

Support running GraphScope on macOS

Is your feature request related to a problem? Please describe.

  • When running the demo on macOS with minikube/docker-desktop, it occurs error due to a known issue about exposing address.
  • learning feature doesn't support yet on macOS.

Describe the solution you'd like

  • make it fully support on macOS.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Improve docstring of the graphscope module

Currently the help of module itself does not produce very useful information. I think it is a good idea to improve the doc there, following the example of pandas for example

>>> help(graphscope)
Help on package graphscope:

NAME
    graphscope

DESCRIPTION
    # -*- coding: utf-8 -*-
    #
    # Copyright 2020 Alibaba Group Holding Limited. All Rights Reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    #

PACKAGE CONTENTS
    analytical (package)
    client (package)
    config
    dataset (package)
    deploy (package)
    experimental (package)
    framework (package)
    interactive (package)
    learning (package)
    proto (package)
    version

DATA
    Vertex = Vertex

VERSION
    0.1.1

FILE
    /opt/conda/lib/python3.8/site-packages/graphscope/__init__.py

Proxy all traffic from client in coordinator.

Is your feature request related to a problem? Please describe.
Too many services exposed by GraphScope to the client in one instance, such as vineyard RPC service、gremlin service、GLE train server service.

Describe the solution you'd like
It would be great if we can proxy all traffic from the client in one coordinator service.

Refs
Look forward to any helpful information.

[BUG] GraphScope can't run in the namespace with ResourceQuota limit.

Describe the bug

GraphScope can't create pods (such as GraphManager pod and gremlin server pod) in the namespace with ResourceQuota limit.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: mem-cpu-demo
spec:
  hard:
    requests.cpu: "2"
    requests.memory: 2Gi
    limits.cpu: "2"
    limits.memory: 2Gi
Error from server (Forbidden): : pods "xxxx" is forbidden: failed quota: mem-cpu-demo: must specify limits.cpu,limits.memory,requests.cpu,requests.memory

Integration with mars

Is your feature request related to a problem? Please describe.

mars is a distributed tensor-based computation engine. The intermediate data/results of GraphScope in the format of dataframe should be enabled to process in mars, as a part of the workflow pipeline.

Improve "Getting Started" experience

  1. Improve ./scripts/prepare_env.sh script to check compatibility.
    • Check version of the OS (for example, WSL1 or WSL2 of Windows)
    • Check versions of dependencies
      ...
      Tell user in a friendly way what is the problem if requirements are not met
  2. Improve READEM and Docs to reflect latest compatibility status

support running GraphScope on WSL

Is your feature request related to a problem? Please describe.

  • GraphScope not test on windows WSL.

Describe the solution you'd like

  • make the whole demo runnable on WSL
  • since minikube need systemd to start, WSL not support systemd init, consider use kind on WSL.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Support list/tensor property types in ArrowPropertyFragment

Is your feature request related to a problem? Please describe.

Add more rich types to ArrowPropertyFragment.

Describe alternatives you've considered

Marshal as a fixed-size binary type.

Additional context

For applications like GNN training, split a tensor to a set of arrays is bad for user side API as well as for performance.

[BUG] Double persist when creating the stream for building subgraph

Describe the bug

The "persist" happens in the Seal operation of GlobalPGStreamBuilder, which is unnecessary, and might be wrong.

VINEYARD_CHECK_OK(client.CreateMetaData(gstream->meta_, gstream->id_));
VINEYARD_CHECK_OK(client.Persist(gstream->id_));
return std::dynamic_pointer_cast<Object>(gstream);

Additional context

N/A

Print logs/hints about what we are doing at the different stages when launching a session

Is your feature request related to a problem? Please describe.

The logs is still doesn't every helpful when error occurs. Could we print something like (when show_logs=True):

Launching coordinator...
Coordinator service is ready.
Launching graphscope engines... (or, `Waiting graphscope engines....`)
Graphscope engines are ready.

I mean, we should let user know what we are doing or waiting for, in a detailed manner, when show_logs is True.

Additional context

I just think log in comments #3 (comment) is quite confusing. The word "ready" combined with no extra logs deliver a sense of "we are ready", but actually things are not ready, and users are confused about what they are waiting for and why it doesn't return or raise.

Jupyter Notebook extension for graph visualization

Is your feature request related to a problem? Please describe.

Graph visualization is an essential part for a graph computing platform.
As the first step, we may want a module to visualize the graph data.

Describe the solution you'd like

  • implement this feature via a Jupyter notebook extension in the location /python/jupyter
  • visualize the graph data in the result cell when the draw functions are invoked.
  • the draw APIs:
# assume the g is a Graph in GraphScope.
# draw the whole graph. (maybe cascaded if the graph is huge.)
g.draw()

# draw selected vertices and their induced subgraph
g.draw(vertices=[1,2,3,4])

# draw induced subgraph with hop extension 
g.draw(vertices=[1, 2 ,3, 4], hop=2)

# draw a subgraph with selected labels.
g.draw(vlabel=students”, elabel=friends”)

# draw the graph with captions
# label.e, label.v, label
# id.e, id.v, id
# prop.XXX
# e.g.,
g.draw(caption=label:e”)
g.draw(caption="prop.name")

# pass the args with a dict.
g.draw(config=Dict())

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Support load_balancer service type.

Is your feature request related to a problem? Please describe.
The NodePort service type cannot meet the k8s cluster deployed on the cloud, while the LoadBalancer service type meets this requirement.

Describe the solution you'd like
Add a session param named k8s_service_type, valid options are NodePort, and LoadBalancer. Defaults to NodePort.

sess = graphscope.session(k8s_service_type='LoadBalancer')

Release plan for v0.3.0

Timeline

We plan to release GraphScope 0.3.0 in March 2021.

Major features

Graph Visualization

  • Support jupyterlab notebook extension (ipygraphin) for graph visualization. (#50)
  • Reorganize graphin Jupyter plugin for graphscope. (#208)

Performance Enhancement

  • Include pre-compiled apps/graphs into docker image (#137)
  • improve the response time of each step in Python (#157)

Kubernetes Releated

  • Deprecate usage of minikube and move to kind (#114, by #129)
  • Proxy all traffic from client in coordinator (#78)
  • Replace bundled vineyard container with the helm installation. (#89)
  • GraphScope helm support (#161 )

Integration with Mars

  • Integration with mars (#58 )

Improve the compatibility with NetworkX

Cloud service integration

  • Provisioning k8s cluster on amazon/aliyun (#57)

Run GraphScope locally without K8s/Docker

  • Kubernetes is not necessary for GraphScope (#113 )

Improve graph manipulation APIs

  • a general project operator (#134 )
  • tensor/ndarray attribute (#135 )
  • optimize the API of create a graph (#142 )
  • create new graph by adding new labels of edges and vertices to existing graphs (#138 )

User-friendly API improvements

  • user-friendly repr() for graphscope objects, e.g., for graphs, print out schema in a nice way. (#164 )

Built-in app extension

Known Breaking API Changes

  • Deprecate the support for minikube (by #129 )

About opening the kube_config.new_client_from_config parameter in the Session initialization function

Is your feature request related to a problem? Please describe.
When I needed to connect to a remote k8s cluster, I found that GS had no configurable parameters to do so

In the script session.py, at line 572:

api_client = kube_config.new_client_from_config()

Let's look at thenew_client_from_config function:

def new_client_from_config(
        config_file=None,
        context=None,
        persist_config=True):
    """
    Loads configuration the same as load_kube_config but returns an ApiClient
    to be used with any API object. This will allow the caller to concurrently
    talk with multiple clusters.
    """
    client_config = type.__call__(Configuration)
    load_kube_config(config_file=config_file, context=context,
                     client_configuration=client_config,
                     persist_config=persist_config)
    client_config.verify_ssl = False
    return ApiClient(configuration=client_config)

k8s allows developers to pass in the config_file parameter, but GS hides this parameter.

Describe the solution you'd like
I think the parameters of the functionnew_client_from_config can be configured by the developer in the Session class, just like:

api_client = kube_config.new_client_from_config(config_file=None, context=None, persist_config=persist_config)

or

api_client = kube_config.new_client_from_config(kw.pop('k8s_client_config'))

This allows us to easily load the k8s configuration:

sess = graphscope.session(kw = {config_file=None, context=None, persist_config=True})

or 

sess = graphscope.session(kw ={"k8s_client_config" :  {config_file=None, context=None, persist_config=True}})

Describe alternatives you've considered

Additional context

My current solution is to configure the environment variables as follows:

os.environ.setdefault(
    "KUBECONFIG", os.path.join(Config.BASE_DIR, "config/kube_inner_config")
)

import graphscope

This way of writing causes that environment variables must be declared before the graphscope is imported, which is very inelegant.

Add FAQ

GraphScope differs from many existing graph systems. I think it is a good idea to have a FAQ section in the docs to help users quickly getting started.
Ref: #16

Session status still active when disconnected from the coordinator.

Is your feature request related to a problem? Please describe.
As titled.

# 2021-01-05 15:53:20,693 [WARNING][rpc:75]: Grpc call 'send_heartbeat' failed: StatusCode.UNAVAILABLE: failed to connect to all addresses
In [19]: sess
Out[19]: {'status': 'active', 'type': 'k8s', 'engine_hosts': 'gs-engine-tttqih-dpfkm,gs-engine-tttqih-tb2lg', 'namespace': 'gs-zcwihz', 'num_workers': 2, 'coordinator_endpoint': 'xxxx:xxxx', 'engine_config': {'experimental': 'ON', 'vineyard_socket': '/tmp/vineyard_workspace/vineyard.sock', 'vineyard_rpc_endpoint': xxxx:xxxx', 'vineyard_service_name': 'gs-vineyard-service-wtnaep'}}

Support bytecode-based requests in interactive engine.

Interactive engine already supports submit a script to query the remote graph, which syntax is like:
interactive.execute("g.V().count()").result().one().

However, there is also another bytecode-based style to query a graph, which should also be supported, as follows:
g.V().count().toList().

We should support bytecode-based style as it's the recommended style by Tinkerpop.

Serialization/Deserialization of graphs.

Is your feature request related to a problem? Please describe.
Support serialize/deserialize a graph loaded in a session, for fast loading in the future.

Describe the solution you'd like
functions in the python to support serialize/deserialize graph. e.g.,

g1 =sess.load_from(HUGE_CONFIG)
g1.serialize("hdfs://LOCATION")

sess.close()

# next time to analysis the same graph maybe a few days later.
g2 = another_sess.deserialize_from("hdfs://LOCATION")

This may save huge time, because:

  • the graph maybe very large
  • the first time to construct a large graph is time-consuming.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Is k8s necessary for GraphScope

original issue title: a publicly available k8s cluster on aliyun as playground for tutorials and quick start

As titled. Let's provide a cluster as a playground for beginners.

Since the original issue will be solved via #65, and a question posed as below, the title changed to "Is k8s is a necessity for GraphScope"

[BUG] graphscope.framework.errors.AnalyticalEngineInternalError

Describe the bug
An internal error occurred while I was loading the diagram

Expected behavior
g = load_ogbn_mag(sess, "/home/bbduser/graphscope_test") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/bbduser/.local/lib/python3.7/site-packages/graphscope/dataset/ogbn_mag.py", line 71, in load_ogbn_mag return sess.load_from(edges, vertices) File "/home/bbduser/.local/lib/python3.7/site-packages/graphscope/client/session.py", line 649, in load_from return graphscope.load_from(*args, **kwargs) File "/home/bbduser/.local/lib/python3.7/site-packages/graphscope/framework/graph_utils.py", line 615, in load_from graph_def = sess.run(op) File "/home/bbduser/.local/lib/python3.7/site-packages/graphscope/client/session.py", line 531, in run response = self._grpc_client.run(dag) File "/home/bbduser/.local/lib/python3.7/site-packages/graphscope/client/rpc.py", line 136, in run return self._run_step_impl(dag_def) File "/home/bbduser/.local/lib/python3.7/site-packages/graphscope/client/rpc.py", line 45, in with_grpc_catch return fn(*args, **kwargs) File "/home/bbduser/.local/lib/python3.7/site-packages/graphscope/client/rpc.py", line 223, in _run_step_impl return check_grpc_response(response) File "/home/bbduser/.local/lib/python3.7/site-packages/graphscope/framework/errors.py", line 171, in check_grpc_response raise error_type(status.error_msg, detail) graphscope.framework.errors.AnalyticalEngineInternalError: 'ArrowError occurred on worker 0: ArrowError occurred on worker 0: /usr/local/lib64/cmake/graphscope-analytical/../../../include/graphscope/core/loader/arrow_fragment_loader.h:368: readTableFromLocation -> IOError: Failed to open the /home/bbduser/graphscope_test/author_affiliated_with_institution.csv because: No such file or directory
this file is exists

Screenshots
image

Environment (please complete the following information):

  • GraphScope version: GraphScope 0.1
  • OS:Linux version 3.10.0-862.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Fri Apr 20 16:44:24 UTC 2018
  • Version [e.g. 10.15]:python3.6.3
  • Kubernetes Version :v1.20.1

Additional context
Add any other context about the problem here.

Java SDK on GraphScope analytics

Currently the analytical engine only supports user-defined algorithms in cpp or python(via cython).
We are planning to support Java API on the analytical engine, which enables:

  • A set of compatible API of Apache Giraph, enabling jars written for Apache Giraph can run seamlessly.
  • compatible API of Apache GraphX.
  • comparable performance to the cpp version.

Print logs after session failed.

Is your feature request related to a problem? Please describe.
Session has a param called show_log which controls whether print backend logs to stdout / stderr, which is great. But when we run a session with show_log=False since we don't want to be distracted by logs, and occasionally session launched failed, we can't get much useful information about why it failed and how to fix it, usually we need to rerun the session with show_log=True.

Describe the solution you'd like
Print logs when session failed, even when show_log=False.

Deprecate the codepath for minikube (if kind workers better for Mac) to simply our environment prepration steps.

Is your feature request related to a problem? Please describe.

Currently the "Prerequisites" section in README is a bit confusion, e.g., do we advertise that kind is OK as well for Mac? AFAIK we haven't test it yet.

And the codepath for minikube to get the service endpoint on Mac, as well as the k8s_minikube_vm_driver parameter in graphscope.session is a bit hacky.

Describe the solution you'd like

If kind works better we could deprecate the support for minikube and use kind everywhere in our codebase and documentation.

Additional context

N/A

Add AWS S3 support

Support load and write to AWS S3 stroage.
Most cloud storages are compatible with S3.

Replace bundled vineyard container with the helm installation.

Is your feature request related to a problem? Please describe.

Vineyard can be install using helm:

helm repo add vineyard https://dl.bintray.com/libvineyard/charts/
helm install vineyard vineyard/vineyard

GraphScope should depend on a publicly available vineyard distribution.

Improve the compatibility of NetworkX

Currently the compatibility of NetworkX API is in preview.
We will make it solid in these aspects:

  • graph manipulations
  • built-in algorithms for large graphs in the distributed setting

How to launch the GraphScope after running the /scirps/prepare_env.sh?

Hi,
I'm puzzled at how can I launch the GraphScope's k8s ,just after using the prepare_env.sh shell to pull all the needed docker images like graphscope, zookeeper, etcd...then how can i launch and arrange the pods,is it should be launched and be set manually separately?? which means , I should start the GraphScope as a pod firstly , and then start the etcd or zookeeper as another pod ...?
Or, are these works all done by the graphscope's Session?

Project fragment: select a set of labels and properties to project

Is your feature request related to a problem? Please describe.

We have Graph.project_to_simple which select an edge + a vertice and a property to project a property fragment as a simple fragment. But we cannot obtain a subset of the original graph as a new property graph. The implementation should be straightforward.

Describe the solution you'd like

Implements Graph.project(elabels: Map[string, List[string]], vlabels: Map[string, List[string]]) -> Graph.

  • Use session.g() to return a Graph.
  • Delete remove related syntax and docs.
  • Unify project and project_to_simple
  • Make add_column works by checking individual vertex label's signature, not by graph object id.

Enrich the content of the document of PIE and Pregel

Is your feature request related to a problem? Please describe.
Currently, the documentation of PIE and Pregel, as well as its code sample, are not very intuitive to new users.
https://graphscope.io/docs/analytics_engine.html#writing-your-own-algorithms-in-pie
Describe the solution you'd like

  1. Explain every keyword that hasn't appeared yet. Such as context, frag, message, and vd_type, md_type for the pregel.
  2. What is graphscope.Vertex, and what does graphscope.declare do?
  3. Give an short description about the graph structure used in the example. For example: What's the meaning of 2 in e.get_int(2)?
  4. Illustrate how to set and get argument for UDF apps. For example: Where do I set the src for context.get_config(b"src")?
    And in the running section below, I saw ret = my_app(g, source="0"), where source and src doesn't correspond.
  5. Give an example of how to retrieve the SSSP results, right after the running section.
  6. More comment for the two app samples, one should have a general idea about what the sample will do without dive into every line of code. Since it is a little complicate for new comers, such as
    for v_label_id in range(v_label_num):
        iv = frag.inner_nodes(v_label_id)
        for v in iv:
            v_dist = context.get_node_value(v)
            for e_label_id in range(e_label_num):
                es = frag.get_outgoing_edges(v, e_label_id)
                for e in es:
                    u = e.neighbor()
                    u_dist = v_dist + e.get_int(2)
                    if context.get_node_value(u) > u_dist:
                        context.set_node_value(u, u_dist)

[BUG] Figure out the required protobuf version for the graphscope python package.

Describe the bug

GraphScope will raise if the existing protobuf package version is too low:

AttributeError: module 'google.protobuf.descriptor' has no attribute '_internal_create_key'

That is because we didn't specify the version range of protobuf in requirements.txt. That is bad when raising during import graphscope, and without a clear error message to tell the users how to fix that.

Expected behavior

Declare which version that we could accept in requirements.txt.

Additional context

N/A.

[BUG] Set etcd memory size in session parameters doesn't take effect.

Describe the bug
Set k8s_etcd_mem to a customed number in graphscope.session() doesn't take effect.

To Reproduce
Steps to reproduce the behavior:
run sess = graphscope.session(k8s_etcd_mem='512Mi').
describe the pod you will see the limits is still 128Mi.

Expected behavior
The etcd pod's memory limit set to 512Mi.

Environment (please complete the following information):

  • GraphScope version: v0.1.2
  • OS: Linux
  • Version: Ubuntu 20.04

AttributeError: module 'graphlearn' has no attribute 'encoders'

Describe the bug
A clear and concise description of what the bug is.
tensorflow :2.2.0

run node_classification_on_citation.ipynb :
config = {"class_num": 349, # output dimension
"features_num": 130, # 128 dimension + kcore + triangle count
"batch_size": 500,
"val_batch_size": 100,
"test_batch_size":100,
"categorical_attrs_desc": "",
"hidden_dim": 256,
"in_drop_rate": 0.5,
"hops_num": 2,
"neighs_num": [5, 10],
"full_graph_mode": False,
"agg_type": "gcn", # mean, sum
"learning_algo": "adam",
"learning_rate": 0.0005,
"weight_decay": 0.000005,
"epoch": 20,
"node_type": "paper",
"edge_type": "cites"}

train(config, lg)

the error:

AttributeError Traceback (most recent call last)
in
18 "edge_type": "cites"}
19
---> 20 train(config, lg)

in train(config, graph)
27 config["learning_algo"],
28 config["learning_rate"],
---> 29 config["weight_decay"]))
30 trainer.train_and_evaluate()

~/anaconda3/envs/graphscope/lib/python3.7/site-packages/graphscope/learning/graphlearn/python/model/tf/trainer.py in init(self, model_func, epoch, optimizer)
161 epoch,
162 optimizer)
--> 163 self.model = self._model_func()
164
165 def init(self, **kwargs):

in model_fn()
21 node_type=config["node_type"],
22 edge_type=config["edge_type"],
---> 23 full_graph_mode=config["full_graph_mode"])
24 trainer = LocalTFTrainer(model_fn,
25 epoch=config["epoch"],

~/anaconda3/envs/graphscope/lib/python3.7/site-packages/graphscope/learning/examples/tf/gcn/gcn.py in init(self, graph, output_dim, features_num, batch_size, val_batch_size, test_batch_size, categorical_attrs_desc, hidden_dim, in_drop_rate, hops_num, neighs_num, full_graph_mode, node_type, edge_type)
88 self.src_ego_spec = gl.EgoSpec(src_spec, hops_spec=[hop_spec] * self.hops_num)
89 # encoders.
---> 90 self.encoders = self._encoders()
91
92 def _sample_seed(self):

~/anaconda3/envs/graphscope/lib/python3.7/site-packages/graphscope/learning/examples/tf/gcn/gcn.py in _encoders(self)
128
129 depth = self.hops_num
--> 130 feature_encoders = [gl.encoders.IdentityEncoder()] * (depth + 1)
131 conv_layers = []
132 for i in range(depth):

AttributeError: module 'graphlearn' has no attribute 'encoders'

Environment (please complete the following information):

  • GraphScope version: main
  • OS: Linux
  • Version centos7
  • Kubernetes Version:1.20.1

Additional context
Add any other context about the problem here.

Make `show_log` as a global configuration of graphscope.

Is your feature request related to a problem? Please describe.
The show_log parameters between different sessions will affect each other.

Describe the solution you'd like

Deprecated show_log, log_level params in the session, instead,It is better to define as global configuration of graphscope.

import graphscope
graphscope.set_option("show_log", True)
graphscope.get_option("show_log")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.