GithubHelp home page GithubHelp logo

alibaba / graphar Goto Github PK

View Code? Open in Web Editor NEW
158.0 10.0 34.0 7.27 MB

An open source, standard data file format for graph data storage and retrieval

Home Page: https://alibaba.github.io/GraphAr/

License: Apache License 2.0

CMake 2.48% C++ 43.24% Java 21.94% Scala 22.19% Shell 0.71% C 0.08% Makefile 0.09% Python 9.27%
big-data graph graph-storage data-orchestration etl graph-analysis pyspark spark

graphar's Introduction

GraphAr

An open source, standard data file format for graph data storage and retrieval

GraphAr CI Docs CI GraphAr Docs Good First Issue

๐Ÿ“ข Join our Weekly Community Meeting to learn more about GraphAr and get involved!

What is GraphAr?


Overview


Graph processing serves as the essential building block for a diverse variety of real-world applications such as social network analytics, data mining, network routing, and scientific computing.

GraphAr (short for "Graph Archive") is a project that aims to make it easier for diverse applications and systems (in-memory and out-of-core storages, databases, graph computing systems, and interactive graph query frameworks) to build and access graph data conveniently and efficiently.

It can be used for importing/exporting and persistent storage of graph data, thereby reducing the burden on systems when working together. Additionally, it can serve as a direct data source for graph processing applications.

To achieve this, GraphAr provides:

  • The Graph Archive(GAR) file format: a standardized system-independent file format for storing graph data
  • Libraries: a set of libraries for reading, writing and transforming GAR files

By using GraphAr, you can:

  • Store and persist your graph data in a system-independent way with the GAR file format
  • Easily access and generate GAR files using the libraries
  • Utilize Apache Spark to quickly manipulate and transform your GAR files

The GAR File Format

The GAR file format is designed for storing property graphs. It uses metadata to record all the necessary information of a graph, and maintains the actual data in a chunked way.

A property graph consists of vertices and edges, with each vertex contains a unique identifier and:

  • A text label that describes the vertex type.
  • A collection of properties, with each property can be represented by a key-value pair.

Each edge contains a unique identifier and:

  • The outgoing vertex (source).
  • The incoming vertex (destination).
  • A text label that describes the relationship between the two vertices.
  • A collection of properties.

The following is an example property graph containing two types of vertices ("person" and "comment") and three types of edges.

property graph

Vertices in GraphAr

Logical table of vertices

Each type of vertices (with the same label) constructs a logical vertex table, with each vertex assigned with a global index inside this type (called internal vertex id) starting from 0, corresponding to the row number of the vertex in the logical vertex table. An example layout for a logical table of vertices under the label "person" is provided for reference.

Given an internal vertex id and the vertex label, a vertex is uniquely identifiable and its respective properties can be accessed from this table. The internal vertex id is further used to identify the source and destination vertices when maintaining the topology of the graph.

vertex logical table

Physical table of vertices

The logical vertex table will be partitioned into multiple continuous vertex chunks for enhancing the reading/writing efficiency. To maintain the ability of random access, the size of vertex chunks for the same label is fixed. To support to access required properties avoiding reading all properties from the files, and to add properties for vertices without modifying the existing files, the columns of the logical table will be divided into several column groups.

Take the "person" vertex table as an example, if the chunk size is set to be 500, the logical table will be separated into sub-logical-tables of 500 rows with the exception of the last one, which may have less than 500 rows. The columns for maintaining properties will also be divided into distinct groups (e.g., 2 for our example). As a result, a total of 4 physical vertex tables are created for storing the example logical table, which can be seen from the following figure.

vertex physical table

Note: For efficiently utilize the filter push-down of the payload file format like Parquet, the internal vertex id is stored in the payload file as a column. And since the internal vertex id is continuous, the payload file format can use the delta encoding for the internal vertex id column, which would not bring too much overhead for the storage.

Edges in GraphAr

Logical table of edges

For maintaining a type of edges (that with the same triplet of the source label, edge label, and destination label), a logical edge table is established. And in order to support quickly creating a graph from the graph storage file, the logical edge table could maintain the topology information in a way similar to CSR/CSC (learn more about CSR/CSC), that is, the edges are ordered by the internal vertex id of either source or destination. In this way, an offset table is required to store the start offset for each vertex's edges, and the edges with the same source/destination will be stored continuously in the logical table.

Take the logical table for "person knows person" edges as an example, the logical edge table looks like:

edge logical table

Physical table of edges

As same with the vertex table, the logical edge table is also partitioned into some sub-logical-tables, with each sub-logical-table contains edges that the source (or destination) vertices are in the same vertex chunk. According to the partition strategy and the order of the edges, edges can be stored in GraphAr following one of the four types:

  • ordered_by_source: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the source, which can be seen as the CSR format.
  • ordered_by_dest: all the edges in the logical table are ordered and further partitioned by the internal vertex id of the destination, which can be seen as the CSC format.
  • unordered_by_source: the internal id of the source vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can be seen as the COO format.
  • unordered_by_dest: the internal id of the destination vertex is used as the partition key to divide the edges into different sub-logical-tables, and the edges in each sub-logical-table are unordered, which can also be seen as the COO format.

After that, a sub-logical-table is further divided into edge chunks of a predefined, fixed number of rows (referred to as edge chunk size). Finally, an edge chunk is separated into physical tables in the following way:

  • an adjList table (which contains only two columns: the internal vertex id of the source and the destination).
  • 0 or more edge property tables, with each table contains a group of properties.

Additionally, there would be an offset table for ordered_by_source or ordered_by_dest edges. The offset table is used to record the starting point of the edges for each vertex. The partition of the offset table should be in alignment with the partition of the corresponding vertex table. The first row of each offset chunk is always 0, indicating the starting point for the corresponding sub-logical-table for edges.

Take the "person knows person" edges to illustrate. Suppose the vertex chunk size is set to 500 and the edge chunk size is 1024, and the edges are ordered_by_source, then the edges could be saved in the following physical tables:

edge logical table1

edge logical table2

Building Libraries

GraphAr offers a collection of libraries for the purpose of reading, writing and transforming files. Currently, the following libraries are available, and plans are in place to expand support to additional programming language.

The C++ Library

See GraphAr C++ Library for details about the building of the C++ library.

The Java Library

The GraphAr Java library is created with bindings to the C++ library (currently at version v0.10.0), utilizing Alibaba-FastFFI for implementation. See GraphAr Java Library for details about the building of the Java library.

The Spark Library

See GraphAr Spark Library for details about the Spark library.

The PySpark Library

The GraphAr PySpark library is developed as bindings to the GraphAr Spark library. See GraphAr PySpark Library for details about the PySpark library.

Contributing

Contributing Guidelines

Read through our contribution guidelines to learn about our submission process, coding rules, and more.

Code of Conduct

Help us keep GraphAr open and inclusive. Please read and follow our Code of Conduct.

Getting Involved

Join the conversation and help the community. Even if you do not plan to contribute to GraphAr itself or GraphAr integrations in other projects, we'd be happy to have you involved.

Read through our community introduction to learn about our communication channels, governance, and more.

License

GraphAr is distributed under Apache License 2.0. Please note that third-party libraries may not have the same license as GraphAr.

Publication

@article{li2023enhancing,
  author = {Xue Li and Weibin Zeng and Zhibin Wang and Diwen Zhu and Jingbo Xu and Wenyuan Yu and Jingren Zhou},
  title = {Enhancing Data Lakes with GraphAr: Efficient Graph Data Management with a Specialized Storage Scheme},
  year = {2023},
  url = {https://doi.org/10.48550/arXiv.2312.09577},
  doi = {10.48550/ARXIV.2312.09577},
  eprinttype = {arXiv},
  eprint = {2312.09577},
  biburl = {https://dblp.org/rec/journals/corr/abs-2312-09577.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

graphar's People

Contributors

acezen avatar andydiwenzhu avatar haohao0103 avatar jasinliu avatar liuxiaocs7 avatar lixueclaire avatar semyonsinchenko avatar sighingnow avatar thespica avatar yecol avatar yixinglu avatar zhanglei1949 avatar ziy1-tan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

graphar's Issues

[Bug] Inconsistent prefix for vertex property chunks in the test data

In documentation and meta data, the prefix is "./vertex/person/first_name_last_name_gender", but the file path for property chunks is "./vertex/person/firstName_lastName_gender".
documentation: https://alibaba.github.io/GraphAr/user-guide/getting-started.html#property-data
meta data: https://github.com/acezen/gar-test/blob/master/ldbc_sample/csv/person.vertex.yml
file path for the chunks: https://github.com/acezen/gar-test/tree/master/ldbc_sample/csv/vertex/person/firstName_lastName_gender

Improve the document about the file format introduction and use examples

Is your feature request related to a problem? Please describe.
Improve the document about the GAR file format introduction to make it more clear. Also, re-organize and improve the examples for helping the users to getting started with GraphAr.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Support users-defined data type parser in graph info

Is your feature request related to a problem? Please describe.
Currently, the GraphAr Information classes support the users to extend their custom data types base on the info version (#27). However, the Reader/Writer implementations of our libraries do not support to read/write data in user-defined types.

Describe the solution you'd like
Extend the GraphAr libraries to support pass a user-defined parser to the Reader/Writer, to handle the custom data types.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[C++] [Improvement] Provide more writing methods in the C++ library

Is your feature request related to a problem? Please describe.
Currently, the low-level writers (VertexPropertyWriter and EdgeChunkWriter) only support to write Arrow tables, thus for the users, it is required to construct such tables before writing (e.g., writing the PageRank results saved in a std::vector into GAR files). For high-level writers (VerticesBuilder and EdgesBuilder), it is required to construct the Vertex/Edge firstly, which is the internal high-level data structure in GraphAr

Describe the solution you'd like
We are proposed to provide more built-in writing methods in C++ Writer SDK, to support additional data structures besides Arrow tables and GraphAr Vertex/Edge. A possible solution is to use containers from the STL, as Boost Graph Library does, including:

  • std::vector
  • std::list
  • std::slist
  • std::set
  • std::hash_set
  • std::multiset

[Feat] Ensure the metadata information to behave exactly the same across different languages

Is your feature request related to a problem? Please describe.
Utilize ProtoBuf to ensure the metadata information of GAR file format to behave exactly the same across different languages.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Release version v0.1.0

Is your feature request related to a problem? Please describe.
Release version v0.1.0

Describe the solution you'd like

  • Check CI pass
  • The release note.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Add Spark Examples in GAR using GraphAr Spark tools

Is your feature request related to a problem? Please describe.
The GraphAr Spark tools can be applied to the scenarios where the graph format need to be transformed. It can also be used when taking GraphAr as data sources to execute SQLs or do graph processing. We can add some examples to show the use cases.

Describe the solution you'd like
Add examples that utilize the Spark tools to:

  • take GAR as data sources to do graph processing (e.g., run CC using GraphX).
  • transform GAR data between different file types (e.g., from ORC to parquet).
  • transform GAR data between different adjList types (e.g., from COO to CSR).

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Bug]: Offset chunk of spark writer got wrong value and output location

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

  • Offset chunk file's path output by spark writer is not compatible with the path get from edge info:

run mvn test -Dsuites='com.alibaba.graphar.WriterSuite test edge writer with vertex table and edge table

the offset0 output path is /tmp/edge/person_knows_person/ordered_by_source/offset/part0
but [getAdjListOffsetFilePath] method of edge info return /tmp/edge/person_knows_person/ordered_by_source/offset0
https://github.com/alibaba/GraphAr/blob/0991064e3f5a5844d453d2743bc2b03dc65fdf14/spark/src/main/scala/com/alibaba/graphar/EdgeInfo.scala#L291

Expected Behavior

  • path: offset output path should compatible with edge info
  • offset value should be the real offset value, not edge count.

Minimal Reproducible Example

cd spark
mvn test -Dsuites='com.alibaba.graphar.WriterSuite test edge writer with vertex table and edge table

Environment

  • Operating system: MacOS
  • GraphAr version:
    v0.1.0

Link to GraphAr Logs

No response

Further Information

No response

Add `AdjListInfo` for EdgeInfo to store adj list information

Is your feature request related to a problem? Please describe.
Base on the graph information file design (example), the AdjList of graph is containing the informations about align, edge chunk file type, property_groups of edge. But in C++ library, the AdjList is only an enum type, and the other informations are stored in EdgeInfo with map.
This is not align to the yaml file design. To address the problem, maybe we should add a middle structure AdjListInfo between PropertyGroup and EdgeInfo to keep track of the adj list information of graph.

Describe the solution you'd like
The AdjListInfo could be like: (just a proposal)

class AdjListInfo {
    FileType file_type_;
    std::string prefix_;
    std::vector<PropertyGroup> property_groups_;

  public:
    // Constructor
    AdjListInfo(FileType file_type, std::string prefix);

    // some add methods
    void AddPropertyGroup(pg);

    // some getter methods
    FileType GetFileType() const;
}

Then, use AdjListInfo objects as member variables to update the implementation of EdgeInfo.

[Feat] Provide libraries for other languages

Is your feature request related to a problem? Please describe.
Currently the libraries for GraphAr are only available for C++ and Spark. But many graph processing systems are implemented by other programming languages (like Neo4j by java). We need to provide libraries for more programming languages.

Describe the solution you'd like
Implement library with

  • Java
  • Go
  • Rust
  • Python

[Feat] Implement GraphAr Spark Writer for writing Spark DataFrame into GAR format files

Is your feature request related to a problem? Please describe.
Implement writer of Spark tool to provide functions to generate GraphAr format files from hive table.

Describe the solution you'd like
It's better to read hive table as a spark DataFrame and use operators of DataFrame to generate the files.
The writer should include VertexWriter and EdgeWriter.

  • VertexWriter provide functions to generate chunk files of property group base on the vertex info user defined
  • EdgeWriter provide function to generate chunk files of adj list/offset/property group base on the edge info that user defined

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Improve the performance of Spark Reader

Is your feature request related to a problem? Please describe.
Optimize the Spark Reader to support reading multiple chunks in parallel for better performance, and maintain the relative order of the chunks in resulting DataFrame.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add introduction about GraphAr Spark tools in document

Is your feature request related to a problem? Please describe.
Add an individual page in GraphAr document to introduce the Spark tools.

Describe the solution you'd like
The document would include:

  • the high-level overview of the Spark tools
  • how to get the tools
  • how to use them

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat][Doc] Generate GAR files of whole `ldbc-sample` property graph as an example to demonstrate GAR format

Is your feature request related to a problem? Please describe.
Maybe we need a widely-used property graph to demonstrate the GAR file format. The ldbc dataset seems to be a good choice.

This issue can be a good first issue for a developer.

Describe the solution you'd like
Generate formatted files in GraphAr for a property graph including:

  • design the metadata files (in Yaml) for the ldbc graph, the file format of chunk files can be csv for easy to read
  • generate GAR data files with the Spark library
  • add some tests to check the data files are match to the metadata information through utilizing the Info classes

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
related to issue #37

[Feat] Implement `FragmentBuilder` in GraphScope with GraphAr to support building graph from GAR format files

Is your feature request related to a problem? Please describe.
We select GraphScope as GraphAr' first landing system and make that as an example to use GraphAr.
Implement a builder of Fragment in GraphScope with GraphAr to support build the in-memory property graph from GraphAr format files.

Describe the solution you'd like
The process of builder works like:

  • first user should design the yaml files to describe the graph you want to load, it can be the whole in-memory graph or a subgraph.
  • FragmentBuilder load the yaml files as Info(GraphInfo, VertexInfo and EdgeInfo), and use the ArrowChunkReader API of GraphAr to load chunk files as arrow table(including vertex table, edge table and offset table) and use these table to construct fragment.

Here is a prototype implementation of FragmentBuilder

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Implement GraphAr Spark Reader for reading GAR format files into Spark DataFrame

Is your feature request related to a problem? Please describe.
Implement the Spark Reader to provide functions for reading GraphAr files into Spark DataFrames.

Describe the solution you'd like
The reader should include VertexReader and EdgeReader:

  • VertexReader provide functions to read a type of vertices at a time and assembles the result into the Spark DataFrames.
  • EdgeReader provide functions to read edge chunks including adjList, offset and property chunks.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Improve the performance of Spark writer

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Bug]: The `libgar` library building from source expose its interface to its dependencies

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When I use gar library in my project with

target_link_libraries(my_example PUBLIC ${GAR_LIBRARIES})

and build my project

got error:

-larrow_static not found

it looks like the target link libraries has inherited the gar library's dependency.

GraphAr/CMakeLists.txt

Lines 170 to 172 in e8edfe3

target_link_libraries(gar -Wl,-force_load arrow_static
"${PROJECT_BINARY_DIR}/thirdparty/yaml-cpp/libyaml-cpp.a"
"${PARQUET_STATIC_LIB}"

Expected Behavior

DO NOT inherit the dependencies interface of GraphAr in user's project

Minimal Reproducible Example

project(MyExample)

find_package(gar REQUIRED)
include_directories(${GAR_INCLUDE_DIRS})

add_executable(my_example my_example.cc)
target_compile_features(my_example PRIVATE cxx_std_17)
target_link_libraries(my_example PRIVATE ${GAR_LIBRARIES})

Environment

  • Operating system: Ubuntu 20.04
  • GraphAr version: commit e8edfe3

Link to GraphAr Logs

No response

Further Information

No response

[Feat] Support more file formats for payload files

Is your feature request related to a problem? Please describe.
The GraphAr's chunk files could be stored in ORC, parquet or CSV now. We can support more builtin file formats like
Json, hdf5 and avro to enhance the capacity of GraphAr and satisfy different requirements for file formats.

Describe the solution you'd like
Support more file types by extending the metadata information and implement related reading/writing functions with help of
arrow or other third-party libraries.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Integrate NebulaGraph spark connector as input data source for GraphAr spark tool

Is your feature request related to a problem? Please describe.
The graph data migration between NebulaGraph and GraphAr could be an important application of GraphAr. This can be implemented based on the NebulaGraph Spark connector and the GraphAr Spark library, including reading graph data from NebulaGraph to generate GAR files, and reading from GraphAr to create/update instances in NebulaGraph.

Describe the solution you'd like
Please refer to the integration with Neo4j (#107).

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add release tutorial to contributing guide to make maintainer easy to do the release process of GraphAr

Is your feature request related to a problem? Please describe.
Add an github action to simplify the release process of GraphAr and add release tutorial for maintainer how to cut a version.

Describe the solution you'd like
simplify the process with tool like action-automatic-releases

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat]: Support data type extension in graph information base on `version` attribute of Infos

Is your feature request related to a problem? Please describe.
The version attribute of infos(graph, vertex, edge) now is only a number. Actually it can contain the implicit information that the property data types support with the version. With the version growing, the supported data types could be extended. Likes:
version 1 -> support bool,int32, int64, float, double, string
version 2 -> support bool,int32, int64, float, double, string, date32

Describe the solution you'd like

  • Use string instead of number as version, something like User Agent message of browser.
    version example:
    gar/v1
    gar/v2
    gar/v3 (user_define1, user_define2) # suppose the version 3 or higher support user define type.

  • Add a VersionMeta class to keep record different version supported data types and do the version string parse job.

  • If the yaml contains the data type that the yaml version not support, raise error to user.

Add CODE_OF_CONDUCT.md

Code of conduct help establish expectations for behavior of the project's participants, and facilitate healthy, constructive community behavior.

We should add a document to the root of the git repository to direct interested individuals to the CoC.

Fix prefix of GAR files in document

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Add tests for GraphAr spark tool and integrate to CI

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Improvement] Reorganize the code directory to make developer easily to extend library for other language

Is your feature request related to a problem? Please describe.
Since the C++ library is the first library support by GraphAr, its code is put directly in the root of source. For extending other language library easily, we need to reorganize the code directory like:

.
โ”œโ”€โ”€ cpp (c++ library code)
โ”œโ”€โ”€ docs
โ”œโ”€โ”€ examples
โ”œโ”€โ”€ spark
โ””โ”€โ”€ thirdparty

Describe the solution you'd like

  • put the c++ library code to cpp directory
  • Add an CMakeList.txt to manage the building of all libraries

[Feat] Integrate GraphAr into GraphScope

Is your feature request related to a problem? Please describe.
Use GraphScope as our first landing system.

Describe the solution you'd like

  • Implement writer/builder with GraphAr in vineyard.
  • benchmarking (cf, ldbc snb30, ldbc snb100)
  • Add related call api in GraphScope client to enable writer/build graph with GraphAr
  • Add related documents and test

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Refine the README.rst to make user/developer easy to know `What is GraphAr`

Is your feature request related to a problem? Please describe.
Current README of GraphAr is a little clumsy and incomplete. It is hard to help user/developer to know What is GraphAr.

Describe the solution you'd like

  • Clear and concise introduction of GraphAr.
  • Goals of GraphAr
  • Links to other documents (for advance reading)
  • Concise writing
  • Add Code of conduct

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Integrate LDBC spark connector as input data source for GraphAr spark tool

Is your feature request related to a problem? Please describe.
LDBC provides a synthetic graph generator running on Spark (https://github.com/ldbc/ldbc_snb_datagen_spark). We can utilize the GraphAr spark library to integrate with this graph generator, for dumping the generated graph data into GraphAr files.

Describe the solution you'd like
Refer to the API reference of Reader/Writer and graph-level interface of the GraphAr Spark library. The integration with neo4j Spark connector (#107 ) can also help.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Implement GraphAr Spark Tools 0.1

Is your feature request related to a problem? Please describe.
GraphAr Spark tools are required as a library for generating, loading and transforming GAR files with Apache Spark easy.

Describe the solution you'd like
GraphAr Spark tools consist of the following parts:

  • Reader: for reading GAR files into Spark DataFrame #29
  • Writer: for writing Spark DataFrame into GAR files #28
  • IndexGenerator: for helping to generate the vertex index for vertex/edge DataFrames #36
  • Info Classes: for constructing and accessing the meta information of GraphAr #32

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Implement Graph spark `IndexGenerator` for helping to generate the vertex index for vertex/edge DataFrame

Is your feature request related to a problem? Please describe.
According to GAR file format, the global index of vertex is important in GAR file format and it is continuous and unique.

The original data source for spark(e.g vertex dataframe and edge dataframe) usually not contain such column.
IndexGenerator is a helper object that help GraphAr to generate index of vertex for vertex dataframe and edge dataframe.

Describe the solution you'd like
Here is a API proposal of IndexGenerator

object IndexGenerator {
  //helper methods for vertex DataFrame
  def constructVertexIndexMapping(vertexDf: DataFrame, primaryKey: String): DataFrame = {
    //return a DataFrame contains two columns: vertex index & primary key
  }

  def generateVertexIndexColumn(vertexDf: DataFrame): DataFrame = {
    //add a column contains vertex index
  }

  //helper methods for edge DataFrame
  //generate index from vertex mapping
  def generateSrcIndexForEdgesFromMapping(edgeDf: DataFrame, srcColumnName: String, srcIndexMapping: DataFrame): DataFrame = {
  	// join the edge table with the vertex index mapping for source column
	}

  def generateDstIndexForEdgesFromMapping(edgeDf: DataFrame, dstColumnName: String, dstIndexMapping: DataFrame): DataFrame = {
  	// join the edge table with the vertex index mapping for destination column
	}

  def generateVertexIndexForEdgesFromMapping(edgeDf: DataFrame, srcColumnName: String, dstColumnName: String, srcIndexMapping: DataFrame, dstIndexMapping: DataFrame): DataFrame = {
  	// join the edge table with the vertex index mapping for source & destination columns
	}
	//generate index by sorting the src/dst column
  def generateSrcIndexForEdges(edgeDf: DataFrame, srcColumnName: String): DataFrame = {
  	// construct vertex index for source column
	}
  
  def generateDstIndexForEdges(edgeDf: DataFrame, dstColumnName: String): DataFrame = {
  	// construct vertex index for destination column
	}

  def generateSrcAndDstIndexUnitedlyForEdges(edgeDf: DataFrame, srcColumnName: String, dstColumnName: String): DataFrame = {
    // construct vertex index for source & destination columns together
  }
}

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Fully utilize the features of different file formats for improved efficiency

Is your feature request related to a problem? Please describe.
GraphAr supports the file formats of CSV, ORC and Parquet currently, and it's going to support more file types such as json, hdf5 and avro. For enhancing the efficiency of reading/writing and storing of the data, the features of different file formats should be considered and fully utilized, for example, applying the most appropriate compression and encoding scheme to the data, or enable filter pushdown to improve query performance.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Implement `FragmentWriter` with GraphAr in GraphScope to support writing graph to GAR format files

Is your feature request related to a problem? Please describe.
We select GraphScope as GraphAr' first landing system and make that as an example to use GraphAr.
Implement writer of Fragment in GraphScope with GraphAr to support dump the in-memory property graph to GraphAr format files.

Describe the solution you'd like
The process of writer works like:

  • first user should design the yaml files to describe the graph you want to dumps, it can be the whole in-memory graph or a subgraph.
  • FragmentWriter loads the yaml files as Info(GraphInfo, VertexInfo and EdgeInfo), and use the ArrowChunkWriter API of GraphAr to dumps the arrow table to GraphAr format files.

Here is a prototype implementation of FragmentWriter

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Revise the application example implementation

Is your feature request related to a problem? Please describe.
Currently the examples of GraphAr are implement like unit test and they are not intuitive for user or developer beginner to know how to use GraphAr as example.
We need to revise the implement and make them more like an example and show case.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
related to #37

[Feat] Implement Info class for GraphAr spark tool to construct and access the meta information of graph

Is your feature request related to a problem? Please describe.
Implement info classes for GraphAr spark tool. the Info include GraphInfo, VertexInfo and EdgeInfo and align to the classes of c++ SDK.

Describe the solution you'd like
Here is a proposal of the Info classes api:

class Property () {
  @BeanProperty var name: String = ""
  @BeanProperty var data_type: String = ""
  @BeanProperty var is_primary: Boolean = false
}

//methods of Property:
// -- getName: String
// -- getData_type: String
// -- getData_type_in_gar: GarType.Value
// -- getIs_primary: Boolean

class PropertyGroup () {
  @BeanProperty var prefix: String = ""
  @BeanProperty var file_type: String = ""
  @BeanProperty var properties = new java.util.ArrayList[Property]()
}

//methods of PropertyGroup:
// -- getPrefix: String
// -- getFile_type: String
// -- getFile_type_in_gar: FileType.Value
// -- getProperties:  ArrayList[Property]

class AdjList () {
  @BeanProperty var ordered: Boolean = false
  @BeanProperty var aligned_by: String = "src"
  @BeanProperty var prefix: String = ""
  @BeanProperty var file_type: String = ""
  @BeanProperty var property_groups = new java.util.ArrayList[PropertyGroup]()
}

//methods of AdjList:
// -- getOrdered: Boolean
// -- getAligned_by: String
// -- getPrefix: String
// -- getFile_type: String
// -- getFile_type_in_gar: FileType.Value
// -- getAdjList_type: String
// -- getAdjList_type_in_gar: AdjListType.Value
// -- getPropertyGroups: ArrayList[PropertyGroup]

class GraphInfo() {
  @BeanProperty var name: String = ""
  @BeanProperty var prefix: String = ""
  @BeanProperty var vertices = new java.util.ArrayList[String]()
  @BeanProperty var edges = new java.util.ArrayList[String]()
  @BeanProperty var version: String = ""
}

//methods of GraphInfo:
// -- getName: String
// -- getPrefix: String
// -- getVertices: ArrayList[String]
// -- getEdges: ArrayList[String]
// -- getVersion: String

class VertexInfo() {
  @BeanProperty var label: String = ""
  @BeanProperty var chunk_size: Long = 0
  @BeanProperty var prefix: String = ""
  @BeanProperty var property_groups = new java.util.ArrayList[PropertyGroup]()
  @BeanProperty var version: String = ""
}

//methods of VertexInfo:
// -- getLabel: String
// -- getChunk_size: Long
// -- getPrefix: String
// -- getProperty_groups: ArrayList[PropertyGroup]
// -- getVersion: String
// -- containPropertyGroup(property_group: PropertyGroup) : Boolean
// -- containProperty(property_name: String) : Boolean
// -- getPropertyGroup(property_name: String):PropertyGroup
// -- getPropertyType(property_name: String): GarType.Value
// -- isPrimaryKey(property_name: String): Boolean
// -- getPrimaryKey(): String
// -- isValidated(): Boolean
// -- getVerticesNumFilePath(): String
// -- getFilePath(property_group: PropertyGroup, chunk_index: Long): String
// -- getDirPath(property_group: PropertyGroup): String

class EdgeInfo() {
  @BeanProperty var src_label: String = ""
  @BeanProperty var edge_label: String = ""
  @BeanProperty var dst_label: String = ""
  @BeanProperty var chunk_size: Long = 0
  @BeanProperty var src_chunk_size: Long = 0
  @BeanProperty var dst_chunk_size: Long = 0
  @BeanProperty var directed: Boolean = false
  @BeanProperty var prefix: String = ""
  @BeanProperty var adj_lists = new java.util.ArrayList[AdjList]()
  @BeanProperty var version: String = ""
}

//methods of EdgeInfo:
// -- getSrc_label: String
// -- getEdge_label: String
// -- getDst_label: String
// -- getChunk_size: Long
// -- getSrc_chunk_size: Long
// -- getDst_chunk_size: Long
// -- getDirected: Boolean
// -- getPrefix: String
// -- getAdj_lists: ArrayList[AdjList]
// -- containAdjList(adj_list_type: AdjListType.Value): Boolean
// -- getAdjListPrefix(adj_list_type: AdjListType.Value): String
// -- getAdjListFileType(adj_list_type: AdjListType.Value): FileType.Value
// -- containPropertyGroup(property_group: PropertyGroup, adj_list_type: AdjListType.Value) : Boolean
// -- containProperty(property_name: String) : Boolean
// -- getPropertyGroups(adj_list_type: AdjListType.Value): java.util.ArrayList[PropertyGroup]
// -- getPropertyType(property_name: String): GarType.Value
// -- getPropertyGroup(property_name: String, adj_list_type: AdjListType.Value): PropertyGroup 
// -- isPrimaryKey(property_name: String): Boolean
// -- getPrimaryKey(): String
// -- isValidated(): Boolean
// -- getAdjListOffsetFilePath(chunk_index: Long, adj_list_type: AdjListType.Value) : String
// -- getAdjListOffsetDirPath(adj_list_type: AdjListType.Value) : String
// -- getAdjListFilePath(vertex_chunk_index: Long, chunk_index: Long, adj_list_type: AdjListType.Value) : String
// -- getAdjListDirPath(adj_list_type: AdjListType.Value) : String
// -- getPropertyFilePath(property_group: PropertyGroup, adj_list_type: AdjListType.Value, vertex_chunk_index: Long, chunk_index: Long): String
// -- getPropertyDirPath(property_group: PropertyGroup, adj_list_type: AdjListType.Value) : String
// -- getVersion: String

[Feat] GraphAr Spark library to support adding new rows/columns

Is your feature request related to a problem? Please describe.
In real use cases, the graph data is usually continuously changing, including adding, deleting, and modifying vertices or edges. As part of incremental management functions, we intend to extend the GraphAr Spark tools to support adding new rows/columns conveniently and efficiently.

Describe the solution you'd like
Support to add new rows/columns for vertex/edge table and dump the new data by generating new GAR files or appending/rewriting existing GAR files.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Integrate Neo4j spark connector as input data source

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Refine document to make users/developers easy to use GraphAr

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Improve the performance of high-level graph iterators of the C++ library

Is your feature request related to a problem? Please describe.
An important application case of GraphAr is to serve out-of-core graph processing scenarios. With the graph data saved as GAR files in the disk, GraphAr provides a set of reading interfaces to allow to load part of graph data into memory when needed, to conduct analytics.
Since for out-of-core graph processing, disk I/O time usually dominates the overall execution time. It is critically important that the GraphAr C++ library perform efficiently for traversing vertices/edges through high-level graph iterators.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

๐Ÿ›ฃ๏ธ Roadmap

๐Ÿ›ฃ๏ธ Roadmap

Below is a high-level road map view for GraphAr to provide a sense of direction of where the project is going. This can change at any point and does not reflect many features and improvements that will also be included as part of the journey along this road map. For more granular detail of what will be included in upcoming releases you can review the project milestones as defined in our Release Process documentation.

  • Format Spec

    • #231
    • Extract property groups property from adjacent list in edge info
    • Use the same property group chunks in all adjacent lists, to reduce total file size
    • #275
  • C++

  • Java

    • cross-language schema compatibility for format.
    • Provide ability that can integrate into HugeGraph
    • Refactor SDK to avoid strong binding to C++/arrow
  • Spark

    • cross-language schema compatibility for format.
    • #320
    • #324
    • #330
  • Python

    • cross-language schema compatibility for format.
    • Support Python SDK

[Feat][FileFormat] CSV should include the header row in chunk file

Is your feature request related to a problem? Please describe.
Currently, CSV chunk files generated by c++/spark writer does not contains the header row and it would lost schema information of data. We should include the header row when generate CSV chunk files.

Describe the solution you'd like
enable the include_header option of C++ chunk writer , refer from: https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow3csv12WriteOptions14include_headerE
enable the header option in spark dataframe writer. refer from: https://spark.apache.org/docs/latest/sql-data-sources-csv.html

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Include additional built-in data types for GraphAr libraries

Is your feature request related to a problem? Please describe.
Currently, the GraphAr C++ and Spark libraries supports only several basic data types (including BOOL, INT32, INT64, FLOAT, DOUBLE, and STRING). To serve more scenarios, more built-in data types need to be added in GraphAr libraries.

Describe the solution you'd like
Add more common data types to the GraphAr libraries, such as DATE, TIME, BINARY, STRUCT, MAP, ARRARY, and JSON. Since these types are not always supported by the CSV/ORC/Parquet file types and the C++/Spark standard libraries, careful handling should be taken in each case, e.g., performing the necessary type conversions.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.