GithubHelp home page GithubHelp logo

vesoft-inc / nebula-importer Goto Github PK

View Code? Open in Web Editor NEW
87.0 29.0 59.0 838 KB

Nebula Graph Importer with Go

License: Apache License 2.0

Go 99.45% Dockerfile 0.12% Makefile 0.43%
nebula-graph golang csv-import csv

nebula-importer's Introduction

codecov.io Go Report Card GolangCI GoDoc

What is NebulaGraph Importer?

NebulaGraph Importer is a tool to import data into NebulaGraph.

Features

  • Support multiple data sources, currently supports local, s3, oss, ftp, sftp, hdfs, and gcs.
  • Support multiple file formats, currently only csv files are supported.
  • Support files containing multiple tags, multiple edges, and a mixture of both.
  • Support data transformations.
  • Support record filtering.
  • Support multiple modes, including INSERT, UPDATE, DELETE.
  • Support connect multiple Graph with automatically load balance.
  • Support retry after failure.
  • Humanized status printing.

See configuration instructions for more features.

How to Install

From Releases

Download the packages on the Releases page, and give execute permissions to it.

You can choose according to your needs, the following installation packages are supported:

  • binary
  • archives
  • apk
  • deb
  • rpm

From go install

$ go install github.com/vesoft-inc/nebula-importer/cmd/nebula-importer@latest

From docker

$ docker pull vesoft/nebula-importer:<version>
$ docker run --rm -ti \
      --network=host \
      -v <config_file>:<config_file> \
      -v <data_dir>:<data_dir> \
      vesoft/nebula-importer:<version>
      --config <config_file>

# config_file: the absolute path to the configuration file.
# data_dir: the absolute path to the data directory, ignore if not a local file.
# version: the version of NebulaGraph Importer.

From Source Code

$ git clone https://github.com/vesoft-inc/nebula-importer
$ cd nebula-importer
$ make build

You can find a binary named nebula-importer in bin directory.

Configuration Instructions

NebulaGraph Importer's configuration file is in YAML format. You can find some examples in examples.

Configuration options are divided into four groups:

  • client is configuration options related to the NebulaGraph connection client.
  • manager is global control configuration options related to NebulaGraph Importer.
  • log is configuration options related to printing logs.
  • sources is the data source configuration items.

client

client:
  version: v3
  address: "127.0.0.1:9669"
  user: root
  password: nebula
  ssl:
    enable: true
    certPath: "your/cert/file/path"
    keyPath: "your/key/file/path"
    caPath: "your/ca/file/path"
    insecureSkipVerify: false
  concurrencyPerAddress: 16
  reconnectInitialInterval: 1s
  retry: 3
  retryInitialInterval: 1s
  • client.version: Required. Specifies which version of NebulaGraph, currently only v3 is supported.
  • client.address: Required. The address of graph in NebulaGraph.
  • client.user: Optional. The user of NebulaGraph. The default value is root.
  • client.password: Optional. The password of NebulaGraph. The default value is nebula.
  • client.ssl: Optional. SSL related configuration.
  • client.ssl.enable: Optional. Specifies whether to enable ssl authentication. The default value is false.
  • client.ssl.certPath: Required. Specifies the path of the certificate file.
  • client.ssl.keyPath: Required. Specifies the path of the private key file.
  • client.ssl.caPath: Required. Specifies the path of the certification authority file.
  • client.ssl.insecureSkipVerify: Optional. Specifies whether a client verifies the server's certificate chain and host name. The default value is false.
  • client.concurrencyPerAddress: Optional. The number of client connections to each graph in NebulaGraph. The default value is 10.
  • client.reconnectInitialInterval: Optional. The initialization interval for reconnecting NebulaGraph. The default value is 1s.
  • client.retry: Optional. The failed retrying times to execute nGQL queries in NebulaGraph client. The default value is 3.
  • client.retryInitialInterval: Optional. The initialization interval retrying. The default value is 1s.

manager

  spaceName: basic_int_examples
  batch: 128
  readerConcurrency: 50
  importerConcurrency: 512
  statsInterval: 10s
  hooks:
    before:
      - statements:
          - UPDATE CONFIGS storage:wal_ttl=3600;
          - UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = true };
      - statements:
          - |
            DROP SPACE IF EXISTS basic_int_examples;
            CREATE SPACE IF NOT EXISTS basic_int_examples(partition_num=5, replica_factor=1, vid_type=int);
            USE basic_int_examples;
        wait: 10s
    after:
      - statements:
          - |
            UPDATE CONFIGS storage:wal_ttl=86400;
            UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = false };
  • manager.spaceName: Required. Specifies which space the data is imported into.
  • manager.batch: Optional. Specifies the batch size for all sources of the inserted data. The default value is 128.
  • manager.readerConcurrency: Optional. Specifies the concurrency of reader to read from sources. The default value is 50.
  • manager.importerConcurrency: Optional. Specifies the concurrency of generating inserted nGQL statement, and then call client to import. The default value is 512.
  • manager.statsInterval: Optional. Specifies the interval at which statistics are printed. The default value is 10s.
  • manager.hooks.before: Optional. Configures the statements before the import begins.
    • manager.hooks.before.[].statements: Defines the list of statements.
    • manager.hooks.before.[].wait: Optional. Defines the waiting time after executing the above statements.
  • manager.hooks.after: Optional. Configures the statements after the import is complete.
    • manager.hooks.after.[].statements: Optional. Defines the list of statements.
    • manager.hooks.after.[].wait: Optional. Defines the waiting time after executing the above statements.

log

log:
  level: INFO
  console: true
  files:
    - logs/nebula-importer.log
  • log.level: Optional. Specifies the log level, optional values is DEBUG, INFO, WARN, ERROR, PANIC or FATAL. The default value is INFO.
  • log.console: Optional. Specifies whether to print logs to the console. The default value is true.
  • log.files: Optional. Specifies which files to print logs to.

sources

sources is the configuration of the data source list, each data source contains data source information, data processing and schema mapping.

The following are the relevant configuration items.

  • batch specifies the batch size for this source of the inserted data. The priority is greater than manager.batch.
  • path, s3, oss, ftp, sftp, hdfs, and gcs are information configurations of various data sources, and only one of them can be configured.
  • csv describes the csv file format information.
  • tags describes the schema definition for tags.
  • edges describes the schema definition for edges.

path

It only needs to be configured for local file data sources.

path: ./person.csv
  • path: Required. Specifies the path where the data files are stored. If a relative path is used, the path and current configuration file directory are spliced. Wildcard filename is also supported, for example: ./follower-*.csv, please make sure that all matching files with the same schema.

s3

It only needs to be configured for s3 data sources.

s3:
  endpoint: <endpoint>
  region: <region>
  bucket: <bucket>
  key: <key>
  accessKeyID: <Access Key ID>
  accessKeySecret: <Access Key Secret>
  • endpoint: Optional. The endpoint of s3 service, can be omitted if using aws s3.
  • region: Required. The region of s3 service.
  • bucket: Required. The bucket of file in s3 service.
  • key: Required. The object key of file in s3 service.
  • accessKeyID: Optional. The Access Key ID of s3 service. If it is public data, no need to configure.
  • accessKeySecret: Optional. The Access Key Secret of s3 service. If it is public data, no need to configure.

oss

It only needs to be configured for oss data sources.

oss:
  endpoint: <endpoint>
  bucket: <bucket>
  key: <key>
  accessKeyID: <Access Key ID>
  accessKeySecret: <Access Key Secret>
  • endpoint: Required. The endpoint of oss service.
  • bucket: Required. The bucket of file in oss service.
  • key: Required. The object key of file in oss service.
  • accessKeyID: Required. The Access Key ID of oss service.
  • accessKeySecret: Required. The Access Key Secret of oss service.

ftp

It only needs to be configured for ftp data sources.

ftp:
  host: 192.168.0.10
  port: 21
  user: <user>
  password: <password>
  path: <path of file>
  • host: Required. The host of ftp service.
  • port: Required. The port of ftp service.
  • user: Required. The user of ftp service.
  • password: Required. The password of ftp service.
  • path: Required. The path of file in the ftp service.

sftp

It only needs to be configured for sftp data sources.

sftp:
  host: 192.168.0.10
  port: 22
  user: <user>
  password: <password>
  keyFile: <keyFile>
  keyData: <keyData>
  passphrase: <passphrase>
  path: <path of file>
  • host: Required. The host of sftp service.
  • port: Required. The port of sftp service.
  • user: Required. The user of sftp service.
  • password: Optional. The password of sftp service.
  • keyFile: Optional. The ssh key file path of sftp service.
  • keyData: Optional. The ssh key file content of sftp service.
  • passphrase: Optional. The ssh key passphrase of sftp service.
  • path: Required. The path of file in the sftp service.

hdfs

It only needs to be configured for hdfs data sources.

hdfs:
  address: 192.168.0.10:8020
  user: <user>
  servicePrincipalName: <Kerberos Service Principal Name>
  krb5ConfigFile: <Kerberos config file>
  ccacheFile: <Kerberos ccache file>
  keyTabFile: <Kerberos keytab file>
  password: <Kerberos password>
  dataTransferProtection: <Kerberos Data Transfer Protection>
  disablePAFXFAST: false
  path: <path of file>
  • address: Required. The address of hdfs service.
  • user: Optional. The user of hdfs service.
  • servicePrincipalName: Optional. The kerberos service principal name of hdfs service when enable kerberos.
  • krb5ConfigFile: Optional. The kerberos config file of hdfs service when enable kerberos, default is /etc/krb5.conf.
  • ccacheFile: Optional. The ccache file of hdfs service when enable kerberos.
  • keyTabFile: Optional. The keytab file of hdfs service when enable kerberos.
  • password: Optional. The kerberos password of hdfs service when enable kerberos.
  • dataTransferProtection: Optional. The data transfer protection of hdfs service.
  • disablePAFXFAST: Optional. Whether to prohibit the client to use PA_FX_FAST.
  • path: Required. The path of file in the sftp service.

gcs

It only needs to be configured for gcs data sources.

gcs:
  endpoint: <endpoint>
  bucket: <bucket>
  key: <key>
  credentialsFile: <Service account or refresh token JSON credentials file>
  credentialsJSON: <Service account or refresh token JSON credentials>
  withoutAuthentication: <false | true>
  • endpoint: Optional. The endpoint of GCS service.
  • bucket: Required. The bucket of file in GCS service.
  • key: Required. The object key of file in GCS service.
  • credentialsFile: Optional. Path to the service account or refresh token JSON credentials file. Not required for public data.
  • credentialsJSON: Optional. Content of the service account or refresh token JSON credentials file. Not required for public data.
  • withoutAuthentication: Optional. Specifies that no authentication should be used, defaults to false.

batch

batch: 256
  • batch: Optional. Specifies the batch size for this source of the inserted data. The priority is greater than manager.batch.

csv

csv:
  delimiter: ","
  withHeader: false
  lazyQuotes: false
  comment: ""
  • delimiter: Optional. Specifies the delimiter for the CSV files. The default value is ",". And only a 1-character string delimiter is supported.
  • withHeader: Optional. Specifies whether to ignore the first record in csv file. The default value is false.
  • lazyQuotes: Optional. If lazyQuotes is true, a quote may appear in an unquoted field and a non-doubled quote may appear in a quoted field.
  • comment: Optional. Specifies the comment character. Lines beginning with the Comment character without preceding whitespace are ignored.

tags

tags:
- name: Person
  mode: INSERT
  filter:
    expr: (Record[1] == "Mahinda" or Record[1] == "Michael") and Record[3] == "male"
  id:
    type: "STRING"
    function: "hash"
    index: 0
  ignoreExistedIndex: true
  props:
    - name: "firstName"
      type: "STRING"
      index: 1
    - name: "lastName"
      type: "STRING"
      index: 2
    - name: "gender"
      type: "STRING"
      index: 3
      nullable: true
      defaultValue: male
    - name: "birthday"
      type: "DATE"
      index: 4
      nullable: true
      nullValue: _NULL_
    - name: "creationDate"
      type: "DATETIME"
      index: 5
    - name: "locationIP"
      type: "STRING"
      index: 6
    - name: "browserUsed"
      type: "STRING"
      index: 7
      nullable: true
      alternativeIndices:
        - 6

# concatItems examples
tags:
- name: Person
  id:
    type: "STRING"
    concatItems:
      - "abc"
      - 1
    function: hash
  • name: Required. The tag name.
  • mode: Optional. The mode for processing data, optional values is INSERT, UPDATE or DELETE, default INSERT.
  • filter: Optional. The data filtering configuration.
  • id: Required. Describes the tag ID information.
    • type: Optional. The type for ID. The default value is STRING.
    • index: Optional. The column number in the records. Required if concatItems is not configured.
    • concatItems: Optional. The concat items to generate for IDs. The concat item can be string, int or mixed. string represents a constant, and int represents an index column. Then connect all items. If set, the above index will have no effect.
    • function: Optional. Functions to generate the IDs. Currently, we only support function hash.
  • ignoreExistedIndex: Optional. Specifies whether to enable IGNORE_EXISTED_INDEX. The default value is true.
  • props: Required. Describes the tag props definition.
    • name: Required. The property name, must be the same with the tag property in NebulaGraph.
    • type: Optional. The property type, currently BOOL, INT, FLOAT, DOUBLE, STRING, TIME, TIMESTAMP, DATE, DATETIME, GEOGRAPHY, GEOGRAPHY(POINT), GEOGRAPHY(LINESTRING) and geography(polygon) are supported. The default value is STRING.
    • index: Required. The column number in the records.
    • nullable: Optional. Whether this prop property can be NULL, optional values is true or false, default false.
    • nullValue: Optional. Ignored when nullable is false. The value used to determine whether it is a NULL. The property is set to NULL when the value is equal to nullValue, default "".
    • alternativeIndices: Optional. Ignored when nullable is false. The property is fetched from records according to the indices in order until not equal to nullValue.
    • defaultValue: Optional. Ignored when nullable is false. The property default value, when all the values obtained by index and alternativeIndices are nullValue.

edges

edges:
- name: KNOWS
  mode: INSERT
  filter:
    expr: (Record[1] == "Mahinda" or Record[1] == "Michael") and Record[3] == "male"
  src:
    id:
      type: "INT"
      index: 0
  dst:
    id:
      type: "INT"
      index: 1
  rank:
    index: 0
  ignoreExistedIndex: true
  props:
    - name: "creationDate"
      type: "DATETIME"
      index: 2
      nullable: true
      nullValue: _NULL_
      defaultValue: 0000-00-00T00:00:00
  • name: Required. The edge name.
  • mode: Optional. The mode here is similar to mode in the tags above.
  • filter: Optional. The filter here is similar to filter in the tags above.
  • src: Required. Describes the source definition for the edge.
  • src.id: Required. The id here is similar to id in the tags above.
  • dst: Required. Describes the destination definition for the edge.
  • dst.id: Required. The id here is similar to id in the tags above.
  • rank: Optional. Describes the rank definition for the edge.
  • rank.index: Required. The column number in the records.
  • props: Optional. Similar to the props in the tags, but for edges.

See the Configuration Reference for details on the configurations.

nebula-importer's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nebula-importer's Issues

CSV Data import error with timestamp data type

I'm trying to import data in csv format to nebula graph using nebula-importer tool.
This is the schema for tag user:

CREATE TAG user(id int, screen_name string, followers_count int, friends_count int, created_at timestamp);

And here are two rows of my csv file to make understand:

"0",":User","2011-09-13T15:13:20","372861228","danieleverdi",,,
"2",":User","2020-06-02T14:52:27","1267831404525690880","mariorsossi",,,

The problem is related to the timestamp string format: as a matter of fact I get the following error:

2021/05/11 21:54:43 --- START OF NEBULA IMPORTER ---
2021/05/11 21:54:43 [WARN] config.go:217: Invalid retry option in clientSettings.retry, reset to 1
2021/05/11 21:54:43 [WARN] config.go:168: You have not configured whether to remove generated temporary files, reset to default value. removeTempFiles: false
2021/05/11 21:54:43 [INFO] connection_pool.go:74: [nebula-clients] connection pool is initialized successfully
2021/05/11 21:54:43 [INFO] clientmgr.go:28: Create 10 Nebula Graph clients
2021/05/11 21:54:43 [INFO] reader.go:64: Start to read file(0): /home/justin/Desktop/progetto_dm/nebula-docker-compose/users.csv, schema: < :IGNORE,:IGNORE,user.created_at:timestamp,:VID(int)/user.id:int,user.screen_name:string,user.followers_count:int,user.friends_count:int >
2021/05/11 21:54:43 [INFO] reader.go:180: Total lines of file(/home/justin/Desktop/progetto_dm/nebula-docker-compose/users.csv) is: 2, error lines: 0
2021/05/11 21:54:44 [ERROR] handler.go:63: Client 2 fail to execute: INSERT VERTEX user(created_at,id,screen_name,followers_count,friends_count) VALUES 1267831404525690880: (2020-06-02T14:52:27,1267831404525690880,"MarcelloLyotard",,);, ErrMsg: SyntaxError: syntax error near T14', ErrCode: -7 2021/05/11 21:54:44 [ERROR] handler.go:63: Client 1 fail to execute: INSERT VERTEX user(created_at,id,screen_name,followers_count,friends_count) VALUES 372861228: (2011-09-13T15:13:20,372861228,"danielenavone1",,);, ErrMsg: SyntaxError: syntax error near T15', ErrCode: -7
2021/05/11 21:54:44 [INFO] statsmgr.go:61: Done(/home/justin/Desktop/progetto_dm/nebula-docker-compose/users.csv): Time(1.03s), Finished(2), Failed(2), Latency AVG(0us), Batches Req AVG(0us), Rows AVG(1.95/s)
2021/05/11 21:54:44 Total 2 lines fail to insert into nebula graph database
2021/05/11 21:54:45 --- END OF NEBULA IMPORTER ---

In both cases a Syntax error occurs due to the "T" letter, however, even removing it the error still persists.
As you can see, in the INSERT VERTEX statement quotes at the beginning and at the end of the string get removed.
So my question is: How should I properly format the timestamp field in the csv to make the import process work?

Thanks for your help.

importing failed

If the: label column is missing from the CSV file, the import cannot succeed, for example:

:VID(string) player.age:int player.name:string
player100 22 lzy
player101 24 zy
player102 25 gc
player103 26 jh

report errors:

vid is not niljie: %!(EXTRA string=:VID, *config.VID=&{0xc00001b040 <nil> <nil>})panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x536e27]

Before adding a column: label to import successfully.

The configuration file is as follows:

version: v2
description: example
removeTempFiles: false
clientSettings:
  retry: 3
  concurrency: 2 # number of graph clients
  channelBufferSize: 1
  space: basketballplayer
  connection:
    user: root
    password: nebula
    address: 192.168.153.10:9669
  postStart:
    commands: |
      UPDATE CONFIGS storage:wal_ttl=3600;
      UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = true };
      DROP SPACE IF EXISTS basketballplayer;
      CREATE SPACE IF NOT EXISTS basketballplayer(partition_num=5, replica_factor=1, vid_type=FIXED_STRING(20));
      USE basketballplayer;
      CREATE TAG player(name string, age int);
      CREATE TAG team(name string);
      CREATE EDGE follow(degree int);
      CREATE EDGE serve(start_year int, end_year int);
    afterPeriod: 8s
  preStop:
    commands: |
      UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = false };
      UPDATE CONFIGS storage:wal_ttl=86400;
logPath: ./err/test.log
files:
  - path: ./basketball.csv
    failDataPath: ./err/course.csv
    batchSize: 2
    inOrder: true
    type: csv
    csv:
      withHeader: true
      withLabel: false
    schema:
      type: vertex

import result error

schema:

CREATE SPACE IF NOT EXISTS sf1(PARTITION_NUM = 24, REPLICA_FACTOR = 3, vid_type = int64);
USE sf1;
CREATE TAG IF NOT EXISTS `Comment`(`creationDate` string,`locationIP` string,`browserUsed` string,`content` string,`length` int);

comment.csv

import a file with wrong format
config

{
    "config": {
        "version": "v2",
        "description": "web console import",
        "clientSettings": {
            "concurrency": 10,
            "channelBufferSize": 128,
            "space": "sf1",
            "connection": {
                "user": "1",
                "password": "2",
                "address": "192.168.8.157:9669"
            }
        },
        "logPath": "/Users/xxx/Documents/Work/nebula-studio/tmp/upload/tmp/import.log",
        "files": [{
            "path": "/Users/xxx/Documents/Work/nebula-studio/tmp/upload/comment.csv",
            "failDataPath": "/Users/xxx/Documents/Work/nebula-studio/tmp/upload/tmp/err/数据源 1Fail.csv",
            "batchSize": 10,
            "type": "csv",
            "csv": {
                "withHeader": false,
                "withLabel": false
            },
            "schema": {
                "type": "vertex",
                "vertex": {
                    "vid": {
                        "index": 0,
                        "type": "int"
                    },
                    "tags": [{
                        "name": "Comment",
                        "props": [{
                            "name": "creationDate",
                            "type": "string",
                            "index": 1
                        }, {
                            "name": "locationIP",
                            "type": "string",
                            "index": 2
                        }, {
                            "name": "browserUsed",
                            "type": "string",
                            "index": 3
                        }, {
                            "name": "content",
                            "type": "string",
                            "index": 4
                        }, {
                            "name": "length",
                            "type": "int",
                            "index": 5
                        }]
                    }]
                }
            }
        }]
    },
    "mountPath": "/Users/xxx/Documents/Work/nebula-studio/tmp/upload"
}

image

read file line error, but log shows Failed(0)
comment.csv

support real csv headers

Now, if I set csv.withHeader = true,I need to promise the headers of the source csv are nebula defined foramt like: :DST_VID,follow.likeness:double,:SRC_VID,:RANK, I think they are not the user's real csv headers. They are nebula's, which are better set in config,like fields mapping from real source to nebula. Thanks.

Support CSV file with BOM in windows

image

{
  "version": "v2",
  "description": "web console import",
  "clientSettings": {
    "concurrency": 10,
    "channelBufferSize": 128,
    "space": "ashare_1",
    "connection": {
      "user": "user",
      "password": "123",
      "address": "192.168.10.217:9669"
    }
  },
  "logPath": "/Users/hetaohua/Documents/Projects/nebula-graph-studio/tmp/upload/tmp/import.log",
  "files": [
    {
      "path": "/Users/hetaohua/Documents/Projects/nebula-graph-studio/tmp/upload/nodes.csv",
      "failDataPath": "/Users/hetaohua/Documents/Projects/nebula-graph-studio/tmp/upload/tmp/err/数据源 1Fail.csv",
      "batchSize": 10,
      "type": "csv",
      "csv": {
        "withHeader": false,
        "withLabel": false
      },
      "schema": {
        "type": "vertex",
        "vertex": {
          "vid": {
            "index": 0,
            "type": "int"
          },
          "tags": [
            {
              "name": "stocks",
              "props": [
                {
                  "name": "stock_id",
                  "type": "string",
                  "index": 1
                },
                {
                  "name": "name",
                  "type": "string",
                  "index": 2
                },
                {
                  "name": "industry",
                  "type": "string",
                  "index": 3
                }
              ]
            }
          ]
        }
      }
    },
    {
      "path": "/Users/hetaohua/Documents/Projects/nebula-graph-studio/tmp/upload/edges.csv",
      "failDataPath": "/Users/hetaohua/Documents/Projects/nebula-graph-studio/tmp/upload/tmp//err/Edge 1Fail.csv",
      "batchSize": 10,
      "type": "csv",
      "csv": {
        "withHeader": false,
        "withLabel": false
      },
      "schema": {
        "type": "edge",
        "edge": {
          "name": "relation",
          "srcVID": {
            "index": 0,
            "type": "int"
          },
          "dstVID": {
            "index": 1,
            "type": "int"
          },
          "withRanking": false,
          "props": [
            {
              "name": "weight",
              "type": "double",
              "index": 2
            }
          ]
        }
      }
    }
  ]
}

edges.csv
nodes.csv

Add some test

I find that this project have a little test file (just 2 xx_test.gofiles). How about add some tests? Maybe I can help you :)

Should stop insert if any insertion failed

Controlled by a parameter in yaml file.

fail-fast: true

If I wrote a wrong yaml file, then a lot of content will be printed to the console, and I have to Ctrl+C to stop it.

So if the importer can fail-fast I would be grateful.

Delete Vertex with label specified in CSV

When I specify - (minus) :LABEL in CSV file for import what is the meaning of the other TAG values specified ?

In my test case when there is VertexID and TAG value specified for delete, you will expect that for specific VertexID and that TAG data will be deleted,
but that is not the case, all TAG values are deleted, so complete Vertex is deleted.

As I can find in documentation for Nebula that only DELETE Vertex is supported, not specific TAG for Vertex
I will expect that you can't delete specific TAG for Vertex and leave other TAG untouched, so this :LABEL with TAG values is misleading, because importer will delete complete Vertex for ID specified.

Configs in .yaml File

Hello Team,
When importing csv file with using a .yaml, if we state a vertex/edge property without its data type , importer doesn't throw an exception and doesn't do anything. For example:

 .....
edge:
        name: friend
        withRanking: false
        srcVID:
          index: 0
        dstVID:
          index: 1
        props:
          - name: startdete
         #### type: timestamp
            index: 2

This seems works very well but actually does not. Maybe it should be an error.
Thanks.

panic when import a large number of files

question mentioned at https://discuss.nebula-graph.com.cn/t/topic/1579/21?u=les1ie

If I try to import a larger number of files, eg: 1000 csv files, with a yaml file more than 10000 lines, nebula_importer will panic error

How to reproduce:

  1. ganerate sample csv files
from pathlib import Path
import os

dump_dir = Path('./dump')
if not os.path.exists(dump_dir):
    os.mkdir(dump_dir)


def generate_csv():
    num = 10000
    for i in range(num):
        with open(f'{dump_dir}/vertex_{i}.csv', 'w') as f:
            f.write("123\n")


generate_csv()
  1. download error config.yaml
    out.zip

  2. import csv

python3 reproduce.py
cp path_to_nebula_impoerter_exec dump/
cd dump
./nebula_importer -c out.yaml
  1. screenshot
    image

self-adaptive import flow control

Not sure if it's practical(or worth the effort), but is it possible to introduce a mechanism similar to TCP rfc1323 (on slow-start and gradually optimize to proper batch size and can self adapted when storage handling capability changed), which helps optimize buffer size, buffer batch, etc. per each import activities to enable a speed close to the best out-of-box?

This could potentially decouple efforts on tuning those parameters for each nebula cluster(or even in different shape/workload from which the capability to handle import flow could vary)

Support for ignoring some columns in configuration file when CSV file has no header line.

As your README said:

Note: The order of properties in the above props must be the same as that of the corresponding data in the CSV data file.
If I have a file course.csv:

name,        teacher,          id
math,         Mr Liu,           1
computer,  Mr Wang,      2

I don't want the first two field name and teacher, only want to import the id field. How can I promise the order of the props in the config? Or I only need to add the :IGNORE field to the csv file? Is there a option like this:

...
   vetex:
       tags:
           - name: couse
              props:
                   - ignore: true  // Add this field support,meaning to ignore the relative order field in csv.
                   - ignore: true
                   - name: id,
                      type: string
...

thanks

JSON Lines support

Is it possible that we also have JSON Lines support, this means a lot for fresh users to have the headless importer on their existing file-based sources to nebula :)

同一schema配置多个文件

你好,想请教一下。假如我在yaml里面分别配置了两个path:student_0.csv和student_1.csv,都用于导入schema是student(具有属性:name)的点数据。两个文件可能包含vertext id相同的student数据。请问:

  1. 最后读到的点个数是两个文件数据的并集吗?
  2. 假如对于同一个student(vertext id相同),在student_0.csv里name属性是”A“,但在student_1.csv里name属性是”B“,请问最终在图空间里name属性会是什么?

error: Prop index 1 out range 1 of record([847SawUH57a,0.01]))

client

./nebula-importer-linux-amd64-v2.6.0

the csv example

83Vywxuirk7,6.323
83Vxbbzymnl,1.016
83VxVAtOhlM,3.437
83WIM5lP3IQ,0.01
83WWBWpipQl,0.01
83WdUurhv1A,0.01
83WOl3YKkDQ,0.01
83WWBYAtIXl,0.01
83WykP7jnLE,0.01
83W68p5flDi,1.156
83WcD54qUgd,0.01

the Yml config

version: v2
description: journal
removeTempFiles: false
clientSettings:
  retry: 3
  concurrency: 2 # number of graph clients
  channelBufferSize: 128
  space: dataengine
  connection:
    user: root
    password: 123456
    address: 192.168.110.149:31883
  postStart:
    commands: |
      UPDATE CONFIGS storage:wal_ttl=3600;
      UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = true };
    afterPeriod: 8s
  preStop:
    commands: |
      UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = false };
      UPDATE CONFIGS storage:wal_ttl=86400;
logPath: ./err/t_journal.log
files:
  - path: ./t_journal.csv
    failDataPath: ./err/t_journal.csv
    batchSize: 2000
    type: csv
    csv:
      withHeader: false
      withLabel: false
      delimiter: '|'
    schema:
      type: vertex
      vertex:
        vid:
          index: 0
          type: string
        tags:
          - name: t_journal
            props:
              - name: journal_id
                type: string
                index: 0
              - name: impact_factor
                type: double
                index: 1
2021/10/29 23:04:24 [ERROR] handler.go:63: Client 0 fail to execute: THERE_ARE_SOME_ERRORS(tag: {0xc000156180 [0xc000142108 0xc000142120]}, error: Prop index 1 out range 1 of record([847SawUH57a,0.01])), ErrMsg: SyntaxError: syntax error near `THERE_ARE_SOME_ERRORS', ErrCode: -1004

File doesn't exist

[root@VM-95-249-centos /data/graphdb/testUpload]# docker run --rm -ti       --network=host       -v /data/graphdb/testUpload/tryImport.yaml:/data/graphdb/testUpload/tryImport.yaml       -v  /data/graphdb/testUpload/       vesoft/nebula-importer:v1      --config /data/graphdb/testUpload/tryImport.yaml
2021/03/20 15:38:19 --- START OF NEBULA IMPORTER ---
2021/03/20 15:38:19 File(/data/graphdb/testUpload/userInfo.csv) doesn't exist
2021/03/20 15:38:20 --- END OF NEBULA IMPORTER ---

but I have the file:

[root@VM-95-249-centos /data/graphdb/testUpload]# ll
total 16
-rw-r--r-- 1 root root  36 Mar 20 23:17 courseInfo.csv
-rw-r--r-- 1 root root 187 Mar 20 23:17 take.csv
-rw-r--r-- 1 root root 997 Mar 20 23:37 tryImport.yaml
-rw-r--r-- 1 root root  34 Mar 20 23:17 userInfo.csv

why???

wrong method name

2021/11/25 18:28:43 --- START OF NEBULA IMPORTER ---
2021/11/25 18:28:44 failed to open connection, error: failed to verify client version: verifyClientVersion failed: wrong method name
2021/11/25 18:28:45 --- END OF NEBULA IMPORTER ---
exit status 200

version: v2
description: example
removeTempFiles: false
clientSettings:
retry: 3
concurrency: 1 # number of graph clients
channelBufferSize: 128
space: test
connection:
user: root
password: password
address: *****
logPath: ./err/test.log

nebula version 2.5.0

import error

trying to run importer in the same machine as nebula, with this command:

$ docker run --rm -ti --network=host -v /opt/nebula/import.yaml:/data/import.yaml -v ~/:/data/ vesoft/nebula-importer --config /data/import.yaml

i get the following error:

2020/10/04 11:53:18 --- START OF NEBULA IMPORTER ---
2020/10/04 11:53:18 [INFO] config.go:399: files[1].schema.vertex is nil
2020/10/04 11:53:29 dial tcp 172.29.3.1:3699: i/o timeout
2020/10/04 11:53:30 --- END OF NEBULA IMPORTER ---

also, trying to use external domain name as nebula connection address, i get the following error:

2020/10/04 11:57:45 --- START OF NEBULA IMPORTER ---
2020/10/04 11:57:45 [INFO] config.go:399: files[1].schema.vertex is nil
2020/10/04 11:57:45 [INFO] clientmgr.go:28: Create 2 Nebula Graph clients
2020/10/04 11:57:45 [INFO] reader.go:64: Start to read file(0): /data/users_profile.csv, schema: < :VID,user.username:string >
panic: send on closed channel

goroutine 24 [running]:
github.com/vesoft-inc/nebula-importer/pkg/reader.(*Batch).requestClient(0xc000232180)
	/home/nebula-importer/pkg/reader/batch.go:66 +0x14c
github.com/vesoft-inc/nebula-importer/pkg/reader.(*Batch).Add(0xc000232180, 0x1, 0xc00000ec40, 0x2, 0x2)
	/home/nebula-importer/pkg/reader/batch.go:36 +0xa0
github.com/vesoft-inc/nebula-importer/pkg/reader.(*FileReader).Read(0xc0002320c0, 0x0, 0x0)
	/home/nebula-importer/pkg/reader/reader.go:162 +0x570
github.com/vesoft-inc/nebula-importer/pkg/cmd.(*Runner).Run.func2(0xc000014a40, 0xc000014a80, 0xc0002320c0, 0xc000019480, 0x17)
	/home/nebula-importer/pkg/cmd/cmd.go:70 +0x40
created by github.com/vesoft-inc/nebula-importer/pkg/cmd.(*Runner).Run
	/home/nebula-importer/pkg/cmd/cmd.go:69 +0x705

PS: i can connect to server with Nebula Graph Studio with no problem, but as the file sizes is so big, i can not use it for import

Support multiple Nebula Graph Servers

For example configuration:

clientSettings:
  connection:
    address: 192.168.8.5:3699,192.168.8.6:3699,192.168.8.7:3699

We should balance above 3 servers workload.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.