vesoft-inc / nebula-importer Goto Github PK

View Code? Open in Web Editor NEW

87.0 29.0 59.0 838 KB

Nebula Graph Importer with Go

License: Apache License 2.0

Go 99.45% Dockerfile 0.12% Makefile 0.43%

nebula-graph golang csv-import csv

nebula-importer's Introduction

What is NebulaGraph Importer?

NebulaGraph Importer is a tool to import data into NebulaGraph.

Features

Support multiple data sources, currently supports local, s3, oss, ftp, sftp, hdfs, and gcs.
Support multiple file formats, currently only csv files are supported.
Support files containing multiple tags, multiple edges, and a mixture of both.
Support data transformations.
Support record filtering.
Support multiple modes, including INSERT, UPDATE, DELETE.
Support connect multiple Graph with automatically load balance.
Support retry after failure.
Humanized status printing.

See configuration instructions for more features.

How to Install

From Releases

Download the packages on the Releases page, and give execute permissions to it.

You can choose according to your needs, the following installation packages are supported:

binary
archives
apk
deb
rpm

From go install

$ go install github.com/vesoft-inc/nebula-importer/cmd/nebula-importer@latest

From docker

$ docker pull vesoft/nebula-importer:<version>
$ docker run --rm -ti \
      --network=host \
      -v <config_file>:<config_file> \
      -v <data_dir>:<data_dir> \
      vesoft/nebula-importer:<version>
      --config <config_file>

# config_file: the absolute path to the configuration file.
# data_dir: the absolute path to the data directory, ignore if not a local file.
# version: the version of NebulaGraph Importer.

From Source Code

$ git clone https://github.com/vesoft-inc/nebula-importer
$ cd nebula-importer
$ make build

You can find a binary named nebula-importer in bin directory.

Configuration Instructions

NebulaGraph Importer's configuration file is in YAML format. You can find some examples in examples.

Configuration options are divided into four groups:

client is configuration options related to the NebulaGraph connection client.
manager is global control configuration options related to NebulaGraph Importer.
log is configuration options related to printing logs.
sources is the data source configuration items.

client

client:
  version: v3
  address: "127.0.0.1:9669"
  user: root
  password: nebula
  ssl:
    enable: true
    certPath: "your/cert/file/path"
    keyPath: "your/key/file/path"
    caPath: "your/ca/file/path"
    insecureSkipVerify: false
  concurrencyPerAddress: 16
  reconnectInitialInterval: 1s
  retry: 3
  retryInitialInterval: 1s

client.version: Required. Specifies which version of NebulaGraph, currently only v3 is supported.
client.address: Required. The address of graph in NebulaGraph.
client.user: Optional. The user of NebulaGraph. The default value is root.
client.password: Optional. The password of NebulaGraph. The default value is nebula.
client.ssl: Optional. SSL related configuration.
client.ssl.enable: Optional. Specifies whether to enable ssl authentication. The default value is false.
client.ssl.certPath: Required. Specifies the path of the certificate file.
client.ssl.keyPath: Required. Specifies the path of the private key file.
client.ssl.caPath: Required. Specifies the path of the certification authority file.
client.ssl.insecureSkipVerify: Optional. Specifies whether a client verifies the server's certificate chain and host name. The default value is false.
client.concurrencyPerAddress: Optional. The number of client connections to each graph in NebulaGraph. The default value is 10.
client.reconnectInitialInterval: Optional. The initialization interval for reconnecting NebulaGraph. The default value is 1s.
client.retry: Optional. The failed retrying times to execute nGQL queries in NebulaGraph client. The default value is 3.
client.retryInitialInterval: Optional. The initialization interval retrying. The default value is 1s.

manager

  spaceName: basic_int_examples
  batch: 128
  readerConcurrency: 50
  importerConcurrency: 512
  statsInterval: 10s
  hooks:
    before:
      - statements:
          - UPDATE CONFIGS storage:wal_ttl=3600;
          - UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = true };
      - statements:
          - |
            DROP SPACE IF EXISTS basic_int_examples;
            CREATE SPACE IF NOT EXISTS basic_int_examples(partition_num=5, replica_factor=1, vid_type=int);
            USE basic_int_examples;
        wait: 10s
    after:
      - statements:
          - |
            UPDATE CONFIGS storage:wal_ttl=86400;
            UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = false };

manager.spaceName: Required. Specifies which space the data is imported into.
manager.batch: Optional. Specifies the batch size for all sources of the inserted data. The default value is 128.
manager.readerConcurrency: Optional. Specifies the concurrency of reader to read from sources. The default value is 50.
manager.importerConcurrency: Optional. Specifies the concurrency of generating inserted nGQL statement, and then call client to import. The default value is 512.
manager.statsInterval: Optional. Specifies the interval at which statistics are printed. The default value is 10s.
manager.hooks.before: Optional. Configures the statements before the import begins.
- manager.hooks.before.[].statements: Defines the list of statements.
- manager.hooks.before.[].wait: Optional. Defines the waiting time after executing the above statements.
manager.hooks.after: Optional. Configures the statements after the import is complete.
- manager.hooks.after.[].statements: Optional. Defines the list of statements.
- manager.hooks.after.[].wait: Optional. Defines the waiting time after executing the above statements.

log

log:
  level: INFO
  console: true
  files:
    - logs/nebula-importer.log

log.level: Optional. Specifies the log level, optional values is DEBUG, INFO, WARN, ERROR, PANIC or FATAL. The default value is INFO.
log.console: Optional. Specifies whether to print logs to the console. The default value is true.
log.files: Optional. Specifies which files to print logs to.

sources

sources is the configuration of the data source list, each data source contains data source information, data processing and schema mapping.

The following are the relevant configuration items.

batch specifies the batch size for this source of the inserted data. The priority is greater than manager.batch.
path, s3, oss, ftp, sftp, hdfs, and gcs are information configurations of various data sources, and only one of them can be configured.
csv describes the csv file format information.
tags describes the schema definition for tags.
edges describes the schema definition for edges.

path

It only needs to be configured for local file data sources.

path: ./person.csv

path: Required. Specifies the path where the data files are stored. If a relative path is used, the path and current configuration file directory are spliced. Wildcard filename is also supported, for example: ./follower-*.csv, please make sure that all matching files with the same schema.

s3

It only needs to be configured for s3 data sources.

s3:
  endpoint: <endpoint>
  region: <region>
  bucket: <bucket>
  key: <key>
  accessKeyID: <Access Key ID>
  accessKeySecret: <Access Key Secret>

endpoint: Optional. The endpoint of s3 service, can be omitted if using aws s3.
region: Required. The region of s3 service.
bucket: Required. The bucket of file in s3 service.
key: Required. The object key of file in s3 service.
accessKeyID: Optional. The Access Key ID of s3 service. If it is public data, no need to configure.
accessKeySecret: Optional. The Access Key Secret of s3 service. If it is public data, no need to configure.

oss

It only needs to be configured for oss data sources.

oss:
  endpoint: <endpoint>
  bucket: <bucket>
  key: <key>
  accessKeyID: <Access Key ID>
  accessKeySecret: <Access Key Secret>

endpoint: Required. The endpoint of oss service.
bucket: Required. The bucket of file in oss service.
key: Required. The object key of file in oss service.
accessKeyID: Required. The Access Key ID of oss service.
accessKeySecret: Required. The Access Key Secret of oss service.

ftp

It only needs to be configured for ftp data sources.

ftp:
  host: 192.168.0.10
  port: 21
  user: <user>
  password: <password>
  path: <path of file>

host: Required. The host of ftp service.
port: Required. The port of ftp service.
user: Required. The user of ftp service.
password: Required. The password of ftp service.
path: Required. The path of file in the ftp service.

sftp

It only needs to be configured for sftp data sources.

sftp:
  host: 192.168.0.10
  port: 22
  user: <user>
  password: <password>
  keyFile: <keyFile>
  keyData: <keyData>
  passphrase: <passphrase>
  path: <path of file>

host: Required. The host of sftp service.
port: Required. The port of sftp service.
user: Required. The user of sftp service.
password: Optional. The password of sftp service.
keyFile: Optional. The ssh key file path of sftp service.
keyData: Optional. The ssh key file content of sftp service.
passphrase: Optional. The ssh key passphrase of sftp service.
path: Required. The path of file in the sftp service.

hdfs

It only needs to be configured for hdfs data sources.

hdfs:
  address: 192.168.0.10:8020
  user: <user>
  servicePrincipalName: <Kerberos Service Principal Name>
  krb5ConfigFile: <Kerberos config file>
  ccacheFile: <Kerberos ccache file>
  keyTabFile: <Kerberos keytab file>
  password: <Kerberos password>
  dataTransferProtection: <Kerberos Data Transfer Protection>
  disablePAFXFAST: false
  path: <path of file>

address: Required. The address of hdfs service.
user: Optional. The user of hdfs service.
servicePrincipalName: Optional. The kerberos service principal name of hdfs service when enable kerberos.
krb5ConfigFile: Optional. The kerberos config file of hdfs service when enable kerberos, default is /etc/krb5.conf.
ccacheFile: Optional. The ccache file of hdfs service when enable kerberos.
keyTabFile: Optional. The keytab file of hdfs service when enable kerberos.
password: Optional. The kerberos password of hdfs service when enable kerberos.
dataTransferProtection: Optional. The data transfer protection of hdfs service.
disablePAFXFAST: Optional. Whether to prohibit the client to use PA_FX_FAST.
path: Required. The path of file in the sftp service.

gcs

It only needs to be configured for gcs data sources.

gcs:
  endpoint: <endpoint>
  bucket: <bucket>
  key: <key>
  credentialsFile: <Service account or refresh token JSON credentials file>
  credentialsJSON: <Service account or refresh token JSON credentials>
  withoutAuthentication: <false | true>

endpoint: Optional. The endpoint of GCS service.
bucket: Required. The bucket of file in GCS service.
key: Required. The object key of file in GCS service.
credentialsFile: Optional. Path to the service account or refresh token JSON credentials file. Not required for public data.
credentialsJSON: Optional. Content of the service account or refresh token JSON credentials file. Not required for public data.
withoutAuthentication: Optional. Specifies that no authentication should be used, defaults to false.

batch

batch: 256

batch: Optional. Specifies the batch size for this source of the inserted data. The priority is greater than manager.batch.

csv

csv:
  delimiter: ","
  withHeader: false
  lazyQuotes: false
  comment: ""

delimiter: Optional. Specifies the delimiter for the CSV files. The default value is ",". And only a 1-character string delimiter is supported.
withHeader: Optional. Specifies whether to ignore the first record in csv file. The default value is false.
lazyQuotes: Optional. If lazyQuotes is true, a quote may appear in an unquoted field and a non-doubled quote may appear in a quoted field.
comment: Optional. Specifies the comment character. Lines beginning with the Comment character without preceding whitespace are ignored.

edges

edges:
- name: KNOWS
  mode: INSERT
  filter:
    expr: (Record[1] == "Mahinda" or Record[1] == "Michael") and Record[3] == "male"
  src:
    id:
      type: "INT"
      index: 0
  dst:
    id:
      type: "INT"
      index: 1
  rank:
    index: 0
  ignoreExistedIndex: true
  props:
    - name: "creationDate"
      type: "DATETIME"
      index: 2
      nullable: true
      nullValue: _NULL_
      defaultValue: 0000-00-00T00:00:00

name: Required. The edge name.
mode: Optional. The mode here is similar to mode in the tags above.
filter: Optional. The filter here is similar to filter in the tags above.
src: Required. Describes the source definition for the edge.
src.id: Required. The id here is similar to id in the tags above.
dst: Required. Describes the destination definition for the edge.
dst.id: Required. The id here is similar to id in the tags above.
rank: Optional. Describes the rank definition for the edge.
rank.index: Required. The column number in the records.
props: Optional. Similar to the props in the tags, but for edges.

See the Configuration Reference for details on the configurations.

nebula-importer's People

Stargazers

Watchers

nebula-importer's Issues

The bash dir path `failDataPath` related is project root but not config root when using a remote URL

When I call the Runner of the importer as the library and import a csv file using remote url, failDataPath related a mistake bash path.
The error case:

The normal case:

support skipping header

anytime，the data offen has header，we must delete this line，not good！please。

Support a runnable package for using without network

https://discuss.nebula-graph.com.cn/t/topic/3771/4

CSV Data import error with timestamp data type

I'm trying to import data in csv format to nebula graph using nebula-importer tool.
This is the schema for tag user:

CREATE TAG user(id int, screen_name string, followers_count int, friends_count int, created_at timestamp);

And here are two rows of my csv file to make understand:

"0",":User","2011-09-13T15:13:20","372861228","danieleverdi",,,
"2",":User","2020-06-02T14:52:27","1267831404525690880","mariorsossi",,,

The problem is related to the timestamp string format: as a matter of fact I get the following error:

2021/05/11 21:54:43 --- START OF NEBULA IMPORTER ---
2021/05/11 21:54:43 [WARN] config.go:217: Invalid retry option in clientSettings.retry, reset to 1
2021/05/11 21:54:43 [WARN] config.go:168: You have not configured whether to remove generated temporary files, reset to default value. removeTempFiles: false
2021/05/11 21:54:43 [INFO] connection_pool.go:74: [nebula-clients] connection pool is initialized successfully
2021/05/11 21:54:43 [INFO] clientmgr.go:28: Create 10 Nebula Graph clients
2021/05/11 21:54:43 [INFO] reader.go:64: Start to read file(0): /home/justin/Desktop/progetto_dm/nebula-docker-compose/users.csv, schema: < :IGNORE,:IGNORE,user.created_at:timestamp,:VID(int)/user.id:int,user.screen_name:string,user.followers_count:int,user.friends_count:int >
2021/05/11 21:54:43 [INFO] reader.go:180: Total lines of file(/home/justin/Desktop/progetto_dm/nebula-docker-compose/users.csv) is: 2, error lines: 0
2021/05/11 21:54:44 [ERROR] handler.go:63: Client 2 fail to execute: INSERT VERTEX user(created_at,id,screen_name,followers_count,friends_count) VALUES 1267831404525690880: (2020-06-02T14:52:27,1267831404525690880,"MarcelloLyotard",,);, ErrMsg: SyntaxError: syntax error near T14', ErrCode: -7 2021/05/11 21:54:44 [ERROR] handler.go:63: Client 1 fail to execute: INSERT VERTEX user(created_at,id,screen_name,followers_count,friends_count) VALUES 372861228: (2011-09-13T15:13:20,372861228,"danielenavone1",,);, ErrMsg: SyntaxError: syntax error near T15', ErrCode: -7
2021/05/11 21:54:44 [INFO] statsmgr.go:61: Done(/home/justin/Desktop/progetto_dm/nebula-docker-compose/users.csv): Time(1.03s), Finished(2), Failed(2), Latency AVG(0us), Batches Req AVG(0us), Rows AVG(1.95/s)
2021/05/11 21:54:44 Total 2 lines fail to insert into nebula graph database
2021/05/11 21:54:45 --- END OF NEBULA IMPORTER ---

In both cases a Syntax error occurs due to the "T" letter, however, even removing it the error still persists.
As you can see, in the INSERT VERTEX statement quotes at the beginning and at the end of the string get removed.
So my question is: How should I properly format the timestamp field in the csv to make the import process work?

Thanks for your help.

Support appoint hash("prop") as vid column in csv data files

importing failed

If the: label column is missing from the CSV file, the import cannot succeed, for example:

:VID(string)	player.age:int	player.name:string
player100	22	lzy
player101	24	zy
player102	25	gc
player103	26	jh

report errors：

vid is not niljie: %!(EXTRA string=:VID, *config.VID=&{0xc00001b040 <nil> <nil>})panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x536e27]

Before adding a column: label to import successfully.

The configuration file is as follows:

version: v2
description: example
removeTempFiles: false
clientSettings:
  retry: 3
  concurrency: 2 # number of graph clients
  channelBufferSize: 1
  space: basketballplayer
  connection:
    user: root
    password: nebula
    address: 192.168.153.10:9669
  postStart:
    commands: |
      UPDATE CONFIGS storage:wal_ttl=3600;
      UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = true };
      DROP SPACE IF EXISTS basketballplayer;
      CREATE SPACE IF NOT EXISTS basketballplayer(partition_num=5, replica_factor=1, vid_type=FIXED_STRING(20));
      USE basketballplayer;
      CREATE TAG player(name string, age int);
      CREATE TAG team(name string);
      CREATE EDGE follow(degree int);
      CREATE EDGE serve(start_year int, end_year int);
    afterPeriod: 8s
  preStop:
    commands: |
      UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = false };
      UPDATE CONFIGS storage:wal_ttl=86400;
logPath: ./err/test.log
files:
  - path: ./basketball.csv
    failDataPath: ./err/course.csv
    batchSize: 2
    inOrder: true
    type: csv
    csv:
      withHeader: true
      withLabel: false
    schema:
      type: vertex

import result error

schema:

CREATE SPACE IF NOT EXISTS sf1(PARTITION_NUM = 24, REPLICA_FACTOR = 3, vid_type = int64);
USE sf1;
CREATE TAG IF NOT EXISTS `Comment`(`creationDate` string,`locationIP` string,`browserUsed` string,`content` string,`length` int);

comment.csv

import a file with wrong format
config

{
    "config": {
        "version": "v2",
        "description": "web console import",
        "clientSettings": {
            "concurrency": 10,
            "channelBufferSize": 128,
            "space": "sf1",
            "connection": {
                "user": "1",
                "password": "2",
                "address": "192.168.8.157:9669"
            }
        },
        "logPath": "/Users/xxx/Documents/Work/nebula-studio/tmp/upload/tmp/import.log",
        "files": [{
            "path": "/Users/xxx/Documents/Work/nebula-studio/tmp/upload/comment.csv",
            "failDataPath": "/Users/xxx/Documents/Work/nebula-studio/tmp/upload/tmp/err/数据源 1Fail.csv",
            "batchSize": 10,
            "type": "csv",
            "csv": {
                "withHeader": false,
                "withLabel": false
            },
            "schema": {
                "type": "vertex",
                "vertex": {
                    "vid": {
                        "index": 0,
                        "type": "int"
                    },
                    "tags": [{
                        "name": "Comment",
                        "props": [{
                            "name": "creationDate",
                            "type": "string",
                            "index": 1
                        }, {
                            "name": "locationIP",
                            "type": "string",
                            "index": 2
                        }, {
                            "name": "browserUsed",
                            "type": "string",
                            "index": 3
                        }, {
                            "name": "content",
                            "type": "string",
                            "index": 4
                        }, {
                            "name": "length",
                            "type": "int",
                            "index": 5
                        }]
                    }]
                }
            }
        }]
    },
    "mountPath": "/Users/xxx/Documents/Work/nebula-studio/tmp/upload"
}

read file line error, but log shows Failed(0)
comment.csv

support import GEO data

since nebula v2.6.0 supports GEO data, importer supporting it will be better

support real csv headers

Now, if I set csv.withHeader = true，I need to promise the headers of the source csv are nebula defined foramt like: :DST_VID,follow.likeness:double,:SRC_VID,:RANK, I think they are not the user's real csv headers. They are nebula's, which are better set in config，like fields mapping from real source to nebula. Thanks.

Support CSV file with BOM in windows

{
  "version": "v2",
  "description": "web console import",
  "clientSettings": {
    "concurrency": 10,
    "channelBufferSize": 128,
    "space": "ashare_1",
    "connection": {
      "user": "user",
      "password": "123",
      "address": "192.168.10.217:9669"
    }
  },
  "logPath": "/Users/hetaohua/Documents/Projects/nebula-graph-studio/tmp/upload/tmp/import.log",
  "files": [
    {
      "path": "/Users/hetaohua/Documents/Projects/nebula-graph-studio/tmp/upload/nodes.csv",
      "failDataPath": "/Users/hetaohua/Documents/Projects/nebula-graph-studio/tmp/upload/tmp/err/数据源 1Fail.csv",
      "batchSize": 10,
      "type": "csv",
      "csv": {
        "withHeader": false,
        "withLabel": false
      },
      "schema": {
        "type": "vertex",
        "vertex": {
          "vid": {
            "index": 0,
            "type": "int"
          },
          "tags": [
            {
              "name": "stocks",
              "props": [
                {
                  "name": "stock_id",
                  "type": "string",
                  "index": 1
                },
                {
                  "name": "name",
                  "type": "string",
                  "index": 2
                },
                {
                  "name": "industry",
                  "type": "string",
                  "index": 3
                }
              ]
            }
          ]
        }
      }
    },
    {
      "path": "/Users/hetaohua/Documents/Projects/nebula-graph-studio/tmp/upload/edges.csv",
      "failDataPath": "/Users/hetaohua/Documents/Projects/nebula-graph-studio/tmp/upload/tmp//err/Edge 1Fail.csv",
      "batchSize": 10,
      "type": "csv",
      "csv": {
        "withHeader": false,
        "withLabel": false
      },
      "schema": {
        "type": "edge",
        "edge": {
          "name": "relation",
          "srcVID": {
            "index": 0,
            "type": "int"
          },
          "dstVID": {
            "index": 1,
            "type": "int"
          },
          "withRanking": false,
          "props": [
            {
              "name": "weight",
              "type": "double",
              "index": 2
            }
          ]
        }
      }
    }
  ]
}

edges.csv
nodes.csv

Failed to build

When I build the importer, something wrong happened:

[root@COS nebula-importer]# make build
rm -rf nebula-importer;
go: github.com/vesoft-inc/[email protected]: Get "https://proxy.golang.org/github.com/vesoft-inc/nebula-go/@v/v1.1.0.mod": dial tcp 172.217.160.81:443: i/o timeout
make: *** [fmt] 错误 1

How can I change the file location with local file?

PRESTOP implements delayed commands, so that the index can be rebuilt after the data is imported and the index is created in prestop

hope in prestop can create index and rebuild index after the index is created

Disable CGO when compiling nebula-importer

CGO_ENABLED=0 go build

Support SSL connection

As title.

Support to import NULL value

for nullable fields of tag or edge, think about how to import the NULL value in csv file.

有两个文件 path可以写两个么？

Add some test

I find that this project have a little test file (just 2 xx_test.gofiles). How about add some tests? Maybe I can help you :)

V1.1.0 importer 进行字符串格式化使用fmt.Sprintf(“%q”, )导致数据不一致

使用fmt.Sprintf("%q", )会把中文空格转换成\u3000，插入到数据库就会被转换成u3000，导致数据不一致

Test importer performance for go client with zero timeout

0 timeout will improve performance for go client when inserting csv data each row per time.

Should stop insert if any insertion failed

Controlled by a parameter in yaml file.

fail-fast: true

If I wrote a wrong yaml file, then a lot of content will be printed to the console, and I have to Ctrl+C to stop it.

So if the importer can fail-fast I would be grateful.

If the space does not exist, the program is suspended.

The importer cannot exit.

Refactor importer configuration file structure

Check connection address and set default value if empty
Update version default value v1beta
Refactor Tag and Edge schema configuration

Improve error message for with label

https://discuss.nebula-graph.com.cn/t/topic/6481

Delete Vertex with label specified in CSV

When I specify - (minus) :LABEL in CSV file for import what is the meaning of the other TAG values specified ?

In my test case when there is VertexID and TAG value specified for delete, you will expect that for specific VertexID and that TAG data will be deleted,
but that is not the case, all TAG values are deleted, so complete Vertex is deleted.

As I can find in documentation for Nebula that only DELETE Vertex is supported, not specific TAG for Vertex
I will expect that you can't delete specific TAG for Vertex and leave other TAG untouched, so this :LABEL with TAG values is misleading, because importer will delete complete Vertex for ID specified.

go环境搭建的坑

千万不能sudo apt-get install golang，要找一个靠谱的教程

update importer English docs

https://github.com/vesoft-inc/nebula-importer/blob/release-v2.0.0-ga/README.md

Support to build multi-platform binary in GitHub workflow

reference: https://discuss.nebula-graph.com.cn/t/topic/3771

Configs in .yaml File

Hello Team,
When importing csv file with using a .yaml, if we state a vertex/edge property without its data type , importer doesn't throw an exception and doesn't do anything. For example:

 .....
edge:
        name: friend
        withRanking: false
        srcVID:
          index: 0
        dstVID:
          index: 1
        props:
          - name: startdete
         #### type: timestamp
            index: 2

This seems works very well but actually does not. Maybe it should be an error.
Thanks.

improving exception error message

panic: runtime error: invalid memory address or nil pointer dereference when withHeader was not properly configured.

https://discuss.nebula-graph.com.cn/t/topic/6655/6?u=wey

also in #179

Do flush and compaction after finishing data importing

Now compaction and flush operations have been supported in console

panic when import a large number of files

question mentioned at https://discuss.nebula-graph.com.cn/t/topic/1579/21?u=les1ie

If I try to import a larger number of files, eg: 1000 csv files, with a yaml file more than 10000 lines, nebula_importer will panic error

How to reproduce:

ganerate sample csv files

from pathlib import Path
import os

dump_dir = Path('./dump')
if not os.path.exists(dump_dir):
    os.mkdir(dump_dir)


def generate_csv():
    num = 10000
    for i in range(num):
        with open(f'{dump_dir}/vertex_{i}.csv', 'w') as f:
            f.write("123\n")


generate_csv()

download error config.yaml
out.zip
import csv

python3 reproduce.py
cp path_to_nebula_impoerter_exec dump/
cd dump
./nebula_importer -c out.yaml

screenshot

data of invalid vid should also be reported as failed data

self-adaptive import flow control

Not sure if it's practical(or worth the effort), but is it possible to introduce a mechanism similar to TCP rfc1323 (on slow-start and gradually optimize to proper batch size and can self adapted when storage handling capability changed), which helps optimize buffer size, buffer batch, etc. per each import activities to enable a speed close to the best out-of-box?

This could potentially decouple efforts on tuning those parameters for each nebula cluster(or even in different shape/workload from which the capability to handle import flow could vary)

Support for ignoring some columns in configuration file when CSV file has no header line.

As your README said:

Note: The order of properties in the above props must be the same as that of the corresponding data in the CSV data file.
If I have a file course.csv:

name,        teacher,          id
math,         Mr Liu,           1
computer,  Mr Wang,      2

I don't want the first two field name and teacher, only want to import the id field. How can I promise the order of the props in the config? Or I only need to add the :IGNORE field to the csv file? Is there a option like this:

...
   vetex:
       tags:
           - name: couse
              props:
                   - ignore: true  // Add this field support，meaning to ignore the relative order field in csv.
                   - ignore: true
                   - name: id,
                      type: string
...

thanks

Support compaction and flush of RocksDB

Should output the summary info of all the csv data files in the end

Now, nebula-importer output summary info like FINISH(0), FAILED(0)... when one csv data file finishes importing. This summary info is mixed up with other info. So when all of the csv data files finish importing, it's hard for users to find all of the summary info.

JSON Lines support

Is it possible that we also have JSON Lines support, this means a lot for fresh users to have the headless importer on their existing file-based sources to nebula :)

同一schema配置多个文件

你好，想请教一下。假如我在yaml里面分别配置了两个path：student_0.csv和student_1.csv，都用于导入schema是student（具有属性：name）的点数据。两个文件可能包含vertext id相同的student数据。请问：

最后读到的点个数是两个文件数据的并集吗？
假如对于同一个student(vertext id相同)，在student_0.csv里name属性是”A“，但在student_1.csv里name属性是”B“，请问最终在图空间里name属性会是什么？

error: Prop index 1 out range 1 of record([847SawUH57a,0.01]))

client

./nebula-importer-linux-amd64-v2.6.0

the csv example

83Vywxuirk7,6.323
83Vxbbzymnl,1.016
83VxVAtOhlM,3.437
83WIM5lP3IQ,0.01
83WWBWpipQl,0.01
83WdUurhv1A,0.01
83WOl3YKkDQ,0.01
83WWBYAtIXl,0.01
83WykP7jnLE,0.01
83W68p5flDi,1.156
83WcD54qUgd,0.01

the Yml config

version: v2
description: journal
removeTempFiles: false
clientSettings:
  retry: 3
  concurrency: 2 # number of graph clients
  channelBufferSize: 128
  space: dataengine
  connection:
    user: root
    password: 123456
    address: 192.168.110.149:31883
  postStart:
    commands: |
      UPDATE CONFIGS storage:wal_ttl=3600;
      UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = true };
    afterPeriod: 8s
  preStop:
    commands: |
      UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = false };
      UPDATE CONFIGS storage:wal_ttl=86400;
logPath: ./err/t_journal.log
files:
  - path: ./t_journal.csv
    failDataPath: ./err/t_journal.csv
    batchSize: 2000
    type: csv
    csv:
      withHeader: false
      withLabel: false
      delimiter: '|'
    schema:
      type: vertex
      vertex:
        vid:
          index: 0
          type: string
        tags:
          - name: t_journal
            props:
              - name: journal_id
                type: string
                index: 0
              - name: impact_factor
                type: double
                index: 1

2021/10/29 23:04:24 [ERROR] handler.go:63: Client 0 fail to execute: THERE_ARE_SOME_ERRORS(tag: {0xc000156180 [0xc000142108 0xc000142120]}, error: Prop index 1 out range 1 of record([847SawUH57a,0.01])), ErrMsg: SyntaxError: syntax error near `THERE_ARE_SOME_ERRORS', ErrCode: -1004

Use errors.Wrap or other error printer framework to improve nebula importer

https://github.com/rotisserie/eris

Support add prefix for columns?

Could we add a param to support add a prefix of a given column to avoid users modifying CSV files?

Consider Adding Binary releases?

Someone new to go/goproxy may be blocked by the building part?

https://discuss.nebula-graph.com.cn/t/topic/4020/13

File doesn't exist

[root@VM-95-249-centos /data/graphdb/testUpload]# docker run --rm -ti       --network=host       -v /data/graphdb/testUpload/tryImport.yaml:/data/graphdb/testUpload/tryImport.yaml       -v  /data/graphdb/testUpload/       vesoft/nebula-importer:v1      --config /data/graphdb/testUpload/tryImport.yaml
2021/03/20 15:38:19 --- START OF NEBULA IMPORTER ---
2021/03/20 15:38:19 File(/data/graphdb/testUpload/userInfo.csv) doesn't exist
2021/03/20 15:38:20 --- END OF NEBULA IMPORTER ---

but I have the file:

[root@VM-95-249-centos /data/graphdb/testUpload]# ll
total 16
-rw-r--r-- 1 root root  36 Mar 20 23:17 courseInfo.csv
-rw-r--r-- 1 root root 187 Mar 20 23:17 take.csv
-rw-r--r-- 1 root root 997 Mar 20 23:37 tryImport.yaml
-rw-r--r-- 1 root root  34 Mar 20 23:17 userInfo.csv

why???

wrong method name

2021/11/25 18:28:43 --- START OF NEBULA IMPORTER ---
2021/11/25 18:28:44 failed to open connection, error: failed to verify client version: verifyClientVersion failed: wrong method name
2021/11/25 18:28:45 --- END OF NEBULA IMPORTER ---
exit status 200

version: v2
description: example
removeTempFiles: false
clientSettings:
retry: 3
concurrency: 1 # number of graph clients
channelBufferSize: 128
space: test
connection:
user: root
password: password
address: *****
logPath: ./err/test.log

nebula version 2.5.0

Improve output log format on terminal

import error

trying to run importer in the same machine as nebula, with this command:

$ docker run --rm -ti --network=host -v /opt/nebula/import.yaml:/data/import.yaml -v ~/:/data/ vesoft/nebula-importer --config /data/import.yaml

i get the following error:

2020/10/04 11:53:18 --- START OF NEBULA IMPORTER ---
2020/10/04 11:53:18 [INFO] config.go:399: files[1].schema.vertex is nil
2020/10/04 11:53:29 dial tcp 172.29.3.1:3699: i/o timeout
2020/10/04 11:53:30 --- END OF NEBULA IMPORTER ---

also, trying to use external domain name as nebula connection address, i get the following error:

2020/10/04 11:57:45 --- START OF NEBULA IMPORTER ---
2020/10/04 11:57:45 [INFO] config.go:399: files[1].schema.vertex is nil
2020/10/04 11:57:45 [INFO] clientmgr.go:28: Create 2 Nebula Graph clients
2020/10/04 11:57:45 [INFO] reader.go:64: Start to read file(0): /data/users_profile.csv, schema: < :VID,user.username:string >
panic: send on closed channel

goroutine 24 [running]:
github.com/vesoft-inc/nebula-importer/pkg/reader.(*Batch).requestClient(0xc000232180)
	/home/nebula-importer/pkg/reader/batch.go:66 +0x14c
github.com/vesoft-inc/nebula-importer/pkg/reader.(*Batch).Add(0xc000232180, 0x1, 0xc00000ec40, 0x2, 0x2)
	/home/nebula-importer/pkg/reader/batch.go:36 +0xa0
github.com/vesoft-inc/nebula-importer/pkg/reader.(*FileReader).Read(0xc0002320c0, 0x0, 0x0)
	/home/nebula-importer/pkg/reader/reader.go:162 +0x570
github.com/vesoft-inc/nebula-importer/pkg/cmd.(*Runner).Run.func2(0xc000014a40, 0xc000014a80, 0xc0002320c0, 0xc000019480, 0x17)
	/home/nebula-importer/pkg/cmd/cmd.go:70 +0x40
created by github.com/vesoft-inc/nebula-importer/pkg/cmd.(*Runner).Run
	/home/nebula-importer/pkg/cmd/cmd.go:69 +0x705

PS: i can connect to server with Nebula Graph Studio with no problem, but as the file sizes is so big, i can not use it for import

Support multiple Nebula Graph Servers

For example configuration:

clientSettings:
  connection:
    address: 192.168.8.5:3699,192.168.8.6:3699,192.168.8.7:3699

We should balance above 3 servers workload.

Fix statistics results when succeed to insert vertex or edge partially

Finish it when nebula graph could return the completed counts.

vesoft-inc / nebula-importer Goto Github PK

nebula-importer's Introduction

What is NebulaGraph Importer?

Features

How to Install

From Releases

From go install

From docker

From Source Code

Configuration Instructions

client

manager

log

sources

path

s3

oss

ftp

sftp

hdfs

gcs

batch

csv

tags

edges

nebula-importer's People

Stargazers

Watchers

Forkers

nebula-importer's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs