reproio / columnify Goto Github PK
View Code? Open in Web Editor NEWMake record oriented data to columnar format.
License: Apache License 2.0
Make record oriented data to columnar format.
License: Apache License 2.0
Panic occured when execute the following command.
$ columnify -schemaType avro -schemaFile ~/wc/src/github.com/reproio/columnify/examples/schema/primitives.avsc =(echo '{"boolean": false, "string": "foobar"}') > a
panic: interface conversion: interface {} is nil, not string [recovered]
panic: send on closed channel
goroutine 11 [running]:
github.com/xitongsys/parquet-go/writer.(*ParquetWriter).flushObjs.func1.1(0xc00007dd20, 0xc00002c900)
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:208 +0xc8
panic(0x9ef140, 0xc00018fc20)
/usr/lib/go-1.14/src/runtime/panic.go:969 +0x166
github.com/xitongsys/parquet-go/encoding.WritePlainBYTE_ARRAY(0xc00007de40, 0x1, 0x1, 0x0, 0x0, 0x0)
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/encoding/encodingwrite.go:107 +0x1bc
github.com/xitongsys/parquet-go/encoding.WritePlain(0xc00007de40, 0x1, 0x1, 0x6, 0xc0002f5460, 0x0, 0xc00007de40)
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/encoding/encodingwrite.go:51 +0x10a
github.com/xitongsys/parquet-go/layout.(*Page).EncodingValues(0xc000038600, 0xc00007de40, 0x1, 0x1, 0x1, 0xc00007de40, 0x0)
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/layout/page.go:166 +0xd9
github.com/xitongsys/parquet-go/layout.(*Page).DataPageCompress(0xc000038600, 0x1, 0x0, 0x0, 0xc0002f5460)
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/layout/page.go:183 +0x187
github.com/xitongsys/parquet-go/layout.TableToDataPages(0xc000038500, 0xc000002000, 0x1, 0x0, 0x1, 0xc0000e8070, 0xc00031e250)
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/layout/page.go:116 +0x812
github.com/xitongsys/parquet-go/writer.(*ParquetWriter).flushObjs.func1(0xc00007dd20, 0xc00002c900, 0xc0000b0280, 0xc0002f5320, 0xc00031e248, 0x1, 0x1, 0x0, 0x1, 0x0)
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:230 +0x44d
created by github.com/xitongsys/parquet-go/writer.(*ParquetWriter).flushObjs
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:195 +0x242
And this will create an invalid file as the parquet format.
The invalid file causes an error on Athena if we use this command as a compressor of fluent-plugin-s3.
But the following command does not panic:
$ columnify -schemaType avro -schemaFile ~/wc/src/github.com/reproio/columnify/examples/schema/primitives.avsc =(echo '{"boolean": false, "string": "foobar"}')
PAR12020/05/01 14:53:35 Failed to write: interface conversion: interface {} is nil, not string
examples/
to testdata/
*_test.go
Current version requires many ram resources. I'd like to investigate high ram consumers and try to reduce it.
ref. https://golang.org/pkg/runtime/pprof/
columnify
doesn't show command-line help when it runs without options.
$ ./columnify
2020/04/21 16:07:39 Failed to init: open : no such file or directory
I think showing command-line help is better than outputting the above error message.
I guess users hope to aggregate multiple input files to one or small number of files to compact data size.
We are running columnify as a part of fluent-plugin-s3 compressor (msgpack to parquet) for these days.
But columnify caused no memory error in some environments.
So I want to estimate memory usage of columnify.
Or is there a way to keep the memory usage constant regardless of the file size?
In my research, memory usage is proportional to file size.
Large files use 5 to 6 times the file size in memory.
For example, a large msgpack file (223MB) consumes memory about 1.3GB (ps command's RSS).
I want to use msgpack format with fluent-plugin-s3 <format>
section.
But I cannot see *parquet
files on S3.
I can create a small reproducible case for this issue.
$ ~/wc/src/github.com/reproio/columnify/columnify -recordType msgpack -schemaType avro -schemaFile nginx_access_log.avsc x.msgpack
PAR12020/05/11 17:32:33 Failed to write: reflect: call of reflect.Value.Elem on uint8 Value
$ echo $?
1
I use the following Avro schema definition:
{
"name": "NginxAccessLog",
"type": "record",
"fields": [
{
"name": "container_id",
"type": "string"
},
{
"name": "container_name",
"type": "string"
},
{
"name": "source",
"type": "string"
},
{
"name": "log",
"type": "string"
},
{
"name": "__fluentd_address__",
"type": "string"
},
{
"name": "__fluentd_host__",
"type": "string"
},
{
"name": "role",
"type": "string"
},
{
"name": "host",
"type": "string"
},
{
"name": "remote_ip",
"type": "string"
},
{
"name": "request_host",
"type": "string"
},
{
"name": "user",
"type": "string"
},
{
"name": "method",
"type": "string"
},
{
"name": "path",
"type": "string"
},
{
"name": "status",
"type": "string"
},
{
"name": "size",
"type": "string"
},
{
"name": "referer",
"type": "string"
},
{
"name": "agent",
"type": "string"
},
{
"name": "duration",
"type": "string"
},
{
"name": "country_code",
"type": "string"
},
{
"name": "token_param",
"type": "string"
},
{
"name": "idfv_param",
"type": "string"
},
{
"name": "tag",
"type": "string"
},
{
"name": "time",
"type": "string"
}
]
}
It possible overwrites a concrete error with nil
. In such case, we will lose a chance to detect errors at the writer func. For now, as @itchyny said, I make this place not to capture a return value by Close()
if err
is already set.
#18 (comment)
record.FormatToMap()
just formats input data to map[string]interface{}
without checking schema, but it fails following processes (e.g. #24 )
So, we should ensure schema checks at the place to detect mismatching. It's natural because to-parquet-conversion is based on schema!
To visualize test coverage
In #33 , optional nested record types probably break output data. Current implementation will have some issues around nullable types.
avroTypeToArrowType(f.Type)
returns three values and one of them is nullable as boolean.
func avroFieldToArrowField(f avro.RecordField) (*arrow.Field, error) {
t, nullable, err := avroTypeToArrowType(f.Type)
if err != nil {
return nil, err
}
return &arrow.Field{
Name: f.Name,
Type: t,
Nullable: nullable,
}, nil
}
I wonder if this functionality can divide independent function.
if t.UnionType != nil {
if nt := isNullableField(t.UnionType); nt != nil {
if nested, _, err := avroTypeToArrowType(*nt); err == nil {
return nested, true, nil
}
}
}
I confirmed isNullableField()
returns *avro.AvroType. I think this function has two functionality: null or not and retrieving AvroType.
func isNullableField(ut *avro.UnionType) *avro.AvroType {
if len(*ut) == 2 && *(*ut)[0].PrimitiveType == *avro.ToPrimitiveType(avro.AvroPrimitiveType_Null) {
// According to Avro spec, the "null" is usually listed first
return &(*ut)[1]
}
return nil
}
By refactoring these functions, could you improve to more simple API? I don't know the background of converting types, so tell me if I was misunderstanding.
Using FluentD with Columnify.
Running on Kubernetes to push logs to S3 in parquet.
Issue's arising when trying to use avro schema with a nested map, or list of strings.
According to kubernetes metadata filter plugin's docs (https://github.com/ViaQ/fluent-plugin-kubernetes_metadata_input/blob/master/README.md#kubernetes-labels-and-annotations)
I believe that columnify would get a nested array of strings.
Got this schema to work so far, but still running into issues upstream with Athena.
{
"type": "record",
"name": "record",
"fields": [
{
"name": "message",
"type": "string"
},
{
"name": "logtag",
"type": "string"
},
{
"name": "stream",
"type": "string"
},
{
"name": "time",
"type": ["null", "string"]
},
{
"name": "docker",
"type": {
"type": "record",
"name": "docker",
"fields": [
{
"name": "container_id",
"type": "string"
}
]
}
},
{
"name": "kubernetes",
"type": {
"type": "record",
"name": "kubernetes",
"fields": [
{
"name": "container_name",
"type": "string"
},
{
"name": "host",
"type": ["null", "string"]
},
{
"name": "master_url",
"type": ["null", "string"]
},
{
"name": "namespace_name",
"type": ["null", "string"]
},
{
"name": "pod_id",
"type": ["null", "string"]
},
{
"name": "pod_name",
"type": ["null", "string"]
},
{
"name": "labels",
"type": {
"type": "array",
"items": {
"name": "label",
"type" : "record",
"fields": [ {
"type": ["null", "string"]
} ]
}
}
}
]
}
}
]
}
Specifically the issue with the labels part. I think this should work, instead of the record with array of record:
{
"name": "labels",
"type":{
"type": "array",
"items":{
"type":"list",
"values":"string"
}
}
}
Example data before fluentd filters:
{
"stream": "stdout",
"logtag": "F",
"message": " Tue Nov 22 23:51:12 UTC 2022 Found redis master (172.20.203.160)",
"time": 1669161072.283568,
"docker": {
"container_id": "29e32e64745530e7a1c5e9174f9e266e051707aec6a76d4556871532157a"
},
"kubernetes": {
"container_name": "split-brain-fix",
"namespace_name": "argocd",
"pod_name": "argocd-redis-ha-server-0",
"container_image": "docker.io/library/redis:6.2.6-alpine",
"container_image_id": "docker.io/library/redis@sha256:132337b9d7744ffee4fae83fde53c3530935ad3ba528b7110f2d805f55cbf5",
"pod_id": "ee5af2aa-14d8-446c-9755-",
"pod_ip": "10.64.124.43",
"host": "ip-10-64-116-85.us-west-2.compute.internal",
"labels": {
"app": "redis-ha",
"argocd-redis-ha": "replica",
"controller-revision-hash": "argocd-redis-ha-server-7cd67685d6",
"release": "argocd",
"statefulset_kubernetes_io/pod-name": "argocd-redis-ha-server-0"
},
"master_url": "https://172.20.0.1:443/api",
"namespace_id": "f3d1453d-d227-4c54-982a-457d5b99cc8b",
"namespace_labels": {
"app_kubernetes_io/managed-by": "Helm",
"kubernetes_io/metadata_name": "argocd"
}
},
"tag": "kubernetes.var.log.containers.argocd-redis-ha-server-0_argocd_split-brain-fix-29e32e64745530e7a171e08251707aec6a76d4556871532157a.log"
}
But getting this error:
2022-11-23 00:29:43 +0000 [warn]: #0 [out_s3] got unrecoverable error in primary and no secondary error_class=Fluent::UnrecoverableError error="failed to execute columnify command. stdout= stderr=panic: runtime error: index out of range [0] with length 0\n\ngoroutine 1 [running]:\n โ
โ github.com/xitongsys/parquet-go/layout.PagesToChunk(0x10ea6d8, 0x0, 0x0, 0x20)\n\t/home/runner/go/pkg/mod/github.com/xitongsys/[email protected]/layout/chunk.go:24 +0x90d\ngithub.com/xitongsys/parquet-go/writer.(*ParquetWriter).Flush(0xc00074fcc0, 0xc00010e001, 0x10, 0xa3abc0)\n\t/ โ
โ home/runner/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:285 +0x3d5\ngithub.com/xitongsys/parquet-go/writer.(*ParquetWriter).WriteStop(0xc00074fcc0, 0x0, 0xc00010e050)\n\t/home/runner/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:120 +0x37 โ
โ \ngithub.com/reproio/columnify/columnifier.(*parquetColumnifier).Close(0xc00000c6c0, 0xc00086fe18, 0x9d5cff)\n\t/home/runner/work/columnify/columnify/columnifier/parquet.go:122 +0x2e\nmain.columnify.func1(0xc2d760, 0xc00000c6c0, 0xc00086fec0)\n\t/home/runner/work/columnify/columnif โ
โ y/cmd/columnify/columnify.go:24 +0x35\nmain.columnify(0xc2d760, 0xc00000c6c0, 0xc00013a0f0, 0x1, 0x1, 0x0, 0x0)\n\t/home/runner/work/columnify/columnify/cmd/columnify/columnify.go:36 +0xe2\nmain.main()\n\t/home/runner/work/columnify/columnify/cmd/columnify/columnify.go:71 +0x545\n โ
โ status=#<Process::Status: pid 48 exit 2>" โ
โ 2022-11-23 00:29:43 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.1.0/gems/fluent-plugin-s3-1.7.2/lib/fluent/plugin/s3_compressor_parquet.rb:60:in `compress' โ
โ 2022-11-23 00:29:43 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.1.0/gems/fluent-plugin-s3-1.7.2/lib/fluent/plugin/out_s3.rb:352:in `write' โ
โ 2022-11-23 00:29:43 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin/output.rb:1180:in `try_flush' โ
โ 2022-11-23 00:29:43 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin/output.rb:1501:in `flush_thread_run' โ
โ 2022-11-23 00:29:43 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin/output.rb:501:in `block (2 levels) in start' โ
โ 2022-11-23 00:29:43 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
The billing plan for reproio does not include access to GitHub Actions. Please contact the organization owner or billing manager for questions about the current billing plan
Current reproio
org doesn't have a privilege to execute GitHub Actions. This repo will be open source so is it better to start using it since that time?
Columnify uses Apache Arrow Schema/Record as an intermediate representation between various input formant and output ( currently only parquet ). It's powerful, fast memory accesses, supports columnar like representation. But Go implementation is not perfect yet e.g. Arrow record type doesn't support some types on its sub fields so it's not still applicable for Columnify. Additionally Arrow Go implementation doesn't support rich data conversion like PyArrow. Finally it's using "only Arrow Schema" as a necessary intermediate data now.
So we have some options to tackle this problems like:
As a tirivial topic, gocredits
doesn't work on Go Arrow dependency. #4
Following the README instructions, i'm trying to install columnify with go get. But i'm getting the following error:
โ GO111MODULE=off go get github.com/reproio/columnify/cmd/columnify
package github.com/vmihailenco/msgpack/v4: cannot find package "github.com/vmihailenco/msgpack/v4" in any of:
/usr/local/Cellar/go/1.14.4/libexec/src/github.com/vmihailenco/msgpack/v4 (from $GOROOT)
/Users/XXX/.go/src/github.com/vmihailenco/msgpack/v4 (from $GOPATH)
2023-05-22 14:08:44 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error_class=Fluent::ConfigError error="'columnify' utility must be in PATH for -h compression"
/root/go/bin/columnify
Usage of columnify: columnify [-flags] [input files]
-output string
path to output file; default: stdout
-parquetCompressionCodec string
parquet compression codec, default: SNAPPY (default "SNAPPY")
-parquetPageSize int
parquet file page size, default: 8kB (default 8192)
-parquetRowGroupSize int
parquet file row group size, default: 128MB (default 134217728)
-recordType string
record data format type, [avro|csv|jsonl|ltsv|msgpack|tsv] (default "jsonl")
-schemaFile string
path to schema file
-schemaType string
schema type, [avro|bigquery]
HI,
I am trying to use columnify to generate output in parquet format to send data to azure using azure plugin,however i am getting error when i run the fluentd container with below error . I will appreciate if someone can help me here.
2024-07-09 22:25:41 +0000 [warn]: #0 got unrecoverable error in primary and no secondary error_class=Fluent::UnrecoverableError error="failed to execute columnify command. stdout= stderr=Failed to close columnifier: interface conversion: interface {} is nil, not string
2024/07/09 22:25:41 Failed to write: interface conversion: interface {} is nil, not string\n status=#<Process::Status: pid 1191 exit 1>"
2024-07-09 22:25:41 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluent-plugin-azurestorage-gen2-0.3.5/lib/fluent/plugin/out_azurestorage_gen2.rb:834:in `compress'
2024-07-09 22:25:41 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluent-plugin-azurestorage-gen2-0.3.5/lib/fluent/plugin/out_azurestorage_gen2.rb:165:in `write'
2024-07-09 22:25:41 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1225:in `try_flush'
2024-07-09 22:25:41 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
2024-07-09 22:25:41 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
2024-07-09 22:25:41 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
Go 1.13 introduces Wrapping errors with %w
feature. I like chaining errors since it's useful to debug or logging. How about using wrapping errors?
See also: https://blog.golang.org/go1.13-errors
Currently, Columnifier interface has Flush method only.
type Columnifier interface {
Write(data []byte) error
WriteFromFiles(paths []string) error
Flush() error
}
I confirmed parquetColumnifier.Flush has two functions: Flush and Close. In my feeling, Flush does not indicate Close. I think that Flush writes synchronously or ensures to write data into the data store.
So, I recommend providing two methods: Flush and Close.
func (c *parquetColumnifier) Flush() error {
if err := c.w.WriteStop(); err != nil {
return err
}
return c.w.PFile.Close()
}
Another merit of having Close method, we know an idiom to open/close files with defer
statement.
c, err := columnifier.NewColumnifier(*schemaType, *schemaFile, *recordType, *output)
if err != nil {
log.Fatalf("Failed to init: %v\n", err)
}
defer c.Close()
It means we won't care error handling when Close method returns an error since there is nothing to be able to do in most cases. But, we might handle something when Flush returns an error.
Parquet has various configuration to tune encoded data.
e.g.) compression codec, like snappy, page size, dictionary encoding, ...
https://parquet.apache.org/documentation/latest/
https://github.com/apache/parquet-format/blob/master/Encodings.md
$ columnify -schemaType avro -schemaFile rails-log.avsc -recordType jsonl crash.json.log
PAR1panic: reflect: call of reflect.Value.Type on zero Value [recovered]
panic: send on closed channel
goroutine 11 [running]:
github.com/xitongsys/parquet-go/writer.(*ParquetWriter).flushObjs.func1.1(0xc002ebe4f0, 0xc00002c900)
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:208 +0xc8
panic(0x9dd420, 0xc002b18ba0)
/usr/lib/go-1.14/src/runtime/panic.go:969 +0x166
reflect.Value.Type(0x0, 0x0, 0x0, 0x4, 0x9b5420)
/usr/lib/go-1.14/src/reflect/value.go:1872 +0x183
github.com/xitongsys/parquet-go/common.SizeOf(0x0, 0x0, 0x0, 0xc00329ce30)
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/common/common.go:503 +0x5a
github.com/xitongsys/parquet-go/layout.TableToDataPages(0xc000038680, 0xc000002000, 0x1, 0x22, 0xc0000ce0e8, 0x2, 0x48b)
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/layout/page.go:89 +0x259
github.com/xitongsys/parquet-go/writer.(*ParquetWriter).flushObjs.func1(0xc002ebe4f0, 0xc00002c900, 0xc0000203c0, 0xc002e76cc8, 0xc000189ae8, 0x1, 0x1, 0x0, 0x5b6, 0x0)
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:230 +0x44d
created by github.com/xitongsys/parquet-go/writer.(*ParquetWriter).flushObjs
/home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:195 +0x242
I could not understand the error logs above.
I use the following avro schema:
{
"name": "RailsAccessLog",
"type": "record",
"fields": [
{
"name": "container_id",
"type": "string"
},
{
"name": "container_name",
"type": "string"
},
{
"name": "source",
"type": "string"
},
{
"name": "log",
"type": "string"
},
{
"name": "__fluentd_address__",
"type": "string"
},
{
"name": "__fluentd_host__",
"type": "string"
},
{
"name": "role",
"type": "string"
},
{
"name": "host",
"type": "string"
},
{
"name": "severity",
"type": "string"
},
{
"name": "status",
"type": "string"
},
{
"name": "db",
"type": "float"
},
{
"name": "view",
"type": "float"
},
{
"name": "duration",
"type": "float"
},
{
"name": "method",
"type": "string"
},
{
"name": "path",
"type": "string"
},
{
"name": "remote_ip",
"type": "string"
},
{
"name": "agent",
"type": "string"
},
{
"name": "params",
"type": "string"
},
{
"name": "tag",
"type": "string"
},
{
"name": "time",
"type": "string"
}
]
}
BTW, the following command does not panic:
$ columnify -recordType jsonl -schemaType avro -schemaFile rails-log.avsc =(head -n 9 crash.json.log)
Any help?
I am working on converting JSONL log files to Parquet format to improve log search capabilities.
To achieve this, I've been exploring tools compatible with Fluentd, and I came across the s3-plugin, which uses the columnify
tool for conversion.
In my quest to find the most efficient conversion method, I conducted tests using two different approaches:
pandas
and pyarrow
libraries for JSONL to Parquet conversion.columnify
tool for the same purpose.I used a JSONL file containing approximately 27,000 log lines, all structured similarly to the following example:
{ "stdouttype": "stdout", "letter": "F", "level": "info", "f_t": "2023-09-21T16:35:46.608Z", "ist_timestamp": "21 Sept 2023, 22:05:46 GMT+5:30", "f_s": "service-name", "f_l": "module_name", "apiName": "<name_of_api>", "workflow": "some-workflow-qwewqe-0", "step": "somestepid0", "sender": "234567854321345670", "traceId": "23456785432134567_wertjlwqkjrtljjjwelfe0", "sid": "", "request": "<stringified-request-body>", "response": "<stringified-request-body>"}
For both methods, I generated GZIP-compressed JSON and Parquet files. The image below illustrates the resulting Parquet files:
in the below image you can see 3 parquet files that are generated
main_file.log.gz.parquet
(101KB) is generated by python script (pandas+pyarrow)main_file1.columnify.parquet
(8.7MB) is generated by columnify
As shown, the Parquet file generated by columnify is significantly larger than the one created by the Python script.
Upon further investigation, I discovered that the default row_group_size and page_size settings differ between pyarrow (used in the Python script) and columnify (utilizing parquet-go):
In Pyarrow:
Default row_group_size: 1MB (maximum of 64MB)
Default page_size: 1MB
In columnify (parquet-go):
Default row_group_size: 128MB
Default page_size: 8KB
So, I adjusted the page_size for columnify to 1MB (-parquetPageSize 1048576), which reduced the file size from 8.7MB to 438KB. However, modifying the row_group_size option did not result in further size reduction.
I'm seeking help in understanding why the columnify-generated Parquet file remains larger than the one generated by the Python script using pyarrow. Is this due to limitations in the parquet-go library ? or am I missing something in my configuration?
kindly give some insights, advice, or any recommendations on optimizing the Parquet conversion process with columnify.
LINKS
pyarrow doc ref. for page_size and row_group_size
pyarrow default row group size value
pyarrow default page_size
parquet-go row_group_size and page_size
columnify/cmd/columnify/columnify.go
Lines 37 to 46 in 21bb871
log.Fatalf
calls os.Exit(1)
, which terminates immediately program without calling defer
functions. So, when WriteFromFiles()
is failed, WriteCloser
will not be closed.
If we support multiple input files #9 , al least we can support input file based concurrent processing, just sharding Columnifier
instance that will help to minimize processing time. And also we can support more flexible output files control if doing shuffle before reduce.
retry to implement Arrow record typed intermediate representation, once more! I think we can gradually switch to that by below steps:
prototyping for PoC: (various inputs) -> map's -> arrow -> map's -> json -> parquet
remove parquet writing side Go intermediates: (various inputs) -> map's -> arrow -> json -> parquet
remove input side Go intermediates: (various inputs) -> arrow -> json -> parquet
ideal: (various inputs) -> arrow -> parquet
parquet-go
)Does the package name parquetgo
come from https://github.com/xitongsys/parquet-go ? I think another name is better for the below reasons.
parquet-go
go
is obviousparquet-go
is betterhi, i'm using columnify with avro input record. and found that records of logical types(around datetime: date, timemillis, timemicros, timestampmillis, timestampmicros) are broken.
for example, the sample data gets result below.
# jsonl input(OK)
$ ./columnify -schemaType avro -schemaFile columnifier/testdata/schema/logicals.avsc -recordType jsonl columnifier/testdata/record/logicals.jsonl > jsonl.parquet
$ parquet-tools cat -json jsonl.parquet
{"date":1,"timemillis":1000,"timemicros":1000000,"timestampmillis":1000,"timestampmicros":1000000}
{"date":2,"timemillis":2000,"timemicros":2000000,"timestampmillis":2000,"timestampmicros":2000000}
{"date":3,"timemillis":3000,"timemicros":3000000,"timestampmillis":3000,"timestampmicros":3000000}
{"date":4,"timemillis":4000,"timemicros":4000000,"timestampmillis":4000,"timestampmicros":4000000}
{"date":5,"timemillis":5000,"timemicros":5000000,"timestampmillis":5000,"timestampmicros":5000000}
{"date":6,"timemillis":6000,"timemicros":6000000,"timestampmillis":6000,"timestampmicros":6000000}
{"date":7,"timemillis":7000,"timemicros":7000000,"timestampmillis":7000,"timestampmicros":7000000}
{"date":8,"timemillis":8000,"timemicros":8000000,"timestampmillis":8000,"timestampmicros":8000000}
{"date":9,"timemillis":9000,"timemicros":9000000,"timestampmillis":9000,"timestampmicros":9000000}
{"date":10,"timemillis":10000,"timemicros":10000000,"timestampmillis":10000,"timestampmicros":10000000}
# avro input(NG)
$ ./columnify -schemaType avro -schemaFile columnifier/testdata/schema/logicals.avsc -recordType avro columnifier/testdata/record/logicals.avro > avro.parquet
$ parquet-tools cat -json avro.parquet
{"date":1970,"timemillis":1000000000,"timemicros":1000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":2000000000,"timemicros":2000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":3000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":4000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":5000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":6000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":7000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":8000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":9000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":10000000000,"timestampmillis":1970,"timestampmicros":1970}
this behavior seems to come from goavro that format logical types to go native types(using time
).
though i dont have good idea to reformat go native types to parquet primitive types before writing :(
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.