reproio / columnify Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 6.0 1.43 MB

Make record oriented data to columnar format.

License: Apache License 2.0

Go 93.95% Makefile 6.05%

avro bigdata parquet

columnify's People

Contributors

Stargazers

Watchers

Forkers

darklore philoinc ashie ryota717 ozgurbtr harshit2283

columnify's Issues

Panic if input files does not have fields defined by avro schema

Panic occured when execute the following command.

$ columnify -schemaType avro -schemaFile ~/wc/src/github.com/reproio/columnify/examples/schema/primitives.avsc =(echo '{"boolean": false, "string": "foobar"}') > a
panic: interface conversion: interface {} is nil, not string [recovered]
        panic: send on closed channel

goroutine 11 [running]:
github.com/xitongsys/parquet-go/writer.(*ParquetWriter).flushObjs.func1.1(0xc00007dd20, 0xc00002c900)
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:208 +0xc8
panic(0x9ef140, 0xc00018fc20)
        /usr/lib/go-1.14/src/runtime/panic.go:969 +0x166
github.com/xitongsys/parquet-go/encoding.WritePlainBYTE_ARRAY(0xc00007de40, 0x1, 0x1, 0x0, 0x0, 0x0)
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/encoding/encodingwrite.go:107 +0x1bc
github.com/xitongsys/parquet-go/encoding.WritePlain(0xc00007de40, 0x1, 0x1, 0x6, 0xc0002f5460, 0x0, 0xc00007de40)
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/encoding/encodingwrite.go:51 +0x10a
github.com/xitongsys/parquet-go/layout.(*Page).EncodingValues(0xc000038600, 0xc00007de40, 0x1, 0x1, 0x1, 0xc00007de40, 0x0)
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/layout/page.go:166 +0xd9
github.com/xitongsys/parquet-go/layout.(*Page).DataPageCompress(0xc000038600, 0x1, 0x0, 0x0, 0xc0002f5460)
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/layout/page.go:183 +0x187
github.com/xitongsys/parquet-go/layout.TableToDataPages(0xc000038500, 0xc000002000, 0x1, 0x0, 0x1, 0xc0000e8070, 0xc00031e250)
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/layout/page.go:116 +0x812
github.com/xitongsys/parquet-go/writer.(*ParquetWriter).flushObjs.func1(0xc00007dd20, 0xc00002c900, 0xc0000b0280, 0xc0002f5320, 0xc00031e248, 0x1, 0x1, 0x0, 0x1, 0x0)
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:230 +0x44d
created by github.com/xitongsys/parquet-go/writer.(*ParquetWriter).flushObjs
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:195 +0x242

And this will create an invalid file as the parquet format.
The invalid file causes an error on Athena if we use this command as a compressor of fluent-plugin-s3.

But the following command does not panic:

$ columnify -schemaType avro -schemaFile ~/wc/src/github.com/reproio/columnify/examples/schema/primitives.avsc =(echo '{"boolean": false, "string": "foobar"}') 
PAR12020/05/01 14:53:35 Failed to write: interface conversion: interface {} is nil, not string

Publish this repo as an OSS

receive code review
receive legal/compliance/others reviews (if required
make it public 🎉
set GitHub Action based CI

improve integration tests

Move test data in examples/ to testdata/
Test output parquet files in *_test.go
Prepare more test data, like nullable types, logical types to separate error cases

Measure memprofile

Current version requires many ram resources. I'd like to investigate high ram consumers and try to reduce it.
ref. https://golang.org/pkg/runtime/pprof/

columnify should show help message

columnify doesn't show command-line help when it runs without options.

$ ./columnify 
2020/04/21 16:07:39 Failed to init: open : no such file or directory

I think showing command-line help is better than outputting the above error message.

Support multiple input files

I guess users hope to aggregate multiple input files to one or small number of files to compact data size.

We are running columnify as a part of fluent-plugin-s3 compressor (msgpack to parquet) for these days.
But columnify caused no memory error in some environments.
So I want to estimate memory usage of columnify.
Or is there a way to keep the memory usage constant regardless of the file size?

In my research, memory usage is proportional to file size.
Large files use 5 to 6 times the file size in memory.
For example, a large msgpack file (223MB) consumes memory about 1.3GB (ps command's RSS).

Cannot convert msgpack generated from nginx access log

I want to use msgpack format with fluent-plugin-s3 <format> section.
But I cannot see *parquet files on S3.

I can create a small reproducible case for this issue.

$ ~/wc/src/github.com/reproio/columnify/columnify -recordType msgpack -schemaType avro -schemaFile nginx_access_log.avsc x.msgpack
PAR12020/05/11 17:32:33 Failed to write: reflect: call of reflect.Value.Elem on uint8 Value
$ echo $?
1

single.msgpack.log

I use the following Avro schema definition:

{
  "name": "NginxAccessLog",
  "type": "record",
  "fields": [
    {
      "name": "container_id",
      "type": "string"
    },
    {
      "name": "container_name",
      "type": "string"
    },
    {
      "name": "source",
      "type": "string"
    },
    {
      "name": "log",
      "type": "string"
    },
    {
      "name": "__fluentd_address__",
      "type": "string"
    },
    {
      "name": "__fluentd_host__",
      "type": "string"
    },
    {
      "name": "role",
      "type": "string"
    },
    {
      "name": "host",
      "type": "string"
    },
    {
      "name": "remote_ip",
      "type": "string"
    },
    {
      "name": "request_host",
      "type": "string"
    },
    {
      "name": "user",
      "type": "string"
    },
    {
      "name": "method",
      "type": "string"
    },
    {
      "name": "path",
      "type": "string"
    },
    {
      "name": "status",
      "type": "string"
    },
    {
      "name": "size",
      "type": "string"
    },
    {
      "name": "referer",
      "type": "string"
    },
    {
      "name": "agent",
      "type": "string"
    },
    {
      "name": "duration",
      "type": "string"
    },
    {
      "name": "country_code",
      "type": "string"
    },
    {
      "name": "token_param",
      "type": "string"
    },
    {
      "name": "idfv_param",
      "type": "string"
    },
    {
      "name": "tag",
      "type": "string"
    },
    {
      "name": "time",
      "type": "string"
    }
  ]
}

Capture the first error caused by columnifier

It possible overwrites a concrete error with nil. In such case, we will lose a chance to detect errors at the writer func. For now, as @itchyny said, I make this place not to capture a return value by Close() if err is already set.
#18 (comment)

Validate record that matches with a schema at FormatToMap

record.FormatToMap() just formats input data to map[string]interface{} without checking schema, but it fails following processes (e.g. #24 )

So, we should ensure schema checks at the place to detect mismatching. It's natural because to-parquet-conversion is based on schema!

gocredits integration

https://songmu.jp/riji/entry/2019-04-16-gocredits.html

Codecov integration

To visualize test coverage

Handle optional nested record types

In #33 , optional nested record types probably break output data. Current implementation will have some issues around nullable types.

Refactoring about Nullable check

avroTypeToArrowType(f.Type) returns three values and one of them is nullable as boolean.

func avroFieldToArrowField(f avro.RecordField) (*arrow.Field, error) {
	t, nullable, err := avroTypeToArrowType(f.Type)
	if err != nil {
		return nil, err
	}

	return &arrow.Field{
		Name:     f.Name,
		Type:     t,
		Nullable: nullable,
	}, nil
}

I wonder if this functionality can divide independent function.

    if t.UnionType != nil {                                                      
        if nt := isNullableField(t.UnionType); nt != nil {                       
            if nested, _, err := avroTypeToArrowType(*nt); err == nil {             
                return nested, true, nil                                         
            }                                                                    
        }                                                                        
    }

I confirmed isNullableField() returns *avro.AvroType. I think this function has two functionality: null or not and retrieving AvroType.

func isNullableField(ut *avro.UnionType) *avro.AvroType {                        
    if len(*ut) == 2 && *(*ut)[0].PrimitiveType == *avro.ToPrimitiveType(avro.AvroPrimitiveType_Null) {
        // According to Avro spec, the "null" is usually listed first            
        return &(*ut)[1]                                                         
    }                                                                            
                                                                                 
    return nil                                                                   
}

By refactoring these functions, could you improve to more simple API? I don't know the background of converting types, so tell me if I was misunderstanding.

Nested array of strings issue

Using FluentD with Columnify.

Running on Kubernetes to push logs to S3 in parquet.
Issue's arising when trying to use avro schema with a nested map, or list of strings.

According to kubernetes metadata filter plugin's docs (https://github.com/ViaQ/fluent-plugin-kubernetes_metadata_input/blob/master/README.md#kubernetes-labels-and-annotations)
I believe that columnify would get a nested array of strings.

Got this schema to work so far, but still running into issues upstream with Athena.

{
  "type": "record",
  "name": "record",
  "fields": [
    {
      "name": "message",
      "type": "string"
    },
    {
      "name": "logtag",
      "type": "string"
    },
    {
      "name": "stream",
      "type": "string"
    },
    {
      "name": "time",
      "type": ["null", "string"]
    },
    {
      "name": "docker",
      "type": {
        "type": "record",
        "name": "docker",
        "fields": [
          {
            "name": "container_id",
            "type": "string"
          }
        ]
        }
    },
    {
      "name": "kubernetes",
      "type": {
        "type": "record",
        "name": "kubernetes",
        "fields": [
          {
            "name": "container_name",
            "type": "string"
          },
          {
            "name": "host",
            "type": ["null", "string"]
          },
          {
            "name": "master_url",
            "type": ["null", "string"]
          },
          {
            "name": "namespace_name",
            "type": ["null", "string"]
          },
          {
            "name": "pod_id",
            "type": ["null", "string"]
          },
          {
            "name": "pod_name",
            "type": ["null", "string"]
          },
          {
            "name": "labels",
            "type": {
              "type": "array",
              "items": {
                "name": "label",
                "type" : "record",
                "fields": [ {
                  "type": ["null", "string"]
                } ]
              }
            }
          }
        ]
      }
    }
  ]
}

Specifically the issue with the labels part. I think this should work, instead of the record with array of record:

{ 
  "name": "labels",
  "type":{
    "type": "array",
    "items":{
      "type":"list",
      "values":"string"
      }
  }
}

Example data before fluentd filters:

{
  "stream": "stdout",
  "logtag": "F",
  "message": "  Tue Nov 22 23:51:12 UTC 2022 Found redis master (172.20.203.160)",
  "time": 1669161072.283568,
  "docker": {
    "container_id": "29e32e64745530e7a1c5e9174f9e266e051707aec6a76d4556871532157a"
  },
  "kubernetes": {
    "container_name": "split-brain-fix",
    "namespace_name": "argocd",
    "pod_name": "argocd-redis-ha-server-0",
    "container_image": "docker.io/library/redis:6.2.6-alpine",
    "container_image_id": "docker.io/library/redis@sha256:132337b9d7744ffee4fae83fde53c3530935ad3ba528b7110f2d805f55cbf5",
    "pod_id": "ee5af2aa-14d8-446c-9755-",
    "pod_ip": "10.64.124.43",
    "host": "ip-10-64-116-85.us-west-2.compute.internal",
    "labels": {
      "app": "redis-ha",
      "argocd-redis-ha": "replica",
      "controller-revision-hash": "argocd-redis-ha-server-7cd67685d6",
      "release": "argocd",
      "statefulset_kubernetes_io/pod-name": "argocd-redis-ha-server-0"
    },
    "master_url": "https://172.20.0.1:443/api",
    "namespace_id": "f3d1453d-d227-4c54-982a-457d5b99cc8b",
    "namespace_labels": {
      "app_kubernetes_io/managed-by": "Helm",
      "kubernetes_io/metadata_name": "argocd"
    }
  },
  "tag": "kubernetes.var.log.containers.argocd-redis-ha-server-0_argocd_split-brain-fix-29e32e64745530e7a171e08251707aec6a76d4556871532157a.log"
}

But getting this error:

2022-11-23 00:29:43 +0000 [warn]: #0 [out_s3] got unrecoverable error in primary and no secondary error_class=Fluent::UnrecoverableError error="failed to execute columnify command. stdout= stderr=panic: runtime error: index out of range [0] with length 0\n\ngoroutine 1 [running]:\n │
│ github.com/xitongsys/parquet-go/layout.PagesToChunk(0x10ea6d8, 0x0, 0x0, 0x20)\n\t/home/runner/go/pkg/mod/github.com/xitongsys/[email protected]/layout/chunk.go:24 +0x90d\ngithub.com/xitongsys/parquet-go/writer.(*ParquetWriter).Flush(0xc00074fcc0, 0xc00010e001, 0x10, 0xa3abc0)\n\t/ │
│ home/runner/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:285 +0x3d5\ngithub.com/xitongsys/parquet-go/writer.(*ParquetWriter).WriteStop(0xc00074fcc0, 0x0, 0xc00010e050)\n\t/home/runner/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:120 +0x37 │
│ \ngithub.com/reproio/columnify/columnifier.(*parquetColumnifier).Close(0xc00000c6c0, 0xc00086fe18, 0x9d5cff)\n\t/home/runner/work/columnify/columnify/columnifier/parquet.go:122 +0x2e\nmain.columnify.func1(0xc2d760, 0xc00000c6c0, 0xc00086fec0)\n\t/home/runner/work/columnify/columnif │
│ y/cmd/columnify/columnify.go:24 +0x35\nmain.columnify(0xc2d760, 0xc00000c6c0, 0xc00013a0f0, 0x1, 0x1, 0x0, 0x0)\n\t/home/runner/work/columnify/columnify/cmd/columnify/columnify.go:36 +0xe2\nmain.main()\n\t/home/runner/work/columnify/columnify/cmd/columnify/columnify.go:71 +0x545\n  │
│ status=#<Process::Status: pid 48 exit 2>"                                                                                                                                                                                                                                                  │
│   2022-11-23 00:29:43 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.1.0/gems/fluent-plugin-s3-1.7.2/lib/fluent/plugin/s3_compressor_parquet.rb:60:in `compress'                                                                                                                           │
│   2022-11-23 00:29:43 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.1.0/gems/fluent-plugin-s3-1.7.2/lib/fluent/plugin/out_s3.rb:352:in `write'                                                                                                                                            │
│   2022-11-23 00:29:43 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin/output.rb:1180:in `try_flush'                                                                                                                                               │
│   2022-11-23 00:29:43 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin/output.rb:1501:in `flush_thread_run'                                                                                                                                        │
│   2022-11-23 00:29:43 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin/output.rb:501:in `block (2 levels) in start'                                                                                                                                │
│   2022-11-23 00:29:43 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/3.1.0/gems/fluentd-1.15.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

Enable GitHub Actions

The billing plan for reproio does not include access to GitHub Actions. Please contact the organization owner or billing manager for questions about the current billing plan

Current reproio org doesn't have a privilege to execute GitHub Actions. This repo will be open source so is it better to start using it since that time?

Consider whither using Apache Arrow intermediate representation

Columnify uses Apache Arrow Schema/Record as an intermediate representation between various input formant and output ( currently only parquet ). It's powerful, fast memory accesses, supports columnar like representation. But Go implementation is not perfect yet e.g. Arrow record type doesn't support some types on its sub fields so it's not still applicable for Columnify. Additionally Arrow Go implementation doesn't support rich data conversion like PyArrow. Finally it's using "only Arrow Schema" as a necessary intermediate data now.

So we have some options to tackle this problems like:

Remove Arrow dependency. It's unnecessary now and reducing dependencies make clear maintainancability of this product. Arrow Schema type is replacable with Avro Schema or others.
Improve Arrow! It's an OSS and we probably have various chances to contribute to Go Arrow implementation.
Just keep current Columnify implementation ant watch activities on Arrow community.

As a tirivial topic, gocredits doesn't work on Go Arrow dependency. #4

Unable to install columnify on MacOS w/ go get

Following the README instructions, i'm trying to install columnify with go get. But i'm getting the following error:

⇒  GO111MODULE=off go get github.com/reproio/columnify/cmd/columnify                                                                      
package github.com/vmihailenco/msgpack/v4: cannot find package "github.com/vmihailenco/msgpack/v4" in any of:
        /usr/local/Cellar/go/1.14.4/libexec/src/github.com/vmihailenco/msgpack/v4 (from $GOROOT)
        /Users/XXX/.go/src/github.com/vmihailenco/msgpack/v4 (from $GOPATH)

Error while reading json records: Failed to write: bufio.Scanner: token too long

td-agent throwing error - 2023-05-22 14:08:44 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error_class=Fluent::ConfigError error="'columnify' utility must be in PATH for -h compression"

2023-05-22 14:08:44 +0000 [error]: config error file="/etc/td-agent/td-agent.conf" error_class=Fluent::ConfigError error="'columnify' utility must be in PATH for -h compression"

which columnify

/root/go/bin/columnify

columnify -h

Usage of columnify: columnify [-flags] [input files]
-output string
path to output file; default: stdout
-parquetCompressionCodec string
parquet compression codec, default: SNAPPY (default "SNAPPY")
-parquetPageSize int
parquet file page size, default: 8kB (default 8192)
-parquetRowGroupSize int
parquet file row group size, default: 128MB (default 134217728)
-recordType string
record data format type, [avro|csv|jsonl|ltsv|msgpack|tsv] (default "jsonl")
-schemaFile string
path to schema file
-schemaType string
schema type, [avro|bigquery]

Need help to resolve the error :- got unrecoverable error in primary and no secondary error_class=Fluent::UnrecoverableError error="failed to execute columnify command. stdout= stderr=Failed to close columnifier: interface conversion: interface {} is nil, not string2024/07/09 22:25:41 Failed to write: interface conversion: interface {} is nil, not string\n status=#<Process::Status: pid 1191 exit 1>"

HI,
I am trying to use columnify to generate output in parquet format to send data to azure using azure plugin,however i am getting error when i run the fluentd container with below error . I will appreciate if someone can help me here.

2024-07-09 22:25:41 +0000 [warn]: #0 got unrecoverable error in primary and no secondary error_class=Fluent::UnrecoverableError error="failed to execute columnify command. stdout= stderr=Failed to close columnifier: interface conversion: interface {} is nil, not string
2024/07/09 22:25:41 Failed to write: interface conversion: interface {} is nil, not string\n status=#<Process::Status: pid 1191 exit 1>"
  2024-07-09 22:25:41 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluent-plugin-azurestorage-gen2-0.3.5/lib/fluent/plugin/out_azurestorage_gen2.rb:834:in `compress'
  2024-07-09 22:25:41 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluent-plugin-azurestorage-gen2-0.3.5/lib/fluent/plugin/out_azurestorage_gen2.rb:165:in `write'
  2024-07-09 22:25:41 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2024-07-09 22:25:41 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2024-07-09 22:25:41 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2024-07-09 22:25:41 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.16.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

Wrapping errors

Go 1.13 introduces Wrapping errors with %w feature. I like chaining errors since it's useful to debug or logging. How about using wrapping errors?

Columnifier interface might have Close method?

Currently, Columnifier interface has Flush method only.

type Columnifier interface {
    Write(data []byte) error
    WriteFromFiles(paths []string) error
    Flush() error
}

I confirmed parquetColumnifier.Flush has two functions: Flush and Close. In my feeling, Flush does not indicate Close. I think that Flush writes synchronously or ensures to write data into the data store.

So, I recommend providing two methods: Flush and Close.

func (c *parquetColumnifier) Flush() error {
    if err := c.w.WriteStop(); err != nil {
        return err
    }

    return c.w.PFile.Close()
}

Another merit of having Close method, we know an idiom to open/close files with defer statement.

    c, err := columnifier.NewColumnifier(*schemaType, *schemaFile, *recordType, *output)
    if err != nil {
        log.Fatalf("Failed to init: %v\n", err)
    }
    defer c.Close()

It means we won't care error handling when Close method returns an error since there is nothing to be able to do in most cases. But, we might handle something when Flush returns an error.

Support parquet compression/encoding configurations

Parquet has various configuration to tune encoded data.

e.g.) compression codec, like snappy, page size, dictionary encoding, ...
https://parquet.apache.org/documentation/latest/
https://github.com/apache/parquet-format/blob/master/Encodings.md

Panic if provide a valid jsonl file

$ columnify -schemaType avro -schemaFile rails-log.avsc -recordType jsonl crash.json.log
PAR1panic: reflect: call of reflect.Value.Type on zero Value [recovered]
        panic: send on closed channel

goroutine 11 [running]:
github.com/xitongsys/parquet-go/writer.(*ParquetWriter).flushObjs.func1.1(0xc002ebe4f0, 0xc00002c900)
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:208 +0xc8
panic(0x9dd420, 0xc002b18ba0)
        /usr/lib/go-1.14/src/runtime/panic.go:969 +0x166
reflect.Value.Type(0x0, 0x0, 0x0, 0x4, 0x9b5420)
        /usr/lib/go-1.14/src/reflect/value.go:1872 +0x183
github.com/xitongsys/parquet-go/common.SizeOf(0x0, 0x0, 0x0, 0xc00329ce30)
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/common/common.go:503 +0x5a
github.com/xitongsys/parquet-go/layout.TableToDataPages(0xc000038680, 0xc000002000, 0x1, 0x22, 0xc0000ce0e8, 0x2, 0x48b)
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/layout/page.go:89 +0x259
github.com/xitongsys/parquet-go/writer.(*ParquetWriter).flushObjs.func1(0xc002ebe4f0, 0xc00002c900, 0xc0000203c0, 0xc002e76cc8, 0xc000189ae8, 0x1, 0x1, 0x0, 0x5b6, 0x0)
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:230 +0x44d
created by github.com/xitongsys/parquet-go/writer.(*ParquetWriter).flushObjs
        /home/kenji/go/pkg/mod/github.com/xitongsys/[email protected]/writer/writer.go:195 +0x242

I could not understand the error logs above.

crash.json.log

I use the following avro schema:

{
  "name": "RailsAccessLog",
  "type": "record",
  "fields": [
    {
      "name": "container_id",
      "type": "string"
    },
    {
      "name": "container_name",
      "type": "string"
    },
    {
      "name": "source",
      "type": "string"
    },
    {
      "name": "log",
      "type": "string"
    },
    {
      "name": "__fluentd_address__",
      "type": "string"
    },
    {
      "name": "__fluentd_host__",
      "type": "string"
    },
    {
      "name": "role",
      "type": "string"
    },
    {
      "name": "host",
      "type": "string"
    },
    {
      "name": "severity",
      "type": "string"
    },
    {
      "name": "status",
      "type": "string"
    },
    {
      "name": "db",
      "type": "float"
    },
    {
      "name": "view",
      "type": "float"
    },
    {
      "name": "duration",
      "type": "float"
    },
    {
      "name": "method",
      "type": "string"
    },
    {
      "name": "path",
      "type": "string"
    },
    {
      "name": "remote_ip",
      "type": "string"
    },
    {
      "name": "agent",
      "type": "string"
    },
    {
      "name": "params",
      "type": "string"
    },
    {
      "name": "tag",
      "type": "string"
    },
    {
      "name": "time",
      "type": "string"
    }
  ]
}

BTW, the following command does not panic:

$ columnify -recordType jsonl -schemaType avro -schemaFile rails-log.avsc =(head -n 9 crash.json.log)

Any help?

Inefficient Parquet Conversion with columnify compared to pyarrow

I am working on converting JSONL log files to Parquet format to improve log search capabilities.
To achieve this, I've been exploring tools compatible with Fluentd, and I came across the s3-plugin, which uses the columnify tool for conversion.

In my quest to find the most efficient conversion method, I conducted tests using two different approaches:

I created a custom Python script utilizing the pandas and pyarrow libraries for JSONL to Parquet conversion.
I used the columnify tool for the same purpose.

I used a JSONL file containing approximately 27,000 log lines, all structured similarly to the following example:

{ "stdouttype": "stdout", "letter": "F", "level": "info", "f_t": "2023-09-21T16:35:46.608Z", "ist_timestamp": "21 Sept 2023, 22:05:46 GMT+5:30", "f_s": "service-name", "f_l": "module_name", "apiName": "<name_of_api>", "workflow": "some-workflow-qwewqe-0", "step": "somestepid0", "sender": "234567854321345670", "traceId": "23456785432134567_wertjlwqkjrtljjjwelfe0", "sid": "", "request": "<stringified-request-body>", "response": "<stringified-request-body>"}

For both methods, I generated GZIP-compressed JSON and Parquet files. The image below illustrates the resulting Parquet files:
in the below image you can see 3 parquet files that are generated

main_file.log.gz.parquet (101KB) is generated by python script (pandas+pyarrow)
main_file1.columnify.parquet (8.7MB) is generated by columnify

As shown, the Parquet file generated by columnify is significantly larger than the one created by the Python script.

Upon further investigation, I discovered that the default row_group_size and page_size settings differ between pyarrow (used in the Python script) and columnify (utilizing parquet-go):

In Pyarrow:

Default row_group_size: 1MB (maximum of 64MB)
Default page_size: 1MB

In columnify (parquet-go):
Default row_group_size: 128MB
Default page_size: 8KB

So, I adjusted the page_size for columnify to 1MB (-parquetPageSize 1048576), which reduced the file size from 8.7MB to 438KB. However, modifying the row_group_size option did not result in further size reduction.

I'm seeking help in understanding why the columnify-generated Parquet file remains larger than the one generated by the Python script using pyarrow. Is this due to limitations in the parquet-go library ? or am I missing something in my configuration?

kindly give some insights, advice, or any recommendations on optimizing the Parquet conversion process with columnify.

LINKS
pyarrow doc ref. for page_size and row_group_size
pyarrow default row group size value
pyarrow default page_size
parquet-go row_group_size and page_size

Failed to close WriteCloser

columnify/cmd/columnify/columnify.go

Lines 37 to 46 in 21bb871

 defer func() { 

 if err := c.Close(); err != nil { 

 log.Fatalf("Failed to close: %v\n", err) 

 } 

 }() 

 _, err = c.WriteFromFiles(files) 

 if err != nil { 

 log.Fatalf("Failed to write: %v\n", err) 

 }

log.Fatalf calls os.Exit(1), which terminates immediately program without calling defer functions. So, when WriteFromFiles() is failed, WriteCloser will not be closed.

Support concurrent processing like Map/Shuffle/Reduce pattern

If we support multiple input files #9 , al least we can support input file based concurrent processing, just sharding Columnifier instance that will help to minimize processing time. And also we can support more flexible output files control if doing shuffle before reduce.

Reimplement Arrow based intermediate records

retry to implement Arrow record typed intermediate representation, once more! I think we can gradually switch to that by below steps:

prototyping for PoC: (various inputs) -> map's -> arrow -> map's -> json -> parquet
- implement Arrow -> JSON conversion in Go
- integrate it easily
remove parquet writing side Go intermediates: (various inputs) -> map's -> arrow -> json -> parquet
remove input side Go intermediates: (various inputs) -> arrow -> json -> parquet
- It requires input -> arrow formatter for each input types
ideal: (various inputs) -> arrow -> parquet
- It's so complicated because of arrow -> parquet (a part depends on parquet-go)
- It'll require some improvements of Arrow Go implementation.

Better package name for parquetgo

Does the package name parquetgo come from https://github.com/xitongsys/parquet-go ? I think another name is better for the below reasons.

the package name should not depend on an external library
- in the future, we might change to use another library instead of parquet-go
- columnify is a go library, so the suffix go is obvious
the package name which indicates role/feature/process with parquet-go is better

logical type from avro input is broken.

hi, i'm using columnify with avro input record. and found that records of logical types(around datetime: date, timemillis, timemicros, timestampmillis, timestampmicros) are broken.

for example, the sample data gets result below.

# jsonl input(OK)
$ ./columnify -schemaType avro -schemaFile columnifier/testdata/schema/logicals.avsc -recordType jsonl columnifier/testdata/record/logicals.jsonl > jsonl.parquet
$ parquet-tools cat -json jsonl.parquet
{"date":1,"timemillis":1000,"timemicros":1000000,"timestampmillis":1000,"timestampmicros":1000000}
{"date":2,"timemillis":2000,"timemicros":2000000,"timestampmillis":2000,"timestampmicros":2000000}
{"date":3,"timemillis":3000,"timemicros":3000000,"timestampmillis":3000,"timestampmicros":3000000}
{"date":4,"timemillis":4000,"timemicros":4000000,"timestampmillis":4000,"timestampmicros":4000000}
{"date":5,"timemillis":5000,"timemicros":5000000,"timestampmillis":5000,"timestampmicros":5000000}
{"date":6,"timemillis":6000,"timemicros":6000000,"timestampmillis":6000,"timestampmicros":6000000}
{"date":7,"timemillis":7000,"timemicros":7000000,"timestampmillis":7000,"timestampmicros":7000000}
{"date":8,"timemillis":8000,"timemicros":8000000,"timestampmillis":8000,"timestampmicros":8000000}
{"date":9,"timemillis":9000,"timemicros":9000000,"timestampmillis":9000,"timestampmicros":9000000}
{"date":10,"timemillis":10000,"timemicros":10000000,"timestampmillis":10000,"timestampmicros":10000000}

# avro input(NG)
$ ./columnify -schemaType avro -schemaFile columnifier/testdata/schema/logicals.avsc -recordType avro columnifier/testdata/record/logicals.avro > avro.parquet
$ parquet-tools cat -json avro.parquet
{"date":1970,"timemillis":1000000000,"timemicros":1000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":2000000000,"timemicros":2000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":3000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":4000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":5000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":6000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":7000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":8000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":9000000000,"timestampmillis":1970,"timestampmicros":1970}
{"date":1970,"timemillis":0,"timemicros":10000000000,"timestampmillis":1970,"timestampmicros":1970}

this behavior seems to come from goavro that format logical types to go native types(using time).
though i dont have good idea to reformat go native types to parquet primitive types before writing :(

	defer func() {
	if err := c.Close(); err != nil {
	log.Fatalf("Failed to close: %v\n", err)
	}
	}()

	_, err = c.WriteFromFiles(files)
	if err != nil {
	log.Fatalf("Failed to write: %v\n", err)
	}

reproio / columnify Goto Github PK

columnify's People

Contributors

Stargazers

Watchers

Forkers

columnify's Issues

which columnify

columnify -h

Recommend Projects

Recommend Topics

Recommend Org

Jobs