GithubHelp home page GithubHelp logo

353solutions / carrow Goto Github PK

View Code? Open in Web Editor NEW
15.0 4.0 0.0 8.09 MB

Go wrapper for Apache Arrow C++

Home Page: https://arrow.apache.org/

License: BSD 3-Clause "New" or "Revised" License

Makefile 2.57% C++ 36.70% C 4.98% Go 48.54% Dockerfile 1.43% Vim Script 0.53% Python 5.26%
arrow golang cpp apache dataframe analytics

carrow's Introduction

carrow - Go bindings to Apache Arrow via C++-API

CircleCI godoc

THIS PROJECT IS NO LONGER MAINTAINED, HAVE A LOOK HERE

Access to Arrow C++ from Go.

FAQ

We'd like to share memory between Go & Python and the current arrow bindings don't have that option. Since pyarrow uses the C++ Arrow under the hood, we can just pass a s a pointer.

Also, the C++ Arrow library is more maintained than the Go one and have more features.

Development

  • The C++ glue layer is in carrow.cc, we try to keep it simple and unaware of Go.
  • See Dockerfile & build-docker target in the Makefile on how to setup an environment
  • See Dockerfile.test for running tests (used in CircleCI)

Debugging

We have Go, C++ & Python code working together. See the Dockerfile on how we get dependencies and set environment for development.

Example using gdb

$ PKG_CONFIG_PATH=/opt/miniconda/lib/pkgconfig LD_LIBRARY_PATH=/opt/miniconda/lib  go build ./_misc/wtr.go
$ LD_LIBRARY_PATH=/opt/miniconda/lib gdb wtr
(gdb) break carrow.cc:write_table
(gdb) run -db /tmp/plasma.db -id 800

carrow's People

Contributors

tebeka avatar dependabot-preview[bot] avatar yonidavidson avatar

Stargazers

 avatar Andreas Motl avatar Ruohan Zhao avatar Raymond Pang avatar George Erickson avatar Rikard Andersson avatar Ozan Minez avatar GAURAV avatar Casey Lucas avatar Wang Xiao avatar robin avatar  avatar Eric Jacobsen avatar Daniel Krom avatar Nick Poorman avatar

Watchers

Uwe L. Korn avatar  avatar James Cloos avatar  avatar

carrow's Issues

Shared memory allocator

We'd like an option to allocate arrow objects on a shared memory so we'll be able to have fast IPC

Expose CSV options to Go

The arrow CSV package has the following options: ReadOptions, ParseOptions & ConvertOptions. Expose them to Go.

Generate Append functions by template

This requires a bit more refactoring in the C side in order to get a generic structure for this type of functions.

For example:

func (b *TimestampArrayBuilder) Append(val time.Time) error {
	r := C.array_builder_append_timestamp(b.ptr, C.longlong(val.UnixNano()))
	if r.err != nil {
		return errFromResult(r)
	}
	return nil
}
Go type: time.Time
C.array_builder_append_timestamp

should pass a pointer (and the C function will cast it)
val.UnixNano should be defined as a Mutator function for the template.

Buffered Append

Currently we have Append calling the C append function on every value. I suggest we'll have a "buffered append" where append will add to existing Go array and only when this array is full will call the right AppendValues C++ method.

Expose Apache Arrow Flight to Carrow

In the Arrow 0.14 release, Flight was introduced as a new data interoperability technology to deliver a high-performance protocol for big data transfer for analytics across different applications and platforms.

We want this since we are binded to the new Arrow package and can use it.

Dependabot can't resolve your Go dependency files

Dependabot can't resolve your Go dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:


If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

Type specific builders

Currently we have FloatArrayBuilder and IntArrayBuilder. They should become Float64ArrayBuilder and Int64ArrayBuilder . The Int64ArrayBuilder Append should get an int64 as value.

Single Logger for system

We would like to use only the Go logger in all source code (Go and C++) so we have consistency.

Imprvoment: The c++ code should also be able to compile as stand-alone without the Go (by IFDEF to a C++ logger)

Python bindings proof of concept

We'd like to be able to use carrow from Python. A proof of concept is:

  • Create sub directory for Python bindings
  • Create a Python extension module that uses carrow. It should expose only one function build() that will return a pyarrow.Table
    • build can re-use the code in example_test.go
  • Convert the returned table to a pandas.DataFrame

plasma testing

Write some tests for plasma. We'll need to start a plasma store (problematic in docker) and then read/write tables to it.

Link statically with arrow

Currently we link with arrow shared library:

$ go test -c
$ ldd carrow.test | grep arrow
	libarrow.so.13 => /usr/lib/x86_64-linux-gnu/libarrow.so.13 (0x00007f0acdaee000)

We should statically link with libarrow.a to enable easier distribution, see #11

Organize C++ code

Move all C++ code to a directory (lib?) and build the .a from there. Have a single api.h header file.

Support Time arrays

We'd like support time.Time arrays. Need to think to which Arrow type it mapps. Probably Timestamp with nanosecond resolution.

Build artifacts with docker

Currently make artifact-linux-x86_64 will build the atrifact on the host machine. IMO we should build it via docker container (can do with a shared volume or copy from the build container).

@yonidavidson WDYT?

Regarding sharing memory between Go and Python

The README says

Why Not Apache Arrow for Go? We'd like to share memory between Go & Python and the current arrow bindings don't have that option. Since pyarrow uses the C++ Arrow under the hood, we can just pass a s a pointer. Also, the C++ Arrow library is more maintained than the Go one and have more features.

With the new C Data Interface, it is probably relatively straightforward to share memory between Go and Python (in-process) http://arrow.apache.org/blog/2020/05/03/introducing-arrow-c-data-interface/

Table metadata

We'd like to be able to read/write the table metadata pandas is using

Benchmarks

Have you run any benchmarks on this? I've been working through a similar Arrow implementation where all these cgo calls in loops are adding significant latency.

Better error reporting

Currently when there's an error at the C++ level we return an invalid value, however we don't have the message context.

Since we're a Go library :) I suggest API functions will return struct with value and error message. If the error message is empty - there was no error. Otherwise it'll contain the error message (probably from status.message()).

carrow.cc

struct ptr_reply {
    void *ptr;
    const char *error;
};

// ...
if (!status.ok()) {
    return ptr_reply{nullptr, status.message().c_str()};
}
// ...
// No error
return ptr_reply{table, nullptr};

carrow.go

out = C.some_function()
if out.error != nil {
    msg := C.GoString(out.error)
    C.free(out.error)
    return nil, fmt.Errorf(msg)
}
return NewTableFromPtr(out.ptr), nil

Make carrow "go get"able

Currently the build porcesses uses make to generate libcarrow.a and then go to build. This won't work when someones tries to go get the package.

Find a way to make carrow go getable

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.