GithubHelp home page GithubHelp logo

apache / arrow Goto Github PK

View Code? Open in Web Editor NEW
13.5K 356.0 3.3K 183.6 MB

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Home Page: https://arrow.apache.org/

License: Apache License 2.0

Makefile 0.06% C++ 53.72% C 2.27% Shell 0.81% Ruby 3.40% Batchfile 0.06% CMake 1.43% Python 6.20% Java 14.83% FreeMarker 0.01% JavaScript 0.27% HTML 0.01% TypeScript 2.10% Lua 0.02% Go 11.13% Awk 0.01% Meson 0.09% Dockerfile 0.27% Thrift 0.07% R 3.23%
arrow

arrow's Introduction

Apache Arrow

Fuzzing Status License Twitter Follow

Powering In-Memory Analytics

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

Major components of the project include:

Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.

What's in the Arrow libraries?

The reference Arrow libraries contain many distinct software components:

  • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
  • Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
  • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
  • IO interfaces to local and remote filesystems
  • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
  • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
  • Conversions to and from other in-memory data structures
  • Readers and writers for various widely-used file formats (such as Parquet, CSV)

Implementation status

The official Arrow libraries in this repository are in different stages of implementing the Arrow format and related features. See our current feature matrix on git main.

How to Contribute

Please read our latest project contribution guide.

Getting involved

Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved:

arrow's People

Contributors

alamb avatar alenkaf avatar andygrove avatar assignuser avatar bkietz avatar cyb70289 avatar dependabot[bot] avatar domoritz avatar emkornfield avatar fsaintjacques avatar jonkeane avatar jorgecarleitao avatar jorisvandenbossche avatar kou avatar kszucs avatar lidavidm avatar liyafan82 avatar maplefu avatar nealrichardson avatar nevi-me avatar paleolimbot avatar pcmoritz avatar pitrou avatar raulcd avatar thisisnic avatar tianchen92 avatar wesm avatar westonpace avatar xhochy avatar zeroshade avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arrow's Issues

C++: establish a basic function evaluation model

We don't exactly have a working model for executing functions on Arrow objects that require memory allocations (either for intermediate storage of results).

For example:

F(input1, input2) -> output

Since the output type may depend on the input types, F in general will need to be able to request memory to be allocated. This will have to be figured out (some sort of ArrowContext with a memory pool attached) in due course. For example:

F(ctx, input1, input2) -> output

Where ctx is a Context object provided access to memory resources and other system state (user-configurable).

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Note: This issue was originally created as ARROW-34. Please see the migration documentation for further details.

Set up Travis CI

I will ask INFRA to enable Travis CI for the repo, and then will propose a patch that runs the C++ test suite to start (unless some kind soul beats me to it with a Java patch). We can use a build matrix with one build per language SDK (so gcc and clang for arrow-cpp) to start.

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Note: This issue was originally created as ARROW-8. Please see the migration documentation for further details.

C++: Algorithms for using nested types in a hash table context

Computing hash values (and performing equality comparisons) for top-level slots in nested-type data (for example, computing DISTINCT on a List<List<Int32>>, related: ARROW-32) can be fairly complex. Additionally, value slots at any level of the type tree can be null.

We should explore various algorithms for their performance and memory use in practical settings. For example, one can compute a contiguous "record" / byte array resulting from a depth-first traversal of a single value slot for the purposes of computing a hash value or comparing with another slot. If anyone has other ideas from past experiences I would be keen to learn more.

Reporter: Wes McKinney / @wesm

Note: This issue was originally created as ARROW-38. Please see the migration documentation for further details.

Error when run maven install

when I run maven to install, I got following problem:

Failed to execute goal
org.apache.drill.tools:drill-fmpp-maven-plugin:1.4.0:generate
(generate-fmpp) on project vector: Execution generate-fmpp of goal
org.apache.drill.tools:drill-fmpp-maven-plugin:1.4.0:generate failed:
Plugin org.apache.drill.tools:drill-fmpp-maven-plugin:1.4.0 or one of its
dependencies could not be resolved: Failure to find
org.freemarker:freemarker:jar:2.3.24-SNAPSHOT in
http://repository.apache.org/snapshots was cached in the local repository

btw, I just clone repo and run mvn clean install.

dev mailing link
http://mail-archives.apache.org/mod_mbox/arrow-dev/201602.mbox/%3CCAABsKVCSEULDTL2hoANL8-wrWMDO8%3Dgv0RFmSQMXt3MdiqUcPw%40mail.gmail.com%3E

Environment: Ubuntu Maven 3.2
Reporter: AllenFang
Assignee: Liwei Lin(Inactive) / @lw-lin

Note: This issue was originally created as ARROW-5. Please see the migration documentation for further details.

C++: Externalize memory allocations and add a MemoryPool abstract interface to builder classes

Currently memory allocations in the C++ implementation are all internal / self-contained, but in practice the array builders will need to integrate with some other memory allocator / tracker.

I'll define a basic abstract memory pool interface and add an example default implementation (with a resizable arrow::Buffer subclass that requests memory from the pool) with all memory managed internally.

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Note: This issue was originally created as ARROW-19. Please see the migration documentation for further details.

[C++] Implement delimited file scanner / CSV reader

Like Parquet and binary file formats, text files will be an important data medium for converting to and from in-memory Arrow data.

pandas has some (Apache-compatible) business logic we can learn from here (as one of the gold-standard CSV readers in production use)

https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.h
https://github.com/pydata/pandas/blob/master/pandas/parser.pyx

While very fast, this this should be largely written from scratch to target the Arrow memory layout, but we can reuse certain aspects like the tokenizer DFA (which originally came from the Python interpreter csv module implementation)

https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L713

Reporter: Wes McKinney / @wesm
Assignee: Antoine Pitrou / @pitrou

PRs and other links:

Note: This issue was originally created as ARROW-25. Please see the migration documentation for further details.

Add Python library build toolchain

I will be working on a patch to make the initial Arrow C++ libarrow.so library callable from Cython (http://cython.org/) extensions. For the uninitiated, Cython is the modern "gold standard" for interoperability with C++ libraries. I have used it recently to create Kudu's Python client (https://github.com/apache/incubator-kudu/tree/master/python/kudu).

A significant amount of Python "glue code" will be needed to interoperate with pandas, NumPy, and other standard Python libraries (in addition to Python's built-in scalar data types), but these will be the subject of many follow up JIRAs.

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Note: This issue was originally created as ARROW-7. Please see the migration documentation for further details.

Python: basic PyList <-> Arrow marshaling code

To start, this will encompass only lists containing flat data (i.e. no list or dict members). For example:

[1, 2, None, 3, 4]
['foo', 'bar', None, 'baz']
[True, False, False, True, None]

Type inference on more complicated Python lists (as occurring from json.loads, for example) is something we can explore later (if it is deemed useful).

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Note: This issue was originally created as ARROW-31. Please see the migration documentation for further details.

C++: Logical chunked arrays / columns: conforming to fixed chunk sizes

Implementing algorithms on large arrays assembled in physical chunks is problematic if:

  • The chunks are not all the same size (except possibly the last chunk, which can be less). Otherwise, retrieving a particular element is in general a O(log num_chunks) operation

  • The chunk size is not a power of 2. Computing integer modulus with a non-multiple of 2 requires more clock cycles (in other words, i % p is much more expensive to compute than i & (p - 1), but the latter only works if p is a power of 2)

Most of the Arrow data adapters will either feature contiguous data (1 chunk, so chunking is not an issue) or a regular chunk size, so this isn't as much of an immediate concern, but we should consider making it a contract of any data structures dealing in multiple arrays.

In general, it would be preferable to reorganize memory into either a regular chunksize (like 64K values per chunk) or a contiguous memory region. I would prefer for the moment to not to invest significant energy in writing algorithms for data with irregular chunk sizes.

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Note: This issue was originally created as ARROW-39. Please see the migration documentation for further details.

C++: Add schema adapter routines for converting flat Parquet schemas to in-memory Arrow schemas

Depends on ARROW-21. There's many cases to implement here (e.g. 3 kinds of Parquet list encoding that all map onto the single Arrow list type) so I may create follow-up JIRAs for some of the less-commonly-occurring Parquet schemas.

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Note: This issue was originally created as ARROW-22. Please see the migration documentation for further details.

Set some vector fields to default access level for Drill compatibility

Drill has methods for serializing and deserializing value vectors, and this functionality was not included in the code checked into the Arrow repo. It was kept in drill. For these Drill methods to work, some private vector fields need to be made package level so that the Drill helper methods can access them.

Reporter: Steven Phillips / @StevenMPhillips
Assignee: Steven Phillips / @StevenMPhillips

Note: This issue was originally created as ARROW-17. Please see the migration documentation for further details.

C++: add hash table classes for fixed-byte-width and variable-length primitive arrays

Some of the most important in-memory analytical routines are:

  • unique
  • contains / is-in
  • match (see base::match in R or pandas.match)
  • dictionary-encode (aka "factorize" as I call it)
  • frequency-table (unique + observed frequencies)

At their lowest level these all involve either iterative hash table construction or construct-then-sweep (for the routines involving multiple arrays, e.g. contains/match).

Hashing more complex Arrow structures (e.g. structs or lists-of-structs) will require some more thought, but performing these operations on fixed-byte-width types and lists thereof (e.g. strings as List) is fairly straightforward and can be used to craft more complex hash-table based routines.

Reporter: Wes McKinney / @wesm
Assignee: Antoine Pitrou / @pitrou

Note: This issue was originally created as ARROW-32. Please see the migration documentation for further details.

[C++] Consider adding a scalar type object model

Just did this on the Python side. In later analytics routines, passing in scalar values (example: Array + Scalar) requires some kind of container. Some systems, like the R language, solve this problem with length-1 arrays, but we should do some analysis of use cases and figure out what will work best for Arrow.

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-47. Please see the migration documentation for further details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.