GithubHelp home page GithubHelp logo

doytsujin / xbow Goto Github PK

View Code? Open in Web Editor NEW

This project forked from seertaak/xbow

0.0 1.0 0.0 74 KB

range-v3 views and actions for Arrow C++

CMake 0.55% Python 87.69% Shell 0.10% C++ 11.43% Jupyter Notebook 0.23%

xbow's Introduction

xbow - range-v3 views and actions for Arrow C++

Why xbow?

Because I want to specify arrow types for the rows of an Arrow table like this:

// Below: define a "record", which is just a value type which in addition wraps
// Boost.Hana's BOOST_HANA_DEFINE_STRUCT macro. This macro generates machinery
// which can be used to retrieve field names and types, as well as accessors that
// can read fields based on a compile-time integer (or string) representing the
// index (or name, respectively) of the field. Note that this dereferencing takes
// place in constant time. It's the same as accessing a struct member - it is *not*
// like accessing an item in a hash map (say). This reflective metadata is sufficient
// do the (ugly) Arrow C++ API raw lifting. For example, for the first field,
// an arrow Int32Builder will be used for writing data (what we call the *action*
// side, since the side-effect of an arrow being created is effected), and a
// Int32Array will be used for reading (the *views* side: here we take an regular
// arrow array or table, and wrap it with a range-v3 compatible range). For the
// "name" field, a arrow::StringBuilder and arrow::StringArray will be used, respectively.
// And so on.
def_record(suspect,
    (int32_t, id),
    (string, name),
    (double, salary)
);

... instead of like this. And I want to use it like this:

auto suspects = vector<suspect>{
    {1, "Keyser Söze"s, 1000.0}, {2, "Kobayashi"s, 500.0},   {3, "Fred Fenster"s, 500.0},
    {4, "Jack Baer"s, 100.0},    {5, "Dean Keaton"s, 800.0}, {6, "Michael McManus"s, 100.0},
};

print("input rows: {}\n", suspects);

// below: traverse the rows, changing name to upper case, skipping every other element,
// cycling over rows so that they repeat and taking exactly 20 of these rows, and finally
// this range-v3 range is converted to a regular arrow table.
// This code shows that we can take a bog-standard range-v3 pipeline and convert it to
// an arrow object. This could later, for example, be written to a parquet file (WIP).
const auto table = suspects 
                     | views::transform([](auto&& p) -> suspect& {
                        boost::to_upper(p.name);
                        return p;
                       })
                     | views::stride(2)  
                     | views::cycle 
                     | views::take(20) 
                     | xb::arrow::actions::to_table;

// below: note that to_range<suspect>(table) returns a range consisting of chunks, each of which
// is also a range. These chunks correspond exactly to the actual low-level chunks in the
// arrow file. We view::join this range to produce a single, collated range, which we then
// convert to a std::vector<suspect> for the sole reason of printing. Note how easily we
// taped together the chunks! Normally this would be two-level for loop involving laborious
// extraction of each field, type-casting, urgh!
print("round-tripped rows: {}\n",
xb::arrow::views::to_range<suspect>(table) | views::join | to<vector<suspect>>);

...which should produce output like this:

input rows: {person[id: 1, name: Keyser Söze, salary: 1000], person[id: 2, name: Kobayashi, salary: 500], person[id: 3, name: Fred Fenster, salary: 500], person[id: 4, name: Jack Baer, salary: 100], person[id: 5, name: Dean Keaton, salary: 800], person[id: 6, name: Michael McManus, salary: 100]}
round-tripped rows: {person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500], person[id: 5, name: DEAN KEATON, salary: 800], person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500], person[id: 5, name: DEAN KEATON, salary: 800], person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500], person[id: 5, name: DEAN KEATON, salary: 800], person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500], person[id: 5, name: DEAN KEATON, salary: 800], person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500], person[id: 5, name: DEAN KEATON, salary: 800], person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500], person[id: 5, name: DEAN KEATON, salary: 800], person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500]}

(It does.)

Usage

Usage examples can be found in the tests, and in the examples directory.

Defining Custom Tables

Use the def_record macro to define the types of rows of tables you want to manipulate in xbow. This leverages Boost.Hana to create static introspection machinery, which is used by xbow to figure out the right concrete types for the Arrow array objects.

Example:

def_record(point,
    (double, x),
    (double, y)
);

def_record(person,
    (int64_t, id),
    (xb::date, dob),
    (string, name),
    (double, cost),
    (array<double, 3>, cost_components),
    (point, p)  
);

Notes:

  1. In principle, the record types can be nested as in the example above. This will work for the purpose of Arrow schema generation. Aggregates such as arrays and vectors are also supported. (Again, in principle you can use arbitrary element types, with the caveat below.)
  2. However, round-tripping (actual range use) currently doesn't support structures within structures. It's planned, and it's certainly possible using the reflective setup.
  3. Currently only numeric and string types are supported, with partial support for boolean arrays (stored as bitmasks).
  4. Partial support, with more planned, for dates and timestamp - these map from their Arrow concrete types to suitable modern C++ counterparts: hinnant's date and std's chrono libraries, respectively.

Actions: Ranges To Arrow Objects

One use-case is that we already have a range or container with objects of our (appropriately defined - see above) row type, and our goal is to create the concomitant Arrow objects. This is accomplished using xbows's actions, so called because they are side-effecting: they create new Arrow objects, which involves memory allocation.

// convert a scalar range (i.e. a single column) into the appropriate Arrow array
// object.
template <range R>
auto to_arrow_array(R&& input) -> meta::array_obj<xb::meta::element_t<R>>;

// convert a range over an appropriately defined 
template <ranges::range Range>
auto to_table(Range&& rows);

Views: Views on Arrow Ojects

In the other direction, we start with an Arrow array or table, and want to query or manipulate the data using range-v3 ranges. In this case, no allocation is necessary, and so we adopt the view terminology.

// given a suitably defined record type (see above) convert an Arrow Table to a range 
// over items of the row type.
template <xb::meta::record R>
auto to_range(const std::shared_ptr<::arrow::Table>& table);

// convert a concrete Arrow array object to a view, no memory is allocated. (e.g.
// strings are represented as std::string_views whose pointer points to the Arrow
// memory itself, rather than creating copies. The range produced by this version 
// does not allow null elements. If you have missing elements, use 
// optional_array_view, below.
template <typename A>
auto array_view(const std::shared_ptr<A>& array) -> decltype(auto);

// Same as above, but allowing missing elements. Let's say you create A has element
// "StringType". Then optional_array_view will produce a range of 
// std::optional<std::string>. That is to say, nullity/non-nullity are represented
// using std::optional.
template <typename A>
auto optional_array_view(const std::sharedhttps://en.cppreference.com/w/cpp/chrono_ptr<A>& array) -> decltype(auto);

// Arrow tables are created out of so-called chunked arrays. They're usually pretty
// horrible to deal with. This helper function creates a range of ranges; the outer
// range obviously representing the chunks.
template <ranges::semiregular T>
auto chunked_array_view(const ::arrow::ChunkedArray& chunks) -> unspecified;

Pre-requisites

  • Python 3.9.1
  • Clang 11 or higher (will probably all work with recent gcc)

Note: installing clang is not enough. You need to use update-alternatives on Ubuntu to make sure that c++ and cc call clang++-11 and clang-11, respectively.

Installation

  1. Install conan; this is used to download C++ libraries such as boost and arrow:
    # install conan 
    pip install conan
    # install third-party libs in debug mode using conan...
    conan install . -if build/debug --profile:host=conan/profile/tools --profile:build=conan/profile/debug --build missing
    # ...same for release mode.
    conan install . -if build/release --profile:host=conan/profile/tools --profile:build=conan/profile/release --build missing
    
    conan export conan/libs/arrow.py xbow/stable
    
    # for CLion conan plugin.
    ln conan/profile/debug ~/.conan/profiles/debug
    ln conan/profile/release ~/.conan/profiles/release
  2. Install C++ build tools (CMake, Ninja, and nasm) and any libraries mentioned in this project's conanfile.py.

Development Environment

To make the build tools available in your shell (rather than just in the compile step):

source build/debug/activate.sh

For example:

# "BEFORE"
cmake
# =>
# Could not find command-not-found database. Run 'sudo apt update' to populate it.
# cmake: command not found

# now set up shell
source build/debug/activate.sh 

# "AFTER": -- cmake, nasm, ninja work.
cmake
# =>
# Usage
# 
#   cmake [options] <path-to-source>
#   cmake [options] <path-to-existing-build>
#   cmake [options] -S <path-to-source> -B <path-to-build>
# 
# Specify a source directory to (re-)generate a build system for it in the
# current working directory.  Specify an existing build directory to
# re-generate its build system.

Run 'cmake --help' for more information.

Building

# build debug
conan build . -bf build/debug

# build release
conan build . -bf build/debug

# Alternative, using only cmake:

# build debug
source build/debug/activate.sh    # ... to be able to use cmake.
cd build/debug
cmake .

# build release
source build/release/activate.sh    # ... to be able to use cmake.
cd build/release
cmake .

Future Plans

Much tighter integration with Python. Use PEP484 and python's AST introspection to generate the appropriate record class from a Python specification. Even more ambitious: allow using Numba or Cython to generate UDFs that compiled inline with the aggregation code to allow ultimate performance and flexibility.

  1. A python file defines functions and types which are passed to functions and template types.
    @data
    class employee:
        id: int
        team_id: int
        name: str
        dob: date
        salary: float
    
    employees = dataframe[person].read_parquet('/data/employees.parquet')
    E = employees.accessors
    
    
    while requesthttps://en.cppreference.com/w/cpp/chrono:
       team_id = request.next().team_id
       team_costs = (
             employees
                 .filter(E.team_id == team_id)
                 .accumulate(0, std.plus, E.salary)
       )

Note: first build must occur in xbow env on the command line. That way, the python version is found by pybind11 and set in CMake cache. After that, refreshing the CMake tab and building work in clion.

xbow's People

Contributors

seertaak avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.