GithubHelp home page GithubHelp logo

facebookincubator / nimble Goto Github PK

View Code? Open in Web Editor NEW
365.0 21.0 18.0 543 KB

New file format for storage of large columnar datasets.

License: Apache License 2.0

CMake 2.36% Makefile 0.19% C++ 94.78% Shell 0.04% Python 2.63%

nimble's Introduction

The Nimble File Format

Nimble (formerly known as “Alpha”) is a new columnar file format for large datasets created by Meta. Nimble is meant to be a replacement for file formats such as Apache Parquet and ORC.

Watch this talk to learn more about Nimble’s internals.

Nimble has the following design principles:

  • Wide: Nimble is better suited for workloads that are wide in nature, such as tables with thousands of columns (or streams) which are commonly found in feature engineering workloads and training tables for machine learning.

  • Extensible: Since the state-of-the-art in data encoding evolves faster than the file layout itself, Nimble decouples stream encoding from the underlying physical layout. Nimble allows encodings to be extended by library users and recursively applied (cascading).

  • Parallel: Nimble is meant to fully leverage highly parallel hardware by providing encodings which are SIMD and GPU friendly. Although this is not implemented yet, we intend to expose metadata to allow developers to better plan decoding trees and schedule kernels without requiring the data streams themselves.

  • Unified: More than a specification, Nimble is a product. We strongly discourage developers to (re-)implement Nimble’s spec to prevent environmental fragmentation issues observed with similar projects in the past. We encourage developers to leverage the single unified Nimble library, and create high-quality bindings to other languages as needed.

Nimble has the following features:

  • Lighter metadata organization to efficiently support thousands to tens of thousands of columns and streams.

  • Use Flatbuffers instead of thrift/protobuf to more efficiently access large metadata sections.

  • Use block encoding instead of stream encoding to provide predictable memory usage while decoding/reading.

  • Supports many encodings out-of-the-box, and additional encodings can be added as needed.

  • Supports cascading (recursive/composite) encoding of streams.

  • Supports pluggable encoding selection policies.

  • Provide extensibility APIs where encodings and other aspects of the file can be extended.

  • Clear separation between logical and physical encoded types.

  • And more.

Nimble is a work in progress, and many of these features above are still under design and/or active development. As such, Nimble does not provide stability or versioning guarantees (yet). They will be eventually provided with a future stable release. Use it at your own risk.

Build

Nimble’s CMake build system is self-sufficient and able to either locate its main dependencies or compile them locally. In order to compile it, one can simply:

$ git clone [email protected]:facebookincubator/nimble.git
$ cd nimble
$ make

To override the default behavior and force the build system to, for example, build a dependency locally (bundle it), one can:

$ folly_SOURCE=BUNDLED make

Nimble builds have been tested using clang 15 and 16. It should automatically compile the following dependencies: gtest, glog, folly, abseil, and velox. You may need to first install the following system dependencies for these to compile (example from Ubuntu 22.04):

$ sudo apt install -y \
    git \
    cmake \
    flatbuffers-compiler \
    protobuf-compiler \
    libflatbuffers-dev \
    libgflags-dev \
    libunwind-dev \
    libgoogle-glog-dev \
    libdouble-conversion-dev \
    libevent-dev \
    liblz4-dev \
    liblzo2-dev \
    libelf-dev \
    libdwarf-dev \
    libsnappy-dev \
    libssl-dev \
    bison \
    flex \
    libfl-dev

Although Nimble’s codebase is today closely coupled with velox, we intend to decouple them in the future.

License

Nimble is licensed under the Apache 2.0 License. A copy of the license can be found here.

nimble's People

Contributors

albertdachichen avatar yuhta avatar r-barnes avatar pedroerp avatar sdruzkin avatar gownta avatar helfman avatar facebook-github-bot avatar xiaoxmeng avatar

Stargazers

Leo Lee avatar timelyportfolio avatar timothy avatar zorro avatar  avatar Vincent Gromakowski avatar Yaroslav Ravlinko avatar Oleksandr avatar mocl avatar Wei He avatar Cory Grinstead avatar Dimitar Dimitrov avatar  avatar Prakhar Srivastava avatar Piero Ferrante avatar Priyanka O avatar Jonathan Whittle  avatar  avatar Derek Hecksher avatar Sasha Sheikin avatar maru avatar JinYan Su avatar Dong Lin avatar Judah Rand avatar Florian Gerlinghoff avatar Abel Chalier avatar Alex Quistberg avatar JH avatar Ashwin Jayaprakash avatar earle avatar Yishai Chernovitzky avatar Enwei Jiao avatar Venkat avatar Kan Ouivirach avatar Martin Blais avatar Yoni Farin avatar Pasha Iepimakhov avatar Evgeny Postnov avatar Alex Chen avatar Duke avatar Cyrille SAVELIEF avatar Jesse Powell avatar  avatar Niklas Österlund avatar Samrose avatar roryqi avatar Changsu Jiang avatar Paul Guo avatar David Wells avatar etienne avatar Kesus Kim avatar Sean Kelly avatar André avatar levishi avatar  avatar Rohit Rastogi avatar Alexander Kirilin avatar  avatar Dhruv Gohil avatar Suman Karumuri avatar Beans avatar Yuan avatar Marcin Kuthan avatar Emanuel F. avatar Nick Terrell avatar Chamin Nalinda avatar Lukas Malkmus avatar Darach Ennis avatar Vu Tan avatar Zander avatar "Mark" Zhongjun Jin avatar Andrew Burkett avatar Filipp Frizzy avatar Cole Howard avatar Tao Jianhang avatar Alejandro Fernandez avatar Ethan Rosenthal avatar Kamil Chmielewski avatar Nishanth Kumar avatar  avatar 육세현 avatar Antonio avatar Jia Yu avatar itsmemadhuri avatar Tuan Vu avatar Fabion Kauker avatar Muhammad Haseeb avatar Jack Klamer avatar Praveen Krishna avatar Renjie Liu avatar Mateusz "Serafin" Gajewski avatar pezy avatar Gregory Kimball avatar Gera Shegalov avatar Chandan Jog avatar Brett Hoerner avatar Amir Yahyavi avatar Pawan Dogra avatar Ofir Manor avatar Asaf Mesika avatar

Watchers

Matteo Bertozzi avatar Kirti Bhardwaj avatar Wes McKinney avatar Darren Fu avatar timelyportfolio avatar Mike avatar Zhenyuan Zhao avatar  avatar  avatar  avatar Naveen821 avatar  avatar Matthijs Brobbel avatar  avatar RindsSchei225e avatar earle avatar Peter Boncz avatar  avatar Laith Sakka avatar  avatar Markus Tremmel avatar

nimble's Issues

Format spec document

I was considering contributing to Nimble, but I can't get a sense of exactly how the file is laid out. For me and other potential contributors, a format spec document would be essential. Something like this. It only needs to describe how to decode, not encode.

spatial support

Is there direction to include a spatial aspect - equivalent to geoparquet - ocg compliant, etc
thanks

Pcodec support

I'm excited that Nimble has such flexible encodings/compressions! It shouldn't be too hard to add Pcodec, which generally gets much better compression ratio on numerical data than the traditional dictionary/rle/.../LZ approach. Compression and decompression speeds could benefit too. This seems important, especially for an ML-focused columnar format.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.