GithubHelp home page GithubHelp logo

gpertea / gfstlib Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fstpackage/fstlib

0.0 1.0 0.0 2.37 MB

fast multi-threaded serialization of tabular data (fstlib) adapted for genomics use cases

License: Mozilla Public License 2.0

C++ 50.03% C 49.07% CMake 0.90%

gfstlib's Introduction

Build Status License: AGPLv3

The fst format and fstlib library

Overview

The fstlib library is home to the fst storage format for columnar tabular data. It also contains very fast multi-threaded streamers for fst files and a computational framework that allows for effective use of the format's features for parallel calculations on larger-than-memory datasets.

The fst format

The fst format is used to store columnar tabular data. The format uses hashing and compression for stability, correctness and compactness. A wide range of data-types is available in the format and tabular data can be compressed with a wide range of settings to maximize throughput to storage devices.

Streaming

The fstlib library is build to access tabular data in the fst format with maximum possible speeds. It employs multi-threading for background reading and writing, and can (de-)compress using the full resources of the CPU. Speeds of multiple GB/s can be reached on fast (NVME SSD) storage devices.

fstlib uses the excellent LZ4 compressor for high speed compression at lower ratio’s and the ZSTD compressor for medium speed compression at higher ratio’s. Compression is done on small (16kB) blocks of data, which allows for (almost) random access of data. Each column uses it’s own compression scheme and different compressors can be mixed within a single column. This flexible setup allows for better optimized and faster compression of data, boosting speeds.

Computational framework

The fstlib library allows for computations on tabular data blocks during loading and decompression of data. This unique approach to processing compressed tabular data enables high-speed computing on large-than-memory datasets.

Goals

The fstlib library was designed with four goals in mind:

  • cross-language compatibility: fstlib compiles on all major platforms and compilers using the cmake tool chain, see here for Travis builds on the three major platforms.
  • maximum possible speed: fstlib is a multithreaded (OpenMP) library which was completely designed around the most important bottleneck for larger-than-memory data analytics: access to storage devices (such as SSD's). It uses background reading and writing and employs fast compression and decompression to get the maximum number of bytes to- and from disk in any given time.
  • full (almost) random access: fstlib was designed to facilitate computational platforms. Therefore, data in the format can be access with almost full random access, both in columns- and rows. This allows for high-speed chunk-based processing of data, crucial for larger-than-memory analytics.
  • flexibility: fstlib needs an interface to columnar in-memory data, but is agnostic to the memory management and precise format of that data. That makes it very effective for use with wide range of in-memory table-like containers without any overhead for copying data. It can handle arrow memory structures, but also native R vectors, all zero-copy.

Use cases

Currently, the main use case for fstlib is R's fst package. In that package, fstlib provides the backend for accessing fst files with very high speeds up to multiple GB/s. In the future, fstlib will be part of similar packages for other languages such as Python, Julia, and Rust.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.