GithubHelp home page GithubHelp logo

slham / bsv Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nathants/bsv

0.0 0.0 0.0 549 KB

performant data processing, unix style

License: MIT License

Makefile 0.60% C 78.00% Python 20.01% Shell 0.34% Java 0.41% Go 0.64%

bsv's Introduction

why

it should be possible to process data at speeds approaching that of sequential io.

sequential io is fast. cpu is the bottleneck. sequential only data access is the play.

what

a minimal, row oriented data format designed for time efficiency and ease of use.

small cli utilities manipulate data and can be combined into pipelines.

util/ is shared code.

src/ are independent utilities building on util/.

testing methodology

quickcheck style testing with python implementations of every utility to verify correct behavior for arbitrary inputs and varying buffer sizes.

experiments

performance experiments with alternate implementations and approaches.

utilities

  • bbucket - prefix each row with a consistent hash of the first column
  • bcat - cat some bsv file to csv
  • bcopy - pass through data, to benchmark load/dump performance
  • bcounteach - count and collapse each contiguous identical row by strcmp the first column
  • bcountrows - count rows
  • bcut - select some columns
  • bdedupe - dedupe identical contiguous rows by strcmp the first column
  • bdropuntil - drop until the first column is gte to VALUE
  • bmerge - merge sorted files
  • bpartition - split into multiple files by the first column value
  • brmerge - merge reverse sorted files
  • brsort - reverse timsort rows by strcmp the first column
  • bsort - timsort rows by strcmp the first column
  • bsplit - split a stream into multiple files. files are named after the hash of the first chunk and then numbered
  • bsum - integer sum numbers in the first column and output a single value
  • bsv_ascii - convert csv to bsv, numerics remain ascii for faster parsing
  • bsv - convert csv to bsv
  • btake - take while the first column is VALUE
  • btakeuntil - take until the first column is gte to VALUE
  • csv_ascii - convert bsv to csv, numerics are treated as ascii
  • csv - convert bsv to csv
  • xxh3 - xxh3_64 hash stdin, defaults to hex, can be --int, or --stream to hex and pass stdin through

prefix each row with a consistent hash of the first column

usage: ... | bbucket NUM_BUCKETS

>> echo '
a
b
c
' | bsv | bbucket 100 | csv
50,a
39,b
83,c

cat some bsv file to csv

usage: bcat [--prefix] [--head NUM] FILE1 ... FILEN

>> for char in a a b b c c; do
     echo $char | bsv >> /tmp/$char
   done

>> bcat --head 1 --prefix /tmp/{a,b,c}
/tmp/a:a
/tmp/b:b
/tmp/c:c

pass through data, to benchmark load/dump performance

usage: ... | bcopy

>> echo a,b,c | bsv | bcopy | csv
a,b,c

count and collapse each contiguous identical row by strcmp the first column

usage: ... | bcounteach

echo 'a
a
b
b
b
a
' | bsv | bcounteach | csv
a,2
b,3
a,1

count rows

usage: ... | bcountrows

>> echo -e '1
2
3
4.1
' | bsv | bcountrows | csv
4

select some columns

usage: ... | bcut FIELD1,...,FIELDN

>> echo a,b,c | bsv | bcut 3,3,3,2,2,1 | csv
c,c,c,b,b,a

dedupe identical contiguous rows by strcmp the first column

usage: ... | bdedupe

>> echo '
a
a
b
b
a
a
' | bsv | bdedupe | csv
a
b
a

drop until the first column is gte to VALUE

usage: ... | bdropuntil VALUE

>> echo '
a
b
c
d
' | bsv | bdropuntil c | csv
c
d

merge sorted files

usage: bmerge FILE1 ... FILEN

>> echo -e 'a
c
e
' | bsv > a.bsv
>> echo -e 'b
d
f
' | bsv > b.bsv
>> bmerge a.bsv b.bsv
a
b
c
d
e
f

split into multiple files by the first column value

usage: ... | bbucket NUM_BUCKETS | bpartition PREFIX NUM_BUCKETS

>> echo '
0,a
1,b
2,c
' | bsv | bpartition prefix 10
prefix00
prefix01
prefix02

merge reverse sorted files

usage: brmerge FILE1 ... FILEN

>> echo -e 'e
c
a
' | bsv > a.bsv
>> echo -e 'f
d
b
' | bsv > b.bsv
>> brmerge a.bsv b.bsv
f
e
d
c
b
a

reverse timsort rows by strcmp the first column

usage: ... | bsort

>> echo '
a
b
c
' | bsv | brsort | csv
c
b
a

timsort rows by strcmp the first column

usage: ... | bsort

>> echo '
c
b
a
' | bsv | bsort | csv
a
b
c

split a stream into multiple files. files are named after the hash of the first chunk and then numbered

usage: ... | bsplit [chunks_per_file=1]

>> echo -n a,b,c | bsv | bsplit
1595793589_0000000000

integer sum numbers in the first column and output a single value

usage: ... | bsum

>> echo -e '1
2
3
4.1
' | bsv | bsum | csv
10

convert csv to bsv, numerics remain ascii for faster parsing

usage: ... | bsv

>> echo a,b,c | bsv | bcut 3,2,1 | csv
c,b,a

convert csv to bsv

usage: ... | bsv

>> echo a,b,c | bsv | bcut 3,2,1 | csv
c,b,a

take while the first column is VALUE

usage: ... | btake VALUE

>> echo '
a
b
c
d
' | bsv | bdropntil c | btake c | csv
c

take until the first column is gte to VALUE

usage: ... | btakeuntil VALUE

>> echo '
a
b
c
d
' | bsv | btakeuntil c | csv
a
b

convert bsv to csv, numerics are treated as ascii

usage: ... | csv

>> echo a,b,c | bsv | csv
a,b,c

convert bsv to csv

usage: ... | csv

>> echo a,b,c | bsv | csv
a,b,c

xxh3_64 hash stdin, defaults to hex, can be --int, or --stream to hex and pass stdin through

usage: ... | xxh3 [--stream|--int]

>> echo abc | xxh3
B5CA312E51D77D64

bsv's People

Contributors

indragiek avatar nathants avatar slham avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.