bsv,slham

why

it should be possible to process data at speeds approaching that of sequential io.

sequential io is fast. cpu is the bottleneck. sequential only data access is the play.

what

a minimal, row oriented data format designed for time efficiency and ease of use.

small cli utilities manipulate data and can be combined into pipelines.

util/ is shared code.

src/ are independent utilities building on util/.

testing methodology

quickcheck style testing with python implementations of every utility to verify correct behavior for arbitrary inputs and varying buffer sizes.

experiments

performance experiments with alternate implementations and approaches.

utilities

bbucket - prefix each row with a consistent hash of the first column
bcat - cat some bsv file to csv
bcopy - pass through data, to benchmark load/dump performance
bcounteach - count and collapse each contiguous identical row by strcmp the first column
bcountrows - count rows
bcut - select some columns
bdedupe - dedupe identical contiguous rows by strcmp the first column
bdropuntil - drop until the first column is gte to VALUE
bmerge - merge sorted files
bpartition - split into multiple files by the first column value
brmerge - merge reverse sorted files
brsort - reverse timsort rows by strcmp the first column
bsort - timsort rows by strcmp the first column
bsplit - split a stream into multiple files. files are named after the hash of the first chunk and then numbered
bsum - integer sum numbers in the first column and output a single value
bsv_ascii - convert csv to bsv, numerics remain ascii for faster parsing
bsv - convert csv to bsv
btake - take while the first column is VALUE
btakeuntil - take until the first column is gte to VALUE
csv_ascii - convert bsv to csv, numerics are treated as ascii
csv - convert bsv to csv
xxh3 - xxh3_64 hash stdin, defaults to hex, can be --int, or --stream to hex and pass stdin through

bbucket

prefix each row with a consistent hash of the first column

usage: ... | bbucket NUM_BUCKETS

>> echo '
a
b
c
' | bsv | bbucket 100 | csv
50,a
39,b
83,c

bcat

cat some bsv file to csv

usage: bcat [--prefix] [--head NUM] FILE1 ... FILEN

>> for char in a a b b c c; do
     echo $char | bsv >> /tmp/$char
   done

>> bcat --head 1 --prefix /tmp/{a,b,c}
/tmp/a:a
/tmp/b:b
/tmp/c:c

bcopy

pass through data, to benchmark load/dump performance

usage: ... | bcopy

>> echo a,b,c | bsv | bcopy | csv
a,b,c

bcounteach

count and collapse each contiguous identical row by strcmp the first column

usage: ... | bcounteach

echo 'a
a
b
b
b
a
' | bsv | bcounteach | csv
a,2
b,3
a,1

bcountrows

count rows

usage: ... | bcountrows

>> echo -e '1
2
3
4.1
' | bsv | bcountrows | csv
4

bcut

select some columns

usage: ... | bcut FIELD1,...,FIELDN

>> echo a,b,c | bsv | bcut 3,3,3,2,2,1 | csv
c,c,c,b,b,a

bdedupe

dedupe identical contiguous rows by strcmp the first column

usage: ... | bdedupe

>> echo '
a
a
b
b
a
a
' | bsv | bdedupe | csv
a
b
a

bdropuntil

drop until the first column is gte to VALUE

usage: ... | bdropuntil VALUE

>> echo '
a
b
c
d
' | bsv | bdropuntil c | csv
c
d

bmerge

merge sorted files

usage: bmerge FILE1 ... FILEN

>> echo -e 'a
c
e
' | bsv > a.bsv
>> echo -e 'b
d
f
' | bsv > b.bsv
>> bmerge a.bsv b.bsv
a
b
c
d
e
f

bpartition

split into multiple files by the first column value

usage: ... | bbucket NUM_BUCKETS | bpartition PREFIX NUM_BUCKETS

>> echo '
0,a
1,b
2,c
' | bsv | bpartition prefix 10
prefix00
prefix01
prefix02

brmerge

merge reverse sorted files

usage: brmerge FILE1 ... FILEN

>> echo -e 'e
c
a
' | bsv > a.bsv
>> echo -e 'f
d
b
' | bsv > b.bsv
>> brmerge a.bsv b.bsv
f
e
d
c
b
a

brsort

reverse timsort rows by strcmp the first column

usage: ... | bsort

>> echo '
a
b
c
' | bsv | brsort | csv
c
b
a

bsort

timsort rows by strcmp the first column

usage: ... | bsort

>> echo '
c
b
a
' | bsv | bsort | csv
a
b
c

bsplit

split a stream into multiple files. files are named after the hash of the first chunk and then numbered

usage: ... | bsplit [chunks_per_file=1]

>> echo -n a,b,c | bsv | bsplit
1595793589_0000000000

bsum

integer sum numbers in the first column and output a single value

usage: ... | bsum

>> echo -e '1
2
3
4.1
' | bsv | bsum | csv
10

bsv_ascii

convert csv to bsv, numerics remain ascii for faster parsing

usage: ... | bsv

>> echo a,b,c | bsv | bcut 3,2,1 | csv
c,b,a

bsv

convert csv to bsv

usage: ... | bsv

>> echo a,b,c | bsv | bcut 3,2,1 | csv
c,b,a

btake

take while the first column is VALUE

usage: ... | btake VALUE

>> echo '
a
b
c
d
' | bsv | bdropntil c | btake c | csv
c

btakeuntil

take until the first column is gte to VALUE

usage: ... | btakeuntil VALUE

>> echo '
a
b
c
d
' | bsv | btakeuntil c | csv
a
b

csv_ascii

convert bsv to csv, numerics are treated as ascii

usage: ... | csv

>> echo a,b,c | bsv | csv
a,b,c

csv

convert bsv to csv

usage: ... | csv

>> echo a,b,c | bsv | csv
a,b,c

xxh3

xxh3_64 hash stdin, defaults to hex, can be --int, or --stream to hex and pass stdin through

usage: ... | xxh3 [--stream|--int]

>> echo abc | xxh3
B5CA312E51D77D64

slham / bsv Goto Github PK

bsv's Introduction

why

what

testing methodology

experiments

utilities

bsv's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org

Jobs