it should be possible to process data at speeds approaching that of sequential io.
sequential io is fast. cpu is the bottleneck. sequential only data access is the play.
a minimal, row oriented data format designed for time efficiency and ease of use.
small cli utilities manipulate data and can be combined into pipelines.
util/ is shared code.
src/ are independent utilities building on util/.
quickcheck style testing with python implementations of every utility to verify correct behavior for arbitrary inputs and varying buffer sizes.
performance experiments with alternate implementations and approaches.
- bbucket - prefix each row with a consistent hash of the first column
- bcat - cat some bsv file to csv
- bcopy - pass through data, to benchmark load/dump performance
- bcounteach - count and collapse each contiguous identical row by strcmp the first column
- bcountrows - count rows
- bcut - select some columns
- bdedupe - dedupe identical contiguous rows by strcmp the first column
- bdropuntil - drop until the first column is gte to VALUE
- bmerge - merge sorted files
- bpartition - split into multiple files by the first column value
- brmerge - merge reverse sorted files
- brsort - reverse timsort rows by strcmp the first column
- bsort - timsort rows by strcmp the first column
- bsplit - split a stream into multiple files. files are named after the hash of the first chunk and then numbered
- bsum - integer sum numbers in the first column and output a single value
- bsv_ascii - convert csv to bsv, numerics remain ascii for faster parsing
- bsv - convert csv to bsv
- btake - take while the first column is VALUE
- btakeuntil - take until the first column is gte to VALUE
- csv_ascii - convert bsv to csv, numerics are treated as ascii
- csv - convert bsv to csv
- xxh3 - xxh3_64 hash stdin, defaults to hex, can be --int, or --stream to hex and pass stdin through
prefix each row with a consistent hash of the first column
usage: ... | bbucket NUM_BUCKETS
>> echo '
a
b
c
' | bsv | bbucket 100 | csv
50,a
39,b
83,c
cat some bsv file to csv
usage: bcat [--prefix] [--head NUM] FILE1 ... FILEN
>> for char in a a b b c c; do
echo $char | bsv >> /tmp/$char
done
>> bcat --head 1 --prefix /tmp/{a,b,c}
/tmp/a:a
/tmp/b:b
/tmp/c:c
pass through data, to benchmark load/dump performance
usage: ... | bcopy
>> echo a,b,c | bsv | bcopy | csv
a,b,c
count and collapse each contiguous identical row by strcmp the first column
usage: ... | bcounteach
echo 'a
a
b
b
b
a
' | bsv | bcounteach | csv
a,2
b,3
a,1
count rows
usage: ... | bcountrows
>> echo -e '1
2
3
4.1
' | bsv | bcountrows | csv
4
select some columns
usage: ... | bcut FIELD1,...,FIELDN
>> echo a,b,c | bsv | bcut 3,3,3,2,2,1 | csv
c,c,c,b,b,a
dedupe identical contiguous rows by strcmp the first column
usage: ... | bdedupe
>> echo '
a
a
b
b
a
a
' | bsv | bdedupe | csv
a
b
a
drop until the first column is gte to VALUE
usage: ... | bdropuntil VALUE
>> echo '
a
b
c
d
' | bsv | bdropuntil c | csv
c
d
merge sorted files
usage: bmerge FILE1 ... FILEN
>> echo -e 'a
c
e
' | bsv > a.bsv
>> echo -e 'b
d
f
' | bsv > b.bsv
>> bmerge a.bsv b.bsv
a
b
c
d
e
f
split into multiple files by the first column value
usage: ... | bbucket NUM_BUCKETS | bpartition PREFIX NUM_BUCKETS
>> echo '
0,a
1,b
2,c
' | bsv | bpartition prefix 10
prefix00
prefix01
prefix02
merge reverse sorted files
usage: brmerge FILE1 ... FILEN
>> echo -e 'e
c
a
' | bsv > a.bsv
>> echo -e 'f
d
b
' | bsv > b.bsv
>> brmerge a.bsv b.bsv
f
e
d
c
b
a
reverse timsort rows by strcmp the first column
usage: ... | bsort
>> echo '
a
b
c
' | bsv | brsort | csv
c
b
a
timsort rows by strcmp the first column
usage: ... | bsort
>> echo '
c
b
a
' | bsv | bsort | csv
a
b
c
split a stream into multiple files. files are named after the hash of the first chunk and then numbered
usage: ... | bsplit [chunks_per_file=1]
>> echo -n a,b,c | bsv | bsplit
1595793589_0000000000
integer sum numbers in the first column and output a single value
usage: ... | bsum
>> echo -e '1
2
3
4.1
' | bsv | bsum | csv
10
convert csv to bsv, numerics remain ascii for faster parsing
usage: ... | bsv
>> echo a,b,c | bsv | bcut 3,2,1 | csv
c,b,a
convert csv to bsv
usage: ... | bsv
>> echo a,b,c | bsv | bcut 3,2,1 | csv
c,b,a
take while the first column is VALUE
usage: ... | btake VALUE
>> echo '
a
b
c
d
' | bsv | bdropntil c | btake c | csv
c
take until the first column is gte to VALUE
usage: ... | btakeuntil VALUE
>> echo '
a
b
c
d
' | bsv | btakeuntil c | csv
a
b
convert bsv to csv, numerics are treated as ascii
usage: ... | csv
>> echo a,b,c | bsv | csv
a,b,c
convert bsv to csv
usage: ... | csv
>> echo a,b,c | bsv | csv
a,b,c
xxh3_64 hash stdin, defaults to hex, can be --int, or --stream to hex and pass stdin through
usage: ... | xxh3 [--stream|--int]
>> echo abc | xxh3
B5CA312E51D77D64