bnclabs / gostore Goto Github PK

Storage algorithms.

License: MIT License

Go 97.14% Makefile 0.44% Python 0.48% Java 1.94%

balanced-tree btree llrb golang malloc mvcc multithreading lsm

gostore's Introduction

Storage algorithms in golang

Package storage implement a collection of storage algorithm and necessary tools and libraries. Applications wishing to use this package please checkout interfaces defined under api/.

As of now, two data structures are available for indexing key,value entries:

llrb in memory left-leaning red-black tree
bubt immutable, durable bottoms up btree.
bogn multi-leveled, lsm based, ACID compliant storage.

There are some sub-packages that are common to all storage algorithms:

flock read-write mutex locks across process.
lib collections of helper functions.
lsm implements log-structured-merge.
malloc custom memory alloctor, can be used instead of golang's memory allocator or OS allocator.

How to contribute

Pick an issue, or create an new issue. Provide adequate documentation for the issue.
Assign the issue or get it assigned.
Work on the code, once finished, raise a pull request.
Gostore is written in golang, hence expected to follow the global guidelines for writing go programs.
If the changeset is more than few lines, please generate a report card.
As of now, branch master is the development branch.

gostore's People

Contributors

Stargazers

Watchers

Forkers

etsangsplk gongxun0928 0xack13

gostore's Issues

BUBT: Testcase with empty index.

build a bubt index with empty iterator.
build a bubt index with empty metadata.

BUBT: Settings parameter for min/max key/value.

Similar to llrb. Add settings parameters:

MinKeysize, below which keys won't be accepted.
MaxKeysize, above which keys won't be accepted.
MinValsize, below which values won't be accepted.
MaxValsize, above which values won't be accepted.

Add validation logic to check and fail.

BUBT: Cache all intermediate nodes.

When Opening a well-formed bubt-index cache the intermediate nodes (m-blocks) in memory. Either using golang map or using llrb or some other fast lookup mechanism. It should typically be a map of fpos->mblock buffer.

Implement this as a configurable feature.

BUBT: Implement validate.

Tree validation. This would involve walking the entire tree.

Check whether keys are in sorted order.
Check for metadata information, whether it is consistent with configured settings.
Check for node utilisation.
Check for tree depth.
Check for block alignment.
Check for value file's sanity.

Continuous integration.

Integrate with travis-CI. Ref: http://github/pratarpc/goparsec.
Integrate with coveralls.io.
Integrate with godoc. README page for every sub-package can refer to the sub-package's godoc.

Get a clean chit from goreportcard.

Use piecewise Iteration for a full table scan.

Full table scan on active tree will have following issues:

With MVCC disabled, Iterations will lock the entire tree, until it
completes.
With MVCC enabled, Iterations won't lock the writer and won't
interfer with other concurrent readers. But if there are hundreds
and thousands of mutations happening in the background, while the
iterator is holding the snapshot, it can lead to huge memory
pressure.

To avoid this, implement a PiecewiseIterator that scans the tree
part by part and stitches them together to simulate a full-table scan.
Here are some implementation guidelines.

Can be implemented only when application can maintain a monotonically
increasing seqno for every mutation (CREATE, UPDATE, DELETE).
LLRB should be created with metadata.bornseqno, metadata.deadseqno
enabled.
PiecewiseIterators will repeat a large number of small Iteration on
the tree until it completes a full table scan.
Application should supply the tillseqno as a form of timestamp to
filter mutations that happens after the point in time that started
PiecewiseIteration.
The first iteration will have startkey as nil and endkey as nil.
Each iteration will read only 1000 entries. And the last key will be
remembered as the startkey for the next iteration with inclusion
set to high.
For every iteration, the entries read from the tree will be filtered
and returned to the application.
- Get the node's timestamp by picking the larger value between
  bornseqno and deadseqno.
- Node's timestamp should be less than or equal to tillseqno. Else
  skip.

The key difference between Iterator and PiecewiseIterator is that
PiecewiseIterator does not give a point time view of the tree snapshot.
For example, if an entry that is not yet read by the piecewise-iterator
is updated after the iteration has started, then the entry might be skipped
and this entry won't be part of the final output.

This implies:

The final output won't contain the full sample set of entries.
Cannot be queried with correctness.

Hopefully it can still be used with LSM reads.

Optimize `chan bool` used for sync between goroutines.

In some places chan bool is used to synchronise between go-routines. This can be replace with chan struct{} which consume ZERO memory footprint.

LLRB: Testcases to create and update Clock.

Involves the following APIs on LLRB{} - GetClock(), SetClock(), Clone()
Involves the following APIs on LLRBSnapshot{} - GetClock().

LLRB: Document shape and fields in statistics object.

Stats and Fullstats return a map of useful statistics about the index and operations on index.

BUBT: Implement Stats().

Returned statistics map should include only pre-computed statistics. Avoid walking over the tree.
This must be a cheap call. Some of the statistics should include:

keymem: total payload of key.
valmem: total payload of value.
paddingmem: bytes wasted, that neither has payload not overhead fields.
n_mblocks: number of m-blocks in m-index file.
n_zblocks: number of z-blocks in z-index file.
n_vblocks: number of blocks in value file.
seqno: highest seqno associated with a key,value entry.
epoch: timestamp since January 1, 1970 UTC.
n_deleted: number of entries marked as deleted.

All of them can be part of the infoblock.

Malloc: JEMalloc as allocator backend.

BUBT: Optimise memory allocation.

z-block and m-block allocations can be recycles once it is flushed to disk.
benchmark block level functions and entry level functions.
memory profile BUBT for allocations and free.

LSM: Testcase with deleted entries.

Add a test case that has deleted entries and entries that has not suffered deletes. Add another case where deleted entries are created once again.

LSM: Simple test cases.

Create a local data-structure and a list of the same to test LSM logic. This purpose of this test case , when compared to existing LLRB based merge index, is to make corner case testing easer to create and maintain.

LLRB: Optimize go_writer.go:respch.

go_writer.go exports several API that internally uses unbuffered
chan []interface{} as response channel that are purely used for
synchronisation.

May be this can be changed to chan struct{}.

BUBT: Implement Log.

Should provide two variant.

One when index is actively used and periodically logged.
Another for involved details on the index. Before index become active and/or before it is about to be destroyed.

Malloc: OSMalloc as allocator backend.

LSM: Validation and benchmark.

Create a command line tool, may be outside gostore repository, to:

Validate LSMRange and LSMMerge APIs.
Performance benchmark LSMRange and LSMMerge against single index iteration.
Memory and CPU profile LSMRange and LSMMerge.
Try with 1,2,4,8 iterators. Even better make the number of indexes to merge configurable via cmd-line.

Let this be easily repeatable. Add cmd-line options to integrate the tool with gostore CI.

LSM: Optimize sorting.

LSM merges pre-sorted input. There might be a better way than using sort.Sort().

BUBT: Testcases to create and update Clock.

Involves the following APIs on bubt.Snapshot{} - GetClock().

Statistics: Overview and detailed descriptions.

Statistics are vital for debugging and characterisation. Adequate stats are implemented for malloc/ bubt/ and llrb/ packages. Create a page with a short overview of storage and memory stats and detailed description on how to interpret then and relationship between stats.

Memoryutilization: Improvise and refactor.

Package malloc/ has global variable MEMUtilization used for
Blocksize calculation. Once Slab struct is created and
Blocksizes and SuitableSize are localized to Slab, we can
configure MEMUtilization per slab instance instead of keeping
it global.

Package llrb/ has config-parameter called memutilization used
while validation. There is also an API ExpectedUtilization that
makes the config-parameter redundant. This parameter is used while
validating llrb tree. I suppose we can can avoid both the
config-paramter and the API, instead add an argument to Validate
for expected-memory-utilization ratio.

LLRB: Testcase for Range and Iterate with lsm enabled.

Create an index with lsm enabled.
Upsert entries.
Delete few of them.
Range and Iterate should include deleted entries.

LLRB: Sizing requirements.

Sizing information need to be presented for:

CPU
Memory
Network

BUBT: Align z-blocks and m-blocks with configured block-sizes.

Before flushing the z-blocks and m-blocks to disk, pad them to align with zblocksize and mblocksize.

README page.

Start a proper README page for gostore. Every sub-package should have a README package. Let the README content have the following template.

State the goals in clear bullet points.
Current status of the project, let be informative for potential users of gostore.
Quicklinks.
Settings are important in gostore. Add an introduction and relevant links to default-settings for bubt, llrb, malloc.
Introduction to ideas and concepts.
List panic cases, and its recovery.
Projects using gostore.
Links to external articles, papers, news, blogs.
How to contribute.

For sub-pkg README, keep the original details with sub-pkg's doc.go. Include introduction to basic concepts, ideas, and getting-started guid.

Cleanup panic and recovery.

Panic only when absolutely required.
If a function / go-routine panics, document the same.
Every go-routine should be coded in a separate file, document exit/panic/recover cases for all go-routines.

LLRB: Read-lock for Clone()

Right now Clone() implementation acquires a write while cloning the
llrb tree. I think this can be converted into read-lock.

BUBT: Implement Fullstats().

Returned statistics map should include all statistics that are relevant to the btree index.

Include build time statistics.
Include IndexReader / IndeSnapshot related statitics.
If there is a need to walk the entire tree to compute statistic information, this the place to do. If possible try to move this computing in Build().

Malloc: Accurate overhead.

For Arena, Calculate the size of mpools map.

LLRB: Stats review and writeup.

Review stats counting in LLRB storage.
Create a write-up on stats accounting, its meaning and how
they are related to each other. This must be a base document
for every one who does characterization.
Check for whether stats values need to be atomically protected.

Slack channel.

Start a slack channel for the following projects - golog, gosettings, gson, gofast, gostore. If possible start it under gophers.slack.com

Reverse link them with golog, gosettings, gson, gofast, gostore.

Api: confirm behaviour of duplicate Destroy() calls.

LLRB: Block diagram of go-routines.

Picture can say thousand words. Create a block diagram of go-routines, how the interact with application logic and underlying socket. Mostly it is important to trace the execution path and its exceptional cases.

LLRB: Document shape and fields in statistics object.

Stats and Fullstats return a map of useful statistics about the index and operations on index.

BUBT: Opening snapshot validate rootblock.

When opening the snapshot, one of the header fields points to the rootblock, this should match with rootblock position after header, stats, setts and metadata.

LLRB: Testcases with lsm enabled.

Log-Structured-Merge

Log-Structured-Merge (LSM) is available off-the-shelf with LLRB store.
Just enable lsm via settings while creating the LLRB tree. Enabling
LSM will have the following effects:

DeleteMin, DeleteMax and Delete will simply be marked as deleted
and its deadseqno will be updated to currseqno.
For Delete operation, if entry is missing in the index. An entry
will be inserted and then marked as deleted with its deadseqno
updated to currseqno.
When a key marked as deleted is Upserted again, its deadseqno will
be set to ZERO, and deleted flag is cleared.
In case of UpsertCAS, CAS should match before entry is cleared from
delete log.
All of the above bahaviour are equally applicable with MVCC enabled.

NOTE: DeleteMin and DeleteMax is not useful when LLRB index is only
holding a subset, called working-set, of the full index.

Dict: Testcase to create and update clocks.

Involves the following APIs on Dict{} - GetClock(), SetClock(), Clone()
Involves the following APIs on DictSnapshot{} - GetClock().

Replace panicerr() with panic()

LLRB: UpsertCAS with CAS as ZERO.

Create a test case for UpsertCAS with CAS value as ZERO. This is same as CREATE operation.

Malloc: from fair-pools to optimal-pools

At present malloc uses a fair-model to allocate a pool from OS. There
are some considerations while allocating an entire pool from OS:

A Pool, for any slab, cannot contain chunks more than Maxchunks.
If pool size is too small allocator will end up with too many
pools for the same chunk-size.
If the pool size is too large, then partially utilised pools will
introduce bad memory-utilisation.

Right now, allocator assumes that number of chunks, allocated by
application, in each slab will be same for all slab-size. This is
probably a good way to start a new arena instance, but for every
new allocation we get to know the histogram of chunk-size for
each slab and with that information we can pick an optimal size
while allocating the next pool from OS.

LLRB: YCSB test cases.

BUBT: Command like program to check bubt files.

command line program to check the bubt file and its options value file.

header information.
show stats, metadata, settings
analysis of z-blocks and m-blocks.

CPU
Memory
DISK

Lib: Localise FailsafeRequest and FailsafePost methods to gen-server.

FailsafeRequest and FailsafePost can be localized as gen-server methods. That way we can be specific about the channel type instead of typing it as chan []interface{}.

BUBT: Testcase for Range and Iterate with lsm enabled.

Create an index with lsm enabled.
Upsert entries.
Delete few of them.
Range and Iterate should include deleted entries.

Cleanup dummy imports.

There might be imports of the form

import _ "fmt"

import "fmt"

var _ = fmt.Sprintf(...)

Remove them. Search the net ask community for alternate ideas.