andy-byers / calicodb Goto Github PK

View Code? Open in Web Editor NEW

36.0 3.0 4.0 4.91 MB

A tiny embedded, transactional key-value database 🐱

License: MIT License

CMake 1.42% C++ 98.23% Python 0.29% C 0.04% Shell 0.02%

cpp cpp17 database key-value-database key-value-store write-ahead-log bplus-tree

calicodb's People

Contributors

Stargazers

Watchers

Forkers

m561247 mark-connolly-ce kannoth

calicodb's Issues

Prefix B+-tree

Is your feature request related to a problem? Please describe.
Not necessarily, this is more of an optimization.

Describe the solution you'd like
I'd like to experiment with converting the B⁺-tree into a prefix B⁺-tree as described in this paper.
This has the potential to save quite a bit of space in internal nodes, depending on how similar the keys are.
Also, if pivot keys are very long, we may be able to avoid having them spill onto overflow pages.

Describe alternatives you've considered
This seems to be the paper that introduced this concept, making it a good starting point.

Additional context

NOTE: The paper talks about B^*-trees, rather than B⁺-trees, but there doesn't seem to be too much of a difference, at least for our purposes.

This feature can be implemented by modifying PayloadManager::promote(), which is used to convert a cell read from an external node into a pivot cell suitable for posting to an internal node.
It needs to accept the cell immediately to the left of the cell being promoted.
Additionally, the cell being promoted must be detached before being passed to PayloadManager::promote(), since it will be overwritten.
PayloadManager::promote() should be modified to compare the keys, s and t, of the 2 cells.
The shortest key, k, should be determined, such that s < k <= t.
This is the shortest key that is necessary to direct searches toward the correct external node.
The cell should then be modified to hold k (the key size and overflow ID should be corrected).
Note that k will never be longer than either s or t, so we may be able to avoid copying an overflow chain.

At some later point, Tree::redistribute_cells() should be further changed to consider multiple candidates around the chosen split point, selecting the one that produces the shortest pivot key.
This technique is described in the paper, but should be easier to implement here due to the presence of the indirection vector.

I'm pretty sure that's it for the "simple prefix B-tree" part of the paper. At this point, we can consider implementing the "prefix B-tree".
It would likely produce further fan-out improvement at the cost of some more complexity (pivots may need to be recomputed after SMOs).

Remove the Bytes object

One of the first things I did when starting this project was to write a slice-type object. I made two versions, one that allows mutating the underlying data, and one that does not. It's actually not a great idea to have the mutating version, as it adds complication and weirdness (const_cast, CRTP stuff in a public header for no good reason). Furthermore, the user never actually needs it to use the API! It's only used internally. Most of the time when we need a non-const pointer for use as an out parameter, we already know the size of the memory it points to, or we are otherwise able to guarantee that it'll be large enough. We should be able to use raw pointers in these cases, and unify the Bytes and BytesView objects into a single Slice object.

Vacuum functionality

Is your feature request related to a problem? Please describe.
When database pages are no longer needed, they are added to the free list.
We should have a way to collect these pages at the end of the file, so it can be truncated.
Otherwise, the database file will never get smaller, even if all the records are erased.

Describe the solution you'd like
I've created a stub method, Status Database::vacuum().
Once implemented, this method should reclaim some portion of the free list, if not the whole thing, and truncate the file.

I believe we could do something like this:

When working with any page that isn't a tree node (free list pages or overflow value pages), we should leave the 9^th byte cleared (this is the "flags" byte in tree nodes). Then in tree nodes, we'll use 2 bits from the flags byte to indicate type, rather than 1. Neither node type will have a flags byte of 0, so we can distinguish node pages from non-node pages.
We also need to keep back pointers on each free list page and overflow value page.
Then, we can iterate through the free list. For each free list page, we consider the last page in the file. If they are the same page, we can truncate the file and move on. Otherwise, we swap the two pages and update any necessary back references (should be possible for any type of page). Then, the free list page is at the end of the file, so we once again truncate and move on.
This process is repeated until the free list is empty.

Non-logical refactors

Here are some things I'd like to change that aren't necessarily related to any major program logic. Many will break the API, however.

MIT license
Support common package management tools
Add clang-format/sanitizers to CI workflow?

Repair functionality

Is your feature request related to a problem? Please describe.
It is possible for a database to be corrupted such that it cannot be opened normally. It's also possible that in such a case, there is enough information in the WAL to roll back and fix the problem. We should have a way to repair those databases so they can be opened normally.

Describe the solution you'd like
I've added a stub method, static Status Database::repair(const Slice &, const Options &), that once implemented, should fix corrupted databases that can't be opened normally. It can't work for many types of corruption, but it's worth a try!

There are so many different things to consider and try in this feature. We may be able to just open the WAL without actually opening the database, and roll the WAL while writing updates straight to the database file.

WAL segment file naming scheme

Describe the bug
We need a naming scheme for the segment files generated by the write-ahead log (WAL)! Currently, we are using the format "wal-xxxxxx", where "xxxxxx" is a 6-digit number. This clearly won't work, since long-running databases will eventually run out of identifiers.

To Reproduce
Steps to reproduce the behavior:
Run the tests.

Expected behavior
We would expect that the segment file naming format can handle arbitrarily large numbers (we start reading WALs at the lowest-numbered segment file during recovery, so the file numbers are monotonically increasing). Alternatively, we could keep some sort of manifest file and use that to determine the current WAL segment number on startup, then recycle segment file numbers/names.

`db_format_fuzzer` crashes

Describe the bug
db_format_fuzzer runs various operations against a database file made up of fuzzer input. It finds crashes in CalicoDB.

To Reproduce
Steps to reproduce the behavior:

Clone the repo and run

mkdir build && cd ./build
export CC=clang
export CXX=clang++
cmake -DCMAKE_CXX_FLAGS="-g3 -O3 -fsanitize=fuzzer" \
      -DCALICODB_WithASan=On -DCALICODB_WithUBSan=On \
      -DCALICODB_BuildTests=Off -DCALICODB_BuildFuzzers=On \
      -DCALICODB_FuzzerStandalone=Off ..
cmake --build . --target db_format_fuzzer

in the project root directory to build the fuzzer.
2. It's a good idea to create a corpus directory. Actually, this fuzzer won't find anything interesting unless it starts with a corpus filled with valid databases. There isn't a great way to create such a corpus right now. I just run the tests and copy any leftover databases into a directory and use that.
3. Run the fuzzer with

./fuzzers/db_format_fuzzer ${CORPUS}

The fuzzer should crash pretty quickly.

Expected behavior
This fuzzer should not crash.

Additional context
Each time a node is read and a Node created, we walk the whole intra-node freelist to determine how much total free space there is. We take this time to make sure the freelist is not corrupted. If it is corrupted, we don't access the node further. We are also able to make sure that any modifications to the freelist while the node is live don't cause corruption. Maintaining the freelist invariants is less expensive than doing so for all the cells in the node. There are likely to be many more cells than freelist blocks, since freelist blocks are merged whenever possible. Also, it's much easier to tell when the freelist is corrupted, since the freelist blocks are sorted by offset and are always at least 4 bytes offset from one another. So basically, it seems like whenever the fuzzer crashes, it is trying to access some cell value that is corrupted. The routines that actually parse cells will return -1 if they are about to read off the end of the page, but some other routines, do not (like Node::read_child_id()). We mask cell offsets so that they are always less than the page size, but this doesn't help if we access multiple bytes.

Need to store the "record LSN" for each page in memory

Describe the bug
At some point, I foolishly thought that I could get away with not saving the log sequence number (LSN) of the update that made each dirty page dirty. After implementing this thing, I realized that this information is necessary, otherwise we end up with frequently-used pages staying in the cache forever and never being written to disk. Right now, in develop, we are flushing after each commit. This makes commit very expensive, and is unnecessary. In fact, I don't think it's even enough to really fix the problem in all cases. We also have the problem that we will end up cleaning up WAL segments before the corresponding pages are written back to the database.

To Reproduce
Run the tests.

Expected behavior
For each page that we have in memory, we should store the value of the page LSN before the page was initially made dirty. Then, when the page is written back to disk (made clean), this value is set to the updated page LSN. This value is only needed while pages are in memory (page must be written to disk before its frame can be reused). This means that we can store it as a field in either the dirty list entry type, or the page registry entry type. Then, we can make the decision to flush a page when it gets too out-of-sync, or if we would like to reclaim some WAL segments, etc. Also, we won't have to flush during each commit.

Additional context
Add any other context about the problem here.

andy-byers / calicodb Goto Github PK

calicodb's People

Contributors

Stargazers

Watchers

Forkers

calicodb's Issues

Prefix B+-tree

Remove the Bytes object

Vacuum functionality

Non-logical refactors

Repair functionality

WAL segment file naming scheme

`db_format_fuzzer` crashes

Need to store the "record LSN" for each page in memory

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs