GithubHelp home page GithubHelp logo

yahoo / halodb Goto Github PK

View Code? Open in Web Editor NEW
493.0 27.0 101.0 576 KB

A fast, log structured key-value store.

Home Page: https://yahoodevelopers.tumblr.com/post/178250134648/introducing-halodb-a-fast-embedded-key-value

License: Apache License 2.0

Java 100.00%
storage-engine java embedded-database key-value-store big-data

halodb's People

Contributors

amannaly avatar bellofreedom avatar erichetti avatar goelpulkit avatar gwirvin avatar retlawrose avatar rfecher avatar wangtao724 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

halodb's Issues

Multi writer threads may result in low performance

In HaloDBInternal class, the boolean put(byte[] key, byte[] value) function is added a lock, so that it may result in low performance when multi-threads writing.

    boolean put(byte[] key, byte[] value) throws IOException, HaloDBException {
        if (key.length > Byte.MAX_VALUE) {
            throw new HaloDBException("key length cannot exceed " + Byte.MAX_VALUE);
        }

        //TODO: more fine-grained locking is possible. 
        writeLock.lock();
        try {
            Record record = new Record(key, value);
            record.setSequenceNumber(getNextSequenceNumber());
            record.setVersion(Versions.CURRENT_DATA_FILE_VERSION);
            RecordMetaDataForCache entry = writeRecordToFile(record);
            markPreviousVersionAsStale(key);

            //TODO: implement getAndSet and use the return value for
            //TODO: markPreviousVersionAsStale method.
            return inMemoryIndex.put(key, entry);
        } finally {
            writeLock.unlock();
        }
    }

Inefficient file formats

It appears like the file formats have a lot of redundancy.

For example, ever Record, Tombstone, and Index entry have an individual crc32, a version byte, plus 4 byte record size.

Lets take IndexFileEntry for example

/**
 * checksum         - 4 bytes. 
 * version          - 1 byte.
 * Key size         - 1 bytes.
 * record size      - 4 bytes.
 * record offset    - 4 bytes.
 * sequence number  - 8 bytes
 */

A few things come to mind.

  1. The file could have a header with the version number, since it is identical for all entries. Since the file is only read sequentially, and truncated at the first corrupted item found, the header could contain the first sequenceNumber as well, and values afterward can be deltas relative to this value using a variable length encoding. RecordOffset is similar -- the values are monotonically increasing and could be delta encoded with a variable length integer.
  2. As for the checksum, it could be written for small 'blocks' rather than for each record. This also would accelerate recovery from a crash, as each block could be something like: (2 byte size, 8 byte xxHash checksum, size bytes of index entries). Validating the file would then only need to go a block at a time until it fails. As long as the block had at least 3 entries, it would save space. I suspect something like flushing a block every ~ 32 entries or 2k bytes (whatever is first) would work well -- ~9% as many bytes used for checksums, but small enough chunks so that it shouldn't significantly impact the chance that data fails to reach disk before a crash.
  3. Also, unless hardware accelerated, crc32 is much slower than XXHash and also more prone to collisions.

Can we support versioning?

I would like to support key value versioning capability.

There could be introduced something like revision id (incremental number per key 0-infinity) and at the same time I am interested in tracking key (validFrom/createdOn - validTo/InvalidatedOn).

Enhancement: force data flush on write

Hi thanks for this awesome project. I would like to implement a volume server using HaloDB. Now as a file storage, data durability is a requirement. My question is how is it possible to enable some sort of write option to indicate that write must be written immediately to disk. Would setting options.setFlushDataSizeBytes(0) guarantee data durability?

Code Coverage maven profile

Its useful to asses what code is and is not covered when writing thorough unit tests.

Adding a maven profile to track and generate code coverage would be useful.

Iteration properties

I need regular database exports, which is an iteration over all records.
In my tests, when I create an iterator, records created after the iterator creation are not returned, I have no problem with that.
When I update a record during iteration (insert different value with the same key), sometimes it is not returned by the iterator. I guess this is because it is a different record, the old one is marked as deleted, and the new one behaves as the first case - records after iterator creation are not returned.
Also, it sometimes happens that when I update a record during iteration, the iterator returns both versions of the record with the same key. This I cannot reproduce in my tests, but happens regularly on a production DB with millions of records.
So if I wanted to achieve more consistent results from iteration (no missing updated records or duplicates), am I supposed to stop writing to DB while iterating? Or maybe also pause compaction?

What I can store in HalloDB?

Imagine my both key and value might be:
simple data types to complex object structures (coming from JSON, YAML files)

Value also could be for example avro schema, json schema, json, messagepack message, whatever.

Is this supported?

Also now imagine key is json:
{
group: ingestion
type: schema
key: datasource1Pipeline
}

{
group: ingestion
type: schema
key: datasource2Pipeline
}

and value is real schema content for both. Now imagine I am going to search for all keys from group ingestion by providing:
{
group: ingestion
} I expect all keys with group will be returned.

Is HalloDB designed to implement such query capabilities easily? (I didn't dig down more into the source code yet...)

Now I am going to search by key:

Got FileSystemException when repairing database

Every time when repairing database after incorrect shutdown I get following error:

Caused by: java.nio.file.FileSystemException: C:\database\1549869944.index.repair -> C:\database\1549869944.index: The process cannot access the file because it is being used by another process.

It seems like when calling method openDataFilesForReading() my '.index' file has been opened for reading and hadn't been closed after it. No other processes use this file.

So when we call method repairFile(DBDirectory dbDirectory) we get an exception in line: Files.move(repairFile.indexFile.getPath(), indexFile.getPath(), REPLACE_EXISTING, ATOMIC_MOVE);

Any suggestions on how to avoid this error?

Data Compression

LZ4 is included as a dependancy, but it doesn't look to be used. Is there a reason for this?

Would you be open to a PR that optionally enables LZ4 compression of keys and/or values?

Project status?

Hi all! I’m interested in the properties that HaloDB provides, but I’m curious to learn about the status of HaloDB. The repo here looks as though the most recent update is from a few years ago.

Is HaloDB no longer used in production by Yahoo? If not, it would be great to learn what technology was used to replace it, and why 😊 Is there a new/different fork that is the continuation of the project? Keen to learn more. Thanks!

SequenceId seems unnecessary

I may be wrong, but I believe that SequenceId is unnecessary. It eats up 8 bytes per entry in memory, and >=16 bytes per entry on disk (one for each entry in tombstone/index/data files).

My reasoning:

  • One reason that SequenceId is required, is so that during initialization, tombstones (deletes) and index updates (put/replace) do not process out of order. Tombstones are processed last, but do not apply to keys that were updated 'after' them.
  • If Tombstones are sequenced in order inside the index file as they occur, then rebuilding the index in the order it was written would resolve the above, without sequenceId.
  • The other reason that SequenceId is required, is to support concurrent initialization by multiple threads. For simple 'puts', the fileId is enough to resolve which update should win. But if tombstones are interleaved with index loading as I suggest above, then concurrent threads from different files doing puts and deletes on the same key will have race conditions (if file 2 does a delete on key X, then file 1 does a put, it would not know that file 2 has already removed it).
  • I see two solutions to the above, assuming tombstones are sequenced in the index/data file in the order they happened relative to the 'puts':
    1. When a delete happens, leave the fileId in the in memory map with the key, marked as deleted (possible with closed addressing but not the current data structures).
    2. Initialize the database in order, from oldest file to newest file, and split the data by hash so that a thread per segment can do the update. This would be limited by how fast one thread could compute the hash of the keys in the tombstone/index file. One optimization would be if compaction was able to merge and split files so that files are on disjoint key hash ranges. For example, if compacting file 1,2, 3, 4 all together yeilded four files "4.a, 4.b, 4.c, 4.d" each representing a distinct hash range perhaps split by the top two bits of the hash into 4 buckets. Then the initialization could read all four in parallel as their updates would be disjoint and apply to different Segments.

My biggest concerns with such a change is that it is a major overhaul of the file format and in memory layout, and would not easily share code with the older implementation. The work to make the code support the old and new formats simultaneously could easily be most of the effort.

Storage clustering support

So imagine I will use HaloDB to build some important point within infrastructure, it mean I am interested in running at least 2 instances in time.

Schema 1. I can loadbalance reads, but can push write requests to both instances.
Schema 2. I can create one instance as replica over network I might mark reader writer nodes...
hm, both approaches sucks and requires lot of work. What about using https://atomix.io/

Any ideas are welcomed.

Feature Requent for sequence based store

Hi @amannaly,

As a true storage DB, would HaloDB support sequence based store? that is to implement byte[] putValue(byte[] value) that's returning sequence number as key. This would also make things even simpler and possibly faster lookup as the index as can be implemented using simple off heap Byte.allocateDirect array instead of Snazy/OHC. Operation on such store is strictly PUT, GET, DELETE. As a bonus iteration with skip limit is also easy and fast.

Cheers

java.nio.file.AccessDeniedException thrown on windows during creation of the DBDirectory

When running the Readme example I have the following issue on a windows system:

Exception in thread "main" com.oath.halodb.HaloDBException: Failed to open db halodb_directory
	at com.oath.halodb.HaloDB.open(HaloDB.java:29)
	at com.oath.halodb.HaloDB.open(HaloDB.java:35)
....
Caused by: java.nio.file.AccessDeniedException: D:\halodb_directory
	at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83)
	at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
	at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
	at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(WindowsFileSystemProvider.java:115)
	at java.nio.channels.FileChannel.open(FileChannel.java:287)
	at java.nio.channels.FileChannel.open(FileChannel.java:335)
	at com.oath.halodb.DBDirectory.openReadOnlyChannel(DBDirectory.java:76)
	at com.oath.halodb.DBDirectory.open(DBDirectory.java:35)
	at com.oath.halodb.HaloDBInternal.open(HaloDBInternal.java:79)
	at com.oath.halodb.HaloDB.open(HaloDB.java:26)
	... 12 more

This is a known issue:
http://mail.openjdk.java.net/pipermail/nio-dev/2013-February/002123.html
https://issues.apache.org/jira/browse/HDFS-13586
permazen/permazen#7
cryptomator/fuse-nio-adapter#5

I think the fix made in this commit could be ported to HaloDB:
permazen/permazen@287c94c

CI pipeline for Java 11

Documentation currently says Java8 only. You should implement some CI for Java11 meaning supporting 11 (and onwards) is zero effort in the future.

Can this be used as a standalone db?

Hi, The description says that HaloDB can be embedded into an application. I was wondering if there is a way to set it up as a standalone DB and connect to it from a remote application. Is that possible?

HaloDBInternal delete not respecting open/close status

Performing a delete operation on an item in a db object that is already in the closed state should throw an exception. Currently this does not happen for the first occurrence of an attempted delete, since the Tombstone file has not yet been created.

A possible solution could be to always initialize a tombstone file whenever a new db file is created.

Truncate Database

Apart from iterating through every record in the database, is there any way to truncate all data?

Seem VERY hard to port beyond JDK 8 - anybody got some ideas on how to do it?

I really like a lot of things with HaloDB and would really have liked to contribute some time to port it to more contemporary JDK releases but after looking a bit at the code I am not so sure it can be done at least until additional support for manipulating memory larger than 2GB etc. is added to Java (implementing the features of "unsafe" no longer available at least in any way short of using JNI) but if anybody have some ideas I may be interested to put some effort into implementing....

Manual Compaction

One of my use cases involves building a large K/V dataset in one location, then closing the data store, transfering and replicating the contents to other locations and initializing them elsewhere.

For this use case, I would like to have a method to do manual compaction.

Perhaps a method void compact(float threshold) that compacts all data files that have more than threshold ratio of overwritten bytes would be best. For example, calling compact(0.01) would result in all files that have over 1% of their space deleted to be compacted.

This could be used for my use case as well as several others. It also would be a useful tool for compaction performance measurement.

Enhancement: Iterator seek

Dear @amannaly,

I have another use case of streaming read from and offset (think Kafka, DistributedLog, Pulsar). I tried forwarding using iterator.next(). as you can guess it's slow. Is it difficult to implement 'HaloDBIterator.seek(offset)' to start the beginning of the iterator from offset. BTW your 'forEachRemaining' is cool mate. I understand HaloDB does not order guarantee on delete or update but for this use case it will be strictly append only.

Cheers

Supporting keys > 127 bytes long

HaloDB encodes key length as a signed byte on disk, and in memory. It rejects keys longer than 127 bytes.

Although most of my string keys in different databases are between 10 and 80 bytes, I have some rare outliers as large as 780 bytes. Every other K/V store I work with (I have an API abstraction that wraps about 10) can support large keys; only one other breaks at 512 bytes.

There are some options for longer keys:

Easy: read the byte as an unsigned value, and then sizes from 0 to 255 are supported. However, this will make it even harder to support larger keys.

Harder: Allow for larger sizes, perhaps up to 2KB.

  • Index/Record/Tombstone files: Steal the top 3 bits from the version byte. Since the version byte is currently 0, new code versions would interpret existing data files the same ( top 3 bits of existing version byte | key size byte ). Old code would interpret any 'extended' keys as a version mismatch and thus still be safe. Therefore I think this can remain version 0. If versions were to get up to 32, a different format would be needed at that time.
  • SegmentWithMemoryPool: No change for now, it will not support key sizes larger than its configured fixedKeySize which would be 127 or less. It could be extended to support key overflow when fixedKeySize is set to 8 or larger. In this case, when the key length is larger than fixedKeySize, then the slot holds a pointer to extended key data, plus whatever prefix of the key fits in the remaining slot (fixedKeySize - 8). An alternative when fixedKeySize is large enough is to keep a portion of the key hash in this area as well, so that the pointer to the extended key data does not need to be followed for most lookups. Even just one byte of the hash that was not used for accessing the hash bucket would decrease the chance that the pointer is followed by a factor of 256 on a miss.
  • SegmentNonMemoryPool: Since all hash entries are individually allocated, (it appears to be closed addressing with a linked chain of entries), the allocated entry in memory can either use a variable length integer encoding for the key/value lengths, or a constant two bytes for the key.

ByteBuffer flip twice

I was attracted by the design of HaloDB recently, and I carefully read his implementation. I found that in the serialize method of InMemoryIndexMetaDataSerializer, the byteBuffer flips twice. I don’t understand why flip twice. Isn’t there any problem with flipping twice?

Allocation-free reads

Every read allocates a byte[] on the java heap, even if all that read is going to do is deserialize that byte[] into something else.

It would be useful to be able to read the data directly without the intermediate byte[].

Perhaps with a signature similar to:

<A> A get(byte[] key, Function<DataInput, A> reader);

Access to a (native) ByteBuffer would also be useful, but this will become invalid if the file it points into is garbage collected. A different data structure could 'find' the memory again if it moved due to compaction and otherwise continue to use the old value. That might look like

ValueHandle getValueHandle(byte[] key);

interface ValueHandle {
  boolean updated(); // if the value was updated after the handle was created
  <A> A read(Function<DataInput, A> reader); // read whatever the current value is
}

The purpose of these would be to improve performance by decreasing allocations, and to allow for lazy-deserialization of larger data types. For example one might want to read only part of a value initially, and lazily load the remainder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.