yahoo / halodb Goto Github PK
View Code? Open in Web Editor NEWA fast, log structured key-value store.
Home Page: https://yahoodevelopers.tumblr.com/post/178250134648/introducing-halodb-a-fast-embedded-key-value
License: Apache License 2.0
A fast, log structured key-value store.
Home Page: https://yahoodevelopers.tumblr.com/post/178250134648/introducing-halodb-a-fast-embedded-key-value
License: Apache License 2.0
In HaloDBInternal class, the boolean put(byte[] key, byte[] value)
function is added a lock, so that it may result in low performance when multi-threads writing.
boolean put(byte[] key, byte[] value) throws IOException, HaloDBException {
if (key.length > Byte.MAX_VALUE) {
throw new HaloDBException("key length cannot exceed " + Byte.MAX_VALUE);
}
//TODO: more fine-grained locking is possible.
writeLock.lock();
try {
Record record = new Record(key, value);
record.setSequenceNumber(getNextSequenceNumber());
record.setVersion(Versions.CURRENT_DATA_FILE_VERSION);
RecordMetaDataForCache entry = writeRecordToFile(record);
markPreviousVersionAsStale(key);
//TODO: implement getAndSet and use the return value for
//TODO: markPreviousVersionAsStale method.
return inMemoryIndex.put(key, entry);
} finally {
writeLock.unlock();
}
}
I got a logging info INFO com.oath.halodb.Uns - OHC using JNA OS native malloc/free Out of memory!!!
Why does that information come up? How can I change the Java Memory
configuration ?
It appears like the file formats have a lot of redundancy.
For example, ever Record, Tombstone, and Index entry have an individual crc32, a version byte, plus 4 byte record size.
Lets take IndexFileEntry for example
/**
* checksum - 4 bytes.
* version - 1 byte.
* Key size - 1 bytes.
* record size - 4 bytes.
* record offset - 4 bytes.
* sequence number - 8 bytes
*/
A few things come to mind.
size
bytes of index entries). Validating the file would then only need to go a block at a time until it fails. As long as the block had at least 3 entries, it would save space. I suspect something like flushing a block every ~ 32 entries or 2k bytes (whatever is first) would work well -- ~9% as many bytes used for checksums, but small enough chunks so that it shouldn't significantly impact the chance that data fails to reach disk before a crash.HaloDBInternal.put has a hard check of the key length: https://github.com/yahoo/HaloDB/blob/master/src/main/java/com/oath/halodb/HaloDBInternal.java#L188
Is there a hard limit of 127 bytes on keys by design? Or is it intended to have longer keys using the fixedKeySize
option? Currently it seems to be ignored altogether.
I would like to support key value versioning capability.
There could be introduced something like revision id (incremental number per key 0-infinity) and at the same time I am interested in tracking key (validFrom/createdOn - validTo/InvalidatedOn).
Hi thanks for this awesome project. I would like to implement a volume server using HaloDB. Now as a file storage, data durability is a requirement. My question is how is it possible to enable some sort of write option to indicate that write must be written immediately to disk. Would setting options.setFlushDataSizeBytes(0) guarantee data durability?
Range scans
is not supported for HaloDB. If I want to support range scans function
, what should I do?
Its useful to asses what code is and is not covered when writing thorough unit tests.
Adding a maven profile to track and generate code coverage would be useful.
I need regular database exports, which is an iteration over all records.
In my tests, when I create an iterator, records created after the iterator creation are not returned, I have no problem with that.
When I update a record during iteration (insert different value with the same key), sometimes it is not returned by the iterator. I guess this is because it is a different record, the old one is marked as deleted, and the new one behaves as the first case - records after iterator creation are not returned.
Also, it sometimes happens that when I update a record during iteration, the iterator returns both versions of the record with the same key. This I cannot reproduce in my tests, but happens regularly on a production DB with millions of records.
So if I wanted to achieve more consistent results from iteration (no missing updated records or duplicates), am I supposed to stop writing to DB while iterating? Or maybe also pause compaction?
Any comment on PR #12?
Here's the background:
based on the other comments in this file, it is known that this won't work in windows, and results in fatal exceptions https://grokbase.com/t/lucene/dev/1519kz2s50/recent-java-9-commit-e5b66323ae45-breaks-fsync-on-directory
It seems appropriate to allow for the situation where a file channel cannot be created on a directory because in that situation (Windows) fsync is not applicable either.
Imagine my both key and value might be:
simple data types to complex object structures (coming from JSON, YAML files)
Value also could be for example avro schema, json schema, json, messagepack message, whatever.
Is this supported?
Also now imagine key is json:
{
group: ingestion
type: schema
key: datasource1Pipeline
}
{
group: ingestion
type: schema
key: datasource2Pipeline
}
and value is real schema content for both. Now imagine I am going to search for all keys from group ingestion by providing:
{
group: ingestion
} I expect all keys with group will be returned.
Is HalloDB designed to implement such query capabilities easily? (I didn't dig down more into the source code yet...)
Now I am going to search by key:
Every time when repairing database after incorrect shutdown I get following error:
Caused by: java.nio.file.FileSystemException: C:\database\1549869944.index.repair -> C:\database\1549869944.index: The process cannot access the file because it is being used by another process.
It seems like when calling method openDataFilesForReading()
my '.index' file has been opened for reading and hadn't been closed after it. No other processes use this file.
So when we call method repairFile(DBDirectory dbDirectory)
we get an exception in line: Files.move(repairFile.indexFile.getPath(), indexFile.getPath(), REPLACE_EXISTING, ATOMIC_MOVE);
Any suggestions on how to avoid this error?
LZ4 is included as a dependancy, but it doesn't look to be used. Is there a reason for this?
Would you be open to a PR that optionally enables LZ4 compression of keys and/or values?
Hi all! I’m interested in the properties that HaloDB provides, but I’m curious to learn about the status of HaloDB. The repo here looks as though the most recent update is from a few years ago.
Is HaloDB no longer used in production by Yahoo? If not, it would be great to learn what technology was used to replace it, and why 😊 Is there a new/different fork that is the continuation of the project? Keen to learn more. Thanks!
I may be wrong, but I believe that SequenceId is unnecessary. It eats up 8 bytes per entry in memory, and >=16 bytes per entry on disk (one for each entry in tombstone/index/data files).
My reasoning:
My biggest concerns with such a change is that it is a major overhaul of the file format and in memory layout, and would not easily share code with the older implementation. The work to make the code support the old and new formats simultaneously could easily be most of the effort.
Hi @amannaly
Could we have timestamp in header
Cheers
So imagine I will use HaloDB to build some important point within infrastructure, it mean I am interested in running at least 2 instances in time.
Schema 1. I can loadbalance reads, but can push write requests to both instances.
Schema 2. I can create one instance as replica over network I might mark reader writer nodes...
hm, both approaches sucks and requires lot of work. What about using https://atomix.io/
Any ideas are welcomed.
Hi @amannaly,
As a true storage DB, would HaloDB support sequence based store? that is to implement byte[] putValue(byte[] value)
that's returning sequence number as key. This would also make things even simpler and possibly faster lookup as the index as can be implemented using simple off heap Byte.allocateDirect
array instead of Snazy/OHC. Operation on such store is strictly PUT, GET, DELETE. As a bonus iteration with skip limit is also easy and fast.
Cheers
When running the Readme example I have the following issue on a windows system:
Exception in thread "main" com.oath.halodb.HaloDBException: Failed to open db halodb_directory
at com.oath.halodb.HaloDB.open(HaloDB.java:29)
at com.oath.halodb.HaloDB.open(HaloDB.java:35)
....
Caused by: java.nio.file.AccessDeniedException: D:\halodb_directory
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(WindowsFileSystemProvider.java:115)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:335)
at com.oath.halodb.DBDirectory.openReadOnlyChannel(DBDirectory.java:76)
at com.oath.halodb.DBDirectory.open(DBDirectory.java:35)
at com.oath.halodb.HaloDBInternal.open(HaloDBInternal.java:79)
at com.oath.halodb.HaloDB.open(HaloDB.java:26)
... 12 more
This is a known issue:
http://mail.openjdk.java.net/pipermail/nio-dev/2013-February/002123.html
https://issues.apache.org/jira/browse/HDFS-13586
permazen/permazen#7
cryptomator/fuse-nio-adapter#5
I think the fix made in this commit could be ported to HaloDB:
permazen/permazen@287c94c
Documentation currently says Java8 only. You should implement some CI for Java11 meaning supporting 11 (and onwards) is zero effort in the future.
Hi, The description says that HaloDB can be embedded into an application. I was wondering if there is a way to set it up as a standalone DB and connect to it from a remote application. Is that possible?
Performing a delete
operation on an item in a db object that is already in the closed
state should throw an exception. Currently this does not happen for the first occurrence of an attempted delete, since the Tombstone file has not yet been created.
A possible solution could be to always initialize a tombstone file whenever a new db file is created.
Apart from iterating through every record in the database, is there any way to truncate all data?
I have added HaloDB to my encyclopedia of databases:
Do you have a logo that I can include? Thanks!
-- Andy
I really like a lot of things with HaloDB and would really have liked to contribute some time to port it to more contemporary JDK releases but after looking a bit at the code I am not so sure it can be done at least until additional support for manipulating memory larger than 2GB etc. is added to Java (implementing the features of "unsafe" no longer available at least in any way short of using JNI) but if anybody have some ideas I may be interested to put some effort into implementing....
This field is never used.
One of my use cases involves building a large K/V dataset in one location, then closing the data store, transfering and replicating the contents to other locations and initializing them elsewhere.
For this use case, I would like to have a method to do manual compaction.
Perhaps a method void compact(float threshold)
that compacts all data files that have more than threshold
ratio of overwritten bytes would be best. For example, calling compact(0.01)
would result in all files that have over 1% of their space deleted to be compacted.
This could be used for my use case as well as several others. It also would be a useful tool for compaction performance measurement.
Dear @amannaly,
I have another use case of streaming read from and offset (think Kafka, DistributedLog, Pulsar). I tried forwarding using iterator.next()
. as you can guess it's slow. Is it difficult to implement 'HaloDBIterator.seek(offset)' to start the beginning of the iterator from offset. BTW your 'forEachRemaining' is cool mate. I understand HaloDB does not order guarantee on delete or update but for this use case it will be strictly append only.
Cheers
HaloDB encodes key length as a signed byte on disk, and in memory. It rejects keys longer than 127 bytes.
Although most of my string keys in different databases are between 10 and 80 bytes, I have some rare outliers as large as 780 bytes. Every other K/V store I work with (I have an API abstraction that wraps about 10) can support large keys; only one other breaks at 512 bytes.
There are some options for longer keys:
Easy: read the byte as an unsigned value, and then sizes from 0 to 255 are supported. However, this will make it even harder to support larger keys.
Harder: Allow for larger sizes, perhaps up to 2KB.
fixedKeySize
which would be 127 or less. It could be extended to support key overflow
when fixedKeySize
is set to 8 or larger. In this case, when the key length is larger than fixedKeySize
, then the slot holds a pointer to extended
key data, plus whatever prefix of the key fits in the remaining slot (fixedKeySize - 8). An alternative when fixedKeySize
is large enough is to keep a portion of the key hash in this area as well, so that the pointer to the extended
key data does not need to be followed for most lookups. Even just one byte of the hash that was not used for accessing the hash bucket would decrease the chance that the pointer is followed by a factor of 256 on a miss.I see that 0.5.5 and 0.5.6 have recently been released, but they are not published to https://yahoo.bintray.com/maven. Where can I get those artifacts? Also, is there any way this will ever be published to Maven Central?
Hi @amannaly
I came across the Yahoo's Oak, looks like if OHC is replaced with this, we could even have a range scan over metadata which would be great, without sacrificing off-heap. Any plan on doing this?
Cheers
Is it easy to bound storage layer to Postgres for example?
I was attracted by the design of HaloDB recently, and I carefully read his implementation. I found that in the serialize method of InMemoryIndexMetaDataSerializer, the byteBuffer flips twice. I don’t understand why flip twice. Isn’t there any problem with flipping twice?
Every read allocates a byte[] on the java heap, even if all that read is going to do is deserialize that byte[] into something else.
It would be useful to be able to read the data directly without the intermediate byte[].
Perhaps with a signature similar to:
<A> A get(byte[] key, Function<DataInput, A> reader);
Access to a (native) ByteBuffer would also be useful, but this will become invalid if the file it points into is garbage collected. A different data structure could 'find' the memory again if it moved due to compaction and otherwise continue to use the old value. That might look like
ValueHandle getValueHandle(byte[] key);
interface ValueHandle {
boolean updated(); // if the value was updated after the handle was created
<A> A read(Function<DataInput, A> reader); // read whatever the current value is
}
The purpose of these would be to improve performance by decreasing allocations, and to allow for lazy-deserialization of larger data types. For example one might want to read only part of a value initially, and lazily load the remainder.
YTHO???
HaloDB db = HaloDB.open(root.toFile, options);
db.contains(key);
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.