spotify / sparkey Goto Github PK
View Code? Open in Web Editor NEWSimple constant key/value storage library, for read-heavy systems with infrequent large bulk inserts.
License: Apache License 2.0
Simple constant key/value storage library, for read-heavy systems with infrequent large bulk inserts.
License: Apache License 2.0
Under Ubuntu 12.04 + gcc 4.6.3.
When I ran make, I got the following message:
bench.c:253:3: error: format ‘%ld’ expects argument of type ‘long int’, but argument 2 has type ‘size_t’ [-Werror=format]
printf(" file size: %ld\n", total_file_size(c->files()));
to
printf(" file size: %ld\n", (long int) total_file_size(c->files()));
I'm looking to do parallel bulk writes.
Since log-files can only be written sequentially, a solution might be to write a log-file per concurrent process and then merge them. Merging the log-files into one can be done quite efficiently, but it might make more sense to allow converting multiple log-files into a hash-file directly at sparkey_hash_write
.
It would save the extra work of merging that can only be done with limited parallelism, plus it would for instance allow to spread out the log-files among multiple disks to improve write speeds (and read speeds for creating the hash table).
If more than one log-file is passed to sparkey_hash_write
, might that also allow to do part of the process in parallel? At least reading the log-files and (re-)hashing the keys.
I'd like to hear the maintainer's opinion on it.
EDIT: I see that the hash-table doesn't contain the key-values but it addresses the log-file. That means that multiple log-files (in multiple disks) might also improve general read performance.
Changing the .travis.yml to use newer versions of ubuntu installs newer versions of gcc, which have implemented additional complication checks. Various files fail various checks. See here https://travis-ci.org/github/mvanwyk/sparkey/builds/725252565
I have looked at a few source files for your current software. I have noticed that some checks for return codes are missing.
Would you like to add more error handling for return values from functions like the following?
... since you guys are rather a Python shop?
In our case we had hash/log files that were not corrupt but had different file_identifier
fields due to problems in a writing pipeline. The result is sparkey_hash_open()
will (correctly) return -305 (SPARKEY_FILE_IDENTIFIER_MISMATCH
), but the log resources will not be cleaned up, because:
sparkey_hash_open
, reader->open_status
is not set when there's a file identifier mismatch (ditto any of the checks sparkey_hash_open
does)sparkey_hash_close
does nothing if reader->open_status
is not setThe fallout is munmap
and close
are not called on the log file, leading to leaking file descriptors and the OS refusing to remove those bytes from the filesystem until the process is killed.
I intend to fix this problem, but figured I'd document it first!
After creating indexFile SparkeyWriter and close it. Am I able to read this same indexFile and append new data to it?
I have a scene which source data come from many large file. I read one by one large file and write to several different indexFile. the new large file should append data to existing indexfile.
for example . file1:
absssss,1111
acsssss,1222
when index original data to index file, I design to use the first two char from key. such as
all prefix equal to ab will group to ab.spi file
file2:
abssss,23444
cdssss,34444
finally there are 3 indexFile: ab.spi, ac.spi, cd.spi
because when query abssss
, I only query ab.spi. when query cdssss
only query cd.spi.
something similar to query routing to reduce each file's query load.
because our data has nearly 100 billion.that's why I want to split different index file
one way is keep SparkeyWriter in memory by a map like
ab->SparkeyWriter1
ac->SparkeyWriter2
cd->SparkeyWriter3
and read all original file just one time
: read line, substring first two char ,get corresponding SparkeyWriter from map, and write data to this index.
as must get all file and keep SparkeyWriter untill all work done seems too slow,also may be memory insufficient. so I want to know does sparkey support open a store for writing subsequent times
As I check paldb project by linkedin, It says :
Can you open a store for writing subsequent times?
No, the final binary file is created when StoreWriter.close() is called.
so I want to know sparkey support this feature?
or may be I don't need split index file: just pull all 100 billion data is sufficient fast query support by sparkey?
Currently the hash writing requires malloc of the entire index size. If this is larger than available memory, the operation fails.
We have a use case requiring more than 1.65 billion entries, which results in a 32 GB+ hash index. This is larger than the available memory on the machine that builds it.
We could replace the malloc with mmap:ing of the file itself but that would lead to a lot of random reads and writes from disk which would be horribly slow.
I have an idea to reduce this limitation, but no time to do it at the moment.
For sorting the large set of entries, split it into reasonably small chunks and sort those in memory. Then run a tree of sequential file merges.
I know this is a bit edge-casey, but when clobbering an existing hash and providing an invalid hash size, sparkey_hash_write
incorrectly returns SPARKEY_SUCCESS
.
Test case:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sparkey/sparkey.h>
#include <assert.h>
#define check(rc) ({ \
if (SPARKEY_SUCCESS != rc) { \
fprintf(stderr, "error: %s (L#%d)\n", sparkey_errstring(rc), __LINE__); \
exit(1); \
} \
});
int
main(void) {
sparkey_logwriter *writer = NULL;
check(sparkey_logwriter_create(&writer, "test.spl", SPARKEY_COMPRESSION_NONE, 0));
check(sparkey_logwriter_put(writer, 5, (uint8_t *) "key1", 7, (uint8_t *) "value1"));
check(sparkey_logwriter_put(writer, 5, (uint8_t *) "key2", 7, (uint8_t *) "value2"));
check(sparkey_logwriter_put(writer, 5, (uint8_t *) "key3", 7, (uint8_t *) "value3"));
// write a hash with size 4
check(sparkey_hash_write("test.spi", "test.spl", 4));
// write a hash with invalid size
check(sparkey_hash_write("test.spi", "test.spl", 3456789));
assert(0 && "wrote test.spi with an invalid size :(");
return 0;
}
I just installed sparkey (Commit: ec80630) on my local maschine (Ubuntu 12.04) like described in the README file:
$ git clone [email protected]:spotify/sparkey.git
$ cd sparkey
$ autoreconf --install
$ ./configure
$ make
$ make check
$ sudo make install
But if I want to execute sparkey
, I get the following error:
$ sparkey
sparkey: error while loading shared libraries: libsparkey.so.0:
cannot open shared object file: No such file or directory
Can someone please tell me, where sparkey is searching for the file because it is available in /usr/local/lib
:
$ ls -la /usr/local/lib/ | grep sparkey
-rw-r--r-- 1 root root 79842 Sep 3 16:02 libsparkey.a
-rwxr-xr-x 1 root root 974 Sep 3 16:02 libsparkey.la
lrwxrwxrwx 1 root root 19 Sep 3 16:02 libsparkey.so -> libsparkey.so.0.0.0
lrwxrwxrwx 1 root root 19 Sep 3 16:02 libsparkey.so.0 -> libsparkey.so.0.0.0
-rwxr-xr-x 1 root root 50900 Sep 3 16:02 libsparkey.so.0.0.0
I also tried to copy/symlink the files to /usr/lib
but this didn't work.
Assuming I have two keys in my store: a
and c
. I can get exact keys, but would it be possible to seek for the next match?
Example:
sparkey_hash_seek(myreader, (uint8_t*)"a", 1, myiter); // exact match, same as sparkey_hash_get
sparkey_hash_seek(myreader, (uint8_t*)"b", 1, myiter); // cannot find "b", so advance to next live key: "c"
I am not sure if/how this could work with the slot traversal algorithm, but this is quite a useful feature when e.g. looking up ranges, etc.
Thanks
I've been polishing up my NuDB key/value database library which we're using in rippled (https://github.com/ripple/rippled) so I'm looking at other key value systems for comparisons. We were very unhappy with the performance of RocksDB for our use case (which admittedly is rather unique). Interestingly, sparkey comes the closest in terms of implementation to NuDB. They both use an append-only data file, and a key file which implements an on-disk hash table. One difference is that NuDB uses linear hashing so that buckets reach the 50% occupancy. When a bucket gets full, it is "spilled" over into the data file (that bucket now becomes a linked list of buckets).
I'd be very interested in your feedback on NuDB's design in comparison to sparkey. I'd like to hear about problems you encountered with other key value stores and the data access pattern of your software. I'm curious to know how NuDB compares for your use case, or if there are any techniques in sparkey that I could adopt. The repo is here:
https://github.com/vinniefalco/NuDB
Thanks!
I downloaded the source code and read it today. But I still have some difficulty understanding how the hash algorithm works. Could you write more description about the algorithm in this project to help us learn faster? A more detailed design might be better. Thanks a lot.
Hello,
Thanks for the very interesting library. However "clock_gettime" is not available on OS X and configuration fails on this system. I think it is missing on Windows too. "clock_gettime" api could be simulated using gettimeofday, or mach_absolute_time. Maybe building benchmark app could be optional?
I'd suggest to move this project to CMake as this will simplify the "configure" step and will make integration into other projects easy.
Hey @spkrka,
I know that you've not worked on this repo for a while and are working on a full Java version now, but I was wondering if you could tag a release at whatever latest point you feel is reasonable. I'd like to use this as a dependency for a project, but I'm hesitant (out of habit) to simply refer to HEAD or a specific commit hash.
Thanks!
I'd be nice to skip over n
entries when iterating a hash. Any plans of adding an API for doing so?
I may have misinterpreted the documentation, but it doesn't look like sparkey_logiter_reset
actually resets the iterator.
Test case:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sparkey/sparkey.h>
#include <assert.h>
#define SPARKEY_ASSERT(rc) ({ \
if (SPARKEY_SUCCESS != rc) { \
fprintf( \
stderr \
, "error: %s (%d)\n" \
, sparkey_errstring(rc) \
, __LINE__ \
); \
exit(1); \
} \
});
int
main(void) {
sparkey_logwriter *writer = NULL;
sparkey_logreader *reader = NULL;
sparkey_logiter *iterator = NULL;
const char *key1 = "key1";
const char *value1 = "value1";
size_t key1size = strlen(key1);
size_t value1size = strlen(value1);
const char *key2 = "key2";
const char *value2 = "value2";
size_t key2size = strlen(key2);
size_t value2size = strlen(value2);
uint64_t wanted;
uint64_t actual;
uint8_t *buffer = NULL;
// create a log
SPARKEY_ASSERT(sparkey_logwriter_create(
&writer
, "test.spl"
, SPARKEY_COMPRESSION_NONE
, 0
));
// write some stuff
SPARKEY_ASSERT(sparkey_logwriter_put(
writer
, key1size
, (uint8_t *) key1
, value1size
, (uint8_t *) value1
));
SPARKEY_ASSERT(sparkey_logwriter_put(
writer
, key2size
, (uint8_t *) key2
, value2size
, (uint8_t *) value2
));
SPARKEY_ASSERT(sparkey_logwriter_close(&writer));
SPARKEY_ASSERT(sparkey_logreader_open(&reader, "test.spl"));
SPARKEY_ASSERT(sparkey_logiter_create(&iterator, reader));
// get first key
SPARKEY_ASSERT(sparkey_logiter_next(iterator, reader));
wanted = sparkey_logiter_keylen(iterator);
assert((buffer = malloc(wanted)));
SPARKEY_ASSERT(sparkey_logiter_fill_key(
iterator
, reader
, wanted
, buffer
, &actual
));
printf("buffer: %s\n", buffer);
assert(0 == strcmp("key1", (char *) buffer));
free(buffer);
// reset iterator
SPARKEY_ASSERT(sparkey_logiter_reset(iterator, reader));
// get key again
SPARKEY_ASSERT(sparkey_logiter_next(iterator, reader));
wanted = sparkey_logiter_keylen(iterator);
assert((buffer = malloc(wanted)));
SPARKEY_ASSERT(sparkey_logiter_fill_key(
iterator
, reader
, wanted
, buffer
, &actual
));
printf("buffer: %s (after reset)\n", buffer);
assert(0 == strcmp("key1", (char *) buffer));
free(buffer);
// cleanup
sparkey_logiter_close(&iterator);
sparkey_logreader_close(&reader);
free(buffer);
return 0;
}
Yields:
$ gcc -lsparkey reset.c -o reset -Wall -Wextra
$ ./reset
buffer: key1
buffer: key2 (after reset)
Assertion failed: (0 == strcmp("key1", (char *) buffer)), function main, file reset.c, line 98.
Abort trap: 6
Thoughts on adding a REPL to sparkey(1)? I'd be happy to put it together if you're interested.
It's unclear if you are referring to the C++ library or the C library.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.