basho / leveldb Goto Github PK

Clone of http://code.google.com/p/leveldb/

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.51% Shell 1.20% C++ 90.38% C 7.91%

leveldb's Introduction

leveldb: A key-value store
Authors: Sanjay Ghemawat ([email protected]) and Jeff Dean ([email protected])

The original Google README is now README.GOOGLE.

** Introduction

This repository contains the Google source code as modified to benefit
the Riak environment.  The typical Riak environment has two attributes
that necessitate leveldb adjustments, both in options and code:

- production servers: Riak often runs in heavy Internet environments:
  servers with many CPU cores, lots of memory, and 24x7 disk activity.
  Basho's leveldb takes advantage of the environment by adding
  hardware CRC calculation, increasing Bloom filter accuracy, and
  defaulting to integrity checking enabled.

- multiple databases open: Riak opens 8 to 128 databases
  simultaneously.  Google's leveldb supports this, but its background
  compaction thread can fall behind.  leveldb will "stall" new user
  writes whenever the compaction thread gets too far behind.  Basho's
  leveldb modification include multiple thread blocks that each
  contain prioritized threads for specific compaction activities.

Details for Basho's customizations exist in the leveldb wiki:

  http://github.com/basho/leveldb/wiki


** Branch pattern

This repository follows the Basho standard for branch management 
as of November 28, 2013.  The standard is found here:

https://github.com/basho/riak/wiki/Basho-repository-management

In summary, the "develop" branch contains the most recently reviewed
engineering work.  The "master" branch contains the most recently
released work, i.e. distributed as part of a Riak release.


** Basic options needed

Those wishing to truly savor the benefits of Basho's modifications
need to initialize a new leveldb::Options structure similar to the
following before each call to leveldb::DB::Open:

    leveldb::Options * options;

    options=new Leveldb::Options;

    options.filter_policy=leveldb::NewBloomFilterPolicy2(16);
    options.write_buffer_size=62914560;  // 60Mbytes
    options.total_leveldb_mem=2684354560; // 2.5Gbytes (details below)
    options.env=leveldb::Env::Default();


** Memory plan

Basho's leveldb dramatically departed from Google's original internal
memory allotment plan with Riak 2.0.  Basho's leveldb uses a methodology
called flexcache.  The technical details are here:

   https://github.com/basho/leveldb/wiki/mv-flexcache

The key points are:

- options.total_leveldb_mem is an allocation for the entire process,
  not a single database

- giving different values to options.total_leveldb_mem on subsequent Open
  calls causes memory to rearrange to current value across all databases

- recommended minimum for Basho's leveldb is 340Mbytes per database.  

- performance improves rapidly from 340Mbytes to 2.5Gbytes per database (3.0Gbytes
  if using Riak's active anti-entropy).  Even more is nice, but not as helpful.

- never assign more than 75% of available RAM to total_leveldb_mem.  There is
  too much unaccounted memory overhead (worse if you use tcmalloc library).

- options.max_open_files and options.block_cache should not be used.

leveldb's People

Contributors

Stargazers

Watchers

Forkers

circonus-labs lkunemail bsparrow435 lemenkov xinmingyao bergwolf weixu8 timclassic mecoring chenbk85 forestlzj dreyk changguanghua qelover ithacadream junfeng-feng originin jameschan tongming whugintama shu-jed eliq luojf945 josexie alberts s1van macsummer mykook fgallaire imgrace pombredanne fomalhautzhu shawnb457 roykachouh xinghan kondi wang17 zl-2014 petewarden project-fifo tiensonqin jimmyjzhang apsaltis mindgeek u20024804 shiwenxiang jufemaiz imbean xiangbai marquisthunder zhangya roysuu zhiwenf wangzijian0x7c6 vdiskmobile no2key binarytemple claresun dhritiman zuai firstblade kuntjz canoefzh bai-lu-sz jhongao sleepwom redisoptimal cindynku lvchao0428 alex-repos guomin luoluo calvinlau fancyspeed tianshangjun shjgiser liuluheng markandrewj westfly codesong paomian alisheikh lyhopq liangfflia nicganon qlw pengshanzhou zhongxingzhi luckytina bizyun coolws krus thevuuranusls ormuzd suxinde2009 hunterlodge define-null zhanghairong ironaldo genorm

leveldb's Issues

Adjust tiered locks to survive "throw"

Jesse points out that the tiered locks in db/db_impl.c would never release if a vnode did a "throw". This would leave the lock active and block all other vnodes. Fix as part of try/catch management changes.

Clang build failure on FreeBSD 10.1 [JIRA: RIAK-1600]

Hitting the following build errror when building with Clang on FreeBSD (now the default compiler, as of 10.0)

prompt$ CXX=c++ gmake

...

./include/leveldb/env.h:163:11: error: unknown type name 'pthread_t'; did you mean 'pthread'?
virtual pthread_t StartThread(void (function)(void arg), void* arg) = 0;
^~~~~~~~~
pthread
/usr/include/stdio.h:143:9: note: 'pthread' declared here
struct pthread _fl_owner; / current owner /
^
In file included from db/log_reader.cc:8:
./include/leveldb/env.h:374:3: error: unknown type name 'pthread_t'; did you mean 'pthread'?
pthread_t StartThread(void (f)(void), void a) {
^~~~~~~~~

the leveldb env.h file needs to include pthread.h, in order to fix this issue on this platform.

bad read performance when reading from EBS volumes

Hi,
I noticed that in basho/leveldb the mmap code in NewRandomAccessFile function is commented out.

On amazon ec2 instances with ebs volumes attached the sequential reads on sstables are very slow when implemented using preads.

I've created a small demo program that just reads an input sstable sequentially.

It reads 600MB sstable in 1m30s with preads (basho/leveldb) without the file being in file cache. 9s when the file is in machine file cache and 12s !!! when NewRandomAccessFile is implemented via mmap as in the original leveldb while reading cold file again. The results are pretty consistent.

The io throughput is only 5-10MB with preads and around 80-100MB with mmap. This is due the fact that preads have non optimal read sizes when reading the sstable blocks.

I suggest returning the mmap code back to NewRandomAccessFile.

(I've run my tests on m1.medium instance with usual IOPS provisioning).

Compiler options need to be more strict

Compiler does not appear to have strict error and warning diagnosis. This needs to be corrected.

Code lacks try/catch exception handling

Memory allocations ("new") and some stl operations are designed to "throw". The code has no handling of "throw" operations.

Bus error on leveldb open after recovery

A full disk followed by repair leads to SIGBUS on Open():

One Riak instance with eleveldb enabled. Data directory is in devrel node
on the same disk as Riak and it's logs.

* basho_bench configured with:

%%
%% Prepopulate database to ~1Gb using 1k objects - requires 750k
%%

%% Make sure this is max/infinity so the populate completes
{mode, max}.
{duration, infinity}.
%% Make this a multiple of final key size - partitioned_sequential_int keygen is fussy
{concurrent, 10}.

%% Set up bucket to use - configure with n_val=1
%% using riak_core_bucket:set_bucket(<<"b1">>,[{n_val,1}]).
%% from riak console
{riakc_pb_bucket, <<"b1">>}.
{key_generator, {int_to_bin, {partitioned_sequential_int, 0, 75000000}}}.
{value_generator, {fixed_bin, 1000}}.
{operations, [{put, 1}]}.

%% Riak connection info
{riakc_pb_ips, [ "127.0.0.1" ]}.
{riakc_pb_replies, default}.

%% Setup cruft
{driver, basho_bench_driver_riakc_pb}.
{code_paths, ["deps/riakc",

"deps/protobuffs"]}.

This is allowed to exhaust disk space during the benchmark. The benchmark halts,
as riak stops. No message is logged (presumably because disk space is exhausted).
leveldb::RepairDB() called on leveldb/0
leveldb::DB::Open() called with option create_if_missing=true on leveldb/0
See: "Bus error" from SIGBUS and program abort with stack trace:
#0 0x00007ffff712c5b8 in memcpy_ssse3 () from /lib64/libc.so.6
#1 0x000000000042c165 in leveldb::(anonymous namespace)::PosixMmapFile::Append(leveldb::Slice const&) ()
#2 0x0000000000423f78 in leveldb::TableBuilder::WriteRawBlock(leveldb::Slice const&, leveldb::CompressionType, leveldb::BlockHandle) ()
#3 0x0000000000424174 in leveldb::TableBuilder::WriteBlock(leveldb::BlockBuilder, leveldb::BlockHandle_) ()
#4 0x0000000000424314 in leveldb::TableBuilder::Flush() ()
#5 0x00000000004244bb in leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&) ()
#6 0x000000000042d667 in leveldb::BuildTable(std::string const&, leveldb::Env_, leveldb::Options const&, leveldb::TableCache_, leveldb::Iterator_, leveldb::FileMetaData_) ()
#7 0x000000000040919b in leveldb::DBImpl::WriteLevel0Table(leveldb::MemTable_, leveldb::VersionEdit_, leveldb::Version_) ()
#8 0x0000000000409d4d in leveldb::DBImpl::RecoverLogFile(unsigned long, leveldb::VersionEdit_, unsigned long_) ()
#9 0x000000000040ccf5 in leveldb::DBImpl::Recover(leveldb::VersionEdit_) ()
#10 0x000000000040d076 in leveldb::DB::Open(leveldb::Options const&, std::string const&, leveldb::DB_*) ()
#11 0x0000000000404260 in main ()

File handle leak - possibly BLOCKS.bad

When I reopen the same database many times, I eventually run out of file handles - lost/BLOCKS.bad looks like a good culprit.

dev1(master) jmeredith$ lsof -p 2698 | head -40
COMMAND   PID      USER   FD   TYPE     DEVICE  SIZE/OFF     NODE NAME
beam.smp 2698 jmeredith  cwd    DIR       14,2       612 10354589 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1
beam.smp 2698 jmeredith  txt    REG       14,2   4107908 10354693 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/erts-5.9.1/bin/beam.smp
beam.smp 2698 jmeredith  txt    REG       14,2    429692 10331716 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/deps/eleveldb/priv/eleveldb.so
beam.smp 2698 jmeredith  txt    REG       14,2    599280   391573 /usr/lib/dyld
beam.smp 2698 jmeredith  txt    REG       14,2 296235008  2817701 /private/var/db/dyld/dyld_shared_cache_x86_64
beam.smp 2698 jmeredith    0u   CHR       16,3 0t1772371    27971 /dev/ttys003
beam.smp 2698 jmeredith    1u   CHR       16,3 0t1772371    27971 /dev/ttys003
beam.smp 2698 jmeredith    2u   CHR       16,3 0t1772371    27971 /dev/ttys003
beam.smp 2698 jmeredith    3   PIPE 0x12fbb648     16384          ->0x12fba3ec
beam.smp 2698 jmeredith    4   PIPE 0x12fba3ec     16384          ->0x12fbb648
beam.smp 2698 jmeredith    5   PIPE 0x0ddeb3e8     16384          ->0x10492bb8
beam.smp 2698 jmeredith    6   PIPE 0x10492bb8     16384          ->0x0ddeb3e8
beam.smp 2698 jmeredith    7   PIPE 0x12fbb4b8     16384          ->0x0ddeb5dc
beam.smp 2698 jmeredith    8   PIPE 0x0ddeb5dc     16384          ->0x12fbb4b8
beam.smp 2698 jmeredith    9   PIPE 0x0ddecc20     16384          ->0x104932c0
beam.smp 2698 jmeredith   10   PIPE 0x104932c0     16384          ->0x0ddecc20
beam.smp 2698 jmeredith   11u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   12u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   13u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   14u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   15u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   16u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   17u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   18u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   19u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   20u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   21u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   22u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   23u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   24u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   25u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   26u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   27u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   28u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   29u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   30u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   31u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   32u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad
beam.smp 2698 jmeredith   33u   REG       14,2         0 10467350 /Users/jmeredith/basho/work/1.2/nomanifest/riak_ee/dev/dev1/noman.db/lost/BLOCKS.bad

Unit Test failure on FreeBSD [JIRA: RIAK-2295]

***** Running cache2_test
==== Test CacheTest.HitAndMiss
==== Test CacheTest.Erase
==== Test CacheTest.EntriesArePinned
==== Test CacheTest.EvictionPolicy
util/cache2_test.cc:170: failed: -1 == 201

rewrite PosixEnv::Schedule() in env_posix.cc

Jesse pointed out that the current routine is bloated from a lot of copy/paste hacking. Needs a rewrite, likely a supporting function for the queue item remove/add sequences.

Table cache sized by count, not file data size

Table cache is limited by max_file_count. This count does not consider the size of the bloom filter or index record. Previous test code that sized the cache by allocation space needs to be brought to production.

tiered storage configuration issues [JIRA: RIAK-1855]

config schema now prefixes leveldb_data_root and causes documentation of tiered storage initialization to be wrong. Need to add fourth line to setup example: leveldb.leveldb_data_root = "./leveldb"
verify user rights needed for leveldb_data_root and add to documentation
verify / code creation of leveldb_data_root path (in startup script?)

Could leveldb avoid some compactions? [JIRA: RIAK-1810]

While loading up some data (64 workers loading to Q=4 ring starting off with Key:64 from 1..64+64) I noticed that after loading for 12 hours, 676 / 24921 compactions (~3%) were single file from lower level, zero files from higher level. Should these be moves?

2015/05/17-02:01:17.614037 7f85b1ffb700 Compacted 1@2 + 0@3 files => 209776678 bytes
2015/05/17-02:03:21.585423 7f85b1ffb700 Compacted 1@4 + 0@5 files => 419553560 bytes
2015/05/17-02:09:42.569225 7f85b27fc700 Compacted 1@2 + 0@3 files => 209776563 bytes
2015/05/17-02:11:44.670296 7f85b27fc700 Compacted 1@2 + 0@3 files => 209777140 bytes
2015/05/17-02:13:37.785211 7f85b1ffb700 Compacted 1@4 + 0@5 files => 419553547 bytes
2015/05/17-02:13:41.615433 7f85b17fa700 Compacted 1@2 + 0@3 files => 209777115 bytes
2015/05/17-02:15:50.199730 7f85b1ffb700 Compacted 1@2 + 0@3 files => 209776844 bytes
2015/05/17-02:17:18.329543 7f85b17fa700 Compacted 1@4 + 0@5 files => 419553584 bytes
2015/05/17-02:18:56.660011 7f85b27fc700 Compacted 1@2 + 0@3 files => 209776751 bytes
2015/05/17-02:25:59.590528 7f85b17fa700 Compacted 1@2 + 0@3 files => 209777167 bytes
2015/05/17-02:27:45.373177 7f85b2ffd700 Compacted 1@4 + 0@5 files => 419553417 bytes
2015/05/17-02:28:21.881114 7f85b27fc700 Compacted 1@2 + 0@3 files => 209776898 bytes
2015/05/17-02:36:01.577150 7f85b1ffb700 Compacted 1@2 + 0@3 files => 209776876 bytes
2015/05/17-02:37:23.790984 7f85b2ffd700 Compacted 1@4 + 0@5 files => 419553682 bytes

On FreeBSD 10.0-STABLE leveldb within eleveldb cannot be built with clang

FreeBSD's native clang compiler generates errors and stops when compiling leveldb.
See basho/eleveldb#97 for the details.

FreeBSD: db_test hangs [JIRA: RIAK-2298]

db test hangs here:

==== Test DBTest.IteratorPinsRef
==== Test DBTest.Snapshot
==== Test DBTest.DeletionMarkers1

CPU at 100% while stuck waiting in DeletionMarkers1 test.

IsTrivialMove interacts poorly with Aggressive Delete [JIRA: RIAK-2232]

When aggressive delete feature identifies .sst file for grooming, IsTrivialMove can actually cause .sst file to migrate to highest level instead of running a compaction to clear out the data. Likely need to create a flag in Compaction object that is tested in DBImpl::BackgroundCompaction() when deciding whether or not to use move operation. A flag needs to be set in VersionSet::Finalize() then passed to Compaction object within VersionSet::PickCompaction().

Compaction infinite loop

About 50% of the time, benchmark runs get hung up in an infinite compaction loop, triggered by the compactor deleting a file and forgetting that it was deleted. The test binary, command script used to trigger it, and the LOG files from the DB are included in http://highlandsun.com/hyc/bashobug.tgz

Request that leveldb version place in LOG [JIRA: RIAK-1842]

It would be quite handy to have the build version / tag in the LOG file when reviewing customer issues.

Dynamic block size and Auto compression disable interaction? [JIRA: RIAK-1856]

Verify if auto compression disable considers changes in block size. A bigger block size might create new compression opportunity.

Is there a binding.gyp?

I want to use this in node.js, is there already exist a binding.gyp? or I need write one myself?

If there is a binding, can you show me an address? thanks advance!

Multiple key revisions in Level 0 files

leveldb removes old / replaced revisions of a key during level to level compactions. It does NOT remove these when copying memory to a new Level 0 file. This causes the Level 0 files to have more bloat.

replace gettimeofday() with sliding counter in LRUCache::Lookup

I hammered gettimeofday() into LRUCache::Lookup in place of the previous mechanism that constantly reordered the list, requiring a write lock. gettimeofday() is a relatively expensive OS call. A better design would use an incrementing counter. When the counter crossed a threshold, all instances of the counter would be right shifted 32 bits.

db/version_set.cc:60: error: integer constant is too large for ‘long’ type

Environment

Mac OS X 10.7.5
Using Riak source from http://s3.amazonaws.com/downloads.basho.com/riak/1.4/1.4.0/riak-1.4.0.tar.gz - fetched on 14 JUL 2013
Erlang R15B01 (erts-5.9.1) [source] [smp:4:4] [async-threads:0] [hipe] [kernel-poll:false]
$ cc -v
- Using built-in specs.
- Target: i686-apple-darwin11
- Configured with:
  - /private/var/tmp/llvmgcc42/llvmgcc42-2336.1~22/src/configure -
  - -disable-checking
  - --enable-werror
  - --prefix=/Developer/usr/llvm-gcc-4.2
  - --mandir=/share/man
  - --enable-languages=c,objc,c++,obj-c++
  - --program-prefix=llvm-
  - --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/
  - --with-slibdir=/usr/lib
  - --build=i686-apple-darwin11
  - --enable-llvm=/private/var/tmp/llvmgcc42/llvmgcc42-2336.1~22/dst-llvmCore/Developer/usr/local
  - --program-prefix=i686-apple-darwin11-
  - --host=x86_64-apple-darwin11
  - --target=i686-apple-darwin11
  - --with-gxx-include-dir=/usr/include/c++/4.2.1
- Thread model: posix
- gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)

Expected behavior

All deps build clean.

Observed behavior

The following error emitted when building eleveldb/leveldb dep:

db/version_set.cc:59: warning: this decimal constant is unsigned only in ISO C90
db/version_set.cc:59: warning: this decimal constant is unsigned only in ISO C90
db/version_set.cc:60: error: integer constant is too large for ‘long’ type
db/version_set.cc:60: error: integer constant is too large for ‘long’ type
db/version_set.cc:61: error: integer constant is too large for ‘long’ type
db/version_set.cc:61: error: integer constant is too large for ‘long’ type
db/version_set.cc:62: error: integer constant is too large for ‘long’ type
db/version_set.cc:62: error: integer constant is too large for ‘long’ type
table/filter_block.cc: In member function ‘bool     leveldb::FilterBlockReader::KeyMayMatch(uint64_t, const leveldb::Slice&)’:
table/filter_block.cc:112: warning: comparison between signed and unsigned integer expressions
util/env_posix.cc: In constructor ‘leveldb::<unnamed>::PosixEnv::PosixEnv()’:
util/env_posix.cc:936: warning: unused variable ‘ts’
make[1]: *** [libleveldb.dylib.1.9] Error 1
ERROR: Command [compile] failed!

Fix

Add ULL suffix to the appropriate constants at https://github.com/basho/leveldb/blob/master/db/version_set.cc#L55

Result

Builds, though there are still some warnings.

perf_dump detaches current memory segment, creates new [JIRA: RIAK-2104]

perf_dump is supposed to perform only read operations against the shared memory segment. However, it calls leveldb::Env::Default() which also accesses the shared memory segment and will resize it if necessary. This is bad behavior. Env object is only used for its NowTime(). Best bet is to likely copy that routine to perf_dump.cc, or create static equivalent that perf_dump can call (and is also called by env_posix's NowTime() routine).

CURRENT file pointing to missing MANIFEST file. [JIRA: RIAK-1789]

Re: Zendesk Ticket #10730

After a restart of a node, the riak vnodes wouldn't start because the CURRENT file was pointing to an non-existent version of the MANIFEST. How did these get out of sync?

Hot on/off switch for perf counters [JIRA: RIAK-1898]

The performance counters are atomic increments. These do have some performance impact. It would be really nice if they could be enabled and disable via the shared memory segment. Then, the impact only occurs during developer work or when a user needs to find a problem in production.

Two threading tweaks [JIRA: RIAK-2336]

should create synchronous start-up for thread pool threads like recently done in throttle.cc. The change is not relevant to normal execution. It simply makes unit tests which construct and deconstruct quickly more reliable.
consider using first thread in a pool as replacement for semaphore thread. The semaphore thread is to protect against a race condition where no thread is available as new task arrives, but by the time the new task is on the queue all threads are back to waiting and will never see the task until a second task gets added. Using a conventional mutex and signal only with the first thread will prevent this case AND not heavily burden the system. (Known bad design had one mutex / condition variable for all threads. This was a performance killer.) ... of course performance impact needs to be tested.

Structure change supporting faster Backup / Repair

backup / repair: move from a flat directory structure to a directory by level. Allows external backup (if performed in level order, rsync default) and creates implied manifest for repair.

create commits about thread purposes and tiered locks

Jesse requested that comments be added to env_posix.cc to describe the purpose of each thread. Same with tiered locks.

Corruption in .sst index

Corruption in the .sst index will still cause infinite compaction loop.

offline "repair" should include rebuilding manifest

Currently repair will:

destroy any existing manifest
place all sst files on Level 0
exit ...

The "exit" implicitly leaves the redistribution of sst files to appropriate levels to runtime. Runtime redistribution tends to block the shared compaction thread, i.e. eventually halting the node, until hours later when the one vnode is properly redistributed.

The redistribution step should also occur offline.

Time-based expiry of keys [JIRA: RIAK-2678]

Adding per our discussion:

Similar to bitcask expiry of data based on a specified timeframe. TBD whether this should be set at the bucket or bucket-type level. Customer request.

replacement for max_open_files

max_open_files was redefined in Riak 1.4 to being a memory limit. It is completely ignored in 2.0. It needs to be revitalized as a global leveldb limit, not per open database, in the same manner as the cache size limits.

db_bench

We were doing testing using db_bench with an earlier release (when the tiering options were introduced) and have found that something has changed such that db_bench will either run very slowly or completely stop moving forward for reasonably large (100M key - 16 bytes/ value = 1024) readseq runs. Running on a fusion io ioCache card which had reasonable performance for the older versions.

Infrequent, unreproducible missing SST

We've seen a few instances of leveldb failing on compaction as it is missing an SST file

Example:

2012/06/12-20:23:41.713048 97 Generated table #422226: 563 keys, 2107848 bytes
88026049394855470996013819413191996189691609088/107719.sst: No such file or directory

Possible theory is the file being added to the MANIFEST before actually being created on disk and then having problems during creation.

write_buffer_size overwritten by gMapSize

The idea to make gMapSize a user option ended with it overwriting write_buffer_size in all cases. This is really, really bad. Heavily impacts performance. Putting the correction into branch mv-sequential-tuning. Fix is in db/db_impl.cc SanitizeOptions().

Looping compaction error should eventually close DB and print nice error

We've had a few customers hit looping compaction errors for a variety of reasons(ex. file in manifest that's missing from disk). This results in looping compaction errors:

2014/07/29-17:09:36.253849 7f285f3b1700 Compaction error: IO error: /var/lib/riak/leveldb/936274486415109681974235595958868809467081785344/sst_0/370873.sst: No such file or directory

It would be great if leveldb could identify repeated compaction errors and close the DB throwing an error up the chain of abstractions. Currently, the node sits waiting for riak_kv to start with no console.log entries indicating an issue.

bloom filter for each 2k of disk seems inefficient

The google provided bloom filter code segregates a file into 2K chunks. It creates a bloom filter for each chunk. This 2k is hard coded and completely ignores block_size. The code also creates 2k placeholder objects if a given key/value record covers more than one 2k region of disk. Again, seems inefficient.

clock_getres undefined symbol on ubuntu 12.04.4

It compiles but fails at runtime with clock_getres undefined symbol error.
I also found a similar error from google (https://drone.io/github.com/janelia-flyem/dvid/163)

Recovery / Repair should "lock" database

The recovery/repair code path does not lock the database directory. This appears to have caused a couple of failures at customer sites. Something / someone brought up the vnode while recovery was still processing the directory.

Super low memory has divide by zero [JIRA: RIAK-1843]

Customer had divide by zero error. No detail stack trace. But File cache size and Block cache size are both zero in LOG. Should be easy to guess ... and need to look again at killing leveldb in horrible config situations.

Polling file cache expire [JIRA: RIAK-2297]

Food for thought: file cache expire currently requires file cache action to initiate. Is it possible to start the expire on "dormant" file caches via the Throttle 1 minute loop (as now done for compactions)?

Use cases

Hi, sorry for maybe off-topic question. Does leveldb (mostly this fork) suitable for storing files in caching proxy server?

Creation of bad blocks file at wrong place

The bad blocks file creation was forced into code wherever the options structure was not "const". This is likely the wrong location. Please review and maybe (so horrible to even suggest) make the bad block file mutable in the options structure.

Fix w.done code path in DBImpl::Write()

Agreement out of code review was that the code path of

if (w.done)
return w.status;

needs to be updated to flow through the throttle at the bottom of the function to maintain proper throttle participation. The change was not made today due to the lateness of the release cycle.

Tiering LevelDB Levels for FS Record Size Optimization

With the integration of dynamic block sizes into the tree, it provides a performance gain in many scenarios; however, it can be detrimental in the case of the ZFS filesystem. I'd been studying the wiki page at: https://github.com/basho/leveldb/wiki/mv-dynamic-block-size -- thank you for this level of documentation, by the way.

ZFS has the ability to define record sizes at the ZVOL level. It defaults to 128K. If you write a sequence of 32x 4K blocks at once, it is going to end up in a 128K record. Looking up any individual 4K block in that case is going to incur a read of the entire 128K block.

If you run ZFS, you are going to want to set a 4K block size for the youngest levels; but setting a 4K block size for a higher level file does not make sense, especially in the case of dynamic block sizes.

The current directory structure of LevelDB in Riak makes it extremely difficult to set the appropriate block sizes. If the directory structure was in a predictable enough fashion and segmented by level, or block size, it makes it trivial to create individual ZVOLs for each level/block size.

Also, as it currently stands, the algorithm used for block_size_steps is not page aligned, making it difficult to avoid a double-read penalty when the block overlaps two records. Additionally, if ZFS record sizes are larger than the LevelDB block size, we incur a read-modify-write penalty any time we write out a LevelDB block individually.

I realize that this proposed direction would be a drastic restructuring of the directory structure, but at the very least, if we could separate out level-0, it would greatly improve performance.

create "repair light" option [JIRA: RIAK-2334]

Part of the repair procedure is to open and read every .sst file. This is done to verify data CRCs and potential throw out corrupted files. However, there is a common use case of only desiring to rebuild the manifest file (such as manual deletion of old .sst files). Data is not considered suspect in this use case.

The goal is to create a flag that would allow a user to skip the read of all .sst files, only rebuild the manifest.

Change util/throttle.cc to use the port::Mutex and port::CondVar wrapper classes [JIRA: RIAK-2080]

Currently, util/throttle.cc uses raw pthread_mutex_t and pthread_cond_t types rather than the corresponding wrapper classes in the porting layer. We should refactor the throttling code to use the wrapper classes.

Note that doing this will require updating port::CondVar, adding an overload of the Wait() method that takes a timeout parameter.

It would also be a good idea to add some unit tests that drive the port::Mutex, port::CondVar, and MutexLock classes.

Delete all leveldb key-values, while the disk usage doesn't reduce [JIRA: RIAK-1800]

I have a 14Gb data of leveldb files, I have delete it whole with eleveldb:iterator_move/2 and eleveldb:delete/3. When I'm going to fetch the first key in DB, it really took me to much time, 5min or longer, and the result showed the DB didn't have any keys anymore. But my disk usage remains 14Gb, I think leveldb didn't automatically compaction those keys. How could it be possible for me to show up it starting automatically compaction or not?

leveldb.data_root ends up in /var/lib/riak/leveldb regardless of configuration value. [JIRA: RIAK-2470]

riak config describe leveldb.data_root

Documentation for leveldb.data_root
Where LevelDB will store its data.

Valid Values:
- the path to a directory
Default Value : $(platform_data_dir)/leveldb
Value not set in /etc/riak/riak.conf
Internal key : eleveldb.data_root

riak config describe platform_data_dir

Documentation for platform_data_dir
Platform-specific installation paths (substituted by rebar)

Valid Values:
- the path to a directory
Default Value : /var/lib/riak
Set Value : /datadisks/disk1/riak-data
Internal key : riak_core.platform_data_dir

1.8G /var/lib/riak/leveldb/

riak-2.1.3-1.el7.centos.x86_64 in Azure environment.

Even setting fixed path to leveldb.data_root in /etc/riak.conf wont do the change. Workarounding with symbolic link.

Add unit tests for write throttling [JIRA: RIAK-2079]

We need to add some unit tests that drive the throttling calculations in isolation. Doing this effectively might require creating a mock version of the leveldb::Env class, perhaps something like the InMemoryEnv class but writing to null instead of in-memory.