GithubHelp home page GithubHelp logo

Sync frequency about badger HOT 6 CLOSED

dgraph-io avatar dgraph-io commented on May 6, 2024
Sync frequency

from badger.

Comments (6)

manishrjain avatar manishrjain commented on May 6, 2024

In async mode, if the value size is lower than option.ValueThreshold, then key-value doesn't get written out to WAL. In this case, sounds like all of your values were lower than 20 bytes (if you used the default options).

Then, they get written out to LSM tree. They would be written to a memtable. Only once a memtable gets filled up, would it be synced to disk. 1M keys isn't sufficient to do that. A 64MB memtable can take 3M keys before it fills up.

This should all be really fast. I suspect the reason it took you 18 mins, is because you didn't batch your requests. Batching is absolutely critical to getting better performance. There's quite a bit of overhead cost per request; which you can amortize well by sending a 1000 keys in one request. You can send many requests in parallel to achieve even better performance.

Try these out. It shouldn't take more than 20 seconds or so. Do remember to Close the store.

P.S. You can look at the populate code here to see how we populated data to run Badger v/s RocksDB benchmarks.

from badger.

wvh avatar wvh commented on May 6, 2024

First of all, thanks for your insights.

My values are about 12-15 bytes. Is there a technical reason for the default ValueThreshold = 20 option? You'd need to document this clearly, as those records will get lost if the server goes down. 18 minutes without a single sync is – I'm sure you agree – pretty long. Similar for the 64MB memtable. I understand speed is a priority, but it's not entirely clear to me as a casual user that it might take a long time before anything actually gets sync'ed to disk... I'd not expect to lose anything more than perhaps a few seconds worth of records with default settings. Not sure how often LevelDB does a sync to disk.

I've tried a batch approach, as you suggested. Writing 10000 records in batches of 1000:

start: populateLevelDB
  end: populateLevelDB Δt: 110.948957ms ops: 90131.53679308585
start: populateBadger
  end: populateBadger Δt: 10.896048533s ops: 917.7639003455051
start: populateBadgerBatch
  end: populateBadgerBatch Δt: 18.422385ms ops: 542817.8816152198
start: populateBoltDB
  end: populateBoltDB Δt: 1m46.367649028s ops: 94.01354727100926
start: populateBoltDBOneTrans
  end: populateBoltDBOneTrans Δt: 212.684806ms ops: 47017.93319453201

400000-500000 ops, a lot faster than non-batched inserts and also a lot faster than other kv-stores.

The problem in my use case is that the database stores user-generated events, which can't be batched easily as they happen sporadically.

from badger.

szymonm avatar szymonm commented on May 6, 2024

@wvh AFAIK, there is no direct technical reason behind ValueThreshold = 20. However, you must know that if we store a value in the ValueLog, we need to keep a pointer to this value in the LSM tree. The size of the pointer is 10 bytes, so it makes no sense no have ValueThreshold below that. We just assumed that the value should be at least 2x bigger to make any sense to keep it in a separate place at the cost of keeping the pointer in the LSM tree.

When tuning this parameter, you should know that when retrieving a KV pair there are 3 possible execution paths (I'm simplifying a bit here to give you intuition).

  1. If size(value) < ValueThreshold (<=> KV pair is stored in LSM tree) and KV pair is in memory, we just retrieve it from memory, so you have best Get time.
  2. If size(value) < ValueThreshold but the KV pair didn't fit into memory, we have to do a random-read to read it from disk.
  3. If size(value) > ValueThreshold, we have to retrieve the pointer to it from LSM tree (from memory or disk) and then the value from disk, so you have at least one disk call.

There is an order of magnitude difference in Get time between 1 and 2, while linear between 2 and 3. If you increase ValueThreshold more of KV pairs go from 3 to 2, but at the same time, you can fit less of them to memory, so more of them goes from 1 to 2.

The documents will not get lost if the server goes down when SyncWrites option is true. So, you can populate your database with SyncWrites = false and then restart it with SyncWrites = true to avoid losing production data.

Regarding your problem that events are generated by users, you could do some kind of buffering to have good write speed and low latency at the same time. You could, for example, store buffered data every 1ms, so that users won't notice it, while you have data written in batches when the load is high.

from badger.

szymonm avatar szymonm commented on May 6, 2024

@wvh If you have no more questions, let me close the issue.

from badger.

manishrjain avatar manishrjain commented on May 6, 2024

The problem in my use case is that the database stores user-generated events, which can't be batched easily as they happen sporadically.

I'd recommend you to set SyncWrites=true. It seems like you don't need a very high write throughput, but do care about your writes being persisted. In that case, a slightly higher write latency shouldn't be that big of a deal.

You can't have it both ways. Either you choose sync writes with higher write latency and ensure persistence, or you choose async writes with lower write latency and give up immediate persistence.

Even with sync writes, writes to SSDs, if that's what you're using, are pretty fast; so I don't see much downside for you to set SyncWrites=true.

from badger.

wvh avatar wvh commented on May 6, 2024

szymonm: That's a pretty helpful explanation. It would be interesting if this would turn into documentation for people who understand the bigger picture but not the specific implementation details of each individual KV-store.

manishrjain: The difference is pretty minor, so yes, I very much prefer the sync option. Writes are pretty variable in my use case and while not every write is absolutely crucial, it should still be possible to kill the server and preferably not lose any events. Please make sure people understand that this sync is off by default.

Thanks to both of you for your time!

from badger.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.