Comments (6)
In async mode, if the value size is lower than option.ValueThreshold, then key-value doesn't get written out to WAL. In this case, sounds like all of your values were lower than 20 bytes (if you used the default options).
Then, they get written out to LSM tree. They would be written to a memtable. Only once a memtable gets filled up, would it be synced to disk. 1M keys isn't sufficient to do that. A 64MB memtable can take 3M keys before it fills up.
This should all be really fast. I suspect the reason it took you 18 mins, is because you didn't batch your requests. Batching is absolutely critical to getting better performance. There's quite a bit of overhead cost per request; which you can amortize well by sending a 1000 keys in one request. You can send many requests in parallel to achieve even better performance.
Try these out. It shouldn't take more than 20 seconds or so. Do remember to Close
the store.
P.S. You can look at the populate code here to see how we populated data to run Badger v/s RocksDB benchmarks.
from badger.
First of all, thanks for your insights.
My values are about 12-15 bytes. Is there a technical reason for the default ValueThreshold = 20
option? You'd need to document this clearly, as those records will get lost if the server goes down. 18 minutes without a single sync is – I'm sure you agree – pretty long. Similar for the 64MB memtable. I understand speed is a priority, but it's not entirely clear to me as a casual user that it might take a long time before anything actually gets sync'ed to disk... I'd not expect to lose anything more than perhaps a few seconds worth of records with default settings. Not sure how often LevelDB does a sync to disk.
I've tried a batch approach, as you suggested. Writing 10000 records in batches of 1000:
start: populateLevelDB
end: populateLevelDB Δt: 110.948957ms ops: 90131.53679308585
start: populateBadger
end: populateBadger Δt: 10.896048533s ops: 917.7639003455051
start: populateBadgerBatch
end: populateBadgerBatch Δt: 18.422385ms ops: 542817.8816152198
start: populateBoltDB
end: populateBoltDB Δt: 1m46.367649028s ops: 94.01354727100926
start: populateBoltDBOneTrans
end: populateBoltDBOneTrans Δt: 212.684806ms ops: 47017.93319453201
400000-500000 ops, a lot faster than non-batched inserts and also a lot faster than other kv-stores.
The problem in my use case is that the database stores user-generated events, which can't be batched easily as they happen sporadically.
from badger.
@wvh AFAIK, there is no direct technical reason behind ValueThreshold = 20
. However, you must know that if we store a value in the ValueLog, we need to keep a pointer to this value in the LSM tree. The size of the pointer is 10 bytes, so it makes no sense no have ValueThreshold
below that. We just assumed that the value should be at least 2x bigger to make any sense to keep it in a separate place at the cost of keeping the pointer in the LSM tree.
When tuning this parameter, you should know that when retrieving a KV pair there are 3 possible execution paths (I'm simplifying a bit here to give you intuition).
- If
size(value) < ValueThreshold
(<=> KV pair is stored in LSM tree) and KV pair is in memory, we just retrieve it from memory, so you have best Get time. - If
size(value) < ValueThreshold
but the KV pair didn't fit into memory, we have to do a random-read to read it from disk. - If
size(value) > ValueThreshold
, we have to retrieve the pointer to it from LSM tree (from memory or disk) and then the value from disk, so you have at least one disk call.
There is an order of magnitude difference in Get
time between 1 and 2, while linear between 2 and 3. If you increase ValueThreshold
more of KV pairs go from 3 to 2, but at the same time, you can fit less of them to memory, so more of them goes from 1 to 2.
The documents will not get lost if the server goes down when SyncWrites
option is true. So, you can populate your database with SyncWrites = false
and then restart it with SyncWrites = true
to avoid losing production data.
Regarding your problem that events are generated by users, you could do some kind of buffering to have good write speed and low latency at the same time. You could, for example, store buffered data every 1ms, so that users won't notice it, while you have data written in batches when the load is high.
from badger.
@wvh If you have no more questions, let me close the issue.
from badger.
The problem in my use case is that the database stores user-generated events, which can't be batched easily as they happen sporadically.
I'd recommend you to set SyncWrites=true. It seems like you don't need a very high write throughput, but do care about your writes being persisted. In that case, a slightly higher write latency shouldn't be that big of a deal.
You can't have it both ways. Either you choose sync writes with higher write latency and ensure persistence, or you choose async writes with lower write latency and give up immediate persistence.
Even with sync writes, writes to SSDs, if that's what you're using, are pretty fast; so I don't see much downside for you to set SyncWrites=true.
from badger.
szymonm: That's a pretty helpful explanation. It would be interesting if this would turn into documentation for people who understand the bigger picture but not the specific implementation details of each individual KV-store.
manishrjain: The difference is pretty minor, so yes, I very much prefer the sync option. Writes are pretty variable in my use case and while not every write is absolutely crucial, it should still be possible to kill the server and preferably not lose any events. Please make sure people understand that this sync is off by default.
Thanks to both of you for your time!
from badger.
Related Issues (20)
- [BUG]: "Arena too small" after enlarging memtable size, then re-opening with prior size HOT 2
- GC doesn't seem to run HOT 3
- [QUESTION]: Max capacity limit HOT 1
- cannot use generic type ristretto.Cache[K, V any] without instantiation ../../github.com/dgraph-io/badger/v4/table/table.go:80:14: cannot use generic type ristretto.Cache[K, V any] without instantiation HOT 2
- [BUG]: <Index Init Error giving out of bounds>
- [Documentation]: absent documentation about v4 HOT 3
- [BUG]: return odd keys bytes when scaning with prefix HOT 1
- [QUESTION]: Why badger use WAL? HOT 2
- [QUESTION]: How can I limit the disk size used by Badger or the number of keys that Badger can store through parameter configuration in the program? HOT 4
- Hi,badger-rs, a rust implements version has release [QUESTION]: <Title>
- [BUG]: Panic from a lot of DELs and SETs with very large keys
- [BUG]: when compression is none block cache will cache the table
- [QUESTION]: Batch write is partially committed when badger crashes
- [QUESTION]: Search keys by regular expressions HOT 1
- [BUG]: arm64 inconsistencies HOT 1
- [QUESTION]: Is Badger rsync friendly during db in use ? HOT 3
- How to achieve multiple process sharing and open a db
- [BUG]: compilation fails with `GOOS=aix` (Unix) HOT 1
- Memory crash
- [BUG]: RunValueGC HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from badger.