Problem I have a sourceBu

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Question] Explain how sync between buckets work; slow overall speed about b2_command_line_tool HOT 8 OPEN

ToshY commented on July 4, 2024

[Question] Explain how sync between buckets work; slow overall speed

from b2_command_line_tool.

Comments (8)

ppolewicz commented on July 4, 2024 1

We'll be setting up an environment to test performance of large files this week and after that happens we'll circle back to this one to test performance of small files too.

Thank you for the detailed bug report.

from b2_command_line_tool.

ppolewicz commented on July 4, 2024

Hey,

replication is a cool new feature of b2 that sounds like it might be perfect for this case. Your objects are very small and you are processing a ton of them, which might (I guess) result in server throttling your operations, so threads wait until the limit is decreased. Perhaps you can try that? It should run much faster than sync.

With 30k files in 2h it'd be 4 files per second, assuming 75 KB/s that's like 312KB/s, but you are reporti8ng 70-100KB/s. I'm not sure what's up with that. Which cluster are you using? If there is a performance issue with the CLI, I'd like to try to replicate it. Is it the same 30k files and you only change a few of those or is it different 30k files every time?

from b2_command_line_tool.

ToshY commented on July 4, 2024

Hey @ppolewicz 👋

replication is a cool new feature of b2 that sounds like it might be perfect for this case

Well I've tried replication before by setting it up from the UI, but I've found it very unintuitive. It gives almost no sense on howlong it actually takes before it's done replicating, and after watching the tutorial, it is said that for existing files "it can take up from a few minutes to a few hours". As my experience was also that it takes hours for it to replicate, I then changed to using the CLI instead with b2 sync command, which atleast gives me some sense on how long it will take.

Replication also doesn't really fit my use case, because in the example the sourceBucket is actually a production bucket, and the destinationBucket is a development bucket. So I don't have the need to fully replicate the entire production bucket to development bucket, as I don't want/need those replicated files in my development bucket.

The b2 sync gives me more freedom, because if I decide I want to work on a feature, I can run the command above to sync it to my development bucket, wait roughly 2 hours, and then actually start developping. It's a pain that I have to wait that long, but atleast I know howlong it's been running, at what speed it's processing files, which files it's currently syncing, and roughly estimate on howlong it will take before it's complete.

With 30k files in 2h it'd be 4 files per second, assuming 75 KB/s that's like 312KB/s, but you are reporti8ng 70-100KB/s. I'm not sure what's up with that. Which cluster are you using? If there is a performance issue with the CLI, I'd like to try to replicate it. Is it the same 30k files and you only change a few of those or is it different 30k files every time?

Every 1 to 2 months I run the sync command above, and in the last 6 months the production bucket accumulated 6k additional images (24k before), so you can say roughly 1k images are added to the production bucket each month.

Which cluster are you using?

If you refer to the endpoint/region, it is s3.eu-central-003.backblazeb2.com for both source and destination bucket.

I've performed a sync earlier today (with the same command I've used above), which took again roughly 1.5-2 hours. So now if the buckets contents are basically identical, it shouldn't take much time to sync again right? But as I'm currently running it again, it gives similar speeds in the range of +/- 90-100 Kb/s in the console, and doing +/- 5-6 files per second.

After diving a bit deeper in the documentation I eventually found --compareVersions:

So what I did next was try --compareVersions none and --compareVersions size, both only taking a couple of seconds now (!), which is my desired behaviour (as I previously synced the files earlier today I think that's why fast as it no longer compares the modified time).

Now looking back, maybe I had the unwittingly assumed that the sync would make a complete copy of the file and it's properities (like modified time). But I guess it makes sense the "modified time" for the new file in the destination bucket is newer than the one from the destination.

TLDR

b2 sync --threads <10|100|500> --delete --replaceNewer --compareVersions <none|size> b2://sourceBucket b2://destinationBucket

@ppolewicz final question 🏁

I've now completely wiped my development bucket clean and started a new sync. It currently only performs copy operations for all files and does this with +/- 75 kB/s. Is this low speed of copying files related to the server throttling your operations you've mentioned earlier? And if so, is this out of my control or are there ways to speed things up?

from b2_command_line_tool.

ppolewicz commented on July 4, 2024

In order to determine if the server is throttling, you'll have to enable logs (passing --verbose is a simple way to do it).

On any storage system based on erasure coding and HDDs, performance of reading of small files is not going to be great. If you'd sync a few bigger files, the speed would go way up.

There is a performance bottleneck somewhere, either the server is throttling you or python threading is not doing a very good job with all those copy operations. 6/s is way below what I'd expect to see though, so my bet would be on the throttling.

I'm not familiar enough with the throttling subsystem Backblaze B2 eu-central is currently running on, but from the client perspective you should be able to observe the retries and threads backing off. If you'll confirm it's not the retries and throttling, then I'll take a look at reproducing and analyzing performance of it - B2 and associated tools are supposed to handle 10TB objects and buckets with a billion objects, so not being able to deal with 30k files in a timely manner could be a bug.

What are you running this on? Windows, Liunux? How did you install, from pip or binary?

from b2_command_line_tool.

ToshY commented on July 4, 2024

Hey @ppolewicz

I've tried adding verbose but I can't say I see any keywords related to "throttling", "retry" or "back off" limits.

Here's a gist with a portion of the logging (only ran it for a couple of seconds and truncated it to 2300 lines + redacted some information). Maybe you can spot things that are out of the ordinary.

I've been running it on the following systems:

Ubuntu 22.04.2 LTS (WSL2); binary v3.0.8
Ubuntu 22.04.3 LTS (production server); binary v3.0.9

from b2_command_line_tool.

ppolewicz commented on July 4, 2024

The log only shows scanning and 18 transfers - server wouldn't throttle you so early. You'd have to run it longer and then show a tail of the log (2k lines would be ok).

Since you are running ubuntu, it would be easy to pip install --user b2 on some user (or in venv) to check if that's maybe an issue caused with the binary builder.

from b2_command_line_tool.

ToshY commented on July 4, 2024

@ppolewicz But if it already sticks to the 75-100 kB/s range at the first copies, and doesn't improve over time, then surely it is not related to throttling?

I will try your suggestions later today.

from b2_command_line_tool.

ToshY commented on July 4, 2024

@ppolewicz Installed it with pip, ran the same initial command, roughly same performance +/- 100 kB/s. So no performance gain there.

Ran it for roughly 30 minutes and then pasted (2135 lines) it in the gist.

from b2_command_line_tool.

[Question] Explain how sync between buckets work; slow overall speed about b2_command_line_tool HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs