GithubHelp home page GithubHelp logo

Comments (8)

ppolewicz avatar ppolewicz commented on July 4, 2024 1

We'll be setting up an environment to test performance of large files this week and after that happens we'll circle back to this one to test performance of small files too.

Thank you for the detailed bug report.

from b2_command_line_tool.

ppolewicz avatar ppolewicz commented on July 4, 2024

Hey,

replication is a cool new feature of b2 that sounds like it might be perfect for this case. Your objects are very small and you are processing a ton of them, which might (I guess) result in server throttling your operations, so threads wait until the limit is decreased. Perhaps you can try that? It should run much faster than sync.

With 30k files in 2h it'd be 4 files per second, assuming 75 KB/s that's like 312KB/s, but you are reporti8ng 70-100KB/s. I'm not sure what's up with that. Which cluster are you using? If there is a performance issue with the CLI, I'd like to try to replicate it. Is it the same 30k files and you only change a few of those or is it different 30k files every time?

from b2_command_line_tool.

ToshY avatar ToshY commented on July 4, 2024

Hey @ppolewicz 👋

replication is a cool new feature of b2 that sounds like it might be perfect for this case

Well I've tried replication before by setting it up from the UI, but I've found it very unintuitive. It gives almost no sense on howlong it actually takes before it's done replicating, and after watching the tutorial, it is said that for existing files "it can take up from a few minutes to a few hours". As my experience was also that it takes hours for it to replicate, I then changed to using the CLI instead with b2 sync command, which atleast gives me some sense on how long it will take.

Replication also doesn't really fit my use case, because in the example the sourceBucket is actually a production bucket, and the destinationBucket is a development bucket. So I don't have the need to fully replicate the entire production bucket to development bucket, as I don't want/need those replicated files in my development bucket.

The b2 sync gives me more freedom, because if I decide I want to work on a feature, I can run the command above to sync it to my development bucket, wait roughly 2 hours, and then actually start developping. It's a pain that I have to wait that long, but atleast I know howlong it's been running, at what speed it's processing files, which files it's currently syncing, and roughly estimate on howlong it will take before it's complete.

With 30k files in 2h it'd be 4 files per second, assuming 75 KB/s that's like 312KB/s, but you are reporti8ng 70-100KB/s. I'm not sure what's up with that. Which cluster are you using? If there is a performance issue with the CLI, I'd like to try to replicate it. Is it the same 30k files and you only change a few of those or is it different 30k files every time?

Every 1 to 2 months I run the sync command above, and in the last 6 months the production bucket accumulated 6k additional images (24k before), so you can say roughly 1k images are added to the production bucket each month.

Which cluster are you using?

If you refer to the endpoint/region, it is s3.eu-central-003.backblazeb2.com for both source and destination bucket.


I've performed a sync earlier today (with the same command I've used above), which took again roughly 1.5-2 hours. So now if the buckets contents are basically identical, it shouldn't take much time to sync again right? But as I'm currently running it again, it gives similar speeds in the range of +/- 90-100 Kb/s in the console, and doing +/- 5-6 files per second.


After diving a bit deeper in the documentation I eventually found --compareVersions:

image

So what I did next was try --compareVersions none and --compareVersions size, both only taking a couple of seconds now (!), which is my desired behaviour (as I previously synced the files earlier today I think that's why fast as it no longer compares the modified time).

Now looking back, maybe I had the unwittingly assumed that the sync would make a complete copy of the file and it's properities (like modified time). But I guess it makes sense the "modified time" for the new file in the destination bucket is newer than the one from the destination.

TLDR

b2 sync --threads <10|100|500> --delete --replaceNewer --compareVersions <none|size> b2://sourceBucket b2://destinationBucket

@ppolewicz final question 🏁

I've now completely wiped my development bucket clean and started a new sync. It currently only performs copy operations for all files and does this with +/- 75 kB/s. Is this low speed of copying files related to the server throttling your operations you've mentioned earlier? And if so, is this out of my control or are there ways to speed things up?

from b2_command_line_tool.

ppolewicz avatar ppolewicz commented on July 4, 2024

In order to determine if the server is throttling, you'll have to enable logs (passing --verbose is a simple way to do it).

On any storage system based on erasure coding and HDDs, performance of reading of small files is not going to be great. If you'd sync a few bigger files, the speed would go way up.

There is a performance bottleneck somewhere, either the server is throttling you or python threading is not doing a very good job with all those copy operations. 6/s is way below what I'd expect to see though, so my bet would be on the throttling.

I'm not familiar enough with the throttling subsystem Backblaze B2 eu-central is currently running on, but from the client perspective you should be able to observe the retries and threads backing off. If you'll confirm it's not the retries and throttling, then I'll take a look at reproducing and analyzing performance of it - B2 and associated tools are supposed to handle 10TB objects and buckets with a billion objects, so not being able to deal with 30k files in a timely manner could be a bug.

What are you running this on? Windows, Liunux? How did you install, from pip or binary?

from b2_command_line_tool.

ToshY avatar ToshY commented on July 4, 2024

Hey @ppolewicz

I've tried adding verbose but I can't say I see any keywords related to "throttling", "retry" or "back off" limits.

Here's a gist with a portion of the logging (only ran it for a couple of seconds and truncated it to 2300 lines + redacted some information). Maybe you can spot things that are out of the ordinary.


I've been running it on the following systems:

  • Ubuntu 22.04.2 LTS (WSL2); binary v3.0.8
  • Ubuntu 22.04.3 LTS (production server); binary v3.0.9

from b2_command_line_tool.

ppolewicz avatar ppolewicz commented on July 4, 2024

The log only shows scanning and 18 transfers - server wouldn't throttle you so early. You'd have to run it longer and then show a tail of the log (2k lines would be ok).

Since you are running ubuntu, it would be easy to pip install --user b2 on some user (or in venv) to check if that's maybe an issue caused with the binary builder.

from b2_command_line_tool.

ToshY avatar ToshY commented on July 4, 2024

@ppolewicz But if it already sticks to the 75-100 kB/s range at the first copies, and doesn't improve over time, then surely it is not related to throttling?

I will try your suggestions later today.

from b2_command_line_tool.

ToshY avatar ToshY commented on July 4, 2024

@ppolewicz Installed it with pip, ran the same initial command, roughly same performance +/- 100 kB/s. So no performance gain there.

Ran it for roughly 30 minutes and then pasted (2135 lines) it in the gist.

from b2_command_line_tool.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.