Comments (8)
We'll be setting up an environment to test performance of large files this week and after that happens we'll circle back to this one to test performance of small files too.
Thank you for the detailed bug report.
from b2_command_line_tool.
Hey,
replication is a cool new feature of b2 that sounds like it might be perfect for this case. Your objects are very small and you are processing a ton of them, which might (I guess) result in server throttling your operations, so threads wait until the limit is decreased. Perhaps you can try that? It should run much faster than sync.
With 30k files in 2h it'd be 4 files per second, assuming 75 KB/s that's like 312KB/s, but you are reporti8ng 70-100KB/s. I'm not sure what's up with that. Which cluster are you using? If there is a performance issue with the CLI, I'd like to try to replicate it. Is it the same 30k files and you only change a few of those or is it different 30k files every time?
from b2_command_line_tool.
Hey @ppolewicz 👋
replication is a cool new feature of b2 that sounds like it might be perfect for this case
Well I've tried replication before by setting it up from the UI, but I've found it very unintuitive. It gives almost no sense on howlong it actually takes before it's done replicating, and after watching the tutorial, it is said that for existing files "it can take up from a few minutes to a few hours". As my experience was also that it takes hours for it to replicate, I then changed to using the CLI instead with b2 sync
command, which atleast gives me some sense on how long it will take.
Replication also doesn't really fit my use case, because in the example the sourceBucket
is actually a production bucket, and the destinationBucket
is a development bucket. So I don't have the need to fully replicate the entire production bucket to development bucket, as I don't want/need those replicated files in my development bucket.
The b2 sync
gives me more freedom, because if I decide I want to work on a feature, I can run the command above to sync it to my development bucket, wait roughly 2 hours, and then actually start developping. It's a pain that I have to wait that long, but atleast I know howlong it's been running, at what speed it's processing files, which files it's currently syncing, and roughly estimate on howlong it will take before it's complete.
With 30k files in 2h it'd be 4 files per second, assuming 75 KB/s that's like 312KB/s, but you are reporti8ng 70-100KB/s. I'm not sure what's up with that. Which cluster are you using? If there is a performance issue with the CLI, I'd like to try to replicate it. Is it the same 30k files and you only change a few of those or is it different 30k files every time?
Every 1 to 2 months I run the sync command above, and in the last 6 months the production bucket accumulated 6k additional images (24k before), so you can say roughly 1k images are added to the production bucket each month.
Which cluster are you using?
If you refer to the endpoint/region, it is s3.eu-central-003.backblazeb2.com
for both source and destination bucket.
I've performed a sync earlier today (with the same command I've used above), which took again roughly 1.5-2 hours. So now if the buckets contents are basically identical, it shouldn't take much time to sync again right? But as I'm currently running it again, it gives similar speeds in the range of +/- 90-100 Kb/s
in the console, and doing +/- 5-6 files per second
.
After diving a bit deeper in the documentation I eventually found --compareVersions
:
So what I did next was try --compareVersions none
and --compareVersions size
, both only taking a couple of seconds now (!), which is my desired behaviour (as I previously synced the files earlier today I think that's why fast as it no longer compares the modified time).
Now looking back, maybe I had the unwittingly assumed that the sync would make a complete copy of the file and it's properities (like modified time). But I guess it makes sense the "modified time" for the new file in the destination bucket is newer than the one from the destination.
TLDR
b2 sync --threads <10|100|500> --delete --replaceNewer --compareVersions <none|size> b2://sourceBucket b2://destinationBucket
@ppolewicz final question 🏁
I've now completely wiped my development bucket clean and started a new sync. It currently only performs copy
operations for all files and does this with +/- 75 kB/s
. Is this low speed of copying files related to the server throttling your operations
you've mentioned earlier? And if so, is this out of my control or are there ways to speed things up?
from b2_command_line_tool.
In order to determine if the server is throttling, you'll have to enable logs (passing --verbose
is a simple way to do it).
On any storage system based on erasure coding and HDDs, performance of reading of small files is not going to be great. If you'd sync a few bigger files, the speed would go way up.
There is a performance bottleneck somewhere, either the server is throttling you or python threading is not doing a very good job with all those copy operations. 6/s is way below what I'd expect to see though, so my bet would be on the throttling.
I'm not familiar enough with the throttling subsystem Backblaze B2 eu-central is currently running on, but from the client perspective you should be able to observe the retries and threads backing off. If you'll confirm it's not the retries and throttling, then I'll take a look at reproducing and analyzing performance of it - B2 and associated tools are supposed to handle 10TB objects and buckets with a billion objects, so not being able to deal with 30k files in a timely manner could be a bug.
What are you running this on? Windows, Liunux? How did you install, from pip or binary?
from b2_command_line_tool.
Hey @ppolewicz
I've tried adding verbose but I can't say I see any keywords related to "throttling", "retry" or "back off" limits.
Here's a gist with a portion of the logging (only ran it for a couple of seconds and truncated it to 2300 lines + redacted some information). Maybe you can spot things that are out of the ordinary.
I've been running it on the following systems:
- Ubuntu 22.04.2 LTS (WSL2); binary
v3.0.8
- Ubuntu 22.04.3 LTS (production server); binary
v3.0.9
from b2_command_line_tool.
The log only shows scanning and 18 transfers - server wouldn't throttle you so early. You'd have to run it longer and then show a tail of the log (2k lines would be ok).
Since you are running ubuntu, it would be easy to pip install --user b2
on some user (or in venv) to check if that's maybe an issue caused with the binary builder.
from b2_command_line_tool.
@ppolewicz But if it already sticks to the 75-100 kB/s
range at the first copies, and doesn't improve over time, then surely it is not related to throttling?
I will try your suggestions later today.
from b2_command_line_tool.
@ppolewicz Installed it with pip, ran the same initial command, roughly same performance +/- 100 kB/s
. So no performance gain there.
Ran it for roughly 30 minutes and then pasted (2135 lines) it in the gist.
from b2_command_line_tool.
Related Issues (20)
- Install from source documentation section should be updated to PDM
- B2_DEBUG_HTTP=1 does not seem to have any effect HOT 1
- Feature request: use b2 cli without saving plaintext credentials in `.b2_account_info` HOT 9
- Official Docker Image ? HOT 3
- delete_file_version needs bypassGovernance option HOT 1
- Can't upload FIFOs with upload-file HOT 6
- UnicodeEncodeError: 'ascii' codec can't encode character '\u2022' in position 1325: ordinal not in range(128) HOT 1
- b2sdk.exception.InvalidAuthToken: Invalid authorization token. Server said: (bad_auth_token) HOT 2
- Error when resuming upload HOT 2
- switch to stable pypy version after new pypy is released HOT 2
- b2 ls --json destroys output structure with informational message HOT 1
- `Using https://api.backblazeb2.com` output to stderr causing problems for automation that only expects errors and warnings on stderr
- pip install breaks with setuptools 69.0.0 HOT 3
- b2 should respect the spec's default fallback for `XDG_CONFIG_HOME` HOT 4
- Feature request: Support transparent compression HOT 1
- error: the following arguments are required: command HOT 5
- Issue with b2 sync :: KeyError: 'content-length' HOT 11
- Package `b2` for Alpine Linux HOT 5
- Linux.Xor.DDoS false-positive from chkrootkit due PyInstaller use
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from b2_command_line_tool.