When I increase the number of threads from 4 to 32 on 64-core cpu performance is just

ZSTD with T option does not scale on multicore CPUs about zstd HOT 7 CLOSED

Dmitri555 commented on April 28, 2024

ZSTD with T option does not scale on multicore CPUs

from zstd.

Comments (7)

Cyan4973 commented on April 28, 2024 1

The first run used ~10 out of 12 physical cores (close to what I'd expect), the second one barely more than 3.

Is that expected?

Yes,
as the level increases, the window size tends to increase,
and as a consequence, the size of each job tends to increase too.
At level 21, each job is likely ~256 MB, so there are less jobs in parallel.

It's possible to take direct control of job size, as explained in the man page. Quoting :

ADVANCED COMPRESSION OPTIONS
       ### -B#: Specify the size of each compression job. This parameter is only available when multi-threading is enabled. Each compression job is run in parallel,
       so this value indirectly impacts the nb of active threads. Default job size varies depending on compression level (generally 4 * windowSize).  -B#  makes  it
       possible to manually select a custom size. Note that job size must respect a minimum value which is enforced transparently. This minimum is either 512 KB, or
       overlapSize, whichever is largest. Different job sizes will lead to non-identical compressed frames.

from zstd.

Cyan4973 commented on April 28, 2024

If multithreading was working, I would expect user time > real time.
That's not the case here, suggesting something is wrong.

For reference, here is what I'm getting on a local Ubuntu desktop :

time ./zstd -4 linux-6.2.9.tar -c > /dev/null
4.52s user 0.21s system 108% cpu 4.348 total

time ./zstd -4 -T0 linux-6.2.9.tar -c > /dev/null
11.91s user 0.42s system 735% cpu 1.675 total

from zstd.

Dmitri555 commented on April 28, 2024

Test with your file looks good!

time ./zstd1.5.5 -4 linux-6.2.9.tar -c > /dev/null
real 0m4.046s user 0m4.260s sys 0m0.681s

time ./zstd1.5.5 -4 -T0 linux-6.2.9.tar -c > /dev/null
real 0m0.774s user 0m4.362s sys 0m0.817s

But with another file (just a core file of crashed process) does NOT

time ./zstd1.5.5 -4 core-file -c > /dev/null
real 0m6.837s user 0m5.990s sys 0m6.653s

time ./zstd1.5.5 -4 -T0 core-file -c > /dev/null
real 0m6.886s user 0m6.720s sys 0m7.051s

ls -l linux-6.2.9.tar core-file
-rw-rw-r--. 1 test test 14083944448 Dec 2 14:15 core-file
-rw-rw-r--. 1 test test 1371432960 Mar 30 2023 linux-6.2.9.tar

from zstd.

Cyan4973 commented on April 28, 2024

It implies that this outcome is data dependent, therefore not a generality.

Data compression is indeed data dependent, compression ratio of course, but even compression speed.
That being said, that multithreading isn't effective on some type of data compared to others is a new one, and I would not have expected it.

I'm afraid that, without access to a reproduction case, it will be difficult to investigate further.

from zstd.

gyscos commented on April 28, 2024

On my machine, -T0 seems to work on level 19, but not as well on level 21:

% # Prepare test file
% seq 100000000 > seq.txt
% zstd -19 -T0 seq.txt
seq.txt              :  2.83%   (   848 MiB =>   24.0 MiB, seq.txt.zst)        
zstd -19 -T0 seq.txt  410.49s user 0.84s system 989% cpu 41.560 total
% zstd --ultra -21 -T0 seq.txt
zstd: seq.txt.zst already exists; overwrite (y/n) ? y
seq.txt              :  2.79%   (   848 MiB =>   23.6 MiB, seq.txt.zst)        
zstd --ultra -21 -T0 seq.txt  471.95s user 0.62s system 325% cpu 2:25.22 total

The first run used ~10 out of 12 physical cores (close to what I'd expect), the second one barely more than 3.

Is that expected?

from zstd.

Dmitri555 commented on April 28, 2024

Am I understand you correctly ?
ZSTD can not scale linear. So when I increase number of cores 8x times (from 4 to 32) I can not expect 8x increase in performance, just no or little increase.

from zstd.

Cyan4973 commented on April 28, 2024

It's a combination of factors.

The level 1 is the most likely to scale linearly, because its amount of "hot" memory typically fits inside each core.
After that, as level increases, memory requirement increases, and it becomes more and more likely (depending on exact cpu model) that hot memory will spill over into shared resource, such as L3 cache or RAM. At which point, by increasing the nb of cores, there will be increased contention on the shared resource. Adding cores will still increase performance, but no longer linearly.

The issue reported by @gyscos is different though : it's a question of quantity of input.
Given an infinite input stream and an infinite input bandwidth, all threads will be occupied compressing a section each.
But if the input is "too small", there will not be enough jobs to distribute. So even if 100 threads are available, if there are only 5 jobs to distribute, for example, it won't be possible to employ the 100 threads.
The problem is especially acute in the --ultra range, because each job becomes huge : at level 21, each job is 256 MB large by default. So it becomes probable that only one job, or very few, will be distributed.
This issue is mostly a problem for --ultra levels. Lower levels have more reasonable job sizes. Level 19 defines a 32 MB job size by default, and level 1 defines a 2 MB one. "by default" is stated because it's also possible to manually take control of this value when need be.

from zstd.

ZSTD with T option does not scale on multicore CPUs about zstd HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs