Comments (11)
I changed the order of execution and added -T32 option: same result
time ./zstd1.4.5 -4 -T32 core-file -c > /dev/null
core-file : 0.15% (14083944448 => 21562128 bytes, /stdout)
real 0m2.876s user 0m7.874s sys 0m1.433s
time ./zstd1.5.5 -4 -T32 core-file -c > /dev/null
real 0m6.740s user 0m6.407s sys 0m6.808s
from zstd.
The sample labelled 1.5.5
has a much higher sys
time.
Not sure what it corresponds to, as if OS had a harder time doing something.
Moreover, since user < real
, it doesn't even feel multi-threaded, which -T0
should achieve.
So, that gives a few possibilities to look into :
- caching difference : if the target file is accessed for the first time, it's more difficult for OS to retrieve it, whereas in subsequent runs, the file is likely cached. Try accessing the file with
1.5.5
after accessing it with1.4.5
to rule out this effect. - multithreading: maybe
-T0
doesn't work, for reasons unknown, and default to single thread. In which case, try to manually set a nb of cores. For example,-T30
.
Also : use -vv
, or even -vvv
, to extract more debug information.
from zstd.
START=$(date +%s); time ./zstd1.4.5 -vvv -4 -T32 core-file -o 3 ; END=$(date +%s); echo Elapsed time $((END-START)) Sec
*** zstd command line interface 64-bits v1.4.5, by Yann Collet ***
set nb workers = 32
core-file: 1199042560 bytes
core-file : 0.15% (14083944448 => 21562128 bytes, 3)
core-file : Completed in 9.34 sec (cpu load : 100%)
real 0m2.925s
user 0m7.919s
sys 0m1.427s
Elapsed time 3 Sec
START=$(date +%s); time ./zstd1.5.5 -vvv -4 -T32 core-file -o 4 ; END=$(date +%s); echo Elapsed time $((END-START)) Sec
*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
--zstd=wlog=21,clog=18,hlog=18,slog=1,mml=5,tlen=0,strat=2
--format=.zst --no-sparse --block-size=0 --memory=134217728 --threads=32 --content-size
set nb workers = 32
core-file: 14083944448 bytes
Decompression will require 2097152 B of memory
core-file : 0.15% (14083944448 B => 21545116 B, 4) %
core-file : Completed in 8.58 sec (cpu load : 171%)
real 0m8.588s
user 0m6.731s
sys 0m7.952s
Elapsed time 9 Sec
For strange reason with -vvv option zstd even reports wrong execution time:
Completed in 9.34 sec VS Elapsed time 3 Sec
from zstd.
Some more advanced tests that could be attempted :
- Try
v1.5.3
. It would help determine if the issue comes from some new feature added inv1.5.4
.
I was also wondering if the definition of "level 4" has changed between v1.4.5
and v1.5.5
, but after verification, it has not, so it should not behave vastly differently.
This could be confirmed by using the internal benchmark module, bypassing potential I/O bottleneck :
zstd -b4 core-file
zstd -b4 -T32 core-file
from zstd.
./zstd1.4.5 -b4 core-file
Not enough memory; testing 2709 MB only...
4#core-file :2840941909 -> 7883578 (360.36),2520.8 MB/s ,13232.4 MB/s
./zstd1.5.5 -b4 -T32 core-file
Not enough memory; testing 2709 MB only...
4#core-file :2840941909 -> 8076033 (x351.77), 7734.9 MB/s, 11526.4 MB/s
./zstd1.5.5 -b4 core-file
Not enough memory; testing 2709 MB only...
4#core-file :2840941909 -> 7879476 (x360.55), 6341.0 MB/s, 13338.3 MB/s
Where came from message "Not enough memory" ? Every new run comes with new bug :)
cat /proc/meminfo
MemTotal: 131575760 kB
MemAvailable: 116197980 kB
from zstd.
The benchmark module has an internal memory limit (8 GB, divided into 3 buffers, hence ~2.7 GB per buffer). If you want to use more memory, you can change the limit manually and recompile.
But I don't think that will be necessary. The point is just to benchmark the inner compression loop, and get some speed and ratio estimations. I suspect 2.7 GB of data is good enough to get a good idea of the behavior. And there is already enough information there to learn a few things :
- The target file is unusual:
zstd
achieves extremely high compression ratios with it (x360!). This suggests very high redundancies. We typically don't calibrate for such extreme range, so changes may happen there that we did not notice. - The compression algorithm itself is extremely fast, as expected in such a case. At ~6 GB/s for a single core, it should take about 2 seconds to compress the whole 13 GB file. This is not consistent with the report that it takes > 7 sec to do that.
- Also, oddly enough,
1.5.5
seems to be much faster thanv1.4.5
, rather than slower, on the tested sample. This is also not consistent with the previous report suggesting thatv1.5.5
is slower.
All this seems to point towards I/O as the potential bottleneck. And it's logical, considering the extreme speeds requested.
The next test could employ a ramfs
file system, like /tmp/
, to remove the physical component of I/O limitations. But there will still be the File System itself, and the I/O component within zstd
, both of which could become bottlenecks considering the extreme speed targeted.
After that, presuming the performance difference comes from the I/O component within zstd
, it will be necessary to bisect to find the commit that resulted in this performance difference on this specific scenario.
from zstd.
I made a few local tests, to attempt to mimic the scenario, using a highly compressible synthetic data source of 13 GB.
On a macos
laptop, it gives the following :
version | -4 | -4 -T0 |
---|---|---|
v1.4.5 |
6.2s | 2.68s |
v1.5.5 |
3.4s | 2.59s |
As can be seen in this measurement, v1.5.5
is indeed way faster than v1.4.5
in single-thread mode. But once -T0
is employed, they both converge towards the same limit, because I/O is now the bottleneck.
In case it would be macos
specific, I also made the same test on a older Ubuntu Desktop :
version | -4 | -4 -T0 |
---|---|---|
v1.4.5 |
6.85s | 2.59s |
v1.5.5 |
3.85s | 2.31s |
Basically, same conclusion.
So the reported issue is not reproduced.
from zstd.
I observed a similar discrepancy. I measured zstd's compression speed on a weak VM (zstd v1.4.8) that has only 4 cores and 8GB RAM and repeated the experiment on a more powerful machine (zstd v1.5.5), which is a few years old but has 128GB of RAM and 64 cores. My expectation was that the latter should be faster (if only because of the more recent software version that promises speedups in the release notes since then).
In a small experiment, I compared 1) zstd, 2) zstd with the "--long" parameter, and 3) lrzip. Each strategy was restricted to only use 4 cores. Each of them was evaluated on compression speed and file size over several compression levels. The runs on v1.5.5 were about 10-15% slower than the runs using v1.4.8. Might not be as drastic as "two times slower", but this was contrary to my expectations and therefore seems suspicious to me.
Additional differences that might be Version 1.4.8 was installed via Debian sources, whereas v1.5.5 was manually compiled with make
.
The data to be compressed was a small set of web crawling results, where the single files are of size up to 4GB. Unfortunately, I cannot share the files, but they are comparable to the common crawl web archive files.
from zstd.
We typically don't calibrate for such extreme range
As I said before this is a regular core file from crashed sofware. It can have large areas of unused/uninitialized memory. It is nothing unusual.
Probably you are trying to reproduce the problem on your notebook with SSD drive but I have an old-school server with lots of regular HDDs.
PS
Looks like --no-asyncio solved this performance problem
from zstd.
Looks like --no-asyncio solved this performance problem
In which case, I would assume the issue started happening between v1.5.2
and v1.5.4
.
from zstd.
Thank you for reporting @Dmitri555 .
I've been able to replicate some use-cases of slow-downs when using AsyncIO on an AMD machine (didn't replicate on an Intel server / M1 laptop). The numbers I was getting are similar to what @chschroeder has observed (10-15% slowdown).
It only reproduced for me in cases where very little writes happened (very high compression ratio) and compression speed and reads were very quick (around 2G/s). I believe that in those cases the additional overhead for AsyncIO's thread synchronization becomes meaningful.
However, I didn't manage to reproduce the x2 slow-down reported here. My suspicious are that it's either the HDDs or NUMA, but there are just guesses. @Dmitri555 if you can run the same experiment only ram to ram without going through the HDD it'd be helpful. Additionally, if you can make such the process is pinned to one socket (in case there are multiple CPUs on the machine) that'd allow us to rule out NUMA as well.
As for the 15% slowdown, I've spent some time debugging this and I believe this is caused by the additional overhead introduced by AsyncIO's thread synchronization. This should only manifest in cases where the read, write and compression workloads are extremely fast to the point where the added synchronization syscalls actually take a meaningful time of the runtime. Even so, it only reproduced for me on an AMD machine.
I don't think there's an easy fix here, one solution is to increase the size of our read buffers, but this could have negative results for other use-cases. The better solution would be to add an io_uring compatible asyncio implementation, that should allow us to remove most of the overhead. We've built the asyncio module with io_uring in mind, so the same API should work, but implementing and testing would still take some work.
from zstd.
Related Issues (20)
- Increase minimum C standard from C89/C90 to C11 HOT 3
- ZSTD with T option does not scale on multicore CPUs HOT 7
- Question about FSE Huffman literal part
- C++ Builder and mem.h ambiguity HOT 7
- Reducing DCtx Size for Embedded Systems (like esp32) HOT 2
- Automatic code formatting? HOT 1
- No `uncompressed` and `ratio` information in `zstd --list` output if the zstd file is created via pipe HOT 3
- lz4 "legacy" format support HOT 1
- Add common file types that are compressed to ' --exclude-compressed' HOT 3
- windows
- compressing files containing multiple similar portions HOT 5
- Using ZSTD_compressBound for Streaming Input HOT 4
- head file not found HOT 1
- ZSTD-1.5.2 compress Segmentation fault HOT 4
- Support history buffers in zstd hardware acceleration HOT 3
- RFC: Code for integration zstd into node.js
- seekable_format/examples/parallel_compression.c is not parallel
- use decompress
- soft link
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from zstd.