GithubHelp home page GithubHelp logo

Comments (11)

Dmitri555 avatar Dmitri555 commented on April 28, 2024 1

I changed the order of execution and added -T32 option: same result

time ./zstd1.4.5 -4 -T32 core-file -c > /dev/null
core-file : 0.15% (14083944448 => 21562128 bytes, /stdout)
real 0m2.876s user 0m7.874s sys 0m1.433s

time ./zstd1.5.5 -4 -T32 core-file -c > /dev/null
real 0m6.740s user 0m6.407s sys 0m6.808s

from zstd.

Cyan4973 avatar Cyan4973 commented on April 28, 2024

The sample labelled 1.5.5 has a much higher sys time.
Not sure what it corresponds to, as if OS had a harder time doing something.
Moreover, since user < real, it doesn't even feel multi-threaded, which -T0 should achieve.

So, that gives a few possibilities to look into :

  1. caching difference : if the target file is accessed for the first time, it's more difficult for OS to retrieve it, whereas in subsequent runs, the file is likely cached. Try accessing the file with 1.5.5 after accessing it with 1.4.5 to rule out this effect.
  2. multithreading: maybe -T0 doesn't work, for reasons unknown, and default to single thread. In which case, try to manually set a nb of cores. For example, -T30.

Also : use -vv, or even -vvv, to extract more debug information.

from zstd.

Dmitri555 avatar Dmitri555 commented on April 28, 2024

START=$(date +%s); time ./zstd1.4.5 -vvv -4 -T32 core-file -o 3 ; END=$(date +%s); echo Elapsed time $((END-START)) Sec
*** zstd command line interface 64-bits v1.4.5, by Yann Collet ***
set nb workers = 32
core-file: 1199042560 bytes
core-file : 0.15% (14083944448 => 21562128 bytes, 3)
core-file : Completed in 9.34 sec (cpu load : 100%)

real 0m2.925s
user 0m7.919s
sys 0m1.427s
Elapsed time 3 Sec

START=$(date +%s); time ./zstd1.5.5 -vvv -4 -T32 core-file -o 4 ; END=$(date +%s); echo Elapsed time $((END-START)) Sec
*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
--zstd=wlog=21,clog=18,hlog=18,slog=1,mml=5,tlen=0,strat=2
--format=.zst --no-sparse --block-size=0 --memory=134217728 --threads=32 --content-size
set nb workers = 32
core-file: 14083944448 bytes
Decompression will require 2097152 B of memory
core-file : 0.15% (14083944448 B => 21545116 B, 4) %
core-file : Completed in 8.58 sec (cpu load : 171%)

real 0m8.588s
user 0m6.731s
sys 0m7.952s
Elapsed time 9 Sec

For strange reason with -vvv option zstd even reports wrong execution time:
Completed in 9.34 sec VS Elapsed time 3 Sec

from zstd.

Cyan4973 avatar Cyan4973 commented on April 28, 2024

Some more advanced tests that could be attempted :

  • Try v1.5.3. It would help determine if the issue comes from some new feature added in v1.5.4.

I was also wondering if the definition of "level 4" has changed between v1.4.5 and v1.5.5, but after verification, it has not, so it should not behave vastly differently.

This could be confirmed by using the internal benchmark module, bypassing potential I/O bottleneck :
zstd -b4 core-file
zstd -b4 -T32 core-file

from zstd.

Dmitri555 avatar Dmitri555 commented on April 28, 2024

./zstd1.4.5 -b4 core-file
Not enough memory; testing 2709 MB only...
4#core-file :2840941909 -> 7883578 (360.36),2520.8 MB/s ,13232.4 MB/s

./zstd1.5.5 -b4 -T32 core-file
Not enough memory; testing 2709 MB only...
4#core-file :2840941909 -> 8076033 (x351.77), 7734.9 MB/s, 11526.4 MB/s

./zstd1.5.5 -b4 core-file
Not enough memory; testing 2709 MB only...
4#core-file :2840941909 -> 7879476 (x360.55), 6341.0 MB/s, 13338.3 MB/s

Where came from message "Not enough memory" ? Every new run comes with new bug :)

cat /proc/meminfo
MemTotal: 131575760 kB
MemAvailable: 116197980 kB

from zstd.

Cyan4973 avatar Cyan4973 commented on April 28, 2024

The benchmark module has an internal memory limit (8 GB, divided into 3 buffers, hence ~2.7 GB per buffer). If you want to use more memory, you can change the limit manually and recompile.
But I don't think that will be necessary. The point is just to benchmark the inner compression loop, and get some speed and ratio estimations. I suspect 2.7 GB of data is good enough to get a good idea of the behavior. And there is already enough information there to learn a few things :

  • The target file is unusual: zstd achieves extremely high compression ratios with it (x360!). This suggests very high redundancies. We typically don't calibrate for such extreme range, so changes may happen there that we did not notice.
  • The compression algorithm itself is extremely fast, as expected in such a case. At ~6 GB/s for a single core, it should take about 2 seconds to compress the whole 13 GB file. This is not consistent with the report that it takes > 7 sec to do that.
  • Also, oddly enough, 1.5.5 seems to be much faster than v1.4.5, rather than slower, on the tested sample. This is also not consistent with the previous report suggesting that v1.5.5 is slower.

All this seems to point towards I/O as the potential bottleneck. And it's logical, considering the extreme speeds requested.

The next test could employ a ramfs file system, like /tmp/, to remove the physical component of I/O limitations. But there will still be the File System itself, and the I/O component within zstd, both of which could become bottlenecks considering the extreme speed targeted.

After that, presuming the performance difference comes from the I/O component within zstd, it will be necessary to bisect to find the commit that resulted in this performance difference on this specific scenario.

from zstd.

Cyan4973 avatar Cyan4973 commented on April 28, 2024

I made a few local tests, to attempt to mimic the scenario, using a highly compressible synthetic data source of 13 GB.

On a macos laptop, it gives the following :

version -4 -4 -T0
v1.4.5 6.2s 2.68s
v1.5.5 3.4s 2.59s

As can be seen in this measurement, v1.5.5 is indeed way faster than v1.4.5 in single-thread mode. But once -T0 is employed, they both converge towards the same limit, because I/O is now the bottleneck.

In case it would be macos specific, I also made the same test on a older Ubuntu Desktop :

version -4 -4 -T0
v1.4.5 6.85s 2.59s
v1.5.5 3.85s 2.31s

Basically, same conclusion.

So the reported issue is not reproduced.

from zstd.

chschroeder avatar chschroeder commented on April 28, 2024

I observed a similar discrepancy. I measured zstd's compression speed on a weak VM (zstd v1.4.8) that has only 4 cores and 8GB RAM and repeated the experiment on a more powerful machine (zstd v1.5.5), which is a few years old but has 128GB of RAM and 64 cores. My expectation was that the latter should be faster (if only because of the more recent software version that promises speedups in the release notes since then).

In a small experiment, I compared 1) zstd, 2) zstd with the "--long" parameter, and 3) lrzip. Each strategy was restricted to only use 4 cores. Each of them was evaluated on compression speed and file size over several compression levels. The runs on v1.5.5 were about 10-15% slower than the runs using v1.4.8. Might not be as drastic as "two times slower", but this was contrary to my expectations and therefore seems suspicious to me.

Additional differences that might be Version 1.4.8 was installed via Debian sources, whereas v1.5.5 was manually compiled with make.

The data to be compressed was a small set of web crawling results, where the single files are of size up to 4GB. Unfortunately, I cannot share the files, but they are comparable to the common crawl web archive files.

from zstd.

Dmitri555 avatar Dmitri555 commented on April 28, 2024

We typically don't calibrate for such extreme range

As I said before this is a regular core file from crashed sofware. It can have large areas of unused/uninitialized memory. It is nothing unusual.

Probably you are trying to reproduce the problem on your notebook with SSD drive but I have an old-school server with lots of regular HDDs.
PS
Looks like --no-asyncio solved this performance problem

from zstd.

Cyan4973 avatar Cyan4973 commented on April 28, 2024

Looks like --no-asyncio solved this performance problem

In which case, I would assume the issue started happening between v1.5.2 and v1.5.4.

from zstd.

yoniko avatar yoniko commented on April 28, 2024

Thank you for reporting @Dmitri555 .
I've been able to replicate some use-cases of slow-downs when using AsyncIO on an AMD machine (didn't replicate on an Intel server / M1 laptop). The numbers I was getting are similar to what @chschroeder has observed (10-15% slowdown).
It only reproduced for me in cases where very little writes happened (very high compression ratio) and compression speed and reads were very quick (around 2G/s). I believe that in those cases the additional overhead for AsyncIO's thread synchronization becomes meaningful.

However, I didn't manage to reproduce the x2 slow-down reported here. My suspicious are that it's either the HDDs or NUMA, but there are just guesses. @Dmitri555 if you can run the same experiment only ram to ram without going through the HDD it'd be helpful. Additionally, if you can make such the process is pinned to one socket (in case there are multiple CPUs on the machine) that'd allow us to rule out NUMA as well.

As for the 15% slowdown, I've spent some time debugging this and I believe this is caused by the additional overhead introduced by AsyncIO's thread synchronization. This should only manifest in cases where the read, write and compression workloads are extremely fast to the point where the added synchronization syscalls actually take a meaningful time of the runtime. Even so, it only reproduced for me on an AMD machine.

I don't think there's an easy fix here, one solution is to increase the size of our read buffers, but this could have negative results for other use-cases. The better solution would be to add an io_uring compatible asyncio implementation, that should allow us to remove most of the overhead. We've built the asyncio module with io_uring in mind, so the same API should work, but implementing and testing would still take some work.

from zstd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.