GithubHelp home page GithubHelp logo

Comments (8)

tsolomko avatar tsolomko commented on July 19, 2024 2

4.8.0 has been released, which includes the new functionality discussed here!

from swcompression.

tsolomko avatar tsolomko commented on July 19, 2024 1

Hi @philmitchell,

Thank you for PR and happy new year to you too!

I had a look and, wow, this autoreleasepool thing does wonders: repeating the same tests as I did previously in this thread results in 0.19 GB "maximum resident set size" for when using TarReader + autoreleasepool vs. 2.93 GB for TarContainer.open.

As I understand now, the new TarReader/Writer types won't help with memory consumption issues by themselves: users will still have to use autoreleasepool to achieve the desirable effect. So I wonder, if it's possible to move the autoreleasepool inside TarReader, because otherwise these new types seem useless.

One way to achieve this would be to add a new function, something like TarReader.processNext<Result>(_ f: (TarEntry) -> Result) throws -> Result, which obtains a new TarEntry (similar tonext()) and calls the user-supplied function on it inside the autoreleasepool.

Anyway, this question is for me think about later in the week (alongside with merging your PR, etc.), so thank you once again!

from swcompression.

tsolomko avatar tsolomko commented on July 19, 2024 1

I have some news to share.

First of all, as I discovered today, autoreleasepool is a Darwin-only thing, so it is not available on Linux and Windows. Apart from having to provide compatibility implementation for it on those platforms, I wonder, what is the situation with memory usage there. As in, if this new TarReader/Writer stuff will be even helpful on Linux/Windows. I've also found on Swift forums some relevant discussion about the autoreleasepool availability.

Secondly, I merged tar-rw branch into develop and made some additional changes:

  1. TarReader.next() has been renamed into TarReader.read().
  2. TarReader.process(_:) has been added (this is the function that automatically uses autoreleasepool).
  3. Removed --use-rw-api option from swcomp tar: it will now automatically use TarReader/Writer, if possible (which means when there are no additional compression options supplied).

Note that there are no changes to TarWriter. As much as I would like to make it use autoreleasepool internally, so users wouldn't have to think about these subtleties, I don't think it is possible. So the only solution there would be to mention in documentation something along the lines of "you should wrap autoreleasepool around the TarWriter stuff".

from swcompression.

tsolomko avatar tsolomko commented on July 19, 2024

To answer your main question: no, there has been no progress on this issue.

However, I see now that there is a lot of demand for this kind of functionality. As such, I will try to do something about it, starting with your specific case (creating TAR). I will report back with some sort of implementation soon (hopefully, by the end of this week, but no guarantees).

from swcompression.

JordanPJamieson avatar JordanPJamieson commented on July 19, 2024

Thanks a lot for the quick response - I deeply appreciate it.

This functionality would help immensely.

For my specific case, I'm trying to create a tar file and write it to disk efficiently without holding so much data in memory. So a function that accepts tar entries (with a URL location of the tar/file entry rather than the Data of the file) and a URL to write the data in chunks to would be very helpful as well.

Thanks again!

from swcompression.

tsolomko avatar tsolomko commented on July 19, 2024

So I did implement something. I added a set of relatively simple APIs that use Foundation's FileHandle:

public struct TarReader {
    public init(fileHandle: FileHandle)
    public mutating func next() throws -> TarEntry?
}

public struct TarWriter {
    public init(fileHandle: FileHandle, format: TarContainer.Format = .pax)
    public mutating func append(_ entry: TarEntry) throws
    public func finalize() throws
}

These should help avoid loading everything into memory at once and only do it as necessary.

I've tried to check if they indeed help with memory usage. I used /usr/bin/time -lp command to measure "maximum resident set size" and tarred "Microsoft Word.app" as a test file (2.26 GB). The results are a bit confusing. For the TarContainer.open function I've got 3.265 GB vs 1.746 GB for the version with TarReader. For the TarContainer.create and its TarWriter version I've got 3.515 GB vs. 2.279 GB, respectively. The full terminal output is available in the attached file (though it may be a bit cryptic). term_output.txt

So, what do you think about all of this? I should note that I have no idea whether or not this metric of "maximum resident set size" is a good one to evaluate memory usage. As I assume you did some memory-related measurements before, any advice on how to better estimate memory usage would be appreciated!

P.S. If you are interested in testing these new functions yourself, they are available in tar-rw branch.

from swcompression.

philmitchell avatar philmitchell commented on July 19, 2024

I just submitted a PR to address this, but only for the case of untarring. If you want, I'll look at tarring, too ... but I happen to need only untarring :)

I used Xcode instruments to confirm that with the addition of autoreleasepool, the new rw-api branch can untar quite large files without any increase in transient or persistent memory usage.

Thanks Timofey ... and happy new year!

from swcompression.

philmitchell avatar philmitchell commented on July 19, 2024

Ha, that's super interesting! It was surprising to discover that we had to manually release memory during the file read, but now I'm really curious to know how it works in Swift on Linux/Windows. I'm swamped at the moment, but when I can will look into this.

Regarding removal of --use-rw-api: good call, I agree it should be the default. I actually ran some (very crude) tests to see if there was a performance hit when reading the tarfile in chunks and if anything it performs better. If that's correct, there's no reason to do it any other way.

from swcompression.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.