GithubHelp home page GithubHelp logo

microsoft / recursiveextractor Goto Github PK

View Code? Open in Web Editor NEW
182.0 12.0 25.0 196.42 MB

RecursiveExtractor is a .NET Standard 2.0 archive extraction Library, and Command Line Tool which can process 7zip, ar, bzip2, deb, gzip, iso, rar, tar, vhd, vhdx, vmdk, wim, xzip, and zip archives and any nested combination of the supported formats.

License: MIT License

C# 92.02% HTML 4.70% JavaScript 0.88% CSS 2.39%
nuget recursion extractor archive disc-image

recursiveextractor's Issues

Some ISOs not parsed from DiscUtils

I've been playing with this library for a few minutes and created small PowerShell module using it.

$ExtractMe = "$Env:UserProfile\Downloads\DaRT70.iso"
$Extractor = [Microsoft.CST.OpenSource.RecursiveExtractor.Extractor]::new()
$Extractor.ExtractFile($ExtractMe) | Format-Table

this is the error:
image

This is bootable .iso.

Wrong numbers of entries in TAR archive

I created a TAR archive using 7-zip and then proceeded to extract this archive using this project. I both tried to extract the archive given the method using the filepath and the method where I pass a stream of the file, both gives the same error.

Steps to recreate this

  • Create a TAR archive using 7-zip (others might work too) containing two or three files
  • Run a test with the following code:
using var stream = new FileStream(path, FileMode.Open);
var entries = new Extractor().Extract(string.Empty, stream).ToList();

or
var entries = new Extractor().Extract(path).ToList();

  • Verify the number of entries against the number of files added to the TAR archive
  • The number of entries in the archive is one less than the number of files added to the archive

Cannot extract files from .7z and .rar (RAR4) archives with unencrypted filenames

When attempting to extract a .7z with unencrypted filenames, i get the following error in the CLI.

2023-03-08 11:20:10.5991|ERROR|Microsoft.CST.RecursiveExtractor.Cli.RecursiveExtractorClient|Exception while extracting. SharpCompress.Common.CryptographicException:Encrypted 7Zip archive has no password specified. ( at SharpCompress.Compressors.LZMA.AesDecoderStream..ctor(Stream input, Byte[] info, IPasswordProvider pass, Int64 limit)
at SharpCompress.Compressors.LZMA.DecoderRegistry.CreateDecoderStream(CMethodId id, Stream[] inStreams, Byte[] info, IPasswordProvider pass, Int64 limit)
at SharpCompress.Compressors.LZMA.DecoderStreamHelper.CreateDecoderStream(Stream[] packStreams, Int64[] packSizes, Stream[] outStreams, CFolder folderInfo, Int32 coderIndex, IPasswordProvider pass)

When trying to extract a .rar with unencrypted filenames, i get no error but it just creates a damaged copy of the .rar.

Same issue with the .net standard library variant.

Steps to reproduce:

  1. Create a text file.
  2. Add it to an .7z archive using 7zip or to an .rar archive using WinRar without checking the Encrypt filenames option.
  3. Try to extract the archives using the RecursiveExtractor Tool.

Add a Cli

To be published as a .NET global tool

Provide an option to use provided file extensions instead of checking bytes

Microsoft office files with .docx, .xlsx, and perhaps others are identifed as archive file types i.e. MiniMagic.DetectFileType returns a type ZIP for them. Now, they technically appear to be that i.e. you can unzip them apparently and they contain other files like xml docs etc. but for practical purposes they may not be the desired result i.e. the user is unlikely to expect them to be unzipped but treated like a single document. It can lead to unexpected results and lost processing time.

Request: assess if the file type is a known document type earlier in the detection process which is accepted as a non archive type unless a new override flag is provided to ignore file extensions so similar effect.

Extraction of files leading with numbers n. is removed while extracting ar-archives

We have a test archive of type test.ar which contains three files 1.lorem.txt, 2.lorem.txt and 3.lorem.txt. If we unpack these by recursive extractor each entry contains the same filename. It seems the 1., 2. and 3. is removed for some reason.

So each entry is returned with the name lorem.txt which gives a collision on extracting the entries. The fullpath is also the same \Lorem.txt or in the example below "\test.ar\Lorem.txt"

Archice is included within the zipfile since ar-files aren't allowed to be added directly.

Test.zip

using var stream = new FileStream("Test.zip", FileMode.Open);
var entries = new Extractor().Extract(string.Empty, stream, options).ToList()

Support for MSI Format

The MSI installer format is not dissimilar to an archive of many files.

There is a .NET Framework library for parsing MSI files, but no nuget is published. It may be possible to fork lessmsi and add MSI extraction support, or write it from scratch.

Extract swallows timeout exception

Because of an issue in SharpCompress with very slow 7zip extraction it's easy to get in a situation where it takes many hours to extract a few thousand small files.

Because of this I've enabled timeout
new ExtractorOptions() { Timeout = TimeSpan.FromMinutes(10), EnableTiming = true, }

When trying this I found that this exception is swallowed in Extractor.Extract(...) and only a fraction of the files are returned.
Extractor.ExtractAsync(...) does not catch this exception and leaves it to the caller. Here it works as expected.

Add Verbose Extraction Flag

Currently if you set ExtractSelfOnFail you'll the parent archive but not any (limited) contents that we were able to extract out of the archive. Consider adding an additional flag for use with ExtractSelfOnFail that would also include everything we could find. This would require adding a field to FileEntry to indicate partially extracted file, perhaps with gradations of how much was extracted.

Refactor parallel behavior to perform only at top level

The parallel option currently enables parallel extraction when possible inside of extractors. However, many implementations are not thread-safe as so this support is limited.

Instead we could parallelize the recursive part of the extraction at the top level - this should sidestep any non-threadsafe implementations in the underlying extractor libraries.

For example, add the parallelization here: https://github.com/microsoft/RecursiveExtractor/blob/84b45b6be7c908e9da9658da2b6d6adeacd66ccc/RecursiveExtractor/Extractor.cs#LL647C23-L647C23

Also part of this refactoring:

Remove individual extractor level parallel implementations.
Remove any async parallel implementations.

Alternately:

Deprecate parallel option entirely - remove all functionality and update doc ocomments appropriately.

Improve tests to validate the correct contents of extracted files

Currently the tests primarily check that the correct number of files are returned from extraction commands. However, as the issue identified in #102 shows this can be insufficient in cases where the underlying library may not be threadsafe (for example). In such cases the tests can still succeed, finding the correct number of files, but their contents may be incomplete. We should develop additional tests that verify for each file type we support that the contents + sizes of each file are correct. Perhaps the easiest method would be with a hash of the contents of the extracted stream.

Generate Extraction Report

For some use cases it might be useful to have an overview report of the files which were extracted and the status. This would require some amount of reworking in all the extractors.

Support for Encrypted Archives

Problem

Recursive Extractor currently skips extracted encrypted or password protected archives.

Proposed Feature

Update all applicable extractors to support extracting with a password.

List Implementation

Input is a list of passwords. Attempt each password in turn when encountering anything encrypted.

Possible Other Implementations

Input is a Dictionary of Filename (regexes?) to passwords. Only attempt the passwords whose regexes match.

Add more test cases for VHDX/VHD with various disc formattings

Currently FAT, NTFS and XFS are supported in code. Also test with various boot tables. I think we test with MBT and GPT in the VHD and VHDX respectively, but should also test LVM.

NTFS is currently tested implicitly in the VHDX test.
FAT is currently tested implicitly in the VHD test.

Clean up Documentation and Interfaces

Extractor.cs in particular has calls that have duplicative options - passing both an extractor options and the allow/deny filters.
Many extractors also have not had their documentation comments updated.

VHD Extraction Issue (ObjectDisposedException)

A fatal exception occurred while executing "public IEnumerable Extract" function on VhdExtractor.cs. Exception is thrown when function "using var disk = new DiscUtils.Vhd.Disk(fileEntry.Content, Ownership.None);" is called with "using" keyword. The issue is solved when removing "using" keyword. Actually, if you replace function as in VhdxExtractor.cs will be enough.

extracting .iso files in parallel with small batch size fails

This test fails

        [DataRow("TestData.iso")]
        public void ExtractArchiveSmallBatchSize(string fileName, int expectedNumFiles = 3)
        {
            var extractor = new Extractor();
            var path = Path.Combine(Directory.GetCurrentDirectory(), "TestData", "TestDataArchives", fileName);
            var results = extractor.Extract(path, new ExtractorOptions() { Parallel = true, BatchSize = 2 });
            Assert.AreEqual(expectedNumFiles, results.Count(entry => entry.EntryStatus == FileEntryStatus.Default));
        }

Add file metadata to FileEntry

It might be interesting to some consumers what the stated permissions and other metadata is available about the file. For example, some files could unzip with chmod +x and be executable.

This requires first to see what common factors are shared between SharpZipLib extracted entries, and also LibObjectFile file entries. For LibObjectFile entries reading real filesystems this is particularly interesting. Deciding on a new schema for FileEntry - either a number of fields or a FileMetadata object.

This is a feature request related to Attack Surface Analyzer.

Use ObsoleteAttribute to mark method as deprecated

I noticed there are a few methods in Extractor.cs that have comment text saying that it is deprecated. As a minor improvement, we can use the ObsoleteAttribute attribute can be used to programatically mark the method as deprecated, so any user will get a warning or error (depending on the second parameter -- true=error, false=warning).

        /// <summary>
        /// Deprecated. Use ExtractAsync.
        /// </summary>
        /// <param name="fileEntry"></param>
        /// <param name="opts"></param>
        /// <param name="governor"></param>
        /// <returns></returns>
==>     [ObsoleteAttribute("This method is obsolete. Use ExtractAsync instead.", false)]
        public async IAsyncEnumerable<FileEntry> ExtractFileAsync(FileEntry fileEntry, ExtractorOptions? opts = null, ResourceGovernor? governor = null)
        {
            await foreach(var entry in ExtractAsync(fileEntry, opts, governor))
            {
                yield return entry;
            }
        }

.iso extraction with parallel causes deadlock

Extracting the following file sometimes throws an error and sometimes deadlocks.
https://mirrors.slackware.com/slackware/slackware-iso/slackware64-14.1-iso/slackware64-14.1-install-dvd.iso

        [DataTestMethod]
        [DataRow("slackware64-14.1-install-dvd.iso", 8503)]
        public void ExtractIso(string fileName, int expectedNumFiles)
        {
            var extractor = new Extractor();
            var path = Path.Combine(Directory.GetCurrentDirectory(), "TestData", "TestDataArchives", fileName);
            var results = extractor.Extract(path,
                new ExtractorOptions()
                {
                    Parallel = true, BatchSize = 5, RawExtensions = new List<string>() { ".txz" }
                });
            Assert.AreEqual(expectedNumFiles, results.Count());
        }

With Parallel = false this completes in less then 20 seconds.
.txz files are not related, just excluded because it just takes extra time.

Propagate file create/modify times

Where available, RecursiveExtractor should pull out file create/modify times so that callers (or the CLI) can set them for created files.

If no data is available, either "now" or epoch would be fine.

Option to exclude/include files by archive type

Option currently exists to check by globs, but we check the actual file signature which may not match extension.

This would be to add a new option so you could specify for example: only extract zips, or extract everything but zips.

Options should be two lists of Archive Type Enums.

RecursiveExtractor fails for some archives (fail to create file)

Repro:

  1. Download https://github.com/microsoft/OSSGadget/releases/download/v0.1.237/OSSGadget_linux_0.1.237.zip.
  2. Run the RecursiveExtractor CLI against that zip file.

The output contains contain something like:

Extracted OSSGadget_linux_0.1.237.zip\OSSGadget_linux_0.1.237\System.Security.Cryptography.Native.OpenSsl.a\openssl.c.o.
Extracted OSSGadget_linux_0.1.237.zip\OSSGadget_linux_0.1.237\System.Security.Cryptography.Native.OpenSsl.a\apibridge.c.o.

2020-09-06 21:14:02.3523|FATAL|Microsoft.CST.RecursiveExtractor.Extractor|Failed to create file at bb\OSSGadget_linux_0.1.237.zip\OSSGadget_linux_0.1.237\System.Security.Cryptography.Native.OpenSsl.a\.

Extracted OSSGadget_linux_0.1.237.zip\OSSGadget_linux_0.1.237\System.Security.Cryptography.Native.OpenSsl.so.
Extracted OSSGadget_linux_0.1.237.zip\OSSGadget_linux_0.1.237\System.Security.Cryptography.OpenSsl.dll.
Extracted OSSGadget_linux_0.1.237.zip\OSSGadget_linux_0.1.237\System.Security.Cryptography.Pkcs.dll.

I believe this is happening in Extractor.ExtractToDirectory since targetPath is a directory that already exists.

Publish Cli to Nuget

Depends on #23

We want to publish this Cli as a Nuget Global Tool.

Action to bind to "RecursiveExtractor"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.