GithubHelp home page GithubHelp logo

Comments (16)

miurahr avatar miurahr commented on August 25, 2024

py7zr was designed to do it in future, but current implementation does not have the function.

An extraction main function is built with five code blocks.

  1. setup list of directories, files and symlinks from header data with output path.
    for f in self.files:

    py7zr/py7zr/py7zr.py

    Lines 753 to 754 in 6ba709c

    self.worker.register_filelike(f.id, outfilename)
    target_files.append((outfilename, f.file_properties()))
  2. make directories.

    py7zr/py7zr/py7zr.py

    Lines 755 to 757 in 6ba709c

    for target_dir in sorted(target_dirs):
    try:
    target_dir.mkdir()
  3. decompress archive file walking through lists. If path is None, nothing is written.
    self.worker.extract(self.fp, multithread=multi_thread)
  4. create symbolic link
    sym_dst.symlink_to(sym_src)
  5. set metadata of files and directories, such as creation time, permission etc.

    py7zr/py7zr/py7zr.py

    Lines 791 to 792 in 6ba709c

    for o, p in target_files:
    self._set_file_property(o, p)

Currently step.1 works for all of items. It calls 'self. worker.register_filelike()' against each target items. when calling it with 'None' as a target path, then decompress function will skip it.

from py7zr.

miurahr avatar miurahr commented on August 25, 2024

A branch https://github.com/miurahr/py7zr/commits/topic-extraction-filter try to realize it with dirty hack.

Please see test case to know how to specify files.

py7zr/tests/test_basic.py

Lines 372 to 383 in c3f580e

@pytest.mark.api
def test_py7zr_extract_specified_file(tmp_path):
archive = py7zr.SevenZipFile(open(os.path.join(testdata_path, 'test_1.7z'), 'rb'))
expected = [{'filename': 'scripts/py7zr', 'mode': 33261, 'mtime': 1552522208,
'digest': 'b0385e71d6a07eb692f5fb9798e9d33aaf87be7dfff936fd2473eab2a593d4fd'}
]
archive.extract(path=tmp_path, targets=['scripts', 'scripts/py7zr'])
assert tmp_path.joinpath('scripts').is_dir()
assert tmp_path.joinpath('scripts/py7zr').exists()
assert not tmp_path.joinpath('setup.cfg').exists()
assert not tmp_path.joinpath('setup.py').exists()
check_output(expected, tmp_path)

There are two items to extract in this test case, one is 'scripts' directory and another is 'scripts/py7zr' file. Other files 'setup.py' and 'setup.cfg' is skipped.

Any feedback?

from py7zr.

michaelfecher avatar michaelfecher commented on August 25, 2024

I nearly implemented a similar solution,
because I wasn't aware that you respond that fast 😄

# specific_file_list: List[str] as an argument to the function
file_list_pattern = '|'.join('(?:{0})'.format(x) for x in specific_file_list)
file_pattern = re.compile(file_list_pattern)
# in for loop, straight after iteration
if file_pattern.match(f.filename) is None:
    continue

I only wasn't sure if it's the intended solution, because I only had a brief look on the master code.
And I was misleaded by the code, because I thought all this worker stuff needs to be adapted as well.

Not sure why you find your solution "dirty"?
I only can assume that you named it "dirty", because you packed everything together in one function...

from py7zr.

miurahr avatar miurahr commented on August 25, 2024

Your code does almost as same as I tried.
if re.match('file1 | file2 | ..', f.filename) and if f.filename in file_list produce same result.

py7zr/py7zr/py7zr.py

Lines 737 to 739 in c3f580e

if targets is not None and f.filename not in targets:
self.worker.register_filelike(f.id, None)
continue

There are several design considerations for API.

  1. py7zr has a method getnames() which returns all of archived files as list. It would be better that value of getnames() can be used as an argument.

  2. It is necessary to split extract function into several internal functions for a better maintenance.

  3. When user specified files under directories, py7zr should make these directories before extraction of files. If user does not specify parent directory, method extract() call become failed.

  4. My hack does not recognize path separate character difference ('/' or '').

  5. Is it better to accept regex expression for argument?

  6. extraction core currently run through all of archive data. When skipping target files, just ignore decompressed data ( fileish becomes NullHandler). It can reduce I/O but cannot reduce CPU time.

py7zr/py7zr/compression.py

Lines 263 to 274 in c3f580e

def extract_single(self, fp: BinaryIO, files, src_start: int) -> None:
"""Single thread extractor that takes file lists in single 7zip folder."""
fp.seek(src_start)
for f in files:
fileish = self.target_filepath.get(f.id, NullHandler()) # type: Handler
fileish.open()
# Skip empty file read
if f.emptystream:
fileish.write(b'')
else:
self.decompress(fp, f.folder, fileish, f.uncompressed[-1], f.compressed)
fileish.close()

from py7zr.

miurahr avatar miurahr commented on August 25, 2024

Here is my idea how to utilize getnames()

py7zr/tests/test_basic.py

Lines 391 to 397 in e51772a

allfiles = archive.getnames()
filter_pattern = re.compile(r'scripts.*')
targets = []
for f in allfiles:
if filter_pattern.match(f):
targets.append(f)
archive.extract(path=tmp_path, targets=targets)

If it is convenient, I'd like to add new method such as extract_re(path=<outdir>, filter=<regex>).

from py7zr.

michaelfecher avatar michaelfecher commented on August 25, 2024

Your code does almost as same as I tried.
if re.match('file1 | file2 | ..', f.filename) and if f.filename in file_list produce same result.

py7zr/py7zr/py7zr.py

Lines 737 to 739 in c3f580e

if targets is not None and f.filename not in targets:
self.worker.register_filelike(f.id, None)
continue

There are several design considerations for API.

1. py7zr has a method `getnames()` which returns all of archived files as list. It would be better that value of `getnames()` can be used as an argument.

2. It is necessary to split extract function into several internal functions for a  better maintenance.

3. When user specified files under directories,  py7zr should make these directories before extraction of files. If user does not specify parent directory, method `extract()` call become failed.

4. My hack does not recognize path separate character difference ('/' or '').

5. Is it better to accept regex expression for argument?

6. extraction core currently run through all of archive data. When skipping target files, just ignore  **decompressed** data ( fileish becomes NullHandler). It can reduce I/O but cannot reduce CPU time.

py7zr/py7zr/compression.py

Lines 263 to 274 in c3f580e

def extract_single(self, fp: BinaryIO, files, src_start: int) -> None:
"""Single thread extractor that takes file lists in single 7zip folder."""
fp.seek(src_start)
for f in files:
fileish = self.target_filepath.get(f.id, NullHandler()) # type: Handler
fileish.open()
# Skip empty file read
if f.emptystream:
fileish.write(b'')
else:
self.decompress(fp, f.folder, fileish, f.uncompressed[-1], f.compressed)
fileish.close()

When trying to run with your changes, I get an error:

  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/py7zr.py", line 772, in extract
    self.worker.extract(self.fp, multithread=multi_thread)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 261, in extract
    self.extract_single(fp, self.files, self.src_start)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 268, in extract_single
    fileish.open()
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 111, in open
    self.fp = self.target.open(mode=mode)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/pathlib.py", line 1203, in open
    opener=self._opener)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/pathlib.py", line 1058, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'FUBAR/65155/feedback.txt'

called the extract method like this:

zip_location = '/home/mf/code/py7zr/BIG_ARCHIVE_FILE.7z'
archive = py7zr.SevenZipFile(zip_location, mode='r')
# targets are in the archive!
targets = ['FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']
archive.extract(targets=targets)

Also an error occurs, if I provide an output path...

  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/py7zr.py", line 772, in extract
    self.worker.extract(self.fp, multithread=multi_thread)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 261, in extract
    self.extract_single(fp, self.files, self.src_start)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 268, in extract_single
    fileish.open()
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 111, in open
    self.fp = self.target.open(mode=mode)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/pathlib.py", line 1203, in open
    opener=self._opener)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/pathlib.py", line 1058, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/logs/FUBAR/65155/feedback.txt'

/tmp/logs/ was provided as path argument

from py7zr.

miurahr avatar miurahr commented on August 25, 2024

It is what I mentioned at

3. When user specified files under directories, py7zr should make these directories before extraction of files. If user does not specify parent directory, method extract()call become failed.

from py7zr.

michaelfecher avatar michaelfecher commented on August 25, 2024

very specifically asked, because the 2nd sentence confuses me...
the workaround would be to create the dirs for targets and path in the client code before calling extract()?

from py7zr.

miurahr avatar miurahr commented on August 25, 2024

means you should call with

- targets = ['FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']
+ targets = ['FUBAR', 'FUBAR/65155', 'FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']

from py7zr.

miurahr avatar miurahr commented on August 25, 2024

Please see

py7zr/tests/test_basic.py

Lines 374 to 379 in e51772a

def test_py7zr_extract_specified_file(tmp_path):
archive = py7zr.SevenZipFile(open(os.path.join(testdata_path, 'test_1.7z'), 'rb'))
expected = [{'filename': 'scripts/py7zr', 'mode': 33261, 'mtime': 1552522208,
'digest': 'b0385e71d6a07eb692f5fb9798e9d33aaf87be7dfff936fd2473eab2a593d4fd'}
]
archive.extract(path=tmp_path, targets=['scripts', 'scripts/py7zr'])

That is not archive.extract(path=tmp_path, targets=['scripts/py7zr']) but archive.extract(path=tmp_path, targets=['scripts', 'scripts/py7zr'])

from py7zr.

michaelfecher avatar michaelfecher commented on August 25, 2024

means you should call with

- targets = ['FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']
+ targets = ['FUBAR', 'FUBAR/65155', 'FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']

Thanks for the hint!
Adapted my code accordingly.
Unfortunately, I'm hitting another issue now :/

I am reading in the 7z file BEFORE a loop.
In the loop, I run the extraction to extract the corresponding files via the extract function.
The first iteration is fine, everything behaves as it should.
Unfortunately in the 2nd iteration, there occurs an error during the extract method:

  File "/home/mf/miniconda3/envs/xds/lib/python3.6/site-packages/py7zr/py7zr.py", line 772, in extract
    self.worker.extract(self.fp, multithread=multi_thread)
  File "/home/mf/miniconda3/envs/xds/lib/python3.6/site-packages/py7zr/compression.py", line 261, in extract
    self.extract_single(fp, self.files, self.src_start)
  File "/home/mf/miniconda3/envs/xds/lib/python3.6/site-packages/py7zr/compression.py", line 273, in extract_single
    self.decompress(fp, f.folder, fileish, f.uncompressed[-1], f.compressed)
  File "/home/mf/miniconda3/envs/xds/lib/python3.6/site-packages/py7zr/compression.py", line 301, in decompress
    assert out_remaining == 0
AssertionError

My code for the extraction looks like this:

from pathlib import Path

archive = py7zr.SevenZipFile(open(zip_location, 'rb'))
for not_revant, files_to_extract_list in fubar.items():
     unique_dirs = {str(p) for b in files_to_extract_list 
                                 for p in Path(b).parents
                                 if str(p) is not '.'}
     sorted_unique_dirs = sorted(unique_dirs , key=len)
     all_dirs_and_files = [*sorted_unique_dirs, *files_to_extract_list]
     archive.extract(path='/tmp/logs',
                        targets=all_dirs_and_files)

all_dirs_and_files variables per run:

1st iteration:
['FUBAR', 'FUBAR/65155', 'FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']

2nd iteration:
['FUBAR', 'FUBAR/65268', 'FUBAR/65268/feedback.txt.5', 'FUBAR/65268/feedback.txt.4', 'FUBAR/65268/feedback.txt.3', 'FUBAR/65268/feedback.txt.2', 'FUBAR/65268/feedback.txt.1', 'FUBAR/65268/feedback.txt']

from py7zr.

michaelfecher avatar michaelfecher commented on August 25, 2024

strange...
when I move

archive = py7zr.SevenZipFile(open(zip_location, 'rb'))

in the loop, then it works.
Is that intended?
Asking, because I'm used to open the file once, doing the stuff and close the file or rely on auto-closing ala with open(...).
Not knowing the details of the implementation, but won't there be an issue with the amounf of file handlers?
Still I'm super happy, that it works now 👍
Big thanks for your support so far!!
Will check it out, how it will behave with the big 7z files ;)

from py7zr.

miurahr avatar miurahr commented on August 25, 2024

7-zip format is basically use 'solid' archive, that all files are compressed in single archive stream.
When extracting data form the stream, decompressor should read the data from begging even target data is placed at end of stream.

Both extract() and extractall() method have to process all the archive data, read all the archive data, even that is several giga bytes.
extract() read all the data and not write some data, then write specified chunk to file.
extractall() read all the data and write all data chunk as target files.

After you called extract() method, an internal file pointer has positioned to end of data.
We can seek file pointer to start of data at each iteration, but it is quite inefficient.

You want to process large archive (30Gb) and looping method, if it is twice, you read 30Gb x 2 = 60Gb from disk. If it is ten times of loop, you read 300Gb from disk!

Solid 7-zip format does not support random access by its nature, but optimized to compression ratio.

Users are recommended to construct a list of files to extract, you can use loop there, then call extact() only once.

from py7zr.

miurahr avatar miurahr commented on August 25, 2024

You can use reset() method before second iteration of call 'extract()', that reset file pointer to a beginning of data chunk.
Because reset() does not reset LZMA decompressor state object, then decompressor.eof is still True, which do not accept any data.

from py7zr.

miurahr avatar miurahr commented on August 25, 2024

Unfortunately lzma module of python standard library does not have reset() method. lzma object is created when reading archive header, so you can not call extract() twice.

from py7zr.

miurahr avatar miurahr commented on August 25, 2024

Thanks @michaelfecher for testing.
Now PR #64 provide extracting specific files and support iterating.
See

py7zr/tests/test_basic.py

Lines 405 to 412 in ae9e76a

@pytest.mark.api
def test_py7zr_extract_and_reset_iteration(tmp_path):
archive = py7zr.SevenZipFile(open(os.path.join(testdata_path, 'test_1.7z'), 'rb'))
iterations = archive.getnames()
for target in iterations:
archive.extract(path=tmp_path, targets=[target])
archive.reset()
archive.close()

from py7zr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.