GithubHelp home page GithubHelp logo

dtrx-py / dtrx Goto Github PK

View Code? Open in Web Editor NEW

This project forked from brettcs/dtrx

217.0 6.0 10.0 1.77 MB

Do The Right Extraction

License: GNU General Public License v3.0

Python 85.20% CSS 0.35% HTML 5.43% Dockerfile 2.52% Shell 1.96% Makefile 1.77% JavaScript 2.78%

dtrx's Introduction

GitHub PyPI version PyPI pyversions GitHub Workflow Status

dtrx

"Do The Right eXtraction" - don't remember what set of tar flags or where to pipe the output to extract it? no worries!

TL;DR

pip install dtrx

dtrx yolo.tar.gz

This is a copy-paste of the original dtrx repo, and all credit for this software should be attributed to the original author, Brett Smith @brettcs:

https://github.com/brettcs/dtrx

See the original README for more details on what this does!

Changes in this repo

This repo contains some patches on top of the original source to enable using dtrx with python3. The original motivation was to enable dtrx on Ubuntu 20.04+, where the dtrx apt package was removed from the default ppas (likely due to being python2 only).

I attempted to get the tests all working via tox , for which I used a Dockerfile to try to get some kind of environment consistency. You can run the tests by running (requires Docker installed):

./test.sh

Development

Contributions

Contributions are gladly welcomed! Feel free to open a Pull Request with any changes.

Issues

When posting an issue, it can be very handy to provide any example files (for example, the archive that failed to extract) or reproduction steps so we can address the problem quickly.

Releases

Releases are tagged in this repo and published to pypi.org. The release process for maintainers is the below steps:

  1. update the version specifier:

    # update the VERSION value in dtrx/dtrx.py, then:
    ❯ git add dtrx/dtrx.py
    ❯ git commit  # fill in the commit message
  2. create an annotated tag for the release. usually good to put a list of new commits since the previous tag, for example by listing them with:

    ❯ git log $(git describe --tags --abbrev=0)..HEAD --oneline
    # create the annotated tag
    ❯ git tag -a <version number>

    be sure to push the tag, git push --tags.

  3. use the make publish-release command to build and publish to GitHub and PyPi

See the Makefile for details on what that rule does.

Invoke + Tests

There's some minimal helper scripts for pyinvoke under tasks/.

To bootstrap, run pip install -r requirements.txt, then inv --list to see available tasks:

❯ inv --list
Available tasks:

  build-docker                build docker image
  push-docker                 push docker image
  quick-test                  run quick tests in docker
  rst2man                     run rst2man in docker
  test-nonexistent-file-cmd   run test-nonexistent-file-cmd.sh
  tox                         run tox in docker
  windows                     just check that windows install fails. pulls a minimal wine docker image to test

To run the tests, run inv tox. Takes a couple of minutes to go through all the python versions.

Linting

Linting is provided by pre-commit. To use it, first install the pre-commit hook:

pip install pre-commit
pre-commit install

pre-commit will run anytime git commit runs (disable with --no-verify). You can manually run it with pre-commit run.

Docker

The tests in CI (and locally) can be run inside a Docker container, which provides all the tested python versions.

This image is defined at Dockerfile. It's pushed to the GitHub Container Registry so it can be managed by the dtrx-py organization on GitHub- Docker Hub charges for Organizations.

There are Invoke tasks for building + pushing the Docker image, which push both a :latest tag as well as a :2022-09-16 ISO8601 numbered tag. The tag can then be updated in the GitHub actions runner.

Note: there's a bit of complexity around how the image is used, because the dtrx tests need to run as a non-root user (there's one test that checks for error handling when the output directory is not accessible by the current user). To deal with this, there's an entrypoint script that switches user to a non-root user, but that still has read/write access to the mounted host volume (which is the cwd, intended for local development work). This is required on Linux, where it's nice to have the host+container UID+GUID matching, so any changes to the mounted host volume have the same permissions set.

In the GitHub actions runner, we need to run inside the same container (to have access to the correct python versions for testing), and the github action for checkout assumes it can write to somewhat arbitrary locations in the file system (basically root access). So we switch to the non-root user after checkout.

dtrx's People

Contributors

brettcs avatar chrisjefferson avatar dilinger avatar noahp avatar scop avatar sr-verde avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dtrx's Issues

DTRX and docker saved images

I have a wrapper for dtrx and use it in a batch process.

When docker images are saved like in the documentation and dtrx is run on the exported tar.gz, it extracts everything and somewhere runs cpio on it. During this command it gets stuck and shows error about malformed content. I have to kill the cpio command at that moment so that my batch process continues. Has anyone extracted a docker image with dtrx before?

Fail faster on Windows

I've recently installed dtrx on Windows, tried unpacking several archives and failed. Then I looked up the Trove classifiers (defined here in setup.cfg) to see that the only operating system listed there is Operating System :: POSIX and conclude that Windows isn't supported.

But if Windows isn't supported, perhaps it's better to fail at pip install time and not at call time? Add an assert statement to the setup script or something. It would also be cool if pip checked Trove classifiers for this sort of thing automatically, but that issue is for a different repo...

[feature request] bzip3 support

I don't see many people using it (yet), but bzip3 looks pretty impressive from the claims in the README:

https://github.com/kspalaiologos/bzip3

If the claims are accurate, then I expect that we'll see people switching away from bz2 and xz to bz3. It would be good to support bz3 in that case. So, opening a place holder bug here. :)

Archives with passwords cause an infinite hang while waiting for input

A number of archive types support encryption with a password: zip, rar, 7z.. When dtrx encounters one of these, it just hangs waiting for the user to put in a password. However, it doesn't ask for a password, nor does it echo the password prompt from the underlying command. Instead, it just sits there silently waiting.

The reason for this is because dtrx sets stdout to /dev/null, and saves stderr to a temporary file. So the user never sees the password prompt. Zip and rar output their prompts to stderr, and 7z outputs it prompt to stdout.


In addition, errors to stderr seem to be thrown away, or shown AFTER we're asked where to output (invalid) files

For example:

dilinger@e7470:$ echo foo > test2; zip -P foobar file.zip test2
updating: test2 (stored 0%)
dilinger@e7470:
$ dtrx file.zip

file.zip contains one file but its name doesn't match.
Expected: file
Actual: test2
You can:

  • extract the file _I_nside a new directory named file
  • extract the file and _R_ename it file
  • extract the file _H_ere
    What do you want to do? (I/r/h) i
    dtrx: WARNING: extracting /home/dilinger/file.zip to ./file.rl629f5n
    dtrx: WARNING: Error output from this process:
    ERROR: Wrong password : test2

[comment] Lifesaver project

I have been looking for a python3 port of this for awhile. Attempted to convert it myself but failed. Thanks

Bug: extracting a zip file with 2 identically-named files fails waiting for input to the `zip` command

Apparently zip files can somehow contain 2 files with the same filename (and thus same path in the zip file). The zip command tries to prompt the user for an action.

Here is an example of a file that does not work in dtrx: identical_names_same_dir.zip

Fix Options

  • Minimum viable fix: raise an error.
  • Ideally, add interactive and non-interactive handlers, where the file is extracted and renamed in non-interactive mode

Unable to install on Ubuntu Unity 23.04

Hello,

The install command mentioned in the README is : pip install dtrx.

Obviously, when running it on a frenshly installed distro, it outputs this :

Command 'pip' not found, but can be installed with:
sudo apt install python3-pip

But after executing that command then retrying the first one, a new message, one that I never encountered before, appears :

rror: externally-managed-environment

× This environment is externally managed
╰─> To install Python packages system-wide, try apt install
    python3-xyz, where xyz is the package you are trying to
    install.
    
    If you wish to install a non-Debian-packaged Python package,
    create a virtual environment using python3 -m venv path/to/venv.
    Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
    sure you have python3-full installed.
    
    If you wish to install a non-Debian packaged Python application,
    it may be easiest to use pipx install xyz, which will manage a
    virtual environment for you. Make sure you have pipx installed.
    
    See /usr/share/doc/python3.11/README.venv for more information.

note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
hint: See PEP 668 for the detailed specification.

So I run sudo apt install pipx then pipx install dtrx, which leads me to the following output :

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/packaging/requirements.py", line 35, in __init__
    parsed = parse_requirement(requirement_string)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/packaging/_parser.py", line 64, in parse_requirement
    return _parse_requirement(Tokenizer(source, rules=DEFAULT_RULES))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/packaging/_parser.py", line 82, in _parse_requirement
    url, specifier, marker = _parse_requirement_details(tokenizer)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/packaging/_parser.py", line 120, in _parse_requirement_details
    specifier = _parse_specifier(tokenizer)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/packaging/_parser.py", line 206, in _parse_specifier
    with tokenizer.enclosing_tokens("LEFT_PARENTHESIS", "RIGHT_PARENTHESIS"):
  File "/usr/lib/python3.11/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/usr/lib/python3/dist-packages/packaging/_tokenizer.py", line 183, in enclosing_tokens
    self.raise_syntax_error(
  File "/usr/lib/python3/dist-packages/packaging/_tokenizer.py", line 163, in raise_syntax_error
    raise ParserSyntaxError(
packaging._tokenizer.ParserSyntaxError: Expected closing RIGHT_PARENTHESIS
    platform (==unsupported) ; platform_system == "Windows"
             ~^

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/bin/pipx", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/usr/lib/python3/dist-packages/pipx/main.py", line 819, in cli
    return run_pipx_command(parsed_pipx_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pipx/main.py", line 202, in run_pipx_command
    return commands.install(
           ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pipx/commands/install.py", line 60, in install
    venv.install_package(
  File "/usr/lib/python3/dist-packages/pipx/venv.py", line 244, in install_package
    self._update_package_metadata(
  File "/usr/lib/python3/dist-packages/pipx/venv.py", line 318, in _update_package_metadata
    venv_package_metadata = self.get_venv_metadata_for_package(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pipx/venv.py", line 300, in get_venv_metadata_for_package
    venv_metadata = inspect_venv(
                    ^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pipx/venv_inspect.py", line 251, in inspect_venv
    app_paths_of_dependencies = _dfs_package_apps(
                                ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pipx/venv_inspect.py", line 121, in _dfs_package_apps
    dependencies = get_package_dependencies(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pipx/venv_inspect.py", line 54, in get_package_dependencies
    for req in map(Requirement, dist.requires or []):
  File "/usr/lib/python3/dist-packages/packaging/requirements.py", line 37, in __init__
    raise InvalidRequirement(str(e)) from e
packaging.requirements.InvalidRequirement: Expected closing RIGHT_PARENTHESIS
    platform (==unsupported) ; platform_system == "Windows"

Thanks

Support Chrome Extension (.crx) files

Despite the slightly different header (thus causing file to report the file as "Google Chrome extension"), these are basically ZIP files, and can be extracted by the unzip command identically to normal ZIP files.

dtrx gets stuck sometimes

We are using dtrx to uncompress various files in a given directory. We noticed however that quite often dtrx gets stuck on some file (or even file types). I could not find a pattern here but we often see this with ISO files.

Could you check whether dtrx can be improved in a way to exit clean even if a file (file type) is provided that dtrx cannot work with?

Thanks,
André

[Feature request] Specify destination

Hello,

I sometimes want to extract an archive elsewhere than its own location, so it would be nice for dtrx to allow specifying destination.

Thanks

platform==unsupported line preventing install

The line in setup.cfg that prevents install on windows seems to be causing errors when trying to install on other platforms, I've tried pipx on mac and the standard AUR install on Arch. Others seem to be having the same issue. The root of the error seems to be the following. Just patching out the line lets me install just fine.

packaging.requirements.InvalidRequirement: Expected end or semicolon (after name and no valid version specifier)
    platform==unsupported;platform_system=="Windows"

Crash with file not found even though the file does exist

Using dtrx with any (or no) options crashes with the following stacktrace:

dtrx hello.zip
Traceback (most recent call last):
  File "/usr/local/bin/dtrx", line 1404, in <module>
    sys.exit(app.run())
  File "/usr/local/bin/dtrx", line 1388, in run
    self.try_extractors(filename,
  File "/usr/local/bin/dtrx", line 1335, in try_extractors
    for extractor in builder:
  File "/usr/local/bin/dtrx", line 1054, in get_extractor
    getattr(self, 'try_by_' + func_name)(self.filename)
  File "/usr/local/bin/dtrx", line 1081, in try_by_magic
    process = subprocess.Popen(['file', '-zL', filename],
  File "/usr/local/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/lib/python3.8/subprocess.py", line 1704, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'file'
~# command terminated with exit code 137

The file exist and can even be uncompressed using unzip.

Using dtrx-noahp-7.1.2

Extensibility of file types; en- and disable them for a call

Thanks for this tool!

It would be great if it could use additional configuration (/etc/dtrx.conf or a .d directory) to specify how to extract other kinds of data - like .jar, .odt, .docx, etc.

Then it might be useful to enable/disable recursive extraction of these for individual calls - sometimes I want to extract everything (virus scanning), other times text data (.odt) should be ignored.

`pip install dtrx` failing on Windows with Python 3.9

Using pip install dtrx, the install failed on Windows 10 with Python 3.9.

Error message:

      __main__.UnsupportedPython: One or more packages do not support your version of Python (3.9.13). The installation will stop now. To force installation, set the ALLOW_UNSUPPORTED_PYTHON environment variable to any value. This may result in a broken package environment.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for unsupported-python
Failed to build unsupported-python
ERROR: Could not build wheels for unsupported-python, which is required to install pyproject.toml-based projects

What's going on here?

If the issue is that Windows is outright unsupported, then that should be indicated in the README.

[FR] CLIs are nice but APIs are better

First, thanks for taking this project over! Great effort.

My feature request: I would love to see the monolithic ExtractorApplication broken down into smaller pieces. Ideally, I would be able to do something like this instead of calling ExtractorApplication.run(...) in my code:

archive = dtrx.Archive.from_path(...)
for path in archive:
  print(path)
archive.extract(dst=...)

Cheers!

Cannot list contents of 7z archive

Hi ,
I really love having one tool to deal with so many archive formats. Thanks.

Unfortunately, i' m getting this error when trying to list archive content with this command:
dtrx -l archive.7z .

I already had experienced this error a few year ago. Since then i have tried many times with different python versions
over multiple system installs.

Maybe it's related to the 7z output when listing.

File "/home/user/.pyenv/versions/3.11.2/lib/python3.11/site-packages/dtrx/dtrx.py", line 716, in get_filenames
    fn_index = string.rindex(line, " ") + 1
               ^^^^^^^^^^^^^
AttributeError: module 'string' has no attribute 'rindex'

Again thanks for the good work.
Have a nice day.

Non-interactive mode not always non-interactive

The unzip and 7z commands don't seem to have implemented the non-interactive mode if the switch is activated. It works by changing the lines to:

extract_command = ['7z', 'x', '-y']

and

extract_command = ['unzip', '-o', '-q']

But this way it forces the non-interactive mode without checking for the argument at all. Works in my case/fork but might be different for you. Still this is a bug.

I am writing in this repo because it is the latest active one.

dtrx doesn't error on missing passwords in non-interactive mode

What happens:

dtrx doesn't error on missing passwords on rar archives. I have a password protected archive containing one file file1.txt. When trying to extract the file, I’ll get a password prompt (that will be killed immediately) and after that there is a dir password-protected that contains one empty file.

What I expect:

dtrx should print an error "cannot extract encrypted archive" (as when trying to extract other archive types).

How can you reproduce it:

  1. Create a password protected RAR archive containing a file
  2. Extract it in non-interactive mode

My log:

$ ls
password-protected.rar
$ dtrx -n password-protected.rar 


Enter password (will not be echoed) for file1.txt: %                                             
$ ls 
password-protected  password-protected.rar
$ ls password-protected
file1.txt
$ cat password-protected/file1.txt 
$ 

dtrx warnings

Sometimes, when calling dtrx like this:

dtrx -rn $FILE

I receive many warnings à la:

dtrx: WARNING: extracting /data/av-buffer/tmpFilesArchives/VMware-vCenter-Server-Appliance-6.7.0.52000-19300125-patch-FP/ncurses-6.0-10.ph1/usr/share/man/man3/curs_slk.3x.gz to /data/av-buffer/tmpFilesArchives/VMware-vCenter-Server-Appliance-6.7.0.52000-19300125-patch-FP/ncurses-6.0-10.ph1/usr/share/man/man3/curs_slk.3x.2jzoh7vj

There are dozens of those warnings. What exactly is the meaning of them and does it mean that there are any conflicts?

Thanks

Why require twine?

When I try to build dtrx without having twine installed, I get:

* Getting dependencies for wheel...

ERROR Missing dependencies:
	twine>=1.11.0

Why does dtrx require twine? This was added in 24eeff4 "to enable upload" according to the commit message, but I'm not trying to upload anything.

As I understand it, dtrx is a utility to extract archives and twine is a utility to upload things to pypi. It doesn't seem like dtrx should need twine.

There are no occurrences of "twine" in the source code except in setup_requires in setup.py. If I remove the mention of twine there, it builds fine without twine.

--- setup.py.orig	2021-09-15 15:45:01.000000000 -0500
+++ setup.py	2022-01-07 16:28:35.000000000 -0600
@@ -70,6 +70,6 @@
     long_description_content_type="text/markdown",
     # using markdown as pypi description:
     # https://dustingram.com/articles/2018/03/16/markdown-descriptions-on-pypi
-    setup_requires=["setuptools>=38.6.0", "wheel>=0.31.0", "twine>=1.11.0"],
+    setup_requires=["setuptools>=38.6.0", "wheel>=0.31.0"],
     install_requires=install_requires,

.rar files depends on number of files

Hello @noahp
I have a ETL project where i use dtrx as backbone for unzipping a lot of different files. Sometimes thare are .rar files that fails. I have been able to reproduce it as a bare minimum.

Reproduction steps:
Running the following python script. Should create 708 txt files.
Create a folder called dtrx_test and create the following script.

# Dtrx works just fine.
for i in range(708):
    with open(f"{str(i)}.txt", "w") as f:
        f.write("this is a txt file")

# Dtrx doesnt work
# for i in range(709):
#     with open(f"{str(i)}.txt", "w") as f:
#         f.write("this is a txt file")

Compressing with 

rar a works.rar dtrx_test

Yields a rar file that dtrx can compress. Running the script again with 709 txt files creates a rar files that dtrx cant compress.

dtrx works.rar --noninteractive
dtrx does_not_work.rar --noninteractive

However, running.

unrar -o+ x works.rar
unrar -o+ x does_not_work.rar

Works for both containers, so the unrar library is able to handle it.

I've added both files a a zip file.
(Running dtrx dtrx.zip --recursive --noninteractive should also fail for the zip file)

NB. Having subdirs e.g. dtrx_test/subdir/(alot of files) changes how many files is needed for the extraction to work, however i think fixing one problem fixes the other.

dtrx.zip

Official list of supported file types

Hi,

we are using dtrx as part of larger scripts to unroll / extract huge bulks of files. As part of the script dtrx receives input from 'find'. However, this leads to the situation that we call dtrx on files that are obviously no archives. Hence, I would like to create a filter to only call dtrx on file that have actual supported extensions.

Is there an official list of all the file types supported by dtrx? The readme list a bunch in the README but this list is not exhaustive.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.