beatsbears / tarsafe Goto Github PK
View Code? Open in Web Editor NEWA safe subclass of the TarFile class for interacting with tar files. Can be used as a direct drop-in replacement for safe usage of extractall()
License: MIT License
A safe subclass of the TarFile class for interacting with tar files. Can be used as a direct drop-in replacement for safe usage of extractall()
License: MIT License
Hello!
We're using tarsafe as part of https://github.com/datadog/guarddog/ and I noticed that for some archives (unrelated to size), it takes a lot of time to extract, much more than the stdlib tarfile
.
Sample file: https://files.pythonhosted.org/packages/2a/e3/624e95d2bc75f78ab7ce45e868b3609dea9da210a9f54e0e4e2c8cf95aa3/datadog-api-client-2.10.0.tar.gz (MD5 6f20eb7f5239a051230bb0a211d11f0b, only around 3k files and 1.5M )
Reproduction:
$ time python3 -c 'import tarfile; tarfile.open("datadog-api-client.tar.gz").extractall("/tmp/tarfile")'
python3 -c 0.55s user 0.67s system 96% cpu 1.276 total
$ time python3 -c 'import tarsafe; tarsafe.open("datadog-api-client.tar.gz").extractall("/tmp/tarsafe")'
python3 -c 66.29s user 1.64s system 98% cpu 1:08.97 total
Here's a profile generated using python3 -m cProfile -s tottime repro.py
on Python 3.10.9.
489374595 function calls (489363690 primitive calls) in 201.207 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
12956401 54.448 0.000 77.369 0.000 posixpath.py:337(normpath)
12960105 19.456 0.000 31.087 0.000 posixpath.py:71(join)
3600 18.788 0.005 198.102 0.055 tarsafe.py:47(_safetar_check)
12956400 15.129 0.000 153.743 0.000 tarsafe.py:64(_is_traversal_attempt)
64785773 12.333 0.000 12.333 0.000 {method 'startswith' of 'str' objects}
12956401 10.366 0.000 104.856 0.000 posixpath.py:376(abspath)
12956401 8.232 0.000 16.106 0.000 posixpath.py:60(isabs)
99561884 7.588 0.000 7.588 0.000 {method 'append' of 'list' objects}
12956400 6.601 0.000 9.943 0.000 tarsafe.py:83(_is_device)
25920105 6.439 0.000 9.825 0.000 posixpath.py:41(_get_sep)
12956405 5.042 0.000 5.042 0.000 {method 'split' of 'str' objects}
38887752 4.955 0.000 4.955 0.000 {built-in method builtins.isinstance}
12956400 4.522 0.000 6.874 0.000 tarsafe.py:69(_is_unsafe_symlink)
51832955 4.080 0.000 4.080 0.000 {built-in method posix.fspath}
12956400 4.035 0.000 5.829 0.000 tarsafe.py:76(_is_unsafe_link)
12956912 3.352 0.000 3.352 0.000 {method 'join' of 'str' objects}
12963492 2.355 0.000 2.355 0.000 tarfile.py:1417(issym)
12960129 2.340 0.000 2.340 0.000 {method 'endswith' of 'str' objects}
12963600 2.304 0.000 2.927 0.000 tarfile.py:2453(__iter__)
12963598 1.796 0.000 1.796 0.000 tarfile.py:1421(islnk)
12956400 1.725 0.000 1.725 0.000 tarfile.py:1425(ischr)
12956400 1.617 0.000 1.617 0.000 tarfile.py:1429(isblk)
3494 1.466 0.000 1.466 0.000 {built-in method io.open}
Thanks!
This library claims to be safe but it is not.
Since Python 2.7 tarfile uses
https://github.com/python/cpython/blob/498598e8c2d64232d26c075de87c513415176bbf/Lib/tarfile.py#L2154
but all your safety checks assume os.path.sep
as a directory separator.
I believe there's issue with performance and iterating over large tar files, not sure if large size or number of entries.
Any chance this is fixable? Looking at the code I don't think this is possible without writing a better native Python implementation of tar extraction.
Thanks for your library btw.
The TarFile.extract
function has the same issue as extractall
: https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.extract
It'd be helpful to include a safe implementation of this function as well.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.