GithubHelp home page GithubHelp logo

gaby / dirhash-python Goto Github PK

View Code? Open in Web Editor NEW

This project forked from andhus/dirhash-python

0.0 0.0 0.0 180 KB

Python module and CLI for hashing of file system directories based on the Dirhash Standard.

License: MIT License

Python 100.00%

dirhash-python's Introduction

codecov

dirhash

A lightweight python module and CLI for computing the hash of any directory based on its files' structure and content.

  • Supports all hashing algorithms of Python's built-in hashlib module.
  • Glob/wildcard (".gitignore style") path matching for expressive filtering of files to include/exclude.
  • Multiprocessing for up to 6x speed-up

The hash is computed according to the Dirhash Standard, which is designed to allow for consistent and collision resistant generation/verification of directory hashes across implementations.

Installation

From PyPI:

pip install dirhash

Or directly from source:

git clone [email protected]:andhus/dirhash-python.git
pip install dirhash/

Usage

Python module:

from dirhash import dirhash

dirpath = "path/to/directory"
dir_md5 = dirhash(dirpath, "md5")
pyfiles_md5 = dirhash(dirpath, "md5", match=["*.py"])
no_hidden_sha1 = dirhash(dirpath, "sha1", ignore=[".*", ".*/"])

CLI:

dirhash path/to/directory -a md5
dirhash path/to/directory -a md5 --match "*.py"
dirhash path/to/directory -a sha1 --ignore ".*"  ".*/"

Why?

If you (or your application) need to verify the integrity of a set of files as well as their name and location, you might find this useful. Use-cases range from verification of your image classification dataset (before spending GPU-$$$ on training your fancy Deep Learning model) to validation of generated files in regression-testing.

There isn't really a standard way of doing this. There are plenty of recipes out there (see e.g. these SO-questions for linux and python) but I couldn't find one that is properly tested (there are some gotcha:s to cover!) and documented with a compelling user interface. dirhash was created with this as the goal.

checksumdir is another python module/tool with similar intent (that inspired this project) but it lacks much of the functionality offered here (most notably including file names/structure in the hash) and lacks tests.

Performance

The python hashlib implementation of common hashing algorithms are highly optimised. dirhash mainly parses the file tree, pipes data to hashlib and combines the output. Reasonable measures have been taken to minimize the overhead and for common use-cases, the majority of time is spent reading data from disk and executing hashlib code.

The main effort to boost performance is support for multiprocessing, where the reading and hashing is parallelized over individual files.

As a reference, let's compare the performance of the dirhash CLI with the shell command:

find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5

which is the top answer for the SO-question: Linux: compute a single hash for a given folder & contents? Results for two test cases are shown below. Both have 1 GiB of random data: in "flat_1k_1MB", split into 1k files (1 MiB each) in a flat structure, and in "nested_32k_32kB", into 32k files (32 KiB each) spread over the 256 leaf directories in a binary tree of depth 8.

Implementation Test Case Time (s) Speed up
shell reference flat_1k_1MB 2.29 -> 1.0
dirhash flat_1k_1MB 1.67 1.36
dirhash(8 workers) flat_1k_1MB 0.48 4.73
shell reference nested_32k_32kB 6.82 -> 1.0
dirhash nested_32k_32kB 3.43 2.00
dirhash(8 workers) nested_32k_32kB 1.14 6.00

The benchmark was run a MacBook Pro (2018), further details and source code here.

Documentation

Please refer to dirhash -h, the python source code and the Dirhash Standard.

dirhash-python's People

Contributors

andhus avatar cottsay avatar irsent avatar jonathanarns avatar matthewfeickert avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.