GithubHelp home page GithubHelp logo

laktak / chkbit-py Goto Github PK

View Code? Open in Web Editor NEW
65.0 6.0 6.0 489 KB

Check the data integrity of your files over time

License: MIT License

Python 100.00%
backup storage-media data-degradation bitrot-detection data-integrity disk-check cloud-backup

chkbit-py's People

Contributors

jminor avatar laktak avatar pmjdebruijn avatar spock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

chkbit-py's Issues

slower verify in version 3

Previously, running --verify-index on my 50k files - without any existing index - was taking 1m 50s with 1 worker, and 35s with 6 workers. It was simply listing all the files, and that was very fast.

Now, with version 3, verification takes about 24m with a single worker, and about 6m 40s with 6 workers; which are about the same times as for updating the index.

Looking at line 39 here a03d5b4#diff-0c955c1fb53488dd1d4f7f30bec88d9339a66af65b566f0fa1375d7dc926b22bR39 , it looks like even in the read-only verification mode hashes are calculated and modified times are checked for all new files.

I think in the previous version new files during verification were simply reported: a03d5b4#diff-5b1e4f1fcc838580d1fd73bbf4d2b6a8ad87d3a526fd32c45d9743ddd84ae7aaL43 (doesn't expand from the URL - indexthread.py, line 43)

question: `--algo blake3`

Hi, thank you for a promising-looking file bitrot/hash checker. I especially like the built-in logic of "modified content and date are fine, modified content alone is not" - this exactly what I've been looking for!

For file integrity checking there is a rather new BLAKE3 algorithm, that is significantly faster (like 9x) than md5, but also claims to be better; they published an article with more details and benchmarks. It was designed specifically for file (content) hashing.

Primary (binary) implementation is in Rust (with parallelization), but there are also reference/educational non-parallel implementations in C and pure Python.

If you think this could be a nice --algo option, what could be the best way to integrate it? As you already have multi-worker support, I guess calling their single-threaded C library (or asking for single-thread processing from the main Rust library) would be the best? I haven't yet checked if Python bindings exist, but I'd assume they do.

question: how to match the same `.chkbitignore` files at different depth levels?

I'm struggling with a line definition in .chkbitignore that would, for example, match all Thumbs.db files independent of where they are in the file tree.

I've tried just the name itself, with a leading asterisk, and even things like */*/Thumbs.db - but Thumbs.db seem to still be added.

What is the right way to do this? fnmatch manual suggests that *Thumbs.db should be correct...

Synology x86_64 libz.so.1

Works great on my Windows Machine. Trying to get it to work on LINUX DMS (Synology NAS) but I get this error
./chkbit: error while loading shared libraries: libz.so.1: failed to map segment from shared object

Would it be possible to get a copy of the required library file so I can drop it in the same directory? The Synology linux distro is so scaled down it's hard to even install dependencies.

Secure hash functions

I noticed from the documentation that this project uses MD5 hashes. While I understand that this works well enough to catch unintentional corruption, I'm wondering if it would be possible to support secure hash functions going forward, such as:

  • SHA-256: well-vetted in practice, protects against collisions
  • SHA-3: also protects against length extension attacks

I understand that it's not a primary goal to protect against adversarial attacks, but I can also imagine some situations (e.g. content-addressed storage) where a program might accidentally swap a file with another that has the same hash. In general, it would be safest if there is no known way to create any files with collisions. My understanding is that these hashes would not cost a significant performance penalty on modern computers.

Stargate Blocker

I'm unable to add chkbit to brew.sh (Homebrew) because their CI job blocks with

  • GitHub repository not notable enough (<30 forks, <30 watchers and <75 stars)

If you'd like to install chkbit via brew then please star this repo.

Question: hash and integrity management

Hi,
I got here while learning about file integrity, as my understanding of the code the hash of a file is kept in a dictionary in the Index class, so I assume that chkbit cannot guarantee that the file wasn't modified while it wasn't running as it hasn't a persistent hash history; also, it isn't clear to me if there is an alerting mechanism while it is running or you must manually check.

About the first point, I was wondering if a SQLite database wouldn't be a better fit, it would have the advantages of using files and would make it possible to have a persistent history; still, it would also introduce more complexities, such as database management but, more important, the hashes should be encrypted -> the key should be kept somewhere safe.

I would like to know more about the project, given my recent interest in file integrity.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.