laktak / chkbit-py Goto Github PK

View Code? Open in Web Editor NEW

65.0 6.0 6.0 489 KB

Check the data integrity of your files over time

License: MIT License

Python 100.00%

backup storage-media data-degradation bitrot-detection data-integrity disk-check cloud-backup

chkbit-py's People

Contributors

Stargazers

Watchers

Forkers

rakhithjk fedstryale jminor patrickhaussmann spock llaith-oss

chkbit-py's Issues

slower verify in version 3

Previously, running --verify-index on my 50k files - without any existing index - was taking 1m 50s with 1 worker, and 35s with 6 workers. It was simply listing all the files, and that was very fast.

Now, with version 3, verification takes about 24m with a single worker, and about 6m 40s with 6 workers; which are about the same times as for updating the index.

Looking at line 39 here a03d5b4#diff-0c955c1fb53488dd1d4f7f30bec88d9339a66af65b566f0fa1375d7dc926b22bR39 , it looks like even in the read-only verification mode hashes are calculated and modified times are checked for all new files.

I think in the previous version new files during verification were simply reported: a03d5b4#diff-5b1e4f1fcc838580d1fd73bbf4d2b6a8ad87d3a526fd32c45d9743ddd84ae7aaL43 (doesn't expand from the URL - indexthread.py, line 43)

question: `--algo blake3`

Hi, thank you for a promising-looking file bitrot/hash checker. I especially like the built-in logic of "modified content and date are fine, modified content alone is not" - this exactly what I've been looking for!

For file integrity checking there is a rather new BLAKE3 algorithm, that is significantly faster (like 9x) than md5, but also claims to be better; they published an article with more details and benchmarks. It was designed specifically for file (content) hashing.

Primary (binary) implementation is in Rust (with parallelization), but there are also reference/educational non-parallel implementations in C and pure Python.

If you think this could be a nice --algo option, what could be the best way to integrate it? As you already have multi-worker support, I guess calling their single-threaded C library (or asking for single-thread processing from the main Rust library) would be the best? I haven't yet checked if Python bindings exist, but I'd assume they do.

question: how to match the same `.chkbitignore` files at different depth levels?

I'm struggling with a line definition in .chkbitignore that would, for example, match all Thumbs.db files independent of where they are in the file tree.

I've tried just the name itself, with a leading asterisk, and even things like */*/Thumbs.db - but Thumbs.db seem to still be added.

What is the right way to do this? fnmatch manual suggests that *Thumbs.db should be correct...

Synology x86_64 libz.so.1

Works great on my Windows Machine. Trying to get it to work on LINUX DMS (Synology NAS) but I get this error
./chkbit: error while loading shared libraries: libz.so.1: failed to map segment from shared object

Would it be possible to get a copy of the required library file so I can drop it in the same directory? The Synology linux distro is so scaled down it's hard to even install dependencies.

Secure hash functions

I noticed from the documentation that this project uses MD5 hashes. While I understand that this works well enough to catch unintentional corruption, I'm wondering if it would be possible to support secure hash functions going forward, such as:

SHA-256: well-vetted in practice, protects against collisions
SHA-3: also protects against length extension attacks

I understand that it's not a primary goal to protect against adversarial attacks, but I can also imagine some situations (e.g. content-addressed storage) where a program might accidentally swap a file with another that has the same hash. In general, it would be safest if there is no known way to create any files with collisions. My understanding is that these hashes would not cost a significant performance penalty on modern computers.

Stargate Blocker

I'm unable to add chkbit to brew.sh (Homebrew) because their CI job blocks with

GitHub repository not notable enough (<30 forks, <30 watchers and <75 stars)

If you'd like to install chkbit via brew then please star this repo.

Feature request: Parity files so that backups can be healed

I got here from https://unix.stackexchange.com/questions/136947/protecting-data-against-bit-rot/533728#533728 and think you could successfully add the functionality to chkbit-py, making it more powerful.

Question: hash and integrity management

Hi,
I got here while learning about file integrity, as my understanding of the code the hash of a file is kept in a dictionary in the Index class, so I assume that chkbit cannot guarantee that the file wasn't modified while it wasn't running as it hasn't a persistent hash history; also, it isn't clear to me if there is an alerting mechanism while it is running or you must manually check.

About the first point, I was wondering if a SQLite database wouldn't be a better fit, it would have the advantages of using files and would make it possible to have a persistent history; still, it would also introduce more complexities, such as database management but, more important, the hashes should be encrypted -> the key should be kept somewhere safe.

I would like to know more about the project, given my recent interest in file integrity.

laktak / chkbit-py Goto Github PK

chkbit-py's People

Contributors

Stargazers

Watchers

Forkers

chkbit-py's Issues

slower verify in version 3

question: `--algo blake3`

question: how to match the same `.chkbitignore` files at different depth levels?

Synology x86_64 libz.so.1

Secure hash functions

Stargate Blocker

Feature request: Parity files so that backups can be healed

Question: hash and integrity management

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs