laktak / chkbit-py Goto Github PK
View Code? Open in Web Editor NEWCheck the data integrity of your files over time
License: MIT License
Check the data integrity of your files over time
License: MIT License
Previously, running --verify-index
on my 50k files - without any existing index - was taking 1m 50s with 1 worker, and 35s with 6 workers. It was simply listing all the files, and that was very fast.
Now, with version 3, verification takes about 24m with a single worker, and about 6m 40s with 6 workers; which are about the same times as for updating the index.
Looking at line 39 here a03d5b4#diff-0c955c1fb53488dd1d4f7f30bec88d9339a66af65b566f0fa1375d7dc926b22bR39 , it looks like even in the read-only verification mode hashes are calculated and modified times are checked for all new files.
I think in the previous version new files during verification were simply reported: a03d5b4#diff-5b1e4f1fcc838580d1fd73bbf4d2b6a8ad87d3a526fd32c45d9743ddd84ae7aaL43 (doesn't expand from the URL - indexthread.py, line 43)
Hi, thank you for a promising-looking file bitrot/hash checker. I especially like the built-in logic of "modified content and date are fine, modified content alone is not" - this exactly what I've been looking for!
For file integrity checking there is a rather new BLAKE3 algorithm, that is significantly faster (like 9x) than md5, but also claims to be better; they published an article with more details and benchmarks. It was designed specifically for file (content) hashing.
Primary (binary) implementation is in Rust (with parallelization), but there are also reference/educational non-parallel implementations in C and pure Python.
If you think this could be a nice --algo
option, what could be the best way to integrate it? As you already have multi-worker support, I guess calling their single-threaded C library (or asking for single-thread processing from the main Rust library) would be the best? I haven't yet checked if Python bindings exist, but I'd assume they do.
I'm struggling with a line definition in .chkbitignore
that would, for example, match all Thumbs.db
files independent of where they are in the file tree.
I've tried just the name itself, with a leading asterisk, and even things like */*/Thumbs.db
- but Thumbs.db
seem to still be added.
What is the right way to do this? fnmatch
manual suggests that *Thumbs.db
should be correct...
Works great on my Windows Machine. Trying to get it to work on LINUX DMS (Synology NAS) but I get this error
./chkbit: error while loading shared libraries: libz.so.1: failed to map segment from shared object
Would it be possible to get a copy of the required library file so I can drop it in the same directory? The Synology linux distro is so scaled down it's hard to even install dependencies.
I noticed from the documentation that this project uses MD5 hashes. While I understand that this works well enough to catch unintentional corruption, I'm wondering if it would be possible to support secure hash functions going forward, such as:
I understand that it's not a primary goal to protect against adversarial attacks, but I can also imagine some situations (e.g. content-addressed storage) where a program might accidentally swap a file with another that has the same hash. In general, it would be safest if there is no known way to create any files with collisions. My understanding is that these hashes would not cost a significant performance penalty on modern computers.
I'm unable to add chkbit to brew.sh (Homebrew) because their CI job blocks with
- GitHub repository not notable enough (<30 forks, <30 watchers and <75 stars)
If you'd like to install chkbit via brew then please star this repo.
I got here from https://unix.stackexchange.com/questions/136947/protecting-data-against-bit-rot/533728#533728 and think you could successfully add the functionality to chkbit-py
, making it more powerful.
Hi,
I got here while learning about file integrity, as my understanding of the code the hash of a file is kept in a dictionary in the Index
class, so I assume that chkbit cannot guarantee that the file wasn't modified while it wasn't running as it hasn't a persistent hash history; also, it isn't clear to me if there is an alerting mechanism while it is running or you must manually check.
About the first point, I was wondering if a SQLite database wouldn't be a better fit, it would have the advantages of using files and would make it possible to have a persistent history; still, it would also introduce more complexities, such as database management but, more important, the hashes should be encrypted -> the key should be kept somewhere safe.
I would like to know more about the project, given my recent interest in file integrity.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.