GithubHelp home page GithubHelp logo

lrq3000 / pyfilefixity Goto Github PK

View Code? Open in Web Editor NEW
124.0 15.0 9.0 10.32 MB

📂🛡️Suite of tools for file fixity (data protection for long term storage⌛) using redundant error correcting codes, hash auditing and duplications with majority vote, all in pure Python🐍

License: MIT License

Python 99.53% JavaScript 0.03% CSS 0.15% Makefile 0.30%
archival data-archival data-protection data-repairing duplication error-correcting-codes long-term reed-solomon reed-solomon-codes

pyfilefixity's People

Contributors

cpburnz avatar lrq3000 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyfilefixity's Issues

question: `tqdm` progress display keeps lines on the screen

For all operations, there is a nice tqdm progress shown, but at least for me each update is printed as a new, separate line, instead of overwriting the same line over and over (which I believe is the default behavior of tqdm).

This results in lots of scrolling, pushing away more interesting messages (such as file date/checksum mismatches):

 99%|#########9| 48385/48771 [07:03<00:02, 179.60it/s]
 99%|#########9| 48403/48771 [07:03<00:02, 177.26it/s]
 99%|#########9| 48421/48771 [07:03<00:01, 175.20it/s]
 99%|#########9| 48439/48771 [07:03<00:01, 173.68it/s]
 99%|#########9| 48457/48771 [07:03<00:01, 173.15it/s]
 99%|#########9| 48475/48771 [07:03<00:01, 172.24it/s]
 99%|#########9| 48493/48771 [07:03<00:01, 172.36it/s]
 99%|#########9| 48511/48771 [07:04<00:01, 171.92it/s]

Is this behavior by design, or does it behave so only on my system?
pyFileFixity version 3.1.4 installed with pip, on Python 3.10.12, on WSL2.

file encoding on Linux / Windows

Dear Sir,

I am using pyFileFicity which is a very efficient tool.

But the output files for checksum / header / ECC seem to be OS depended (charset dependent), which can lead to errors.

For example, I computed checksums on Windows 10, then I tested the checksums on Linux (Debian buster, utf-8 encoding), and I found multiple errors.
It appears that output file encoding in Windows was CP-1215, while reading utf-8 on Linux.
It was easily fixed for the checksums output file, by transforming the file to utf-8.
But I am wondering if header and ECC output files - computed on Windows, will work if I repair the file on Linux?
Is there an option in pyfileFixity to set charset of reading / output files (utf-8)?

Best regards,

Olivier

Infinite compression

Hi,

This is not an issue just a question.
If I have an image, I make an ecc file from it. I then purposely remove the maximum number of bits before the file is too much corrupt. Then I make a second ecc iteration, and again I remove the maximum number of bits before it is too corrupt.

From this sophism I should be able to infinitely compress a file.

I know this is wrong, but why?

RAID1 correction

RAID 1 is mirroring one disk with a bit-by-bit copy of another disk.

This is by convention only: the marginal utility of an additional disk drops rapidly, therefore no COTS solutions above 2 disks. I run 3-disk RAID-1 arrays, exactly for correcting errors on n-1 disks (also, if one disk in a 2-disk array fails, the other one practically tends to fail soon after, whatever the reason - from suddenly bearing the whole load? from being similar in age? from being from the same production batch?).

More-disks RAID 1 is merely impractical for archival, with its costly requirement for disk redundancy: it kind of works, but it's not the right tool for the job (as opposed to availability for currently-used data).

However, your point with "no detection of silent corruption" has merit. I suggest an addition to the RAID 1 paragraph:

While it's possible to have multiple disks in a RAID 1 array, you are paying a multiple of the storage price, with the same storage capacity as with a single disk, without a commensurate increase in resilience. In other words, not very efficient.

IO Error on files with characters like (, ), or comma in the file name

I'm running pyFileFixity on Windows and it fails if a file name has special characters in it. Is there a fix for this? If a file has a special character the script aborts claiming the files don't exist but the files do in fact exist.

IOError: [Errno 2] No such file or directory:('C:\directory\file that (does) exist.jpg')

Non latin-1 filenames are not supported

Thank you for a very thought-out tool! Currently evaluating it for keeping my 400+GB, 50k-file archive safe(r).

While doing so, came across this exception:

Traceback (most recent call last):
  File "/home/user/.local/bin/pff", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/pff.py", line 108, in main
    return saecc_main(argv=subargs, command=fullcommand)
  File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 574, in main
    relfilepath_ecc = compute_ecc_hash_from_string(relfilepath, ecc_manager_intra, hasher_intra, max_block_size, resilience_rate_intra)
  File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/structural_adaptive_ecc.py", line 203, in compute_ecc_hash_from_string
    fpfile = BytesIO(b(string))
  File "/home/user/.local/lib/python3.10/site-packages/pyFileFixity/lib/_compat.py", line 36, in b
    return codecs.latin_1_encode(x)[0]
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 16-26: ordinal not in range(256)

Looking at the code, it seems that latin-1 is used as an internal encoding - which can indeed not handle some of the non-latin-1 characters:

if sys.version_info < (3,):
    def b(x):
        return x
else:
    import codecs
    def b(x):
        if isinstance(x, _str):
            return codecs.latin_1_encode(x)[0]  # <-- here
        else:
            return x

Problematic filename had Ukrainian/Cyrillic characters, which I think are not a part of latin-1 encoding.
Example string: зображення.

pyFileFixity version 3.1.4 installed with pip. I'm on Python 3.10.12.

Getting errors 20% of the way through creating hashes

Windows 11, Synology NAS mapped drive

File "", line 198, in _run_module_as_main

File "", line 88, in _run_code

File "C:\Users\lk\AppData\Local\Programs\Python\Python311\Scripts\pff.exe_main_.py", line 7, in

Can I just append and it will continue where it left off?

Awesome!

This is not a bug, but just kudos... This code is quite fine. Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.