dbryant4 / furtive Goto Github PK
View Code? Open in Web Editor NEWFile Integrity Verification System (furtive) aims to ensure long term data integrity verification for digital archival purposes.
License: MIT License
File Integrity Verification System (furtive) aims to ensure long term data integrity verification for digital archival purposes.
License: MIT License
upload_manifest_to_s3.py should print out a list of files that were not uploaded. It seems like it skips this detail.
The AWS related scripts seem to be out of scope and are taking up time that should be devoted to Furtive. Perhaps they should all be removed from the furtive repo and placed in their own repo.
For some reason, the process will not stop when all files have been hashed. This problem seems to be intermittent.
Add option to delete empty S3 buckets after uploading the manifest to s3.
Having multiple objects in hasher.py seems unclean. The objects within hasher.py deserve their own file.
Need to store historical information in the manifest instead of deleting the whole database then pushing hashes into it.
Need to:
User should be able to specify the region for s3 storage. Currently, the region is statically set to USWEST.
Currently the script statically retires uploads to s3 5 times. This should be an option.
Traceback (most recent call last):
File "/root/furtive/upload_manifest_to_s3.py", line 160, in
main()
File "/root/furtive/upload_manifest_to_s3.py", line 138, in main
k.change_storage_class(args.storage_class)
File "/usr/local/lib/python2.7/boto/s3/key.py", line 298, in change_storage_class
validate_dst_bucket=validate_dst_bucket)
File "/usr/local/lib/python2.7/boto/s3/key.py", line 360, in copy
encrypt_key=encrypt_key)
File "/usr/local/lib/python2.7/boto/s3/bucket.py", line 679, in copy_key
acl = src_bucket.get_xml_acl(src_key_name)
File "/usr/local/lib/python2.7/boto/s3/bucket.py", line 741, in get_xml_acl
response.status, response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 404 Not Found
It seems like furtive should have an option to check the integrity of a manifest.
Error You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.:
Need to create a method to compare two manifests.
When comparing previous hashes in a manifest, Furtive needs to automatically determine which hash algorithm was used previously and reuse that hash algorithm no matter what is set in self.hash_algorithm.
Alternatively, Furtive could hash the file using the detected algorithm, then if the hashes match, rehash using the algorithm set in self.hash_algorithm then store in manifest.
sqlite3 seems a little complicated for this purpose. The YAML format seems sufficient for the purposes of this application. YAML is also readable by humans.
In most cases, the walking method should ignore .manifest.db.
Need to make use of multiple core CPUs by implementing threading.
Add arg options for users to select which hash to use
I think the first step should be to implement the ignoring of files during file discovery. The first step should not be to record the exclusions but to require that the same exclusions be provided during subsequent executions of furtive
. If exclusions are not provided, then files which were previously ignored will be treated as new files. Conversely, if exclusions are added to subsequent executions of furtive
, then those files matching the exclusion patterns will be treated as deleted files.
Furtive adds the prefix "/" to the lifecycle expiration rule which does not work. Should be "".
furtive currently prints out an ugly form of YAML to the terminal upon completion.
(furtive)Derricks-MacBook-Air:furtive derrickbryant$ furtive --manifest .test_manifest.db compare
INFO:root:Generating temporary updated manifest.
INFO:root:Discovering files in . and adding to processing queue
added:
- ./.git/objects/7a/b159f38e9347f388cd91d8274975f697ad4937
- ./.git/objects/0f/e0c120460aa853de717df8404b5743a635670a
- ./.git/objects/d0/886b59a4ef66cf77a407471c4868e63a43e30e
- ./.git/objects/7d/9c81ec309c0acf66586aa1b5e3447c6e9cca9c
- ./.git/objects/fd/c38b9970ca741afbd71d81c458f8be61e05e6b
- ./.git/objects/ed/3fda7e67467c3ad053945912fe427601f4cdbc
- ./.git/objects/45/fb87956f7e9cf9ce0a72644d0446cef2c528a0
- ./.git/objects/db/c520c7d3be3b94bfc0164ec2988314f5c78562
- ./.git/objects/07/f52b6e2e0319fc33f4377f31d2837d68ba78db
- ./.git/objects/85/014f2645c95089d8ddd480bf7a86f067c771cb
- ./.git/objects/50/d8ecc301a67d8181407636f7f2b4c0443e80ca
- ./.git/objects/f6/e80b435183fef170895f9802160ca25d9b41c4
changed:
- !!python/unicode './.git/logs/HEAD'
- !!python/unicode './.git/logs/refs/heads/9'
- !!python/unicode './.travis.yml'
- !!python/unicode './.git/refs/remotes/origin/9'
- !!python/unicode './.git/refs/heads/9'
- !!python/unicode './.git/logs/refs/remotes/origin/9'
- !!python/unicode './.test_manifest.db'
- !!python/unicode './.git/index'
- !!python/unicode './furtive/manifest.py'
removed:
- !!python/unicode './Furtive.py'
- !!python/unicode './hashDir.py'
- !!python/unicode './docs/Furtive.html'
It should print only the string values not the !!python/unicode
crap before it.
Need to create some test scripts to ensure the module is functioning as expected. I have already uploaded some test data into the test-data directory.
It would be nice to have a feature or script which could take a manifest and create an archive (zip, tar, xz).
Currently, the hashing process happens after all files have been found and put in to a list. It would be nice if hashing started as soon as the files are found.
Should create a script that will open a manifest and upload the contents to Amazon Glacier, perhaps also S3. Look at using the python module boot.
We should terminate all processes in the Pool when a KeyboardInterrupt is intercepted.
Perhaps run de-duplication on files residing on the file system or maybe just de-dup files as they are being uploaded to Glacier to save space and money. For either option, there should be a way to store information about the files removed within the manifest.
Should have an option for upload_manifest.py such as --prepend which acts like a directory to place the manifest in within glacier.
Example:
upload_manifest.py --dir /path/to/manifest --vault "pictures-backup" --prepend "october"
would upload manifest to a directory called october in the vault pictures-backup.
As a user, I want furtive to begin hashing files as soon as the YAML line is read. Currently, the whole YAML file needs to be read before hashing begins. This can take many seconds.
It seems as though there is a memory error when trying to hash a lot of large files. The last file below is 7.5GB and there is ~2.5 GB of RAM free.
Output:
Tue Dec 8 02:16:06 2015 EST [DEBUG]: Hash for /smb/Pictures/2011/Flight - Culpeper and Back/940_0182.MOV: cbc95ae47af59137b94ca6786903c440
Tue Dec 8 02:16:06 2015 EST [DEBUG]: Starting Hash of /smb/Pictures/2011/Flight - Culpeper and Back/Merged.wmv
Traceback (most recent call last):
File "/usr/local/bin/furtive", line 56, in <module>
main()
File "/usr/local/bin/furtive", line 50, in main
furtive.create()
File "/usr/local/lib/python2.7/dist-packages/furtive/__init__.py", line 35, in create
self.manifest.create()
File "/usr/local/lib/python2.7/dist-packages/furtive/manifest.py", line 26, in create
self.manifest = HashDirectory(self.directory).hash_files()
File "/usr/local/lib/python2.7/dist-packages/furtive/hasher.py", line 75, in hash_files
results = pool.map(hash_task, files_to_hash)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
MemoryError
Also, add support for importing md5sum and other *sum applications
89.7% 1018 of 1135Traceback (most recent call last):
File "upload_manifest.py", line 116, in <module>
main()
File "upload_manifest.py", line 99, in main
vault.create_archive_from_file(file,f)
File "/usr/local/lib/python2.7/boto/glacier/vault.py", line 141, in create_archive_from_file
writer.close()
File "/usr/local/lib/python2.7/boto/glacier/writer.py", line 164, in close
self._uploaded_size)
File "/usr/local/lib/python2.7/boto/glacier/layer1.py", line 513, in complete_multipart_upload
response_headers=response_headers)
File "/usr/local/lib/python2.7/boto/glacier/layer1.py", line 78, in make_request
data=data)
File "/usr/local/lib/python2.7/boto/connection.py", line 910, in make_request
return self._mexe(http_request, sender, override_num_retries)
File "/usr/local/lib/python2.7/boto/connection.py", line 872, in _mexe
raise e
socket.gaierror: [Errno -2] Name or service not known
As a user, I want to be able to use furtive with Python 2.7 and 3.x so that I can continue using furtive when I eventually switch to Python 3.x.
Currently, the full path is added to the manifest file, this should change to the path relative to the basedir.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.