GithubHelp home page GithubHelp logo

dbryant4 / furtive Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 1.0 172 KB

File Integrity Verification System (furtive) aims to ensure long term data integrity verification for digital archival purposes.

License: MIT License

Python 100.00%

furtive's People

Contributors

beserres avatar bmannix avatar dbryant4 avatar quantifiedcode-bot avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

beserres

furtive's Issues

List of failed uploads

upload_manifest_to_s3.py should print out a list of files that were not uploaded. It seems like it skips this detail.

Consider Removing AWS Scripts

The AWS related scripts seem to be out of scope and are taking up time that should be devoted to Furtive. Perhaps they should all be removed from the furtive repo and placed in their own repo.

Process not terminating

For some reason, the process will not stop when all files have been hashed. This problem seems to be intermittent.

Store historical information manifest

Need to store historical information in the manifest instead of deleting the whole database then pushing hashes into it.

Need to:

  • Store previous hashes, along with date and hash algorithm
    • Storing algorithm might not be necessary since we might be bale to detect algorithm by hash length
  • Do not delete hashes of removed files. Perhaps add a "removed" column to the database or move the hash to a "removed" table within the manifest database

Specify region for s3

User should be able to specify the region for s3 storage. Currently, the region is statically set to USWEST.

Error when uploading manifest to S3

Traceback (most recent call last):
File "/root/furtive/upload_manifest_to_s3.py", line 160, in
main()
File "/root/furtive/upload_manifest_to_s3.py", line 138, in main
k.change_storage_class(args.storage_class)
File "/usr/local/lib/python2.7/boto/s3/key.py", line 298, in change_storage_class
validate_dst_bucket=validate_dst_bucket)
File "/usr/local/lib/python2.7/boto/s3/key.py", line 360, in copy
encrypt_key=encrypt_key)
File "/usr/local/lib/python2.7/boto/s3/bucket.py", line 679, in copy_key
acl = src_bucket.get_xml_acl(src_key_name)
File "/usr/local/lib/python2.7/boto/s3/bucket.py", line 741, in get_xml_acl
response.status, response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 404 Not Found

Add check feature

It seems like furtive should have an option to check the integrity of a manifest.

Get this error when furtifying a documents directory

Error You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.:

Compare two manifests

Need to create a method to compare two manifests.

  1. Create Furtive object of one dir called fur1
  2. Create Furtive object of the other dir called fur2
  3. Add new method called compare which will compare hashes variable within each Furtive object
    • fur1.compare(fur2)

Automatically detect previous hashes used in existing manifests

When comparing previous hashes in a manifest, Furtive needs to automatically determine which hash algorithm was used previously and reuse that hash algorithm no matter what is set in self.hash_algorithm.

Alternatively, Furtive could hash the file using the detected algorithm, then if the hashes match, rehash using the algorithm set in self.hash_algorithm then store in manifest.

Change manifest storage to use YAML

sqlite3 seems a little complicated for this purpose. The YAML format seems sufficient for the purposes of this application. YAML is also readable by humans.

Add ability to exclude files from manifest

I think the first step should be to implement the ignoring of files during file discovery. The first step should not be to record the exclusions but to require that the same exclusions be provided during subsequent executions of furtive. If exclusions are not provided, then files which were previously ignored will be treated as new files. Conversely, if exclusions are added to subsequent executions of furtive, then those files matching the exclusion patterns will be treated as deleted files.

Fix s3 expiration prefix

Furtive adds the prefix "/" to the lifecycle expiration rule which does not work. Should be "".

Fix output of YAML

furtive currently prints out an ugly form of YAML to the terminal upon completion.

(furtive)Derricks-MacBook-Air:furtive derrickbryant$ furtive --manifest .test_manifest.db compare
INFO:root:Generating temporary updated manifest.
INFO:root:Discovering files in . and adding to processing queue
added:
- ./.git/objects/7a/b159f38e9347f388cd91d8274975f697ad4937
- ./.git/objects/0f/e0c120460aa853de717df8404b5743a635670a
- ./.git/objects/d0/886b59a4ef66cf77a407471c4868e63a43e30e
- ./.git/objects/7d/9c81ec309c0acf66586aa1b5e3447c6e9cca9c
- ./.git/objects/fd/c38b9970ca741afbd71d81c458f8be61e05e6b
- ./.git/objects/ed/3fda7e67467c3ad053945912fe427601f4cdbc
- ./.git/objects/45/fb87956f7e9cf9ce0a72644d0446cef2c528a0
- ./.git/objects/db/c520c7d3be3b94bfc0164ec2988314f5c78562
- ./.git/objects/07/f52b6e2e0319fc33f4377f31d2837d68ba78db
- ./.git/objects/85/014f2645c95089d8ddd480bf7a86f067c771cb
- ./.git/objects/50/d8ecc301a67d8181407636f7f2b4c0443e80ca
- ./.git/objects/f6/e80b435183fef170895f9802160ca25d9b41c4
changed:
- !!python/unicode './.git/logs/HEAD'
- !!python/unicode './.git/logs/refs/heads/9'
- !!python/unicode './.travis.yml'
- !!python/unicode './.git/refs/remotes/origin/9'
- !!python/unicode './.git/refs/heads/9'
- !!python/unicode './.git/logs/refs/remotes/origin/9'
- !!python/unicode './.test_manifest.db'
- !!python/unicode './.git/index'
- !!python/unicode './furtive/manifest.py'
removed:
- !!python/unicode './Furtive.py'
- !!python/unicode './hashDir.py'
- !!python/unicode './docs/Furtive.html'

It should print only the string values not the !!python/unicode crap before it.

Create test scripts

Need to create some test scripts to ensure the module is functioning as expected. I have already uploaded some test data into the test-data directory.

Manifest to archive feature

It would be nice to have a feature or script which could take a manifest and create an archive (zip, tar, xz).

Explore de-duplication

Perhaps run de-duplication on files residing on the file system or maybe just de-dup files as they are being uploaded to Glacier to save space and money. For either option, there should be a way to store information about the files removed within the manifest.

Prepend a directory for Glacier uploads

Should have an option for upload_manifest.py such as --prepend which acts like a directory to place the manifest in within glacier.

Example:

upload_manifest.py --dir /path/to/manifest --vault "pictures-backup" --prepend "october"

would upload manifest to a directory called october in the vault pictures-backup.

Loading the YAML file is too slow

As a user, I want furtive to begin hashing files as soon as the YAML line is read. Currently, the whole YAML file needs to be read before hashing begins. This can take many seconds.

MemoryError

It seems as though there is a memory error when trying to hash a lot of large files. The last file below is 7.5GB and there is ~2.5 GB of RAM free.

Output:

Tue Dec  8 02:16:06 2015 EST [DEBUG]: Hash for /smb/Pictures/2011/Flight - Culpeper and Back/940_0182.MOV: cbc95ae47af59137b94ca6786903c440
Tue Dec  8 02:16:06 2015 EST [DEBUG]: Starting Hash of /smb/Pictures/2011/Flight - Culpeper and Back/Merged.wmv
Traceback (most recent call last):
  File "/usr/local/bin/furtive", line 56, in <module>
    main()
  File "/usr/local/bin/furtive", line 50, in main
    furtive.create()
  File "/usr/local/lib/python2.7/dist-packages/furtive/__init__.py", line 35, in create
    self.manifest.create()
  File "/usr/local/lib/python2.7/dist-packages/furtive/manifest.py", line 26, in create
    self.manifest = HashDirectory(self.directory).hash_files()
  File "/usr/local/lib/python2.7/dist-packages/furtive/hasher.py", line 75, in hash_files
    results = pool.map(hash_task, files_to_hash)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
MemoryError

Bug while uploading many files to Glacier

89.7% 1018 of 1135Traceback (most recent call last):
  File "upload_manifest.py", line 116, in <module>
    main()
  File "upload_manifest.py", line 99, in main
    vault.create_archive_from_file(file,f)
  File "/usr/local/lib/python2.7/boto/glacier/vault.py", line 141, in create_archive_from_file
    writer.close()
  File "/usr/local/lib/python2.7/boto/glacier/writer.py", line 164, in close
    self._uploaded_size)
  File "/usr/local/lib/python2.7/boto/glacier/layer1.py", line 513, in complete_multipart_upload
    response_headers=response_headers)
  File "/usr/local/lib/python2.7/boto/glacier/layer1.py", line 78, in make_request
    data=data)
  File "/usr/local/lib/python2.7/boto/connection.py", line 910, in make_request
    return self._mexe(http_request, sender, override_num_retries)
  File "/usr/local/lib/python2.7/boto/connection.py", line 872, in _mexe
    raise e
socket.gaierror: [Errno -2] Name or service not known

Make Furtive Python 3 compatible

As a user, I want to be able to use furtive with Python 2.7 and 3.x so that I can continue using furtive when I eventually switch to Python 3.x.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.