GithubHelp home page GithubHelp logo

bagit's People

Contributors

hakbailey avatar richardrodgers avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

hakbailey

bagit's Issues

Add support for xz compression (in tars)

Might be circumstances where one would be willing to trade speed and memory consumption for archive size, and this format seems more widely supported than alternatives

Have archive files end with identifiable suffixes

The decision to offer variant archive formats (with and without timestamps) was good, but the flle naming scheme was not. Instead of {bagname}.{format}.{variant} it really should be simply {bagname}.{format} so tools can find the '.zip' (or '.tgz') as a suffix they recognize, not '.nt' which has no standard meaning.

The result will now be simpler names:

bag1.zip
bag2.tgz

and the variant type will not be expressed in the filename at all (which is fragile anyway, since the bags often travel over networks where the name is not preserved). This will require a small change in the API to force the intended variant at time of archiving.

Path encoding bug

I recently discovered that the BagIt 1.0 specification requires that CR, LF, and % in file paths within manifest files are percent-encoded, and that there isn't a single BagIt implementation that does this correctly. Implementations either only encode CR and LF but not % or they encode nothing.

This implementation does not encode paths in the manifest, which means that it would fail to validate BagIt 1.0 bags that include file paths containing CR, LF, or %. Likewise, it would create bags that would fail BagIt 1.0 validation in the case that there are paths that naturally contain percent-encoded characters.

For example, let's say a bag contains the file data/file%0A1.txt. This file should be written to the manifest per the spec as data/file%250A1.txt. However, this implementation writes it as data/file%0A1.txt. This means, that when this implementation validates a properly constructed 1.0 bag it will look for the file data/file%250A1.txt which does not exist. Similarly, if another implementation that follows the spec attempts to validate a bag produced by this implementation, it would look for data/file\n1.txt, which does not exist.

It would seem desirable to me to move the ecosystem in the direction of properly implementing the 1.0 specification, while at the same acknowledging that there are a large number of 1.0 bags in existence that may then become invalid.

As such, it may be prudent to, when validating bags, fall back on a series of tests. You may want to first attempt to validate per the spec, and then, if a file cannot be found, attempt to locate it by either only decoding the CR and LF or leaving the path unchanged, ideally validating all of the files using the same method.

I have not examined fetch.txt implementations, but the same encoding requirements exist for paths in that file as well. This is potentially a thornier problem to address in a backward compatible way as it is unclear if the path data/file%250A1.txt is supposed to create data/file%250A1.txt (incorrect) or data/file%0A1.txt (correct).

Finally, I created a related ticket against the spec discussing this encoding problem, in particular how it breaks checksum utility compatibility.

Fails to read tag files with encoding other than UTF-8

https://github.com/richardrodgers/bagit/blob/master/src/main/java/edu/mit/lib/bagit/Bag.java#L412

Will fail for example when the manifest is written using UTF-16 or some other encoding. As per the bagit specification https://tools.ietf.org/html/draft-kunze-bagit-14#section-2.1.1 the encoding should be read from the bagit.txt file and used for reading the other tag files.

The Library of congress has a conformance suite that you can test your library against
https://github.com/LibraryOfCongress/bagit-conformance-suite/tree/master/v0.97/valid/UTF-16-encoded-tag-files

Support multiple manifest flies and newer checksum algorithms

Current library default checksum algorithm MD-5 is deprecated as of version 1.0 of the Bagit spec. Provide at least SHA-256 and SHA-512, in addition to current MD-5 and SHA-1.

Also, spec allows multiple manifest, tagmanifest files per bag, using different checksum algorithms. Support will entail API changes.

Retire travisCI in favor of GitHub actions

Seems travis is moving away from free OS support.
Besides adding actions/workflow docs, might have to incorporate all the gradle cruft (gradlew, gradle.bat, wrapper, etc) into the repo - seems like actions (unlike travis) expects repo-local gradle files.

Add programmatic EOL control

Library currently uniformly applies Unix-style line separators ('\n') when generating text files in bags. While this is spec-legal, it would be more flexible to give the client the ability to specify a preference.

Approach is to offer 4 alternatives - passed to the Filler constructor:

PLATFORM (default): use platform-defined separator (\n for *nix, \r\n for Win)
COUNTER_PLATFORM: use opposite of platform-defined separator
UNIX: use *nix-style new-line character
WINDOWS: use CR/LF

Add file signature checks for archives

Currently uses crude suffix test for archives - if file ends with 'zip' it's a Zip archive. Better to at least check some magic numbers on the archive.

New handling of fetch.txt entries

Version 1.0 adds new requirements for payload files referenced in fetch.txt. The primary one is that all files in fetch.txt must also appear in (all) payload manifests. This introduces a number of complexities, foremost of which is that files for which only a URL, a bag location and (optionally) a length is known cannot be added to a bag, since the manifests will also need checksum(s).

Even worse, in light of the Spec recommendation in section 2.4: "Implementers are encouraged to simplify the process of adding additional manifests using new algorithms to streamline the process of in-place upgrades", we note that it is impossible to deduce a new checksum from any existing one, so the recommended in-place upgrades cannot be performed when a bag contains fetch.txt contents, given this new requirement.

We will add an API call that will allow (given actual file contents, not the URI only) all payload manifests to be properly assigned for fetch.txt files. Essentially, it will be identical to a payload addition, but the file will not be copied into the bag. Other 'unsafe' calls may be considered.

Improve jar packaging

Should add support for making the jar 'executable' (invoking the 'Bagger' command-line tool), and possibly also make a fat jar for simple deployment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.