vbatts / tar-split Goto Github PK

checksum-reproducible tar archives (utility/library)

License: BSD 3-Clause "New" or "Revised" License

Go 100.00%

tar-archive payload disassembly golang checksum

tar-split's Introduction

tar-split

Pristinely disassembling a tar archive, and stashing needed raw bytes and offsets to reassemble a validating original archive.

Docs

Code API for libraries provided by tar-split:

Install

The command line utility is installable via:

go get github.com/vbatts/tar-split/cmd/tar-split

Usage

For cli usage, see its README.md. For the library see the docs

Demo

Basic disassembly and assembly

This demonstrates the tar-split command and how to assemble a tar archive from the tar-data.json.gz

youtube video of basic command demo

Docker layer preservation

This demonstrates the tar-split integration for docker-1.8. Providing consistent tar archives for the image layer content.

youtube vide of docker layer checksums

Caveat

Eventually this should detect TARs that this is not possible with.

For example stored sparse files that have "holes" in them, will be read as a contiguous file, though the archive contents may be recorded in sparse format. Therefore when adding the file payload to a reassembled tar, to achieve identical output, the file payload would need be precisely re-sparsified. This is not something I seek to fix immediately, but would rather have an alert that precise reassembly is not possible. (see more http://www.gnu.org/software/tar/manual/html_node/Sparse-Formats.html)

Other caveat, while tar archives support having multiple file entries for the same path, we will not support this feature. If there are more than one entries with the same path, expect an err (like ErrDuplicatePath) or a resulting tar stream that does not validate your original checksum/signature.

Contract

Do not break the API of stdlib archive/tar in our fork (ideally find an upstream mergeable solution).

Std Version

The version of golang stdlib archive/tar is from go1.11 It is minimally extended to expose the raw bytes of the TAR, rather than just the marshalled headers and file stream.

Design

See the design.

Stored Metadata

Since the raw bytes of the headers and padding are stored, you may be wondering what the size implications are. The headers are at least 512 bytes per file (sometimes more), at least 1024 null bytes on the end, and then various padding. This makes for a constant linear growth in the stored metadata, with a naive storage implementation.

First we'll get an archive to work with. For repeatability, we'll make an archive from what you've just cloned:

git archive --format=tar -o tar-split.tar HEAD .

$ go get github.com/vbatts/tar-split/cmd/tar-split
$ tar-split checksize ./tar-split.tar
inspecting "tar-split.tar" (size 210k)
 -- number of files: 50
 -- size of metadata uncompressed: 53k
 -- size of gzip compressed metadata: 3k

So assuming you've managed the extraction of the archive yourself, for reuse of the file payloads from a relative path, then the only additional storage implications are as little as 3kb.

But let's look at a larger archive, with many files.

$ ls -sh ./d.tar
1.4G ./d.tar
$ tar-split checksize ~/d.tar 
inspecting "/home/vbatts/d.tar" (size 1420749k)
 -- number of files: 38718
 -- size of metadata uncompressed: 43261k
 -- size of gzip compressed metadata: 2251k

Here, an archive with 38,718 files has a compressed footprint of about 2mb.

Rolling the null bytes on the end of the archive, we will assume a bytes-per-file rate for the storage implications.

uncompressed	compressed
~ 1kb per/file	0.06kb per/file

What's Next?

More implementations of storage Packer and Unpacker
More implementations of FileGetter and FilePutter
would be interesting to have an assembler stream that implements io.Seeker

License

See LICENSE

tar-split's People

Contributors

Stargazers

Watchers

tar-split's Issues

Work-around `tar.TypeGNUSparse`

Perhaps the best work around here, will be to have an entry type, that is like Segment and File, but stashes the whole raw bytes of a sparse file entry. This could make the tar-data.json substantially larger, but could preserve the integrity of sparse files, rather than inflating them.

Panic in archive/tar/reader.go

func (tr *Reader) Next() (*Header, error) can panic caused by addition for RawAccounting.

This block in readHeader https://github.com/vbatts/tar-split/blob/master/archive/tar/reader.go#L616 can allow readHeader to return nil with tr.err == nil. This is not allowed by the readHeader function since it causes a panic here de-referencing the header value.

I am not sure if and where the function should return after it hits this condition or whether it is intended to return at all. I would think the correct thing to do is reset tr.err back to io.EOF and allow the nil return.

Doesn't support tar entries > 8gb

tar-split is used by docker and makes it impossible to build an image that contains some layers that are >8gb.

See the issue in Docker: moby/moby#37581
See the code to reproduce the issue out of Docker: https://gist.github.com/dgageot/0007b4cbfa08f7cf95e93ba0db3bc12a

needs benchmarking baked in

looking over #22 i see that we have not made enough use of testing Benchmarks.

sentry entry type for versioning

update `archive/tar` to go1.5, once it is released

$subject

"pre" and "post" values are incorrect

Take this excerpt from main.go:

                pre := tr.RawBytes()
                output.Write(pre)
                sum += int64(len(pre))

                var i int64
                if i, err = io.Copy(output, tr); err != nil {
                    log.Println(err)
                    break
                }
                sum += i

                // I've never seen this be populated
                post := tr.RawBytes()
                output.Write(post)
                sum += int64(len(post))

The // I've never seen this be populated comment can be explained because the first call:

pre := tr.RawBytes()

returns all of the raw bytes and then resets the rawbytes buffer:

func (tr *Reader) RawBytes() []byte {
    if !tr.RawAccounting {
        return nil
    }
    if tr.rawBytes == nil {
        tr.rawBytes = bytes.NewBuffer(nil)
    }
    // if we've read them, then flush them.
    defer tr.rawBytes.Reset()
    return tr.rawBytes.Bytes()
}

ALSO

I would assume that the pre value is the bytes from the beginning of an entry header to the ending of any padding before the file entry contents, and that the post value is the padding bytes from the ending of the file entry contents to the end of the next 512 bytes block.

This is not currently the case. Take this excerpt from the example in the README:

$ ./main tar-split.tar
2015/02/20 15:00:58 writing "tar-split.tar" to "tar-split.tar.out"
pax_global_header pre: 512 read: 52 post: 0
LICENSE pre: 972 read: 1075 post: 0
README.md pre: 973 read: 1004 post: 0
..

The first entry, pax_global_header has 512 bytes of pre data. This is the Tar header which is always going to be a multiple of 512 bytes. The read portion is 52 bytes (the size of the file), but post is 0 bytes but it should be 512 - (read % 512) -> 460 bytes of padding!

Where did these 460 bytes of padding go? They went to the pre of the next entry! The LICENSE has 972 bytes of pre data. This is weird because it's not a multiple of 512 bytes. It's 512 bytes for the header + the 460 bytes of padding from the previous entry.

failed to read from testdata/46af0962ab5afeb5ce6740d4d91652e69206fc991fd5328c1a94d364ad00e457/layer.tar: archive/tar: missed writing 3348 bytes

e489928 does not seem to pass the moby pkg/tarsum unit test when built on Fedora x86_64 with Fedora Go 1.10 & default flags

+ GOPATH=/builddir/build/BUILD/moby-7cfd3f4229c82ba61fa13a8818b8ecf58a2dcdbf/_build:/usr/share/gocode
+ go test -buildmode pie -compiler gc -ldflags '-extldflags '\''-Wl,-z,relro  '\'''
--- FAIL: TestTarSumRemoveNonExistent (0.00s)
        builder_context_test.go:27: failed to read from testdata/46af0962ab5afeb5ce6740d4d91652e69206fc991fd5328c1a94d364ad00e457/layer.tar: archive/tar: missed writing 3348 bytes
--- FAIL: TestTarSumRemove (0.00s)
        builder_context_test.go:57: failed to read from testdata/46af0962ab5afeb5ce6740d4d91652e69206fc991fd5328c1a94d364ad00e457/layer.tar: archive/tar: missed writing 3348 bytes
--- FAIL: TestTarSums (0.06s)
        tarsum_test.go:370: failed to copy from testdata/46af0962ab5afeb5ce6740d4d91652e69206fc991fd5328c1a94d364ad00e457/layer.tar: archive/tar: missed writing 3348 bytes
        tarsum_test.go:370: failed to copy from testdata/46af0962ab5afeb5ce6740d4d91652e69206fc991fd5328c1a94d364ad00e457/layer.tar: archive/tar: missed writing 3348 bytes
        tarsum_test.go:370: failed to copy from testdata/46af0962ab5afeb5ce6740d4d91652e69206fc991fd5328c1a94d364ad00e457/layer.tar: archive/tar: missed writing 3348 bytes
        tarsum_test.go:370: failed to copy from : archive/tar: missed writing 8192 bytes
        tarsum_test.go:363: failed to read 16KB from testdata/collision/collision-0.tar: archive/tar: missed writing 6 bytes
        tarsum_test.go:363: failed to read 16KB from testdata/collision/collision-1.tar: archive/tar: missed writing 6 bytes
        tarsum_test.go:363: failed to read 16KB from testdata/collision/collision-2.tar: archive/tar: missed writing 6 bytes
        tarsum_test.go:363: failed to read 16KB from testdata/collision/collision-3.tar: archive/tar: missed writing 6 bytes
        tarsum_test.go:370: failed to copy from : archive/tar: missed writing 8192 bytes
        tarsum_test.go:370: failed to copy from : archive/tar: missed writing 8192 bytes
        tarsum_test.go:370: failed to copy from : archive/tar: missed writing 8192 bytes
        tarsum_test.go:370: failed to copy from : archive/tar: missed writing 8192 bytes
        tarsum_test.go:370: failed to copy from : archive/tar: missed writing 8192 bytes
--- FAIL: TestIteration (0.00s)
        tarsum_test.go:511: archive/tar: missed writing 4 bytes
FAIL
exit status 1
FAIL    github.com/moby/moby/pkg/tarsum 0.086s

Empty entry added when no padding at end of file

I've run into an issue where an empty entry is added when reading a tar stream and I wanted to check if this was expected behaviour or not?

I have a scenario where this read operation

tar-split/tar/asm/disassemble.go

Line 130 in f966b14

n, err := outputRdr.Read(paddingChunk[:])

is retuning n=0 and err=EOF

In other places in this code you are checking that the length of the payload is greater than 0

tar-split/tar/asm/disassemble.go

Lines 57 to 66 in f966b14

 if b := tr.RawBytes(); len(b) > 0 { 

 _, err := p.AddEntry(storage.Entry{ 

 Type: storage.SegmentType, 

 Payload: b, 

 }) 

 if err != nil { 

 pW.CloseWithError(err) 

 return 

 } 

 }

Should this length check also be applied when adding the padding chunk or is adding the empty entry desired?

Request for a new release tag

Can you please tag a release so the --compress change can be packaged? Also it looks like version/version.go is out of sync with the current release tag.

Thanks very much

Fix goreportcard errors

Happy to do this work in the next month or so.

Reference: https://goreportcard.com/report/github.com/vbatts/tar-split

I have a pretty cool script for this, icecrime/poule is using it now too:

https://github.com/box-builder/box/blob/master/checks.sh

This could be added to your test suite to ensure it doesn't go down again. It will put the report card at 100% when it passes.

Since the gocyclo & golint errors in particular will probably require some refactors, if you're willing to accept this request, we should probably discuss how the code should be reorganized first.

Hope this is useful, thanks.

[RFE] ability to support mtree(5) manifests

bsd man page - http://www.freebsd.org/cgi/man.cgi?mtree(8)
linux port - https://github.com/archiecobbs/mtree-port
libarchive supports this format as well - http://linux.die.net/man/5/libarchive-formats

Segmentation fault on macOS Sierra

I'm getting this error:

go build github.com/docker/libcompose/vendor/github.com/vbatts/tar-split/tar/storage: /usr/local/go/pkg/tool/darwin_amd64/compile: signal: segmentation fault
fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0xc04e1e3a37f pc=0x17cc60]

panic on slice bounds

Seems that there is an issue on function

func (fr *regFileReader) Read(b []byte) (n int, err error)

at line 718

Jul 13 11:53:35 ngcore rc.local[1907]: panic: runtime error: slice bounds out of range [:6620516960021273003] with capacity 32768
Jul 13 11:53:35 ngcore rc.local[1907]: goroutine 892 [running]:
Jul 13 11:53:35 ngcore rc.local[1907]: bufio.(*Reader).Read(0xc000790cc0, {0xc0016f8000, 0x12e8, 0xc0013a9bf8})
Jul 13 11:53:35 ngcore rc.local[1907]:         /usr/lib/go/src/bufio/bufio.go:238 +0x2ed
Jul 13 11:53:35 ngcore rc.local[1907]: io.(*teeReader).Read(0xc000135720, {0xc0016f8000, 0x470, 0x8000})
Jul 13 11:53:35 ngcore rc.local[1907]:         /usr/lib/go/src/io/io.go:560 +0x37

large amounts of padding cause crashes

This occurs because tar-split attempts to preserve the padding at the end of an archive, and it tries to store the padding in-memory. One solution would be to create multiple "chunked" SegmentTypes rather than one "big" one.

asm.go at /4th/tar-split/cmd/tar-split invalid argument c.Args() (type cli.Args) for len

2.50s$ go vet ./...

github.com/asellappen/tar-split/cmd/tar-split [github.com/asellappen/tar-split/cmd/tar-split.test]

cmd/tar-split/asm.go:15:8: invalid argument c.Args() (type cli.Args) for len
cmd/tar-split/asm.go:16:65: invalid argument c.Args() (type cli.Args) for len
cmd/tar-split/checksize.go:19:8: invalid argument c.Args() (type cli.Args) for len
cmd/tar-split/checksize.go:22:16: cannot range over c.Args() (type cli.Args)
cmd/tar-split/disasm.go:16:8: invalid argument c.Args() (type cli.Args) for len
cmd/tar-split/disasm.go:25:13: invalid operation: c.Args()[0] (type cli.Args does not support indexing)
cmd/tar-split/disasm.go:28:30: invalid operation: c.Args()[0] (type cli.Args does not support indexing)
cmd/tar-split/disasm.go:62:81: invalid operation: c.Args()[0] (type cli.Args does not support indexing)
cmd/tar-split/main.go:16:5: app.Author undefined (type *cli.App has no field or method Author

found the issue asm.go at /4th/tar-split/cmd/tar-split

func CommandAsm(c *cli.Context) {
if len(c.Args()) > 0 {
logrus.Warnf("%d additional arguments passed are ignored", len(c.Args()))
}

support ISO-8859-1 filenames

Per golang's encoding/json.Marshal only utf-8 strings are supported, and everything else is silently replaced with U+FFFD ... https://golang.org/pkg/encoding/json/#InvalidUTF8Error

Report came from moby/moby#16516

Cannot process non-empty files

When I use tar-split to process the following files, some files fail, and those successful files are empty files. I don't know what the reason is, is the method I use wrong?
Thanks.

➜  overlay-layers ls
08bf0890eb080513cf589d700af2212d52bb8c49dadce1fd249567bc759b9fb9.tar-split.gz  a94e0d5a7c404d0e6fa15d8cd4010e69663bd8813b5117fbad71365a73656df9.tar-split.gz
310473bb278192778884ce806a950f53f8e2fb2c78b4ad03fff6900a0eab83a3.tar-split.gz  c5183829c43c4698634093dc38f9bee26d1b931dedeba71dbee984f42fe1270d.tar-split.gz
3358360aedad76edf49d0022818228d959d20a4cccc55d01c32f8b62e226e2c2.tar-split.gz  cd7100a72410606589a54b932cabd804a17f9ae5b42a1882bd56d263e02b6215.tar-split
39bae602f7539e626f6894ecded6d12a6c56745dbab9aee940e258c766e090e8.tar-split.gz  cd7100a72410606589a54b932cabd804a17f9ae5b42a1882bd56d263e02b6215.tar-split.gz
3b9b1148398d96c80e5b97f862f629c63ac961b09d3c01915de8f9366ecb1e6d.tar-split.gz  d85277d1778ae2765061e24ba726b7339757615a48091acb3fdfa9aabf0fa442.tar-split.gz
422e2695ffd98a637663afa241a442b961a8341e3e2c2bb6174dfa67256ced81.tar-split.gz  e00c9229b481290f804888ad7256e46a65126c0fd6221363a185c5a950cb0428.tar-split.gz
60c8050bcd0ed88be60e3331986ca31073f6c00f3f4239c15e9e21adb17b6c53.tar-split.gz  layers.json
86d58c981bd8c4432d6047b7858a88af2fd2727fba4b53365afe3d4845ec0868.tar-split.gz  layers.lock
9653ff9b37fb90837818e156f728b8cad2affe7c99cde564a7dab9a383376617.tar-split.gz
➜  overlay-layers for i in *.gz; do sudo tar-split a --input $i --path . --output $i.tar
for> 
for> ;done
FATA[0000] open etc/apt/apt.conf.d/docker-autoremove-suggests: no such file or directory 
INFO[0000] created 310473bb278192778884ce806a950f53f8e2fb2c78b4ad03fff6900a0eab83a3.tar-split.gz.tar from . and 310473bb278192778884ce806a950f53f8e2fb2c78b4ad03fff6900a0eab83a3.tar-split.gz (wrote 3584 bytes) 
FATA[0000] open bin/bash: no such file or directory     
FATA[0000] open etc/DIR_COLORS: no such file or directory 
INFO[0000] created 3b9b1148398d96c80e5b97f862f629c63ac961b09d3c01915de8f9366ecb1e6d.tar-split.gz.tar from . and 3b9b1148398d96c80e5b97f862f629c63ac961b09d3c01915de8f9366ecb1e6d.tar-split.gz (wrote 14848 bytes) 
FATA[0000] open etc/apt/trusted.gpg: no such file or directory 
INFO[0000] created 60c8050bcd0ed88be60e3331986ca31073f6c00f3f4239c15e9e21adb17b6c53.tar-split.gz.tar from . and 60c8050bcd0ed88be60e3331986ca31073f6c00f3f4239c15e9e21adb17b6c53.tar-split.gz (wrote 6144 bytes) 
FATA[0000] open etc/apt/sources.list: no such file or directory 
FATA[0000] open run/systemd/container: no such file or directory 
FATA[0000] open bin/bash: no such file or directory     
FATA[0000] open bin/[: no such file or directory        
FATA[0000] open bin/busybox: no such file or directory  
FATA[0000] open root/.bash_history: no such file or directory 
FATA[0000] open bin/arch: no such file or directory

GNU @LongLink entries are not handled correctly

$ mkdir -p asfd/asdf/asdf/asfd/asdf/asdf/asdf/asdf/asdf/asdf/asdf/asdf/asdf/asdf/asdf/asdf/asdf/
$ touch asfd/asdf/asdf/asfd/asdf/asdf/asdf/asdf/asdf/asdf/asdf/asdf/asdf/asdf/asdf/asdf/asdf//axcdfasdfasdfasdfasdfasdfasdfasdfasdfasdf
$ tar cf longlink.tar ./asfd/
$ strings longlink.tar | grep -i longlink
././@LongLink
$ rm -rf asfd/
$ mkdir x
$ tar-split d ./longlink.tar | tar -C ./x -x
time="2015-08-03T14:14:10-04:00" level=info msg="created tar-data.json.gz from ./longlink.tar (read 20480 bytes)"
$ tar-split a --path ./x --output ./longlink.tar.1
INFO[0000] created ./longlink.tar.1 from ./x and tar-data.json.gz (wrote 20480 bytes) 
$ sha1sum longlink.tar*
d9f6babe107b7247953dff6b5b5ae31a3a880add  longlink.tar
3c0114d53cb60a597b733909dde206d6201a7da6  longlink.tar.1

	if b := tr.RawBytes(); len(b) > 0 {
	_, err := p.AddEntry(storage.Entry{
	Type: storage.SegmentType,
	Payload: b,
	})
	if err != nil {
	pW.CloseWithError(err)
	return
	}
	}