ipld / go-car Goto Github PK

View Code? Open in Web Editor NEW

147.0 20.0 44.0 2.23 MB

A content addressible archive utility

License: Other

Go 100.00%

go-car's Introduction

go-car (go!)

Work with car (Content addressed ARchive) files!

This is a Golang implementation of the CAR specifications, both CARv1 and CARv2.

As a format, there are two major module versions:

go-car/v2 is geared towards reading and writing CARv2 files, and also supports consuming CARv1 files and using CAR files as an IPFS blockstore.
go-car, in the root directory, only supports reading and writing CARv1 files.

Most users should use v2, especially for new software, since the v2 API transparently supports both CAR formats.

Usage / Installation

This repository provides a car binary that can be used for creating, extracting, and working with car files.

To install the latest version of car, run:

go install github.com/ipld/go-car/cmd/car@latest

More information about this binary is available in cmd/car

Features

CARv2 features:

Generate index from an existing CARv1 file
Wrap CARv1 files into a CARv2 with automatic index generation.
Random-access to blocks in a CAR file given their CID via Read-Only blockstore API, with transparent support for both CARv1 and CARv2
Write CARv2 files via Read-Write blockstore API, with support for appending blocks to an existing CARv2 file, and resumption from a partially written CARv2 files.
Individual access to inner CARv1 data payload and index of a CARv2 file via the Reader API.

API Documentation

See docs on pkg.go.dev.

Examples

Here is a shortlist of other examples from the documentation

Maintainers

Contribute

PRs are welcome!

When editing the Readme, please conform to the standard-readme specification.

License

Apache-2.0/MIT © Protocol Labs

go-car's People

Contributors

Stargazers

Watchers

go-car's Issues

update the README after the wip-v2 branch is merged

At least it should explain what the difference is between the root and v2 modules, and how they're related and maintained.

Perhaps also why they both live in master with one as a subdirectory, as opposed to different branches.

cc @masih @willscott

blockstore: add an option to skip duplicate Puts by CID, mimicking carv1's Selective Writer API

Filecoin writes proofs into CAR files which are hashed, so we need their contents to be deterministic.

The way Filecoin currently generates those CARv1 files is via v1's selective writer API, which ensures canonical ordering via traversals, and also deduplicates by CID:

go-car/selectivecar.go

Lines 229 to 230 in 71cfa2f

 if !sct.cidSet.Has(c) { 

 sct.cidSet.Add(c)

For Ignite's current project, they receive blocks via graphsync, which ensures the order of blocks as per the IPLD selector, just like v1's selective writer. However, we might receive duplicate blocks from a client. When graphsync receives blocks they end up getting "Put" to our carv2 read-write blockstore.

If we want to be compatible, we should support deduplicating by CID. I propose a ReadWrite blockstore option for it, like DeduplicateByCID; if one calls Put on the same CID twice, the second call will simply do nothing and return a nil error.

In the future we could satisfy this need by porting Selective Writers to carv2 (#104), but that can't happen for another month or two.

I could also ask Ignite to implement a Blockstore wrapper that does this deduplication on Put calls, but deduplicating by CID also seems like a reasonable opt-in feature that others might want in the future. It wouldn't make the API significantly more complex or the read-write blockstore significantly slower, either.

notes on canonicalization, dag traversal order

@whyrusleeping said (in the filecoin community slack)

@jbenet i’ll do a pass on canonicalization for the car format, but one thing i’m thinking through right now is that you can’t just naively CAR things up and store them in a sector.

For example if you tried to store the filecoin blockchain this way, it would include all the state trees, and even (depending on a few things) all the data in the network

my question about traversal order is because we need to be able to have a specific traversal order to ensure CAR files are deterministically generated

so the thing being referenced needs to be a selector

@hannahhoward Does the ipld selectors work strictly define graph traversal order?

my question about traversal order is because we need to be able to have a specific traversal order to ensure CAR files are deterministically generated

blockstore: add an API to open either v1 or v2 as a read-only blockstore

Consider using base32 encoding.

It's faster (although larger) than base58.
It isn't case sensitive.

Improve resumption from files with unknown CARv1 payload size

this is unfortunate that partial writes that close uncleanly are now going to be something we aren't comfortable resuming.

It seems more likely that this is a case where the index has not been written at all, right?
It may be worth doing the slower recovery:

Scan through the blocks in the carV1
For each, see if the length is valid with regards to the length of the file
either re-hash each block and make sure the contents match the CID, or ideally build up the index while doing the scan of lengths, and then do that hashing backwards until you find a valid block.
if there is content after the last valid block, and it parses as an index, great
if there is at most 1 final invalid block, we crashed in the middle of writing it and can truncate to before that and resume.
otherwise we have an actual corrupt car and should force a new full export.

Originally posted by @willscott in #156 (comment)

ensure that writing DAGs backed by an ipld-prime BlockReadOpener is easy

Right now it's hard with v0; it requires duct taping interfaces, such as:

type readStore ipld.BlockReadOpener

func (rs readStore) Get(c cid.Cid) (blocks.Block, error) {
        link := cidlink.Link{Cid: c}
        r, err := rs(ipld.LinkContext{}, link)
        if err != nil {
                return nil, err
        }
        data, err := ioutil.ReadAll(r)
        if err != nil {
                return nil, err
        }
        return blocks.NewBlockWithCid(data, c)
}

func toReadStore(opener ipld.BlockReadOpener) car.ReadStore { 
        return readStore(opener)
}

This is because v0 wants a Get method similar to a Blockstore (https://pkg.go.dev/github.com/ipld/go-car#NewSelectiveCar), but blocks stored and linkified by ipld-prime are inside a https://pkg.go.dev/github.com/ipld/go-ipld-prime#BlockReadOpener. The two serve largely the same purpose, but the interfaces are slightly different.

If we want to continue maintaining v0 in the long term, that toReadStore helper should probably be exposed as part of its API.

v2 doesn't have this problem yet, as it had to give up anything similar to v0's selective car writer. When it does, we should also make sure it's easy to give it both a BlockReadOpener as well as a ReadStore.

Provide public API for CAR v2 index

we probably want a non-internal Index type - it sounds like the sharded blockstore would like to be able to detach indexes to store them separate from cars, and also be able to recombine a carv1 + index into a carv2

See: #88 (comment)

add opt-in support for treating null padding as EOF in v2

See #140 (comment).

Some downstreams did this via 11b6074, which never made it to master. We should add it to v1 and do a tag, but in a way that is opt-in, to ensure backwards compatibility.

feat: support .car files with no root nodes

Heya,

We’re using .car files as a block store format in a few new use cases and this bit here is giving us some trouble.

https://github.com/ipfs/go-car/blob/6bca8656ee37dcc2ed011117958f64427c2ef2ab/car.go#L109-L111

We’re storing very large files across multiple .car files. One of the .car files has a root node that describes all the chunks of the file and what other .car files the blocks can be retrieved at. But the .car files that just contain a bunch of raw blocks don’t have a logical root node.

Our options right now are to either add every block’s CID as a root, which seems unnecessary and a potential performance issue on read, or to create an unnecessary vector for all the links and put that in there as well, which is really just a bunch of unnecessary data we’re shoving into the file.

It would be nice if .car files could optionally not contain any root nodes at all.

add a test that verifies if ReadOnly.GetSize actually works

I think its CID reading offset math is wrong.

disable Travis

This repository is using the unified Go testing / checking workflow. We should disable Travis (and delete .travis.yml).
I don't have permission to edit CI settings. @Stebalien, do you?

Implement a CARv2 SelectiveCARAPI when clients have upgraded to go-ipld-prime v0.9.0

#102 removes the CARv1 dependency from the WIP CARv2 library and states the motivation for doing so i.e. to not force clients to upgrade to the breaking go-ipld-prime v0.9.0 release because the latest CARv1 release depends on go-ipld-prime v0.9.0.

Towards that goal, the CARv2 library also holds back from implementing a SelectiveCAR API for v2 as the SelectiveCAR implementation would need go-ipld-prime and we should be using the latest ipld prime release i.e. go-ipld-prime v0.9.0 to implement it rather than using an older IPLD prime release.

This issue is meant to capture the fact that we should implementing a CARv2 SelectiveCAR API when clients have upgraded to go-ipld-prime v0.9.0.

carv2/ReadWriteBlockstore: support deferred root CIDs

blockstore.OpenReadWrite requires providing the root CIDs when creating a new blockstore. This design inhibits the ability to use a CARv2 blockstore as the target of a streaming merkle DAG formation like UnixFS, as the root CID is not known beforehand.

We could work around this situation by supplying a placeholder root CID, and once the blockstore is finalized, we could go back and replace those bytes in the header.

Unfortunately, the library doesn't provide APIs to do that safely and without making assumptions about the underlying format, or breaking abstractions.

Some ideas:

Allow blockstore.OpenReadWrite to take a []struct{cid.Builder, func() (cid.Cid, error)}.
- Use the cid.Builders to compute the length of each CID, and preallocate those bytes in the header.
- On Finalize, call each function to get the actual CID to replace it in the header.
Simpler: allow the user to specify a number of bytes to preallocate, and provide primitives to update a CAR header.

Propagate options to internal CARv1 in CARv2

Propagate options to the internal CARv1 used in CARv2. Right now no options are propagated. There is a single option we are looking to propagate here which is ZeroLengthSectionAsEOF

decide what the blockstore semantics should be before tagging a stable release

There's a Slack thread going into how CARv2's blockstore should behave.

Whatever we agree to in there, we should reflect it in the code.

Also worth noting that the current default behavior is somewhat inconsistent: just like go-ipfs, "Has" reports matches by multihash. But unlike go-ipfs, Put doesn't deduplicate by hash, and AllKeysChan doesn't flatten CIDs to use the "raw" codec.

CARv2 blockstore throws "cannot resume from a finalized file"

After integrating the CARv2 blockstore resumption changes, one of our test fails with the following error:

"cannot resume from a finalized file".

Unfortunately, this is a valid edge case we need to support to support restarts of Filecoin data transfers. Here's the scenario:

Storage client transfers a 64Gib file to to the Storage Provider.
Storage provider writes the CARv2 file to a blockstore -> calls Finalize but dosen't persist the fact that it's finalized.
Storage provider restarts and attempts to create a CARv2 blockstore from the CARv2 file it has and then finalizes it.

move repo to IPLD org

The spec is being maintained in the IPLD project as well as some other tools and implementations.

Make CAR v2 index iterable without having to read it all into memory

Ideally, the index would not have to be loaded into memory to be iterated, like this.

See:

filecoin-project/dagstore#2 (comment)

go-ipld-cbor dependency unavailable

The go-ipld-cbor package referenced here (1.4.9 QmeyKxwBJgB7AP1DgoMAbqHajMnt4C6TuubsKBRVGZVFuH) isn't available.

Additionally the latest version in https://github.com/ipfs/go-ipld-cbor is 1.3.0.

Consider supporting multihash in CAR v2 Blockstore

The CARv2 index is queryable by CID exposed via Blockstore API. That API as used in Filecoin is multihash based.

It would be useful if the blockstore API implemented by CAR v2, backed by the index, can be queryable by multihash.

Questions:

When we Get a block should it be the CID as it was encoded in the CAR, or in the format it was requested?
- The format it was requested makes sense.
What format should the keys be in when AllKeysChan is called?
- Probably the format encoded in CAR?
- Could defaulting to V1 cause problems?

Client should be able to get a detached Index from a CARv1 or CARv2 file

Currently, if a client wants to to generate/get just the Index for a given CAR file, the client needs to write a lot of boilerplate, especially if the client dosen't know whether what it has is a CARv1 or an indexed CARv2. We don't need this feature for UnIndexed CARv2s and leave it for you guys to decide if you want to support that too for now.

Assume reader here is an `io.ReadSeeker` and we don't know whether the stream is a CARv1 or a CARv2:

	// determine CAR version and Index accordingly.
	ver, err := car.ReadVersion(reader)
	if err != nil {
		return
	}
	// seek to start of file as reading the version above will change the offset.
	if _, err := reader.Seek(0, io.SeekStart); err != nil {
		return
	}
	
	var idx carindex.Index
	switch ver {
	case 2:
		carreader, err := car.NewReader(reader)
		if err != nil {
			return
		}
		if has := carreader.Header.HasIndex(); !has {
			return
		}
		idx, err = carindex.ReadFrom(carreader.IndexReader())
		if err !=  nil {
			return
		}
	case 1:
		cidx,err  := carindex.Generate(reader)
		if err != nil {
			return
		}
		idx = cidx

It would be great to have an API that takes either a CARv1/v2 stream and can generate/detach the Index from it and return it.

efficient buffer-less ExtractV1File method

CARv2 embeds a CARv1 fully, including its header. It should be possible thus to unpack a CARv1 from a CARv2 file efficiently by os.Truncateing at the end, and io.Copying through *os.File descriptors to allow Go to delegate to the kernel instead of performing the copy in user land.

Two use cases are requested:

Extract inner CARv1 into a new file.
Overwrite existing CARv2 with the CARv1 contents (potentially by opening two *os.File file descriptors against the same file, seeking in the source descriptor to the right offset, and performing io.Copy?).

v2: add a WrapV1 API that works on io.ReadSeeker and io.Writer

Right now we have WrapV1File which works on paths. Add a version that doesn't require files on disk.

Couldn't do this straight away as it required internal refactors.

CARv2 TestBlockstore fails intermittently

Happened on Windows CI build here.

      readwrite_test.go:111: AllKeysChan returned an unexpected amount of keys; expected 1049 but got 0
123

Implement block iterator in CARv2 reader that accepts both v1 and v2

CARv1 reader provides an interface for iterating through sections here. Ideally, we want a similar API for CARv2 reader that does the right thing given v1 or v2 payload

Cc @willscott

v2: Open APIs should set io.Closer finalizers, like os.OpenFile

This way, if someone encounters an error early, or otherwise forgets to close/finalize a CAR reader or blockstore, the underlying open file won't leak.

See https://pkg.go.dev/runtime#SetFinalizer.

cc @masih

v2: add a high level API to transform a CARv1 to a CARv2

Given a CARv1 as a reader, write a CARv2 file wrapping the exact same bytes with a header and index.

This is going to be useful for downstreams where the end user has CARv1 files on disk, which they want to easily convert to a CARv2 without altering the inner CARv1 at all. This is possible today with the current API, but it's quite cumbersome - having to manually write the header, and generate+write the index.

For use cases where one doesn't need the inner CARv1 to be equal, or wants to opt into characteristic bit features like "canonical format", then they should use the Write APIs. That use case is separate from this "wrap carv1 in a carv2" API.

Read/Write blockstore should support resuming

When writing to a blockstore, the write might fail before the blockstore writes are "finalised/flushed". The API should support resuming writes, picking up where it left off.

This is a must for Ignite requirements.

Filtered DAG traversal (with IPLD selector?)

The go-filecoin project wanted to export a blockchain structure. The block headers have "State root" CIDs which refer to HAMT structures, which happen to be stored in the same store. We don't want to export the state trees though - i.e. we don't want to traverse those links. Similarly there is a "receipts" structure referenced from a block header that is not to be exported.

The CAR writer assumes that a complete DAG is to be exported, but we need flexibility. For expedience, we just did our own external traversal and write the CAR via the exported utility methods. A few different options would have been more helpful:

a writer that consumes an iterator over blocks to write
a writer that accepts a predicate against which each CID/block is to be tested before loading/writing
a writer that accepts an IPLD selector describing the sub-DAG to write

That last option would be awesome (I know the selectors did not exist at the time this code was authored).

FYI @frrist @warpfork

ensure we replace binary.Put/ReadUvarint with go-varint before the v2 release

We might want to replace all instances of binary.PutUvarint and binary.ReadUvarint in carv2 with https://github.com/multiformats/go-varint before the release; otherwise we are not guaranteed to have minimal encoding of varints, which is useful for determinism

Most of our uses are Reads, not Puts, but even then - go-varint will error if the incoming varint uses more bytes than it should, which is a good thing. We have a few Puts in e.g. index encoding.

v2: don't expose bits of API that shouldn't be used directly

For example, the insertion index is only used by the incremental read-write blockstore, and is not specced, so it should never be used while encoding CARv2 files. Having it as part of the public API is dangerous - if someone misses that detail, they could encode a CARv2 with such an index directly, which would be unspecified and fragile.

v2: we should leave a forwards compatible way to add options to the API

For example:

type WriteOption func(*Writer)

func NewWriter(ctx context.Context, ng format.NodeGetter, roots []cid.Cid, opts ...WriteOption) *Writer

And later on, if we get a new feature like "write a carv2 without an index", setting the characteristic bitfield, we could then add something like:

func SkipIndex(w *Writer) { w.skipIndex = true } // or whichever internal knob

Since the parameters are ..., one can simply not include any options in a call, or add more later.

blockstore: improve file handling by OpenReadOnly

See https://github.com/ipld/go-car/pull/110/files/4eac6e2bf70403064cb9620a14225e3592201418#r659729739.

add basic baseline benchmarks for carv2

ReadOnly blockstore should safely reject reads after Close() was called

Right now, it will lead to a panic, like the one seen in filecoin-project/lotus#7005. If opened with OpenReadOnly(), Close() will close the backing mmap and the next read will cause a panic.

ReadVersion can be more efficient

if carv1 header is of size != 10 then it is most-likely a v1
If the carv1 header begins with a non-empty roots list then it is most likely a v1.
If the carv1 header begins with an empty roots list, then we skip it (fixed length) and read the version.
If we did that, we'd always read a fixed length, so we could just take a []byte.

We wouldn't support carv1-looking headers whose version is not 1, but the carv1 spec requires the version to always equal 1.

Context: #100 (review)

NewSelectiveCar should accept a dagservice

See comment/resulting code here: ipfs/kubo@39b9c3a32b9bcc682

The core justification is the fact that the dagervice can do proper paralellizing etc within a session. The blockstore is "dumb", thus not a great primitive for an interface-API.

Unify utility `Reader`/`Writer` types into `internalio`

should this live in internal/io?

Originally posted by @willscott in #144 (comment)

[issue] go get -u https://github.com/ipld/go-car

Hi guys I tryed to get go-car and I'm in stuck with this.

Cmd

go get -u https://github.com/ipld/go-car

Error

../../work/pkg/mod/github.com/ipld/[email protected]/selectivecar.go:14:2: module github.com/ipld/go-ipld-prime@latest found (v0.0.3), but does not contain package github.com/ipld/go-ipld-prime/impl/free

CAR v2 characteristics should indicate if index is included

We want to be lazy about generating an index. This is so that if we've already got the carv2 header at the beginning of the car, we can write an index suffix without relocating the actual car data on disk.

v2: consider adding logging options

For example:

func WithLogging(func(format string, args ...interface{})) Option

See #111 (comment). There are a few places where we can't easily return errors, such as AllKeysChan's inner goroutine, so it's a shame if the best we can do is simply throw them away.

cc @dirkmc

CAR walk: decouple logic between getting the links and processing them

go-car/car.go

Lines 58 to 61 in a4380ef

 cw := &carWriter{ds: ds, w: w, walk: walk} 

 seen := cid.NewSet() 

 for _, r := range roots { 

 if err := dag.Walk(ctx, cw.enumGetLinks, r, seen.Visit); err != nil {

The dag.Walk() function receives a GetLinks callback to fetch the children of the parent node r. We overload that function with the processing itself of the node (in this case writing it to the file with (*carWriter).enumGetLinks). Although the visit callback in the same dag.Walk is also undocumented it seems a more appropriate place for that logic.

add support for null-padded carv1 payloads

Filecoin has been using a fork of the carv1 library with commit 11b6074, which allows decoding carv1 files that end with null padding. They use this because their crypto proofs require power-of-two payload lengths.

Spec-wise, this seems valid. A CARv1 is a series of "frames", in the form of:

cid+block length as a varint
cid
block

So, if such a frame begins with the zero byte, then we should simply skip to the next byte, because we have a varint of value 0 for the cid+block length, and thus we have nothing else to do. It's reasonable to simply ignore the frame entirely, because there's no CID and no block data.

There's only one place in the carv2 library where we decode CARv1 frames: index.Generate. Right now this padding makes it error, because it unconditionally reads a CID, even if the frame size is zero. In that way, it incorrectly implements the spec; it shouldn't attempt to read CID bytes when it knows a frame is of zero length. So we definitely have something to fix in carv2.

The CARv1 spec doesn't explicitly allow this kind of empty CID plus empty block, but it's not explicitly disallowed either. It does contain a section on padding, however:

The CAR format contains no specified means of padding to achieve specific total archive sizes or internal byte offset alignment of block data. Because it is not a requirement that each block be part of a coherent DAG under one of the roots of the CAR, dummy block entries may be used to achieve padding. Such padding should also account for the size of the length-prefix varint and the CID for a section. All sections must be valid and dummy entries should still decode to valid IPLD blocks.

I think the ship has sailed in terms of forbidding this kind of padding in the CARv1 spec, given that Filecoin has been using this for some time. We could tell them to "migrate" all of those null-padded CARv1 files, so that they also store them alongside a "padding offset" and cut the payload when using the carv2 library, effectively producing an EOF error when the library reaches the padding.

However, this would come at a relatively high cost to them (migrations are work), whereas on our side it's fairly trivial to amend one line to handle zero-length frames in a better way. What I propose is to skip the zero-length frames, effectively meaning that index.Generate would skip the entire null padding one byte at a time.

That would be pretty slow, but it would work. Later on, we have two options:

A) In the CARv1 spec, specify that a decoder is allowed to treat a frame beginning with a null byte as an EOF, discarding the rest of the input. This would allow us to quickly exit and stop reading bytes.

B) In the carv2 implementation, when we reach a null-byte frame, we read the rest of the input in big chunks, which would allow us to quickly reach EOF as long as there are no non-zero bytes. If there are any non-zero bytes, we would go back to the slow path of decoding each frame starting with a varint. Purely an optimization for the common case.

I lean towards making the fix above today, and later on amending the CARv1 spec as per option A.

Present codebase does not allow opening root-less .car files

https://github.com/ipld/go-car/blob/v0.0.4/car.go#L120-L122 seems somewhat superfluous. In the tests for ipfs/kubo#7011 I worked around this by sticking an "empty CID" \x01\x55\x00\x00 into the root array, but there got to be a better way...?

If there is compelling argument to not accepting root-less .car files: please clarify it in this issue.

Godoc with example usage of carv2 primary flows

There should be useful documentation and examples guiding users through

an import/export roundtrip
Opening an unknown file as a blockstore
writing data to a carv2 via blockstore interface and saving it.
Top level doc todo

v2: allow AllKeysChan to report errors via a callback on Context

The API would add:

var ctxKeyErrorHandler someUnexportedType

func WithErrorHandler(ctx context.Context, func(error)) context.Context

And then one can specify an error handler for AllKeysChan:

ctx := someParentContext
ctx = blockstore.WithErrorHandler(ctx, func(err error) {
    // handle the error
})
... = bs.AllKeysChan(ctx, ...)

Separate from #112, since that's logging, and error values are lost.

Move CID decoding functionality to `go-cid`

See ipfs/go-cid#126

Implement the ability to index a CAR using `io.Reader` alone for both v1 and v2

there should be a way to do this on just an io.Reader without ever needing to seek, but we would end up needing to use internal functions to jump to the internal (post-header) portion of index generation when we encounter a car v1 header.

Seeker is also used to skip over blocks during indexing. Nevertheless, i can imagine situations where you're streaming a car onto disk, and want to generate an index while you're at it. Being able to pipe through this with just an io.Reader is likely the way you'd hope to do that, so we should eventually relax the interface here to what we actually need.