This started as a <a href="https://pl-strflt.notion.site/Why-Not-Both-Packing-Cont

what is the problem we're solving here? O

Thank you for writing this up as an issue <a class="user-mention notranslate" data-hov

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Proposal: IntactPack - storing UnixFS data in contiguous blocks for multi-mode Filecoin retrieval about go-car HOT 12 OPEN

rvagg commented on August 15, 2024

Proposal: IntactPack - storing UnixFS data in contiguous blocks for multi-mode Filecoin retrieval

from go-car.

Comments (12)

willscott commented on August 15, 2024

what is the problem we're solving here?
- One of the pain points i've heard against car / unixfs preparation is that having to do something more than streaming the existing raw bytes of data a client has is "overhead". If we still calculate the unixfs reference, are we going to mitigate this pain point? In particular, with this design since all of the reference data comes first, a client needs to fully calculate the reference before beginning to stream data.
I worry that the re-calculation of what offset in the bytes array should be served as the bytes for a reference leaf ends up being somewhat error prone. What happens when we find a malformed variant of this type of car where the bytes aren't recalculated to the same reference leaves at the point when one of the leaves is asked for? Is it the receiver's responsibility to verify the reference structure as it receives an intactpack car and mark it as valid/invalid versus doing that work each retrieval request?

from go-car.

rvagg commented on August 15, 2024

what is the problem we're solving here

This is primarily viewed from a retrieval perspective, given the strong allergies that are appearing to using the plain old IPLD-based protocols. I keep on encountering a strong preference for just piece based retrievals instead. Unfortunately I never got around to doing the benchmarking I was going to do on this with the Project Motion data I had stored live to compare the same data fetched both ways to quantify what's been called "the IPLD tax" as it pertains to transfer speeds.

There also seems to be a growing meme that if the bytes aren't stored contiguously then it's inefficient. Juan tackled this recently, maybe it was at FilDev Iceland, and I broadly agree with him in that your data is almost never stored contiguously anyway because the various layers of storage abstractions all do funky things with it. But, it's a strong meme, and often these things sway users.

I also buy the argument that it'd be simpler for a storage provider to just enable piece retrievals. If we insist on them all exposing trustless, or even bitswap, then we make life harder. If the minimum viable SP stack with retrievals is piece retrievals, then we can make it work with this protocol. Trustless is a value-add on top of it.

What happens when we find a malformed variant of this type of car

I'm proposing that the majority of this work is done via a special API in go-car itself, not the standard API, but an API that opt's in to seeing the data in this way. With that, you just treat byte sequences derived from UnixFS offsets in the same way that we treat them today in "untrusted" CARs—the bytes are invalid for the CID and it's rejected, just as if you uploaded an invalid CAR today. You could make the Filecoin deal for it, but you're not going to get the data back except via piece retrieval without verification.

That can be done either on the server (Boost) when serving this trustless style, or on the client side when retrieving byte ranges via piece retrieval. The CID still exists in the reference layer and its digest must match the bytes in the looked up range.

from go-car.

rvagg commented on August 15, 2024

And yes, :ack: on the inefficiencies of double-streaming to first calculate the reference layer and then stream in the file bytes. That's a trade-off that should be acknowledged but my argument is that it's worth the cost if you want maximally flexible retrievability.

from go-car.

willscott commented on August 15, 2024

Maybe worth comparing this to a variant where the data is uploaded / downloaded fully as raw bytes (maybe as file-delineated objects in a car?) and the ipld references can be generated as a look-aside metadata by/on the SP for it's ability also serve that data with full IPLD semantics.

from go-car.

rvagg commented on August 15, 2024

One drawback there is that you can't do small-chunk trustless retrieval with just a piece endpoint. By packing the proof tree at the start you make this all navigable via piece endpoint with range requests. You can fetch the metadata first, verifying that all makes sense, then fetch the bytes - streaming the whole file, and prove it as you go with the tree you already have.

Making /piece/ + Range work for IPLD is one of the key pieces in this proposal. Of course that's just me saying that's an ideal goal, given what I see as the likely trajectory of SPs turning off bitswap and trustless given the overhead and just offering piece retrieval unless there's some other forcing mechanism that works. Right now "serves retrievals" is satisfied with piece retrievals and I suspect it'll stay that way unless Retriev or something else offers token incentives for providing data trustless.

CommPact gets part way there but essentially jumps out of an IPLD world, even while staying content addressed. It would be a shame to lose the ecosystem of tooling and techniques because that's all we can do.

from go-car.

masih commented on August 15, 2024

Thank you for writing this up as an issue @rvagg 👍

Regarding this proposal, I am mentally stuck on "Why Both?" question. Mixing raw bytes in a CAR file would make things certainly more complex than just pure CID+Block sections in the current CAR format. This brings me to ask the same sort question that motivated the CommPact proposal:

Where does IPLD representation makes sense from the end-user's perspective?

For me the answer to that is about understanding the original shape of data, minimal adoption barrier, contextual awareness, the "Semantic Web" avenue of thinking. It is less about a cross-cutting representation that would be applied to everything. I believe that the most impactful thing we can do is to help make IPLD representation shine in specific use-cases where existing state of the art representations fall short.

from go-car.

masih commented on August 15, 2024

CommPact gets part way there but essentially jumps out of an IPLD world, even while staying content addressed. It would be a shame to lose the ecosystem of tooling and techniques because that's all we can do.

I feel like I have not explained the core idea behind CommPact very well:

The whole point of it is that it keeps things IPLD friendly without forcing data prep until we figure out why one should go through with it.

from go-car.

mikeal commented on August 15, 2024

There also seems to be a growing meme that if the bytes aren't stored contiguously then it's inefficient. Juan tackled this recently, maybe it was at FilDev Iceland, and I broadly agree with him in that your data is almost never stored contiguously anyway because the various layers of storage abstractions all do funky things with it. But, it's a strong meme, and often these things sway users.

I don't think the fact that a filesystem implementation may repartition data it writes to disc drivers is a substantive justification for producing additional re-encodings of data writen to/form that filesystem which force additional copies of the data to be produced.

Filesystems may re-partition data in the process of maintaining a large logical store that continuously presents that data to other layers in the way it received it. The filesystem does not leak this abstraction the way IPFS encoding does when that filesystem is networked, the cost of partitioning appears in the place where that cost has a justification, in the filesystem next to the disc driver.

All data said to be available is made available "as it appears" and all of these systems maintain a stable API between protocols from network to syscall wrt "ranges of bytes" as there is no need for further consensus or adoption of the encoding structure the filesystem implements. Networked filesystems "network" the layer above the partitioning, rather than below.

IPFS, when consceptualized as an encoding rather than a cryptographic address for an abstract "range of bytes," presents several new standards which have to be adopted and agreed upon related to the encoding, and in the case of files this is merely to produce a semantic representation of "a range of bytes."

There's nowhere to implement this where it doesn't materialize as a forced copy of the data and there's nowhere that this won't show up as an additional overall cost of IPFS over any alternative technology because it doesn't do anything except produce an IPLD representation of a "range of bytes" which they already have.

If you want the merkle tree in order to do incremental verification then you can publish enough information in content discovery to produce any IPFS tree locally from a generic proof and then you aren't fighting over abstract encodings of what is addressed anywhere but the client which can produce whatever encoding it prefers. I wrote this up hastily before the holidays as "Encodingless IPFS" https://hackmd.io/@gozala/content-claims#Encodingless-IPFS

from go-car.

rvagg commented on August 15, 2024

@masih:

Mixing raw bytes in a CAR file would make things certainly more complex than just pure CID+Block sections in the current CAR format.

This is intended to be entirely compatible with the CAR format as-is. The "raw bytes" are still CID prefixed, it's just that the block is quite large and the CID won't be linked by anything else in the CAR, but you should be able to car ls and even extract that one big block for your file using the CID it comes with.

As @magik6k pointed out though, current go-car has some size limitations for blocks based on the fact that an LdRead is quite a common operation and pre-allocates the chunk to be read, so it'll likely have some trouble reading these CARs in its current form. But that can be addressed and it's not a spec limitation.

from go-car.

hannahhoward commented on August 15, 2024

Throwing some thoughts out there I've seen in the process of working on web3.storage.

The person who can generate a reliable index data is the person doing data preparation. From what I've seen all our sad days with boost emerge from scanning car files hoping they are written just right so our scanning software will work. Moreover, if the data creator signs the data index, even without verifying the underlying blocks, I have more trust in that index on its own than I do in an index generated from the person who happens to be holding it. Also, it's duplication of work -- why index the same CAR six different times on different providers?
We have a set of different network functions we know this network needs to provide user value. Fundamentally, storing opaque bytes that get read by range is a different job function than using an index to run queries of structured data encoded in opaque bytes and we need both for most normies to use the network. Our mistake is arguing about whether those functions should be seperate by locality, when we really should be thinking about whether those functions should be seperated by incentives. When you think from an incentive perspective, this discussion is largely moot. Filecoin in its current form incentivizes storage and retrieval of bytes. Everything else is the honor system and maybe FIL+. What we truly need is to incentivize people to hold indexes on data and use those indexes to serve retrievals from the underlying data.
Indexes themselves can be thought of as opaque bytes --- and users should probably pay to store them.
Even our best CAR designs so far don't encode true structured graph data (as opposed to UnixFS files) in ways that are usefully queryable, especially if the bytes storage doesn't happen to live where querying is happening. CAR indexes themselves tell you about blocks but not about structure inside blocks. If I wanted to backup a blockchain to Filecoin in a way that can be queried rapidly, I might want to write an index that looked very different than just "here are the blocks and these are their offsets" -- it might be useful to actually know the paths without having to read the blocks one by one -- potentially across a network. Even if that knowledge wasn't verifiable completely on its own, I might still derive value especially if it was signed by someone I knew had encoded that blockchain's data. Ultimately, my objection to "let's try to write a CAR different to make it work as block storage AND indexed data" is indexing data is a wide open set of possibilities for how people might want to do it. I think we can probably confine people to IPLD in most cases, but even IPLD has ways to index that go beyond "here are the blocks and their offsets".

from go-car.

hannahhoward commented on August 15, 2024

sidebar: anyone know what was the motivation for encoding lengths & CIDS next to the blocks?

like if we started from scratch on an indexed car, wouldn't it make sense to have:

header
index of CID+offset+length entries
just a bunch of bytes afterward

not for carv1 streaming, but for cars with indexes.

from go-car.

rvagg commented on August 15, 2024

I think the original design just came from Jeromy wanting to spew a bunch of blocks into a file and then slurp them back in again - both for filecoin chain data, and then to do the whole deal thing to get a CommP that in some way related to the IPLD structure; not a whole lot of thought beyond that afaik, the original go-car wasn't very sophisticated other than spew+slurp.

from go-car.

Proposal: IntactPack - storing UnixFS data in contiguous blocks for multi-mode Filecoin retrieval about go-car HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs