Comments (6)
@frank-dspeed, thanks for the details. For sure we can talk about the details, that's where the fun is. :)
First, is true that under different DAG creation configurations you'd get a different cid. (i.e: change hashing alg, dag layout, raw leaves, etc), but that's unrelated to your original question because you have a definition of DHT which isn't correct. This is why I said I was smelling some confusion here, and I was suspecting you wanted to refer to another concept.
I think in your original question when you said: "between different IPFS nodes" or "DHT", those things are irrelevant or wrong, since what matters is the DAG creation configuration. To be more verbose, here're some claims:
- Under the same DAG creation configuration (i.e: hashing alg, dag layout, etc), the Cid result for a chunk of data is deterministic. The determinism is related to generating something under the same underlying assumptions. If you change the way you do something, then that's unrelated to determinism.
- If two different IPFS nodes use the same DAG creation configuration, the generated Cid is the same. Saying it differently, different Cids of the same data isn't related to different IPFS nodes, just different configurations. You can get different Cids for the same data (under different configs) in the same IPFS node. Conclusion: talking about different IPFS nodes in this discussion isn't relevant.
- Your definition of DHT is not what most people would understand in computer science (or IPFS ecosystem). DHT is understood as Distributed Hash Table, which is ~basically a distributed map.
- Your claim the CID also Includes the IPFS node DHT Information still isn't correct under your definition of DHT. The Cid format doesn't include any details about DAG layout (e.g: balanced, etc); at most "codecs" but that's another thing.
my conclusion is that we can never think that all are running the same version with the same settings so we have no Deduplikation
In general, most people in the space use ipfs add
which has the same default values since the ~begining. Mostly to avoid the same problem you're mentioning. If someone is changing the DAG creation configuration, they should probably know what they're doing and understand that will change the Cid of the data for other people just running ipfs add
.
If you want to be 100% strict on saying that we should clarify adding in our paper: "under the assumption of always using the same DAG building configuration", I think is a fair point. That's something not usually clarified every time someone wants to talk about leveraging content-hashing, since talking about content-addressing always implies having baked in a stable address creation scheme. If you have f(data) = address
, I think is fair to say nobody should expect f
to be changed in the middle of an argument.
from go-threads.
@jsign your correct add that part. You should not underestimate the number of People without prerequired knowledge that read the paper.
I think we can assume that someone who uses this software is not in general familiar with the deep implications of content addressing in general.
from go-threads.
No CID is purely based on the file content. You can generate a CID for the content without having anything IPFS related.
from go-threads.
As an extra question reg:
.. the CID also Includes the IPFS node DHT Information.
Can you provide the reference where you read that? That claim isn't true.
It feels to me there might be some confusion.
from go-threads.
@jsign you can verify that by creating files on diffrent nodes but if you want the full details
- The hashing algorithm used (sha256 or any other)
- The dag-format used (default “balanced” but can be anything and is that what i call DHT Information)
- The chunking algorithm used (default “fixed-262144” “256KiB blocks” but can and will change).
- Whether --raw-leaves was used.
- large folders, whether the HAMT directory sharding option was enabled
Using --raw-leaves (implied by --nocopy, iirc) or --inline should also change the CID (but it might depend on the file content).
my conclusion is that we can never think that all are running the same version with same settings so we have no Deduplikation
from go-threads.
@frank-dspeed, thanks for your feedback!
from go-threads.
Related Issues (20)
- How to make sure open the same DB on different threadsd node?
- Collection with name collection can't be properly indexed HOT 4
- AddThread without creating a log HOT 2
- Make sure net/api/client.Subscribe loop ends gracefully when client is closing HOT 3
- Calling net.Record.PrevID() crashes when the client is closing HOT 2
- --- FAIL: TestModifiedSince (0.00s) HOT 1
- Don't allow setting arbitrary PrevID for records HOT 3
- Lost records from the Subscribe() channel when records arrive concurrently HOT 1
- Found an interesting bug related to synchronisation of threads HOT 10
- feat: New API or method to `getRecords` _back_ from offset to limit
- Occasional CI timing out
- ReadFilter not working
- Proposal for go-threads improvements HOT 1
- ThreadDB Sharding and Replication? HOT 3
- DB API: Count only query option
- rpc error: code = Unknown desc = log already exists HOT 3
- How well does it scale? HOT 1
- Future of this project HOT 1
- General Question on ThreadDB functionality HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from go-threads.