probe-lab / zikade Goto Github PK
View Code? Open in Web Editor NEWA Go implementation of the libp2p Kademlia DHT specification
License: Other
A Go implementation of the libp2p Kademlia DHT specification
License: Other
This issue is a collaborative meta issue to capture all remaining tasks to bring Zikade up to feature parity with go-libp2p-kad-dht. (Originally libp2p/go-libp2p-kad-dht#895)
Migrated from libp2p/go-libp2p-kad-dht#929
Given that many are not happy with the routing.Routing
interface we could avoid "polluting" the DHT API with the routing.Routing methods. So I propose to do the following:
type RoutingDHT struct {
*DHT
}
var _ routing.Routing = (*RoutingDHT)(nil)
func NewRoutingDHT(d *DHT) *RoutingDHT {
return &RoutingDHT{DHT: d}
}
// routing.Routing methods...
Migrated from libp2p/go-libp2p-kad-dht#918
This issue if to address the TODO here: https://github.com/plprobelab/zikade/blob/main/internal/coord/network.go#L229
We should reuse a NodeHandler
managed by NetworkBehaviour
rather than calling NewNodeHandler
Migrated from libp2p/go-libp2p-kad-dht#906
We observed that the 32-bit test runner runs out of memory if we increase the test count to 100
.
There are two areas where we leak goroutines after the test has stopped:
Revisit after libp2p/go-libp2p#2514 has been resolved.
(Migrated from libp2p/go-libp2p-kad-dht#869)
When we receive a PUT_VALUE
RPC from a remote peer, the key will contain a namespace prefix (e.g., pk
or ipns
) followed by a binary key. In the case of IPNS the key follows the spec:
Key format:
/ipns/BINARY_ID
This is in line with how the IPNS key gets generated here:
func (n Name) RoutingKey() []byte {
var buffer bytes.Buffer
buffer.WriteString(NamespacePrefix)
buffer.Write(n.src) // Note: we append raw multihash bytes (no multibase)
return buffer.Bytes()
}
NamespacePrefix
is set to /ipns/
. In the handlePutValue
method the key parameter will be set to the above /ipns/BINARY_ID
format. Later in that handler we're generating the datastore key via convertToDsKey(rec.GetKey())
:
func convertToDsKey(s []byte) ds.Key {
return ds.NewKey(base32.RawStdEncoding.EncodeToString(s))
}
This means we're taking the bytes of the string /ipns/BINARY_ID
and encode them as base32. This means we're losing the capability of proper namespacing. The /ipns
prefix will still be the same in base32 for all keys but still this seems like a coincidence. In the ProviderManager
, we're properly encoding each component separately here.
Changing this key format will be incompatible with any persistent stores out there. We'd need to support both types of keys (everything base32 encoded and only BINARY_ID encoded) for some time.
With go-libp2p-kad-dht v2 we could take the liberty to introduce a breaking change here.
Migrated from libp2p/go-libp2p-kad-dht#927
We should remove this function: https://github.com/libp2p/go-libp2p-kad-dht/blob/e86381e824dd2797cf931f14070292fe763a0c93/v2/tele/tele.go#L133
Instead we should plumb a tracer through to the pool/probe/etc.
Migrated from libp2p/go-libp2p-kad-dht#944
The EventQueryProgressed event received by waitForQuery doesn't have accurate stats. For example the following query does not have an accurate start time and the number of requests is unchanged over time:
2023-09-27T15:56:17.932+0100 DEBUG query made progress {"query_id": "0000000000000005", "peer_id": "QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.058+0100 DEBUG query made progress {"query_id": "0000000000000005", "peer_id": "QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.228+0100 DEBUG query made progress {"query_id": "0000000000000005", "peer_id": "12D3KooWDXvBTyoFtdh3WZt18oQs9rEgBVVgoo2YPW9WFsgvbF1Z", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.305+0100 DEBUG query made progress {"query_id": "0000000000000005", "peer_id": "12D3KooWQSmp7ij4nDx6Z1y8myBhZgaKYf9PmX8RJXvHKxf63tXf", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.344+0100 DEBUG query made progress {"query_id": "0000000000000005", "peer_id": "12D3KooWAXJgmkWuYjmGsRfcLcw712Xo2gL94BgR2dV7dGfFdpBz", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.479+0100 DEBUG query made progress {"query_id": "0000000000000005", "peer_id": "12D3KooWE5WcfiAG8NbZoGETDFStgnG9X9D628whJmEPKw3qWutz", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.637+0100 DEBUG query made progress {"query_id": "0000000000000005", "peer_id": "12D3KooWK4x8Azr4Tj5TsVpeDxzxRpUpFVSwT7zGU5x1LTYCYz6c", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.745+0100 DEBUG query made progress {"query_id": "0000000000000005", "peer_id": "12D3KooWQ4BzSdQ9fABqJfUgPEYd9Xwg6ztUW7kvSJxePs5Bmo9H", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.759+0100 DEBUG query made progress {"query_id": "0000000000000005", "peer_id": "12D3KooWHQZJxV5KRix1uUCy6N1ExAZpN557Kxc9t1HwZocp5N2T", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:19.056+0100 DEBUG query made progress {"query_id": "0000000000000005", "peer_id": "12D3KooWMVwrvy9mtKp2vTyGbqy6wWWFWd2zAuJf39UWNCYtfWGU", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
Migrated from libp2p/go-libp2p-kad-dht#921
Migrated from libp2p/go-libp2p-kad-dht#911
ETA: 2023-08-31
Description: Write a thorough test suite that should be usable again all DHT implementations. The tests should test the protocol implementation, performance and security.
Migrated from libp2p/go-libp2p-kad-dht#923
Migrated from libp2p/go-libp2p-kad-dht#907
Another usage of our network size estimator 👍
Add more visible documentation on the architecture of Zikade including a diagram. Should cover the role of the DHT, Coordinator, Behaviours, State Machines and the Router.
Migrated from libp2p/go-libp2p-kad-dht#919
After watching the walkthrough video and reviewing the code, I can't help but be reminded of the actor model, which has already addressed similar challenges to the ones we're facing with the current go-kademlia
architecture. Further, there's already an established language and common terminology in this area.
In the past, I have worked extensively with protoactor-go
(website), which draws inspiration from Akka.NET and Microsoft Orleans. While all three libraries offer more functionality than what we currently need, I believe we can still leverage their concepts and even some of their code. Interestingly rust-libp2p
has also converged on a very similar pattern with its Hierarchical State Machines, albeit from a different starting point. There are several important concepts employed by all three actor libraries that I believe could be relevant for us. For example, the actor hierarchies, lifecycle events, behaviours (different from rust-libp2p behaviours), mailboxes, middlewares (easy tracing), supervision, and probably more.
The biggest difference in following the actor pattern would be to have multiple event queues instead of a single one because we'd work with a hierarchy of actors and each actor would have exactly one event queue (mailbox) to which all senders enqueue their messages. I understand why the current implementation uses only a single event queue. We want to have sequential message processing to ensure deterministic tests. By introducing a hierarchical model sequential message processing would only happen on a per actor basis. Communication between actors happens asynchronously via messages.
Looking at the SimpleQuery
I think the control flow is not super obvious because state manipulation is distributed across a few different places like requestError
, handleResponse
, and newRequest
. SimpleQuery
is a natural contender to be its own actor.
The current examples show how nodes interact with each other using the SimpleQuery
implementation. However, I believe there must be another layer above this where the orchestration of queries and other tasks are handled. In the examples, the top-level functions craft queries and enqueue them into schedulers. In the future, I believe, we'll have, let's say, a DHT
struct that handles pushing new queries to the scheduler, performing cleanup tasks on shutdown, or generally orchestrating the several actions that are running. Following this assumption, I believe there's a hierarchy that would naturally translate to a hierarchy of actors.
I can't help but feel we'd miss out if we didn't make use of what has already been built in the past. Though, the biggest disadvantage I can see is that we wouldn't have a single event queue - which seems to be a hard requirement? Perhaps there's a middle ground? From my experience, testing in protoactor-go
(with a myriad of actors) was very well possible and also fine. Flakiness was not a problem. However, crafting the sequence of events was not always straightforward. This is super subjective but for me, the actor model really resonates with the way I think and makes high asynchronicity bearable.
Note: I'm not advocating for pulling in protoactor-go
as a dependency. I just want us not to miss out on what has already been developed there and start a discussion around this.
As a result of nodes fetching some content, advertising to the IPFS DHT and quickly churning, some CIDs have a lot of unreachable providers which may be a problem. In order to address this issue, nodes could give a TTL to the provider records they are advertising, given the knowledge of their average uptime. The DHT Servers would store the provider records only for TTL, which should reduce the number of unreachable providers in the IPFS DHT.
In order to implement this change, let's make use of the ttl field in the Provider Record protobuf https://github.com/plprobelab/go-kademlia/blob/dc867cbd3316a89cabaa5be19900cdbf5d2f0805/network/message/ipfsv1/message.proto#L30
Note that this field isn't supported in go-libp2p-kad-dht (so nodes will discard the ttl field), but it can already be parsed by rust-libp2p.
We also need a mechanism on the client/provider that will determine the TTL value, but this value can be provided by the caller (kubo/boxo).
On the Server side, we need to make sure that the TTL field is parsed, and that the Provider Store (once implemented) will only keep the Provider Record for its given TTL (or at most a fixed number e.g 48 hours).
When an incoming event is queued by a behaviour's Notify
method the context supplied in the method call is also queued alongside the event and reused when the event is actioned by Perform
. This was primarily intended to preserve the tracing context for the event so that the event and its consequent outbound events could be traced through the system of coordinator, behaviours and state machines. Secondarily it was intended to allow context cancellation to be effective through the state machines. (Originally the context would be checked for cancellation before actioning the associated event but this has been lost through refactorings)
However these goals can only be attained if the context is consistently preserved everywhere. Currently the coordinator uses its own independent context when dispatching events between state machines and the events emitted by a behaviour's Perform
method are done so without their associated context.
Additionally this storing of the context can be harmful if the context is used for an event generated as a side effect, such as a rouuting notification that adds a node to the include queue. This should have its own independent context that is not subject to the parent context's cancellation.
We should remove the storage of context and use a different mechanism to carry tracing and cancellation metadata.
Extend BehaviourEvent to have a SpanContext
method:
// SpanContext returns tracing information associated with the event.
SpanContext() trace.SpanContext
A SpanContext
holds the trace id, span id and other tracing flags that should be associated with the event. See spancontext in the specification.
Each outbound event that is generated as a direct result of actioning an inbound event should copy the SpanContext
to the new event. Functions that process an event should use the SpanContext
, for example:
ctx, span := c.tele.Tracer.Start(trace.ContextWithSpanContext(ctx, ev.SpanContext()), "Coordinator.AddNodes")
defer span.End()
When an event is submitted to the coordinator's Notify
method (from an external source or as a result of calling a helper method like Coordinator.Bootstrap
) an SpanContext
should be created that using a method like trace.SpanContextFromContext
.
Events that initiate queries (EventStartFindCloserQuery
, EventStartMessageQuery
) and broadcasts (EventStartBroadcast
) should include a Deadline field that can be used to specify a deadline for the query. The query state machines should use this to terminate the query once it has passed its deadline and the relevant waitForQuery
or waitForBroadcast
functions can use to create a context with an appropriate deadline.
Events that initiate outbound network requests (EventOutboundGetCloserNodes
and EventOutboundSendMessage
) should also carry a deadline, inherited from the query that ultimately generated the request event.
Update dependencies EVERYWHERE (https://github.com/ipfs/kubo, https://github.com/libp2p/go-libp2p, https://github.com/libp2p/go-libp2p-kad-dht, https://github.com/libp2p/go-libp2p-routing-helpers, https://github.com/ipfs/go-ipfs-provider, etc.) and make sure to have the DHT responsible to Reprovide content (where necessary).
It shouldn’t be to IPFS implementations (Kubo) to handle persistence in the different content routers.
The proposed interface contains the following functions StartProvide([]cid.Cid) error
, StopProvide([]cid.Cid) error
, ListProvide() []cid.Cid
or similar. The interface still needs to be discussed with other stakeholders.
Right now, StopQuery
does not cancel the in-flight request but just marks the query as finished. Context probe-lab/go-kademlia#74
ETA: 2023-08-31
Description: Currently when the Bitswap ProviderSearchDelay
is set to 0
, the Time To First Byte is worse than when the ProviderSearchDelay
is set to 1
second, which doesn’t make sense from a protocol perspective. When the ProviderSearchDelay
is set to 0
, all requests go through the DHT (compared with only 5% when the ProviderSearchDelay
is set to 1 second). One theory that could explain these strange results, is that the DHT implementation is slowing down Kubo.
We need to investigate around parallel requests in the DHT, and try to understand any resource limitation that the implementation may have. The DHT is much less chatty than Bitswap, hence there must be a way to have a more efficient use of resources (even tough the DHT opens more new connections than Bitswap).
Once we are satisfied with the DHT performances, we can set the Bitswap ProviderSearchDelay
to 0
(unless we encounter new problems). This is a good step to get rid of Bitswap dumb broadcast!
Migrated from libp2p/go-libp2p-kad-dht#916
This test run failed on a PR with no code changes
https://github.com/plprobelab/zikade/actions/runs/6408785740/job/17398549960?pr=47
--- FAIL: TestDHT_SearchValue_quorum_test_suite (10.79s)
--- PASS: TestDHT_SearchValue_quorum_test_suite/TestQuorumReachedAfterDiscoveryOfBetter (0.27s)
--- FAIL: TestDHT_SearchValue_quorum_test_suite/TestQuorumReachedPrematurely (10.15s)
--- PASS: TestDHT_SearchValue_quorum_test_suite/TestQuorumUnspecified (0.18s)
--- PASS: TestDHT_SearchValue_quorum_test_suite/TestQuorumZero (0.17s)
The NetworkBehaviour maintains a map of NodeHandler but entries are never deleted from this map. Each NodeHandler maintains a goroutine to process events from other state machines that want to send messages to the node. Deleting an idle NodeHandler from the map is safe since its only state is the queue of events waiting to be processed.
A NodeHandler should be removed from the map when:
When the NodeHandler is deleted the goroutine servicing its work queue should also be stopped.
ETA: 2023-09-30
Description: Complete the missing/incomplete parts of go-kademlia such that it can be used to implement the DHT in go-libp2p-kad-dht and plumbed into Kubo via an IPFS DHT in boxo
Children:
ETA: 2023-09-30
Description: Once the DHT Refactor is over and tested, we need to replace the old DHT in the Kubo implementation with the Refactored one.
ETA: 2023-08-31
Description: create a plan for side by side performance comparisons of the minimally functional DHT with the existing DHT. Define the success criteria for shipping the new DHT in a Kubo release.
Migrated from libp2p/go-libp2p-kad-dht#909
ETA: 2023-09-30
Test the refactored DHT within Kubo, on MANY clients before deploying the DHT refactor to the next Kubo release.
The tests to be conducted must include testing that Content Routing still behaves as expected, but also performance evaluation and comparison with the legacy DHT implementation.
Migrated from libp2p/go-libp2p-kad-dht#908
This goroutine is holding the QueryBehaviour
lock, trying to Notify a Waiter
that a EventGetCloserNodesSuccess
was received.
The are no goroutines selecting on the waiter's channel. I would expect it to be in Coordinator.waitForQuery
called from Coordinator.QueryMessage
.
goroutine 6818 [select, 33 minutes]:
github.com/plprobelab/zikade/internal/coord.(*Waiter[...]).Notify(0xc0018abbe0, {0x2eda480?, 0xc0021febd0}, {0x2ec3ca0, 0xc0034ab2d0})
/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:126 +0x105
github.com/plprobelab/zikade/internal/coord.(*PooledQueryBehaviour).Notify(0xc00014df80, {0x2eda480?, 0xc00216cf00?}, {0x2ec3be0?, 0xc00253a000?})
/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/query.go:189 +0x109f
github.com/plprobelab/zikade/internal/coord.(*NodeHandler).send(0xc00079d880, {0x2eda480, 0xc00216cf00}, {0x2ecf4d0?, 0xc00262e460?})
/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/network.go:165 +0x33e
github.com/plprobelab/zikade/internal/coord.(*WorkQueue[...]).Enqueue.func1.1()
/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:81 +0x108
created by github.com/plprobelab/zikade/internal/coord.(*WorkQueue[...]).Enqueue.func1
/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:75 +0x7a
There are 8 goroutines are waiting on the lock at this point:
goroutine 6990 [sync.Mutex.Lock, 33 minutes]:
sync.runtime_SemacquireMutex(0x11eeaff?, 0x80?, 0xc001f4a390?)
/home/iand/sdk/go1.20.5/src/runtime/sema.go:77 +0x26
sync.(*Mutex).lockSlow(0xc00014dfd8)
/home/iand/sdk/go1.20.5/src/sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
/home/iand/sdk/go1.20.5/src/sync/mutex.go:90
github.com/plprobelab/zikade/internal/coord.(*PooledQueryBehaviour).Notify(0xc00014df80, {0x2eda480?, 0xc001f4a390?}, {0x2ec3d40?, 0xc0015c87d0?})
/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/query.go:152 +0x125
github.com/plprobelab/zikade/internal/coord.(*NodeHandler).send(0xc001293c00, {0x2eda480, 0xc001f4a390}, {0x2ecf4f8?, 0xc001293600?})
/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/network.go:186 +0x63b
github.com/plprobelab/zikade/internal/coord.(*WorkQueue[...]).Enqueue.func1.1()
/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:81 +0x108
created by github.com/plprobelab/zikade/internal/coord.(*WorkQueue[...]).Enqueue.func1
/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:75 +0x7a
Somehow we have lost the select that should be reading from the waiter's channel.
Migrated from libp2p/go-libp2p-kad-dht#922
Migrated from libp2p/go-libp2p-kad-dht#945
When searching for an IPNS or PK record, we will store the updated version of that record with all the closest peers that we have found while querying. If the query aborts after the three closest haven't returned anyone closer, we still update the remaining 17 peers that we haven't contacted and store our currently known "best" record with them. However, these 17 peers may also hold the same record - we just don't know because we haven't contacted them yet.
I would change the logic in V2 to only store the updated record with peers that provably returned a stale record during the query operation.
Migrated from libp2p/go-libp2p-kad-dht#946
At the moment we don't have tests for updating peers that hold stale records (SearchValue/GetValue).
#41 is required, because currently it is the IPFS implementations that must handle the reprovide. This optimization only works if the DHT manages the reprovides itself.
The project is well described at DHT Reprovide Sweep. This change is a client change only, and is expected to have a significant impact, especially on large content providers.
Now the Minimal Working Modular DHT is finished, all the logic of Reprovide Sweep can be encapsulated inside the Provide module.
=== RUN TestRTAdditionOnSuccessfulQuery
topology_test.go:192:
Error Trace: D:/a/zikade/zikade/topology_test.go:170
D:/a/zikade/zikade/topology_test.go:192
D:/a/zikade/zikade/query_test.go:21
Error: Received unexpected error:
test deadline exceeded while waiting for routing updated event
Test: TestRTAdditionOnSuccessfulQuery
https://github.com/libp2p/go-libp2p/actions/runs/6430114845/job/17460469763?pr=2595
Migrated from libp2p/go-libp2p-kad-dht#917
Migrated from libp2p/go-libp2p-kad-dht#883
Deploy Musa and ramp up the number of FIND_NODE RPCs. Find out when it breaks down and if there's a memory leak.
This infrastructure could be used to periodically check if we introduced a regression.
From comment:
Production load test? https://github.com/protocol/bifrost-infra/issues/2766 🙈
Asked in #libp2p-impelementers channel: https://filecoinproject.slack.com/archives/C03K82MU486/p1695125179210719
Apply the following PR to Zikade: libp2p/go-libp2p-kad-dht#947
Migrated from libp2p/go-libp2p-kad-dht#925
Migrated from libp2p/go-libp2p-kad-dht#926
ETA: 2023-08-31
Currently go-libp2p-kad-dht exports the following metrics
We should replicate these or provide equivalents.
See also libp2p/go-libp2p-kad-dht#304 for more discussion
Migrated from libp2p/go-libp2p-kad-dht#950
Similar to the messageSenderImpl
of V1
Fix flaky test, follow up of the draft PR libp2p/go-libp2p-kad-dht#955
We should decide whether we want to create a FullRT
routing table implementation, and its associated refresh mechanism (crawler), or if the FullRT
from the old implementation should be superseded by Reprovide Sweep
. Note that the scope isn't exactly similar. Reprovide Sweep
is more efficient at reproviding content than FullRT
, but FullRT
is (supposed to be) more efficient at lookup than a normal routing table (using Reprovide Sweep
). IIUC most of the FullRT
users are using it to reprovide content, so the switch to Reprovide Sweep
would be an upgrade for them. However, we may break users depending on the FullRT
for faster lookup, if we don't implement it in the new IPFS DHT implementation. There are other efficient routing table alternatives described in probe-lab/go-kademlia#2.
As the accelerated DHT client is an option in Kubo, we could ship the new IPFS DHT implementation without implementing the FullRT
, and kubo could depend on the old IPFS DHT implementation if the accelerated DHT client option is set.
the libp2p endpoint doesn't yet make use of sync.Pool
to optimize memory allocations when writing/reading messages to/from a libp2p stream. This seems to be an easy performance improvement.
go-libp2p-kad-dht is using sync.Pool
(see https://github.com/libp2p/go-libp2p-kad-dht/blob/978cb74f5fdf846e09d5769bb4dfb9f962135c38/internal/net/message_manager.go#L360-L368)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.