probe-lab / zikade Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 3.0 763 KB

A Go implementation of the libp2p Kademlia DHT specification

License: Other

Go 99.94% Makefile 0.06%

golang ipfs kademlia libp2p

zikade's People

Contributors

Stargazers

Watchers

Forkers

preresearch-labs littlespeechless dettanym

zikade's Issues

Feature gap between zikade and go-libp2p-kad-dht

This issue is a collaborative meta issue to capture all remaining tasks to bring Zikade up to feature parity with go-libp2p-kad-dht. (Originally libp2p/go-libp2p-kad-dht#895)

Tasks

Beta Give feedback

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

Add RoutingDHT struct that implements the routing.Routing interface and embeds the DHT

Migrated from libp2p/go-libp2p-kad-dht#929

Given that many are not happy with the routing.Routing interface we could avoid "polluting" the DHT API with the routing.Routing methods. So I propose to do the following:

type RoutingDHT struct {
    *DHT
}

var _ routing.Routing = (*RoutingDHT)(nil)

func NewRoutingDHT(d *DHT) *RoutingDHT {
    return &RoutingDHT{DHT: d}
}

// routing.Routing methods...

Implement Network Size Estimator

Migrated from libp2p/go-libp2p-kad-dht#918

Reuse NodeHandlers in GetClosestNodes

This issue if to address the TODO here: https://github.com/plprobelab/zikade/blob/main/internal/coord/network.go#L229

We should reuse a NodeHandler managed by NetworkBehaviour rather than calling NewNodeHandler

Tests leak goroutines because of go-libp2p and leveldb

Migrated from libp2p/go-libp2p-kad-dht#906

We observed that the 32-bit test runner runs out of memory if we increase the test count to 100.

There are two areas where we leak goroutines after the test has stopped:

leveldb has a one-second delay before properly releasing all resources. There is nothing immediate we can do about that. If we added a delay of 1s + δ, all resources are released.
go-libp2p does not clean up fx resources properly. This is a known issue and will be fixed soon: libp2p/go-libp2p#2514

Revisit after libp2p/go-libp2p#2514 has been resolved.

Datastore Key conversion mixes namespaces prefix

(Migrated from libp2p/go-libp2p-kad-dht#869)

When we receive a PUT_VALUE RPC from a remote peer, the key will contain a namespace prefix (e.g., pk or ipns) followed by a binary key. In the case of IPNS the key follows the spec:

Key format: /ipns/BINARY_ID

This is in line with how the IPNS key gets generated here:

func (n Name) RoutingKey() []byte {
	var buffer bytes.Buffer
	buffer.WriteString(NamespacePrefix)
	buffer.Write(n.src) // Note: we append raw multihash bytes (no multibase)
	return buffer.Bytes()
}

NamespacePrefix is set to /ipns/. In the handlePutValue method the key parameter will be set to the above /ipns/BINARY_ID format. Later in that handler we're generating the datastore key via convertToDsKey(rec.GetKey()):

func convertToDsKey(s []byte) ds.Key {
	return ds.NewKey(base32.RawStdEncoding.EncodeToString(s))
}

This means we're taking the bytes of the string /ipns/BINARY_ID and encode them as base32. This means we're losing the capability of proper namespacing. The /ipns prefix will still be the same in base32 for all keys but still this seems like a coincidence. In the ProviderManager, we're properly encoding each component separately here.

Changing this key format will be incompatible with any persistent stores out there. We'd need to support both types of keys (everything base32 encoded and only BINARY_ID encoded) for some time.

With go-libp2p-kad-dht v2 we could take the liberty to introduce a breaking change here.

Remove StartSpan function from `tele` package

Migrated from libp2p/go-libp2p-kad-dht#927

We should remove this function: https://github.com/libp2p/go-libp2p-kad-dht/blob/e86381e824dd2797cf931f14070292fe763a0c93/v2/tele/tele.go#L133

Instead we should plumb a tracer through to the pool/probe/etc.

Context: libp2p/go-libp2p-kad-dht#924 (comment)

Query stats not updating

Migrated from libp2p/go-libp2p-kad-dht#944

The EventQueryProgressed event received by waitForQuery doesn't have accurate stats. For example the following query does not have an accurate start time and the number of requests is unchanged over time:

2023-09-27T15:56:17.932+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.058+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.228+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWDXvBTyoFtdh3WZt18oQs9rEgBVVgoo2YPW9WFsgvbF1Z", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.305+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWQSmp7ij4nDx6Z1y8myBhZgaKYf9PmX8RJXvHKxf63tXf", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.344+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWAXJgmkWuYjmGsRfcLcw712Xo2gL94BgR2dV7dGfFdpBz", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.479+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWE5WcfiAG8NbZoGETDFStgnG9X9D628whJmEPKw3qWutz", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.637+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWK4x8Azr4Tj5TsVpeDxzxRpUpFVSwT7zGU5x1LTYCYz6c", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.745+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWQ4BzSdQ9fABqJfUgPEYd9Xwg6ztUW7kvSJxePs5Bmo9H", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.759+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWHQZJxV5KRix1uUCy6N1ExAZpN557Kxc9t1HwZocp5N2T", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:19.056+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWMVwrvy9mtKp2vTyGbqy6wWWFWd2zAuJf39UWNCYtfWGU", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}

Allow changing bootstrap peers at runtime

Migrated from libp2p/go-libp2p-kad-dht#921

Implement DHT.SearchValue to support ValueStore interface

Migrated from libp2p/go-libp2p-kad-dht#911

Integration Testing with Kubo

ETA: 2023-09-30

Children:

Thorough Refactored DHT Testing

ETA: 2023-08-31

Description: Write a thorough test suite that should be usable again all DHT implementations. The tests should test the protocol implementation, performance and security.

Connect tracing contexts to allow event loop tracing

Migrated from libp2p/go-libp2p-kad-dht#923

Search for ourselves (refresh routing table) if our local addresses change (subscriber_notifee.go)

Migrated from libp2p/go-libp2p-kad-dht#907

Cache Provider Records in the network

libp2p/go-libp2p-kad-dht#386

Another usage of our network size estimator 👍

Improve documentation of Zikade architecture

Add more visible documentation on the architecture of Zikade including a diagram. Should cover the role of the DHT, Coordinator, Behaviours, State Machines and the Router.

Implement address filter option

Migrated from libp2p/go-libp2p-kad-dht#919

Actor Model Architecture

After watching the walkthrough video and reviewing the code, I can't help but be reminded of the actor model, which has already addressed similar challenges to the ones we're facing with the current go-kademlia architecture. Further, there's already an established language and common terminology in this area.

In the past, I have worked extensively with protoactor-go (website), which draws inspiration from Akka.NET and Microsoft Orleans. While all three libraries offer more functionality than what we currently need, I believe we can still leverage their concepts and even some of their code. Interestingly rust-libp2p has also converged on a very similar pattern with its Hierarchical State Machines, albeit from a different starting point. There are several important concepts employed by all three actor libraries that I believe could be relevant for us. For example, the actor hierarchies, lifecycle events, behaviours (different from rust-libp2p behaviours), mailboxes, middlewares (easy tracing), supervision, and probably more.

The biggest difference in following the actor pattern would be to have multiple event queues instead of a single one because we'd work with a hierarchy of actors and each actor would have exactly one event queue (mailbox) to which all senders enqueue their messages. I understand why the current implementation uses only a single event queue. We want to have sequential message processing to ensure deterministic tests. By introducing a hierarchical model sequential message processing would only happen on a per actor basis. Communication between actors happens asynchronously via messages.

Looking at the SimpleQuery I think the control flow is not super obvious because state manipulation is distributed across a few different places like requestError, handleResponse, and newRequest. SimpleQuery is a natural contender to be its own actor.

The current examples show how nodes interact with each other using the SimpleQuery implementation. However, I believe there must be another layer above this where the orchestration of queries and other tasks are handled. In the examples, the top-level functions craft queries and enqueue them into schedulers. In the future, I believe, we'll have, let's say, a DHT struct that handles pushing new queries to the scheduler, performing cleanup tasks on shutdown, or generally orchestrating the several actions that are running. Following this assumption, I believe there's a hierarchy that would naturally translate to a hierarchy of actors.

I can't help but feel we'd miss out if we didn't make use of what has already been built in the past. Though, the biggest disadvantage I can see is that we wouldn't have a single event queue - which seems to be a hard requirement? Perhaps there's a middle ground? From my experience, testing in protoactor-go (with a myriad of actors) was very well possible and also fine. Flakiness was not a problem. However, crafting the sequence of events was not always straightforward. This is super subjective but for me, the actor model really resonates with the way I think and makes high asynchronicity bearable.

Note: I'm not advocating for pulling in protoactor-go as a dependency. I just want us not to miss out on what has already been developed there and start a discussion around this.

Add TTL to Provider Record

As a result of nodes fetching some content, advertising to the IPFS DHT and quickly churning, some CIDs have a lot of unreachable providers which may be a problem. In order to address this issue, nodes could give a TTL to the provider records they are advertising, given the knowledge of their average uptime. The DHT Servers would store the provider records only for TTL, which should reduce the number of unreachable providers in the IPFS DHT.

In order to implement this change, let's make use of the ttl field in the Provider Record protobuf https://github.com/plprobelab/go-kademlia/blob/dc867cbd3316a89cabaa5be19900cdbf5d2f0805/network/message/ipfsv1/message.proto#L30

Note that this field isn't supported in go-libp2p-kad-dht (so nodes will discard the ttl field), but it can already be parsed by rust-libp2p.

We also need a mechanism on the client/provider that will determine the TTL value, but this value can be provided by the caller (kubo/boxo).

On the Server side, we need to make sure that the TTL field is parsed, and that the Provider Store (once implemented) will only keep the Provider Record for its given TTL (or at most a fixed number e.g 48 hours).

References

Replace stored contexts with explicit tracing and cancellation metadata

When an incoming event is queued by a behaviour's Notify method the context supplied in the method call is also queued alongside the event and reused when the event is actioned by Perform. This was primarily intended to preserve the tracing context for the event so that the event and its consequent outbound events could be traced through the system of coordinator, behaviours and state machines. Secondarily it was intended to allow context cancellation to be effective through the state machines. (Originally the context would be checked for cancellation before actioning the associated event but this has been lost through refactorings)

However these goals can only be attained if the context is consistently preserved everywhere. Currently the coordinator uses its own independent context when dispatching events between state machines and the events emitted by a behaviour's Perform method are done so without their associated context.

Additionally this storing of the context can be harmful if the context is used for an event generated as a side effect, such as a rouuting notification that adds a node to the include queue. This should have its own independent context that is not subject to the parent context's cancellation.

We should remove the storage of context and use a different mechanism to carry tracing and cancellation metadata.

Proposal

Tracing

Extend BehaviourEvent to have a SpanContext method:

// SpanContext returns tracing information associated with the event.
SpanContext() trace.SpanContext

A SpanContext holds the trace id, span id and other tracing flags that should be associated with the event. See spancontext in the specification.

Each outbound event that is generated as a direct result of actioning an inbound event should copy the SpanContext to the new event. Functions that process an event should use the SpanContext, for example:

ctx, span := c.tele.Tracer.Start(trace.ContextWithSpanContext(ctx, ev.SpanContext()), "Coordinator.AddNodes")
defer span.End()

When an event is submitted to the coordinator's Notify method (from an external source or as a result of calling a helper method like Coordinator.Bootstrap) an SpanContext should be created that using a method like trace.SpanContextFromContext.

Cancellation/Deadlines

Events that initiate queries (EventStartFindCloserQuery, EventStartMessageQuery) and broadcasts (EventStartBroadcast) should include a Deadline field that can be used to specify a deadline for the query. The query state machines should use this to terminate the query once it has passed its deadline and the relevant waitForQuery or waitForBroadcast functions can use to create a context with an appropriate deadline.

Events that initiate outbound network requests (EventOutboundGetCloserNodes and EventOutboundSendMessage) should also carry a deadline, inherited from the query that ultimately generated the request event.

Provide Interface

Description

Update dependencies EVERYWHERE (https://github.com/ipfs/kubo, https://github.com/libp2p/go-libp2p, https://github.com/libp2p/go-libp2p-kad-dht, https://github.com/libp2p/go-libp2p-routing-helpers, https://github.com/ipfs/go-ipfs-provider, etc.) and make sure to have the DHT responsible to Reprovide content (where necessary).

It shouldn’t be to IPFS implementations (Kubo) to handle persistence in the different content routers.

The proposed interface contains the following functions StartProvide([]cid.Cid) error, StopProvide([]cid.Cid) error, ListProvide() []cid.Cid or similar. The interface still needs to be discussed with other stakeholders.

Cleanly cancel outbound request

Right now, StopQuery does not cancel the in-flight request but just marks the query as finished. Context probe-lab/go-kademlia#74

Measure DHT Resource Consumption

ETA: 2023-08-31

Description: Currently when the Bitswap ProviderSearchDelay is set to 0, the Time To First Byte is worse than when the ProviderSearchDelay is set to 1 second, which doesn’t make sense from a protocol perspective. When the ProviderSearchDelay is set to 0, all requests go through the DHT (compared with only 5% when the ProviderSearchDelay is set to 1 second). One theory that could explain these strange results, is that the DHT implementation is slowing down Kubo.

We need to investigate around parallel requests in the DHT, and try to understand any resource limitation that the implementation may have. The DHT is much less chatty than Bitswap, hence there must be a way to have a more efficient use of resources (even tough the DHT opens more new connections than Bitswap).

Upon completion

Once we are satisfied with the DHT performances, we can set the Bitswap ProviderSearchDelay to 0 (unless we encounter new problems). This is a good step to get rid of Bitswap dumb broadcast!

References

Implement accelerated DHT (FullRT)

Migrated from libp2p/go-libp2p-kad-dht#916

Flaking test: TestDHT_SearchValue_quorum_test_suite/TestQuorumReachedPrematurely

This test run failed on a PR with no code changes

https://github.com/plprobelab/zikade/actions/runs/6408785740/job/17398549960?pr=47

--- FAIL: TestDHT_SearchValue_quorum_test_suite (10.79s)
    --- PASS: TestDHT_SearchValue_quorum_test_suite/TestQuorumReachedAfterDiscoveryOfBetter (0.27s)
    --- FAIL: TestDHT_SearchValue_quorum_test_suite/TestQuorumReachedPrematurely (10.15s)
    --- PASS: TestDHT_SearchValue_quorum_test_suite/TestQuorumUnspecified (0.18s)
    --- PASS: TestDHT_SearchValue_quorum_test_suite/TestQuorumZero (0.17s)

Cleanup idle NodeHandlers

The NetworkBehaviour maintains a map of NodeHandler but entries are never deleted from this map. Each NodeHandler maintains a goroutine to process events from other state machines that want to send messages to the node. Deleting an idle NodeHandler from the map is safe since its only state is the queue of events waiting to be processed.

A NodeHandler should be removed from the map when:

the node is removed from the routing table
no communication has been requested for a configurable period of time

When the NodeHandler is deleted the goroutine servicing its work queue should also be stopped.

Minimally Functional Modular DHT

ETA: 2023-09-30

Description: Complete the missing/incomplete parts of go-kademlia such that it can be used to implement the DHT in go-libp2p-kad-dht and plumbed into Kubo via an IPFS DHT in boxo

Children:

Plumb Refactored DHT into Kubo

ETA: 2023-09-30

Description: Once the DHT Refactor is over and tested, we need to replace the old DHT in the Kubo implementation with the Refactored one.

Create a performance test plan and success criteria for Kubo testing

ETA: 2023-08-31

Description: create a plan for side by side performance comparisons of the minimally functional DHT with the existing DHT. Define the success criteria for shipping the new DHT in a Kubo release.

Implement routing table diversity filter

Migrated from libp2p/go-libp2p-kad-dht#909

Test Refactored DHT in Kubo

ETA: 2023-09-30

Description

Test the refactored DHT within Kubo, on MANY clients before deploying the DHT refactor to the next Kubo release.

The tests to be conducted must include testing that Content Routing still behaves as expected, but also performance evaluation and comparison with the legacy DHT implementation.

Add peer if it starts supporting the DHT protocol as a reaction to event.EvtPeerProtocolsUpdated or event.EvtPeerIdentificationCompleted

Migrated from libp2p/go-libp2p-kad-dht#908

Deadlock in QueryBehaviour

This goroutine is holding the QueryBehaviour lock, trying to Notify a Waiter that a EventGetCloserNodesSuccess was received.
The are no goroutines selecting on the waiter's channel. I would expect it to be in Coordinator.waitForQuery called from Coordinator.QueryMessage.

goroutine 6818 [select, 33 minutes]:
github.com/plprobelab/zikade/internal/coord.(*Waiter[...]).Notify(0xc0018abbe0, {0x2eda480?, 0xc0021febd0}, {0x2ec3ca0, 0xc0034ab2d0})
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:126 +0x105
github.com/plprobelab/zikade/internal/coord.(*PooledQueryBehaviour).Notify(0xc00014df80, {0x2eda480?, 0xc00216cf00?}, {0x2ec3be0?, 0xc00253a000?})
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/query.go:189 +0x109f
github.com/plprobelab/zikade/internal/coord.(*NodeHandler).send(0xc00079d880, {0x2eda480, 0xc00216cf00}, {0x2ecf4d0?, 0xc00262e460?})
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/network.go:165 +0x33e
github.com/plprobelab/zikade/internal/coord.(*WorkQueue[...]).Enqueue.func1.1()
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:81 +0x108
created by github.com/plprobelab/zikade/internal/coord.(*WorkQueue[...]).Enqueue.func1
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:75 +0x7a

There are 8 goroutines are waiting on the lock at this point:

goroutine 6990 [sync.Mutex.Lock, 33 minutes]:
sync.runtime_SemacquireMutex(0x11eeaff?, 0x80?, 0xc001f4a390?)
	/home/iand/sdk/go1.20.5/src/runtime/sema.go:77 +0x26
sync.(*Mutex).lockSlow(0xc00014dfd8)
	/home/iand/sdk/go1.20.5/src/sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
	/home/iand/sdk/go1.20.5/src/sync/mutex.go:90
github.com/plprobelab/zikade/internal/coord.(*PooledQueryBehaviour).Notify(0xc00014df80, {0x2eda480?, 0xc001f4a390?}, {0x2ec3d40?, 0xc0015c87d0?})
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/query.go:152 +0x125
github.com/plprobelab/zikade/internal/coord.(*NodeHandler).send(0xc001293c00, {0x2eda480, 0xc001f4a390}, {0x2ecf4f8?, 0xc001293600?})
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/network.go:186 +0x63b
github.com/plprobelab/zikade/internal/coord.(*WorkQueue[...]).Enqueue.func1.1()
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:81 +0x108
created by github.com/plprobelab/zikade/internal/coord.(*WorkQueue[...]).Enqueue.func1
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:75 +0x7a

Somehow we have lost the select that should be reading from the waiter's channel.

When searching for an IPNS or PK record, we will store the updated version of that record with all the closest peers that we have found while querying. If the query aborts after the three closest haven't returned anyone closer, we still update the remaining 17 peers that we haven't contacted and store our currently known "best" record with them. However, these 17 peers may also hold the same record - we just don't know because we haven't contacted them yet.

I would change the logic in V2 to only store the updated record with peers that provably returned a stale record during the query operation.

Add tests for SearchValue fixup

Migrated from libp2p/go-libp2p-kad-dht#946

At the moment we don't have tests for updating peers that hold stale records (SearchValue/GetValue).

Reprovide Sweep Implementation

Prerequisites

#41 is required, because currently it is the IPFS implementations that must handle the reprovide. This optimization only works if the DHT manages the reprovides itself.

Description

The project is well described at DHT Reprovide Sweep. This change is a client change only, and is expected to have a significant impact, especially on large content providers.

Now the Minimal Working Modular DHT is finished, all the logic of Reprovide Sweep can be encapsulated inside the Provide module.

References

DHT Reprovide Sweep

Flaky test: TestRTAdditionOnSuccessfulQuery

=== RUN   TestRTAdditionOnSuccessfulQuery
    topology_test.go:192: 
        	Error Trace:	D:/a/zikade/zikade/topology_test.go:170
        	            				D:/a/zikade/zikade/topology_test.go:192
        	            				D:/a/zikade/zikade/query_test.go:21
        	Error:      	Received unexpected error:
        	            	test deadline exceeded while waiting for routing updated event
        	Test:       	TestRTAdditionOnSuccessfulQuery

https://github.com/libp2p/go-libp2p/actions/runs/6430114845/job/17460469763?pr=2595

Implement Optimistic Provide lookup mechanism

Migrated from libp2p/go-libp2p-kad-dht#917

Perform load test with Musa

Migrated from libp2p/go-libp2p-kad-dht#883

Deploy Musa and ramp up the number of FIND_NODE RPCs. Find out when it breaks down and if there's a memory leak.

This infrastructure could be used to periodically check if we introduced a regression.

From comment:

Production load test? https://github.com/protocol/bifrost-infra/issues/2766 🙈

Asked in #libp2p-impelementers channel: https://filecoinproject.slack.com/archives/C03K82MU486/p1695125179210719

Add RoutingDHT that implements the routing.Routing interface

Apply the following PR to Zikade: libp2p/go-libp2p-kad-dht#947

Remove dependency on boxo

Migrated from libp2p/go-libp2p-kad-dht#925

Implement Stringer interface on coordinator events

Migrated from libp2p/go-libp2p-kad-dht#926

context: libp2p/go-libp2p-kad-dht#924 (comment)

Instrument DHT with metrics

ETA: 2023-08-31

Currently go-libp2p-kad-dht exports the following metrics

libp2p.io/dht/kad/received_messages - Total number of messages received per RPC
libp2p.io/dht/kad/received_message_errors - Total number of errors for messages received per RPC
libp2p.io/dht/kad/received_bytes - Total received bytes per RPC
libp2p.io/dht/kad/inbound_request_latency - Latency per RPC
libp2p.io/dht/kad/outbound_request_latency - Latency per RPC
libp2p.io/dht/kad/sent_messages - Total number of messages sent per RPC
libp2p.io/dht/kad/sent_message_errors - Total number of errors for messages sent per RPC
libp2p.io/dht/kad/sent_requests - Total number of requests sent per RPC
libp2p.io/dht/kad/sent_request_errors - Total number of errors for requests sent per RPC
libp2p.io/dht/kad/sent_bytes - Total sent bytes per RPC
libp2p.io/dht/kad/network_size - Network size estimation

We should replicate these or provide equivalents.

See also libp2p/go-libp2p-kad-dht#304 for more discussion

Implement Stream reuse for the router

Migrated from libp2p/go-libp2p-kad-dht#950

Similar to the messageSenderImpl of V1

Fix flaky test

Fix flaky test, follow up of the draft PR libp2p/go-libp2p-kad-dht#955

Accelerated DHT Client

We should decide whether we want to create a FullRT routing table implementation, and its associated refresh mechanism (crawler), or if the FullRT from the old implementation should be superseded by Reprovide Sweep. Note that the scope isn't exactly similar. Reprovide Sweep is more efficient at reproviding content than FullRT, but FullRT is (supposed to be) more efficient at lookup than a normal routing table (using Reprovide Sweep). IIUC most of the FullRT users are using it to reprovide content, so the switch to Reprovide Sweep would be an upgrade for them. However, we may break users depending on the FullRT for faster lookup, if we don't implement it in the new IPFS DHT implementation. There are other efficient routing table alternatives described in probe-lab/go-kademlia#2.

As the accelerated DHT client is an option in Kubo, we could ship the new IPFS DHT implementation without implementing the FullRT, and kubo could depend on the old IPFS DHT implementation if the accelerated DHT client option is set.

Optimize reading libp2p messages with sync.Pool buffers

the libp2p endpoint doesn't yet make use of sync.Pool to optimize memory allocations when writing/reading messages to/from a libp2p stream. This seems to be an easy performance improvement.

go-libp2p-kad-dht is using sync.Pool (see https://github.com/libp2p/go-libp2p-kad-dht/blob/978cb74f5fdf846e09d5769bb4dfb9f962135c38/internal/net/message_manager.go#L360-L368)

References

https://github.com/plprobelab/go-kademlia/blob/main/network/endpoint/libp2pendpoint/libp2pendpoint.go

probe-lab / zikade Goto Github PK

zikade's People

Contributors

Stargazers

Watchers

Forkers

zikade's Issues

Tasks

Tasks

References

Proposal

Tracing

Cancellation/Deadlines

Description

Upon completion

References

Description

Prerequisites

Description

References

References

Recommend Projects

Recommend Topics

Recommend Org

Jobs