GithubHelp home page GithubHelp logo

zikade's People

Contributors

dennis-tra avatar dependabot[bot] avatar guillaumemichel avatar iand avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

zikade's Issues

Feature gap between zikade and go-libp2p-kad-dht

This issue is a collaborative meta issue to capture all remaining tasks to bring Zikade up to feature parity with go-libp2p-kad-dht. (Originally libp2p/go-libp2p-kad-dht#895)

Tasks

No tasks being tracked yet.

Tests leak goroutines because of go-libp2p and leveldb

Migrated from libp2p/go-libp2p-kad-dht#906

We observed that the 32-bit test runner runs out of memory if we increase the test count to 100.

There are two areas where we leak goroutines after the test has stopped:

  1. leveldb has a one-second delay before properly releasing all resources. There is nothing immediate we can do about that. If we added a delay of 1s + δ, all resources are released.
  2. go-libp2p does not clean up fx resources properly. This is a known issue and will be fixed soon: libp2p/go-libp2p#2514

Revisit after libp2p/go-libp2p#2514 has been resolved.

Datastore Key conversion mixes namespaces prefix

(Migrated from libp2p/go-libp2p-kad-dht#869)

When we receive a PUT_VALUE RPC from a remote peer, the key will contain a namespace prefix (e.g., pk or ipns) followed by a binary key. In the case of IPNS the key follows the spec:

Key format: /ipns/BINARY_ID

This is in line with how the IPNS key gets generated here:

func (n Name) RoutingKey() []byte {
	var buffer bytes.Buffer
	buffer.WriteString(NamespacePrefix)
	buffer.Write(n.src) // Note: we append raw multihash bytes (no multibase)
	return buffer.Bytes()
}

NamespacePrefix is set to /ipns/. In the handlePutValue method the key parameter will be set to the above /ipns/BINARY_ID format. Later in that handler we're generating the datastore key via convertToDsKey(rec.GetKey()):

func convertToDsKey(s []byte) ds.Key {
	return ds.NewKey(base32.RawStdEncoding.EncodeToString(s))
}

This means we're taking the bytes of the string /ipns/BINARY_ID and encode them as base32. This means we're losing the capability of proper namespacing. The /ipns prefix will still be the same in base32 for all keys but still this seems like a coincidence. In the ProviderManager, we're properly encoding each component separately here.

Changing this key format will be incompatible with any persistent stores out there. We'd need to support both types of keys (everything base32 encoded and only BINARY_ID encoded) for some time.

With go-libp2p-kad-dht v2 we could take the liberty to introduce a breaking change here.

Query stats not updating

Migrated from libp2p/go-libp2p-kad-dht#944

The EventQueryProgressed event received by waitForQuery doesn't have accurate stats. For example the following query does not have an accurate start time and the number of requests is unchanged over time:

2023-09-27T15:56:17.932+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.058+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.228+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWDXvBTyoFtdh3WZt18oQs9rEgBVVgoo2YPW9WFsgvbF1Z", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.305+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWQSmp7ij4nDx6Z1y8myBhZgaKYf9PmX8RJXvHKxf63tXf", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.344+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWAXJgmkWuYjmGsRfcLcw712Xo2gL94BgR2dV7dGfFdpBz", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.479+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWE5WcfiAG8NbZoGETDFStgnG9X9D628whJmEPKw3qWutz", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.637+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWK4x8Azr4Tj5TsVpeDxzxRpUpFVSwT7zGU5x1LTYCYz6c", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.745+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWQ4BzSdQ9fABqJfUgPEYd9Xwg6ztUW7kvSJxePs5Bmo9H", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:18.759+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWHQZJxV5KRix1uUCy6N1ExAZpN557Kxc9t1HwZocp5N2T", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}
2023-09-27T15:56:19.056+0100	DEBUG	query made progress	{"query_id": "0000000000000005", "peer_id": "12D3KooWMVwrvy9mtKp2vTyGbqy6wWWFWd2zAuJf39UWNCYtfWGU", "elapsed": 9223372036.854776, "requests": 0, "failures": 0}

Thorough Refactored DHT Testing

ETA: 2023-08-31

Description: Write a thorough test suite that should be usable again all DHT implementations. The tests should test the protocol implementation, performance and security.

Actor Model Architecture

After watching the walkthrough video and reviewing the code, I can't help but be reminded of the actor model, which has already addressed similar challenges to the ones we're facing with the current go-kademlia architecture. Further, there's already an established language and common terminology in this area.

In the past, I have worked extensively with protoactor-go (website), which draws inspiration from Akka.NET and Microsoft Orleans. While all three libraries offer more functionality than what we currently need, I believe we can still leverage their concepts and even some of their code. Interestingly rust-libp2p has also converged on a very similar pattern with its Hierarchical State Machines, albeit from a different starting point. There are several important concepts employed by all three actor libraries that I believe could be relevant for us. For example, the actor hierarchies, lifecycle events, behaviours (different from rust-libp2p behaviours), mailboxes, middlewares (easy tracing), supervision, and probably more.

The biggest difference in following the actor pattern would be to have multiple event queues instead of a single one because we'd work with a hierarchy of actors and each actor would have exactly one event queue (mailbox) to which all senders enqueue their messages. I understand why the current implementation uses only a single event queue. We want to have sequential message processing to ensure deterministic tests. By introducing a hierarchical model sequential message processing would only happen on a per actor basis. Communication between actors happens asynchronously via messages.

Looking at the SimpleQuery I think the control flow is not super obvious because state manipulation is distributed across a few different places like requestError, handleResponse, and newRequest. SimpleQuery is a natural contender to be its own actor.

The current examples show how nodes interact with each other using the SimpleQuery implementation. However, I believe there must be another layer above this where the orchestration of queries and other tasks are handled. In the examples, the top-level functions craft queries and enqueue them into schedulers. In the future, I believe, we'll have, let's say, a DHT struct that handles pushing new queries to the scheduler, performing cleanup tasks on shutdown, or generally orchestrating the several actions that are running. Following this assumption, I believe there's a hierarchy that would naturally translate to a hierarchy of actors.

I can't help but feel we'd miss out if we didn't make use of what has already been built in the past. Though, the biggest disadvantage I can see is that we wouldn't have a single event queue - which seems to be a hard requirement? Perhaps there's a middle ground? From my experience, testing in protoactor-go (with a myriad of actors) was very well possible and also fine. Flakiness was not a problem. However, crafting the sequence of events was not always straightforward. This is super subjective but for me, the actor model really resonates with the way I think and makes high asynchronicity bearable.

Note: I'm not advocating for pulling in protoactor-go as a dependency. I just want us not to miss out on what has already been developed there and start a discussion around this.

Add TTL to Provider Record

As a result of nodes fetching some content, advertising to the IPFS DHT and quickly churning, some CIDs have a lot of unreachable providers which may be a problem. In order to address this issue, nodes could give a TTL to the provider records they are advertising, given the knowledge of their average uptime. The DHT Servers would store the provider records only for TTL, which should reduce the number of unreachable providers in the IPFS DHT.

In order to implement this change, let's make use of the ttl field in the Provider Record protobuf https://github.com/plprobelab/go-kademlia/blob/dc867cbd3316a89cabaa5be19900cdbf5d2f0805/network/message/ipfsv1/message.proto#L30

Note that this field isn't supported in go-libp2p-kad-dht (so nodes will discard the ttl field), but it can already be parsed by rust-libp2p.

We also need a mechanism on the client/provider that will determine the TTL value, but this value can be provided by the caller (kubo/boxo).

On the Server side, we need to make sure that the TTL field is parsed, and that the Provider Store (once implemented) will only keep the Provider Record for its given TTL (or at most a fixed number e.g 48 hours).

References

Replace stored contexts with explicit tracing and cancellation metadata

When an incoming event is queued by a behaviour's Notify method the context supplied in the method call is also queued alongside the event and reused when the event is actioned by Perform. This was primarily intended to preserve the tracing context for the event so that the event and its consequent outbound events could be traced through the system of coordinator, behaviours and state machines. Secondarily it was intended to allow context cancellation to be effective through the state machines. (Originally the context would be checked for cancellation before actioning the associated event but this has been lost through refactorings)

However these goals can only be attained if the context is consistently preserved everywhere. Currently the coordinator uses its own independent context when dispatching events between state machines and the events emitted by a behaviour's Perform method are done so without their associated context.

Additionally this storing of the context can be harmful if the context is used for an event generated as a side effect, such as a rouuting notification that adds a node to the include queue. This should have its own independent context that is not subject to the parent context's cancellation.

We should remove the storage of context and use a different mechanism to carry tracing and cancellation metadata.

Proposal

Tracing

Extend BehaviourEvent to have a SpanContext method:

// SpanContext returns tracing information associated with the event.
SpanContext() trace.SpanContext

A SpanContext holds the trace id, span id and other tracing flags that should be associated with the event. See spancontext in the specification.

Each outbound event that is generated as a direct result of actioning an inbound event should copy the SpanContext to the new event. Functions that process an event should use the SpanContext, for example:

ctx, span := c.tele.Tracer.Start(trace.ContextWithSpanContext(ctx, ev.SpanContext()), "Coordinator.AddNodes")
defer span.End()

When an event is submitted to the coordinator's Notify method (from an external source or as a result of calling a helper method like Coordinator.Bootstrap) an SpanContext should be created that using a method like trace.SpanContextFromContext.

Cancellation/Deadlines

Events that initiate queries (EventStartFindCloserQuery, EventStartMessageQuery) and broadcasts (EventStartBroadcast) should include a Deadline field that can be used to specify a deadline for the query. The query state machines should use this to terminate the query once it has passed its deadline and the relevant waitForQuery or waitForBroadcast functions can use to create a context with an appropriate deadline.

Events that initiate outbound network requests (EventOutboundGetCloserNodes and EventOutboundSendMessage) should also carry a deadline, inherited from the query that ultimately generated the request event.

Provide Interface

Description

Update dependencies EVERYWHERE (https://github.com/ipfs/kubo, https://github.com/libp2p/go-libp2p, https://github.com/libp2p/go-libp2p-kad-dht, https://github.com/libp2p/go-libp2p-routing-helpers, https://github.com/ipfs/go-ipfs-provider, etc.) and make sure to have the DHT responsible to Reprovide content (where necessary).

It shouldn’t be to IPFS implementations (Kubo) to handle persistence in the different content routers.

The proposed interface contains the following functions StartProvide([]cid.Cid) error, StopProvide([]cid.Cid) error, ListProvide() []cid.Cid or similar. The interface still needs to be discussed with other stakeholders.

Measure DHT Resource Consumption

ETA: 2023-08-31

Description: Currently when the Bitswap ProviderSearchDelay is set to 0, the Time To First Byte is worse than when the ProviderSearchDelay is set to 1 second, which doesn’t make sense from a protocol perspective. When the ProviderSearchDelay is set to 0, all requests go through the DHT (compared with only 5% when the ProviderSearchDelay is set to 1 second). One theory that could explain these strange results, is that the DHT implementation is slowing down Kubo.

We need to investigate around parallel requests in the DHT, and try to understand any resource limitation that the implementation may have. The DHT is much less chatty than Bitswap, hence there must be a way to have a more efficient use of resources (even tough the DHT opens more new connections than Bitswap).

Upon completion

Once we are satisfied with the DHT performances, we can set the Bitswap ProviderSearchDelay to 0 (unless we encounter new problems). This is a good step to get rid of Bitswap dumb broadcast!

References

Flaking test: TestDHT_SearchValue_quorum_test_suite/TestQuorumReachedPrematurely

This test run failed on a PR with no code changes

https://github.com/plprobelab/zikade/actions/runs/6408785740/job/17398549960?pr=47

--- FAIL: TestDHT_SearchValue_quorum_test_suite (10.79s)
    --- PASS: TestDHT_SearchValue_quorum_test_suite/TestQuorumReachedAfterDiscoveryOfBetter (0.27s)
    --- FAIL: TestDHT_SearchValue_quorum_test_suite/TestQuorumReachedPrematurely (10.15s)
    --- PASS: TestDHT_SearchValue_quorum_test_suite/TestQuorumUnspecified (0.18s)
    --- PASS: TestDHT_SearchValue_quorum_test_suite/TestQuorumZero (0.17s)

Cleanup idle NodeHandlers

The NetworkBehaviour maintains a map of NodeHandler but entries are never deleted from this map. Each NodeHandler maintains a goroutine to process events from other state machines that want to send messages to the node. Deleting an idle NodeHandler from the map is safe since its only state is the queue of events waiting to be processed.

A NodeHandler should be removed from the map when:

  1. the node is removed from the routing table
  2. no communication has been requested for a configurable period of time

When the NodeHandler is deleted the goroutine servicing its work queue should also be stopped.

Plumb Refactored DHT into Kubo

ETA: 2023-09-30

Description: Once the DHT Refactor is over and tested, we need to replace the old DHT in the Kubo implementation with the Refactored one.

Test Refactored DHT in Kubo

ETA: 2023-09-30

Description

Test the refactored DHT within Kubo, on MANY clients before deploying the DHT refactor to the next Kubo release.

The tests to be conducted must include testing that Content Routing still behaves as expected, but also performance evaluation and comparison with the legacy DHT implementation.

Deadlock in QueryBehaviour

This goroutine is holding the QueryBehaviour lock, trying to Notify a Waiter that a EventGetCloserNodesSuccess was received.
The are no goroutines selecting on the waiter's channel. I would expect it to be in Coordinator.waitForQuery called from Coordinator.QueryMessage.

goroutine 6818 [select, 33 minutes]:
github.com/plprobelab/zikade/internal/coord.(*Waiter[...]).Notify(0xc0018abbe0, {0x2eda480?, 0xc0021febd0}, {0x2ec3ca0, 0xc0034ab2d0})
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:126 +0x105
github.com/plprobelab/zikade/internal/coord.(*PooledQueryBehaviour).Notify(0xc00014df80, {0x2eda480?, 0xc00216cf00?}, {0x2ec3be0?, 0xc00253a000?})
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/query.go:189 +0x109f
github.com/plprobelab/zikade/internal/coord.(*NodeHandler).send(0xc00079d880, {0x2eda480, 0xc00216cf00}, {0x2ecf4d0?, 0xc00262e460?})
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/network.go:165 +0x33e
github.com/plprobelab/zikade/internal/coord.(*WorkQueue[...]).Enqueue.func1.1()
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:81 +0x108
created by github.com/plprobelab/zikade/internal/coord.(*WorkQueue[...]).Enqueue.func1
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:75 +0x7a

There are 8 goroutines are waiting on the lock at this point:

goroutine 6990 [sync.Mutex.Lock, 33 minutes]:
sync.runtime_SemacquireMutex(0x11eeaff?, 0x80?, 0xc001f4a390?)
	/home/iand/sdk/go1.20.5/src/runtime/sema.go:77 +0x26
sync.(*Mutex).lockSlow(0xc00014dfd8)
	/home/iand/sdk/go1.20.5/src/sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
	/home/iand/sdk/go1.20.5/src/sync/mutex.go:90
github.com/plprobelab/zikade/internal/coord.(*PooledQueryBehaviour).Notify(0xc00014df80, {0x2eda480?, 0xc001f4a390?}, {0x2ec3d40?, 0xc0015c87d0?})
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/query.go:152 +0x125
github.com/plprobelab/zikade/internal/coord.(*NodeHandler).send(0xc001293c00, {0x2eda480, 0xc001f4a390}, {0x2ecf4f8?, 0xc001293600?})
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/network.go:186 +0x63b
github.com/plprobelab/zikade/internal/coord.(*WorkQueue[...]).Enqueue.func1.1()
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:81 +0x108
created by github.com/plprobelab/zikade/internal/coord.(*WorkQueue[...]).Enqueue.func1
	/home/iand/pkg/mod/github.com/plprobelab/[email protected]/internal/coord/behaviour.go:75 +0x7a

Somehow we have lost the select that should be reading from the waiter's channel.

SearchValue update peers with newer record

Migrated from libp2p/go-libp2p-kad-dht#945

When searching for an IPNS or PK record, we will store the updated version of that record with all the closest peers that we have found while querying. If the query aborts after the three closest haven't returned anyone closer, we still update the remaining 17 peers that we haven't contacted and store our currently known "best" record with them. However, these 17 peers may also hold the same record - we just don't know because we haven't contacted them yet.

I would change the logic in V2 to only store the updated record with peers that provably returned a stale record during the query operation.

Reprovide Sweep Implementation

Prerequisites

#41 is required, because currently it is the IPFS implementations that must handle the reprovide. This optimization only works if the DHT manages the reprovides itself.

Description

The project is well described at DHT Reprovide Sweep. This change is a client change only, and is expected to have a significant impact, especially on large content providers.

Now the Minimal Working Modular DHT is finished, all the logic of Reprovide Sweep can be encapsulated inside the Provide module.

References

Instrument DHT with metrics

ETA: 2023-08-31

Currently go-libp2p-kad-dht exports the following metrics

  • libp2p.io/dht/kad/received_messages - Total number of messages received per RPC
  • libp2p.io/dht/kad/received_message_errors - Total number of errors for messages received per RPC
  • libp2p.io/dht/kad/received_bytes - Total received bytes per RPC
  • libp2p.io/dht/kad/inbound_request_latency - Latency per RPC
  • libp2p.io/dht/kad/outbound_request_latency - Latency per RPC
  • libp2p.io/dht/kad/sent_messages - Total number of messages sent per RPC
  • libp2p.io/dht/kad/sent_message_errors - Total number of errors for messages sent per RPC
  • libp2p.io/dht/kad/sent_requests - Total number of requests sent per RPC
  • libp2p.io/dht/kad/sent_request_errors - Total number of errors for requests sent per RPC
  • libp2p.io/dht/kad/sent_bytes - Total sent bytes per RPC
  • libp2p.io/dht/kad/network_size - Network size estimation

We should replicate these or provide equivalents.

See also libp2p/go-libp2p-kad-dht#304 for more discussion

Accelerated DHT Client

We should decide whether we want to create a FullRT routing table implementation, and its associated refresh mechanism (crawler), or if the FullRT from the old implementation should be superseded by Reprovide Sweep. Note that the scope isn't exactly similar. Reprovide Sweep is more efficient at reproviding content than FullRT, but FullRT is (supposed to be) more efficient at lookup than a normal routing table (using Reprovide Sweep). IIUC most of the FullRT users are using it to reprovide content, so the switch to Reprovide Sweep would be an upgrade for them. However, we may break users depending on the FullRT for faster lookup, if we don't implement it in the new IPFS DHT implementation. There are other efficient routing table alternatives described in probe-lab/go-kademlia#2.

As the accelerated DHT client is an option in Kubo, we could ship the new IPFS DHT implementation without implementing the FullRT, and kubo could depend on the old IPFS DHT implementation if the accelerated DHT client option is set.

Optimize reading libp2p messages with sync.Pool buffers

the libp2p endpoint doesn't yet make use of sync.Pool to optimize memory allocations when writing/reading messages to/from a libp2p stream. This seems to be an easy performance improvement.

go-libp2p-kad-dht is using sync.Pool (see https://github.com/libp2p/go-libp2p-kad-dht/blob/978cb74f5fdf846e09d5769bb4dfb9f962135c38/internal/net/message_manager.go#L360-L368)

References

https://github.com/plprobelab/go-kademlia/blob/main/network/endpoint/libp2pendpoint/libp2pendpoint.go

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.