0xcaff / dht-crawler Goto Github PK
View Code? Open in Web Editor NEWTools to crawl and index the BitTorrent DHT.
Tools to crawl and index the BitTorrent DHT.
The following should be easy to call.
This project is awesome but no one can tell because there isn't a readme.
Make a helper to asynchronously iteratively query to find_node and get_node. We would only need find_node
for bootstrapping (find_node(self)
)
Currently, the TransactionFuture doesn't clean up when it is dropped. It should do this so timeouts can be easily implemented.
How to use this tools, what can be achieved? Can i get torrent infohash, how they are stored?
From BEP0005:
Each bucket should maintain a "last changed" property to indicate how "fresh" the contents are. When a node in a bucket is pinged and it responds, or a node is added to a bucket, or a node in a bucket is replaced with another node, the bucket's last changed property should be updated. Buckets that have not been changed in 15 minutes should be "refreshed." This is done by picking a random ID in the range of the bucket and performing a find_nodes search on it. Nodes that are able to receive queries from other nodes usually do not need to refresh buckets often. Nodes that are not able to receive queries from other nodes usually will need to refresh all buckets periodically to ensure there are good nodes in their table when the DHT is needed.
I believe this can also be used to do bootstrapping. First, the bootstrap nodes are put into the routing table. Next, the only bucket in the routing table is refreshed. This would be the only time (aside from adding the bootstrap node) that nodes are added to the routing table.
Need to think about the ownership of the bucket refresh process.
The errors don't provide enough context.
Depends on #3
Useful for defining the public API of the DHT.
Envelopes usually contain a binary string with an encoded message. Our Envelope isn't an envelope in that sense.
Implement logging.
There should be a way to keep track of relevant events. It shouldn't also be an obstacle to the DHT client.
The DHT port is also used for the utp protocol I think. There should be a way for clients to use both these things together. http://www.bittorrent.org/beps/bep_0029.html
Currently, verifying the validity of a token requires having state. The state needs to be updated ever 15m. We can remove the need for this state by sending an encrypted blob which includes the ip address and at which the token expires. This will work similar to the way JWT works but without a more compact encoding and a fixed encryption algorithm.
There a super jank routing table implementation. Write an actual working one.
Check whether many nodes support the DHT crawl request outlined in BEP 51.
We would like to maximize the number of IP addresses and torrent infohashes collected.
We will collect infohashes whenever we see them. For example, whenever we receive a request for get_peer
or announce_peer
, the infohash will be collected.
We will collect IP addresses whenever someone calls get_peer
or announce_peer
. BEP42, says only peers with valid node ids are considered valid storage targets making it difficult to scale collection of announce_peer
part up without many IP addresses.
Prior work in this area creates an abusive client which pretends to be near node id's it finds. https://github.com/boramalper/magnetico
We need to figure out a strategy which balances being good and collecting tons of information.
Operations:
Currently if a request is sent and a response isn't received, a Future is polled forever. There should be a timeout and error instead.
The RoutingTable shouldn't have the same node twice.
There should be a method, bootstrap. It returns a future which while polled will keep the dht updated:
It should do the following.
Add the read only flag to the specification and use it when only collecting IP addresses.
After using a timeout for our only outbound request (#24), we've now encountered another problem: if too much traffic is sent at a time, some of it gets dropped. During bootstrap, eventually so much traffic is sent at the same time that all requests fail and the bootstrap dies.
Some ideas on how to handle this:
How others handle this:
The DHT implementation isn't really useful because it can't be used as a library. get_nodes
and announce_peer
are needed for this to work well.
The await macro has been in nightly for a while now, we should consider switching to nightly so we can use it.
Encapsulation, tests and documentation suck. Use packages as a forcing function to make these things better.
Expose announce
on the top level object and keep track of tokens so announce will work.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.