Getting documents archived on IPFS is one thing, but we also need to be able to search

On a related vertical note, for the GA4GH <a href="https://genomicsandhealth.org/cance

I made a prototype search tool <a href="https://ipfs.io/ipfs/QmYo5ZWqNW4ib1Ck4zdm6EKte

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Looks like <a class="user-mention notranslate" data-hovercard-type="user" data-hoverca

Search engine about archives HOT 28 OPEN

ipfs-inactive commented on August 19, 2024 18

Search engine

from archives.

Comments (28)

dokterbob commented on August 19, 2024 8

I've just launched an early alpha of a proper search engine, based on elasticsearch. http://ipfs-search.com

from archives.

chainify-io commented on August 19, 2024 6

As mentioned by dokterbob above, we are currently developing an ipfs search engine that is indexing within IPFS, not searching through a gateway. Actually, our search engine listens to DHT chatter to index content. We just revived the search engine, but it is in an early (let's say) PoC mode. But in genereal, it works. But there's a lot to do.
We think the world needs a (as much as possible) decentralized search engine for IPFS. The major challenge we currently face is: getting people on board to help out.
So, if there is someone who wants to join to build an open, decentralized search engine for the web3.0 -- just raise hands! :-)

from archives.

rcurrie commented on August 19, 2024 5

On a related vertical note, for the GA4GH Cancer Gene Trust, an IPFS based distributed genomic store, I've implemented an elastic search based crawler with web UI showing the network and allowing for searching by gene and clinical features.

from archives.

doesntgolf commented on August 19, 2024 4

I made a prototype search tool here (also available here if it's unavailable on ipfs). The search index is built from scraping json search results from a public searx instance (which itself aggregates results from major search providers). This means the only things being searched are page title, URL, and a snippet of the page's text. It also means the search index only has about 1100 items. The client-side search is done with lunr.js. (search source, scraper source)

It's nothing impressive but was fun to hack together!

from archives.

garyee commented on August 19, 2024 4

I think this is a critical point if there shall be an HTML page-web (hyperlinked information space ... the whole of HTML based web pages out there in the HTTP world)
the point where the internet as we have it today really took of and became a thing for the average person and not just for nerds was: tata
When search engines were invented. So without the means to search through the swarm for information by using simple words, IPFS will stay a distributed file system for nerds.
Just remember the effect when you just had discovered IPFS, set it up, ran your daemon, discovered the swarm ... and ... and... nothing.
In order to find some information or File, you had to refer to the old-web or some other kind of index.

So I don't know if there is going to be a web-layer on top of IPFS, this is going to be a tough nut to crack...
I tried to "google scholar"-ing it and found this:
http://nsfcac.rutgers.edu/TASSL/Papers/squid-hpdc-03.pdf ... to give an idea of the complexity.

I must admit that I discovered IPFS fairly recently, but I did not find anywhere that searching is a critical point for the plans to build the web 3.0, which it is...

from archives.

nbingham1 commented on August 19, 2024 1

If you go with one of the first two options, you lose flexibility -- mainly because the IPFS executable is determining what the index should look like. If you go with option 3, then you can have any number of indices with different structures storing different things. I don't think the search functionality should be integrated into the IPFS structure itself, but should instead be built on top as a web service. The real question you should be asking is not what an index should look like, but what a database built on top of the IPFS system should look like.

https://github.com/ipfs/ipfs/issues/82

from archives.

RangerMauve commented on August 19, 2024 1

Having a decentralized search engine is definitely a difficult challenge, and I think this will be the job of a company rather than the IPFS core team.

Google search already works fine with IPFS through gateways, so centralized search engines should be able to function just fine while a decentralized solution is developed by the community.

from archives.

rht commented on August 19, 2024

+1 static indexes (and any objects computed out of the archives) should be static and versioned in merkledag just like the archives themselves.

dependency: file metadata ipfs/ipfs#36

from archives.

davidar commented on August 19, 2024

@rht I'm leaning towards the first one too. I added the latter two mainly because I believe YaCy works that way.

The two main problems that (I think) need to be solved for static indexes are: discovering which nodes provide them (ipfs/notes#15), and how to aggregate them all. Index aggregation could either be done a) entirely client-side (easy to implement but inefficient, especially once the network gets large), or b) through cooperation between index nodes (more difficult to implement, but also more efficient for clients).

@jbenet suggested that CRDT could be used to achieve (b), but I'm not familiar enough with it to make any intelligent comments :). I do know that trie merging is commutative (unlike some other search trees or hash tables), so it at least fits within CRDTs assumptions. The main thing to keep in mind is that these indexes are likely to become quite large, so you'd want to coordinate the index merging without nodes having to replicate much of each other's indexes (not sure how this would work within CRDT). One possible way of doing this is to partition prefixes across the nodes (since you only need to look at shared prefixes when performing the merge), but I'm not sure how this would work within IPFS.

from archives.

davidar commented on August 19, 2024

Looks like @krl is working on search functionality for starlog: https://github.com/krl/aolog (IRC)

from archives.

davidar commented on August 19, 2024

@nbingham1 None of them necessarily need to be integrated into ipfs itself, I'm just assuming nodes have some kind of indexing service running, which could be an entirely separate executable (and you could have multiple different ones running if you wanted). In terms of what a database on top of ipfs would look like, I'm in favor of one big json object transparently distributed over ipfs. You could then easily build radix trees and such on top of that.

from archives.

jbenet commented on August 19, 2024

@nbingham1 @davidar indeed. very much indeed.

from archives.

nbingham1 commented on August 19, 2024

Ah, ok, that makes much more sense. So the nodes run the indexing service, and updates the index whenever that node adds a file, but the index is still stored in a database format in the IPFS network. I was misinterpreting the options (Sorry! just found out about IPFS Yesterday!).

from archives.

davidar commented on August 19, 2024

@nbingham1 No need to apologise, clarifying questions are good :)

from archives.

BrendanBenshoof commented on August 19, 2024

Merging the Tries via gossip is really viable, essentially each node just shares it's "head node" of the trie they store on ipfs with it's peers. Each peer preforms a recursive merge of the index (which is normally the complexity of the smaller index as you can re-use unaltered entries) and then propagates the new index.

Then we can implement search in js by pointing to that ipfs node's search index by /ipns/local/searchindexroot or somesuch.

Then anybody who wants can implement a search engine on client side with whatever UI or features they want.

This falls into the classic Byzantine Generals problem of people messing with the index. A proof of accuracy based block-chain actually looks like a decent solution to this problem.

from archives.

davidar commented on August 19, 2024

@BrendanBenshoof cool, sounds good

Do we actually need a blockchain though? I was thinking all of the (non-malicious) nodes would agree on the merged root node after the gossiping finishes, after which you should just be able to do a simple popular vote?

from archives.

BrendanBenshoof commented on August 19, 2024

We could rotate it into web of trust but straight democracy is highly vulnerable to sybil attacks.

We just wandered into the big open issue in distributed systems, how to deal with consensus with attackers. I'm not a fan of proof of work blockchains as a general solution to this problem. I like the idea of a block-chain here because there is a clear "proof of truthiness" in that nodes can sanity check a "sub-index" against the contents of the file it indexes to ensure it is at least accurate. Once all the added "sub-indexes" have been inspected, you can also confirm the deterministic merging of them into the "primary index" and agree on the new "root" hash.

My only worry is that it is too wasteful in processing power and too complex. It might be more efficient just to deal with crap in the database,

A block would look like:


Address of last block
new source id | index_head
new source id | index_head
...

new_root id

Basic use dissemination would look like:

I get a new block:
- I sanity check the new indexes
- I merge the indexes into the "master index" and confirm the new head node
I propagate the block if it passes muster
I listen for newly advertised indexes
- I sanity check that index
- I propagate the index
- If I have more than k indexes, publish a new block

This also has the side effect of aggressively caching newly indexed files by all the search engine participants. It also feeds into my preference that indexing be an active choice by a content producer rather than a spider.

from archives.

davidar commented on August 19, 2024

@BrendanBenshoof OK, I think I see what you mean, but distributed systems and associated attacks isn't really my area, and it sounds like you're more familiar with it than I am.

Just to clarify what my thoughts were (but as you say, I haven't thought about possible attacks on this). Suppose you have two nodes, and they swap information in order to agree on the merge of both their indexes. They then both sign this hash, so that everyone else can verify that they're satisfied this merged index contains their own subindexes. You then do a similar thing with merged indexes, etc, so that you end up with a global root signed by everyone with a stake in it.

Thinking about it some more, I agree popular vote wouldn't work in general, I guess what I meant was a vote from nodes that you trust (eg the bootstrap nodes), which you assume aren't malicious, but some of them might make mistakes.

But like I said, I have no idea how well these things will work in practice ☺

@jbenet @whyrusleeping thoughts?

from archives.

cryptix commented on August 19, 2024

Maybe cothorities are relevant here? Warning: I came across this recently and I still try to make sense of it.

To summarize from the slides: Collective authorities want to make tamper-evident public logs while solving the forking- and freshness problem without falling back to a proof-of-work scheme, risk of temporary forks or a 51% attack vulnerability.

I imagine IPFS could offer this and individual projects can use it to sign their latest versions and indexes?

from archives.

rht commented on August 19, 2024

@cryptix I haven't read the paper fully. Regardless, from its scope and aim, it can be used in 2 ways:

Indirectly, only to keep track of whitelisted index providers. Providers are certified in the same way as http servers. One issue is, what if one requests a content from a provider (where it returns a hash) and the search index of the content, but the said provider is not in the whitelist log? Looks like the log has to be updated manually and witnessed by the cosigners. In this case, maybe the content credential (because one has to trust the provider that it provides the correct hash) can be reused for the index.
Directly (provider-independent), where every new index is verified by all cosigners. Provided that there is a cheap way to verify an index without recreating it from scratch. One way is to test index few search terms and check against the full index (much like Rabin fingerprint). However, this is not as complete as pki or validating the proof of a theorem.

Also, since it is not crucial to get timestamp of indexes (of an immutable content) in order, I don't think a full-blown blockchain (be it proof-of-work or not) is necessary. There should be a much faster way than method 2 above.

An alternative to fingerprinting the index is verify the code used to build the index, provided that it can be verified to run in a sandbox with known parameters.

(worth looking: SCP, which uses quorum slice, which enables more organic growth of signing nodes)

from archives.

rht commented on August 19, 2024

(ok the fingerprint check is basically @BrendanBenshoof's "proof-of-truthiness", which drawbacks have been mentioned. I favor the weak ordering method described in the CRDT discussion, if there is a protocol for this that has been fleshed out)

from archives.

davidar commented on August 19, 2024

I think it would make sense for index providers to sign the aggregated index once they've verified it contains their own index. That way "verification" is simply checking that it's been signed by the provider of the part of the index you're interested in.

from archives.

commented on August 19, 2024

@doesntgolf Very nice to see progress here ;)

from archives.

victorb commented on August 19, 2024

@dokterbob thanks for sharing, looks very cool :) Maybe it's something you can submit a PR to add to https://github.com/ipfs/awesome-ipfs ?

from archives.

madorian commented on August 19, 2024

nobody working on a distributed search engine? Something where searching costs money.

from archives.

Kubuxu commented on August 19, 2024

@magik6k was experimenting with pre-computed sharded search for Wikipedia dumps. Magik, any input in this?

from archives.

magik6k commented on August 19, 2024

There is https://github.com/magik6k/distributed-wiki-search. The code there isn't great, though it should give some idea how it may be implemented.

from archives.

RangerMauve commented on August 19, 2024

@chainify-io Would you be interested in another collaborator in your efforts? I've been writing down ideas for a distributed search for web pages and it'd be cool to discuss and see if we can work together.

from archives.

Search engine about archives HOT 28 OPEN

Comments (28)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs