I was wondering if it was worth looking at within signature differences to get an idea

to answer <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

I played with the weighted Jaccard in <a href="https://github.com/luizirber/d

Within signature distance? about sourmash HOT 11 OPEN

sourmash-bio commented on June 8, 2024

Within signature distance?

from sourmash.

Comments (11)

ctb commented on June 8, 2024 1

wow, this IS an old issue!

um, there is no k-mer <-> k-mer edit distance information available, I'm afraid. The hashing function explicitly breaks any link between similar k-mers, in fact!

from sourmash.

ctb commented on June 8, 2024 1

to answer @anmwinter question --

first, major apologies for taking so long!

second, any metric that can operate on k-mers should work with k-mer hashes from scaled signatures. we've looked at things like accumulation curves (see comment) and and also @taylorreiter has thought a bit about diversity metrics for collections of k-mers. Happy to chat further if you have a specific calculation to do!

from sourmash.

nmb85 commented on June 8, 2024

Sorry to raise an old issue from the dead, but this is the only mention of my question that I could find. Do the distances between hashes, i.e., the edit (hamming, levenshtein, etc) distance between, for example, 16571421240 <-> 111660087188, contain any underlying kmer <-> kmer distance information? It would be interesting to build a tree like suggested and use that for a unifrac distance calculation on a collection of signatures.

from sourmash.

nmb85 commented on June 8, 2024

@taylorreiter (and anyone else, haha!) When you have a moment (I realize that you and your team are busy with your main projects), I was wondering what you thought about this:

Could one build a tree like the BK Tree mentioned by @anmwinter, place minhash abundances on that tree, and use the earth mover's distance to get a diversity-informed distance metric between minhash sets? Earth mover's distance is equivalent to unifrac, I think...

Regardless of whether or not it is phylogenetically informative, it still would give a more information-rich distance metric between datasets than Jaccard, right?

from sourmash.

taylorreiter commented on June 8, 2024

I don't have an intuition about whether this would work or not. Can you describe a little more how this would be more information-rich than Jaccard? Jaccard is based on presence absence of the hash, so presumably this has more information because it accounts for abundances?

As Titus said above, if it works on k-mers, it should work on minhashes. It might be worth exploring a proof of concept in k-mer space?

from sourmash.

nmb85 commented on June 8, 2024

Hi, @taylorreiter , thanks very much for your thoughts! First re your questions:

Can you describe a little more how this would be more information-rich than Jaccard?

Sure, as far as I understand it and in analogy only (I'm not fluent in symbolic math, to my chagrin): whereas Jaccard Distance tells you how many kmers/hashes are shared between samples, Earth Mover's Distance (EMD) would tell you how many kmers/hashes "piled up" at defined locations in a 2-D area (think of it as mounds, ponds, and slopes in your backyard: a landscape) that constitute one sample need to be moved to form the same landscape in another sample.

So, to give a practical example that you're probably familiar with, and please ignore if you already know this: in the case of 16s rRNA sequences, earth mover's distance is the same as the weighted unifrac distance. Unifrac is based on the phylogenetic distribution of the sequences across a phylogenetic tree (which incorporates not just the classification of the kmer/sequence but also the phylogenetic distance between the kmers/sequences). The phylogenetic tree that is formed from all samples from which measurements will be made is analogous to the 2-D area across which "dirt" is distributed in samples measured by EMD. Weighted unifrac is the most information-rich distance metric that I'm aware of for phylogenetic trees that contain count information.

With hashes, you can't build a phylogenetic tree of hash distances because hashes don't contain phylogenetic information per @ctb 's comment above - however - if there is still true diversity information contained in those hashes, i.e., hash similarity is proportional somehow to the similarity between the kmers represented by those hashes (which is something I don't know, and my main question to you), then the earth mover's distance of hash sets shared by samples would add at least some information about the similarity between the diversity distribution of the samples. So, as @anmwinter suggested hierarchically clustering hashes by their pairwise edit distances, it struck me that you may be able to use the diversity information contained in the BK-tree to inform an EMD between hash sets (analogously to phylogenetic trees + weighted unifrac for 16S sequences). That would provide a much more information-rich distance metric between samples than Jaccard.

Jaccard is based on presence absence of the hash, so presumably this has more information because it accounts for abundances?

Yes, plus it accounts for the difference in the structure of diversity between two samples, which is an additional source of information.

It might be worth exploring a proof of concept in k-mer space?

Would hierarchically clustering kmers like that be computationally practical?

So, just to clarify, my main question to you in this post is:
"Is hash similarity proportional somehow to the similarity between the kmers represented by those hashes?"

Thanks, Taylor!

from sourmash.

luizirber commented on June 8, 2024

Just dropping a quick note that we actually have hash counts too, and that might be useful information to use...

(Sorry for short message, will comment more next week)

from sourmash.

nmb85 commented on June 8, 2024

Just dropping a quick note that we actually have hash counts too, and that might be useful information to use...

@luizirber For sure! I think count data is required for calculating the earth mover's distance (aka Wassertein or Kantorovich–Rubinstein).

An alternative distance measure that is still more informative than the simple Jaccard distance, if a tree representing hash diversity is not a useful measure of diversity, is the weighted Jaccard and is proposed in this paper on the Order MinHash (OMH). Have you heard of this type of hash? It wouldn't be as informative as EMD, but it might be a nice option if the EMD is impossible to calculate from hashes.

from sourmash.

luizirber commented on June 8, 2024

I played with the weighted Jaccard in another codebase, and it is fairly straighforward to do the same for Scaled MinHash sketches. And while WJ is not well defined for regular MinHash (you would need something like CWS for it), it does work for Scaled MinHash sketches.

(deeper analysis on why WJ works for Scaled MinHash sketches pending =])

from sourmash.

ctb commented on June 8, 2024

(note that we have some empirical evidence now that hash abundances are properly representative - see these (not necessarily permanent... sorry...) links, figure 8 -- hash abundance correlates with mapping for mock and high-coverage real data sets.

https://ctb.github.io/2020-grist-examples/reports/report-SRR606249.html
https://ctb.github.io/2020-grist-examples/reports/report-p8808mo11.html

These are for sourmash gather output, not sourmash --containment output, but I can produce them for the latter easily enough should there be interest.

from sourmash.

ctb commented on June 8, 2024

(key figures screenshotted for futureproofing -- )

from sourmash.

Within signature distance? about sourmash HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs