GithubHelp home page GithubHelp logo

Comments (11)

ctb avatar ctb commented on June 8, 2024 1

wow, this IS an old issue!

um, there is no k-mer <-> k-mer edit distance information available, I'm afraid. The hashing function explicitly breaks any link between similar k-mers, in fact!

from sourmash.

ctb avatar ctb commented on June 8, 2024 1

to answer @anmwinter question --

first, major apologies for taking so long!

second, any metric that can operate on k-mers should work with k-mer hashes from scaled signatures. we've looked at things like accumulation curves (see comment) and and also @taylorreiter has thought a bit about diversity metrics for collections of k-mers. Happy to chat further if you have a specific calculation to do!

from sourmash.

nmb85 avatar nmb85 commented on June 8, 2024

Sorry to raise an old issue from the dead, but this is the only mention of my question that I could find. Do the distances between hashes, i.e., the edit (hamming, levenshtein, etc) distance between, for example, 16571421240 <-> 111660087188, contain any underlying kmer <-> kmer distance information? It would be interesting to build a tree like suggested and use that for a unifrac distance calculation on a collection of signatures.

from sourmash.

nmb85 avatar nmb85 commented on June 8, 2024

@taylorreiter (and anyone else, haha!) When you have a moment (I realize that you and your team are busy with your main projects), I was wondering what you thought about this:

Could one build a tree like the BK Tree mentioned by @anmwinter, place minhash abundances on that tree, and use the earth mover's distance to get a diversity-informed distance metric between minhash sets? Earth mover's distance is equivalent to unifrac, I think...

Regardless of whether or not it is phylogenetically informative, it still would give a more information-rich distance metric between datasets than Jaccard, right?

from sourmash.

taylorreiter avatar taylorreiter commented on June 8, 2024

I don't have an intuition about whether this would work or not. Can you describe a little more how this would be more information-rich than Jaccard? Jaccard is based on presence absence of the hash, so presumably this has more information because it accounts for abundances?

As Titus said above, if it works on k-mers, it should work on minhashes. It might be worth exploring a proof of concept in k-mer space?

from sourmash.

nmb85 avatar nmb85 commented on June 8, 2024

Hi, @taylorreiter , thanks very much for your thoughts! First re your questions:

Can you describe a little more how this would be more information-rich than Jaccard?

Sure, as far as I understand it and in analogy only (I'm not fluent in symbolic math, to my chagrin): whereas Jaccard Distance tells you how many kmers/hashes are shared between samples, Earth Mover's Distance (EMD) would tell you how many kmers/hashes "piled up" at defined locations in a 2-D area (think of it as mounds, ponds, and slopes in your backyard: a landscape) that constitute one sample need to be moved to form the same landscape in another sample.

So, to give a practical example that you're probably familiar with, and please ignore if you already know this: in the case of 16s rRNA sequences, earth mover's distance is the same as the weighted unifrac distance. Unifrac is based on the phylogenetic distribution of the sequences across a phylogenetic tree (which incorporates not just the classification of the kmer/sequence but also the phylogenetic distance between the kmers/sequences). The phylogenetic tree that is formed from all samples from which measurements will be made is analogous to the 2-D area across which "dirt" is distributed in samples measured by EMD. Weighted unifrac is the most information-rich distance metric that I'm aware of for phylogenetic trees that contain count information.

With hashes, you can't build a phylogenetic tree of hash distances because hashes don't contain phylogenetic information per @ctb 's comment above - however - if there is still true diversity information contained in those hashes, i.e., hash similarity is proportional somehow to the similarity between the kmers represented by those hashes (which is something I don't know, and my main question to you), then the earth mover's distance of hash sets shared by samples would add at least some information about the similarity between the diversity distribution of the samples. So, as @anmwinter suggested hierarchically clustering hashes by their pairwise edit distances, it struck me that you may be able to use the diversity information contained in the BK-tree to inform an EMD between hash sets (analogously to phylogenetic trees + weighted unifrac for 16S sequences). That would provide a much more information-rich distance metric between samples than Jaccard.

Jaccard is based on presence absence of the hash, so presumably this has more information because it accounts for abundances?

Yes, plus it accounts for the difference in the structure of diversity between two samples, which is an additional source of information.

It might be worth exploring a proof of concept in k-mer space?

Would hierarchically clustering kmers like that be computationally practical?

So, just to clarify, my main question to you in this post is:
"Is hash similarity proportional somehow to the similarity between the kmers represented by those hashes?"

Thanks, Taylor!

from sourmash.

luizirber avatar luizirber commented on June 8, 2024

Just dropping a quick note that we actually have hash counts too, and that might be useful information to use...

(Sorry for short message, will comment more next week)

from sourmash.

nmb85 avatar nmb85 commented on June 8, 2024

Just dropping a quick note that we actually have hash counts too, and that might be useful information to use...

@luizirber For sure! I think count data is required for calculating the earth mover's distance (aka Wassertein or Kantorovich–Rubinstein).

An alternative distance measure that is still more informative than the simple Jaccard distance, if a tree representing hash diversity is not a useful measure of diversity, is the weighted Jaccard and is proposed in this paper on the Order MinHash (OMH). Have you heard of this type of hash? It wouldn't be as informative as EMD, but it might be a nice option if the EMD is impossible to calculate from hashes.

from sourmash.

luizirber avatar luizirber commented on June 8, 2024

I played with the weighted Jaccard in another codebase, and it is fairly straighforward to do the same for Scaled MinHash sketches. And while WJ is not well defined for regular MinHash (you would need something like CWS for it), it does work for Scaled MinHash sketches.

(deeper analysis on why WJ works for Scaled MinHash sketches pending =])

from sourmash.

ctb avatar ctb commented on June 8, 2024

(note that we have some empirical evidence now that hash abundances are properly representative - see these (not necessarily permanent... sorry...) links, figure 8 -- hash abundance correlates with mapping for mock and high-coverage real data sets.

https://ctb.github.io/2020-grist-examples/reports/report-SRR606249.html
https://ctb.github.io/2020-grist-examples/reports/report-p8808mo11.html

These are for sourmash gather output, not sourmash --containment output, but I can produce them for the latter easily enough should there be interest.

from sourmash.

ctb avatar ctb commented on June 8, 2024

(key figures screenshotted for futureproofing -- )
Screen Shot 2020-11-07 at 7 11 01 AM
Screen Shot 2020-11-07 at 7 10 55 AM

from sourmash.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.