Comments (11)
wow, this IS an old issue!
um, there is no k-mer <-> k-mer edit distance information available, I'm afraid. The hashing function explicitly breaks any link between similar k-mers, in fact!
from sourmash.
to answer @anmwinter question --
first, major apologies for taking so long!
second, any metric that can operate on k-mers should work with k-mer hashes from scaled signatures. we've looked at things like accumulation curves (see comment) and and also @taylorreiter has thought a bit about diversity metrics for collections of k-mers. Happy to chat further if you have a specific calculation to do!
from sourmash.
Sorry to raise an old issue from the dead, but this is the only mention of my question that I could find. Do the distances between hashes, i.e., the edit (hamming, levenshtein, etc) distance between, for example, 16571421240 <-> 111660087188, contain any underlying kmer <-> kmer distance information? It would be interesting to build a tree like suggested and use that for a unifrac distance calculation on a collection of signatures.
from sourmash.
@taylorreiter (and anyone else, haha!) When you have a moment (I realize that you and your team are busy with your main projects), I was wondering what you thought about this:
Could one build a tree like the BK Tree mentioned by @anmwinter, place minhash abundances on that tree, and use the earth mover's distance to get a diversity-informed distance metric between minhash sets? Earth mover's distance is equivalent to unifrac, I think...
Regardless of whether or not it is phylogenetically informative, it still would give a more information-rich distance metric between datasets than Jaccard, right?
from sourmash.
I don't have an intuition about whether this would work or not. Can you describe a little more how this would be more information-rich than Jaccard? Jaccard is based on presence absence of the hash, so presumably this has more information because it accounts for abundances?
As Titus said above, if it works on k-mers, it should work on minhashes. It might be worth exploring a proof of concept in k-mer space?
from sourmash.
Hi, @taylorreiter , thanks very much for your thoughts! First re your questions:
Can you describe a little more how this would be more information-rich than Jaccard?
Sure, as far as I understand it and in analogy only (I'm not fluent in symbolic math, to my chagrin): whereas Jaccard Distance tells you how many kmers/hashes are shared between samples, Earth Mover's Distance (EMD) would tell you how many kmers/hashes "piled up" at defined locations in a 2-D area (think of it as mounds, ponds, and slopes in your backyard: a landscape) that constitute one sample need to be moved to form the same landscape in another sample.
So, to give a practical example that you're probably familiar with, and please ignore if you already know this: in the case of 16s rRNA sequences, earth mover's distance is the same as the weighted unifrac distance. Unifrac is based on the phylogenetic distribution of the sequences across a phylogenetic tree (which incorporates not just the classification of the kmer/sequence but also the phylogenetic distance between the kmers/sequences). The phylogenetic tree that is formed from all samples from which measurements will be made is analogous to the 2-D area across which "dirt" is distributed in samples measured by EMD. Weighted unifrac is the most information-rich distance metric that I'm aware of for phylogenetic trees that contain count information.
With hashes, you can't build a phylogenetic tree of hash distances because hashes don't contain phylogenetic information per @ctb 's comment above - however - if there is still true diversity information contained in those hashes, i.e., hash similarity is proportional somehow to the similarity between the kmers represented by those hashes (which is something I don't know, and my main question to you), then the earth mover's distance of hash sets shared by samples would add at least some information about the similarity between the diversity distribution of the samples. So, as @anmwinter suggested hierarchically clustering hashes by their pairwise edit distances, it struck me that you may be able to use the diversity information contained in the BK-tree to inform an EMD between hash sets (analogously to phylogenetic trees + weighted unifrac for 16S sequences). That would provide a much more information-rich distance metric between samples than Jaccard.
Jaccard is based on presence absence of the hash, so presumably this has more information because it accounts for abundances?
Yes, plus it accounts for the difference in the structure of diversity between two samples, which is an additional source of information.
It might be worth exploring a proof of concept in k-mer space?
Would hierarchically clustering kmers like that be computationally practical?
So, just to clarify, my main question to you in this post is:
"Is hash similarity proportional somehow to the similarity between the kmers represented by those hashes?"
Thanks, Taylor!
from sourmash.
Just dropping a quick note that we actually have hash counts too, and that might be useful information to use...
(Sorry for short message, will comment more next week)
from sourmash.
Just dropping a quick note that we actually have hash counts too, and that might be useful information to use...
@luizirber For sure! I think count data is required for calculating the earth mover's distance (aka Wassertein or Kantorovich–Rubinstein).
An alternative distance measure that is still more informative than the simple Jaccard distance, if a tree representing hash diversity is not a useful measure of diversity, is the weighted Jaccard and is proposed in this paper on the Order MinHash (OMH). Have you heard of this type of hash? It wouldn't be as informative as EMD, but it might be a nice option if the EMD is impossible to calculate from hashes.
from sourmash.
I played with the weighted Jaccard in another codebase, and it is fairly straighforward to do the same for Scaled MinHash sketches. And while WJ is not well defined for regular MinHash (you would need something like CWS for it), it does work for Scaled MinHash sketches.
(deeper analysis on why WJ works for Scaled MinHash sketches pending =])
from sourmash.
(note that we have some empirical evidence now that hash abundances are properly representative - see these (not necessarily permanent... sorry...) links, figure 8 -- hash abundance correlates with mapping for mock and high-coverage real data sets.
https://ctb.github.io/2020-grist-examples/reports/report-SRR606249.html
https://ctb.github.io/2020-grist-examples/reports/report-p8808mo11.html
These are for sourmash gather output, not sourmash --containment output, but I can produce them for the latter easily enough should there be interest.
from sourmash.
(key figures screenshotted for futureproofing -- )
from sourmash.
Related Issues (20)
- GTDB accessions to LCAs? HOT 2
- rust beta lints are failing HOT 1
- If there is a command or way to annotate individual contigs without creating many files? HOT 4
- unexpected `sig grep`, `sig extract` behavior with identical md5s HOT 2
- update GTDB - R08-RS214 HOT 1
- (3rd party) docs for using sourmash for contamination analysis, etc.
- update `doc/` notebooks & notebook links
- links to tutorial notebooks are not working!?
- LINS tutorials HOT 1
- /.singularity.d/runscript: 3: exec: trim-low-abund.py: not found HOT 4
- `tax grep` doesn't search the identifiers!
- Transcriptome of interest, gather in metaTs signatures, repeat. Faster or directly to mapping? HOT 4
- Panic Error when using sourmash gather HOT 3
- `sig cat` selection relies on md5sum, which can pull in unwanted sigs with duplicated md5 HOT 1
- automated test builds are failing routinely (but sporadically) b/c of coverage HOT 1
- Comparing the taxonomic composition of multiple metagenomes HOT 2
- Finding best reference from NGS data HOT 4
- `sig check` very slow for large databases? HOT 1
- print empty results from `sourmash tax`?
- tutorial TODO: finding the best reference sequences for NGS data HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sourmash.