Comments (16)
from sourmash.
Titus,
I installed sourmash through pip. I am currently running it by the command line through a jupyter notebook. Is getting the union easier by running it through python?
Thanks!
ara
from sourmash.
Right, it's not a built in feature at the command line interface, but it's
relatively easy to do via Python.
Can you provide me with an example of the sort of workflow you want to use?
e.g.
- calculate signatures for a bunch of sequences
- cluster signatures at some threshold
- retrieve all signatures that cluster with a specific query signature
- build union or intersection of signatures within a cluster
from sourmash.
The work flow you describes is pretty much what I am looking for:
calculate signatures for a bunch of sequences
cluster signatures at some threshold
retrieve all signatures that cluster with a specific query signature
build union or intersection of signatures within a cluster
Longer term I'd like to be able to see if there is a signature that occurs across all samples. I am trying to sort out the species signatures and any geographic signatures. Currently our metagenomes are clustering by bat species with some exceptions.
Does sourmash use the same procedure that Mash uses to find similar hashes? And if so is that part coded in python?
One thing I wanted to try to code for was a table of "fuzzy" hashes that occur in each sample.
fuzzyhash1 fuzzyhash2 fuzzyhash3
bat1 4 1 0
bat2 8 0 0
bat3 3 2 3
from sourmash.
Are signatures and hashes the same thing?
from sourmash.
On Fri, Jul 29, 2016 at 09:55:22AM -0700, Ara Winter wrote:
Are signatures and hashes the same thing?
Here's how I'm using the terms:
Hash: individual k-mer
Signature: collection of hashes
from sourmash.
On Fri, Jul 29, 2016 at 09:30:15AM -0700, Ara Winter wrote:
The work flow you describes is pretty much what I am looking for:
calculate signatures for a bunch of sequences
cluster signatures at some threshold
retrieve all signatures that cluster with a specific query signature
build union or intersection of signatures within a cluster
ok! I'm not sure if I'll get to it this week but please do bump this issue
in a week or so.
Longer term I'd like to be able to see if there is a signature that occurs across all samples. I am trying to sort out the species signatures and any geographic signatures. Currently our metagenomes are clustering by bat species with some exceptions.
ok - I can give you reasons why it might not work, but it's worth a try!
Does sourmash use the same procedure that Mash uses to find similar hashes? And if so is that part coded in python?
Yes (it's mash compatible) and no (not coded in python). It used to be and
I could put together a Python description of the algorithm if you like.
One thing I wanted to try to code for was a table of "fuzzy" hashes that occur in each sample.
fuzzyhash1 fuzzyhash2 fuzzyhash3
bat1 4 1 0
bat2 8 0 0
bat3 3 2 3
Would the fuzzyhash1 / fuzzyhash2 lists of hashes come from some sort of
clustering or grouping of hashes in the signatures?
from sourmash.
ok - I can give you reasons why it might not work, but it's worth a try!
Oh, I'd like to hear why this might not work. I've read through the Mash paper and I am still trying to wrangle with the concepts in there.
Would the fuzzyhash1 / fuzzyhash2 lists of hashes come from some sort of
clustering or grouping of hashes in the signatures?
Yes, I was imagining a clustering plus picking a representative hash (similar to 16S OTU clustering).
I am in my second week of my post-doc and I have some time to develop/use new tools. Using signatures is at the top of my list since I stumbled across sourmash. I have a few other questions that I will start another thread for.
from sourmash.
On Mon, Aug 01, 2016 at 07:30:31AM -0700, Ara Winter wrote:
ok - I can give you reasons why it might not work, but it's worth a try!
Oh, I'd like to hear why this might not work. I've read through the Mash paper and I am still trying to wrangle with the concepts in there.
Basically, the hashes in the signature give you extraordinarily sensitive
ability to detect similar species, but this falls off quickly as species
diverge. The MetaPalette paper (http://msystems.asm.org/content/1/3/e00020-16) gives some good input here wrt to k-mer sizes and species/strain divergence.
So I'd worry about moderately distant genomes being completely disjoint
in signature space.
Would the fuzzyhash1 / fuzzyhash2 lists of hashes come from some sort of
clustering or grouping of hashes in the signatures?Yes, I was imagining a clustering plus picking a representative hash (similar to 16S OTU clustering).
You'd probably want to work with as many hashes as possible, for sensitivity
raesons.
I am in my second week of my post-doc and I have some time to develop/use new tools. Using signatures is at the top of my list since I stumbled across sourmash. I have a few other questions that I will start another thread for.
ok! note that the YAML signature files are easy to parse with many
languages, and the overall idea is surprisingly trivial, so you could
easily develop your own code to work with the output of sourmash -
I'd go with what you're comfortable with rather than relying too heavily
on this code too much :)
from sourmash.
Thanks @ctb ! I will read through the MetaPalette paper later today.
I just wrote a little python script to parse the YAML signature files so I could start hacking away.
from sourmash.
So I'd worry about moderately distant genomes being completely disjoint
in signature space.
So if you have a decently diverse metagenome, this same issue would crop up? Does increasing the number of hashes help with this?
from sourmash.
from sourmash.
very cool! if you want to share at some point it could be useful to others
(or you can tell me what I can provide through this project's docs to help
people like you in the future!)
Gladly! Right now it's just parsing one file. I need to fix it so it loops through all the .sig files. I am not the best at using github. So what is a good way to share the notebook with you through github?
Thanks again.
from sourmash.
from sourmash.
Morning @ctb I thought I would give the the union hashes a little bump here.
What is the commands for running sourmash through python? I saw a few .py files in the repo.
thanks!
ara
from sourmash.
Documented all of this over in the API docs a while back, closing!
https://sourmash.readthedocs.io/en/latest/api-example.html#set-operations-on-hashes
from sourmash.
Related Issues (20)
- produce more detailed error message when failing to load manifest from zipfile
- cargo clippy beta is failing as of Mar 19, 2024 HOT 2
- multigather individual contigs from one file HOT 5
- revisit cibuildwheel configuration for 2.17.0
- wheel upload to github releases no longer seems to work as of v4.8.7
- automatic upload to zenodo is not working
- searching database for any duplicates genomes HOT 3
- using `fastmultigather` to do contig-level gather and taxonomy assignment - a brief tutorial
- `Connection timed out` when downloading prepared databases HOT 5
- revisit InvalidDNA error? HOT 2
- `nix develop` is failing an ANI distance test?! HOT 1
- `Index.select` allows ksizes that aren't an integer.
- update authorship for zenodo integration? HOT 1
- add conda-forge & sourmash-minimal into release checklist
- pursue spack release PR for sourmash 4.8.4 and beyond
- does `--relpath` in sourmash sig manifest etc. convert abspaths into relpaths?
- doco: GTDB R207 database inconsistency HOT 3
- add `sig grep -l` to just print out the names of matches
- maturin build is failing on latest for maturin wheels, arch `ppc64le` HOT 2
- mac os wheel builds are failing in latest HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sourmash.