GithubHelp home page GithubHelp logo

sourmash-bio / sourmash Goto Github PK

View Code? Open in Web Editor NEW
437.0 19.0 77.0 36.11 MB

Quickly search, compare, and analyze genomic and metagenomic data sets.

Home Page: http://sourmash.readthedocs.io/en/latest/

License: Other

Python 84.43% Shell 0.01% Makefile 0.04% TeX 0.11% Rust 14.64% C 0.59% Nix 0.17%
minhash bioinformatics rust python sourmash fracminhash scaled-minhash kmer sketching taxonomic-classification

sourmash's People

Contributors

betatim avatar bluegenes avatar brooksph avatar camillescott avatar ccbaumler avatar connor-reid-tiffany avatar ctb avatar dependabot-preview[bot] avatar dependabot[bot] avatar dkoslicki avatar erikyoung85 avatar halexand avatar hehouts avatar hyphaltip avatar jgardner78 avatar keyabarve avatar lgautier avatar luizirber avatar marisalim avatar mr-eyes avatar napsterinblue avatar olgabot avatar peterjc avatar pranathivemuri avatar pyup-bot avatar rialc13 avatar ricky-lim avatar standage avatar swamidass avatar taylorreiter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sourmash's Issues

Separate creation of signatures from loading & comparison of signatures

It is rare that we want to update signatures after saving them, unless we're merging etc. Also, the software requirements for building and saving signatures may sometimes be quite different from the requirements for loading and comparing signatures... what's the best way to split up the functionality?

Possible inconsistency of methods of class `MinHash`

The class defines a __len__(), but the length of what is returned by get_mins is not necessarily consistent:

In [1]: import sourmash_lib._minhash

In [2]: mh = sourmash_lib._minhash.MinHash(1000, 21)

In [3]: len(mh)
Out[3]: 1000

In [4]: mh.get_mins()
Out[4]: []

Tests with py.test don't run

I draw several fatal errors when I try to run what seems like a reasonable thing here.

git clone {repo}/sourmash.git
cd sourmash
pip install . pytest matplotlib
py.test .

Here is the top few dozen lines of the error logs from doing that:

(sourmash)jeremy@anjou:~/src/sourmash (master *=)$ py.test .
============================= test session starts ==============================
platform linux2 -- Python 2.7.11, pytest-2.9.2, py-1.4.31, pluggy-0.3.1
rootdir: /home/jeremy/src/sourmash, inifile: pytest.ini
collected 29 items / 1 errors 

doc/api-example.rst F
doc/api.rst s
doc/command-line.rst s
doc/index.rst s
doc/more-info.rst s
doc/requirements.rst s
sourmash_lib/__init__.py FFFFFFFF
sourmash_lib/signature.py FFFFFFFF
sourmash_lib/test_sourmash.py .FFFFFF

==================================== ERRORS ====================================
________________ ERROR collecting sourmash_lib/test__minhash.py ________________
sourmash_lib/test__minhash.py:39: in <module>
    from ._minhash import MinHash, hash_murmur
E   ImportError: No module named _minhash
=================================== FAILURES ===================================
________________________ [doctest] doc/api-example.rst _________________________
007 
008 Define two sequences:
009 
010 >>> seq1 = "ATGGCA"
011 >>> seq2 = "AGAGCA"
012 
013 Create two estimators using 3-mers, and add the sequences:
014 
015 >>> import sourmash_lib
016 >>> E1 = sourmash_lib.Estimators(n=20, ksize=3)
UNEXPECTED EXCEPTION: ImportError('cannot import name _minhash',)
Traceback (most recent call last):

  File "/usr/share/anaconda/anaconda2/envs/sourmash/lib/python2.7/doctest.py", line 1315, in __run
    compileflags, 1) in test.globs

  File "<doctest api-example.rst[3]>", line 1, in <module>

  File "/home/jeremy/src/sourmash/sourmash_lib/__init__.py", line 27, in __init__
    from . import _minhash

ImportError: cannot import name _minhash

/home/jeremy/src/sourmash/doc/api-example.rst:16: UnexpectedException
________________________________ test_jaccard_1 ________________________________

    def test_jaccard_1():
>       E1 = Estimators(n=5, ksize=20)

sourmash commandline tool should probably be an entry_point

The commandline util sourmash causes some structural grief in the git repo, since it has a name collision with the package directory. Conventional approaches would be to include it instead as sourmash_cli.py somewhere and include it as an entry point in setup.py.

Moving the commandline tool to a different name might also allow you to rename $repo/sourmash_lib to $repo/sourmash which is (at least for the pure python modules I'm familiar with) the usual approach.

error: _minhash.hh: No such file or directory

❯❯❯ pip3 install sourmash
…
    running build_ext
    building 'sourmash_lib._minhash' extension
    creating build/temp.linux-x86_64-3.5
    creating build/temp.linux-x86_64-3.5/sourmash_lib
    creating build/temp.linux-x86_64-3.5/third-party
    creating build/temp.linux-x86_64-3.5/third-party/smhasher
    /gsc/btl/linuxbrew/bin/gcc-5 -pthread -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Os -w -pipe -march=core2 -fPIC -I/gsc/btl/linuxbrew/include -I/gsc/btl/linuxbrew/opt/openssl/include -I/gsc/btl/linuxbrew/opt/sqlite/include -I/gsc/btl/linuxbrew/opt/python3/include/python3.5m -c sourmash_lib/_minhash.cc -o build/temp.linux-x86_64-3.5/sourmash_lib/_minhash.o -std=c++11 -pedantic -O3
    cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
    sourmash_lib/_minhash.cc:66:23: fatal error: _minhash.hh: No such file or directory
    compilation terminated.
    error: command '/gsc/btl/linuxbrew/bin/gcc-5' failed with exit status 1

ImportMismatchError with fresh dev environment

Trying to set up a sourmash dev environment for the first time, I'm getting the following error.

python setup.py build_ext -i
running build_ext
copying build/lib.macosx-10.12-x86_64-3.5/sourmash_lib/_minhash.cpython-35m-darwin.so -> sourmash_lib
pip install '.[test]'
Processing /Users/standage/Software/sourmash
  Requirement already satisfied (use --upgrade to upgrade): sourmash==1.1 from file:///Users/standage/Software/sourmash in ./env/lib/python3.5/site-packages
Requirement already satisfied: screed>=0.9 in ./env/lib/python3.5/site-packages (from sourmash==1.1)
Requirement already satisfied: PyYAML>=3.11 in ./env/lib/python3.5/site-packages (from sourmash==1.1)
Requirement already satisfied: ijson in ./env/lib/python3.5/site-packages (from sourmash==1.1)
Requirement already satisfied: pytest in ./env/lib/python3.5/site-packages (from sourmash==1.1)
Requirement already satisfied: pytest-cov in ./env/lib/python3.5/site-packages (from sourmash==1.1)
Requirement already satisfied: numpy in ./env/lib/python3.5/site-packages (from sourmash==1.1)
Requirement already satisfied: matplotlib in ./env/lib/python3.5/site-packages (from sourmash==1.1)
Requirement already satisfied: scipy in ./env/lib/python3.5/site-packages (from sourmash==1.1)
Requirement already satisfied: bz2file in ./env/lib/python3.5/site-packages (from screed>=0.9->sourmash==1.1)
Requirement already satisfied: py>=1.4.29 in ./env/lib/python3.5/site-packages (from pytest->sourmash==1.1)
Requirement already satisfied: coverage>=3.7.1 in ./env/lib/python3.5/site-packages (from pytest-cov->sourmash==1.1)
Requirement already satisfied: pytz in ./env/lib/python3.5/site-packages (from matplotlib->sourmash==1.1)
Requirement already satisfied: pyparsing!=2.0.0,!=2.0.4,!=2.1.2,>=1.5.6 in ./env/lib/python3.5/site-packages (from matplotlib->sourmash==1.1)
Requirement already satisfied: cycler in ./env/lib/python3.5/site-packages (from matplotlib->sourmash==1.1)
Requirement already satisfied: python-dateutil in ./env/lib/python3.5/site-packages (from matplotlib->sourmash==1.1)
Requirement already satisfied: six in ./env/lib/python3.5/site-packages (from cycler->matplotlib->sourmash==1.1)
python -m pytest
============================= test session starts ==============================
platform darwin -- Python 3.5.2, pytest-3.0.5, py-1.4.32, pluggy-0.4.0
rootdir: /Users/standage/Software/sourmash, inifile: pytest.ini
plugins: cov-2.4.0
collected 0 items / 1 errors

==================================== ERRORS ====================================
______________________________ ERROR collecting  _______________________________
env/lib/python3.5/site-packages/_pytest/config.py:325: in _getconftestmodules
    return self._path2confmods[path]
E   KeyError: local('/Users/standage/Software/sourmash/env/lib/python3.5/site-packages/sourmash_lib')

During handling of the above exception, another exception occurred:
env/lib/python3.5/site-packages/_pytest/config.py:356: in _importconftest
    return self._conftestpath2mod[conftestpath]
E   KeyError: local('/Users/standage/Software/sourmash/env/lib/python3.5/site-packages/sourmash_lib/conftest.py')

During handling of the above exception, another exception occurred:
env/lib/python3.5/site-packages/_pytest/config.py:362: in _importconftest
    mod = conftestpath.pyimport()
env/lib/python3.5/site-packages/py/_path/local.py:680: in pyimport
    raise self.ImportMismatchError(modname, modfile, self)
E   py._path.local.LocalPath.ImportMismatchError: ('sourmash_lib.conftest', '/Users/standage/Software/sourmash/sourmash_lib/conftest.py', local('/Users/standage/Software/sourmash/env/lib/python3.5/site-packages/sourmash_lib/conftest.py'))

During handling of the above exception, another exception occurred:
env/lib/python3.5/site-packages/py/_path/common.py:366: in visit
    for x in Visitor(fil, rec, ignore, bf, sort).gen(self):
env/lib/python3.5/site-packages/py/_path/common.py:415: in gen
    for p in self.gen(subdir):
env/lib/python3.5/site-packages/py/_path/common.py:415: in gen
    for p in self.gen(subdir):
env/lib/python3.5/site-packages/py/_path/common.py:415: in gen
    for p in self.gen(subdir):
env/lib/python3.5/site-packages/py/_path/common.py:415: in gen
    for p in self.gen(subdir):
env/lib/python3.5/site-packages/py/_path/common.py:404: in gen
    dirs = self.optsort([p for p in entries
env/lib/python3.5/site-packages/py/_path/common.py:405: in <listcomp>
    if p.check(dir=1) and (rec is None or rec(p))])
env/lib/python3.5/site-packages/_pytest/main.py:670: in _recurse
    ihook = self.gethookproxy(path)
env/lib/python3.5/site-packages/_pytest/main.py:575: in gethookproxy
    my_conftestmodules = pm._getconftestmodules(fspath)
env/lib/python3.5/site-packages/_pytest/config.py:339: in _getconftestmodules
    mod = self._importconftest(conftestpath)
env/lib/python3.5/site-packages/_pytest/config.py:364: in _importconftest
    raise ConftestImportFailure(conftestpath, sys.exc_info())
E   _pytest.config.ConftestImportFailure: ImportMismatchError('sourmash_lib.conftest', '/Users/standage/Software/sourmash/sourmash_lib/conftest.py', local('/Users/standage/Software/sourmash/env/lib/python3.5/site-packages/sourmash_lib/conftest.py'))
!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!
=========================== 1 error in 0.46 seconds ============================

This was with a clean virtualenv, after cloning the repo, running pip install pytest and then make test.

Any ideas?

Makefile install needs to specify test extras

The following pattern should work, but draws errors:

git checkout v0.9.4
mkvirtualenv --python=/usr/bin/python2.7 sourmash-py27
make clean install test

But if the install target was using pip install .[test] this would work. I've not quite got the PR formatted, but I'll just file the ticket here, as part of my second-pass review for JOSS.

jupyter notebook for sourmash workflow

I put together a notebook for tracking workflow using sourmash here:
https://github.com/bioinfonm/bat_metagenomics_2016/blob/master/python_code/sourmash_CAVE_bat_metagenomes.ipynb

Change log:
pylab.savefig includes the bbox_inches='tight'
pylab.savefig dumps to pdf
added two additional plots which use different metadata categories
network plot using nexworkx

Things do to:
Instead of separate label files have a single tab separated sheet with IDs and all metadata.
Get unions of each cluster

Within signature distance?

I was wondering if it was worth looking at within signature differences to get an idea of the diversity of hashes? i.e using Jaccard, Hammings or Lenvestein metrics.

I need to check my understanding of how this whole hashing thing works for a metagenome:

  1. For a given set of sequences the file is read in and we get kmers back for each sequence.
  2. Each set of kmers is hashed into a signature. Are the kmers grouped before or after hashing?
  3. Hashes from two signatures are compared and from the union we can calculate the Jaccard distance based on shared hashes.

I hacked together a BK-Tree script from someone else's code to look at which hashes might be similar within a given distance.

Undocumented utility scripts & directories found at root

... for example, the urchin, utils, data, refseq and sigs directories.

I could imagine that these directories are any of the following:

  • test data to avoid regressions
  • useful for particular research questions that sourmash is used for
  • demo data (for the notebook demos)
    or some combination of the above. But as a developer or a client of the library, not knowing induces significant cognitive overhead.

I wonder if a separate package of sourmash_demo belongs in here somewhere. If it's useful for particular research questions, then maybe there should be addendum or supplementary packages that are bundled separately for PyPi clients or sourmash developers, (depending on who the audience is).

openjournals/joss-reviews#27

sbt built with a single sequence is listed as corrupt

% ./sourmash compute -f data/GCF_000005845.2_ASM58
4v2_genomic.fna.gz
# running sourmash subcommand: compute
computing signatures for files: ['data/GCF_000005845.2_ASM584v2_genomic.fna.gz']
Computing signature for ksizes: [31]
Computing only DNA (and not protein) signatures.
Computing a total of 1 signatures.
Computing signature for ksizes: [31]
... reading sequences from data/GCF_000005845.2_ASM584v2_genomic.fna.gz
calculated 1 signatures for 1 sequences in data/GCF_000005845.2_ASM584v2_genomic.fna.gz

% ./sourmash sbt_index bar GCF_000005845.2_ASM584v2_genomic.fna.gz.sig
# running sourmash subcommand: sbt_index
loading 1 files into SBT
loaded 1 sigs; saving SBT under "bar".

% ./sourmash sbt_search bar GCF_000005845.2_ASM584v2_genomic.fna.gz.sig
# running sourmash subcommand: sbt_search
Traceback (most recent call last):
  File "/Users/t/dev/jup/lib/python3.5/site-packages/khmer-2.0+706.g1745464-py3.5-macosx-10.6-intel.egg/khmer/__init__.py", line 140, in extract_nodegraph_info
    "signature".format(filename) + str(signature))
ValueError: Node graph '.sbt.bar/bar.c7eda0a3534879b70046e3f66870ede2.sbt' is missing file type signatureb'[\n  '

add tests to confirm that MinHash objects are correctly created post-pickle

test_estimators.py::test_pickle only checks the values on the Estimator object, and does not confirm that the new Estimator.mh object behaves properly; we should check all the behavior, including n, k, dna, protein, track_abundance, and (assuming #83) max_hash.

An alternative and better approach might be to have accessors for all the internal MinHash foo, which I believe is being added in #79.

test.sh mentions missing `sourmash clean` subcommand

(possible duplicate ticket? I am having connectivity problems).
test.sh file mentions sourmash clean; this subcommand doesn't exist.
choices are (as I can see):

  • remove invocation
  • implement clean subcommand
  • change to make clean (?)

Developer Makefile incomplete WRT sphinx, pip install

I did:

mkvirtualenv sourmash-py27
pip install -r requirements.txt # implied but not specified in docs
make clean all test

this drew a failure because sphinx wasn't installed.

Starting fresh:

pip install -r requirements.txt sphinx
make clean all test

succeeds (if you don't have a .tox directory lying around messing things up).

I might add that all is a funny name for a build target that builds the C++ library.

Minor concerns but still needs to be addressed for ease of developer entree.

test code is in unexpected places

for example:
sourmash_lib/__init__.py is an unexpected place to find tests. Do they belong here, or can they be moved to a tests/test_*.py constellation?

Similarly, the files in sourmash_lib/ include test files, which are installed by pip; this is not appropriate for clients who are not developing the sourmash library itself. Probably, all of sourmash_tst_utils.py and sourmash_lib/test_*.py belong in a $repo/tests directory or they need to be renamed.

openjournals/joss-reviews#27

ValueError: invalid DNA character in sequence: N

Encountered error below with this command sourmash compute --protein -k 18,21 /mnt/scratch/ljcohen/mmetsp_tmp/SRR1300520.left.fq.head. Looks like a problem with how 'N' are being handled?

(khmer_env)[ljcohen@dev-intel14 ~]$ sourmash compute --protein -k 18,21 /mnt/scratch/ljcohen/mmetsp_tmp/SRR1300520.left.fq.head
# running sourmash subcommand: compute
computing signatures for files: ['/mnt/scratch/ljcohen/mmetsp_tmp/SRR1300520.left.fq.head']
Computing signature for ksizes: [18, 21]
... reading sequences from /mnt/scratch/ljcohen/mmetsp_tmp/SRR1300520.left.fq.head
Traceback (most recent call last):
  File "/mnt/home/ljcohen/khmer_env/bin/sourmash", line 9, in <module>
    load_entry_point('sourmash==0.9.4', 'console_scripts', 'sourmash')()
  File "/mnt/home/ljcohen/khmer_env/lib/python2.7/site-packages/sourmash_lib/__main__.py", line 338, in main
    SourmashCommands()
  File "/mnt/home/ljcohen/khmer_env/lib/python2.7/site-packages/sourmash_lib/__main__.py", line 42, in __init__
    cmd(sys.argv[2:])
  File "/mnt/home/ljcohen/khmer_env/lib/python2.7/site-packages/sourmash_lib/__main__.py", line 151, in compute
    E.add_sequence(s, args.force)
  File "/mnt/home/ljcohen/khmer_env/lib/python2.7/site-packages/sourmash_lib/__init__.py", line 65, in add_sequence
    self.mh.add_sequence(seq, force)
ValueError: invalid DNA character in sequence: N

developer/testing doc needed

@ctb is on it, but I think there needs to be a few pieces of better documentation for how developers can get access to the test environment:
openjournals/joss-reviews#27

For example, #12 and #10 both suggest that there are a few additional steps to take to work as an active developer (rather than a client) of the C bindings.

Figures in `command-line` docs are missing

Figures in command line docs are not found. I can't find the .png in the source code either.

This is still true after running make; is there some other process that needs to be run on readthedocs so -- or my own Ubuntu machine -- so that the figures look better?

declare sphinx as 'doc' dependency

... same as #23 and #12: it's better to include these dependencies in setup.py as extras so that you can use virtualenv and pip to set up and tear down development directories.

Version number reported in PyPI, master disagrees with tag

Looks like tag v0.9.4 actually installs sourmash 0.9.4rc1, same as master.

Usual approach for prereleases is to have them on a branch (possibly 'master') and tag the exact release that sets the version number to v0.9.4. Subsequent branches should update the release number to v0.9.5rc1 etc.

undefined symbol error

I've tried installing and compiling this package every which way I can, and nothing seems to get me past this error.. any help would be much appreciated!

(ENV)dd6/analysis/mash$ sourmash compute *.fa
# running sourmash subcommand: compute
computing signatures for files: ['11511_4#10.contigs_spades.fa', '11511_4#11.contigs_spades.fa']
Computing signature for ksizes: [31]
Traceback (most recent call last):
  File "/lustre/scratch108/bacteria/dd6/cholera/ENV/bin/sourmash", line 339, in <module>
    main()
  File "/lustre/scratch108/bacteria/dd6/cholera/ENV/bin/sourmash", line 336, in main
    SourmashCommands()
  File "/lustre/scratch108/bacteria/dd6/cholera/ENV/bin/sourmash", line 43, in __init__
    cmd(sys.argv[2:])
  File "/lustre/scratch108/bacteria/dd6/cholera/ENV/bin/sourmash", line 138, in compute
    protein=args.protein)
  File "/lustre/scratch108/bacteria/dd6/cholera/ENV/local/lib/python2.7/site-packages/sourmash_lib/__init__.py", line 27, in __init__
    from . import _minhash
ImportError: /lustre/scratch108/bacteria/dd6/cholera/ENV/local/lib/python2.7/site-packages/sourmash_lib/_minhash.so: undefined symbol: _ZSt24__throw_out_of_range_fmtPKcz

sourmash YAML for database

@ctb I remember reading your blog about one use for sourmash and YAML would be to connect different researchers together.

I was wondering if such a database could help both to connect researchers and allow to search against a bunch of metagenomes.

I currently have over a 100 bat metagenomes from New Mexico and Arizona. I have completed most of the .sig for them. I would like to expand the metadata categories in the YAML files to included stuff like research group, contact, and maybe swiping some metadata categories from MGRAST. I would like to add in a link to the mash refseq70 data for each once as well. And eventually roll these into a on-line database.

I was wondering if github could be leveraged for this kinda of database with a live jupyter notebook as the interface?

ValueError when ambiguous nucleotide

I'm running sourmash compute on a number of read files and I get this error. Obviously one of the reads contains an N

Traceback (most recent call last):
  File "/home/cts/local/python34-virtualenv/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/cts/local/python34-virtualenv/lib/python3.4/site-packages/sourmash_lib/__main__.py", line 338, in main
    SourmashCommands()
  File "/home/cts/local/python34-virtualenv/lib/python3.4/site-packages/sourmash_lib/__main__.py", line 42, in __init__
    cmd(sys.argv[2:])
  File "/home/cts/local/python34-virtualenv/lib/python3.4/site-packages/sourmash_lib/__main__.py", line 151, in compute
    E.add_sequence(s, args.force)
  File "/home/cts/local/python34-virtualenv/lib/python3.4/site-packages/sourmash_lib/__init__.py", line 65, in add_sequence
    self.mh.add_sequence(seq, force)
ValueError: invalid DNA character in sequence: N

Is there a reason for throwing an error rather than simply dropping this kmer? I can get into the code to make changes but just wanted to know if this would affect the sketch too much?

Compiler woes

I am sadly running RHEL6 which has g++ 4.4.7 that does not like the -std=c++11 flag. Anyway to install without upgrading my the compiler on the server?

  Running setup.py bdist_wheel for sourmash ... error
  Complete output from command /home/cts/local/python34-virtualenv/bin/python -u -c "import setuptools, tokenize;__file__='/export/data1/tmp/pip-build-uf91zm09/sourmash/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /export/data1/tmp/tmpk2syou66pip-wheel- --python-tag cp34:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.4
  creating build/lib.linux-x86_64-3.4/sourmash_lib
  copying sourmash_lib/__init__.py -> build/lib.linux-x86_64-3.4/sourmash_lib
  copying sourmash_lib/fig.py -> build/lib.linux-x86_64-3.4/sourmash_lib
  copying sourmash_lib/signature.py -> build/lib.linux-x86_64-3.4/sourmash_lib
  copying sourmash_lib/sourmash_tst_utils.py -> build/lib.linux-x86_64-3.4/sourmash_lib
  copying sourmash_lib/test__minhash.py -> build/lib.linux-x86_64-3.4/sourmash_lib
  copying sourmash_lib/test_sourmash.py -> build/lib.linux-x86_64-3.4/sourmash_lib
  running build_ext
  building 'sourmash_lib._minhash' extension
  creating build/temp.linux-x86_64-3.4
  creating build/temp.linux-x86_64-3.4/sourmash_lib
  creating build/temp.linux-x86_64-3.4/third-party
  creating build/temp.linux-x86_64-3.4/third-party/smhasher
  gcc -pthread -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/python34/include/python3.4m -c sourmash_lib/_minhash.cc -o build/temp.linux-x86_64-3.4/sourmash_lib/_minhash.o -std=c++11 -pedantic -O3
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for Ada/C/ObjC but not for C++ [enabled by default]
  cc1plus: error: unrecognized command line option ‘-std=c++11’
  error: command 'gcc' failed with exit status 1

sourmash with bat metagenomes

I thought this might be of interest to folks using sourmash. I am sitting on 80+ metagenomes from bats. We swabbed their external surfaces (i.e., furred surfaces, ears, wings and uropatagia). I ran sourmash on a subset from one field area.

sourmash_cave_bats_bb_location

We see the bats grouping by species. But the Myotis Velifer are split into three groups. Each group corresponds to a different geographic region within the sampling area. The black colored text are cave-caught bats and the green are surface-netted.

What are the right cmd line option for computing both protein and DNA?

Right now, --protein switches 'sourmash compute' over to computing only protein MinHashes. It would be nice to read the sequence once, and compute both protein and DNA signatures at the same time.

The functionality to store the protein/DNA signatures in a single .sig file is there, although it's a bit untested.

First thought -- provide a '--dna' boolean flag, and change the '--protein' flag to be bool as well.
Then change '--protein' to no longer "turn off" DNA sketch computation.

One other thought, in the command line tool we should be really clear about how many signatures we're computing!

importerror _minhash.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZSt24__throw_out_of_range_fmtPKcz

Hello, dev teams.

I have tried to install the package under Linux Mint 18 Sarah, both via pip and all the instruction from here. The installation finished without error, but if I run it, I got the following error:

from ._minhash import MinHash
ImportError: /home/username/sourmash/sourmash_lib/_minhash.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZSt24__throw_out_of_range_fmtPKcz

It seems a solution is needed. Thanks.

Thank you very much.

Identity of minhash driving the clustering?

Hello,

Hopefully I am using the right words here. Is there a way to get at which minhashes are driving each split in the cluster (or the larger groups). I went through the Hash website and I think I understand how each hash is being created.

Ideally I'd like to grab a set of hashes that drives the split in the cluster and then run those against the RefSeq.msh to see what they are.

thanks,
ara

Revisit names and hierarchy

A few things for @lgautier and @luizirber to weigh in on --

  • the sourmash_lib.signature module is badly named; I keep on colliding names with sig, etc. I was thinking of renaming it sigutils or something.

  • the Estimators class is badly named. It's a legacy class anyway, held over from when MinHash didn't exist. It's still a convenient wrapper around the CPython MinHash class, at least for now, since pure Python is much easier to write, change, and test than new C code. One approach might be to deprecate the direct use of MinHash and wrap it more tightly in Estimators, and then rename Estimators to MinHash. Or if that's too confusing, MinHashWrapper. Thoughts?

  • regardless of what we do with the names, Estimators should be moved out of __init__.py.

Not hugely urgent but if you have super strong opinions, now is the time :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.