sourmash-bio / sourmash_plugin_branchwater Goto Github PK

View Code? Open in Web Editor NEW

13.0 3.0 2.0 1.11 MB

Rust-based plugin that enables fast, multithreaded sourmash operations

License: GNU Affero General Public License v3.0

Python 68.86% Rust 31.04% Makefile 0.10%

sourmash

sourmash_plugin_branchwater's Introduction

sourmash_plugin_branchwater

tl;dr Do faster and lower-memory sourmash functions via this plugin.

Details

sourmash is a command-line tool and Python/Rust library for metagenome analysis and genome comparison using k-mers. While sourmash is fast and low memory, sourmash v4 and lower work in single-threaded mode with Python containers.

The branchwater plugin for sourmash (this plugin!) provides faster and lower-memory implementations of several important sourmash features - sketching, searching, and gather (metagenome decomposition). It does so by implementing higher-level functions in Rust on top of the core Rust library of sourmash. As a result it provides some of the same functionality as sourmash, but 10-100x faster and in 10x lower memory.

This code is still in prototype mode, and does not have all of the features of sourmash. As we add features we will move it back into the core sourmash code base; eventually, much of the code in this repository will be integrated into sourmash directly.

If you're intrigued but not sure where to start with this plugin, we suggest first identifying what sourmash functionality you need to run to accomplish your goals. Once you have your sourmash commands working, revisit these docs and see if there is a faster implementation available in this plugin!

This repo originated as a PyO3-based Python wrapper around the core branchwater code. Branchwater is a fast, low-memory and multithreaded application for searching very large collections of FracMinHash sketches as generated by sourmash.

For technical details, see the Rust code in src/ and Python wrapper in src/python/.

Documentation

There is a quickstart below, as well as more user documentation here. Nascent developer docs are also available!

Quickstart demonstrating `multisearch`.

This quickstart demonstrates multisearch using the 64 genomes from Awad et al., 2017.

1. Install the branchwater plugin

On Linux and Mac OS X, you can install the latest release of the branchwater plugin from conda-forge:

conda install sourmash_plugin_branchwater

Please see the developer docs for information on installing the latest development version.

2. Download sketches.

The following commands will download sourmash sketches for the podar genomes into the file podar-ref.zip:

curl -L https://osf.io/4t6cq/download -o podar-ref.zip

3. Execute!

Now run multisearch to search all the sketches against each other:

sourmash scripts multisearch podar-ref.zip podar-ref.zip -o results.csv --cores 4

You will (hopefully ;)) see a set of results in results.csv. These are comparisons of each query against all matching genomes.

Debugging help

If your collections aren't loading properly, try running sourmash sig summarize on them, like so:

sourmash sig summarize podar-ref.zip

If this doesn't work, then you're running into problems creating the collection. Please ask for help on the sourmash issue tracker!

Code of Conduct

This project is under the sourmash Code of Conduct.

License

This software is under the AGPL license. Please see LICENSE.txt.

Authors

Luiz Irber
C. Titus Brown
Mohamed Abuelanin
N. Tessa Pierce-Ward

sourmash_plugin_branchwater's People

Contributors

Stargazers

Watchers

Forkers

mr-eyes olgabot

sourmash_plugin_branchwater's Issues

benchmarks against GTDB 65k

metagenome against 65k genomic reps

impl	time	ram	notes
manygather	1m 4s	173 MB	32 threads
sourmash gather	14m 48s	4.3 GB

genomes against 65k genomic reps

impl	time	memory	notes
sourmash search	12m 43s	3.67 GB	single genome x 65k
manysearch	36s	139 MB	5 genomes x 65k; 32 threads

complete benchmark info here

confirm that fastgather/fastmultigather output can be used as picklists with sourmash

=> include in tests!

benchmarks - SRR606249 x GTDB RS214

4 minutes & 2.2 GB of RAM for fastgather against all of GTDB.
40 second and 1.1 GB of RAM using those results in a picklist.

(how many threads am I using here!? 😆)

appendix

fastgather first

% /usr/bin/time -v sourmash scripts fastgather SRR606249.trim.sig.gz list.gtdb-rs214-k21.txt -o xxx.csv -k 21

== This is sourmash version 4.8.3.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

gathering all sketches in 'SRR606249.trim.sig.gz' against 'list.gtdb-rs214-k21.txt'
Loading query from 'SRR606249.trim.sig.gz'
Loading matchlist from 'list.gtdb-rs214-k21.txt'
Loaded 402709 sig paths in matchlist
using threshold overlap: 100 100000
...
...fastgather is done! gather results in 'xxx.csv'
        Command being timed: "sourmash scripts fastgather SRR606249.trim.sig.gz list.gtdb-rs214-k21.txt -o xxx.csv -k 21"
        User time (seconds): 6616.13
        System time (seconds): 41.06
        Percent of CPU this job got: 2793%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:58.34
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2205448
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 3
        Minor (reclaiming a frame) page faults: 3216006
        Voluntary context switches: 823871
        Involuntary context switches: 867030
        Swaps: 0
        File system inputs: 0
        File system outputs: 24
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

gather with a picklist second

/usr/bin/time -v sourmash gather SRR606249.trim.sig.gz /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-k21.zip --picklist xxx.csv:match:ident
...
        Command being timed: "sourmash gather SRR606249.trim.sig.gz /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-k21.zip --picklist xxx.csv:match:ident"
        User time (seconds): 39.86
        System time (seconds): 4.15
        Percent of CPU this job got: 107%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:41.11
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1105064
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 882
        Minor (reclaiming a frame) page faults: 689018
        Voluntary context switches: 3635
        Involuntary context switches: 281
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

add wheel building github actions

add more columns to `manysearch` output

containment is there, but we should have jaccard at least.

rayon threads - document how it decides how many threads to use?

looks like it picks it up from slurm env variables if RAYON_THREADS isn't defined?

Adding maximum containment column

I will be working on adding a maximum containment column.

names & organization questions re mastiff interface functions

in #58, @bluegenes is adding in CLI extension code to this plugin to take advantage of @luizirber wicked fast inverted index code, codename: mastiff (see also sourmash-bio/sourmash#2230). This PR adds four new plugin commands - for now, index, check, search, and gather - to this plugin, joining manysearch, fastgather, and fastmultigather. That's a lot!

@bluegenes and I talked on Friday, and I think we both agree that eventually these commands will become part of sourmash. For now, we are using the branchwater plugin as a way to (a) learn rust (b) expose functionality that we want & need without having to think about backwards compat (c) explore and have fun.

the question is, how do we manage things in this plugin?

do we just add all the mastiff functions? this adds a lot of complexity to this plugin
and/or do we want to split things out into another plugin? my hot take is probably not; I expect the utility code will be shared between the CLI-accessible functions.
and/or do we want to have some code that's not part of sourmash scripts? tougher question. it's easy to provide other CLI entry points that aren't sourmash plugins.
and/or do we want to define some kind of rough naming scheme for sourmash plugins? I'm not even sure what that would be. what, put fast in front of everything? or pyo3? 😆

Some other thoughts:

re stability, I don't think we should worry overmuch about semantic versioning for CLI stuff on this plugin - it's probably ok to move fast and break things, as long as there's a reason. People can always pin a specific version, so we should just make sure that we do regular releases and make sure the code works properly.
we need to do both benchmarking and validation ref #68
code re-org is good #62, maybe #67

enable CTRL-C in rust code?

maybe provide `environment.yml` that helps build/install pyo3_branchwater?

like this:

name: branchwater-bench
channels:
    - conda-forge
    - bioconda
    - defaults
dependencies:
    - sourmash>=4.8.3,<5
    - pip
    - pip:
      - pyo3_branchwater>=0.5.0

but probably with rust, pyo3, maturin included?

niffler error on load file in fastgather?

if there's an existing but empty file listed in the fromfile, you get a niffler error

CI fails on rocksdb

over in #58, we get:

 error: failed to run custom build command for `librocksdb-sys v0.11.0+8.1.1`

Caused by:
  process didn't exit successfully: `/home/runner/work/pyo3_branchwater/pyo3_branchwater/target/release/build/librocksdb-sys-07a244c34ecb3d24/build-script-build` (exit status: 101)
  --- stderr
  thread 'main' panicked at 'Unable to find libclang: "couldn't find any valid shared libraries matching: ['libclang.so', 'libclang-*.so', 'libclang.so.*',

I think this can be dealt with by fixing #20, where we will (probably) switch to running stuff inside of a conda environment.

write better docs

work in progress: https://hackmd.io/l78LQm_aSMCjFjJJ_Ce2zA?both

run python tests on CI

invest in benchmarking and validation studies

benchmarking: https://github.com/dib-lab/2022-branchwater-benchmarking is great but is a bit too heavyweight. (note that dib-lab/2022-branchwater-benchmarking#9 may be revealing significant performance degradation with my code so :guilty look:) would be good to invest in smaller/faster benchmarks.

need to get flamegraphs going in rust, too.

in re validation I think we need some larger scale code validation than what is being done in the tests right now, which mostly focuses on podar/podar-ref (SRR606249 etc.). I think at the very least some true positives and maybe a true negative test, on a larger scale, would be good. I was thinking of using some of the results from @jessicalumian's paper Biogeographic Distribution of Five Antarctic Cyanobacteria Using Large-Scale k-mer Searching with sourmash branchwater.

benchmark - searching against Genbank cover at 100k

searching SRR606249 (podar) against genbank cover at scaled=100k

"../2022-pymagsearch/multigather.py SRR606249.scaled=10k.sig.gz.list.txt gtdb+genbank.sigs.d.list.txt --scaled=100000"
        User time (seconds): 14.36
        System time (seconds): 10.72
        Percent of CPU this job got: 32%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:16.50
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 160420
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 40499
        Voluntary context switches: 414468
        Involuntary context switches: 3335
        Swaps: 0
        File system inputs: 0
        File system outputs: 136
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

update docs for -c

`num_threads` is never passed to rust

https://github.com/sourmash-bio/pyo3_branchwater/blob/3f0e3d21a43863242124a4370648cf6afc9e49e3/src/python/pyo3_branchwater/__init__.py#L30

consider changing `manysearch` to load database into memory

the search function loads all the query sketches into memory at once, but then iterates over the list of against sketches and loads them, potentially many times.

if anything this seems backwards? but honestly given the low memory requirements we should probably just load everything into memory?

we should remove jaccard calculation from `manysearch` => too expensive

per dib-lab/2022-branchwater-benchmarking#9 (comment), Jaccard calculations require a merge, which costs lots of time and result in a MASSIVE slow down on real data sets.

add threaded compare

no reason why we can't also have sourmash compare running in n^2 time, but with threads :)

change default `--threshold-bp` to either 50kb or 0 for fastgather

and also report that in output!

support `--output-dir` or some such in `fastmultigather`

rename top-level functions to reflect python CLI entry point names

e.g. countergather should be named fastgather instead

perhaps as part of #62

provide installation instructions

in particular... rust is required 😆

ref #34 and #33

document test, dev dependencies

test - pytest, pandas, ...

dev - maturin[patchelf], rust

rename package

maybe ... branchwater-py?

add jaccard back into manysearch

In #52 / fixing #25, I added Jaccard calculation/output into manysearch.

this ended up costing a lot of compute time, because my approach was to merge the minhashes and then calculate their length; see #71.

so, in #72 / fixing #71, I removed Jaccard.

But I think we can put it back in:

Upon reflection, we don't need to merge the two minhashes - this is something that (I think) is needed for handling num MinHash and is not needed for FracMinHash. For FracMinHash, we can use num_common_hashes / (size_A + size_B - num_common_hashes) which should be cheap.

change license to AGPL for next release

since #58 includes/is based on code (https://github.com/luizirber/2022-06-26-rocksdb-eval) that is now AGPL (https://github.com/sourmash-bio/mastiff).

Number of shared hashes returned instead of containment

https://github.com/sourmash-bio/pyo3_branchwater/blob/3f0e3d21a43863242124a4370648cf6afc9e49e3/src/lib.rs#L247-L252

adjust intersect_hashes in manysearch output

should be intersect_bp, I think? to match sourmash => multiply by scaled.

more benchmark stuff - searching 2.3m wort isolates

% export RAYON_NUM_THREADS=32; /usr/bin/time -v ../magsearch/bin/searcher query.list.txt sra-isolates-guess-2022-nov-03.list.txt -o matches2.csv -t 0.1 -k 31 -s 2000
...
        Command being timed: "../magsearch/bin/searcher query.list.txt sra-isolates-guess-2022-nov-03.list.txt -o matches2.csv -t 0.1 -k 31 -s 2000"
        User time (seconds): 665402.66
        System time (seconds): 4591.23
        Percent of CPU this job got: 2652%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 7:00:58
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1443436
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 14
        Minor (reclaiming a frame) page faults: 902693107
        Voluntary context switches: 10692901
        Involuntary context switches: 111937953
        Swaps: 0
        File system inputs: 4693214040
        File system outputs: 3896
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 101

confirm column names match sourmash where they are in common

with #45, the md5 column names should be the same as sourmash prefetch which I think is the future of sourmash. see sourmash-bio/sourmash#1555. But, you know, verify.

add a replacement for branchwater command line, too!

md5sum output is different b/t sourmash and this plugin

md5sum output by search and countergather are of modified/downsampled not orig. This is different from sourmash...

some prefetch/gather benchmarks

A comparison of a Rust counter-gather implementation (pymagsearch) with sourmash gather from sourmash v4.5.0.

table hackmd for nice editing

command	wall time	max RSS	CPU time	thread efficiency
sourmash gather	9:07.73	2,661 MB	489.02 s	0.89
pymagsearch develop mode	5:17.82	113 MB	8735.05 s	27.6x
pymagsearch release mode	1:28.76	107 MB	298.79	3x

ht Rob Patro for suggesting release mode 😆

pymagsearch gather.py with 32 threads -

built with default develop flags.

% /usr/bin/time -v ./gather.py SRR606249.k31.sig.gz gtdb-rs207-k31.list.txt  --output-prefetch p.csv --output-gather g.csv
...
Command being timed: "./gather.py SRR606249.k31.sig.gz gtdb-rs207-k31.list.txt --output-prefetch p.csv --output-gather g.csv"
        User time (seconds): 8735.05
        System time (seconds): 15.75
        Percent of CPU this job got: 2753%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 5:17.82
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 113036
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 35
        Minor (reclaiming a frame) page faults: 310501
        Voluntary context switches: 142072
        Involuntary context switches: 961049
        Swaps: 0
        File system inputs: 8456
        File system outputs: 72
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

sourmash gather with 1 thread

% /usr/bin/time -v sourmash gather SRR606249.k31.sig.gz /group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.genomic-reps.dna.k31.zip --save-prefetch-csv p2.csv -o g2.csv
...
        Command being timed: "sourmash gather SRR606249.k31.sig.gz /group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.genomic-reps.dna.k31.zip --save-prefetch-csv p2.csv -o g2.csv"
        User time (seconds): 489.02
        System time (seconds): 4.30
        Percent of CPU this job got: 90%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 9:07.73
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2661284
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 7
        Minor (reclaiming a frame) page faults: 702574
        Voluntary context switches: 2058
        Involuntary context switches: 749858
        Swaps: 0
        File system inputs: 3547408
        File system outputs: 544
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

pymagsearch gather.py with 32 threads -

built with maturin develop --release using suggestions from Rob

removing GCF_014217355.1 Fusobacterium hwasookii strain=KCOM 1249, ASM1421735v1
        Command being timed: "./gather.py SRR606249.k31.sig.gz gtdb-rs207-k31.li
st.txt --output-prefetch p.csv --output-gather g.csv"
        User time (seconds): 298.79
        System time (seconds): 9.04
        Percent of CPU this job got: 346%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:28.76
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 107128
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 128932
        Voluntary context switches: 242522
        Involuntary context switches: 27046
        Swaps: 0
        File system inputs: 7146392
        File system outputs: 72
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Pretty print the manysearch result

I wanted to share a function that takes the many_search output as of version = "0.6.0" and writes two CSV files for Jaccard and containment. The first column is the query, the last column is the best_match, and the columns in between are all targets.

import pandas as pd
import math

def per_query_table(manysearch_output_csv, output_prefix):
    df = pd.read_csv(manysearch_output_csv)

    unique_queries = df['query_name'].unique()
    unique_matches = df['match_name'].unique()

    containment_df = pd.DataFrame(index=unique_queries, columns=unique_matches)
    jaccard_df = pd.DataFrame(index=unique_queries, columns=unique_matches)

    for query in unique_queries:
        for match in unique_matches:
            subset = df[(df['query_name'] == query) & (df['match_name'] == match)]
            
            if not subset.empty:
                containment = subset['containment'].iloc[0]
                jaccard = subset['jaccard'].iloc[0]                
                containment_df.at[query, match] = containment
                jaccard_df.at[query, match] = jaccard

    containment_df = containment_df.astype(float)
    jaccard_df = jaccard_df.astype(float)
    
    # add best match column
    for output_df in [containment_df, jaccard_df]:
        output_df.reset_index(inplace=True)
        output_df.rename(columns={"index": "query_name"}, inplace=True)
        
        output_df['best_match'] = output_df.iloc[:, 1:].idxmax(axis=1)

    containment_df.to_csv(f"{output_prefix}_containment.csv", index=False)
    jaccard_df.to_csv(f"{output_prefix}_jaccard.csv", index=False)

deal with identical `{basename}` somehow, in `fastmultigather`

deal with this warning from the docs in #21 -

Warning: At the moment, if two different queries have the same {basename}, the CSVs for one will be overwritten. The behavior here is undefined in practice, because this is multithreaded code and we don't know what queries will be executed when or files will be written first.

add regular github actions testing for Python foo

provide interim output for manysearch?

would be nice to know how many sketches have been searched, I guess.

fastmultigather benchmarks: swine x gtdb reps

This took 6 hours to run when using fastgather + sourmash gather/picklists, per https://github.com/ctb/2023-swine-usda/.

Uses code in #21

here? 36 minutes, 4 GB RAM. 🎉

Command being timed: "sourmash scripts fastmultigather list.swine-x-reps.txt list.gtdb-reps-rs214-k21.txt --scaled=10000 -k 21"
        User time (seconds): 69820.00
        System time (seconds): 46.70
        Percent of CPU this job got: 3189%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 36:30.68
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4177944
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 18
        Minor (reclaiming a frame) page faults: 16907954
        Voluntary context switches: 889867
        Involuntary context switches: 1281448
        Swaps: 0
        File system inputs: 0
        File system outputs: 45544
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

reorganize repo

clean up src - make like

src/python
src/<something>

also make sure there are no lingering .py scripts => just the plugin, ma'am. I think this means simplifying the python package substantially: removing __main__, etc.

some multigather benchmarking numbers

Evaluating multigather.py code in #3.

RAYON_NUM_THREADS was set to 32 for all of the below.

single query (SRR606249)

% /usr/bin/time -v ./multigather.py query.txt gtdb-rs207-k31.list.txt 
...
"./multigather.py query.txt gtdb-rs207-k31.list.txt"
        User time (seconds): 245.24
        System time (seconds): 8.96
        Percent of CPU this job got: 477%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:53.21
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3445820
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1486273
        Voluntary context switches: 141565
        Involuntary context switches: 14126
        Swaps: 0
        File system inputs: 0
        File system outputs: 152
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

five queries (5x SRR606249)

% /usr/bin/time -v ./multigather.py query5.txt gtdb-rs207-k31.list.txt 
...
        Command being timed: "./multigather.py query5.txt gtdb-rs207-k31.list.txt"
        User time (seconds): 603.29
        System time (seconds): 7.69
        Percent of CPU this job got: 907%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:07.31
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3681684
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1553319
        Voluntary context switches: 163288
        Involuntary context switches: 48974
        Swaps: 0
        File system inputs: 0
        File system outputs: 760
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

20 random queries from wort

        Command being timed: "./multigather.py wort-a-20.txt gtdb-rs207-k31.list.txt"   
        User time (seconds): 817.36
        System time (seconds): 9.33
        Percent of CPU this job got: 1328%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:02.20
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3774000
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1754234
        Voluntary context switches: 205870
        Involuntary context switches: 70223
        Swaps: 0
        File system inputs: 124560
        File system outputs: 1360
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

100 random queries from wort

Command being timed: "./multigather.py wort-a-100.txt gtdb-rs207-k31.list.txt"  
        User time (seconds): 19878.41
        System time (seconds): 33.23
        Percent of CPU this job got: 1085%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 30:34.56
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 5964248
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 8546828
        Voluntary context switches: 691194
        Involuntary context switches: 1950183
        Swaps: 0
        File system inputs: 1229456
        File system outputs: 12304
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

add "update sourmash-mixers" to release process/checklist?

naming convention for new databases?

#58 introduces mastiff-style rocksdb databases, which are stored as a folder with many internal files.

We don't need a file extension at all, but do we want one as a visual cue for ourselves? I settled on using .rdb in my testing, but totally open to others. Maybe even just .db?

@luizirber @ctb

Command being timed: "../2022-pymagsearch/multigather.py ls-wort.txt.sorted-size.100k-smallest.catalog gtdb-rs207-k31.list.txt"
    User time (seconds): 65460.14
    System time (seconds): 71.04
    Percent of CPU this job got: 3154%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 34:37.54
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3413460
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 11
    Minor (reclaiming a frame) page faults: 8423131
    Voluntary context switches: 357369
    Involuntary context switches: 6475665
    Swaps: 0
    File system inputs: 2880
    File system outputs: 44656
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

Multithreading is not working

In manysearch and probably the rest of the commands, multi-threading is not working. Even when I hard-code the number of cores, it only works on 2. I noticed that with htop

sourmash-bio / sourmash_plugin_branchwater Goto Github PK

sourmash_plugin_branchwater's Introduction

sourmash_plugin_branchwater

Details

Documentation

Quickstart demonstrating multisearch.

1. Install the branchwater plugin

2. Download sketches.

3. Execute!

Debugging help

Code of Conduct

License

Authors

sourmash_plugin_branchwater's People

Contributors

Stargazers

Watchers

Forkers

sourmash_plugin_branchwater's Issues

appendix

fastgather first

gather with a picklist second

searching SRR606249 (podar) against genbank cover at scaled=100k

pymagsearch gather.py with 32 threads -

sourmash gather with 1 thread

pymagsearch gather.py with 32 threads -

single query (SRR606249)

five queries (5x SRR606249)

20 random queries from wort

100 random queries from wort

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Quickstart demonstrating `multisearch`.