GithubHelp home page GithubHelp logo

sourmash-bio / sourmash_plugin_branchwater Goto Github PK

View Code? Open in Web Editor NEW
13.0 3.0 2.0 1.11 MB

Rust-based plugin that enables fast, multithreaded sourmash operations

License: GNU Affero General Public License v3.0

Python 68.86% Rust 31.04% Makefile 0.10%
sourmash

sourmash_plugin_branchwater's Introduction

sourmash_plugin_branchwater

PyPI Conda Version

tl;dr Do faster and lower-memory sourmash functions via this plugin.

Details

sourmash is a command-line tool and Python/Rust library for metagenome analysis and genome comparison using k-mers. While sourmash is fast and low memory, sourmash v4 and lower work in single-threaded mode with Python containers.

The branchwater plugin for sourmash (this plugin!) provides faster and lower-memory implementations of several important sourmash features - sketching, searching, and gather (metagenome decomposition). It does so by implementing higher-level functions in Rust on top of the core Rust library of sourmash. As a result it provides some of the same functionality as sourmash, but 10-100x faster and in 10x lower memory.

This code is still in prototype mode, and does not have all of the features of sourmash. As we add features we will move it back into the core sourmash code base; eventually, much of the code in this repository will be integrated into sourmash directly.

If you're intrigued but not sure where to start with this plugin, we suggest first identifying what sourmash functionality you need to run to accomplish your goals. Once you have your sourmash commands working, revisit these docs and see if there is a faster implementation available in this plugin!

This repo originated as a PyO3-based Python wrapper around the core branchwater code. Branchwater is a fast, low-memory and multithreaded application for searching very large collections of FracMinHash sketches as generated by sourmash.

For technical details, see the Rust code in src/ and Python wrapper in src/python/.

Documentation

There is a quickstart below, as well as more user documentation here. Nascent developer docs are also available!

Quickstart demonstrating multisearch.

This quickstart demonstrates multisearch using the 64 genomes from Awad et al., 2017.

1. Install the branchwater plugin

On Linux and Mac OS X, you can install the latest release of the branchwater plugin from conda-forge:

conda install sourmash_plugin_branchwater

Please see the developer docs for information on installing the latest development version.

2. Download sketches.

The following commands will download sourmash sketches for the podar genomes into the file podar-ref.zip:

curl -L https://osf.io/4t6cq/download -o podar-ref.zip

3. Execute!

Now run multisearch to search all the sketches against each other:

sourmash scripts multisearch podar-ref.zip podar-ref.zip -o results.csv --cores 4

You will (hopefully ;)) see a set of results in results.csv. These are comparisons of each query against all matching genomes.

Debugging help

If your collections aren't loading properly, try running sourmash sig summarize on them, like so:

sourmash sig summarize podar-ref.zip

If this doesn't work, then you're running into problems creating the collection. Please ask for help on the sourmash issue tracker!

Code of Conduct

This project is under the sourmash Code of Conduct.

License

This software is under the AGPL license. Please see LICENSE.txt.

Authors

  • Luiz Irber
  • C. Titus Brown
  • Mohamed Abuelanin
  • N. Tessa Pierce-Ward

sourmash_plugin_branchwater's People

Contributors

ctb avatar dependabot[bot] avatar bluegenes avatar mr-eyes avatar olgabot avatar

Stargazers

Peter Kruczkiewicz avatar  Christopher Neely avatar Jie Li avatar Luiz Irber avatar Tongzhou Xu avatar TON NGOC MINH QUAN avatar Jianshu_Zhao avatar Mladen Rasic avatar pluaron avatar Wei Shen avatar David Koslicki avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

mr-eyes olgabot

sourmash_plugin_branchwater's Issues

benchmarks against GTDB 65k

metagenome against 65k genomic reps

impl time ram notes
manygather 1m 4s 173 MB 32 threads
sourmash gather 14m 48s 4.3 GB

genomes against 65k genomic reps

impl time memory notes
sourmash search 12m 43s 3.67 GB single genome x 65k
manysearch 36s 139 MB 5 genomes x 65k; 32 threads

complete benchmark info here

benchmarks - SRR606249 x GTDB RS214

4 minutes & 2.2 GB of RAM for fastgather against all of GTDB.
40 second and 1.1 GB of RAM using those results in a picklist.

(how many threads am I using here!? ๐Ÿ˜†)

appendix

fastgather first

% /usr/bin/time -v sourmash scripts fastgather SRR606249.trim.sig.gz list.gtdb-rs214-k21.txt -o xxx.csv -k 21

== This is sourmash version 4.8.3.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

gathering all sketches in 'SRR606249.trim.sig.gz' against 'list.gtdb-rs214-k21.txt'
Loading query from 'SRR606249.trim.sig.gz'
Loading matchlist from 'list.gtdb-rs214-k21.txt'
Loaded 402709 sig paths in matchlist
using threshold overlap: 100 100000
...
...fastgather is done! gather results in 'xxx.csv'
        Command being timed: "sourmash scripts fastgather SRR606249.trim.sig.gz list.gtdb-rs214-k21.txt -o xxx.csv -k 21"
        User time (seconds): 6616.13
        System time (seconds): 41.06
        Percent of CPU this job got: 2793%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:58.34
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2205448
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 3
        Minor (reclaiming a frame) page faults: 3216006
        Voluntary context switches: 823871
        Involuntary context switches: 867030
        Swaps: 0
        File system inputs: 0
        File system outputs: 24
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

gather with a picklist second

/usr/bin/time -v sourmash gather SRR606249.trim.sig.gz /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-k21.zip --picklist xxx.csv:match:ident
...
        Command being timed: "sourmash gather SRR606249.trim.sig.gz /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-k21.zip --picklist xxx.csv:match:ident"
        User time (seconds): 39.86
        System time (seconds): 4.15
        Percent of CPU this job got: 107%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:41.11
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1105064
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 882
        Minor (reclaiming a frame) page faults: 689018
        Voluntary context switches: 3635
        Involuntary context switches: 281
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

names & organization questions re mastiff interface functions

in #58, @bluegenes is adding in CLI extension code to this plugin to take advantage of @luizirber wicked fast inverted index code, codename: mastiff (see also sourmash-bio/sourmash#2230). This PR adds four new plugin commands - for now, index, check, search, and gather - to this plugin, joining manysearch, fastgather, and fastmultigather. That's a lot!

@bluegenes and I talked on Friday, and I think we both agree that eventually these commands will become part of sourmash. For now, we are using the branchwater plugin as a way to (a) learn rust (b) expose functionality that we want & need without having to think about backwards compat (c) explore and have fun.

the question is, how do we manage things in this plugin?

  • do we just add all the mastiff functions? this adds a lot of complexity to this plugin
  • and/or do we want to split things out into another plugin? my hot take is probably not; I expect the utility code will be shared between the CLI-accessible functions.
  • and/or do we want to have some code that's not part of sourmash scripts? tougher question. it's easy to provide other CLI entry points that aren't sourmash plugins.
  • and/or do we want to define some kind of rough naming scheme for sourmash plugins? I'm not even sure what that would be. what, put fast in front of everything? or pyo3? ๐Ÿ˜†

Some other thoughts:

  • re stability, I don't think we should worry overmuch about semantic versioning for CLI stuff on this plugin - it's probably ok to move fast and break things, as long as there's a reason. People can always pin a specific version, so we should just make sure that we do regular releases and make sure the code works properly.
  • we need to do both benchmarking and validation ref #68
  • code re-org is good #62, maybe #67

CI fails on rocksdb

over in #58, we get:

 error: failed to run custom build command for `librocksdb-sys v0.11.0+8.1.1`

Caused by:
  process didn't exit successfully: `/home/runner/work/pyo3_branchwater/pyo3_branchwater/target/release/build/librocksdb-sys-07a244c34ecb3d24/build-script-build` (exit status: 101)
  --- stderr
  thread 'main' panicked at 'Unable to find libclang: "couldn't find any valid shared libraries matching: ['libclang.so', 'libclang-*.so', 'libclang.so.*', 

I think this can be dealt with by fixing #20, where we will (probably) switch to running stuff inside of a conda environment.

invest in benchmarking and validation studies

benchmarking: https://github.com/dib-lab/2022-branchwater-benchmarking is great but is a bit too heavyweight. (note that dib-lab/2022-branchwater-benchmarking#9 may be revealing significant performance degradation with my code so :guilty look:) would be good to invest in smaller/faster benchmarks.

need to get flamegraphs going in rust, too.


in re validation I think we need some larger scale code validation than what is being done in the tests right now, which mostly focuses on podar/podar-ref (SRR606249 etc.). I think at the very least some true positives and maybe a true negative test, on a larger scale, would be good. I was thinking of using some of the results from @jessicalumian's paper Biogeographic Distribution of Five Antarctic Cyanobacteria Using Large-Scale k-mer Searching with sourmash branchwater.

benchmark - searching against Genbank cover at 100k

searching SRR606249 (podar) against genbank cover at scaled=100k

"../2022-pymagsearch/multigather.py SRR606249.scaled=10k.sig.gz.list.txt gtdb+genbank.sigs.d.list.txt --scaled=100000"
        User time (seconds): 14.36
        System time (seconds): 10.72
        Percent of CPU this job got: 32%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:16.50
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 160420
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 40499
        Voluntary context switches: 414468
        Involuntary context switches: 3335
        Swaps: 0
        File system inputs: 0
        File system outputs: 136
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

consider changing `manysearch` to load database into memory

the search function loads all the query sketches into memory at once, but then iterates over the list of against sketches and loads them, potentially many times.

if anything this seems backwards? but honestly given the low memory requirements we should probably just load everything into memory?

add threaded compare

no reason why we can't also have sourmash compare running in n^2 time, but with threads :)

add jaccard back into manysearch

In #52 / fixing #25, I added Jaccard calculation/output into manysearch.

this ended up costing a lot of compute time, because my approach was to merge the minhashes and then calculate their length; see #71.

so, in #72 / fixing #71, I removed Jaccard.

But I think we can put it back in:

Upon reflection, we don't need to merge the two minhashes - this is something that (I think) is needed for handling num MinHash and is not needed for FracMinHash. For FracMinHash, we can use num_common_hashes / (size_A + size_B - num_common_hashes) which should be cheap.

more benchmark stuff - searching 2.3m wort isolates

% export RAYON_NUM_THREADS=32; /usr/bin/time -v ../magsearch/bin/searcher query.list.txt sra-isolates-guess-2022-nov-03.list.txt -o matches2.csv -t 0.1 -k 31 -s 2000
...
        Command being timed: "../magsearch/bin/searcher query.list.txt sra-isolates-guess-2022-nov-03.list.txt -o matches2.csv -t 0.1 -k 31 -s 2000"
        User time (seconds): 665402.66
        System time (seconds): 4591.23
        Percent of CPU this job got: 2652%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 7:00:58
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1443436
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 14
        Minor (reclaiming a frame) page faults: 902693107
        Voluntary context switches: 10692901
        Involuntary context switches: 111937953
        Swaps: 0
        File system inputs: 4693214040
        File system outputs: 3896
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 101

some prefetch/gather benchmarks

A comparison of a Rust counter-gather implementation (pymagsearch) with sourmash gather from sourmash v4.5.0.

table hackmd for nice editing

command wall time max RSS CPU time thread efficiency
sourmash gather 9:07.73 2,661 MB 489.02 s 0.89
pymagsearch develop mode 5:17.82 113 MB 8735.05 s 27.6x
pymagsearch release mode 1:28.76 107 MB 298.79 3x

ht Rob Patro for suggesting release mode ๐Ÿ˜†

pymagsearch gather.py with 32 threads -

built with default develop flags.

% /usr/bin/time -v ./gather.py SRR606249.k31.sig.gz gtdb-rs207-k31.list.txt  --output-prefetch p.csv --output-gather g.csv
...
Command being timed: "./gather.py SRR606249.k31.sig.gz gtdb-rs207-k31.list.txt --output-prefetch p.csv --output-gather g.csv"
        User time (seconds): 8735.05
        System time (seconds): 15.75
        Percent of CPU this job got: 2753%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 5:17.82
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 113036
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 35
        Minor (reclaiming a frame) page faults: 310501
        Voluntary context switches: 142072
        Involuntary context switches: 961049
        Swaps: 0
        File system inputs: 8456
        File system outputs: 72
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

sourmash gather with 1 thread

% /usr/bin/time -v sourmash gather SRR606249.k31.sig.gz /group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.genomic-reps.dna.k31.zip --save-prefetch-csv p2.csv -o g2.csv
...
        Command being timed: "sourmash gather SRR606249.k31.sig.gz /group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.genomic-reps.dna.k31.zip --save-prefetch-csv p2.csv -o g2.csv"
        User time (seconds): 489.02
        System time (seconds): 4.30
        Percent of CPU this job got: 90%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 9:07.73
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2661284
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 7
        Minor (reclaiming a frame) page faults: 702574
        Voluntary context switches: 2058
        Involuntary context switches: 749858
        Swaps: 0
        File system inputs: 3547408
        File system outputs: 544
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

pymagsearch gather.py with 32 threads -

built with maturin develop --release using suggestions from Rob

removing GCF_014217355.1 Fusobacterium hwasookii strain=KCOM 1249, ASM1421735v1
        Command being timed: "./gather.py SRR606249.k31.sig.gz gtdb-rs207-k31.li
st.txt --output-prefetch p.csv --output-gather g.csv"
        User time (seconds): 298.79
        System time (seconds): 9.04
        Percent of CPU this job got: 346%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:28.76
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 107128
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 128932
        Voluntary context switches: 242522
        Involuntary context switches: 27046
        Swaps: 0
        File system inputs: 7146392
        File system outputs: 72
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Pretty print the manysearch result

I wanted to share a function that takes the many_search output as of version = "0.6.0" and writes two CSV files for Jaccard and containment. The first column is the query, the last column is the best_match, and the columns in between are all targets.

import pandas as pd
import math

def per_query_table(manysearch_output_csv, output_prefix):
    df = pd.read_csv(manysearch_output_csv)

    unique_queries = df['query_name'].unique()
    unique_matches = df['match_name'].unique()

    containment_df = pd.DataFrame(index=unique_queries, columns=unique_matches)
    jaccard_df = pd.DataFrame(index=unique_queries, columns=unique_matches)

    for query in unique_queries:
        for match in unique_matches:
            subset = df[(df['query_name'] == query) & (df['match_name'] == match)]
            
            if not subset.empty:
                containment = subset['containment'].iloc[0]
                jaccard = subset['jaccard'].iloc[0]                
                containment_df.at[query, match] = containment
                jaccard_df.at[query, match] = jaccard

    containment_df = containment_df.astype(float)
    jaccard_df = jaccard_df.astype(float)
    
    # add best match column
    for output_df in [containment_df, jaccard_df]:
        output_df.reset_index(inplace=True)
        output_df.rename(columns={"index": "query_name"}, inplace=True)
        
        output_df['best_match'] = output_df.iloc[:, 1:].idxmax(axis=1)

    containment_df.to_csv(f"{output_prefix}_containment.csv", index=False)
    jaccard_df.to_csv(f"{output_prefix}_jaccard.csv", index=False)

deal with identical `{basename}` somehow, in `fastmultigather`

deal with this warning from the docs in #21 -

Warning: At the moment, if two different queries have the same {basename}, the CSVs for one will be overwritten. The behavior here is undefined in practice, because this is multithreaded code and we don't know what queries will be executed when or files will be written first.

fastmultigather benchmarks: swine x gtdb reps

This took 6 hours to run when using fastgather + sourmash gather/picklists, per https://github.com/ctb/2023-swine-usda/.

Uses code in #21

here? 36 minutes, 4 GB RAM. ๐ŸŽ‰

Command being timed: "sourmash scripts fastmultigather list.swine-x-reps.txt list.gtdb-reps-rs214-k21.txt --scaled=10000 -k 21"
        User time (seconds): 69820.00
        System time (seconds): 46.70
        Percent of CPU this job got: 3189%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 36:30.68
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4177944
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 18
        Minor (reclaiming a frame) page faults: 16907954
        Voluntary context switches: 889867
        Involuntary context switches: 1281448
        Swaps: 0
        File system inputs: 0
        File system outputs: 45544
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

reorganize repo

clean up src - make like

src/python
src/<something>

also make sure there are no lingering .py scripts => just the plugin, ma'am. I think this means simplifying the python package substantially: removing __main__, etc.

some multigather benchmarking numbers

Evaluating multigather.py code in #3.

RAYON_NUM_THREADS was set to 32 for all of the below.

single query (SRR606249)

% /usr/bin/time -v ./multigather.py query.txt gtdb-rs207-k31.list.txt 
...
"./multigather.py query.txt gtdb-rs207-k31.list.txt"
        User time (seconds): 245.24
        System time (seconds): 8.96
        Percent of CPU this job got: 477%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:53.21
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3445820
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1486273
        Voluntary context switches: 141565
        Involuntary context switches: 14126
        Swaps: 0
        File system inputs: 0
        File system outputs: 152
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

five queries (5x SRR606249)

% /usr/bin/time -v ./multigather.py query5.txt gtdb-rs207-k31.list.txt 
...
        Command being timed: "./multigather.py query5.txt gtdb-rs207-k31.list.txt"
        User time (seconds): 603.29
        System time (seconds): 7.69
        Percent of CPU this job got: 907%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:07.31
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3681684
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1553319
        Voluntary context switches: 163288
        Involuntary context switches: 48974
        Swaps: 0
        File system inputs: 0
        File system outputs: 760
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

20 random queries from wort

        Command being timed: "./multigather.py wort-a-20.txt gtdb-rs207-k31.list.txt"   
        User time (seconds): 817.36
        System time (seconds): 9.33
        Percent of CPU this job got: 1328%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:02.20
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3774000
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1754234
        Voluntary context switches: 205870
        Involuntary context switches: 70223
        Swaps: 0
        File system inputs: 124560
        File system outputs: 1360
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

100 random queries from wort

Command being timed: "./multigather.py wort-a-100.txt gtdb-rs207-k31.list.txt"  
        User time (seconds): 19878.41
        System time (seconds): 33.23
        Percent of CPU this job got: 1085%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 30:34.56
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 5964248
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 8546828
        Voluntary context switches: 691194
        Involuntary context switches: 1950183
        Swaps: 0
        File system inputs: 1229456
        File system outputs: 12304
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

naming convention for new databases?

#58 introduces mastiff-style rocksdb databases, which are stored as a folder with many internal files.

We don't need a file extension at all, but do we want one as a visual cue for ourselves? I settled on using .rdb in my testing, but totally open to others. Maybe even just .db?

@luizirber @ctb

multigather on wort isolate sigs

I ran multigather on the 100k smallest signatures in wort; took 34m.

Command being timed: "../2022-pymagsearch/multigather.py ls-wort.txt.sorted-size.100k-smallest.catalog gtdb-rs207-k31.list.txt"
    User time (seconds): 65460.14
    System time (seconds): 71.04
    Percent of CPU this job got: 3154%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 34:37.54
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3413460
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 11
    Minor (reclaiming a frame) page faults: 8423131
    Voluntary context switches: 357369
    Involuntary context switches: 6475665
    Swaps: 0
    File system inputs: 2880
    File system outputs: 44656
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

Multithreading is not working

In manysearch and probably the rest of the commands, multi-threading is not working. Even when I hard-code the number of cores, it only works on 2. I noticed that with htop

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.