GithubHelp home page GithubHelp logo

medvedevgroup / howdesbt Goto Github PK

View Code? Open in Web Editor NEW
10.0 7.0 3.0 15.94 MB

Sequence Bloom Tree, supporting determined/how split filters

License: MIT License

Makefile 0.12% C++ 85.84% C 0.10% Python 12.75% Shell 0.57% Cython 0.61%

howdesbt's Introduction

HowDeSBT

Sequence Bloom Tree, supporting determined/how split filters

Dependencies

Installation

To install HowDeSBT from the source:

1a. Download the latest version of subutan using Github

     git clone https://github.com/medvedevgroup/HowDeSBT  

1b. Modify the Makefile

If you have installed the dependencies somewhere other than ${HOME}, you need to modify the Makefile. Specifically, in both the CXXFLAGS and LDFLAGS definitions $${HOME} should be changed to your install path.

1c. Jellyfish install workaround

(There are other ways to accomplish this, see the note at the end of this step.)

Jellyfish installation requires an extra step for its include directory. After you have installed Jellyfish, do

    cd ${HOME}/include
    ls | grep jellyfish

You should see something like

    jellyfish-2.2.6

where 2.2.6 is the version of Jellyfish you've installed. Then make a symbolic link named 'jellyfish' that points to the includes directory for the version you've installed:

    cd ${HOME}/include
    ln -s jellyfish-2.2.6/jellyfish jellyfish

Note: the symbolic link is a workaround for the way that Jellyfish installs its files. That install expects the user to have the program pkg-config installed and an environment variable PKG_CONFIG_PATH defined. The Makefile here woud then use pkg-config to get the path to the include files. While that paradigm is apparently widespread it isn't universal. The symbolic link workaround seems less of a burden than requiring that users install another package and set up an environment variable. See gmarcais/Jellyfish#139 for more details.

2. Compile:

    cd HowDeSBT  
    make  

3. Install:

    cd HowDeSBT  
    cp howdesbt ${HOME}/bin

Another alternative is to make sure the path to the HowDeSBT directory is in your PATH environment variable.

4. Validation:

The quick start tutorial shows expected results which can be compared against your tutorial results.

Quick Start

A usage tutorial can be found at https://github.com/medvedevgroup/HowDeSBT/tree/master/tutorial

The command howdesbt ? will show a list of subcommands with brief descriptions. As of this writing, that will look like this:

    $ howdesbt ?
    makebf--    convert a sequence file to a bloom filter
    cluster--   determine a tree topology by clustering bloom filters
    build--     build a sequence bloom tree from a topology file and leaves
    query--     query a sequence bloom tree
    version--   report this program's version

The command howdesbt ?<subcommand> will give a more detailed description of a subcommand. For example, howdesbt ?makebf gives details for how to convert a sequence file to a bloom filter.

Citation

If you use HowDeSBT, please cite

  • Robert S Harris and Paul Medvedev, Improved representation of sequence bloom trees, Bioinformatics, btz662

howdesbt's People

Contributors

eseiler avatar pashadag avatar rsharris avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

howdesbt's Issues

Building a pre-release with new features (--rrr, --unrrr, and --list in bfoperate subcommand)

This is not really an issue.

I'm developing a framework that makes use of HowDeSBT for building SBTs, and for most of the users it would for sure be easier to install HowDeSBT with conda.

Just wondering whether it could be possible to create a new tagged pre-release so that I could easily update the conda recipe to point to a specific tag with the last features (i.e., --rrr, --unrrr, and --list in bfoperate subcommand).

bfdistance: one filter vs a set of filters

This is actually not an issue, it is more like a request to implement a new feature.

The bfdistance command automatically compute the pair-wise distance between all the BF files listed in --list=<filename>. This is great but it is not currently possible to make it parallel and speed up the whole process, which would be extremely helpful when dealing with hundred thousands of BFs. This is probably not straightforward but can be indirectly solved by thinking of another problem.

In particular, let say that I want to compute the distance between one BF only and a set of BFs. This is not currently possible, but it seems super easy to implement since bfdistance already takes one BF file <filename> and a list of BF files --list=<filename> in input. The idea is that it could compute the distance between <filename> and --list=<filename> whether both of them are specified in input (currently, it seems ignoring <filename> if --list=<filename> is also specified, but I may be wrong).

At this point, there would be no reason to think about a way to make it parallel since we could simply run multiple instances of howdesbt bfdistance with different <filename> but all with the same --list=<filename>.

Hash collision rate may be higher than it ought to be

(All of the following is hypothetical work, but there is currently no plan to do this.)

Canonicalization of the hash function by using h(forward)+h(revcomp) may be introducing more collisions. The solution would be to use something like min(h(forward),(revcomp)), though this can skew the distribution of bits.

More testing would be needed to determine whether the current canonicalization is really a problem.

If it is a problem, any solution needs to retain backward compatibility so as to insure compatibility with existing files. The BF file format includes a version number -- newer versions of the program would recognize the earlier file versions and use the current hash function with those. Newly created BF files would use the new hash function. Older versions of the program would reject the new files.

bfoperate: bloom filters with more than one bit vectors

I noticed that you check now whether the input bloom filters contain only one bit vector each before applying a specific logic operator with the bfoperate subcommand.

This totally makes sense, but I didn't get why the condition on the number of bit vectors should be > 2 in case of the second bloom filter passed in input (bfB). Should it probably be fixed with > 1 as well as you already do with the first bloom filter bfA?

void BFOperateCommand::op_and()

HowDeSBT/cmd_bf_operate.cc

Lines 243 to 246 in aaab732

if (bfA->numBitVectors > 1)
fatal ("error: \"" + bfFilenames[0] + "\" contains more than one bit vector");
if (bfB->numBitVectors > 2)
fatal ("error: \"" + bfFilenames[1] + "\" contains more than one bit vector");

void BFOperateCommand::op_or()

HowDeSBT/cmd_bf_operate.cc

Lines 278 to 281 in aaab732

if (bfA->numBitVectors > 1)
fatal ("error: \"" + bfFilenames[0] + "\" contains more than one bit vector");
if (bfB->numBitVectors > 2)
fatal ("error: \"" + bfFilenames[1] + "\" contains more than one bit vector");

void BFOperateCommand::op_xor()

HowDeSBT/cmd_bf_operate.cc

Lines 313 to 316 in aaab732

if (bfA->numBitVectors > 1)
fatal ("error: \"" + bfFilenames[0] + "\" contains more than one bit vector");
if (bfB->numBitVectors > 2)
fatal ("error: \"" + bfFilenames[1] + "\" contains more than one bit vector");

void BFOperateCommand::op_eq()

HowDeSBT/cmd_bf_operate.cc

Lines 348 to 351 in aaab732

if (bfA->numBitVectors > 1)
fatal ("error: \"" + bfFilenames[0] + "\" contains more than one bit vector");
if (bfB->numBitVectors > 2)
fatal ("error: \"" + bfFilenames[1] + "\" contains more than one bit vector");

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.