GithubHelp home page GithubHelp logo

fw1121 / bigblastparser Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gschofl/bigblastparser

0.0 2.0 0.0 154 KB

Parse (very big) NCBI BLAST xml output files (very fast) into an SQLite DB.

C++ 99.00% Makefile 1.00%

bigblastparser's Introduction

bigBlastParser

bigBlastParser is a very fast (SAX-style) NCBI BLAST parser for very large BLAST XML files.

It will parse the BLAST data into a SQLite database, generating three tables query, hit, hsp, that can be queried using standard SQL.

The tables are designed as follows:

    CREATE TABLE query(
            query_id      INTEGER,
            query_num     INTEGER,
            query_def     TEXT,
            query_len     INTEGER,
            PRIMARY KEY (query_id)
            );
    CREATE INDEX Fquery ON query (query_id);

    CREATE TABLE hit(
            query_id      INTEGER,
            hit_id        INTEGER,
            hit_num       INTEGER,
            gene_id       TEXT,
            accession     TEXT,
            definition    TEXT,
            length        INTEGER,
            PRIMARY KEY (hit_id),
            FOREIGN KEY (query_id) REFERENCES query (query_id)
            );
    CREATE INDEX Fhit ON hit (hit_id);
    CREATE INDEX Fhit_hit_query ON hit (query_id, hit_id);
    CREATE INDEX Fhit_query ON hit (query_id);

    CREATE TABLE hsp(
            query_id      INTEGER,
            hit_id        INTEGER,
            hsp_id        INTEGER,
            hsp_num       INTEGER,
            bit_score     FLOAT,
            score         INTEGER,
            evalue        FLOAT,
            query_from    INTEGER,
            query_to      INTEGER,
            hit_from      INTEGER,
            hit_to        INTEGER,
            query_frame   INTEGER,
            hit_frame     INTEGER,
            identity      INTEGER,
            positive      INTEGER,
            gaps          INTEGER,
            align_len     INTEGER,
            qseq          TEXT,
            hseq          TEXT,
            midline       TEXT,
            PRIMARY KEY (hsp_id),
            FOREIGN KEY (hit_id) REFERENCES hit (hit_id)
            FOREIGN KEY (query_id) REFERENCES query (query_id)
            );
    CREATE INDEX Fhsp ON hsp (hsp_id);
    CREATE INDEX Fhsp_hit ON hsp (hit_id);
    CREATE INDEX Fhsp_hit_query ON hsp (query_id, hit_id, hsp_id);
    CREATE INDEX Fhsp_query ON hsp (query_id);

The maximum number of hits parsed per query, and the maximum number of hsps parsed per hit are controlled by command line options.

Install

You will need the Xerces-C++ XML parser and SQLite. On Ubuntu use

apt-get install libxerces-c-dev libsqlite3-dev

then download and build the program:

git clone https://github.com/gschofl/BigBlastParser.git
cd BigBlastParser
make
make clean

Command line usage

Usage

bigBlastParser [options] <blastfile>.xml

-o, --out 	dbName        Output SQLite database (default: <blastfile>.db)
-a, --append              Append data to an existing SQLite Blast DB.
--max_hit	n        	  Maximum number of hits parsed from a query (default: 20);
						  (set -1 to parse all available hits)
--max_hit	n 		      Maximum number of hsps parsed from a hit (default: 20);
						  (set -1 to parse all available hsps)
--reset_at  n 	 		  After <n> queries are parsed, the data is dumped to the
                          database file before parsing is resumed. This helps to
                          keep the memory footprint small (default: 1000)
-h, --help                show help

bigblastparser's People

Watchers

James Cloos avatar Wayne Fang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.