dkfz-odcf / fastqindex Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 1.0 3.84 MB

A tool to index and extract data from gzipped FASTQ files

License: MIT License

CMake 2.38% C++ 96.33% C 0.89% Shell 0.40% Roff 0.01%

indexing extraction reads gzip zlib ngs random-access stream index fastq fastq-files

fastqindex's Introduction

FastqIndEx

Description

FastqIndEx allows you to create an index file for gzip compressed FASTQ to enable random access to the FASTQ file. Though its primary goal is to extract data from FASTQ files, you can also use it to index gzipped text files.

Features at a glance

Support for:
- Local file access (either by piped or streamed access)
- Safe concurrent file access over NFS (using flock(), also with either piped or streamed access)
- Files in an S3 bucket (experimental, without locking yet!)
Many checks to make sure, that the application does what you expect and runs safely.
A number of options to tell the indexer how you like to get things done.
A flexible and configurable index size.
Different extraction strategies:
- Extract a range of records
- (Virtually) divide the FASTQ on the fly into n segments and extract one segement of your choice.

License and Contributing

FastqIndEx is published under the MIT license. See LICENSE.txt for more information.

Please note, that we will only accept contributions, which are compatible with the MIT license. We will therefore not accept contributions which are e.g. licensed under GPL!

For more information on code contributions, please refer to CONTRIBUTING.md

General Usage

Too see, how FastqIndEx is installed, please read the next section of this documentation.

Index

# Index a file, default options, automatic distance calculation for index entries.
fastqindex index -f=test2.fastq.gz 

# Index piped input with a distance of 2GiB for index entries.
cat test2.fastq.gz | fastqindex -f=- -i=test2.fastq.fqi -B=2G

# Index an object stored in an S3 bucket, the index is stored in the Bucket!
fastqindex index -f=s3://bucket/test2.fastq.gz

There are more options available like:

Option	Description
-w	Allow the application to overwrite the index file. By default, this is not allowed.
-B	Tell the indexer to store an entry after approximately n Byte (like 4M, 2G, 512K)

Please call the application with

fastqindex index

to see more options.

Extract

# Extract to a file, note, that this command produces an uncompressed file!
# Note, that the command will decompress everything (basically gunzip)
fastqindex extract -f=test2.fastq.gz -i=test2.fastq.fqi -o=extracted.fastq

# Or to stdout / the console, this is also uncompressed data!
# Extract 16 records, starting with record 201 (we have a 0 based offset!)
fastqindex extract -f=test2.fastq.gz -i=test2.fastq.fqi -o=- -s=200 -n=16

# Or also from S3 with a locally stored index.
# Extract the second and third record.
fastqindex extract -f=s3://bucket/test2.fastq.gz -i=/local/path/test2.fastq.fqi -o=- -s=1 -n=2

Please note, that the S3 extraction is still experimental (but working for us). Unfortunately, there will always be an error message, that a stream was closed. You can ignore this.

There are more options available here as well like:

Option	Description
-s	Defines the first record which will be extracted. By default, the application assumes, that a record has 4 lines.
-n	Defines the number of reads which should be extracted.
-e	Defines the size of a record. For FASTQ files this is 4 (record size), but you could use 1 for e.g. regular text files.
-w	Allow the application to overwrite the index file. By default, this is not allowed.

Please call the application with

fastqindex extract

to see more options.

Installation

Binary releases

We provide you with stable releases. Please go to the releases section.

Download the desired package and extract it to a location of your choice and run it.

Compilation

Dependencies

FastqIndEx has the following dependencies, which should be met before building it:

Dependency	Conda	Version / Git Hash	Purpose
CMake	yes	3.13.x	Build tool
gcc	yes	7.2	Compiler and debugger suite
tclap	yes	1.2.1	Command line interpreter library
zlib	yes	1.2.11	Compression library
AWS SDK	no	1.7.125	S3 Support library
UnitTest++	no	bc5d87f	Unit testing framework

Use Conda to manage your dependencies

Install Miniconda / Anaconda
The conda recipe is contained in env/environment.yml, use conda-env to import it: conda env create -n FastqIndEx -f env/environment.yml
Download and install UnitTest++ like described in the next section, it is not available in Conda.

Compilation with manual installation of dependencies

g++/gcc

Before you run cmake, you might need to set

export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++

to the proper locations of your gcc/g++ binaries.

CMake, zlib...

wget https://github.com/Kitware/CMake/releases/download/v3.13.4/cmake-3.13.4.tar.gz
tar -xvzf cmake-3.13.4.tar.gz
# Configure and build afterwards

wget https://www.zlib.net/zlib-1.2.11.tar.gz
tar -xvzf zlib-1.2.11.tar.gz
cd zlib-1.2.11 && ./configure && make

UnitTest++

git clone https://github.com/unittest-cpp/unittest-cpp.git
cd unittest-cpp
mkdir builds
cd builds
cmake -G "Unix Makefiles" -D CMAKE_INSTALL_PREFIX=/custom/lib/path ..
cmake --build . --target all
cmake --build . --target install

AWS S3

Download the AWS SDK from the Amazon website and install it to a location of your choice. Compile as necessary and remember to first activate the Conda environment of FastqIndEx.

FastqIndEx

First, you need to clone the FastqIndEx Git repository. You can do this by executing:

cd ~/Projects           # Or any other desired location, we'll stick to this
git clone https://github.com/DKFZ-ODCF/FastqIndEx.git
git checkout master     # Or any other version you like

To compile it, create a CMake build directory and run CMake afterwards:

export CONDA_DIR="~/miniconda2/envs/FastqIndEx"     # Miniconda is normally installed in <home>/miniconda2
export CONDA_LIB=${CONDA_DIR}/lib
export CONDA_LIB64=${CONDA_DIR}/lib64
export CONDA_INCLUDE=${CONDA_DIR}/include
cmake -D CMAKE_BUILD_TYPE=Debug -D BUILD_ONLY="s3;config;transfer" -D BUILD_SHARED_LIBS=ON  -D OPENSSL_ROOT_DIR=${CONDA_DIR}  -D OPENSSL_INCLUDE_DIR=${CONDA_INCLUDE} OPENSSL_LIBRARIES=${CONDA_LIB}  -D ZLIB_INCLUDE_DIR=${CONDA_INCLUDE} -DZLIB_LIBRARY=${CONDA_LIB}/libz.a -D CURL_INCLUDE_DIR=${CONDA_INCLUDE} -DCURL_LIBRARY=${CONDA_LIB}/libcurl.so  -D CMAKE_INSTALL_PREFIX=$PWD/../install_shared -D ENABLE_TESTING=OFF ..
cmake -D CMAKE_BUILD_TYPE=Debug -D BUILD_ONLY="s3;config;transfer" -D BUILD_SHARED_LIBS=OFF  -D OPENSSL_ROOT_DIR=${CONDA_DIR}  -D OPENSSL_INCLUDE_DIR=${CONDA_INCLUDE} OPENSSL_LIBRARIES=${CONDA_LIB}  -D ZLIB_INCLUDE_DIR=${CONDA_INCLUDE} -DZLIB_LIBRARY=${CONDA_LIB}/libz.a -D CURL_INCLUDE_DIR=${CONDA_INCLUDE} -DCURL_LIBRARY=${CONDA_LIB}/libcurl.so  -D CMAKE_INSTALL_PREFIX=$PWD/../install_shared -D ENABLE_TESTING=OFF ..


export AWS_DIR="/path/to/aws-sdk-cpp/install_shared"
export UNITTESTPP_DIR="/path/to/unittest-cpp"
export ZLIB_DIR="/path/to/zlib-1.2.11"                                    # If necessary

# The following instructions are for a release build of FastqIndEx. 
# If you want to create a debug build, change "release" to "debug"
export MODE=release
cd FastqIndEx
mkdir ${MODE}                                                             
cd ${MODE}
cmake -G "Unix Makefiles" \
    -D "UnitTest++_DIR":PATH="${UNITTESTPP_DIR}/install/lib/cmake/UnitTest++" \ # If necessary
    -D ZLIB_LIBRARY="${ZLIB_DIR}/libz.a" \                                      # If necessary
    -D ZLIB_INCLUDE_DIR="${ZLIB_DIR}"" \                                        # If necessary
    -D "AWSSDK_DIR":PATH="${AWS_DIR}/lib64/cmake/AWSSDK" \                      # For S3 support, you need to do this.
    -D "aws-cpp-sdk-core_DIR":PATH="${AWS_DIR}/lib64/cmake/aws-cpp-sdk-core" \
    -D "aws-c-event-stream_DIR":PATH="${AWS_DIR}/lib64/aws-c-event-stream/cmake" \
    -D "aws-c-common_DIR":PATH="${AWS_DIR}/lib64/aws-c-common/cmake" \
    -D "aws-checksums_DIR":PATH="${AWS_DIR}/lib64/aws-checksums/cmake" \
    -D "aws-cpp-sdk-s3_DIR":PATH="${AWS_DIR}/lib64/cmake/aws-cpp-sdk-s3" \                
    -DCMAKE_BUILD_TYPE=Release                                             
    ..
cd ..
cmake --build ${MODE} --target all -- -j 2

Note, that the -D flags for the includes are only necessary, if you installed the libraries manually. If they are already installed on your system or (e.g. for UnitTest++) you installed them to the system folders or if you use the Conda environment, you can omit these flags.

To clean the build directory use:

cmake --build build --target clean -- -j 2

To run the tests, run the test binary like:
```
(cd build/test && ./testapp)
```

If you want, you can add the release or debug directory to your PATH variable. E.g. in your local .bashrc file add the following:

# This assumes, that you cloned the repo to ~/Projects/FastqIndEx and 
# created the release sub directory like described above.
export PATH=~/Projects/FastqIndEx/release/src:$PATH

fastqindex's People

Contributors

Watchers

fastqindex's Issues

Consider to check the output fqi file size for faulty / abusive indexer behaviour

If the file grows too large (half the size of the FASTQ?), the indexer should stop and report this.

Add more information to the index header

Like:

Number of index entries
Number of lines in file
Number of compressed blocks

Change source and sink methods so, that they use the Result type as a return value for most methods.

Why? Because in the current implementation, the methods like readChar, read or write are very C/C++'ish and return e.g. -1, if "bad things happen". With the Return type, we can have a modern approach which
a) returns success or not
b) returns a result, if successful
c) can hold a message, describing the nature of the error.

extract mode will not auto-create and find .fqi file for FASTQ, if -i is not set.

Optimize extraction initialization speed

The current algorithm is fairly stupid:

Load the fqi
Read index entries until you find the starting one
Start extraction with it.

As fqi files can be quite large (e.g. 16GB fastq defaults to a 150MB fqi by default), the start of the extraction will take quite a while. We can however shortcut this by adding some more information to the fqi:

Number of lines in the source file
Number of index entries in the fqi file (can also be calculated)
The first line in each block as a list

This way, we can load the very small list of linestarts (8 Byte per stored index entry), calculate, which block we need to start with and jump to the right location in the fqi and read the entry.

Add different index entry storage strategies.

Currently, there is only the block distance strategy (with or without failsafe). In addition, there should also be a byte based distance strategy.

Add GitHub page? Or RTD?

Make three separate projects: GZLineIndEx, FastqIndEx and FastqIndExLib

Why? Because the tool is actually independent from FASTQ and can be used on any gz compressed text file containing (hundreds/thousands/millions/billions) of lines. FastqIndEx is only the special case with a fixed record size of four.

Allow retrieval of multiple intervals/partitions in one command

e.g. with segment/partition selector:

fastqindex extract -f=... -i=... --segments=8 --selectedsegment=1,4,8

Current implementation does not allow to process concatenated gz files.

Extend error handling while calling readCompressedDataFromSource()

Is there actually a way to say what kind of error? E.g. pipeerror? IO-error? If so, this information should be added to the error message, because that enormeously helps error diagnosis when using the tool.

Originally posted by @vinjana in https://github.com/DKFZ-ODCF/FastqIndEx/diffs

Using ~ for home in e.g. -i=~/... parameter does not work.

FastqIndEx will complain that it won't get a lockfile and abort.

Add test for stats mode to ensure that it is running.

Mode is not critical but failed once because it is not yet covered with tests.

Allow random output of records or segments

This could be especially useful for testing workflows:

fqi extract -f=... -randomrecords[=10000]
# OR
fqi extract -f=... -randomsegment=16

where randomrecords would extract a fixed (here 10k) or random number of records from the FASTQ
OR
randomsegment would extract a random segment of (in this case 16) n segments

Implement .bz2 support

Add / Update / Extend tests for the command line interface.

Create md5sum for Index file, use it for extract

Create Buildscript

Runtime check: Can the requested number of lines (or the maximum number of lines) be extracted?

Why? Because this is a very easy consistency check!

The number of lines in the original file is stored in the FQI header.

When extracting, you can

... extract the requested number or just up to the end. The binary knows then that no more lines can be read and everything is alright (exit code == 0; tolerant number check).
... extract the extracted amount of lines. If that number is not available/cannot get extracted an error can be provided (exit code != 0; strict number check).

This is related to #42

S3 input/output

Read FASTQ data from S3, for

read URL from the command line (s3://user:password@server:port/bucket/file)
read/write all files, also index from S3

Improve S3 error recognition for download (And also in general)

I think for the S3 helper, there are two cases:

The application stops without an exit code => Thus all is fine, because the file was downloaded completely.
The application stops with an error! In this case, we need to check, if the amount of read records equals the amount of requested records. If this condition applies, all is good.

Compress dictionary before writing an index entry to the index file.

This can reduce the size of the index by around 60%.

Rework text elements like temporary folders/files and help texts

Extend CLI interface

The current implementation is not very nice, we might also need to use something else than the Boost CLI library. Also some things are missing.

What I had in mind is:

# Index taking:
# - The mandatory positional compressed FASTQ file argument (or - for stdin)
# - The optional positional indexfile argument
# - The optional blockinterval (Store only every nth IndexEntry)
fastqindex index ( - | (FASTQ)file ) [indexfile] [--blockinterval 1..n]

# Extract taking:
# - The mandatory positional compressed FASTQ file argument
# - The optional positional indexfile argument 
# - The mandatory positional output FASTQ file argument (or - for stdout)
# - The optional extractionmultiplier argument which defaults to 4 for FASTQ files
#   The tool can also be used for regular text files
# - The optional disablefastqchecks argument, which will (in the future) disable further checks for FASTQ consistency
fastqindex extract (FASTQ)file [indexfile] ( - | (FASTQ)file) [--extractionmultiplier 1..n]  [--disablefastqchecks]

Improve quality of explanation for the building and integration of the S3 libraries

Add checksum for first lines of each block to each index entry.

This allows further checks upon extraction:

Was there an error with the dictionary
Are the correct lines correctly extracted

For the latter, it might be necessary to load all checksums into memory, when extraction starts. This will increase the memory usage but also increase security.

Make messages shorter without loss of clarity or correctness.

Actually, you could try making the messages shorter without loss of clarity or correctness.

"No data available in stream"

Imagine that s.b. has to read all the prose in the logs!

Originally posted by @vinjana in https://github.com/DKFZ-ODCF/FastqIndEx/diffs

Get rid of Boost and lift to C++17

Unfortunately, I had a lot of trouble getting things running with Boost (static linking, Boost version upgrade, ...). As we use it effectively only for pointer and filesystem related stuff, we can just use the new C++ includes. Also we need another library for the CLI interface. Unresolved is the question of network-safe multi-reader/writer synchronization.

Allow extraction by defining file segments and segment number

E.g. tell the extractor to partition the FASTQ / gzip file by a specific number and which of these parts to extract:

fastqindex extract -f=... -i=... --segments=8 --selectedsegment=1
fastqindex extract -f=... -i=... --segments=8 --selectedsegment=2
fastqindex extract -f=... -i=... --segments=8 --selectedsegment=3
...
fastqindex extract -f=... -i=... --segments=8 --selectedsegment=8

Implement a state model for e.g. the ZLibBasedFASTQProcessorBase

The states modelled here as boolean are dependent on each other, which allows for inconsistent state combinations!

A normalized way to represent this would be with a state graph, where each node represents a combination of these booleans. A design that avoids misusage is better than one which requires knowledge about what is possible and what not.

Originally posted by @vinjana in https://github.com/DKFZ-ODCF/FastqIndEx/diffs

dkfz-odcf / fastqindex Goto Github PK

fastqindex's Introduction

FastqIndEx

Description

Features at a glance

License and Contributing

General Usage

Index

Extract

Installation

Binary releases

Compilation

Dependencies

Use Conda to manage your dependencies

Compilation with manual installation of dependencies

g++/gcc

CMake, zlib...

UnitTest++

AWS S3

FastqIndEx

Links

fastqindex's People

Contributors

Watchers

fastqindex's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs