GithubHelp home page GithubHelp logo

dkfz-odcf / fastqindex Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 1.0 3.84 MB

A tool to index and extract data from gzipped FASTQ files

License: MIT License

CMake 2.38% C++ 96.33% C 0.89% Shell 0.40% Roff 0.01%
extraction fastq fastq-files gzip index indexing ngs random-access reads stream zlib

fastqindex's Issues

Get rid of Boost and lift to C++17

Unfortunately, I had a lot of trouble getting things running with Boost (static linking, Boost version upgrade, ...). As we use it effectively only for pointer and filesystem related stuff, we can just use the new C++ includes. Also we need another library for the CLI interface. Unresolved is the question of network-safe multi-reader/writer synchronization.

Add checksum for first lines of each block to each index entry.

This allows further checks upon extraction:

  • Was there an error with the dictionary
  • Are the correct lines correctly extracted

For the latter, it might be necessary to load all checksums into memory, when extraction starts. This will increase the memory usage but also increase security.

S3 input/output

Read FASTQ data from S3, for

  • read URL from the command line (s3://user:password@server:port/bucket/file)
  • read/write all files, also index from S3

Improve S3 error recognition for download (And also in general)

I think for the S3 helper, there are two cases:

  1. The application stops without an exit code => Thus all is fine, because the file was downloaded completely.
  2. The application stops with an error! In this case, we need to check, if the amount of read records equals the amount of requested records. If this condition applies, all is good.

Runtime check: Can the requested number of lines (or the maximum number of lines) be extracted?

Why? Because this is a very easy consistency check!

The number of lines in the original file is stored in the FQI header.

When extracting, you can

  1. ... extract the requested number or just up to the end. The binary knows then that no more lines can be read and everything is alright (exit code == 0; tolerant number check).
  2. ... extract the extracted amount of lines. If that number is not available/cannot get extracted an error can be provided (exit code != 0; strict number check).

This is related to #42

Optimize extraction initialization speed

The current algorithm is fairly stupid:

  1. Load the fqi
  2. Read index entries until you find the starting one
  3. Start extraction with it.

As fqi files can be quite large (e.g. 16GB fastq defaults to a 150MB fqi by default), the start of the extraction will take quite a while. We can however shortcut this by adding some more information to the fqi:

  • Number of lines in the source file
  • Number of index entries in the fqi file (can also be calculated)
  • The first line in each block as a list

This way, we can load the very small list of linestarts (8 Byte per stored index entry), calculate, which block we need to start with and jump to the right location in the fqi and read the entry.

Extend CLI interface

The current implementation is not very nice, we might also need to use something else than the Boost CLI library. Also some things are missing.

What I had in mind is:

# Index taking:
# - The mandatory positional compressed FASTQ file argument (or - for stdin)
# - The optional positional indexfile argument
# - The optional blockinterval (Store only every nth IndexEntry)
fastqindex index ( - | (FASTQ)file ) [indexfile] [--blockinterval 1..n]

# Extract taking:
# - The mandatory positional compressed FASTQ file argument
# - The optional positional indexfile argument 
# - The mandatory positional output FASTQ file argument (or - for stdout)
# - The optional extractionmultiplier argument which defaults to 4 for FASTQ files
#   The tool can also be used for regular text files
# - The optional disablefastqchecks argument, which will (in the future) disable further checks for FASTQ consistency
fastqindex extract (FASTQ)file [indexfile] ( - | (FASTQ)file) [--extractionmultiplier 1..n]  [--disablefastqchecks]

Allow random output of records or segments

This could be especially useful for testing workflows:

fqi extract -f=... -randomrecords[=10000]
# OR
fqi extract -f=... -randomsegment=16

where randomrecords would extract a fixed (here 10k) or random number of records from the FASTQ
OR
randomsegment would extract a random segment of (in this case 16) n segments

Allow extraction by defining file segments and segment number

E.g. tell the extractor to partition the FASTQ / gzip file by a specific number and which of these parts to extract:

fastqindex extract -f=... -i=... --segments=8 --selectedsegment=1
fastqindex extract -f=... -i=... --segments=8 --selectedsegment=2
fastqindex extract -f=... -i=... --segments=8 --selectedsegment=3
...
fastqindex extract -f=... -i=... --segments=8 --selectedsegment=8

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.