dkfz-odcf / fastqindex Goto Github PK
View Code? Open in Web Editor NEWA tool to index and extract data from gzipped FASTQ files
License: MIT License
A tool to index and extract data from gzipped FASTQ files
License: MIT License
Current test data is small, I did a few tests and they worked out well, but this has to be repeated with a (much) larger set of FASTQ files before the first release.
Unfortunately, I had a lot of trouble getting things running with Boost (static linking, Boost version upgrade, ...). As we use it effectively only for pointer and filesystem related stuff, we can just use the new C++ includes. Also we need another library for the CLI interface. Unresolved is the question of network-safe multi-reader/writer synchronization.
This allows further checks upon extraction:
For the latter, it might be necessary to load all checksums into memory, when extraction starts. This will increase the memory usage but also increase security.
Read FASTQ data from S3, for
Why? Because in the current implementation, the methods like readChar, read or write are very C/C++'ish and return e.g. -1, if "bad things happen". With the Return type, we can have a modern approach which
a) returns success or not
b) returns a result, if successful
c) can hold a message, describing the nature of the error.
e.g. with segment/partition selector:
fastqindex extract -f=... -i=... --segments=8 --selectedsegment=1,4,8
The states modelled here as boolean are dependent on each other, which allows for inconsistent state combinations!
A normalized way to represent this would be with a state graph, where each node represents a combination of these booleans. A design that avoids misusage is better than one which requires knowledge about what is possible and what not.
Originally posted by @vinjana in https://github.com/DKFZ-ODCF/FastqIndEx/diffs
Actually, you could try making the messages shorter without loss of clarity or correctness.
"No data available in stream"
Imagine that s.b. has to read all the prose in the logs!
Originally posted by @vinjana in https://github.com/DKFZ-ODCF/FastqIndEx/diffs
I think for the S3 helper, there are two cases:
Why? Because this is a very easy consistency check!
The number of lines in the original file is stored in the FQI header.
When extracting, you can
This is related to #42
If the file grows too large (half the size of the FASTQ?), the indexer should stop and report this.
Mode is not critical but failed once because it is not yet covered with tests.
Currently, there is only the block distance strategy (with or without failsafe). In addition, there should also be a byte based distance strategy.
The current algorithm is fairly stupid:
As fqi files can be quite large (e.g. 16GB fastq defaults to a 150MB fqi by default), the start of the extraction will take quite a while. We can however shortcut this by adding some more information to the fqi:
This way, we can load the very small list of linestarts (8 Byte per stored index entry), calculate, which block we need to start with and jump to the right location in the fqi and read the entry.
Is there actually a way to say what kind of error? E.g. pipeerror? IO-error? If so, this information should be added to the error message, because that enormeously helps error diagnosis when using the tool.
Originally posted by @vinjana in https://github.com/DKFZ-ODCF/FastqIndEx/diffs
FastqIndEx will complain that it won't get a lockfile and abort.
Like:
Why? Because the tool is actually independent from FASTQ and can be used on any gz compressed text file containing (hundreds/thousands/millions/billions) of lines. FastqIndEx is only the special case with a fixed record size of four.
The current implementation is not very nice, we might also need to use something else than the Boost CLI library. Also some things are missing.
What I had in mind is:
# Index taking:
# - The mandatory positional compressed FASTQ file argument (or - for stdin)
# - The optional positional indexfile argument
# - The optional blockinterval (Store only every nth IndexEntry)
fastqindex index ( - | (FASTQ)file ) [indexfile] [--blockinterval 1..n]
# Extract taking:
# - The mandatory positional compressed FASTQ file argument
# - The optional positional indexfile argument
# - The mandatory positional output FASTQ file argument (or - for stdout)
# - The optional extractionmultiplier argument which defaults to 4 for FASTQ files
# The tool can also be used for regular text files
# - The optional disablefastqchecks argument, which will (in the future) disable further checks for FASTQ consistency
fastqindex extract (FASTQ)file [indexfile] ( - | (FASTQ)file) [--extractionmultiplier 1..n] [--disablefastqchecks]
This could be especially useful for testing workflows:
fqi extract -f=... -randomrecords[=10000]
# OR
fqi extract -f=... -randomsegment=16
where randomrecords would extract a fixed (here 10k) or random number of records from the FASTQ
OR
randomsegment would extract a random segment of (in this case 16) n segments
This can reduce the size of the index by around 60%.
E.g. tell the extractor to partition the FASTQ / gzip file by a specific number and which of these parts to extract:
fastqindex extract -f=... -i=... --segments=8 --selectedsegment=1
fastqindex extract -f=... -i=... --segments=8 --selectedsegment=2
fastqindex extract -f=... -i=... --segments=8 --selectedsegment=3
...
fastqindex extract -f=... -i=... --segments=8 --selectedsegment=8
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.