marcelm / dnaio Goto Github PK
View Code? Open in Web Editor NEWEfficiently read and write sequencing data from Python
Home Page: https://dnaio.readthedocs.io/
License: MIT License
Efficiently read and write sequencing data from Python
Home Page: https://dnaio.readthedocs.io/
License: MIT License
Hi,
I have recently stumbled into an error trying to open an empty file that does not have a regular extension (fasta, fastq,...), by using open with auto-detect (fileformat=None). Quick example:
$ python3 -m venv .venv
$ source .venv/bin/activate
(.venv) $ pip install dnaio
Collecting dnaio
Using cached dnaio-0.4.2-cp38-cp38-manylinux1_x86_64.whl (126 kB)
Collecting xopen>=0.8.2
Using cached xopen-0.9.0-py2.py3-none-any.whl (8.3 kB)
Installing collected packages: xopen, dnaio
Successfully installed dnaio-0.4.2 xopen-0.9.0
(.venv) $ python3
python3
Python 3.8.2 (default, Apr 27 2020, 15:53:34)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dnaio
>>> import pathlib
>>> pathlib.Path('test.tmp').touch()
>>> dnaio.open('test.tmp')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".venv/lib/python3.8/site-packages/dnaio/__init__.py", line 108, in open
return _open_single(
File ".venv/lib/python3.8/site-packages/dnaio/__init__.py", line 177, in _open_single
fileformat = _detect_format_from_content(file)
File ".venv/lib/python3.8/site-packages/dnaio/__init__.py", line 203, in _detect_format_from_content
file.seek(-1, 1)
OSError: [Errno 22] Invalid argument
>>> dnaio.open('test.tmp', fileformat='Fasta')
<dnaio.readers.FastaReader object at 0x7f0910594880>
I understand that in this case it is not possible to infer a file type, but I would like to suggest to use a more detailed error message to explain what is going on.
Thank you for your time!
Dries
Hi,
I am a long time user of cutadapt.
I've tried to update recently but i encounter this error while installing dnaio.
building 'dnaio._core' extension
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DUSE_SSE2 -I/usr/include/python3.8 -c src/dnaio/_core.c -o build/temp.linux-x86_64-cpython-38/src/dnaio/_core.o
x86_64-linux-gnu-gcc: error: src/dnaio/_core.c: Aucun fichier ou dossier de ce type
x86_64-linux-gnu-gcc: fatal error: no input files
compilation terminated.
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
Can't figure out why _core.c is missing.
The buffer protocol is basically a pointer to a struct that contains two function pointers. One function to acquire the buffer and one to release it. Simple buffers only have a buffer acquire function. This function increases the reference count of the object and uses the internal data pointer of the object. Since python automatically decreases the reference of the buffer exporting object on release, an external release function is not required. This is what is used for bytes objects. More complex buffer objects do some extra work, like setting a lock value to an object, to ensure it can't be changed while exporting a buffer.
Why support this for SequenceRecord? Well, because it is possible to write a FASTQ record to a block of memory, expose this as the buffer and then free the memory afterwards again. This basically adds a __bytes__
method to SequenceRecord. It also allows for SequenceRecord objects to be written directly to a Python file, without usage of the fastq_bytes
method. This shaves off a little overhead.
I faced the same choice when programming the BamRecord object for htspy, and decided not to do it (yet) because I feel a to_bytes
method is more intuitive and a new buffer needs to be created anyway. So it is best to communicate this to users even if this has a minor performance penalty (method lookup). Nevertheless I am curious about your opinion on this.
Probably just these, which are interfaces that the FastaWriter
/FastqReader
etc. classes implement:
SequenceReader
SequenceWriter
PairedSequenceReader
PairedSequenceWriter
with overloads, similar to pycompression/xopen#112
Hi @marcelm,
I am revising the miRge3.0 code which use Cutadapt for read trimming (for single-end data). Here, I am trying to pass reads in chunks. Below is the code and followed by the error:
buffer_size = 4 * 1024**2
with xopen(infile, "rb") as f:
tlds = [i for i in dnaio.read_chunks(f,buffer_size)]
chunks_ind = []
for i in tlds:
z = io.BytesIO(i)
with dnaio.open(z) as f:
chunks_ind.append([k for k in f])
I later send chunks_ind in Process.Pool.Executor() function to process the list in parallel.
Traceback (most recent call last):
File "chunk_digest_mp.py", line 41, in <module>
chunks_ind.append([k for k in f])
File "chunk_digest_mp.py", line 41, in <listcomp>
chunks_ind.append([k for k in f])
File "src/dnaio/_core.pyx", line 183, in fastq_iter
dnaio.exceptions.FastqFormatError: Error in FASTQ file at line 1341: Line expected to start with '@', but found 'J'
If I give different buffer size, I get different Error in FASTQ format at differnet line. However, the files in itself don't have issues. I don't know what I am missing here and memory_file can't be sent through pool since it can't be pickled. Any suggestions/thoughts?
Thank you,
Arun.
Given #110 warrants a new release as you say, I think we should name it 1.0.0. it is a stable library which has been used in production for years. Easily the best and one of the most feature rich parsers out there.. See also https://0ver.org
conda install -c bioconda dnaio
or
conda install -c conda-forge -c bioconda dnaio
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: |
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:
Specifications:
- dnaio -> python[version='>=3.6,<3.7.0a0|>=3.7,<3.8.0a0']
Your python: python=3.8
If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.
Currently I am not happy with the way we are creating strings, setting maxchar at 255. For FASTQ only ASCII characters are valid, so I feel that we should perform some form of check. The same as would happen if a user would use open("my.fastq", "rt", encoding="ascii")
. I have several reasons for this:
__init__
method and the getters and setters, then we can be 100% sure that a SequenceRecord object never contains anything but ASCII strings. That means we can drop a lot of the checks in other methods such as fastq_bytes
helping speed. Currently this is not possible because the strings we provide are maxchar 255, so that would be incompatible. We can of course make the SequenceRecord object check for maxchar 255, but that does not make sense.Back when #32 was merged, I noted that we may be able to do it faster than PyUnicode_DecodeASCII. Today I thought of a way to do so.
Checking for ASCII is done with an ASCII mask usually, as ASCII is a 7-bits encoding. The most significant 8th bit is therefore always 0. A mask can be constructed using bytes with the exactvalue of 128 (0x80
in hex). Since we have 64-bit computers we can test 8 bytes at the time with mask: 0x8080808080808080
. This is what happens in CPython as well when PyUnicode_DecodeASCII is used.
But a 8-byte mask can not always be applied. If we have a read name of 29 characters:
The numbers for a 150-bp illumina record with a 29-byte name:
A total of 56 checks.
However since we have the entire record in memory we can check the entire record for ASCII. In this case this is 335 characters.
A total of 48 checks.
Not to mention the easier logic if we only apply one ASCII check once for an entire record rather than each SequenceRecord member individually.
The ASCII check code can be easily programmed in a separate C header file and imported in Cython. It should have minimal impact on performance.
Hi, I have been working on the benchmarking and I am quite intrigued by fastp. The project is very similar to dnaio in terms of it API. Paired reading, writing, a focus on speed, using zlib-compatible libraries to speed up (de)compression etc.
Anyway, I noticed that the fastq.gz reading benchmark between dnaio and fastp is not entirely fair, as the benchmark program always uses a single thread and dnaio gets around the GIL by launching an igzip
process. This made me realize that in al my programs I do this:
import functools
import xopen
DEFAULT_OPENER = functools.partial(xopen, threads=0, compresslevel=1))
# Use all dnaio.open calls with DEFAULT_OPENER
This is quite an ugly way to interact with an API.
I propose that we add open_threads=0
and compresslevel=1
to the dnaio.open
interface. open_threads
should be 0, as any polite program will only go multithreaded when explicitly asked (unless it has parallel in its name). This will use python-isal by default, which is more efficient than piping through igzip (in terms of total cpu time, not wall clock time). We can add a nice docstring to the parameter that explains an external program will be opened when threads > 0.
Compresslevel should be set at 1. This is the most efficient compression level in terms of compute time for a given final size. Tools producing FASTQ almost always produce intermediate files (they are mapped afterwards), so the most common use case will be one where throughput is most important. An advantage of having a compresslevel
parameter in dnaio.open
is that it is very easy for users to change the parameter if they feel compression is more important than throughput. I know you have argued differently for cutadapt, but IIRC that was primarily motivated by backwards-compatibility concerns, where changing the behaviour of the program drastically was undesirable (which I agree with). Dnaio is not as well established, and in fact, still in its 0.x days.
I would have proposed this change with a PR, but since #87 also messes with the open function I leave this for after that is merged (or rejected).
When running cutadapt 3.1 to process FASTQ files, I receive the following fatal error:
File "/usr/local/lib/python3.7/site-packages/cutadapt/pipeline.py", line 571, in run
dnaio.read_paired_chunks(f, f2, self.buffer_size)):
File "/usr/local/lib/python3.7/site-packages/dnaio/chunks.py", line 111, in read_paired_chunks
raise ValueError("FASTQ record does not fit into buffer")
ValueError: FASTQ record does not fit into buffer
The error occurs a little over half the time. Other FASTQ files process normally. Have you seen a similar issue? How can I prevent this?
Hi Marcel,
I got a bit fed up with the crappy fastq filters in the bioinformatics space so I started on something that is fast and actually evaluates quality scores correctly (taking into account they are log values, so many people simply compute the arithmetic mean...).
Currently I implemented a very steady and simple pure Python parser that constructs a NamedTuple containing the 4 lines (without \n
) as bytes objects. It is about 2.5 times slower as dnaio.
Bytes objects are nice, since quality scores are integers and bytes objects can also be represented in python as a sequence of integers directly. (You can sum bytes, throw them into statistics.median etc. with no issue).
The same goes for the sequence string. AGTCN etc. are all in the ASCII range, so there is no reason to represent it in unicode.
As for cutadapt, bytes have a buffer protocol that allow for some very mean and lean reading of their internal uint8_t array. Strings do not have this. This can be quite powerful see this function for mean quality.
So I wonder why decode is used at all? Wouldn't it be more convenient for Sequence objects to have members as bytes instead? No decoding -> faster parsing and easier usage with other libraries that use cython (i.e. cutadapt).
Maybe it would be possible to set up a BinarySequence object which is parsed faster. For backwards compatibility this could be converted into Sequence object by default. What do you think on this issue?
Here are a couple of things I noticed should possibly be revised before releasing a version 1.0. I’ll extend this when I find more.
SequenceRecord.fastq_bytes_two_headers()
is redundant with SequenceRecord.fastq_bytes(two_headers=True)
record_names_match
and record_names_match_bytes
should possibly be renamed to record_ids_match
and record_ids_match_bytes
because those names are more accurate (also requires that the C-level record_ids_match
is renamed)Nanopore reads can be delivered in uBAM support. While a full-fledged BAM processor is a fool's errand (I have been there...) it is actually quite straightforward to just parse the name, sequence and qualities from a uBAM file.
This will add uBAM support to cutadapt. uBAM is very annoying anyway as minimap2 won't accept it. It will be nice of cutadapt can take care of the conversion while also trimming away any nanopore helper sequences.
This has bothered me for a long time: A Sequence
isn’t actually just a sequence; it is an object containing a sequence of nucleotides, a sequence of quality values and a name. The misnaming becomes obvious when one considers that a Sequence
has a sequence
attribute, so I’m sometimes tempted to write something like s = sequence.sequence
or similar.
Biopython has SeqRecord
, but I don’t like abbreviations in class names and to reduce the risk of being confusing with the Biopython class, I think it should be SequenceRecord
.
A while ago I watched some videos on optimization and having a look at generated assembly code. Since dnaio is a relatively straightforward project. I decided to have a go at it. Unfortunately Cython generates not so nice C code. So I decided to rewrite the core functionality in C. It is in the allinc
branch.
It turns out that this makes reading quite substantially faster. The writing is at the same speed. So Cython apparantly has some friction when creating an iterator that generates a lot of objects. We knew this already actually, so this rewrite didn't learn me anything new. But it was fun to read records "superfaster" than what we are used to, so I wanted to share that here.
Looking at the C code. I don't see any way we can make our algorithms any faster. I think the code is already very close to the machine and I don't see how we can speed it up. I tried using x86 intrinsics instead of memchr, but apparently GCC already does this with the builtin memchr optimizations so there is no way to create something faster.
fastq_iter is now a function that uses yield
. At least, that is the representation in Cython's custom syntax.
I recently implemented my first iterator in the python C-API. In order to do so you need to create a Python class that fills the tp_iter
and tp_iternext
slots.
This is exactly what cython does internally.
What we could do is rewrite fastq_iter to cdef class fastq_iter
and implement a __next__
method. This would require some reorganizing, but not too much.
The big advantage of this is that we can expose n_records
on this new iterator class.
This will help paired reading immensely. We can simply use zip(fastq_iter1, fastq_iter2)
iterate over that and in the end check if fastq_iter1.n_records == fastq_iter2.n_records
. This should reduce the overhead for our paired fastq iteration massively!
UMIs are becoming a thing now and our sequencing center is in the habit of sending three files. (R1, R2, R3, with the second one being the file with UMIs annoyingly...). Given that we check for numbers 1-3 in dnaio, this seems to be fairly standard.
However there is no three-way comparison between SequenceRecord objects as of now.
This should be fairly doable to implement. I just can't decide yet on the name. I like is_mate
. So I guess are_mates
is the most appropriate name?
I want to use it for my htspy project (the one that reads BAM files). BAM records have a read_name
property, and it only makes sense to include this property as an ASCII string. (Current implementation uses bytes). It is of course very bad practice to advertise maxchar=127
without having checked it.
Furthermore I think the string_is_ascii
test suite we have is a bit out of place here.
What I can do is make a separate repository at https://github.com/rhpvorderman/ascii-check. Then I can
Re-importing it will be a simple git submodule add https://rhpvorderman/ascii-check
and using `cdef extern from "ascii-check/ascii-check.h". Another option would be to simply copy the ASCII-check file as the license permits it.
Any improvements should benefit both dnaio and htspy, and other projects as well. Of course I checked if there is a header-only library available for ASCII checking, and no there is not. At least not one that uses the "check using a 8-byte mask" trick, despite that being a fairly common trick.
I'd lake to make the code available under the CC-0 public domain license in that repo, so it can be used for any purpose.
Since I technically waved any rights to this code by publishing it in your repository, I have to have your explicit permission first before I continue with the above.
Currently dnaio uses a slightly unusual architecure where the first value of fastq_iter is a boolean, not a SequenceRecord. This determines whether all coming fastq headers are printed with two headers. FastqWriter has a rather quirky implementation to determine its write method.
I think this can be best solved by having a boolean attribute to each sequencerecord. This can be set instantly without branching (no if statement). We can then add a fastq_bytes_as_input method, which will print one or two headers based on the boolean attribute. The fastq_bytes_as_input method can then be used by the FastqWriter class.
This will be fairly trivial to implement once the C-code PR is merged.
I kind of like offering dnaio.open()
as the main entry function, but the open()
name of course collides with the built-in function of the same name. This prevents users from writing from dnaio import open
. For reading, this is kind of ok, but for writing, it looks not so nice because you also need the SequenceRecord
and then have two import lines:
import dnaio
from dnaio import SequenceRecord
with dnaio.open("out.fastq") as f:
f.write(SequenceRecord("name", "ACGT", "####"))
Alternatively, one could write dnaio.SequenceRecord
and avoid the second import, but as a user, I may not want the dnaio.
prefix everywhere. One can also rename the open function when importing, but that’s also not optimal.
So perhaps we should offer the dnaio.open function under a different name.
Hi, I have been trying to run trim_galore on fastq files with naming like *.fastq.1.gz. However there is an error which seems to relate to identifying the file type from the contents i.e.
import dnaio
f = dnaio.open("13714_1#1.fastq.2.gz")
Traceback (most recent call last):
File "", line 1, in
File "//python3.6/site-packages/dnaio/init.py", line 83, in open
file1, fileformat=fileformat, mode=mode, qualities=qualities)
File "//python3.6/site-packages/dnaio/init.py", line 148, in _open_single
if file.seekable():
AttributeError: 'PipedGzipReader' object has no attribute 'seekable'
In quite a lot ofroutines concerning quality I use the following pattern.
This is also how it is done in the cutadapt maximum expected errors code.
How nice would it be if the boundscheck could.be removed. That would make things faster. It would.be great if dnaio could guarantee this.
I think it can do so quite cost effective. Simply do the above pattern without the lookup and instead store each quality after checking. It is a poor-performance version of memcpy that can be utilized when the qualities pyobject is created.
This can then be further upgraded using sse2 vector instructions. Using _mm_loadu, _mm_cmpgt_epi8, _mm_cmplt_epi8,_mm_storeu and _mm_or. A quite simple loop where data is.loaded. both bounds are checked (no unsigned comparison for 8byte integers in SSE) the result of both boundschecks is orred to a register that saves the result. After that storing the vector.at the destination. In the end check the Boundscheck vector if any bounds have been crossed. If so return an error.
Because dnaio does the boundscheck, it can be eliminated from any programs using dnaio. That is quite handy for performance.
and also Sequence.comment
and perhaps Sequence.sep
. They could be lazily computed. Perhaps also add .header
as an alias for .name
? Should setting .id
and .comment
be supported? Convenient, but perhaps slower than setting the full header directly.
Currently the __next__
function is quite messy. Memchr is called several times and several times a NULL
check needs to be performed. Any of these conditions might exit the while loop that is put in place.
Alternatively the update_buffer
call can be used to create a newline index. The __next__
call can then simply get the next four lines from the index and yield a record.
Advantages:
__next__
call, which now uses a while loop to allow for repeated buffer resizings. Instead the newline indexing step can indicate if there is an entire record in the buffer, which means update_buffer can guarantee that at least one FASTQ record is present. Which makes the while loop redundant. Also a lot less NULL checks. Basically only FASTQ integrity checks.Disadvantages.
@
and +
) and \r
checks. This means these cache lines need to be fetched from memory/cache again. In contrast, they were probably already populated with the correct memory in our current solution.In case there is no notable speedup, a reduction in code complexity is also nice. (Taking the line diff as measure).
Cython is now a build-time requirement, so we don’t need to include the generated .c
files in the sdist anymore.
Hi, I profiled the fastq iter method by using hyperfine and commenting out code. This was done on a branch based on the PR in #15 .
The script used was simple:
#!/usr/bin/env python3
import dnaio
import sys
with dnaio.FastqReader(sys.argv[1]) as records:
for record in records:
pass
When viewing these tables, I took it as a rule of thumb that a WGS sequence file contains 1 billion records. Therefore a time of 1 ns per read corresponds to 1 second when traversing a WGS file.
Area | Time for 5 million reads (milliseconds) | time for one read (nanoseconds, estimate) | Additional remarks |
---|---|---|---|
Startup dnaio and determine two_headers |
37 | ||
Finding start and end positions of FASTQ record and basic testing | 250 | 50 | 140ms user 110ms system (!) |
Yielding an object and iterating over a for loop | 150 | 30 | |
Creating a tuple | 5 | 1 | |
Creating a Sequence object | 100 | 20 | |
Creating three bytes objects |
280 | 56 | |
Creating three str objects (latin-1) |
580 | 116 | |
Creating three str objects (name: latin-1, sequence and qualities: ASCII |
640 | 128 |
The following benchmark were done by adapting the read script and check the differences between tuples and sequence objects.
Area | Time for 5 million reads (milliseconds) | time for one read (nanoseconds, estimate) | Additional remarks |
---|---|---|---|
Unpack tuple in for loop | 100 | 20 | for name, sequence, qualities in records |
Access individual tuple member by index | 90 | 18 | record[1] |
Access individual tuple member by global variable index | 130 | 26 | record[SEQUENCE] |
Passing a tuple as an argument using * |
100 | 20 | my_func(*record) |
Access Sequence member by name | 160 | 32 | record.sequence |
_core.pyx
is outdated. Currently an ASCII mask is used to check for ASCII values. (It is possible that each individual character was checked in older versions of python, explaining the difference).bytes
object, but twice as slow to create.*
. (Useful in a function that creates a new FASTQ record). Sequence objects are way more convenient though.There is a clear convenience vs speed tradeoff here. Currently dnaio leans heavily towards convenience. After the holiday I will implement a BinaryFastqReader
and BinaryFastqWriter
object which will lean towards the speed side.
x | FastqReader | BinaryFastqReader |
---|---|---|
dnaio.open mode | r |
rb |
iter yields | Sequence objects | tuples |
String type | ASCII (for sequence and qualities) | bytes |
I implemented this on a dummy branch. Traversing 5 million records takes about 1170ms for the ASCII FastqReader and about 720 ms for the BinaryFastqReader.
After #15 and #23 are merged it will be possible to implement this. See you after the holiday!
When I put BytesSequenceRecord in there were two major reasons:
I think both issues are now gone.
On top of that, strings are more useful than bytes. Names should be strings. Sequences of nucleotides work more intuitive as strings. And qualities, are phred scores. These are an ASCII representation of the proper score and thus work best as strings.
I was working on #65 when I realised that BytesSequenceRecord is now just a maintenance burden at this point.
Currently I am working a lot with UMI data that is stored in a separate FASTQ file meaning I have 3 files now.
I needed to filter those files on average error rate so I adopted the fastq-filter program to work with multiple files.
To keep the pipeline simple. I opted to have a Multiple file reader. This yields 1-tuples for 1 file, 2-tuples for 2 files, 3-tuples for 3 files, etc.
This way I can write the filters to always handle a tuple of SequenceRecord objects and use the same filter in all cases.
Similarly I wrote a multiple writer.
I am wondering if we should do this in dnaio too. There are now two cases in dniao:
I propose replacing the latter with a multipe file reader that can read n number of records and yields n-tuples of SequenceRecords. The PairedEndReader and PairedEndWriter interfaces can still be maintained, but these can simply inherit the MultipleReaders and provide a backwards compatible interface. (Shouldn't be too hard given it is just the 2
-case of the MultipleReader).
This way I do not have to reinvent the wheel across multiple projects. I also feel this is needed for cutadapt. Which needs a sort of auxilary file option, where the auxilary file with the UMIs is kept in sync with the FASTQ files that are output from cutadapt. Currently I have to use biopet-fastqsync to sync the UMI FASTQ file afterwards. (This is not the correct place to raise this issue, but I simply state this here to show that I think this will be a good move for the future).
I already have implemented a multiple reader in my FASTQ filter project. At first it was written in a generic manner. (Everything is a list of multiple files.) But I discovered that severely harms the single-end and paired-end cases: LUMC/fastq-filter#16 . I wonder what the best way is to implement is in dnaio. Alternatively there could be separate 1-tuple 2-tuple n-tuple readers that all share the same interface trough abstract classes.
Hello and thank you for developping dnaio.
I am trying to use it but I am getting an error:
./Include/object.h:602: _Py_NegativeRefcount: Assertion failed: object has negative ref count
<object at 0x7f11cc6fe660 is freed>
Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
Python runtime state: finalizing (tstate=0x00007f11d4f08a78)
Current thread 0x00007f11d5143740 (most recent call first):
<no Python frame>
Abandon (core dumped)
Our Python is compiled with debug options so not a lot of users will see this error but it is hidden and still occuring when not using assertions. To reproduce this error, a simple script like this one will work:
import dnaio
print("Test")
Would you be willing to take a look at this?
@rhpvorderman What do you think, should we move dnaio to its own organization so that the repo would be at https://github.com/dnaio/dnaio?
The discussion in #76 reminded me of an idea I had a while ago for possibly improving performance when dealing with many SequenceRecord objects: We could implement some kind of SequenceRecordArray
class that represents an array of sequence records. It would support iteration (of course) to get the individual SequenceRecord
objects. And as #76 suggests, it could also support a fastq_bytes()
method that serializes all records to FASTQ en bloc.
This would replace the current functions for reading chunked data. So instead of the read_chunks()
function, which returns memoryviews, one would then use a function that returns SequenceRecordArray
s. Similar to how pandas.read_table
does it, we could even add a chunksize
parameter to dnaio.open
that would then make it return SequenceRecordArray
objects.
I think this would simplify some of the code in Cutadapt that transfers sequence data to and from worker processes. For that, I’d also want such an object to have extra attributes chunk_index
and n_records
.
From PavlidisLab/rnaseq-pipeline#60
Apparently, there is such a thing as reads with three parts, where the read id of the 'second' read ends with .3
, not with .1
. This should be allowed.
As discussed in marcelm/cutadapt#757 adding BAM support to dnaio would have several advantages for cutadapt. The alternative is to defer to pysam instead, and keep dnaio at its basic level.
In order for full BAM support I think the following things need to be added:
Converting a SequenceRecord object to binary BAM bytes, is something where I see no technical difficulties. A BAM header is also not verry difficult as it is just a SAM header with a few binary quirks.
BGZF is going to be a slight hurdle as it cannot be appropriately done with xopen, so direct library calls to zlib/libdeflate/isa-l are needed. Since only streamed writing is required it is actually quite easy to write a RawWriter that creates the BGZF blocks and wrap that in an io.BufferedWriter that has a 65200 byte buffer size. That will automatically take care of the sizing and will also be pretty fast.
Tag support is really tough. There are several options here:
Then there is also paired end support. It is possible to put some limits on what is supported. Name-sorted files come to mind. Enabling full support requires supporting indexes and that is also a lot of work. In that case pysam is a better option.
Hi, I have made quite some PRs in the last week and I have still some plans, so I put them here so that the PRs do not come out of the blue.
Planned PRs:
record_names_match
takes 40 ns. Which is quite good. Then I benchmarked a Python function that returns True
. Also 40 ns... A Cython function that returns True
28 ns. So of the 40 ns there is 28 ns call overhead. This is bad because iterating over two paired FASTQ files takes about 40% longer than iterating over two FASTQ files individually. This is bad because zip
and zip_longest
by themselves have very little overhead. By moving the loop into Cython a lot of checks in the loopsequence_class=tuple
and simply yields a tuple. yield name, sequence, qualities
. This is 10% faster than yielding Sequence types. This is quite nice if the rich interface of Sequence is not needed.dnaio.open(file, mode="rb")
. This uses PyBytes_FromStringAndSize internally instead of PyUnicode_DecodeLatin1. This improves speed by more than 20%.I saw this comment in _core.pyx
:
# It would be nice to be able to have the first parameter be an
# unsigned char[:] (memory view), but this fails with a BufferError
# when a bytes object is passed in.
# See <https://stackoverflow.com/questions/28203670/>
This is possible by using the buffer protocol. For a basic implementation see here. This acquires the buffer from a python object, and converts the void *
to unsigned char *
.
I also think paired_fastq_heads would benefit from using memchr
to search the newlines.
Hello @marcelm!
I recently started to add typing to my code so I installed Mypy in my environment and as a PyCharm plugin, and when testing my code some warnings came from using your module.
So I made a fresh pull from your repo and executed it on it, and this was the output:
/dnaio_copy$ mypy --version
mypy 0.971 (compiled: yes)
/dnaio_copy$ mypy .
doc/conf.py:24: error: Need type annotation for "exclude_patterns" (hint: "exclude_patterns: List[<type>] = ...")
tests/test_open.py:377: error: List item 1 has incompatible type "Tuple[Dict[str, str], Type[MultipleFastqWriter]]"; expected "Tuple[object, Type[object]]"
tests/test_open.py:378: error: List item 2 has incompatible type "Tuple[Dict[str, str], Type[MultipleFastqWriter]]"; expected "Tuple[object, Type[object]]"
tests/test_open.py:379: error: List item 3 has incompatible type "Tuple[Dict[str, str], Type[MultipleFastaWriter]]"; expected "Tuple[object, Type[object]]"
tests/test_internal.py:31: error: Module "dnaio._core" has no attribute "bytes_ascii_check"
tests/read_from_stdin.py:6: error: Item "SingleEndReader" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__enter__"
tests/read_from_stdin.py:6: error: Item "PairedEndReader" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__enter__"
tests/read_from_stdin.py:6: error: Item "SingleEndWriter" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__enter__"
tests/read_from_stdin.py:6: error: Item "PairedEndWriter" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__enter__"
tests/read_from_stdin.py:6: error: Item "MultipleFileWriter" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__enter__"
tests/read_from_stdin.py:6: error: Item "SingleEndReader" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__exit__"
tests/read_from_stdin.py:6: error: Item "PairedEndReader" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__exit__"
tests/read_from_stdin.py:6: error: Item "SingleEndWriter" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__exit__"
tests/read_from_stdin.py:6: error: Item "PairedEndWriter" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__exit__"
tests/read_from_stdin.py:6: error: Item "MultipleFileWriter" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__exit__"
Found 15 errors in 4 files (checked 22 source files)
There were also several pep8 warnings:
/dnaio_copy$ pycodestyle .
./setup.py:16:80: E501 line too long (87 > 79 characters)
./src/dnaio/__init__.py:116:80: E501 line too long (81 > 79 characters)
./src/dnaio/__init__.py:139:80: E501 line too long (80 > 79 characters)
./src/dnaio/__init__.py:155:80: E501 line too long (85 > 79 characters)
.....
I'm starting to learn about code typing, but I thought that it could help to report it to you.
Cheers,
This way the check can be performed on large contiguous chunks of memory. Also it removes a branch from the __next__
method. Since the update buffer method is used relatively few times compared to the __next__
method it should be a net win.
Also because of the large contiguous chunks benefits of SIMD optimizations might become noticable.
I cannot implement this right now, but I have to write this down otherwise it will be stuck in my head for the rest of the day.
Dear dnaio team,
is there a method to get reverse complement?
Thanks
Jorge
When running the tests, there are 12 ResourceWarnings. Showing just one of them:
$ python -X dev -X tracemalloc -m pytest
[...]
tests/test_open.py::test_paired_open_with_multiple_args[fasta-r-PairedEndReader]
.../dnaio/.venv/lib/python3.10/site-packages/_pytest/python.py:192: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/pytest-of-marcel/pytest-191/test_paired_open_with_multiple0/file' encoding='UTF-8'>
result = testfunction(**testargs)
Object allocated at:
File ".../dnaio/.venv/lib/python3.10/site-packages/_pytest/python.py", line 192
result = testfunction(**testargs)
[...]
I don’t have time at the moment right now to fix this, so just filing an issue. I would like to at least know what is going on before releasing 0.10.
These warnings were not there in v0.9.1.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.