I want to use it for my htspy proj

ascii_check.h should be in its own repo about dnaio HOT 6 CLOSED

rhpvorderman commented on August 25, 2024

ascii_check.h should be in its own repo

from dnaio.

Comments (6)

rhpvorderman commented on August 25, 2024 1

Thanks!

Regarding how to get the code back into dnaio, I tend to stay away from submodules ... copying the file into the dnaio repository would be my choice at the moment.

Sure thing. Submodules are massively annoying to work with. I will make it so that the files have a nice header that points to the repo, so people can find it.

from dnaio.

marcelm commented on August 25, 2024

I don’t think you actually need my permission, but to be on the safe side: I hereby grant permission for publishing all code you contributed to dnaio under a license of your choice, including CC-0.

Regarding how to get the code back into dnaio, I tend to stay away from submodules, mostly because you need to remember to use --recursive when you do git clone. I think in this case just copying the file into the dnaio repository would be my choice at the moment.

from dnaio.

rhpvorderman commented on August 25, 2024

It is done. https://github.com/rhpvorderman/ascii-check

In the end I choose the MIT license, as it can be pasted on top of the file pretty easily (CC0 turned out to be big). I also made a SSE2 implementation, which is automatically supported by all x86_64 cpus.

from dnaio.

rhpvorderman commented on August 25, 2024

Unfortunately it seems that the non-SSE2 implementation is already so fast that with #60 merged, I cannot measure any difference between the implementation, or even between having the check turned on or off (!).
dnaio-asv also nicely reports an increase in processing time when the ascii check was enabled, only to drop off again to pre-asciicheck levels after #60.
So I wasted quite a lot of time implementing SIMD instructions for nothing. Oh well.

I also thought of even faster branchless ways to do the checks, but that is simply not relevant for dnaio. On the other hand, I am very bothered with Python's very slow PyUnicode_DecodeASCII so I will try to make a contribution there.

from dnaio.

marcelm commented on August 25, 2024

So do you just leave things in this repo as they are?

On the other hand, I am very bothered with Python's very slow PyUnicode_DecodeASCII so I will try to make a contribution there.

That would be supercool :-)

from dnaio.

rhpvorderman commented on August 25, 2024

So do you just leave things in this repo as they are?

Yes. There is no need to change. I did discover that for short ASCII strings the pre-aligning stuff made it slower actually. But given that we check 128kb chunks now this is not really a concern. It is nice that it is in its own repo so others can use the code.

That would be supercool :-)

I tried but it seems Python is as fast as it can be. For some reason calling PyUnicode_DecodeASCII is slower than calling PyUnicode_New, checking and copying. But it is not in the decode_ascii function. Maybe function overhead (seems unlikely)? Memory locality? My benchmark use case was reading a file line by line with the "ascii" codec. I could not get any performance improvement. In fact, most of the things I tried were slightly slower.

I did notice this Unicode slowness too when working on htspy this weekend. In the end I choose to save the BAM record read name as a bytes object, and convert it to a string every time the user requests the name. Bytes objects are simply faster and the name is usually not of interest in a BAM record. So it did not make sense to slow down parsing speed by 10% just for having a string available that will probably not be used. For dnaio this is different of course. Since there are only three attributes to a FASTQ record, and all metadata has to be stored in the read name as well.

from dnaio.

ascii_check.h should be in its own repo about dnaio HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs