GithubHelp home page GithubHelp logo

Comments (6)

rhpvorderman avatar rhpvorderman commented on August 25, 2024 1

Thanks!

Regarding how to get the code back into dnaio, I tend to stay away from submodules ... copying the file into the dnaio repository would be my choice at the moment.

Sure thing. Submodules are massively annoying to work with. I will make it so that the files have a nice header that points to the repo, so people can find it.

from dnaio.

marcelm avatar marcelm commented on August 25, 2024

I don’t think you actually need my permission, but to be on the safe side: I hereby grant permission for publishing all code you contributed to dnaio under a license of your choice, including CC-0.

Regarding how to get the code back into dnaio, I tend to stay away from submodules, mostly because you need to remember to use --recursive when you do git clone. I think in this case just copying the file into the dnaio repository would be my choice at the moment.

from dnaio.

rhpvorderman avatar rhpvorderman commented on August 25, 2024

It is done. https://github.com/rhpvorderman/ascii-check

In the end I choose the MIT license, as it can be pasted on top of the file pretty easily (CC0 turned out to be big). I also made a SSE2 implementation, which is automatically supported by all x86_64 cpus.

from dnaio.

rhpvorderman avatar rhpvorderman commented on August 25, 2024

Unfortunately it seems that the non-SSE2 implementation is already so fast that with #60 merged, I cannot measure any difference between the implementation, or even between having the check turned on or off (!).
dnaio-asv also nicely reports an increase in processing time when the ascii check was enabled, only to drop off again to pre-asciicheck levels after #60.
So I wasted quite a lot of time implementing SIMD instructions for nothing. Oh well.

I also thought of even faster branchless ways to do the checks, but that is simply not relevant for dnaio. On the other hand, I am very bothered with Python's very slow PyUnicode_DecodeASCII so I will try to make a contribution there.

from dnaio.

marcelm avatar marcelm commented on August 25, 2024

So do you just leave things in this repo as they are?

On the other hand, I am very bothered with Python's very slow PyUnicode_DecodeASCII so I will try to make a contribution there.

That would be supercool :-)

from dnaio.

rhpvorderman avatar rhpvorderman commented on August 25, 2024

So do you just leave things in this repo as they are?

Yes. There is no need to change. I did discover that for short ASCII strings the pre-aligning stuff made it slower actually. But given that we check 128kb chunks now this is not really a concern. It is nice that it is in its own repo so others can use the code.

That would be supercool :-)

I tried but it seems Python is as fast as it can be. For some reason calling PyUnicode_DecodeASCII is slower than calling PyUnicode_New, checking and copying. But it is not in the decode_ascii function. Maybe function overhead (seems unlikely)? Memory locality? My benchmark use case was reading a file line by line with the "ascii" codec. I could not get any performance improvement. In fact, most of the things I tried were slightly slower.

I did notice this Unicode slowness too when working on htspy this weekend. In the end I choose to save the BAM record read name as a bytes object, and convert it to a string every time the user requests the name. Bytes objects are simply faster and the name is usually not of interest in a BAM record. So it did not make sense to slow down parsing speed by 10% just for having a string available that will probably not be used. For dnaio this is different of course. Since there are only three attributes to a FASTQ record, and all metadata has to be stored in the read name as well.

from dnaio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.