GithubHelp home page GithubHelp logo

Comments (9)

gsneha26 avatar gsneha26 commented on June 11, 2024

Since WGA_GPU handles all the sequences in a file separately, [multiple] is not required. I will make changes for sequence name handling and --notrivial

from segalign.

glennhickey avatar glennhickey commented on June 11, 2024

From an interface standpoint, it would also be nice if it just operated on the fasta files, rather than needing a 2bit for each sequence.

from segalign.

gsneha26 avatar gsneha26 commented on June 11, 2024

the script run_wga_gpu (that creates the folder with 2bit sequences) needs to be run rather than wga directly. This way, the interface remains clean. The folder with 2bit files is deleted at the end.

We did the profiling for LASTZ, and it showed that a significant % of time was spent in reading fasta file and converting to 2bit. And we have many LASTZ calls for gapped alignment. Hence, it was important to use 2bit files for runtime.

from segalign.

gsneha26 avatar gsneha26 commented on June 11, 2024

--notrivial option has been added to master. Special characters in the sequence names can be handled. [multiple][nameparse=darkspace] functionality is already present and the option should not be specified explicitly.

from segalign.

glennhickey avatar glennhickey commented on June 11, 2024

It still doesn't work on the file from the cactus test. I've attached it: tmp7hax84k3.tmp.gz

run_wga_gpu ./tmp7hax84k3.tmp tmp7hax84k3.tmp --max-hits 1000000 --format=cigar  --step=1 --ambiguous=iupac,100,100 --ydrop=3000 --notrivial
Splitting reference chromosome
Converting chromosome wise fasta to 2bit format
Splitting query chromosome
Converting chromosome wise fasta to 2bit format
Executing: "wga /home/hickey/dev/work/cactus-gpu/tmp7hax84k3.tmp /home/hickey/dev/work/cactus-gpu/tmp7hax84k3.tmp /home/hickey/dev/work/cactus-gpu/output_21463/data_20523/  --max-hits 1000000 --format=cigar --step=1 --ambiguous=iupac,100,100 --ydrop=3000 --notrivial"
Using 8 threads
Using 1 GPU(s)

Reading query file ...

Reading target file ...

Start alignment ...

Sending reference id=0|simMouse.chr6|0 ...

Sending query id=0|simMouse.chr6|0 with buffer 0 ...

Sending query id=1|simRat.chr6|0 with buffer 1 ...

Starting query id=0|simMouse.chr6|0 with buffer 0 ...
Chromosome id=0|simMouse.chr6|0 interval 1/1 (0:636243) with buffer 0

Starting query id=1|simRat.chr6|0 with buffer 1 ...
Chromosome id=1|simRat.chr6|0 interval 1/1 (0:647196) with buffer 1

Sending reference id=1|simRat.chr6|0 ...
FAILURE: extra segments in file (tmp1.ref1.query0.segments: line 2, id=0|simMouse.chr6|0/id=1|simRat.chr6|0+)
(for this usage segments must appear in the same order as the query file, with
all + strand segments before all - strand segments for each query)

Sending query id=0|simMouse.chr6|0 with buffer 0 ...

Sending query id=1|simRat.chr6|0 with buffer 1 ...

Starting query id=0|simMouse.chr6|0 with buffer 0 ...
Chromosome id=0|simMouse.chr6|0 interval 1/1 (0:636243) with buffer 0

Starting query id=1|simRat.chr6|0 with buffer 1 ...
Chromosome id=1|simRat.chr6|0 interval 1/1 (0:647196) with buffer 1
FAILURE: extra segments in file (tmp1.ref0.query1.segments: line 2, id=1|simRat.chr6|0/id=0|simMouse.chr6|0+)
(for this usage segments must appear in the same order as the query file, with
all + strand segments before all - strand segments for each query)

real	0m8.139s
user	0m15.365s
sys	0m2.445s

cactus runs it with

~/dev/cactus/bin/cPecanLastz ./tmp7hax84k3.tmp[multiple][nameparse=darkspace] ./tmp7hax84k3.tmp[nameparse=darkspace] --format=cigar --notrivial --step=1 --ambiguous=iupac,100,100 --ydrop=3000
cigar: id=0|simMouse.chr6|0 2081 2200 + id=0|simMouse.chr6|0 2003 2120 + 3810 M 97 I 2 M 20
cigar: id=0|simMouse.chr6|0 2003 2120 + id=0|simMouse.chr6|0 2081 2200 + 3810 M 97 D 2 M 20
cigar: id=0|simMouse.chr6|0 634196 634356 + id=0|simMouse.chr6|0 2719 2918 + 6113 M 56 D 33 M 21 D 7 M 37 I 1 M 45
etc.

from segalign.

rsharris avatar rsharris commented on June 11, 2024

[multiple][nameparse=darkspace] functionality is already present and the option should not be specified explicitly.

My opinion: If you want users to be able to use this as a drop in replacement, you probably ought to accept [multiple] and just ignore it, rather than prohibit it.

The nameparse options are a can of worms. Those exist in lastz because there are so many 'standards' for names in fasta files. I could be wrong, but get the impression this package only intends to support nameparse=darkspace (which is by far the simplest case, but is not the lastz default). If that's true, I think you'd want to soft-require [nameparse=darkspace] and throw a warning at the user if the command line lacks nameparse or has a different nameparse, so the user has the opportunity to understand the names in her output might be different than she expects.

from segalign.

gsneha26 avatar gsneha26 commented on June 11, 2024

Solved the issue. It should work now.

from segalign.

glennhickey avatar glennhickey commented on June 11, 2024

@rsharris Thanks for the feedback! Agreed that as much of the lastz syntax as can be supported (even if it's just accepting and ignoring stuff like [multiple]), the easier it will be for people to try this. (doubly so for cactus integration).

My command line does run through now, though, so thanks @gsneha26. I will try once again to plug it into cactus.

from segalign.

gsneha26 avatar gsneha26 commented on June 11, 2024

Thank you @rsharris for your input. I will definitely be making changes to the name parse options in the system. Right now, as you rightly pointed out, only [nameparse=darkspace] is supported. It is a temporary feature for cactus compatibility.

About [multiple] - WGA_GPU is not exactly a drop-in replacement for LASTZ. The system is designed such that the user does not have to create multiple jobs for 1) complete genome to genome alignment, and 2) multicore, multi-gpu utilization. Also, WGA_GPU only supports the most basic options for seeding and filtering that LASTZ does.

from segalign.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.