I'm working to make cactus run on arbitrary cigar output <a class="issue-link js-issue

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Better lastz compliance would allow drop-in replacement in cactus about segalign HOT 9 CLOSED

gsneha26 commented on June 11, 2024

Better lastz compliance would allow drop-in replacement in cactus

from segalign.

Comments (9)

gsneha26 commented on June 11, 2024

Since WGA_GPU handles all the sequences in a file separately, [multiple] is not required. I will make changes for sequence name handling and --notrivial

from segalign.

glennhickey commented on June 11, 2024

From an interface standpoint, it would also be nice if it just operated on the fasta files, rather than needing a 2bit for each sequence.

from segalign.

gsneha26 commented on June 11, 2024

the script run_wga_gpu (that creates the folder with 2bit sequences) needs to be run rather than wga directly. This way, the interface remains clean. The folder with 2bit files is deleted at the end.

We did the profiling for LASTZ, and it showed that a significant % of time was spent in reading fasta file and converting to 2bit. And we have many LASTZ calls for gapped alignment. Hence, it was important to use 2bit files for runtime.

from segalign.

gsneha26 commented on June 11, 2024

--notrivial option has been added to master. Special characters in the sequence names can be handled. [multiple][nameparse=darkspace] functionality is already present and the option should not be specified explicitly.

from segalign.

glennhickey commented on June 11, 2024

It still doesn't work on the file from the cactus test. I've attached it: tmp7hax84k3.tmp.gz

run_wga_gpu ./tmp7hax84k3.tmp tmp7hax84k3.tmp --max-hits 1000000 --format=cigar  --step=1 --ambiguous=iupac,100,100 --ydrop=3000 --notrivial
Splitting reference chromosome
Converting chromosome wise fasta to 2bit format
Splitting query chromosome
Converting chromosome wise fasta to 2bit format
Executing: "wga /home/hickey/dev/work/cactus-gpu/tmp7hax84k3.tmp /home/hickey/dev/work/cactus-gpu/tmp7hax84k3.tmp /home/hickey/dev/work/cactus-gpu/output_21463/data_20523/  --max-hits 1000000 --format=cigar --step=1 --ambiguous=iupac,100,100 --ydrop=3000 --notrivial"
Using 8 threads
Using 1 GPU(s)

Reading query file ...

Reading target file ...

Start alignment ...

Sending reference id=0|simMouse.chr6|0 ...

Sending query id=0|simMouse.chr6|0 with buffer 0 ...

Sending query id=1|simRat.chr6|0 with buffer 1 ...

Starting query id=0|simMouse.chr6|0 with buffer 0 ...
Chromosome id=0|simMouse.chr6|0 interval 1/1 (0:636243) with buffer 0

Starting query id=1|simRat.chr6|0 with buffer 1 ...
Chromosome id=1|simRat.chr6|0 interval 1/1 (0:647196) with buffer 1

Sending reference id=1|simRat.chr6|0 ...
FAILURE: extra segments in file (tmp1.ref1.query0.segments: line 2, id=0|simMouse.chr6|0/id=1|simRat.chr6|0+)
(for this usage segments must appear in the same order as the query file, with
all + strand segments before all - strand segments for each query)

Sending query id=0|simMouse.chr6|0 with buffer 0 ...

Sending query id=1|simRat.chr6|0 with buffer 1 ...

Starting query id=0|simMouse.chr6|0 with buffer 0 ...
Chromosome id=0|simMouse.chr6|0 interval 1/1 (0:636243) with buffer 0

Starting query id=1|simRat.chr6|0 with buffer 1 ...
Chromosome id=1|simRat.chr6|0 interval 1/1 (0:647196) with buffer 1
FAILURE: extra segments in file (tmp1.ref0.query1.segments: line 2, id=1|simRat.chr6|0/id=0|simMouse.chr6|0+)
(for this usage segments must appear in the same order as the query file, with
all + strand segments before all - strand segments for each query)

real	0m8.139s
user	0m15.365s
sys	0m2.445s

cactus runs it with

~/dev/cactus/bin/cPecanLastz ./tmp7hax84k3.tmp[multiple][nameparse=darkspace] ./tmp7hax84k3.tmp[nameparse=darkspace] --format=cigar --notrivial --step=1 --ambiguous=iupac,100,100 --ydrop=3000
cigar: id=0|simMouse.chr6|0 2081 2200 + id=0|simMouse.chr6|0 2003 2120 + 3810 M 97 I 2 M 20
cigar: id=0|simMouse.chr6|0 2003 2120 + id=0|simMouse.chr6|0 2081 2200 + 3810 M 97 D 2 M 20
cigar: id=0|simMouse.chr6|0 634196 634356 + id=0|simMouse.chr6|0 2719 2918 + 6113 M 56 D 33 M 21 D 7 M 37 I 1 M 45
etc.

from segalign.

rsharris commented on June 11, 2024

[multiple][nameparse=darkspace] functionality is already present and the option should not be specified explicitly.

My opinion: If you want users to be able to use this as a drop in replacement, you probably ought to accept [multiple] and just ignore it, rather than prohibit it.

The nameparse options are a can of worms. Those exist in lastz because there are so many 'standards' for names in fasta files. I could be wrong, but get the impression this package only intends to support nameparse=darkspace (which is by far the simplest case, but is not the lastz default). If that's true, I think you'd want to soft-require [nameparse=darkspace] and throw a warning at the user if the command line lacks nameparse or has a different nameparse, so the user has the opportunity to understand the names in her output might be different than she expects.

from segalign.

gsneha26 commented on June 11, 2024

Solved the issue. It should work now.

from segalign.

glennhickey commented on June 11, 2024

@rsharris Thanks for the feedback! Agreed that as much of the lastz syntax as can be supported (even if it's just accepting and ignoring stuff like [multiple]), the easier it will be for people to try this. (doubly so for cactus integration).

My command line does run through now, though, so thanks @gsneha26. I will try once again to plug it into cactus.

from segalign.

gsneha26 commented on June 11, 2024

Thank you @rsharris for your input. I will definitely be making changes to the name parse options in the system. Right now, as you rightly pointed out, only [nameparse=darkspace] is supported. It is a temporary feature for cactus compatibility.

About [multiple] - WGA_GPU is not exactly a drop-in replacement for LASTZ. The system is designed such that the user does not have to create multiple jobs for 1) complete genome to genome alignment, and 2) multicore, multi-gpu utilization. Also, WGA_GPU only supports the most basic options for seeding and filtering that LASTZ does.

from segalign.

Better lastz compliance would allow drop-in replacement in cactus about segalign HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs