GithubHelp home page GithubHelp logo

Combining RM rows about edta HOT 5 OPEN

aerilli avatar aerilli commented on June 12, 2024
Combining RM rows

from edta.

Comments (5)

oushujun avatar oushujun commented on June 12, 2024

Hi,

The directions of these entries are different and the physical distances between them are too far. The last two entries are close enough, but their TE coordinates substantially overlap (4910-7166 vs 6988-8240), thus they can not be considered as a single element.

Thanks!
Shujun

from edta.

aerilli avatar aerilli commented on June 12, 2024

Hey Shujun,

Thanks for the clarification! So if a substantial overlap is detected, then they cannot be considered a single element.
However, it is still a bit unclear to me how this can translate into the final annotation of this region, that looks like this:

Chr5    EDTA    Mutator_TIR_transposon  19872566        19873827        10111   -       .       ID=TE_homo_95784;Name=VANDAL21;classification=DNA/MULE-MuDR;sequence_ontology=SO:0002280;identity=0.963;method=homology;ID=TE_homo_98670;sequence_ontology=SO:0002280
Chr5    EDTA    Mutator_TIR_transposon  19873825        19874206        3057    -       .       ID=TE_homo_95785;Name=VANDAL21;classification=DNA/Mutator;sequence_ontology=SO:0002280;identity=0.968;method=homology;ID=TE_homo_98671;sequence_ontology=SO:0002280
Chr5    EDTA    Mutator_TIR_transposon  19873941        19877095        12213   -       .       ID=TE_homo_95786;Name=VANDAL21;classification=DNA/MULE-MuDR;sequence_ontology=SO:0002280;identity=0.966;method=homology;ID=TE_homo_98672;sequence_ontology=SO:0002280
Chr5    EDTA    Mutator_TIR_transposon  19877284        19883063        18267   +       .       ID=TE_homo_95787;Name=VANDAL21;classification=DNA/Mutator;sequence_ontology=SO:0002280;identity=0.976;method=homology;ID=TE_homo_98673;sequence_ontology=SO:0002280
Chr5    EDTA    Mutator_TIR_transposon  19883061        19884298        9665    +       .       ID=TE_homo_95788;Name=VANDAL21;classification=DNA/Mutator;sequence_ontology=SO:0002280;identity=0.964;method=homology;ID=TE_homo_98674;sequence_ontology=SO:0002280

Where at least in two cases the overlap is not substantial and the direction is the same.

Many thankss for your support Shujun! :)

from edta.

oushujun avatar oushujun commented on June 12, 2024

The gff rows you pasted seem to contain extra information compared to the RM out rows. To combine rows, both physical coordinate, direction, and the TE coordinate, divergence need to be considered. If the physical coordinate, direction, and divergence meet the criteria, but the TE coordinate overlaps substantially, they are still considered two elements. If the the TE coordinates have a large distance in between and are in the agreeable directions (first piece has smaller 5' coordinates), they are still considered a single element. In such a case, the annotated TE has a large deletion.

Shujun

from edta.

baozg avatar baozg commented on June 12, 2024

Hi, Shujun

Sorry for jumping into this conversation. What we don't understand is why even meet all the standard in the script, but some rows still not tjoins?

Here is the code and small working example I used:
perl combine_RMrows.pl -rmout test -maxgap 35 -maxdiv 3.5, so same family, same strand, gap less than 35 bp and two elements divergence less than 3.5 will be joined, right?

But looking for these three rows:

# before joining
SW   perc perc perc  query       position in query              matching               repeat                           position in repeat
score   div. del. ins.  sequence    begin    end          (left)   repeat                 class/family            begin     end     (left)        ID
30291    4.5  0.2  0.4  Chr3        17485555 17489789  (8669366) + VANDAL12               DNA/Mutator                     1    4200    (9966)  64678 *
38777    2.6  0.5  0.2  Chr3        17489775 17494536  (8664619) + VANDAL12               DNA/Mutator                  3442    7944    (4030)  64679
26487    1.4  0.2  0.0  Chr3        17494533 17497540  (8661615) + VANDAL12               DNA/Mutator                  8849   11860     (114)  64680 *

# after joining
SW_score        perc_div.       perc_del.       perc_ins.       query_sequence  query_begin     query_end       query_remain    strand  matching_repeat repeat_class/family     repeat_begin  repeat_end       repeat_remain   ID
30291   4.5     0.2     0.4     Chr3    17485555        17489789        8669366 +       VANDAL12        DNA/Mutator     1       4200    (9966)  64678
34020   2.1     0.4     0.1     Chr3    17489775        17497540        8661615 +       VANDAL12        DNA/Mutator     3442    11860   (114)   64679_64680

So the 64679_64680 (the ID column) was joined, but why 64678 didn't joined with 64679_64680?
✅ Same family (VANDAL12)
✅ Same Strand (+)
✅ Overlapped (17485555-17489789 with 17489775-17497540; overlapped 14bp). How large overlap of this script will be ignored? We think it's not a substantial overlap.
✅ Divergence (4.5-2.1=2.4)

from edta.

baozg avatar baozg commented on June 12, 2024

For anyone interested in these merging, the case I pasted here didn't merge is because the overlap in the repeat consensus of last four column. 1-4200 overlapped 800 bp with 3442-11860

from edta.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.