GithubHelp home page GithubHelp logo

Comments (7)

SAMtoBAM avatar SAMtoBAM commented on May 30, 2024 1

Ok so that confirms what I was saying, that duplications only detect expansions within the query and not contractions.
Although I could technically invert the detection using the reciprocal alignment and find contractions...If I tried this modification would you like to try run it? and see if you find the alternate events either side?

Hmmm this inversion result is more worrying
It will most likely be due to the filtering process which gets rid of a lot of false positives due to the mapping of repetitive regions.
it requires that during the nucmer alignment filtering to find the global alignment, that in the absence of the inverted alignment found in the less strictly filtered alignment, that no other alignments can be found in this region. So perhaps it is because they are in repetitive regions?
Maybe it is currently a little strict when it comes to large inversions and could be relaxed a little more...
The difference is the sizes between the two genomes would most likely be as you said due to the differences in the haplotypes in these regions and/or obscurity in mapping at the boarders of the inversion, due to again repetitive regions. This later case is not unlikely considering inversion commonly form with repetitive material.
Let me know how the dotplots look for these inversions

Have a good day!

from mumandco.

SAMtoBAM avatar SAMtoBAM commented on May 30, 2024

Hi Gabriele,

Thanks for using MUM&Co
For duplications:
Technically that is how it should function, the duplications are to be from haplotype1. MUM&Co doesn't predict 'expansions' and 'contractions' so duplications only are the addition of material in the query. It some cases the contractions would appear as deletions but I cannot be too sure how it will respond in all cases.
Interestingly, now I am writing this, there isn't any reason why I could not find the opposite and look for expansion. Might add this as a feature.
The co-oridnates for the query chromosome positions of duplications can vary in accuracy I have noticed. Do you find this an issue?

Regarding inversions: correct again, any position in the opposite orientation would appear regardless of which haplotype is the query or the reference

Assuming all the chromosomes are highly co-linear, running one at a time should be ok however in most cases I would recommend running the whole genome vs whole genome.

Did you have a look at my pipeline for the phased haplotype assembly? I hope it can be of help.

Hope that helps

from mumandco.

uceleste avatar uceleste commented on May 30, 2024

Thanks for the quick reply.
As mentioned above, for duplications I performed the same process using haplotype2-chr1 as query and haplotype1-chr1 as reference (so inverting them). The results are reported below.

ref_chr query_chr ref_start ref_stop size SV_type query_start query_stop   ref_chr query_chr ref_start ref_stop size SV_type query_start query_stop
chr1 chr1 1349462 1349653 191 duplication 1725124 1725127   chr1 chr1 4794578 4890703 96125 duplication 2549810 2549811
chr1 chr1 1678135 1682964 4829 duplication 1377849 1377851   chr1 chr1 6419492 6432519 13027 duplication 7035943 7193751
chr1 chr1 3156729 3186550 29821 duplication 4085015 4085016   chr1 chr1 11599672 12011191 411519 duplication 12790274 12790275
chr1 chr1 11331514 11384802 53288 duplication 10232990 10232990   chr1 chr1 13854641 13854779 138 duplication 14990751 14990787
chr1 chr1 11783930 11820088 36158 duplication 11051724 11051725   chr1 chr1 16046772 16112494 65722 duplication 17303064 17303064
chr1 chr1 11888669 11888776 107 duplication 11117719 11117799   chr1 chr1 17970320 17984232 13912 duplication 19237291 19237292
chr1 chr1 15899129 15931090 31961 duplication 14797193 14797194   chr1 chr1 18832015 18833967 1952 duplication 32598786 32598787
chr1 chr1 18225961 18226087 126 duplication 16968211 16968325   chr1 chr1 19519200 19561168 41968 duplication 19822394 19822394
chr1 chr1 25382432 25388788 6356 duplication 25545490 25545493   chr1 chr1 25641447 25647307 5860 duplication 25484693 25484694
chr1 chr1 25475440 25484693 9253 duplication 25647307 25647308   chr1 chr1 27616766 27616992 226 duplication 25978671 25978673
chr1 chr1 26899767 26899827 60 duplication 26233660 26233661   chr1 chr1 30293973 30294139 166 duplication 30904862 30904869
chr1 chr1 33461685 33641555 179870 duplication 33013693 33013694   chr1 chr1 31333977 31343066 9089 duplication 32013518 32013519

So, I can say that I have found 10 duplications on haplotype1-chr1 and 10 duplications on haplotype2-chr1, for a total a 20 different duplications. That fact that the coordinates for the query chromosome positions of duplications can vary in accuracy should not be a big issue.
As regards inversions, comparing haplotype1-chr1 vs. haplotype2-chr1 and haplotype2-chr1 vs. haplotype1-chr1, it seems that some inversions were identified only in one case or only in the other. The results are reported below.

ref_chr query_chr ref_start ref_stop size SV_type query_start query_stop   ref_chr query_chr ref_start ref_stop size SV_type query_start query_stop
chr1 chr1 928392 936829 8437 inversion 922519 937228   chr1 chr1 1078963 2013971 935008 inversion 1078547 1725591
chr1 chr1 1078547 1725591 647044 inversion 1078963 2013971   chr1 chr1 2739937 3999333 1259396 inversion 2450920 3188513
chr1 chr1 3322217 4595464 1273247 inversion 4085015 4619181   chr1 chr1 4007253 4010944 3691 inversion 3196440 3200122
chr1 chr1 6718265 7022914 304649 inversion 6261822 6419491   chr1 chr1 4058381 4060924 2543 inversion 3283498 3298224
chr1 chr1 7694800 7732806 38006 inversion 6921248 6933932   chr1 chr1 4085015 4619181 534166 inversion 3322217 4595464
chr1 chr1 7997744 8008258 10514 inversion 7199138 7251443   chr1 chr1 5398638 5836884 438246 inversion 5195438 6293327
chr1 chr1 10914442 11723952 809510 inversion 10166357 10961509   chr1 chr1 6261822 6419491 157669 inversion 6718265 7022914
chr1 chr1 12790274 13200365 410091 inversion 12011191 12036824   chr1 chr1 6921248 6933932 12684 inversion 7694800 7732806
chr1 chr1 18915584 19153861 238277 inversion 17659104 17897649   chr1 chr1 9665694 10131689 465995 inversion 10422224 10879774
chr1 chr1 19349912 19459708 109796 inversion 18117940 18117941   chr1 chr1 10166357 10955599 789242 inversion 10914442 11729862
chr1 chr1 25365381 25374070 8689 inversion 25521330 25530347   chr1 chr1 16514121 16524182 10061 inversion 17772558 17781891
chr1 chr1 25796661 25805131 8470 inversion 25964739 25973163   chr1 chr1 17659104 17897649 238545 inversion 18915584 19153861
chr1 chr1 25819305 26699887 880582 inversion 25987337 26037960   chr1 chr1 25964739 25973163 8424 inversion 25796661 25805131
chr1 chr1 30808857 30829719 20862 inversion 30197948 30219007   chr1 chr1 25987337 26037960 50623 inversion 25819305 26699887
chr1 chr1 32655857 33014034 358177 inversion 32082789 32393776   chr1 chr1 30197948 30219007 21059 inversion 30808857 30829719
chr1 chr1 33016918 33053334 36416 inversion 32396648 32492220   chr1 chr1 31170104 31680072 509968 inversion 31774371 32309684
chr1 chr1 33218101 33325587 107486 inversion 32589793 32697994   chr1 chr1 32027721 32393776 366055 inversion 32710925 33014034
chr1 chr1 33641556 33855619 214063 inversion 33193505 33406455   chr1 chr1 32469931 32571206 101275 inversion 33165089 33199342
                  chr1 chr1 32589793 32697994 108201 inversion 33218101 33325587
                  chr1 chr1 33193505 33406455 212950 inversion 33641556 33855619

Is there a particular reason why this can happen? However, I can verify the veracity of these cases through a dot plot analysis. I noticed that even in cases of shared identification the size of the event can change, but I think this could be due to the presence of gaps and indels within the inverted region. Could it be plausible?
Thanks for all the information you will be able to give me!
Yes, I'll have a look on your pipeline. I wasn't aware of that, thank you again.

Gabriele

from mumandco.

uceleste avatar uceleste commented on May 30, 2024

Thanks for all this information!
I'm happy to run the program again once you have made the necessary changes and I'll let you know the results of the dot plots.
See you soon!

from mumandco.

SAMtoBAM avatar SAMtoBAM commented on May 30, 2024

Hi there,
Sorry for taking a couple days
Essentially I have just reverted the parsing of the alignment for duplications onto the reciprocal alignment instead, and therefore this should find, If I am not mistaken, contractions should be the opposite on duplications. (but it took me a few hours to re-understand all the intermediate files and filters etc)
So here is a draft version of the script
Is is inside a newer version I will release soon with a vcf output
However it is limited to use with MUMmer4 as it requires to be given the threads option (-t) at the end of the line (I have wrote this in the read me on how to use the MUMmer4 specific version)

Let me know how it goes (it's gzipped just so I can upload it here)
mumandco_v3_MUMmer4_multithreads_contractions.sh.gz

from mumandco.

uceleste avatar uceleste commented on May 30, 2024

Hi!
I tested your new code for duplications using haplotype1-chr1 vs. haplotype2-chr1 and then I compared the result with the old one. The results are reported below.
new mumandco_v3 haplotype1-chr1 vs. haplotype2-chr1:

ref_chr query_chr ref_start ref_stop size SV_type query_start query_stop
chr1 chr1 1349462 1349653 191 duplication 1725124 1725127
chr1 chr1 1678135 1682964 4829 duplication 1377849 1377851
chr1 chr1 3156729 3186550 29821 duplication 4085015 4085016
chr1 chr1 11331514 11384802 53288 duplication 10232990 10232990
chr1 chr1 11783930 11820088 36158 duplication 11051724 11051725
chr1 chr1 11888669 11888776 107 duplication 11117719 11117799
chr1 chr1 15899129 15931090 31961 duplication 14797193 14797194
chr1 chr1 18225961 18226087 126 duplication 16968211 16968325
chr1 chr1 25382432 25388788 6356 duplication 25545490 25545493
chr1 chr1 25475440 25484693 9253 duplication 25647307 25647308
chr1 chr1 26899767 26899827 60 duplication 26233660 26233661
chr1 chr1 33461685 33641555 179870 duplication 33013693 33013694
               
chr1 chr1 2549810 2549811 96125 contraction 4794578 4890703
chr1 chr1 14990751 14990787 138 contraction 13854641 13854779
chr1 chr1 17303064 17303064 65722 contraction 16046772 16112494
chr1 chr1 19237291 19237292 13912 contraction 17970320 17984232
chr1 chr1 19822394 19822394 41968 contraction 19519200 19561168
chr1 chr1 25484693 25484694 5860 contraction 25641447 25647307
chr1 chr1 25978671 25978673 226 contraction 27616766 27616992
chr1 chr1 30904862 30904869 166 contraction 30293973 30294139
chr1 chr1 32013518 32013519 9089 contraction 31333977 31343066
chr1 chr1 32598786 32598787 1952 contraction 18832015 18833967

old mumandco_v2.4.2 haplotype2-chr1 vs. haplotype1-chr1 (inverted):

ref_chr query_chr ref_start ref_stop size SV_type query_start query_stop
chr1 chr1 4794578 4890703 96125 duplication 2549810 2549811
chr1 chr1 6419492 6432519 13027 duplication 7035943 7193751
chr1 chr1 11599672 12011191 411519 duplication 12790274 12790275
chr1 chr1 13854641 13854779 138 duplication 14990751 14990787
chr1 chr1 16046772 16112494 65722 duplication 17303064 17303064
chr1 chr1 17970320 17984232 13912 duplication 19237291 19237292
chr1 chr1 18832015 18833967 1952 duplication 32598786 32598787
chr1 chr1 19519200 19561168 41968 duplication 19822394 19822394
chr1 chr1 25641447 25647307 5860 duplication 25484693 25484694
chr1 chr1 27616766 27616992 226 duplication 25978671 25978673
chr1 chr1 30293973 30294139 166 duplication 30904862 30904869
chr1 chr1 31333977 31343066 9089 duplication 32013518 32013519

As you can see, it appears that the program correctly attributed what were previously identified as duplications between haplotype2-chr1 vs. haplotype1-chr1 (inverted) as contractions in the new output. All but two duplications (second and third row of the last table), which are not among the contractions. Do you have an idea why this small lack? Thanks for your active support.
P.S. I have yet to verify the previously discussed inversions through dot plot analysis.

from mumandco.

SAMtoBAM avatar SAMtoBAM commented on May 30, 2024

Unfortunately I don't have any idea at the moment why some duplications are not labelled as contractions in the opposite as it should be performing the same analysis on essentially the same alignment.
I will try play with a little further before it becomes the latest version, but I'm not sure I will figure it out.

No worries about the inversions, I can wait. If you can, I'd be interested to know about the duplications/contractions too. Would be nice to see!

Thanks Gabriele

from mumandco.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.