Hi, I noticed a problem in the insertion type, that there are long fragments of in

The insertion problem in the final output about mumandco HOT 7 OPEN

samtobam commented on May 30, 2024

The insertion problem in the final output

from mumandco.

Comments (7)

SAMtoBAM commented on May 30, 2024

Hi there,

That definitely happens on occasion and you are right that it is unusual.

Why this happens with MUM&Co is it attempts to find the closest alignment on the other side of the gap where the insertion occurs. In some cases insertions can occur in the same region as a deletion and therefore the corresponding end of the insertion region is at the other side of the deletion. For me, I see this often when in my sample a TE has been inserted into the same region where a different TE was inserted in the reference, therefore the reference TE is not present (the deletion) and another TE is present (the insertion). Are you finding perhaps similar features?

I hope this helps

from mumandco.

biomichal commented on May 30, 2024

Hi,
Thanks for your kindly reply.
As the situation you described, isn't it a translocation variation that the TE in reference move to another place and the query TE inserted into the reference position where the reference TE leave? I think if it is the case, it is more like a translocation.

Moreover, I check all the types of length for the insertion of my sample, listed as below:

Case 1: This is the most occurred, a large insertion is meanwhile a deletion
***sample.SVs_all.tsv:
Chr01 chr01_6 8106077 8181318 53894 insertion_mobile 1167556 1221450
Chr01 chr01_6 8106077 8181318 75241 deletion_mobile 1167556 1221450
***sample_ref.delta_filter.coords
8096720 8106077 1230804 1221450 9358 9355 97.07 48794144 4968688 0.02 0.19 1 -1 Chr01 chr01_6

Case 2: a 51 bp insertion is meanwhile a large deletion, 51 bp is the shortest length of insertion in the cases that the same reference star and end position are both deletion and insertion. But in this case, the alignments in the sample_ref.delta_filter.coords are from two records.
***sample.SVs_all.tsv:
Chr03 chr03_31 23529160 23529211 51 deletion_novel 695401 701995
Chr03 chr03_31 23529160 23529211 6594 insertion_mobile 695401 701995
***sample_ref.delta_filter.coords
23519765 23529160 686025 695401 9396 9377 97.07 34914772 837808 0.03 1.12 1 1 Chr03 chr03_31
23529211 23541250 701995 714107 12040 12113 96.89 34914772 837808 0.03 1.45 1 1 Chr03 chr03_31

Case 3: different start and end position in reference but it is near. I am not sure whether it is an accurate insertion.
***sample.SVs_all.tsv:
Chr01 chr01_1 1403988 1403993 9091 insertion_mobile 112266 121357
***sample_ref.delta_filter.coords
1403988 1404931 121357 122295 944 939 99.05 48794144 434142 0.00 0.22 1 1 Chr01 chr01_1

Case 4: a normal insertion with the same start and end position in reference, but the way of alignment is the same with Case 2.
***sample.SVs_all.tsv:
Chr01 chr01_2 2140354 2140354 98 insertion_mobile 360176 360274
***sample_ref.delta_filter.coords
2138930 2140354 358756 360176 1425 1421 99.02 48794144 1953865 0.00 0.07 1 1 Chr01 chr01_2
2140354 2158104 360274 377984 17751 17711 97.74 48794144 1953865 0.04 0.91 1 1 Chr01 chr01_2

Case 5: an unnormal insertion but meanwhile a duplication.
***sample.SVs_all.tsv:
Chr03 chr03_20 12106000 12109582 3582 duplication 555388 555514
Chr03 chr03_20 12106000 12109582 5337 insertion_mobile 555388 560725
***sample_ref.delta_filter.coords
12106000 12112229 555514 561733 6230 6220 97.80 34914772 2626911 0.02 0.24 1 1 Chr03 chr03_20
12106000 12128808 560725 583490 22809 22766 98.30 34914772 2626911 0.07 0.87 1 1 Chr03 chr03_20
12080003 12109582 525797 555388 29580 29592 98.00 34914772 2626911 0.08 1.13 1 1 Chr03 chr03_20

The above cases are generally represented all types of insertions of my sample. There are 3400 insertions in my sample, and the
Case 1 and Case 2 take up about 1380 that an insertion is meanwhile a deletion, and only a few are case 5. The number of Case 4 are only 309, which we think is the standard insertion.

So, finally, I still confused about how to use the insertion results here.
Could you provide me more advices?

Appreciate!

Best regard!

from mumandco.

SAMtoBAM commented on May 30, 2024

As the situation you described, isn't it a translocation variation that the TE in reference move to another place and the query TE inserted into the reference position where the reference TE leave? I think if it is the case, it is more like a translocation.

I think you are getting confused by the role of the reference and your view of events are placed in a simple context.
Events in the reference, considering it is not generally an ancestral sample, may have never happened in your sample. So TEs involved in deletions often never occurred in the sample and therefore what is a deletion in your sample is actually an insertion that never occurred. So it isn't a translocation in many cases. Then, if you wanted to identify all TE movement as translocations/transpositions, this requires knowing where the TE was and where it has moved. This is a difficult task based on them being repetitive elements (alignment difficulties), the potentially highly dynamic movement of these elements and the different mechanisms by which different TEs move. So in most cases these events are labelled as insertions and deletions.

The above cases are generally represented all types of insertions of my sample. There are 3400 insertions in my sample, and the
Case 1 and Case 2 take up about 1380 that an insertion is meanwhile a deletion, and only a few are case 5. The number of Case 4 are only 309, which we think is the standard insertion.

Thanks for the analysis results.
So you describe about half the total insertions ((1380+5+309)/3400), what are the other half?

Finding deletions associated with these insertions with distant start and end positions is in line with what I described in my first comment. The insertion happens within a region in the reference which is also contains an unaligned region therefore called as a deletion. I gave the example of the TE not being inserted in your query, as in the reference (the deletion), and another TE being inserted in your query instead (your insertion), but this can also occur due to other reasons. In some cases it could be a duplication-translocation which is not so uncommon in subtelomeric regions, where a chromosome end is lost and another end is duplicated and gained. Or as you suggest, an undetected translocation labelled as insertions and deletions. Alternatively they are just errors in your case.
It is hard for me to say what you are dealing with, genomic backgrounds can very extensively and so I can only presume.

For your example of case1 you didn't provide the next alignment so I am unable to judge the gaps

Thanks

from mumandco.

biomichal commented on May 30, 2024

So, you describe about half the total insertions ((1380+5+309)/3400), what are the other half?

The other half are mostly the short unaligned regions in reference (the deletion) but mostly not identified as deletion like those large unaligned regions in Case 1.

For your example of case1 you didn't provide the next alignment, so I am unable to judge the gaps

In what I have learned, there shouldn't be the unaligned region in the reference for the insertion or in the query for the deletion, it should be a specific position in ref for insertion and in query for deletion. But things are more complex actually due to the TE movement. But I think the insertion in Case 1 should be filtered and considered as a deletion since there are deletion output.

It is hard for me to say what you are dealing with, genomic backgrounds can very extensively and so I can only presume

The results are from the different cultivars of the same species, plant genome.

Thank you for your explaining.

from mumandco.

SAMtoBAM commented on May 30, 2024

In what I have learned, there shouldn't be the unaligned region in the reference for the insertion or in the query for the deletion, it should be a specific position in ref for insertion and in query for deletion. But things are more complex actually due to the TE movement. But I think the insertion in Case 1 should be filtered and considered as a deletion since there are deletion output.

If they would only be considered a deletions...these cases you would miss the insertion... and regions where the reference region is missing AND an insertion occurs are real events. Another example is introgression from a distant species, which would have trouble aligning or the alignment is filtered due to poorer alignment quality compared to the rest of the genome, therefore the region appears to be missing in the reference and the introgressed region is considered an insertion in the same region.
In most cases you can just take the start, i.e the left most aligned position, of the insertions but it is just as accurate as the end position, it is just a reference biased coordinate system.

from mumandco.

biomichal commented on May 30, 2024

Take the left start is a good point both considering the insertion and the deletion. But how it occurred in the actual genome? I think it is a chimeric phenomenon that in some cells it is the normal insertion, but in other cells, it is a completely deletion, no insertion. Or the differences are caused by different haplotypes in a diploid genome. Whatever to say, it is more difficult than in what I think.

from mumandco.

SAMtoBAM commented on May 30, 2024

But how it occurred in the actual genome? I think it is a chimeric phenomenon that in some cells it is the normal insertion, but in other cells, it is a completely deletion, no insertion. Or the differences are caused by different haplotypes in a diploid genome. Whatever to say, it is more difficult than in what I think.

I think you have not understood the several potential, and much more simple, reasons I have given you above that may explain how this can occur (e.g. both a reference specific insertion and a sample specific insertion which are then labelled both a deletion and insertion respectively, which you showed yourself appears the most likely reason as you saw deletions associated with most; introgression of a homologous region from another species; miss-labelled reciprocal translocation; error)
Plus, considering you should be aligning a haplotype collapsed genome, cell heterogeneity or heterozygosity should not be an issue

Hope this helps.

from mumandco.

The insertion problem in the final output about mumandco HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs