samtobam / mumandco Goto Github PK

MUM&Co is a simple bash script that uses Whole Genome Alignment information provided by MUMmer (only v4) to detect Structural Variation

License: GNU General Public License v3.0

Shell 100.00%

sv-detection mummer genome structural-variation

mumandco's Introduction

v3.8 release :

MUM&Co is a simple bash script that uses Whole Genome Alignment information provided by MUMmer (v4) to detect variants.
VERSION >= 3 UPDATE
Only uses MUMmer4 now and contains a thread count option
Contains a VCF output file with all calls currently being imprecise
Contains another output file containing the calls alongside the respective DNA impacted
This new step requires samtools installation
Now calls the reverse of tandem duplications, tandem contractions (>50bp)

MUM&Co is able to detect:
Deletions, insertions, tandem duplications and tandem contractions (>=50bp & <=150kb)
Inversions (>=1kb) and translocations (>=10kb)

MUM&Co requires installation of MUMmer4 and samtools.
MUM&Co will look for the MUMmer toolkit's and samtools scripts path using 'which xxxxx'.
An error warning will print and the script will stop if these paths cannot be found
This path can be editted directly in the script if required.

In order to help with downstream analysis:
Renaming and re-orientation of the query genome contigs to correspond to their reference counterparts
Tools such as RaGOO and Ragout can do this alongside scaffolding of contigs (this is not currently recommended for short-read based assemblies)

Options:

     -r or --reference_genome          path to reference genome
     -q or --query_genome              path to query genome
     -g or --genome_size               size of genome
     -o or --output                    output prefix (default: mumandco)
     -t or --threads                   thread number (default: 1)
     -ml or --minlen                   minimum length of alignments (default: 50bp)
     -b or --blast                     adds the blast option to identify is insertions or deletions look repetitive or novel

Test run script:

     bash mumandco_v*.sh -r ./yeast.tidy.fa -q ./yeast_tidy_DEL100.fa -g 12500000 -o DEL100_test -t 2 -b

OUTPUT FOLDER:
Folder with alignments used for SV detection
Txt file with summary of SVs detected
TSV file with all the detected SVs
TSV file with all detected SVs plus the DNA associated with the event (all from reference except insertions)
VCF file with all calls currently being imprecise

TSV NOTES:
The last column in the TSV file contains notes.
'complicated' : multiple calls within the same region; generally overlapping insertions and deletions
'double' : several calls at the same coordinates; generally tandem duplications or contractions with multiple copy changes
']chrX:xxxxxx]' : a VCF inspired notation for the association of the translocation fragments with the other fragments
e.g. for chr1 with its right border at 250000bp assocaited with chr2 at 100000bp;
the note would be as follows for chr 1: ']chr2:100000]' and for chr2 : '[chr1:250000['
As such, each translocation fragment as called as an event, is now a breakend-like call and will be duplicated if both borders are involved in translocations

VCF TRA EVENT:
The later notation for the TSV file is currently being added to the alt column in the VCF for 'TRA' events.
Currently it is not a called a breakend site (contains no nucleotide at edge) but can be interpreted similarly

Note:
MUMmer4 is now required due to the hard wired thread option not available during alignment with MUMmer3
The blast option (-b /--blast) using BLAST to search for insertion and deletion events in the reference/query in order to label them as either mobile or novel events.

Reference:
Samuel O’Donnell, Gilles Fischer, MUM&Co: accurate detection of all SV types through whole-genome alignment, Bioinformatics, Volume 36, Issue 10, 15 May 2020, Pages 3242–3243, https://doi.org/10.1093/bioinformatics/btaa115

mumandco's People

Contributors

Stargazers

Watchers

Forkers

shulp2211 chen318liang xuelei-dai clairemerot wangdi2014 elcortegano wolongac biopig schorlton amvarani vloegler wangdi2016 nicknamekk kizbaolin

mumandco's Issues

Return to 'multiple threads?' closed issue

Several months ago, you and I corresponded about the need for multithreading and it seems you took that opportunity to generate a new version.

I've come back to this after a while and have found that this new version isn't working for me. I'm getting some very strange errors that I have no idea how to deal with.

These new errors seem to stem from this issue that comes up four times during a run of the test data.

g++: error: /opt/ohpc/pub/compiler/gcc/5.4.0/lib/../lib64/libstdc++.so: No such file or directory
g++: error: /opt/ohpc/pub/compiler/gcc/5.4.0/lib/../lib64/libstdc++.so: No such file or directory
g++: error: /opt/ohpc/pub/compiler/gcc/5.4.0/lib/../lib64/libstdc++.so: No such file or directory
g++: error: /opt/ohpc/pub/compiler/gcc/5.4.0/lib/../lib64/libstdc++.so: No such file or directory

I've done some troubleshooting and I think I've narrowed it down to these lines:

$SHOWCOORDS -T -r -c -l -d -g ""$prefix"_ref".delta_filter > ""$prefix"_ref".delta_filter.coordsg
$SHOWCOORDS -T -r -c -l -d ""$prefix"_ref".delta_filter > ""$prefix"_ref".delta_filter.coords

$NUCMER --threads ${threads} --maxmatch --nosimplify -p ""$prefix"_query" $query_assembly $reference_assembly
$DELTAFILTER -m ""$prefix"_query".delta > ""$prefix"_query".delta_filter
$SHOWCOORDS -T -r -c -l -d -g ""$prefix"_query".delta_filter > ""$prefix"_query".delta_filter.coordsg
$SHOWCOORDS -T -r -c -l -d ""$prefix"_query".delta_filter > ""$prefix"_query".delta_filter.coords

There are four $SHOWCOORDS commands.

Any idea what might be going on here?

This could be a 'my HPCC' problem and likely is but I thought I'd ask.

Originally posted by @davidaray in #12 (comment)

Excesively long execution time

I'm having issues with a large genome (2.7 Gb) sample, and I think the script might actually be stuck. It's been running for about two weeks now, with no output returned after "My what a large genome you have, this may take some time". The latest files updated in the output folder are *.delta and *.delta_filter (this one with a size of 0). These two files haven't changed apparently since the day the script started running. What's the script doing? what would be the expected execution time for a genome this length? is the script stuck? if possible, how to solve this?

MUM&Co was run as follows:

bash mumandco_v3.8.sh -r reference.fa -q sample.fa -g 2720000000 -o prefix -t 32 -b

Thanks!

multiple threads?

Trying this out for a project. Seems to be running fine on two mammal genomes but it's taking quite a while to run.

Can this package be run on multiple processors to speed things up?

Thanks.

The genome used to be compared can only include one chromosome

hello!
when I use MUMandCo to detect SVs between two different genomes of Arabidopsis thaliana，It shows no SV; and only when I separate the genome of Arabidopsis thaliana to 5 chromosomes , and detect SV between two chromosomes, for example chr1 (in the first Arabidopsis thaliana) and chr1(in the second Arabidopsis thaliana) , It works. So I wonder whether the genome used to be compared can only include one chromosome?

BLAST Database error: Error: File [reference] not found.

Hi,

I am running mumandco and getting errors that concern blast. In the example below I have sample file 12, 13, 14 in the array. I get the following error for most of the runs; "BLAST Database error: No alias or index file found for nucleotide database [supercontigs_mtDNA_mata.fasta] in search path [/April/mumcotest/12::]". It appears that the last sample-file I pass is okay, but I don't get why. Some errors are different though, eg "BLAST Database error: Error: File (supercontigs_mtDNA_mata.fasta.nhr) not found." but they all have the same reference file that is the basis for the blast option. So, I do not fully understand why one would get different errors, nor what I am to do with this error.

I installed mummer4 using conda (conda install bioconda::mummer4), last week, in addition to samtools. I do not really get what is going on as some sample files have been fine, whereas the vast majority is not being processed and stops at the blast stage, throwing out blast-database related issues. I would appreciate any help on the matter.

I have copy-pasted a simplified version of my bash script below.

reference=supercontigs_mtDNA_mata.fasta

#samples
samples=(12 13 14)
sample=${samples[$SLURM_ARRAY_TASK_ID]}
echo "$input/$sample/assembly.fasta"

#set working directory
mkdir "/mumcotest/$sample"
cd "/mumcotest/$sample/"

bash mumandco_v3.8.sh -q $input/$sample/assembly.fasta -r $reference -g 43000000 -o $sample --threads 30 -b

What does the label 'complicated' mean?

Hello,
I have been using MUMandCo to detect SVs, but I found that some lines in the result file were followd by 'complicated' label.
These variants are usually labeled insertions or deletions in the SVs type column (In two consecutive rows), so I don't know the exact type.
I tried to read annotations in the source code, but I didn't understand it.
Could you please explain the meaning for me?

Analyzing short read assemblies

Dear developers, I intend to use MUM&CO for calling variants between a reference and a newly sequenced strain of Drosophila. I have the contigs and also a scaffolded assembly for the same. Looking at your yeast test data I observe the reference and query sequences carry same identifiers. Is this an input mandate to run this tool ? How can I proceed with my short read assembly? I also have a pseudo-chromosomes built using RaGOO, can I use this as query. I also see a warning in the Git page not to use this strategy for short read assembly. Is it due to N's? If I want to use longer accurate contigs for calling variants, then what is the best way to rename the contigs.
Hoping for a positive response. Thank you

syntax error near unexpected token `newline'

Hello,

When I run bash mumandco_v3.7.sh according to this website on ubuntu, but I met the following error:

mumandco_v3.7.sh: line 7: syntax error near unexpected token 'newline'
mumandco_v3.7.sh: line 7: ''

Any help?

Thanks in advance.

Weird translocation results

I ran MUM&Co to detect SV between two varieties of the same species. I observed multiple large translocations in the output (in addition to other smaller SVs). For example:

ref_chr query_chr       ref_start       ref_stop        size    SV_type query_start     query_stop      info
chr01   CM020633.1      1000    13862942        13861942        transloc        367     14309343        [chr08:5094415[
chr01   CM020633.1      1000    13862942        13861942        transloc        367     14309343        ]chr11:24521807]
chr01   CM020633.1	13875813        23593880        9718067 transloc        14309338        24284010        [chr08:7002249[
chr01   CM020633.1	13875813        23593880        9718067 transloc        14309338        24284010        ]chr10:20453007]
chr01   CM020633.1	23606808        35271472        11664664        transloc        24284030        35962660        [chr10:20468470[
chr01   CM020633.1	23606808        35271472        11664664        transloc        24284030        35962660        ]chr08:14766499]

The sizes of the translocations are 10-13 Mb. The chromosome sizes in this species are 31-43 Mb. I have visualized the nucmer alignment results as a dotplot and the chromosomes look collinear:

How would you interpret these results?

Error finding translocation fragments

I got an error when running the example provided in the documentation. Under the line saying "Finding translocation fragments", the following error message appears:

cat: '*.transloc_pairing': No such file or directory
rm: cannot remove '*.transloc_pairing': No such file or directory

The program does not stop there but continue running, although I'm not sure if the output is correct, given the above message, and that the total number of SV is 101, but 100 deletions are reported.

DEL100_test  Total SVs  = 101
DEL100_test  Deletions  = 100
DEL100_test  Insertions  = 0
DEL100_test  Duplications  = 0
DEL100_test  Inversions  = 0
DEL100_test  Translocations  = 0

Is this output normal? what does mean that *.transloc_pairing is missing?

About how much longer the running time will be increased when running with the -b parameter

Hello developer,
I set the thread to 44. According to the log file, the time it takes for the program to run to "Removing deletions and insertions called during global alignment due to inversions" is relatively short. The last log information stays at "label INDELS events as novel or mobile elements" "Building a new DB" and has not been updated for a long time.
I would like to ask about the expected increase in running time by a few days using the -b parameter. I would appreciate it if you could help！

The insertion problem in the final output

Hi,
I noticed a problem in the insertion type, that there are long fragments of insertion in the ref side as shown below (others).
Usually, the insertion in the ref is a point like (1) from 8650645 to 8650645.
However, how to explain the others, it is not a point, but a long fragment, and then what is the insertion position.

I really got puzzled, hope to understand it.

(1) Chr01 chr01_6 8650645 8650645 1908 insertion_mobile 728231 730139

others:
Chr01 chr01_6 7976205 8045109 3163 insertion_mobile 1259821 1262984
Chr01 chr01_6 8056961 8084210 10526 insertion_mobile 1237503 1248029
Chr01 chr01_6 8086825 8096720 4082 insertion_mobile 1230804 1234886
Chr01 chr01_6 8106077 8181318 53894 insertion_mobile 1167556 1221450
Chr01 chr01_6 8449161 8450114 1073 insertion_mobile 930090 931163

MUMandCO v3.8, aborted using example files

Hi,
I'm using mumandco v3.8, I ran the example and got this message:

cat: DEL100_test.insertion_blast: No such file or directory

It aborted after 3 minutes running, threads=4.
Any suggestion? Many thanks in advance!

SV call from multiple genomes comparison

Hi, Is it possible to call SV by comparing multiple genome with reference genome? The script can take only single query genome at a time.
bash mumandco.sh -r ./yeast.tidy.fa -q ./yeast_tidy_DEL100.fa -g 12500000 -o DEL100_test

VCF header format error

Hi,

There is an error in the header of the VCF generated by MUM&Co: the ##contig lines do not end with a ">"

Here is an example:
##query_contig=<ID=chromosome1,length=1957569

MUM&Co thread and optimisation issue

I am writing to bring to your attention a persistent issue I have encountered regarding job terminations, despite submitting them with an increased number of CPUs, utilizing up to 72 CPU cores, along with the command "export OMP_NUM_THREADS=$PBS_NCPUS" within my PBS script.

Given the limitations on walltime at my institution, which is set at 48 hours, I had to explore several options in an attempt to address this matter, including increasing the default CPU (thread) count from within the mumandco_v3.8.sh script, specifically modifying "threads" from "1" to "72." Additionally, I have incorporated the "export OMP_NUM_THREADS=$PBS_NCPUS" command within my PBS script.

Regrettably, these jobs continue to be terminated repeatedly, and the program lacks a resume option. Further examination reveals that the program utilizes a significant amount of memory, approximately 40 to 60GB, yet it fails to efficiently utilize the requested threads, despite my best efforts to configure it for optimal performance.

                            %CPU  WallTime  Time Lim     RSS    mem memlim cpus

98252891 R hj3792 te53 mumco1_n 5 21:08:13 48:00:00 64.4GB 64.4GB 80.0GB 72
98252892 R hj3792 te53 mumco2_n 6 21:06:57 48:00:00 64.5GB 64.5GB 80.0GB 72
98252899 R hj3792 te53 mumco1_n 3 21:06:05 48:00:00 65.6GB 65.6GB 90.0GB 72
98252926 R hj3792 te53 mumco2_n 3 21:05:27 48:00:00 65.0GB 65.0GB 90.0GB 72
98253022 R hj3792 te53 mumco2_n 3 21:06:40 48:00:00 64.9GB 64.9GB 90.0GB 72
98298196 R hj3792 te53 mumco2_n 12 04:01:25 48:00:00 64.3GB 64.3GB 80.0GB 72

Considering the circumstances, it appears increasingly likely that the MUM&Co program may require more than the allocated 48 hours to complete successfully. Without an extension of the walltime, these jobs remain trapped in an unproductive cycle of termination and restart.

Hence, I kindly request your valuable assistance in resolving this matter, which may entail optimizing thread utilization to accommodate the program's resource requirements. Your support in this regard would be invaluable and greatly appreciated.

problems in inversion

Dear the author:
I posted the insertion problems in the last issue. After I continue to use, I found more problems in inversion.

I have attached part of my results here. In the last column, I used "query_stop - query_start - size", because the inversion length should be the same or near in a valid inversion. But actually, it differed a lot when compared.

Take the first in example:
ref_chr query_chr ref_start ref_stop size V_type query_start query_stop query_stop-query_start-size
Chr01 chr01_14 10466364 10511872 45508 inversion 410599 424301 -31806
There is a 31kb size difference of inversion body between reference side and query side. How can a valid inversion occur like this.
I extracted the information from the alignment:
10437388 10466364 381802 410599 28977 28798 97.91 48794144 2474131 0.06 1.16 1 1 Chr01 chr01_14

10511872 10513769 424301 426187 1898 1887 95.48 48794144 2474131 0.00 0.08 1 1 Chr01 chr01_14
I guess this inversion was determined by these two alignments, but how to confirm the so-called inversion body from reference and from query are well aligned? From these two alignments, I can only know the flanking region of the so-called inversion are well aligned, but not the inversion itself.

In this example, if it is a inversion, there should be an additional alignment like this:
10466364 10511872 424301 410599 ............ Chr01 chr01_14

What does it mean 'imprecise' calls for the VCF

In the README, it is warned that the output VCF contains 'imprecise' calls. What does it mean specifically? does it refer to SM type, length, coordinates, read support, ...?

awk: cmd. line:1: fatal: division by zero attempted

I am getting this awk error at the beginning of the run. Not sure what causes it and what are its consequences, since the script continues running afterwards.

Matching chromosomes based on names, using '_' seperator and filtering chrMT

awk: cmd. line:1: fatal: division by zero attempted
awk: cmd. line:1: fatal: division by zero attempted
awk: cmd. line:1: fatal: division by zero attempted
(standard_in) 2: syntax error

Finding alignment gaps
...

I send attached the input for reproducibility (input.zip). MUMandCo is run as:

bash mumandco_v2.4.2.sh -r reference.fa -q query.fa -g 110000000 -o "test"

The script is run with blast_step=yes, although this does not seem to alter the output.

How to understanding the inversion result?

Hi,

I am using the MUMandCo to call the inversions between two genomes, but when I check the result I wondering to kown how to understand the inversion events?

1. The output_ref.delta_filter.coords :
[S1] | [E1] | [S2] | [E2] | [LEN 1] | [LEN 2] | [% IDY] | [LEN R] | [LEN Q] | [COV R] | [COV Q] | [FRM] | [TAGS] | |
11,665,052 | 11,665,723 | 9,931,117 | 9,931,788 | 672 | 672 | 99.85 | 80,049,271 | 78,190,240 | 0 | 0 | 1 | 1 | Superscaffold1 | Superscaffold1
11,672,615 | 11,676,403 | 9,937,388 | 9,941,179 | 3,789 | 3,792 | 99.53 | 80,049,271 | 78,190,240 | 0 | 0 | 1 | 1 | Superscaffold1 | Superscaffold1

The result in output_ref.delta_filter.coords shows 11,665,723-11,672,615 was not aligned, but the output.SVs_all.tsv shows it was a inversion ?

details was in the attachment:
MUMandCo_inversion_check.xlsx

Thanks in advance!

Variable number of variants found over runs

I have a reproducibility concern, as I've found that when I repeat the same analysis (i.e. same input and options), the number of variants found may vary. I attach input files reference.txt and query.txt (file extension change required). I run MUM&Co as:

bash mumandco_v2.4.2.sh -r reference.fa -q query.fa -g 110000000 -o test

There is a duplication variant, that sometimes appears in the output file, and sometimes does not. The line corresponding to the variant is:

chromosome_2:3065688-8669002(+)	ctg10:50000-5629894(+)	5595283	5597834	2551	duplication	5569429	5577054

Why this behaviour? is it normal? This suggest me that there are some stochastics involved in the process. Should a seed parameter be made available?

Thank you

No insertions but many deletions found

Hi,
I've tried running this with the latest version (v2.4.2) and Mummer4 on a ~3gb assembly. I also tried running it with Assemblytics on the prefix_ref.delta file (it matches their ordering of nucmer arguments). The MUMandCo script reported 0 insertions while Assemblytics found several thousand, as summarised below.

SV	count (mum&co)	count (Assemblytics)
Total_SVs	8563
Deletions	6794	4304
Insertions	0	4583
Duplications	479
Inversions	0
Translocations	1290

The assembly is larger than the reference (3 vs 2.7gb), but still these results are unexpected. Do you have any idea as to why there are 0 reported insertions, or any intermediate files I should keep to check this out further?

Thanks,
Alex

awk: cmd. line:1: fatal: division by zero attempted

I tried running MUM&Co on several query-reference pairs. In several occasions, I got the error:

Matching query and reference chromosomes

awk: cmd. line:1: fatal: division by zero attempted

But, unlike in one of the closed issues here with the same error, the program fails after this. How can I fix this? I noticed that this happened on more evolutionarily distant genome pairs. With more similar genomes, MUM&Co ran fine.

Aligning different size genomes

I want to compare genomes of two ecotypes which are slightly different in size. What should I use with the -g option?

Detection of ~5kb insertions/deletions

Hey,
thank you very much for your tool. I am currently looking into two genomes that have a three of smaller deletions/insertions on one of the chromosomes. The deletions are caused by transposon insertions. When I use a combination of nucmer and bedtools, I am recovering four locations present in one of the strains that correspond to five transposons responsible. When I run your bash script, I am able to recover three out of the five while two transposons are missing. I wonder if you could help me understand what is happening when parsing the nucmer files.
Thanks
MFS

--threads has no default value

Hi,

When I don't use the -t parameter, the bash script will report an error. Which should be that the ${threads} does not have a default value.

e.g.

MUMandCo/mumandco_v3.7.sh

Line 89 in 14f38bd

 $NUCMER --threads ${threads} --maxmatch --nosimplify -p ""$prefix"_ref" $reference_assembly $query_assembly 

Total duplications and inversions between the homologous chromosomes of an haplotype-phased genome assembly

Hello,

I'm using MUMandCO to call SVs between the homologous chromosomes of my haplotype-phased genome assembly. I'm particularly interested in duplications and inversions. I ran the program using haplotype1-chr1 as query and haplotype2-chr1 as reference. This is the result for duplications only.

ref_chr	query_chr	ref_start	ref_stop	size	SV_type	query_start	query_stop
chr1	chr1	1349462	1349653	191	duplication	1725124	1725127
chr1	chr1	1678135	1682964	4829	duplication	1377849	1377851
chr1	chr1	3156729	3186550	29821	duplication	4085015	4085016
chr1	chr1	11331514	11384802	53288	duplication	10232990	10232990
chr1	chr1	11783930	11820088	36158	duplication	11051724	11051725
chr1	chr1	11888669	11888776	107	duplication	11117719	11117799
chr1	chr1	15899129	15931090	31961	duplication	14797193	14797194
chr1	chr1	18225961	18226087	126	duplication	16968211	16968325
chr1	chr1	25382432	25388788	6356	duplication	25545490	25545493
chr1	chr1	25475440	25484693	9253	duplication	25647307	25647308
chr1	chr1	26899767	26899827	60	duplication	26233660	26233661
chr1	chr1	33461685	33641555	179870	duplication	33013693	33013694

Is there a way to know if the identified duplicated regions belong to haplotype1-chr1 or to haplotype2-chr1? Or are these duplications all on haplotype1-chr1? Should I need to perform the same process but using haplotype2-chr1 as query and haplotype1-chr1 as reference (so inverting them) to get the total duplications?

As regards inversions, a single process should be sufficient to achieve the total number of events. Am I right?

Thank you!

Gabriele

support for other alingment formats

Hello,

Thanks for making this package! We were trying to use MUMandCo with another aligner (which we think is better than MUMmer) and wondered if it can support other alignment formats? Specifically, standard formats like maf

Thanks,

No errors were reported, and the program stopped after running for a while

Hello！I runed this code bash /Bio_data/zzy/software/MUMandCo/mumandco_v3.8.sh -r common_carp.genome.chr.fin.fasta -q C.auratus.chromosome.chr.fasta -g 1531013968 -t 10 -o sv.But after running it for a while, it ended automatically,and a bunch of intermediate files were exported. I checked the log file and there was nothing wrong.I also uploaded my log file.The same thing happened when I added the -b.Do you know what the problem is?Thank you very much!
mumandco.log

Can MUMandCo detect intrachromosomal translocation

I want to detect SV between two samples on chromosome A09 . (only one chromosome).

a error occurs like this:
cat: '.transloc_pairing': No such file or directory
rm: cannot remove '.transloc_pairing': No such file or directory

and there is no translocation been detected.
so I wonder whether MUMandCo can detect intrachromosomal translocation.

Size of deletions

Hello,

I have been using Mum&Co to call SVs between two homologous chromosomes of a diploid whole-genome assembly. What I noticed in the output was that in case of some deletions, the position of the deletion in the query sequence spans several hundred to several thousand basepairs. An example:

ref_chr query_chr       ref_start       ref_stop        size    SV_type query_start     query_stop
chr1    chr_s1  6025961 6101923 75962   deletion        993756  1156942
chr1    chr_s1  6102123 6102195 72      deletion        1157141 1181170
chr1    chr_s1  6103092 6104354 1262    deletion        1182068 1290816
chr1    chr_s1  6104835 6153341 48506   deletion        1291294 1371023
chr1    chr_s1  6153863 6154203 340     deletion        1371539 1371914
chr1    chr_s1  6155165 6162825 7660    deletion        1372885 1422600
chr1    chr_s1  6163903 6244138 80235   deletion        1423676 1538981
chr1    chr_s1  6244596 6244818 222     deletion        1539439 1539661
chr1    chr_s1  6245401 6307265 61864   deletion        1540244 1628838
chr1    chr_s1  6825101 6864750 39649   deletion        1739211 1739889
chr1    chr_s1  6865801 6934365 68564   deletion        1740286 1751443
chr1    chr_s1  8978749 8981110 2361    deletion        3301351 3301351

In the first case, there is a region of 75 kb on chr1, which corresponds to a region of 163186 bp in chr_s1. Other examples with the same situation are listed as well. However, in the last row, there is an example where the deletion of 2361 bp corresponds to 1 base in the query.

I also observed this in your yeast.tidy dataset, although to a much lesser extent. Do you have any possible explanation for this? Ideally, the deletion would of course span zero bases in the query. When I run my pairs of allelic pseudomolecules, I get the pattern described above for around 30% of the deletions. Additionally, if I align one contig (e.g. 2-4 Mb) to the complete allelic pseudomolecule, the size specified in the last two columns is always less than 50 bp and very often in the single digits. What may be the reason for this different outcome when aligning the full versus one part of the chromosome?

Also, just to be sure, are the coordinates in your TSV files zero-based?

Thank you and best regards,
Linus

samtobam / mumandco Goto Github PK

mumandco's Introduction

mumandco's People

Contributors

Stargazers

Watchers

Forkers

mumandco's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs