GithubHelp home page GithubHelp logo

samtobam / mumandco Goto Github PK

View Code? Open in Web Editor NEW
59.0 2.0 14.0 4.69 MB

MUM&Co is a simple bash script that uses Whole Genome Alignment information provided by MUMmer (only v4) to detect Structural Variation

License: GNU General Public License v3.0

Shell 100.00%
sv-detection mummer genome structural-variation

mumandco's Introduction

alt text

v3.8 release : DOI

MUM&Co is a simple bash script that uses Whole Genome Alignment information provided by MUMmer (v4) to detect variants.
VERSION >= 3 UPDATE
Only uses MUMmer4 now and contains a thread count option
Contains a VCF output file with all calls currently being imprecise
Contains another output file containing the calls alongside the respective DNA impacted
This new step requires samtools installation
Now calls the reverse of tandem duplications, tandem contractions (>50bp)

MUM&Co is able to detect:
Deletions, insertions, tandem duplications and tandem contractions (>=50bp & <=150kb)
Inversions (>=1kb) and translocations (>=10kb)

MUM&Co requires installation of MUMmer4 and samtools.
MUM&Co will look for the MUMmer toolkit's and samtools scripts path using 'which xxxxx'.
An error warning will print and the script will stop if these paths cannot be found
This path can be editted directly in the script if required.

In order to help with downstream analysis:
Renaming and re-orientation of the query genome contigs to correspond to their reference counterparts
Tools such as RaGOO and Ragout can do this alongside scaffolding of contigs (this is not currently recommended for short-read based assemblies)

Options:

     -r or --reference_genome          path to reference genome
     -q or --query_genome              path to query genome
     -g or --genome_size               size of genome
     -o or --output                    output prefix (default: mumandco)
     -t or --threads                   thread number (default: 1)
     -ml or --minlen                   minimum length of alignments (default: 50bp)
     -b or --blast                     adds the blast option to identify is insertions or deletions look repetitive or novel

Test run script:

     bash mumandco_v*.sh -r ./yeast.tidy.fa -q ./yeast_tidy_DEL100.fa -g 12500000 -o DEL100_test -t 2 -b

OUTPUT FOLDER:
Folder with alignments used for SV detection
Txt file with summary of SVs detected
TSV file with all the detected SVs
TSV file with all detected SVs plus the DNA associated with the event (all from reference except insertions)
VCF file with all calls currently being imprecise

TSV NOTES:
The last column in the TSV file contains notes.
'complicated' : multiple calls within the same region; generally overlapping insertions and deletions
'double' : several calls at the same coordinates; generally tandem duplications or contractions with multiple copy changes
']chrX:xxxxxx]' : a VCF inspired notation for the association of the translocation fragments with the other fragments
e.g. for chr1 with its right border at 250000bp assocaited with chr2 at 100000bp;
the note would be as follows for chr 1: ']chr2:100000]' and for chr2 : '[chr1:250000['
As such, each translocation fragment as called as an event, is now a breakend-like call and will be duplicated if both borders are involved in translocations

VCF TRA EVENT:
The later notation for the TSV file is currently being added to the alt column in the VCF for 'TRA' events.
Currently it is not a called a breakend site (contains no nucleotide at edge) but can be interpreted similarly

Note:
MUMmer4 is now required due to the hard wired thread option not available during alignment with MUMmer3
The blast option (-b /--blast) using BLAST to search for insertion and deletion events in the reference/query in order to label them as either mobile or novel events.

Reference:
Samuel O’Donnell, Gilles Fischer, MUM&Co: accurate detection of all SV types through whole-genome alignment, Bioinformatics, Volume 36, Issue 10, 15 May 2020, Pages 3242–3243, https://doi.org/10.1093/bioinformatics/btaa115

mumandco's People

Contributors

samtobam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mumandco's Issues

Return to 'multiple threads?' closed issue

Several months ago, you and I corresponded about the need for multithreading and it seems you took that opportunity to generate a new version.

I've come back to this after a while and have found that this new version isn't working for me. I'm getting some very strange errors that I have no idea how to deal with.

These new errors seem to stem from this issue that comes up four times during a run of the test data.

g++: error: /opt/ohpc/pub/compiler/gcc/5.4.0/lib/../lib64/libstdc++.so: No such file or directory
g++: error: /opt/ohpc/pub/compiler/gcc/5.4.0/lib/../lib64/libstdc++.so: No such file or directory
g++: error: /opt/ohpc/pub/compiler/gcc/5.4.0/lib/../lib64/libstdc++.so: No such file or directory
g++: error: /opt/ohpc/pub/compiler/gcc/5.4.0/lib/../lib64/libstdc++.so: No such file or directory

I've done some troubleshooting and I think I've narrowed it down to these lines:

$SHOWCOORDS -T -r -c -l -d -g ""$prefix"_ref".delta_filter > ""$prefix"_ref".delta_filter.coordsg
$SHOWCOORDS -T -r -c -l -d ""$prefix"_ref".delta_filter > ""$prefix"_ref".delta_filter.coords

$NUCMER --threads ${threads} --maxmatch --nosimplify -p ""$prefix"_query" $query_assembly $reference_assembly
$DELTAFILTER -m ""$prefix"_query".delta > ""$prefix"_query".delta_filter
$SHOWCOORDS -T -r -c -l -d -g ""$prefix"_query".delta_filter > ""$prefix"_query".delta_filter.coordsg
$SHOWCOORDS -T -r -c -l -d ""$prefix"_query".delta_filter > ""$prefix"_query".delta_filter.coords

There are four $SHOWCOORDS commands.

Any idea what might be going on here?

This could be a 'my HPCC' problem and likely is but I thought I'd ask.

Originally posted by @davidaray in #12 (comment)

Excesively long execution time

I'm having issues with a large genome (2.7 Gb) sample, and I think the script might actually be stuck. It's been running for about two weeks now, with no output returned after "My what a large genome you have, this may take some time". The latest files updated in the output folder are *.delta and *.delta_filter (this one with a size of 0). These two files haven't changed apparently since the day the script started running. What's the script doing? what would be the expected execution time for a genome this length? is the script stuck? if possible, how to solve this?

MUM&Co was run as follows:

bash mumandco_v3.8.sh -r reference.fa -q sample.fa -g 2720000000 -o prefix -t 32 -b

Thanks!

multiple threads?

Trying this out for a project. Seems to be running fine on two mammal genomes but it's taking quite a while to run.

Can this package be run on multiple processors to speed things up?

Thanks.

The genome used to be compared can only include one chromosome

hello!
when I use MUMandCo to detect SVs between two different genomes of Arabidopsis thaliana,It shows no SV; and only when I separate the genome of Arabidopsis thaliana to 5 chromosomes , and detect SV between two chromosomes, for example chr1 (in the first Arabidopsis thaliana) and chr1(in the second Arabidopsis thaliana) , It works. So I wonder whether the genome used to be compared can only include one chromosome?

BLAST Database error: Error: File [reference] not found.

Hi,

I am running mumandco and getting errors that concern blast. In the example below I have sample file 12, 13, 14 in the array. I get the following error for most of the runs; "BLAST Database error: No alias or index file found for nucleotide database [supercontigs_mtDNA_mata.fasta] in search path [/April/mumcotest/12::]". It appears that the last sample-file I pass is okay, but I don't get why. Some errors are different though, eg "BLAST Database error: Error: File (supercontigs_mtDNA_mata.fasta.nhr) not found." but they all have the same reference file that is the basis for the blast option. So, I do not fully understand why one would get different errors, nor what I am to do with this error.

I installed mummer4 using conda (conda install bioconda::mummer4), last week, in addition to samtools. I do not really get what is going on as some sample files have been fine, whereas the vast majority is not being processed and stops at the blast stage, throwing out blast-database related issues. I would appreciate any help on the matter.

I have copy-pasted a simplified version of my bash script below.

reference=supercontigs_mtDNA_mata.fasta

#samples
samples=(12 13 14)
sample=${samples[$SLURM_ARRAY_TASK_ID]}
echo "$input/$sample/assembly.fasta"

#set working directory
mkdir "/mumcotest/$sample"
cd "/mumcotest/$sample/"

bash mumandco_v3.8.sh -q $input/$sample/assembly.fasta -r $reference -g 43000000 -o $sample --threads 30 -b

What does the label 'complicated' mean?

Hello,
I have been using MUMandCo to detect SVs, but I found that some lines in the result file were followd by 'complicated' label.
These variants are usually labeled insertions or deletions in the SVs type column (In two consecutive rows), so I don't know the exact type.
I tried to read annotations in the source code, but I didn't understand it.
Could you please explain the meaning for me?
complicated_label_example

Analyzing short read assemblies

Dear developers, I intend to use MUM&CO for calling variants between a reference and a newly sequenced strain of Drosophila. I have the contigs and also a scaffolded assembly for the same. Looking at your yeast test data I observe the reference and query sequences carry same identifiers. Is this an input mandate to run this tool ? How can I proceed with my short read assembly? I also have a pseudo-chromosomes built using RaGOO, can I use this as query. I also see a warning in the Git page not to use this strategy for short read assembly. Is it due to N's? If I want to use longer accurate contigs for calling variants, then what is the best way to rename the contigs.
Hoping for a positive response. Thank you

syntax error near unexpected token `newline'

Hello,

When I run bash mumandco_v3.7.sh according to this website on ubuntu, but I met the following error:

mumandco_v3.7.sh: line 7: syntax error near unexpected token 'newline'
mumandco_v3.7.sh: line 7: ''

微信截图_20210920115141

Any help?

Thanks in advance.

Weird translocation results

I ran MUM&Co to detect SV between two varieties of the same species. I observed multiple large translocations in the output (in addition to other smaller SVs). For example:

ref_chr query_chr       ref_start       ref_stop        size    SV_type query_start     query_stop      info
chr01   CM020633.1      1000    13862942        13861942        transloc        367     14309343        [chr08:5094415[
chr01   CM020633.1      1000    13862942        13861942        transloc        367     14309343        ]chr11:24521807]
chr01   CM020633.1	13875813        23593880        9718067 transloc        14309338        24284010        [chr08:7002249[
chr01   CM020633.1	13875813        23593880        9718067 transloc        14309338        24284010        ]chr10:20453007]
chr01   CM020633.1	23606808        35271472        11664664        transloc        24284030        35962660        [chr10:20468470[
chr01   CM020633.1	23606808        35271472        11664664        transloc        24284030        35962660        ]chr08:14766499]

The sizes of the translocations are 10-13 Mb. The chromosome sizes in this species are 31-43 Mb. I have visualized the nucmer alignment results as a dotplot and the chromosomes look collinear:
image

How would you interpret these results?

Error finding translocation fragments

I got an error when running the example provided in the documentation. Under the line saying "Finding translocation fragments", the following error message appears:

cat: '*.transloc_pairing': No such file or directory
rm: cannot remove '*.transloc_pairing': No such file or directory

The program does not stop there but continue running, although I'm not sure if the output is correct, given the above message, and that the total number of SV is 101, but 100 deletions are reported.

DEL100_test  Total SVs  = 101
DEL100_test  Deletions  = 100
DEL100_test  Insertions  = 0
DEL100_test  Duplications  = 0
DEL100_test  Inversions  = 0
DEL100_test  Translocations  = 0

Is this output normal? what does mean that *.transloc_pairing is missing?

About how much longer the running time will be increased when running with the -b parameter

Hello developer,
I set the thread to 44. According to the log file, the time it takes for the program to run to "Removing deletions and insertions called during global alignment due to inversions" is relatively short. The last log information stays at "label INDELS events as novel or mobile elements" "Building a new DB" and has not been updated for a long time.
I would like to ask about the expected increase in running time by a few days using the -b parameter. I would appreciate it if you could help!

The insertion problem in the final output

Hi,
I noticed a problem in the insertion type, that there are long fragments of insertion in the ref side as shown below (others).
Usually, the insertion in the ref is a point like (1) from 8650645 to 8650645.
However, how to explain the others, it is not a point, but a long fragment, and then what is the insertion position.

I really got puzzled, hope to understand it.

(1) Chr01 chr01_6 8650645 8650645 1908 insertion_mobile 728231 730139

others:
Chr01 chr01_6 7976205 8045109 3163 insertion_mobile 1259821 1262984
Chr01 chr01_6 8056961 8084210 10526 insertion_mobile 1237503 1248029
Chr01 chr01_6 8086825 8096720 4082 insertion_mobile 1230804 1234886
Chr01 chr01_6 8106077 8181318 53894 insertion_mobile 1167556 1221450
Chr01 chr01_6 8449161 8450114 1073 insertion_mobile 930090 931163

MUMandCO v3.8, aborted using example files

Hi,
I'm using mumandco v3.8, I ran the example and got this message:

cat: DEL100_test.insertion_blast: No such file or directory

It aborted after 3 minutes running, threads=4.
Any suggestion? Many thanks in advance!

SV call from multiple genomes comparison

Hi, Is it possible to call SV by comparing multiple genome with reference genome? The script can take only single query genome at a time.
bash mumandco.sh -r ./yeast.tidy.fa -q ./yeast_tidy_DEL100.fa -g 12500000 -o DEL100_test

VCF header format error

Hi,

There is an error in the header of the VCF generated by MUM&Co: the ##contig lines do not end with a ">"

Here is an example:
##query_contig=<ID=chromosome1,length=1957569

MUM&Co thread and optimisation issue

I am writing to bring to your attention a persistent issue I have encountered regarding job terminations, despite submitting them with an increased number of CPUs, utilizing up to 72 CPU cores, along with the command "export OMP_NUM_THREADS=$PBS_NCPUS" within my PBS script.

Given the limitations on walltime at my institution, which is set at 48 hours, I had to explore several options in an attempt to address this matter, including increasing the default CPU (thread) count from within the mumandco_v3.8.sh script, specifically modifying "threads" from "1" to "72." Additionally, I have incorporated the "export OMP_NUM_THREADS=$PBS_NCPUS" command within my PBS script.

Regrettably, these jobs continue to be terminated repeatedly, and the program lacks a resume option. Further examination reveals that the program utilizes a significant amount of memory, approximately 40 to 60GB, yet it fails to efficiently utilize the requested threads, despite my best efforts to configure it for optimal performance.

                            %CPU  WallTime  Time Lim     RSS    mem memlim cpus

98252891 R hj3792 te53 mumco1_n 5 21:08:13 48:00:00 64.4GB 64.4GB 80.0GB 72
98252892 R hj3792 te53 mumco2_n 6 21:06:57 48:00:00 64.5GB 64.5GB 80.0GB 72
98252899 R hj3792 te53 mumco1_n 3 21:06:05 48:00:00 65.6GB 65.6GB 90.0GB 72
98252926 R hj3792 te53 mumco2_n 3 21:05:27 48:00:00 65.0GB 65.0GB 90.0GB 72
98253022 R hj3792 te53 mumco2_n 3 21:06:40 48:00:00 64.9GB 64.9GB 90.0GB 72
98298196 R hj3792 te53 mumco2_n 12 04:01:25 48:00:00 64.3GB 64.3GB 80.0GB 72

Considering the circumstances, it appears increasingly likely that the MUM&Co program may require more than the allocated 48 hours to complete successfully. Without an extension of the walltime, these jobs remain trapped in an unproductive cycle of termination and restart.

Hence, I kindly request your valuable assistance in resolving this matter, which may entail optimizing thread utilization to accommodate the program's resource requirements. Your support in this regard would be invaluable and greatly appreciated.

problems in inversion

Dear the author:
I posted the insertion problems in the last issue. After I continue to use, I found more problems in inversion.

I have attached part of my results here. In the last column, I used "query_stop - query_start - size", because the inversion length should be the same or near in a valid inversion. But actually, it differed a lot when compared.

Take the first in example:
ref_chr query_chr ref_start ref_stop size V_type query_start query_stop query_stop-query_start-size
Chr01 chr01_14 10466364 10511872 45508 inversion 410599 424301 -31806
There is a 31kb size difference of inversion body between reference side and query side. How can a valid inversion occur like this.
I extracted the information from the alignment:
10437388 10466364 381802 410599 28977 28798 97.91 48794144 2474131 0.06 1.16 1 1 Chr01 chr01_14

10511872 10513769 424301 426187 1898 1887 95.48 48794144 2474131 0.00 0.08 1 1 Chr01 chr01_14
I guess this inversion was determined by these two alignments, but how to confirm the so-called inversion body from reference and from query are well aligned? From these two alignments, I can only know the flanking region of the so-called inversion are well aligned, but not the inversion itself.

In this example, if it is a inversion, there should be an additional alignment like this:
10466364 10511872 424301 410599 ............ Chr01 chr01_14

2022-09-23_00292

awk: cmd. line:1: fatal: division by zero attempted

I am getting this awk error at the beginning of the run. Not sure what causes it and what are its consequences, since the script continues running afterwards.

Matching chromosomes based on names, using '_' seperator and filtering chrMT

awk: cmd. line:1: fatal: division by zero attempted
awk: cmd. line:1: fatal: division by zero attempted
awk: cmd. line:1: fatal: division by zero attempted
(standard_in) 2: syntax error

Finding alignment gaps
...

I send attached the input for reproducibility (input.zip). MUMandCo is run as:

bash mumandco_v2.4.2.sh -r reference.fa -q query.fa -g 110000000 -o "test"

The script is run with blast_step=yes, although this does not seem to alter the output.

How to understanding the inversion result?

Hi,

I am using the MUMandCo to call the inversions between two genomes, but when I check the result I wondering to kown how to understand the inversion events?

1. The output_ref.delta_filter.coords :
[S1] | [E1] | [S2] | [E2] | [LEN 1] | [LEN 2] | [% IDY] | [LEN R] | [LEN Q] | [COV R] | [COV Q] | [FRM] | [TAGS] |   |  
11,665,052 | 11,665,723 | 9,931,117 | 9,931,788 | 672 | 672 | 99.85 | 80,049,271 | 78,190,240 | 0 | 0 | 1 | 1 | Superscaffold1 | Superscaffold1
11,672,615 | 11,676,403 | 9,937,388 | 9,941,179 | 3,789 | 3,792 | 99.53 | 80,049,271 | 78,190,240 | 0 | 0 | 1 | 1 | Superscaffold1 | Superscaffold1

2. grep 'inversion' output.SVs_all.tsv|head -n1
ref_chr | query_chr | ref_start | ref_stop | size | SV_type | query_start | query_stop
Superscaffold1 | Superscaffold1 | 11,665,723 | 11,672,615 | 6,892 | inversion | 9,931,788 | 9,937,388

  1. The result in output_ref.delta_filter.coords shows 11,665,723-11,672,615 was not aligned, but the output.SVs_all.tsv shows it was a inversion ?

details was in the attachment:
MUMandCo_inversion_check.xlsx

Thanks in advance!

Variable number of variants found over runs

I have a reproducibility concern, as I've found that when I repeat the same analysis (i.e. same input and options), the number of variants found may vary. I attach input files reference.txt and query.txt (file extension change required). I run MUM&Co as:

bash mumandco_v2.4.2.sh -r reference.fa -q query.fa -g 110000000 -o test

There is a duplication variant, that sometimes appears in the output file, and sometimes does not. The line corresponding to the variant is:

chromosome_2:3065688-8669002(+)	ctg10:50000-5629894(+)	5595283	5597834	2551	duplication	5569429	5577054

Why this behaviour? is it normal? This suggest me that there are some stochastics involved in the process. Should a seed parameter be made available?

Thank you

No insertions but many deletions found

Hi,
I've tried running this with the latest version (v2.4.2) and Mummer4 on a ~3gb assembly. I also tried running it with Assemblytics on the prefix_ref.delta file (it matches their ordering of nucmer arguments). The MUMandCo script reported 0 insertions while Assemblytics found several thousand, as summarised below.

SV count (mum&co) count (Assemblytics)
Total_SVs 8563
Deletions 6794 4304
Insertions 0 4583
Duplications 479
Inversions 0
Translocations 1290

The assembly is larger than the reference (3 vs 2.7gb), but still these results are unexpected. Do you have any idea as to why there are 0 reported insertions, or any intermediate files I should keep to check this out further?

Thanks,
Alex

awk: cmd. line:1: fatal: division by zero attempted

I tried running MUM&Co on several query-reference pairs. In several occasions, I got the error:

Matching query and reference chromosomes

awk: cmd. line:1: fatal: division by zero attempted

But, unlike in one of the closed issues here with the same error, the program fails after this. How can I fix this? I noticed that this happened on more evolutionarily distant genome pairs. With more similar genomes, MUM&Co ran fine.

Detection of ~5kb insertions/deletions

Hey,
thank you very much for your tool. I am currently looking into two genomes that have a three of smaller deletions/insertions on one of the chromosomes. The deletions are caused by transposon insertions. When I use a combination of nucmer and bedtools, I am recovering four locations present in one of the strains that correspond to five transposons responsible. When I run your bash script, I am able to recover three out of the five while two transposons are missing. I wonder if you could help me understand what is happening when parsing the nucmer files.
Thanks
MFS

--threads has no default value

Hi,

When I don't use the -t parameter, the bash script will report an error. Which should be that the ${threads} does not have a default value.

e.g.

$NUCMER --threads ${threads} --maxmatch --nosimplify -p ""$prefix"_ref" $reference_assembly $query_assembly

Total duplications and inversions between the homologous chromosomes of an haplotype-phased genome assembly

Hello,

I'm using MUMandCO to call SVs between the homologous chromosomes of my haplotype-phased genome assembly. I'm particularly interested in duplications and inversions. I ran the program using haplotype1-chr1 as query and haplotype2-chr1 as reference. This is the result for duplications only.

ref_chr query_chr ref_start ref_stop size SV_type query_start query_stop
chr1 chr1 1349462 1349653 191 duplication 1725124 1725127
chr1 chr1 1678135 1682964 4829 duplication 1377849 1377851
chr1 chr1 3156729 3186550 29821 duplication 4085015 4085016
chr1 chr1 11331514 11384802 53288 duplication 10232990 10232990
chr1 chr1 11783930 11820088 36158 duplication 11051724 11051725
chr1 chr1 11888669 11888776 107 duplication 11117719 11117799
chr1 chr1 15899129 15931090 31961 duplication 14797193 14797194
chr1 chr1 18225961 18226087 126 duplication 16968211 16968325
chr1 chr1 25382432 25388788 6356 duplication 25545490 25545493
chr1 chr1 25475440 25484693 9253 duplication 25647307 25647308
chr1 chr1 26899767 26899827 60 duplication 26233660 26233661
chr1 chr1 33461685 33641555 179870 duplication 33013693 33013694

Is there a way to know if the identified duplicated regions belong to haplotype1-chr1 or to haplotype2-chr1? Or are these duplications all on haplotype1-chr1? Should I need to perform the same process but using haplotype2-chr1 as query and haplotype1-chr1 as reference (so inverting them) to get the total duplications?

As regards inversions, a single process should be sufficient to achieve the total number of events. Am I right?

Thank you!

Gabriele

support for other alingment formats

Hello,

Thanks for making this package! We were trying to use MUMandCo with another aligner (which we think is better than MUMmer) and wondered if it can support other alignment formats? Specifically, standard formats like maf

Thanks,

No errors were reported, and the program stopped after running for a while

Hello!I runed this code bash /Bio_data/zzy/software/MUMandCo/mumandco_v3.8.sh -r common_carp.genome.chr.fin.fasta -q C.auratus.chromosome.chr.fasta -g 1531013968 -t 10 -o sv.But after running it for a while, it ended automatically,and a bunch of intermediate files were exported. I checked the log file and there was nothing wrong.I also uploaded my log file.The same thing happened when I added the -b.Do you know what the problem is?Thank you very much!
mumandco.log

Can MUMandCo detect intrachromosomal translocation

I want to detect SV between two samples on chromosome A09 . (only one chromosome).

a error occurs like this:
cat: '.transloc_pairing': No such file or directory
rm: cannot remove '
.transloc_pairing': No such file or directory

and there is no translocation been detected.
so I wonder whether MUMandCo can detect intrachromosomal translocation.

Size of deletions

Hello,

I have been using Mum&Co to call SVs between two homologous chromosomes of a diploid whole-genome assembly. What I noticed in the output was that in case of some deletions, the position of the deletion in the query sequence spans several hundred to several thousand basepairs. An example:

ref_chr query_chr       ref_start       ref_stop        size    SV_type query_start     query_stop
chr1    chr_s1  6025961 6101923 75962   deletion        993756  1156942
chr1    chr_s1  6102123 6102195 72      deletion        1157141 1181170
chr1    chr_s1  6103092 6104354 1262    deletion        1182068 1290816
chr1    chr_s1  6104835 6153341 48506   deletion        1291294 1371023
chr1    chr_s1  6153863 6154203 340     deletion        1371539 1371914
chr1    chr_s1  6155165 6162825 7660    deletion        1372885 1422600
chr1    chr_s1  6163903 6244138 80235   deletion        1423676 1538981
chr1    chr_s1  6244596 6244818 222     deletion        1539439 1539661
chr1    chr_s1  6245401 6307265 61864   deletion        1540244 1628838
chr1    chr_s1  6825101 6864750 39649   deletion        1739211 1739889
chr1    chr_s1  6865801 6934365 68564   deletion        1740286 1751443
chr1    chr_s1  8978749 8981110 2361    deletion        3301351 3301351

In the first case, there is a region of 75 kb on chr1, which corresponds to a region of 163186 bp in chr_s1. Other examples with the same situation are listed as well. However, in the last row, there is an example where the deletion of 2361 bp corresponds to 1 base in the query.

I also observed this in your yeast.tidy dataset, although to a much lesser extent. Do you have any possible explanation for this? Ideally, the deletion would of course span zero bases in the query. When I run my pairs of allelic pseudomolecules, I get the pattern described above for around 30% of the deletions. Additionally, if I align one contig (e.g. 2-4 Mb) to the complete allelic pseudomolecule, the size specified in the last two columns is always less than 50 bp and very often in the single digits. What may be the reason for this different outcome when aligning the full versus one part of the chromosome?

Also, just to be sure, are the coordinates in your TSV files zero-based?

Thank you and best regards,
Linus

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.