seqan / igenvar Goto Github PK

The official repository for the iGenVar project.

License: BSD 3-Clause "New" or "Revised" License

CMake 2.82% C++ 53.06% Python 30.27% R 5.62% Shell 7.61% Perl 0.62%

igenvar's Introduction

SeqAn - The Library for Sequence Analysis

NOTE SeqAn3 is out and hosted in a different repository
We recommend using SeqAn3 for new applications.

What Is SeqAn?

SeqAn is an open source C++ library of efficient algorithms and data structures for the analysis of sequences with the focus on biological data. Our library applies a unique generic design that guarantees high performance, generality, extensibility, and integration with other libraries. SeqAn is easy to use and simplifies the development of new software tools with a minimal loss of performance.

License

The SeqAn library itself, the tests and demos are licensed under the very permissive 3-clause BSD License. The licenses for the applications themselves can be found in the LICENSE files.

Prerequisites

Older compiler versions might work but are neither supported nor tested.

Linux, macOS, FreeBSD

GCC ≥ 11
Clang/LLVM ≥ 15
Intel oneAPI C++ Compiler 2024.0.2 (IntelLLVM)

Windows

Visual C++ ≥ 17.0 / Visual Studio ≥ 2022

Architecture support

Intel/AMD platforms, including optimisations for modern instruction sets (POPCNT, SSE4, AVX2, AVX512)
All Debian release architectures supported, including most ARM and all PowerPC platforms.

Build system

To build tests, demos, and official SeqAn applications you also need CMake ≥ 3.12.

Some official applications might have additional requirements or only work on a subset of platforms.

Documentation Resources

Contact

igenvar's People

Contributors

Stargazers

Watchers

Forkers

irallia simonsasse joshuak94 eldariont eseiler joergi-w lanlanla schaudge

igenvar's Issues

[BUG] detect_breakends writes to `err` even if no error occurred.

The application detect_breakends writes results to the err although no error occurred.
This behavior can be seen here #19 and in the detect_breakends_cli_test.

EDIT (29.3.2021): Print only output in err depending on the verbose flag (-v, -vv, ...). This is a follow up for issue #20 .

iGenVar - Find SNPs & Indels modeled on GATK

Call SNPs & Indels: Since GATK is currently the best standard, we want to use these methods.
GATK calls SNPs & Indels via Mutect2 and HaplotypeCaller.
They both have in common a local assembly and realignment of active sites.
Mutect2 is a somatic caller and, on the other hand, HaplotypeCaller is a germline caller.
Mutect2 uses the GATK tool FilterMutectCalls after the realingment, to filter somatic variants (opposed to germline variants), sequencing errors, ...
It looks like it is the best way to translate the tool HaplotypeCaller, written in Java, into C ++ for this purpose.

-> generating candidate haplotypes
-> local realignment using the pair HMM Model against the candidate haplotypes -> matrix of likelihoods for each read
-> local assembly: assemble these window aligned reads into an assembly graph of local variation
-> infer variants from assembled haplotypes: "Despite its name, HaplotypeCaller does not actually call haplotypes. Rather, it generates haplotypes as an intermediate step to discover variants at individual loci. Here we describe how the GATK engine determines which alt alleles exist in locally assembled haplotypes." (-> variant qualtiy score model)

Other links:

Mutect2
Talk about GATK HaplotypeCaller methods (PDF Slides)

Issue sketches:

reimplement the local assembly and realignment in C++. This is described in: https://www.biorxiv.org/content/10.1101/861054v1.supplementary-material
reimplement FilterMutectCalls in C++
...

Paper and Articles:

Calling Somatic SNVs and Indels with Mutect2
Somatic calling is NOT simply a difference between two callsets
Pair HMM probabilistic realignment in HaplotypeCaller and Mutect PDF
Description of alignment method and GPU implementation Link

[FEATURE] Cluster junctions by hierarchical clustering

Hierarchical clustering is the clustering method employed by SVIM. It will be implemented in src/modules/clustering/hierarchical_clustering_method.cpp.

Steps:

Generate partitions of junctions lying close to each other on the reference genome (clustering is O(n^2) so it's much faster on a small partition than the entire genome)
Cluster the junctions in each partition using agglomerative hierarchical clustering

We could use https://github.com/cdalitz/hclust-cpp as an existing implementation for the 2nd step.

[INFRA] modularisation - directory file management

[FEATURE] SeqAn3: Develop a simple VCF parser.

VCF: Variant Call Format, a file format standard in bioinformatics.
https://en.wikipedia.org/wiki/Variant_Call_Format

Simple means a parser in which the content of the columns is not checked (e.g. without specific info fields).
We will do this in a follow up issue.

[FEATURE] Call Deletions in long reads via the CIGAR string

Given a BAM file for long reads and a fasta file from the reference sequence call deletions contained within the CIGAR strings for every read without clustering and report the deletion in a vcf format to the command line.

Mandatory requirements for the BAM file:

must be sorted

chromosome
position
CIGAR

Optional fields:

supplementary tag -> split alignment
read mates (e.g. RNEXT)

[BUG] Possible problem with the enum validator

If I add the following test, a test for a direct specification of a cluster method, I get an error instead of the expected commented out stuff?
It looks like there is something wrong with our new validator.

TEST_F(detect_breakends, with_simple_clustering_method)
{
    cli_test_result result = execute_app("detect_breakends",
                                         data("simulated.minimap2.hg19.coordsorted_cutoff.sam"),
                                         "detect_breakends_insertion_file_out.fasta",
                                         "-c 0");
    std::string expected_err
    {
        "[Error] Validation failed for option -c/--clustering_method: Value simple_clustering is not one of [].\n"
        // "INS1: Reference\tchr21\t41972616\tForward\tRead\t0\t2294\tForward\tm2257/8161/CCS\n"
        // "INS2: Reference\tchr21\t41972616\tReverse\tRead\t0\t3975\tReverse\tm2257/8161/CCS\n"
        // "BND: Reference\tchr22\t17458417\tForward\tReference\tchr21\t41972615\tForward\tm41327/11677/CCS\n"
        // "BND: Reference\tchr22\t17458418\tForward\tReference\tchr21\t41972616\tForward\tm21263/13017/CCS\n"
        // "BND: Reference\tchr22\t17458418\tForward\tReference\tchr21\t41972616\tForward\tm38637/7161/CCS\n"
        // "Start clustering...\n"
        // "Done with clustering. Found 4 junction clusters.\n"
        // "No refinement was selected.\n"
    };
    EXPECT_NE(result.exit_code, 0);
    // EXPECT_EQ(result.exit_code, 0);
    EXPECT_EQ(result.out, "");
    // EXPECT_EQ(result.out, expected_res);
    EXPECT_EQ(result.err, expected_err);
}

Cluster Deletions

Extend the output option to decide if we want to write to a file and or to std::cout.

This issue must be discussed:

The question is, do we want to be able to choose whether we want to write an output file and or be able to write in std::cout aswell. The following review comment gives a general example and further information.

From a comment in #37 (review)

May I suggest something that does not need to be changed or part of this PR:
You go from functions foo() that print to std::cout to function foo(std::ofstream & out_file) that prints to a file. If you would do
template<typename stream_type>
void foo(stream_type & stream)
{
    stream << // ...
}
You can pass either a file OR std::cout. Writing to an dedicated output file is always very helpful in most application but I find sometimes, when writing pipelines, it can be very handy if the file can also be written to std::cout. This of course depends on your App and whether you think it may or may not be executed as part of a pipeline. If not you can ignore my suggestion, if so you might want to consider writing to cout unless a output_path is given. :) Just ideas

Note that there is also a output_stream seqan3 concept for checking that is has the << operator.

iGenVar - Call Deletions from long reads

Call SVs: For larger structural variations, we want to combine the various known methods to call deletions.

~~Distinguish between long and short reads in the input #17~~
- Call Deletions via the CIGAR string
  - Call Deletions in long reads #25
  - ~~Call Deletions in short reads #27~~
- Call Deletion using split alignment #18
Cluster junctions by hierarchical clustering #54

EDIT (30.03.2021): We have removed some of the requirements as we are outsourcing them:

short read input #17

Extend the deletion calling to consider split alignments

iGenVar - Write API & CLI tests for all functions

Write API test for

(Definition of done: when Codecoverage is above ...)

modules/clustering/hierarchical_clustering_method.hpp
- partition_junctions()
- split_partition_based_on_mate2()
- junction_distance()
- hierarchical_clustering_method()
modules/clustering/simple_clustering_method.hpp ✅ #12
- simple_clustering_method()
modules/sv_detection_methods/analyze_cigar_method.hpp ✅ #12
- analyze_cigar()
modules/sv_detection_methods/analyze_sa_tag_method.hpp ✅ #12
- split_string()
- retrieve_aligned_segments()
- analyze_aligned_segments()
- analyze_sa_tag()
structures/... ~~#112~~
variant_detection/... ~~#112~~
variant_detection/variant_detection.hpp ✅
- detect_junctions_in_long_reads_sam_file
variant_detection/variant_output.hpp
- find_and_output_variants(..., ostream)
- find_and_output_variants(..., path)
variant_parser/variant_record.hpp

Write CLI tests

add example data ✅ #57
write tests ✅ #4 , #69
Add tests for the help pages. ✅ #71

Implement Code Coverage #118

Resolve App-Template issue seqan/app-template#32 and seqan/app-template#46
-> Codecoverage higher than 85%

Method selection over CLI does not work properly

Hi,

for me, the selection of specific junction detection methods (probably the same for clustering methods) via the CLI does not work properly. As introduced with #45, the set of methods to use can be specified via the -m option. But when I add seqan3::debug_stream << args.methods << '\n'; after the argument parsing:

> build/bin/detect_breakends simulated.minimap2.hg19.coordsorted_cutoff.sam insertion_file_out.fasta -m1
[1,2,3,4,1]
...
The read pair method is not yet implemented.
The read depth method is not yet implemented.
...

Apparently, the parser just appends values to the already initialized std::vector<uint8_t> methods{1, 2, 3, 4}; and executes all methods even though -m1 was called.

Best
David

[FEATURE] Add output option to detect_breakends & find_deletions

There are already tests that can test this: #19

[FEATURE] Call SVs in short reads via the CIGAR string and SA tag

Given a BAM file for short reads call small deletions covered within the CIGAR sting.

[TEST] Create a mini example for different SV types

Add a parameter, that we can specify a SV length, the default remains at 30bp.
Created a small example for different types of SVs: Deletion, Insertion, Duplication, Translocation, Duplication in the Referece
Create a VCF for the resulting junctions
Test the sam with samtools and look at it via IGV
Write a test case for the example file

A call should look something like this:

./bin/detect_breakends ~/Repos/iGenVar/test/data/mini_example.sam ~/.../mini_out.fasta > ~/.../mini_junctions.vcf -l 10

[TEST] Write tests for different input parameters of detect_breakends.

There are some method parameter combinations missing, and the clustering methods.

[FEATURE] Use the SeqAn3 VCF parser to save the output of find_deletions

This Issue depends on two other issues.

Edit 11.11.2020:
Change test/api/deletion_finding_and_printing_test.cpp by using EXPECT_RANGE_EQ. For more information see: #37 (comment)

[FEATURE] Find Deletions in short reads

needs refinement

[FEATURE] SeqAn3: Develop info fields of the VCF parser.

VCF: Variant Call Format, a file format standard in bioinformatics.

Any key in the info field is allowed, although some subfields are reserved (albeit optional), see wikipedia:
https://en.wikipedia.org/wiki/Variant_Call_Format

We want to use an enum for the reserved subfields.

[EPIC, FEATURE]Find SNPs & Indels modeled on GATK

This is an Epic itself and needs refinement.

GATK calls SNPs & Indels via Mutect2 and HaplotypeCaller.
They both have in common a local assembly and realignment of active sites.
Mutect2 is a somatic caller and, on the other hand, HaplotypeCaller is a germline caller.
Mutect2 uses the GATK tool FilterMutectCalls after the realingment, to filter somatic variants (opposed to germline variants), sequencing errors, ...
...

Issue sketches:

reimplement the local assembly and realignment in C++. This is described in: https://www.biorxiv.org/content/10.1101/861054v1.supplementary-material
reimplement FilterMutectCalls in C++
...

Paper and Articles:

[FEATURE] Modularize input data

[FEATURE] Decouple junctions and a cluster of junctions

After we added a first simple clustering method
eldariont left a comment:

However, I think that the implementation of the simple clustering method (including the corresponding changes to the junction class) could be improved. Proviously, the junction class represented a single junction detected from a single read. Now, it is also used to represent clusters of junctions. IMO, we should represent clusters of junctions with a decouple junctions and cluster of junctions. or a std::vector. Then, we could keep all the cluster-related information, such as the number of supporting reads, the read names, etc. separated from the information on a single junction.

We would like to create this new class.

[FEATURE] Cluster junctions by candidate selection based on voting

This clustering method is used in Vaquita:

"2.2 Candidate merging: SE + PE
Two breakpoints with the same orientation can be merged if both the left and right intervals are adjacent or overlapping. A distance of 50 bases is set by default in assessing adjacency. When two breakpoints are merged, the minimum and maximum positions of each left and right intervals are selected to define the merged breakpoint. The original positions are kept in a list, and the median positions are reported as final positions in the last step. We merge all the breakpoints identified by SE [split-read evidence ] or PE [read-pair evidence] according to this principle. For efficiency, the reference genome is divided into equally sized regions that are 1000 bp by default. The left and right intervals of SVs belong to one or more regions according to their size and genomic coordinates. The entire merging process can be efficiently done by identifying breakpoints in the same region.
[...]
2.5.2 Voting based metric for candidate selection
[...] Instead of using a simple sum of signals from different types of evidence, Vaquita provides an additional metric for candidate selection based on voting. In this scheme, each type of evidence for a breakpoint is checked by a relatively lenient cutoff, and then we calculate the number of evidence types that pass the criteria that we denote as VT. For example, a structural variation with VT = 3 is supported by three evidence types."
Source: Kim, Jongkyu and Reinert, Knut (2017) Vaquita: Fast and Accurate Identification of Structural Variation Using Combined Evidence. In: 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). LIPICS (88). Dagstuhl LIPIcs, Saarbrücken/Wadern, 185(13:1)-198(13:14). ISBN 978-3-95977-050-7

Tests don't compile on MacOS 11.1 (Big Sur)

Hi,

on a brand new installation of MacOS, I installed the current versions of Xcode, Command Line Tools, cmake (homebrew) and gcc@7 (homebrew). When I follow the installation steps in the README, make succeeds but make test fails with:

Test project /Users/eldarion/Documents/Projects/mpi/iGenVar-build
[ 27%] Built target datasource--simulated.minimap2.hg19.coordsorted_cutoff.sam
[ 36%] Built target iGenVar_lib
[ 60%] Built target googletest
[ 63%] Linking CXX executable junction_detection_test
ld: warning: object file (../googletest/src/googletest-build/lib/libgtest.a(gtest-all.cc.o)) was built for newer macOS version (11.0) than being linked (10.16.2)
ld: warning: object file (../googletest/src/googletest-build/lib/libgtest_main.a(gtest_main.cc.o)) was built for newer macOS version (11.0) than being linked (10.16.2)
Undefined symbols for architecture x86_64:
  "testing::internal::PrintStringTo(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_ostream<char, std::char_traits<char> >*)", referenced from:
      testing::AssertionResult testing::internal::CmpHelperEQ<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) in junction_detection_test.cpp.o
  "testing::internal::GetCapturedStdout[abi:cxx11]()", referenced from:
      junction_detection_fasta_out_not_empty_Test::TestBody()      in junction_detection_test.cpp.o
...
ld: symbol(s) not found for architecture x86_64
collect2: error: ld returned 1 exit status
make[4]: *** [test/api/junction_detection_test] Error 1
make[3]: *** [test/api/CMakeFiles/junction_detection_test.dir/all] Error 2
make[2]: *** [test/CMakeFiles/api_test.dir/rule] Error 2
make[1]: *** [test/CMakeFiles/api_test.dir/rule] Error 2
...

For the full log, see test.log. What stands out to me is the linker's warning of mismatching OS versions but I don't fully understand what that means.

Strangely enough, compiling the seqan3 unit tests as described here works fine. What could be the difference in test setups between seqan3 and the app template that causes this issue only for the iGenVar app?

Cheers
David

Versions:
❯ cmake --version
cmake version 3.19.3
❯ g++-7 --version
g++-7 (Homebrew GCC 7.5.0_3) 7.5.0

[FEATURE] Modularize methods for clustering junctions

iGenVar - Add / Use a VCF format parser from SeqAn3

Add a vcf format parser to such that writing vcf can be done more easily.

[FEATURE] Add verbose option to iGenVar

We should discuss, if we want a verbose option. There are some commented out prints that you might want to use here.

There are already tests that can test this: #19

EDIT (29.03.2021):
To discuss:
In SeqAn3 we only have -hh as an exception, it's the only one. Every others are short ids and shuld throw if we use more.
Idea: new option -v, --verbose, which then takes an uint8_t for the verbose levels and has a default of 0.

EDIT (12.04.2021): We want to use the seqan verbose option from the argument parser

_ spare
-v 1 (Level 1): print ERROR <- default
-v 2(Level 2): print WARNING, ERROR
-v 3 (Level 3): print INFO, WARNING, ERROR

[TEST] Write API tests for functions in deletion_finding_and_printing.cpp

File: src/find_deletions/deletion_finding_and_printing.cpp

Write tests for:

read_junctions()
print_vcf_header()
print_deletion()

[TEST] Add cli tests

Write Command-line interface tests for detect_breackends and find_deletions.
You can use the example test from the app-template: 'fastq_to_fasta_options_test'.

[INFRA] Combine two existing executables into a single monolithic executable

Original idea (2019):

Implementation of caller modules as separate executables
Combination of modules with unix pipes, e.g. ./detect_breakends.cpp | ./find_deletions.cpp > output.vcf

Current plan (2021):

Implementation of caller modules in separate source files that are combined into a single large executable

Therefore, we need to

Remove find_deletions.cpp executable and integrate code into the main executable (currently called detect_breakends.cpp)
Update all associated tests
Rename detect_breakends.cpp into something more generic like iGenVar.cpp

[DOC] Add documentation to our classes and structs.

As correctly noted in review post #68 (review), all member variables and functions still lack any documentation.
This concerns:

A documentation could look like

//! \brief Start positon of the alignment.
int32_t pos;

...

/*! \brief Position associated with this AlignedSegment.
 *
 * \return The position stored as int32_t
 */
int32_t get_reference_start() const;

[INFRA] Delete app-template example code

Delete remains of the app-template code:

Depends on #4 .

[FEATURE] Decouple detect_breackends output from functions

Decouple the output printet on stdout from functions into seperate function for forwarding or saving into output files.

Currently, results are being written on the spot directly to stdout. We want to save the results in a data structure and then output them separately (or write them to a file in a subsequent issue, or forward them).

File: 'junction_detection.cpp'
Related functions: 'analyze_cigar()' & 'analyze_aligned_segments()'
The print should happen at the end of 'detect_junctions_in_alignment_file()' in a new seperate function.

[FEATURE] Extend deletion calling by using split alignment.

needs refinement

[FEATURE] Modularize methods for refining breakends

iGenVar - Call SVs in short reads

We have distinguished between short and long reads since issue #43, but have not yet implemented anything for short reads.

[BUG] Mixed up output string.

This bug came out of an discussion of: #49 (comment)

eldariont:
test/api/junction_detection_test.cpp lines +17 to +20:

// Reference\tm2257/8161/CCS\t41972616\tForward\tRead \t0\t2294\tForward\tchr21
// INS from Primary Read - Sequence Type: Reference; Sequence Name: m2257/8161/CCS; Position: 41972616; Orientation: Reverse
//                         Sequence Type: Read; Sequence Name: 0; Position: 3975; Orientation: Reverse
//                         Chromosome: chr21

I'm confused by this string:
// Reference\tm2257/8161/CCS\t41972616\tForward\tRead \t0\t2294\tForward\tchr21
It should consist of breakend1, breakend2 and the read name it was detected from. However, the read name here is chr21 and the reference chromosome of breakend1 is m2257/8161/CCS\t41972616. Those are swapped but I don't immediately see why 😕

These output is coming from:
iGenVar/src/detect_breakends/junction_detection.cpp
and are built in:
iGenVar/include/junction.hpp lines 41 to 46:

template <typename stream_t>
inline stream_t operator<<(stream_t && stream, junction const & junc)
{
    stream << junc.get_mate1() << '\t' << junc.get_mate2() << '\t' << junc.get_read_name();
    return stream;
}

iGenVar/include/breakend.hpp lines 35 to 43:

template <typename stream_t>
inline stream_t operator<<(stream_t && stream, breakend const & b)
{
    stream << ((b.seq_type == sequence_type::reference) ? "Reference" : "Read ") << '\t'
           << b.seq_name << '\t'
           << b.position  << '\t'
           << ((b.orientation == strand::forward) ? "Forward" : "Reverse");
    return stream;
}

Probably there is a mixup in the passing of chromosome and read in `retrieve_aligned_segments` or `analyze_cigar`.

[TEST] Write API tests for functions in iGenVar (methods)

File: src/modules/sv_detection_methods/analyze_sa_tag_method.cpp
Write tests for:

split_string()
retrieve_aligned_segments()
analyze_aligned_segments()
analyze_sa_tag()

File: src/modules/sv_detection_methods/analyze_cigar_method.cpp
Write tests for:

analyze_cigar()

File: src/modules/clustering/simple_clustering_method.cpp
Write tests for:

simple_clustering_method()

iGenVar - Cluster SVs (deletions)

Remove duplicates and cluster deletions with similar breackends.

We already have a simple method, but as discussed in #82 (comment), it has some clear disadvantages. The following issues describe different cluster principles.

First we should decouple junctions and a cluster of junctions.

Than we should implement different methods:

1: hierarchical clustering #54,
2: self-balancing binary tree #55,
3: candidate selection based on voting #56

iGenVar - General issues as errors and bits and pieces

[FEATURE] Cluster junctions by a self-balancing binary tree

This clustering method is used in Sniffles:

"Clustering and nested SVs.
To enable the study of closely positioned or nested SVs, Sniffles optionally clusters SVs that are supported by the same set of reads. Note that Sniffles does not fully phase the haplotypes, as it does not consider single-nucleotide polymorphisms or small indels, but rather identifies SVs that occur together. If this option is enabled, Sniffles stores the name of each read that supports an SV in a hash table keyed by the read name, with the list of SVs associated with that read name as the value. The hash table is used to find reads that span more than one event, and later to cluster reads that span one or more of the same variants. In this way Sniffles can cluster two or more events, even if the distance between the events is larger than the read length. Future work will include a full phasing of hapolotypes including SVs, single-nucleotide polymorphisms, and other small variants. Details are presented in Supplementary Note 2."
Source: Sedlazeck, F.J., Rescheneder, P., Smolka, M. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods 15, 461–468 (2018). https://doi.org/10.1038/s41592-018-0001-7

Supplementary Note 2

"All SVs found during step 1 [detect smaller (<1kb) insertions, deletions and regions with an increased number of mismatches and very short (1-5bp) indels] and 2 [large indels, inversions, duplications, translocations] are stored in a self-balancing binary tree. [...] Sniffles traverses the binary tree to merge SV calls that were caused by the same SV." (Supplementary Note 2, page 14)

"2.2.3 Storing/Clustering of SVs
Sniffles use a self-balancing binary tree to store and merge SV calls. Each node in the tree represents a single SV. The SVs are sorted based on the start coordinate of each SV.
Each time Sniffles detects a read that supports a SV, Sniffles traverses the binary tree to see if that particular SV has been observed before. The current SV call is merged with an already known one if their types (e.g. deletion) are the same and their breakpoints are within the maximum distance D. [...]
In the tree, each SV is represented by the coordinates that it was first found at. However, the coordinates from other reads supporting the same SV are stored as well. To store the SV type Sniffles uses a set of bit flags to enable a fast comparison between different SVs. Furthermore, the bit flags allow Sniffles to assign multiple types and additional information to a single SV, especially for nested SVs. For complex types, we allow inversions or deletions to be merged with a candidate SV as long as they agree on the coordinates. Furthermore, we allow insertions and tandem duplications to be merged since a tandem duplication is an insertion of the same element next to itself.
To account for multiple overlapping SVs or SVs in close proximity, especially if the genome is polyploid in this region as commonly observed in human cancers or plant genomes, Sniffles implements a more thorough tree search to assess whether the current SV has already been observed. Here, Sniffles starts at the current parental node and walks using an in-order traversal search through the sub tree to identify an already stored SV that would match the current one. Note that this does not significantly increase the runtime, since this procedure will generally only be performed on a very small subtree.
If Sniffles does not find the current SV in the tree, it adds it as a new leaf node. Each SV is stored together with the name of the read it was observed in, the strands, the start and stop position of the genome, the start and stop position on the read, the bit-flag for the type and information about the source (split reads, alignment event, noisy region)." (Supplementary Note 2, page 19)

[BUG,TEST] Empty output file in detect_breakends cli test.

The created output file detect_breakends_insertion_file_out.fasta is empty.
Test whether this is always the case or is the test example inappropriate? Choose a better example.

[FEATURE EPIC] Call SNPs, Indels & SVs with iGenVar

This is an overview over all epics.

As a geneticist for rare diseases, I would like to analyze the deletions in a patient's genome so that the events characterized by the disease can be detected with the help of databases. This helps to narrow down the diagnosis and to initiate tailored therapies corresponding to the genon type.

This includes the following aspects:

Input: We want to allow short and long reads and deal with them differently.
- Create a Structure for BAM Indexing seqan/product_backlog#88
Algorithms: We want to use various algorithms for the various inputs and outputs.
- Call SNPs & Indels: Since GATK is currently the best standard, we want to use these methods. The tool HaplotypeCaller, written here in Java, should be translated into C ++ for this purpose.
  => seqan/product_backlog#31
- Call SVs: For larger structural variations, we want to combine the various known methods to call deletions.
  => seqan/product_backlog#32
- Add all Methods of Vaquita
  => seqan/product_backlog#84
~~Output: We want to output the deletions in VCF format using a VCF parser from SeqAn3 (needs to be implemented).~~
~~=> seqan/product_backlog#29~~ ✅
~~We want to modularise the different parts of IGenVar so that the user can decide which methods to use and so that we can compare different combinations of methods more easily.~~
~~=> seqan/product_backlog#44~~ ✅
~~Testing: We want to test all functionalities and also prove this with code coverage.~~
~~=> seqan/product_backlog#30~~ ✅
Refinements, bugs, and requests
=> seqan/product_backlog#24

Input

~~Differentiating between the inputs will be processed in the course of Issue seqan/product_backlog#17.~~ ✅
Create a Structure for BAM Indexing seqan/product_backlog#88

Algorithms

Call SNPs & Indels:

Call SVs:

Distinguish between long and short reads in the input seqan/product_backlog#43 ✅

Call Deletions from long reads seqan/product_backlog#32 ✅

Call Deletions via the CIGAR string ✅
Call Deletion using split alignment seqan/product_backlog#18 ✅
Cluster Deletions by hierarchical clustering seqan/product_backlog#54 ✅

Call Insertions from long reads seqan/product_backlog#93 ✅

Add all Methods of Vaquita seqan/product_backlog#84

Call SVs in short reads seqan/product_backlog#17

Cluster SVs: seqan/product_backlog#26

hirachical clustering seqan/product_backlog#54 seqan/product_backlog#125 ✅
...

Refinement

TODO... (sViper, ...)

Output

~~We need to decouple the output from the functionality #6 so that we can write it to an output file #8 with an output option seqan/product_backlog#21.~~
~~Then a VCF parser has to be developed in SeqAn3 #9 #10, which we want to use for iGenVar #11.~~ ✅

Testing ✅

We want to check the code with CLI seqan/product_backlog#4 and API tests seqan/product_backlog#12 seqan/product_backlog#13 and cover it completely.
-> We now have a codecoverage of > 85%! seqan/product_backlog#116 ✅
~~In order to implement the CodeCoverage, we are waiting for an update in the app template: seqan/app-template#30.~~
Update: CLI tests are implemented. 🎉

Refinements, bugs, and requests

We want fully documented code. seqan/product_backlog#3
A verbose option would be nice. -> seqan/sharg-parser#78 -> seqan/product_backlog#20, seqan/product_backlog#23

[FEATURE] Modularize methods for detecting junctions

[TEST] Add tests for the help pages.

[MISC] Refactor representation of insertions

Current implementation

A breakend is a directed position on the reference genome or a read, e.g. chr1, position 1000, positive strand or read 7:position 210, negative strand.
A novel adjacency is a pair of breakends and represents the connection of two distant positions on the genome or a read.

Deletions can be represented as adjacencies between distant positions on the same chromosome, e.g. chr1:1000 -> chr1: 2000.
Insertions can be represented as two adjacencies:

Adjacency 1 between the insertion location and the start of the insertion sequence on the read, e.g. chr1:1000 -> read7:210
Adjacency 2 between the end of the insertion sequence on the read and the insertion location, e.g. read7:310 -> chr1:1001

This representation of insertions is problematic because:

A single event (the insertion) is represented by two novel adjacencies.
We need both adjacencies to understand what is going on.
Insertions are hard to cluster because we would need to merge adjacencies to different reads (e.g. chr1:1000 -> read7:210 and chr1:1000 -> read8:430 could come from the same insertion).

Desired implementation

A breakend is a directed position on the reference genome, e.g. chr1, position 1000, positive strand.
A novel adjacency is a pair of breakends and represents the connection of two distant genomic positions. Each novel adjacency has a field insertion_sequence that (optionally) stores additional bases inserted between the two joined genomic positions.
As before, deletions can be represented as adjacencies between distant positions on the same chromosome, e.g. chr1:1000 -> chr1: 2000.
Now, insertions can be represented as adjacencies between two neighboring genomic positions, e.g. chr1:1000 -> chr1:1001, with the insertion sequence stored in the respective field.

[DOC]Document code

Document functions of detect_breackends & find_deletions.

[FEATURE] Save find_deletions output in a file

Decouple the output printed on stdout from functions into seperate function for forwarding or saving into output files.

Currently, results are being written on the spot directly to stdout. We want to save the results in a vcf file (without any parser checks, just simple tab seperated file).
We will do the checks with a VCF parser in a later issue.

File: 'src/find_deletions/deletion_finding_and_printing.cpp
Related functions: 'find_and_print_deletions()' & 'print_deletion()' & 'print_vcf_header()'

iGenVar - Modularisation of iGenVar

We want to modularise the different parts of IGenVar so that the user can decide which methods to use and so that we can compare different combinations of methods more easily. This also gives us the possibility to easily add a module to IGenVar, e.g. a new cluster method.

This affects

Input Data
methods for detecting junctions
methods for clustering junctions
methods for refining breakends
move everything into different directories