GithubHelp home page GithubHelp logo

seqan / seqan Goto Github PK

View Code? Open in Web Editor NEW
465.0 43.0 171.0 154.77 MB

SeqAn's official repository.

Home Page: https://www.seqan.de

License: Other

CMake 1.30% C++ 64.50% Shell 0.25% Python 3.09% Awk 0.02% C 1.88% Makefile 0.02% R 0.04% CSS 0.72% JavaScript 1.95% HTML 2.54% TeX 0.02% GLSL 0.01% Tcl 0.01% PHP 0.01% Batchfile 0.04% Roff 0.46% POV-Ray SDL 22.87% Less 0.30%
cpp14 bioinfomatics high-performance simd sequence-alignments alignment bwt indexing suffixarray htslib

seqan's Introduction

SeqAn - The Library for Sequence Analysis

build status license latest release platforms start twitter

NOTE
SeqAn3 is out and hosted in a different repository
We recommend using SeqAn3 for new applications.

What Is SeqAn?

SeqAn is an open source C++ library of efficient algorithms and data structures for the analysis of sequences with the focus on biological data. Our library applies a unique generic design that guarantees high performance, generality, extensibility, and integration with other libraries. SeqAn is easy to use and simplifies the development of new software tools with a minimal loss of performance.

License

The SeqAn library itself, the tests and demos are licensed under the very permissive 3-clause BSD License. The licenses for the applications themselves can be found in the LICENSE files.

Prerequisites

Older compiler versions might work but are neither supported nor tested.

Linux, macOS, FreeBSD

  • GCC ≥ 11
  • Clang/LLVM ≥ 15
  • Intel oneAPI C++ Compiler 2024.0.2 (IntelLLVM)

Windows

  • Visual C++ ≥ 17.0 / Visual Studio ≥ 2022

Architecture support

  • Intel/AMD platforms, including optimisations for modern instruction sets (POPCNT, SSE4, AVX2, AVX512)
  • All Debian release architectures supported, including most ARM and all PowerPC platforms.

Build system

  • To build tests, demos, and official SeqAn applications you also need CMake ≥ 3.12.

Some official applications might have additional requirements or only work on a subset of platforms.

Documentation Resources

Contact

seqan's People

Contributors

aiche avatar beekalam avatar bkahlert avatar catkira avatar cpockrandt avatar dependabot[bot] avatar eseiler avatar esiragusa avatar gurgese avatar h-2 avatar hannespetur avatar holtgrewe avatar joergi-w avatar kreinert avatar ktrappe avatar lkuchenb avatar marehr avatar mr-c avatar oyasnev avatar philliplab avatar rrahn avatar rrwick avatar sgssgene avatar smehringer avatar soapgentoo avatar temehi avatar weese avatar wtwhite avatar xenigmax avatar xp3i4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seqan's Issues

goDown() a std::string for FMIndex iterator does not compile

Suppose I have an iterator, it, over some index and I wish to descend it using goDown(it, x) where x is some string. x could be a seqan string or infix or a std::string. For an ESA or WOTD index this all work just fine. For a FMIndex only the seqan types work. The std::string version fails to compile. See the code below:

I'm using gcc 4.8.2 with the latest git version of seqan.

#include <seqan/index.h>

using namespace seqan;

typedef String< Dna5 > string_t;
typedef StringSet< string_t > string_set_t;



template< typename Iterator >
void
descend( Iterator i ) {
    std::cout << representative( i ) << "\n";
    if( goDown( i ) ) {
        while( true ) {
            descend( i );
            if( ! goRight( i ) ) {
                break;
            }
        }
    }
}


template< typename Index >
void
test_index( Index index ) {
    typename Iterator< Index, TopDown<> >::Type iterator( index );
    descend( iterator );
    goDown( iterator, string_t( "AC" ) );
    goDown( iterator, 'A' );
    goDown( iterator, "AC" );
    goDown( iterator, std::string( "AC" ) ); //fails to compile for FMIndex
}


int main( int argc, char * argv[] ) {
    string_set_t seqs;
    appendValue( seqs, "AC" );
    appendValue( seqs, "A" );
    appendValue( seqs, "ACGT" );
    test_index( Index< string_set_t, IndexEsa<> >( seqs ) );
    test_index( Index< string_set_t, IndexWotd<> >( seqs ) );
    test_index( Index< string_set_t, FMIndex<> >( seqs ) );
    return 0;
}

Cannot use on cygwin

Hi. Thanks for your great work on seqan.

I am trying to use seqan on cygwin, but everythings fails to build because seqan/system.h includes aio.h, which is not available on cygwin.

Is there any way to make it work? Probably some macro to disable asynchronous IO?

Thanks in advance.

Bug in VCF output

For multiple reference sequences, the writeRecord(Vcf) always writes the ID of the first reference, even for variants on other references. So I think there is a bug in the

streamWriteBlock(stream, &(_vcfIOContext.sequenceNames)[record.rID][0],
length((_vcfIOContext.sequenceNames)[record.rID]));

part (write_vcf.h, line 161) because the function writes the same ID for different record.rID values:

correct ID:
rId 1
Id gi|81239530|gb|CP000034.1|

output within writeRecord:
rID 1
ID gi|170079663|ref|NC_010473.1|37
rID 0
ID gi|170079663|ref|NC_010473.1|#""0

cheers, Kathrin

[Docs] Fix multiple documented functions

In class Gaps function iter is listed both under Interface Function Overview and Interface Functions Inherited From. Same issue probably applies to other classes as well.

Header guard mismatch

While compiling OpenMS with clang 3.4 a couple of warnings were generated due to mismatches of the header guards in seqan (see below). We actually use SeqAn 1.4.1 but a quick check showed that some of the mismatches are still in master. Please also note that we only use parts of seqan, so that is not a complete list.

<path-to-seqan-installation>/include/seqan/basic/iterator_position.h:37:9: warning: 'SEQAN_CORE_INCLUDE_SEQAN_BASIC_ITERATOR_POSITION_H_x' is used as a header guard here, followed by #define of a different macro [-Wheader-guard]
#ifndef SEQAN_CORE_INCLUDE_SEQAN_BASIC_ITERATOR_POSITION_H_x
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<path-to-seqan-installation>/include/seqan/basic/iterator_position.h:38:9: note: 'SEQAN_CORE_INCLUDE_SEQAN_BASIC_ITERATOR_POSITION_H_' is defined here; did you mean 'SEQAN_CORE_INCLUDE_SEQAN_BASIC_ITERATOR_POSITION_H_x'?
#define SEQAN_CORE_INCLUDE_SEQAN_BASIC_ITERATOR_POSITION_H_
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        SEQAN_CORE_INCLUDE_SEQAN_BASIC_ITERATOR_POSITION_H_x
<path-to-seqan-installation>/include/seqan/sequence/string_set_dependent_generous.h:38:9: warning: 'SEQAN_SEQUENCE_STRING_SET_DEPENDENT_GENEROUS_H_' is used as a header guard here, followed by #define of a different macro
      [-Wheader-guard]
#ifndef SEQAN_SEQUENCE_STRING_SET_DEPENDENT_GENEROUS_H_
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<path-to-seqan-installation>/include/seqan/sequence/string_set_dependent_generous.h:39:9: note: 'SEQAN_SEQUENCE_STRING_SET_DEPENDENT_GENEROUSH_' is defined here; did you mean 'SEQAN_SEQUENCE_STRING_SET_DEPENDENT_GENEROUS_H_'?
#define SEQAN_SEQUENCE_STRING_SET_DEPENDENT_GENEROUSH_
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        SEQAN_SEQUENCE_STRING_SET_DEPENDENT_GENEROUS_H_
<path-to-seqan-installation>/include/seqan/align/alignment_operations.h:37:9: warning: 'SEQAN_CORE_INCLUDE_SEQAN_ALIGN_ALIGNMENT_OPERATIONS_H_' is used as a header guard here, followed by #define of a different macro
      [-Wheader-guard]
#ifndef SEQAN_CORE_INCLUDE_SEQAN_ALIGN_ALIGNMENT_OPERATIONS_H_
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<path-to-seqan-installation>/include/seqan/align/alignment_operations.h:38:9: note: 'SEQANCORE_INCLUDE_SEQAN_ALIGN_ALIGNMENT_OPERATIONS_H_' is defined here; did you mean 'SEQAN_CORE_INCLUDE_SEQAN_ALIGN_ALIGNMENT_OPERATIONS_H_'?
#define SEQANCORE_INCLUDE_SEQAN_ALIGN_ALIGNMENT_OPERATIONS_H_
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        SEQAN_CORE_INCLUDE_SEQAN_ALIGN_ALIGNMENT_OPERATIONS_H_

ValueSize<> for finite alphabet with more than 256 values

On alphabets with more than 256 characters, the ValueSize<> metafunction does not work properly because the return type of this meta function is limited to 8 bit.

#include <seqan/basic.h>
using namespace seqan;
int main()
{  
    // define alphabet with 259 characters
    typedef SimpleType<unsigned, Finite<259> > TAlph;
    std::cout << static_cast<unsigned>(ValueSize<TAlph>::VALUE) << std::endl;  
    return 0;
    // Output: 3
}

Clarify usage of const-ness together with Dependent StringSet.

@weese, we should talk about const-ness together with dependent string sets.

The sets themselves can be modifiable while the strings are const. I think the current code does not allow for all possible/important cases.

Also, getValueById() does not work properly with const string sets of all specializations.

FragmentStore: loading UCSC annotations

Hi! I was trying to write a simple converter from UCSC's files knownGene.txt and knownIsoforms.txt into GTF. But something went wrong and I cannot figure out why.

I've downloaded USCS files from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz and http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownIsoforms.txt.gz But my code

#include <seqan/store.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan;

int main(int argc, char const* argv[])
{
  seqan::Stream<seqan::GZFile> geneStream;
  if (!open(geneStream, "knownGene.txt.gz", "r"))
  {
    std::cerr << "ERROR: Could not open knownGene.txt.gz for reading.\n";
    return 1;
  }

  seqan::Stream<seqan::GZFile> isoformStream;
  if (!open(isoformStream, "knownIsoforms.txt.gz", "r"))
  {
    std::cerr << "ERROR: Could not open knownIsoforms.txt.gz for reading.\n";
    return 1;
  }

  seqan::Stream<seqan::GZFile> gtfStream;
  if (!open(gtfStream, "ucsc.gtf.gz", "w"))
  {
    std::cerr << "ERROR: Could not open ucsc.gtf.gz for writting.\n";
    return 1;
  }

  FragmentStore<> store;

  read(geneStream, store, Ucsc());
  read(isoformStream, store, Ucsc());

  write(gtfStream, store, Gtf());

  return 0;
}

is failing with message:

/data/results/gusev/epigenetics/novo/annotation/build/seqan/include/seqan/sequence/string_base.h:488 Assertion failed : static_cast<TStringPos>(pos) < static_cast<TStringPos>(length(me)) was: 0 >= 0 (Trying to access an element behind the last one!)
Aborted

What can be the problem here? Thank you!

PS: I am using seqan v. 1.4.1.

Parameters of SequenceStream constructor not documented

The Basic IO tutorial mentions that you can specify a file format as a third parameter to the SequenceStream constructor and gives a link to the API doc.
The API mentions the format parameter but there is no info on what it should be for different formats.

Documentation of Alphabets

We found three issues:

  1. Return value type of ValueSize<TAlphabet>::VALUE is not documented.
  2. T valueSize<T>() should be T1 valueSize<T2>().
  3. class AminoAcid: "The amino acids are enumerated from 0 to 15" should be 19, not 15.

Error when trying to do MSA for similar strings

This code I tryed to use in unit testing in my homework cause an error:

   Align<TSequence> align;
   resize(rows(align), 4);
   for(int i = 0; i < 4; ++i)
   {
     assignSource(row(align, i), TSequence("KKKPPPGGF"));
   }
   globalMsaAlignment(align, Blosum62(-1, -1));

Please write at [email protected]

Tutorial corrections

Tutorial First Steps in SeqAn:
The output in the solutions of assignments 5 and 6 and the output in "The Final Result" has too many 0s at the end.

Tutorial Alphabets, Assignment 1, solution:
alphSize should be of type typename ValueSize<TAlphabet>::Type and not typename Size<TAlphabet>::Type.

Tutorial Index Iterators:
"How many wood would a woodchuck chuck." should be "How MUCH wood would a woodchuck chuck.".

Extract CharString from BamTagsDict

The function extractTagValue() always returns false when the tag type is 'Z' (return value of getTagType()). The example given on the documentation page of BamTagsDict outputs

3
AA -> ""
AB -> ""
AC -> 30

and not

3
AA -> "value1"
AB -> "value2"
AC -> 30

Adjust members to naming scheme

There are still some member variables that do not follow the naming scheme:

  • BamAlignmentRecord::rNextId -> BamAlignmentRecord::rNextID
  • GenomicRegion::seqId -> GenomicRegion::seqID

Also, all fragment store members and related types should be rechecked.

Windows/Visual Studio Tutorial not building razers2

When walking through the "Getting Started With SeqAn On Windows Using Visual Studio" tutorial, building razers2 does not work. The following error is displayed:

3>C:\Dev\SeqAn\seqan\core\include\seqan/parallel/parallel_sequence.h(77): error C2665: 'seqan::atomicInc' : none of the 2 overloads could convert all the argument types
3> C:\Dev\SeqAn\seqan\core\include\seqan/parallel/parallel_atomic_primitives.h(340): could be 'long seqan::atomicInc(volatile long &)'
3> C:\Dev\SeqAn\seqan\core\include\seqan/parallel/parallel_atomic_primitives.h(341): or 'unsigned long seqan::atomicInc(volatile unsigned long &)'
3> while trying to match the argument list '(volatile unsigned int)'

Also receive the same error for atomicDec.
Using Microosft Visual C++ 2010 Express.

Fix documentation of Swift and Pigeonhole Pattern

The documentation of Swift and Pigeonhole patterns could use some love. Although I had previous knowledge of the code and RazerS 3, I had to look around a bit. The following things need improvements:

  • signatures and parameters
  • we need working examples
  • the parametrization with threshold etc. for Swift and error rate for Pigeonhole should be documented

translateFile2GlobalRefId for rNextId in readRecord

I was reading a bam file with a broken header and stumbled into this:

The function readRecord(record, context, stream, Bam()) in bam_io/read_bam.h corrects the rID but not the rNextId with the field translateFile2GlobalRefId of a BamIOContext. It seems like a bug to me to correct only one of the IDs although I do not fully understand what is going on.

beginPos of unmapped reads in BamAlignmentRecord

According to the sam format specification the begin position of an unmapped read is 0:
"POS is set as 0 for an unmapped read without coordinate."

However, when calling the functions

readRecord(record, context, reader, seqan::Sam())
write2(outStream, record, context, seqan::Bam())

the begin position of an unmapped read is being changed from 0 to -inf in the resulting output file.

Header hard to create for VcfStream

Currently, the only way to do this properly is as follows, with using the underscore member _context. There should be a better way to do this.

for (unsigned i = 0; i < numSeqs(faiIndex); ++i)
{
    seqan::CharString contigStr = "<ID=";
    append(contigStr, sequenceName(faiIndex, i));
    append(contigStr, ",length=");
    std::stringstream ss;
    ss << sequenceLength(faiIndex, i);
    append(contigStr, ss.str());
    append(contigStr, ">");
    appendValue(vcfStream.header.headerRecords, seqan::VcfHeaderRecord("contig", contigStr));
    appendName(*vcfStream._context.sequenceNames,
               sequenceName(faiIndex, i),
               vcfStream._context.sequenceNamesCache);
}

SVN not working in Ubuntu Studio 14.04

When following the start up tutorial get an error whilst using svn as follows;

ubuntu-studio@ubuntu-studio:~/Dev$ svn co https://github.com/seqan/seqan/branches/master seqan-trunk

svn: E235000: In file '/build/buildd/subversion-1.8.8/subversion/libsvn_wc/wc_db.c' line 1671: assertion failed (SVN_IS_VALID_REVNUM(changed_rev))
Aborted (core dumped)

Could you please advise how to fix this.

Loading index does not include original sequence names

The original names of sequences are not available when you load an index from disk.

I am using the readAll function to load in sequences from a fasta file. I then generate an index based on the StringSet of the previously read in sequences. The index takes in as input the StringSet of sequences but not the StringSet that contains the sequence identifiers. I then generate the necessary fibres and then write the index to disk. I am then able to load the index back into the program from disk but am unable to access the original names of the sequences that are in the index.

is there any way to include the sequence IDs when writing an index out to disk?

Provide call-consensus function for ProfileChar.

ProfileChar should have a public function that takes the consensus (if any). Currently, this is done by casting to the target type with static_cast<> but that could be considered unintuitive. Either document this feature or provide such a function.

Also, such an explicit function could convert to IUPAC in case of ambiguities.

We might also consider to have a function that does a statistical test.

Add Affix Index Structure

Dear SeqAn team

First, I want to thank you for this excellent library.

As part of an implementation of an approximate string matching algorithm by “Seed-And -Extend”, I am looking for an affix-index structure. Such structures should be great for extending a pattern in both ways.

In a naive solution, I could use 2 FM-Indexes, one of the text and the other on the reversed text . First of all, I would iterate my seed in the “backward” FM-Index and then with an iterator on each index, I would extend in a direction or the other. However, at each change of way, it would be necessary to iterate the entire new pattern in the opposite substring index.

This approach does not seem very efficient compared to what Schnattinger et al proposed in their article “Bidirectional Search in a String with Wavelet Trees”(2010). Indeed, at each iteration in a BWT, they update interval in the opposite BWT with a time complexity of O(log(σ)).

Moreover, the naive solution implies that we have two FibreText loaded in memory, which would be quite space consuming.

I think it would be a great opportunity to be able to use an affix index in SeqAn. I unfortunately do not have sufficient technical kwonledges of SeqAn and template programming to be able to implement new structures. I'm wondering if you're planning to implement (or maybe already implementing!) such a structure in the SeqAn library.

Best regards,

Christophe Vroland

Mason simulator crashes for invalid combination of fragment length-read length-sd

Example call for crash:
./mason_simulator -ir Ecoli_O157H7.fa -n 1832820 -o Ecoli_O157H7_reads1.fa -or Ecoli_O157H7_reads2.fa --fragment-mean-size 1000 --fragment-size-std-dev 300 --illumina-read-length 150

Simulating Reads:
gi|15829254|ref|NC_002695.1| (allele 1) /home/trappek/src/seqan/trunk/core/include/seqan/basic/basic_exception.h:345 FAILED! (Uncaught exception of type std::runtime_error: Illumina read is too long, increase fragment length)

stack trace:
0 [0x42a24e] seqan::ClassTest::fail() + 0xe
1 [0x4a4912] ../mason_simulator()
2 [0x7f74a94cc856] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5e856)
3 [0x7f74a94cb919] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5d919)
4 [0x7f74a94cc4ca] _gxx_personality_v0 + 0x52a
5 [0x7f74a8d547f3] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0xf7f3)
6 [0x7f74a8d54d27] Unwind_Resume + 0x57
7 [0x46419b] ../mason_simulator()
8 [0x465275] ReadSimulatorThread::run(seqan::String<seqan::SimpleType<unsigned char, seqan::Dna5
>, seqan::Alloc >&, PositionMap const&, seqan::String<char, seqan::Alloc > const&, seqan::String<seqan::SimpleType<unsigned char, seqan::Dna5
>, seqan::Alloc >&, int, int) + 0x695
9 [0x427c50] ../mason_simulator()
10 [0x4657b8] MasonSimulatorApp::_simulateReadsDoSimulation() + 0x398
11 [0x424711] main + 0x731
12 [0x7f74a8781de5] __libc_start_main + 0xf5
13 [0x424ac1] ../mason_simulator()

Aborted (core dumped)

Package Generation for git tags

The package generator currently doesn' t work with our new Repo on github. Please port it to use bit tags instead of svn tags so new binaries like Fiona can be provided easily.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.