GithubHelp home page GithubHelp logo

medvedevgroup / twopaco Goto Github PK

View Code? Open in Web Editor NEW
39.0 16.0 10.0 11.91 MB

A fast constructor of the compressed de Bruijn graph from many genomes

License: Other

CMake 0.39% C++ 94.93% Makefile 3.85% Python 0.83%
bioinformatics graph-algorithms comparative-genomics de-bruijn-graphs genomics

twopaco's Introduction

TwoPaCo 1.0.0

Release date: 29th September 2022

Authors

  • Ilia Minkin (Pennsylvania State University)
  • Son Pham (Salk Institute for Biological Studies)
  • Paul Medvedev (Pennsylvania State University)

Introduction

It is an implementation of the algorithm described in the paper "TwoPaCo: An efficient algorithm to build the compacted de Bruijn graph from many complete genomes".

This distribution contains two programs:

  • twopaco -- a tool for direct construction of the compressed graph from multiple complete genomes
  • graphdump -- a utility that turns output of twopaco into a text format

Test data

Links to the data used for bencharmking in the paper: https://github.com/medvedevgroup/TwoPaCo/blob/master/data.txt

Compilation

To compile the code, you need the following (Linux only):

  • CMake
  • A GCC compiler supporting C++11
  • Intel TBB library properly installed on your system. In other words, G++ should be able to find TBB libs

Once you've got all the things above, do the following:

  • Go to the root directory of the project and create the "build" folder
  • Go to the "build" directory
  • Run cmake ../src
  • Run make

This will make two targets: twopaco and graphdump. Compilation under other platforms is possible, portable makefiles are in progress.

TwoPaCo usage

To construct the graph (assuming you are in dir with "twopaco"), type:

./twopaco -f <filter_size> -k <value_of_k> <input_files>

This will constuct the compressed graph for the vertex size of <value_of_k> using 2^<filter_size> bits in the Bloom filter. The output file is a binary that can be either converted to a text file or read directly using an API (will be available soon).

The filter size -f is a very important parameter that affects both the memory usage and the speed. TwoPaCo will use at least 2^<filter_size> / 8 bytes of memory, but setting it too low can massively increase the size of the memory used and slow down the program. We recommend the user to set -f to to the value so that 2^<filter_size> / 8 is the maximum memory in bytes they wish to allocate to the algorithm. If the memory usage then exceeds the value above, then the number of rounds should be increased until the memory usage falls below the desired value (see the section "Number of rounds").

If the memory usage is not a concern, then as a rule of thumb for the fastest speed, set the parameter -f as large as possible. Here are the recommended settings given the memory size of a machine:

Machine RAM Recommended -f value Corresponding Bloom fitler size
4 GB 34 2.1 GB
8 GB 35 4.3 GB
16 GB 36 8.6 GB
32 GB 37 17.2 GB
64 GB 38 34.4 GB
128 GB 39 68.7 GB
256 GB 40 137.4 GB

For a memory size in between, go up a value, i.e. for 12GB RAM use 36, not 35. For more details on how the Bloom filter size affects performance, please see the paper. Below is description of the other parameters.

Alternatively, you can specify the memory used by the filter using the "filtermemory" option:

--filtermemory <memory amount in GB>

Note that the filter will be of size 2^n bits with n being as large as possible such that the filter fits the memory size specified. So if you use 20 as the filtermemory, TwoPaCo will allocate 17.2 GBs for the Bloom filter.

Number of rounds

Number of computational rounds. For the fastest performance, use 1 round (the default). Increasing the number of rounds will decrease the memory usage at the expense of longer runtime. Prior to increasing the number of rounds, please make sure to set the Bloom filter size correctly as described above. To set the rounds parameter, use:

-r <number> or --rounds <number>

K-mer size

This value sets the size of a vertex in the de Bruijn graph. Default is 25, to change, use:

-k <number> or --kvalue <number>

Note that:

  1. TwoPaCo uses k as the size of the vertex and (k + 1) as the size of the edge
  2. k must be odd

The maximum value of K supported by TwoPaCo is determined at the compile time. To increase the max value of K, increase the value "MAX_CAPACITY" defined in the header "vertexenumerator.h" and recompile. The value of "MAX_CAPACITY" should be at least (K + 4) / 32 + 1. Note that increasing the parameter will slow down the compilation.

Number of hash functions

The number of hash functions used for the Bloom filter. The default is five. To change, use:

-q <number> or --hashfnumber <number>

More hash functions increases the running time. At the same time, more hash functions may decrease the number of false positives and the memory usage.

Number of threads

twopaco can be run with multiple threads. The default is 1. To change, use:

-t <number> or --threads <number>

Temporary directory

The directory for temporary files. The default is the current working directory. To change, use (the directory must exist):

--tmpdir <path_to_the_directory>

Output file name

The name of the output file. The default is "de_bruijn.bin". To change, use:

--o <file_name> or --outfile <file_name>

Running tests

If the flag is set, TwoPaCo will run a set of internal tests instead of processing the input file:

--test

The graphdump usage

This utility turns the binary file a text one. There are several output formats available. The folder "example" contains an example described in details.

GFF

In the next release I will add an option to output coordinates of all occurrences of the junctions in GFF format.

DOT

This format is used for visualization. The resulting DOT file can be converted into an image using Graphviz package:

http://www.graphviz.org/

To get the DOT file, use:

graphdump <twopaco_output_file> -f dot -k <value_of_k>

Note that the graph is a union of graphs built from both strands, with blue edges coming from the main strand and red ones from reverse one. The labels of the edges will indicate its position on a chromosome.

GFA

GFA is the most handy option. It explicitly represents the graph as a list of edges (non-branching paths in the non-compacted de Bruijn graph) graph and adjacencies between them. The file also contains all occurrences of the strings spelled by the paths in the input genomes.

In other words, it describes a colored de Bruijn graph where each path is mapped to several locations in the input ("colored"). TwoPaCo supports both GFA1 and GFA2. They are described here:

https://github.com/GFA-spec/GFA-spec

To get GFA output, run:

graphdummp <twopaco_output_file> -f gfa[version] -k <value_of_k> -s <input_genomes>

In the resulting file compacted non-branching paths are "segments" with "links" (GFA1) or "edges" (GFA2) containing them. "Containment" (GFA1) or "Fragment" (GFA2) records desrcibe the mapping between the non-branching paths in the graph and the input genomes. For GFA1, each input chromosome is also a "segment" described in the very beginning of the GFA file.

GFA1 only: each segment representing an input chromosome has the name of the corresponding header of the sequence in input FASTA file. In case if there are duplicate headers, one can add a prefix to segment names:

"s<number>_" + header of the sequence in input FASTA file

To do so, use the switch:

--prefix

For an example of GFA output and more detailed explanation, see the "example" folder.

Junctions List Format

In this format the output file only contains positions of junctions in the input genomes. As described in the paper, you can trivially restore information about edges from this junctions list. Note that junctions are mapped to genomes, i.e. one can reconstruct a colored graph from it. To get the junctions list, run:

graphdump -f seq -k <value_of_k>

This command will output a text file to the standard output. Each line will contain a triple indicating an occurence of junction:

<seq_id_i> <pos_i> <junction_id_i> 

The first number is the index number of the sequence, the second one is the position, and the third one is the junction id. The index number of the sequence is the order of the sequence in the input file(s). All positions/orders count from 0. Positions appear in the file in the same order they appear in the input genomes. The <junction_id> is a signed integer, the id of the junction that appears on the positive strand strand. A positive number indicates "direct" version of the junction, while a negative one shows the reverse complimentary version of the same junction. For example +1 and -1 are different versions of the same junction. This way, one can obtain all multi-edges of the graph with a linear scan, as described in the paper. For example, a sequence of of junctions ids:

a_1
a_2
a_3

Generates edges a_1 -> a_2, a_2 -> a_3 in the graph corresponding to the positive strand. To obtain the edges from the positive strand, one has to traverse them in the backwards order and negate signs, e.g. for the example above the sequence will be -a_3 -> -a_2 -> -a_1. One can also output junctions grouped by ids, it is useful for comparison between different graphs:

graphdump -f group -k <value_of_k>

In this format the i-th line line corresponds to the i-th junction and is of format:

<seq_id_0> <pos_0>; <seq_id_1> <pos_1>; ....

Where each pair "seq_id_i pos_j" corresponds to an occurence of the junction in sequence "seq_id_i" at position "pos_j". Sequence ids are just the numbers of sequences in the order they appear in the input. All positions count from 0.

Read The Binary File Directly

This is the most parsimonious option in terms of involved resources. One can read junctions and/or edges from the output file using a very simple C++ API. I will add the description in the future release. For now, one can use the sources of graphdump as a reference, it is relatively straightforward.

License

See LICENSE.txt

Contacts

Please e-mail your feedback at [email protected].

You also can report bugs or suggest features using issue tracker at GitHub https://github.com/medvedevgroup/TwoPaCo

Citation

If you use TwoPaCo, please cite:

Ilia Minkin, Son Pham, and Paul Medvedev
"TwoPaCo: An efficient algorithm to build the compacted de Bruijn graph from many complete genomes"
Bioinformatics, 2016 doi:10.1093/bioinformatics/btw609

This project has been supported in part by NSF awards DBI-1356529, CCF-1439057, IIS-1453527, and IIS-1421908.

twopaco's People

Contributors

dpryan79 avatar ekg avatar iminkin avatar pashadag avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

twopaco's Issues

Handeling N characters in sequences

Hi,

when using input sequences that contain N characters (which almost all eukaryote reference sequences do) I get this error message:

Round 0, 0:1048576
Pass Filling Filtering
1 error: Found an invalid character 'N'

Do I need to re-split my input into contigs to use it with TwoPaCo, or is this some kind of bug?

Thanks,
Chris

`twopaco` and `graphdump` functions not found

Hello,

I am interested in using twopaco on a set of bacterial reference genomes, though I am running into issues.

I cloned the repo and went through the build steps mentioned in the README, though after the build, I can't seem to run the twopaco command.

bash: twopaco: command not found
bash: graphdump: command not found

I can't locate in which directory the command exists or needs to be run from. I am somewhat new to building tools written in C, so pardon my ignorance on the topic.

I am working on a HPC which has C++ version 11 or higher and the required TBB libraries and version of cmake. The tool built without issue.

Thanks,
Domenick

Cause of input corrupted error?

Hey @iminkin!

I'm trying to run this on another dataset however I keep getting the "the input is corrupted" error. I tried to take a look at the source code but can't fully understand what causes this? I checked my Fasta and that seems to be fine, but I might be missing something

typo?

Hello, I am writing because it seems like there is a typo in the main README under the graphdump / GFA section. Seems like graphdummp should be graphdump

Best,
Domenick

Fix error handling

  1. Add error checking after the first pass
  2. Add try/catch around checking constructing the parser (checking if file exists)

Unable to create temp file

Hi,

in all my runs, independent of the provided location TwoPaCo is unable to create a temp file at the given location.

twopaco --tmpdir /home/TwoPaCo/tmp/ --test -t 1 -k 11 -f 20 -o /home/TwoPaCo/test.dbg /home/TwoPaCo/examples/example.fa

What am I doing wrong here?

Thanks,
Chris

Fails to build with TBB 2021.5.0

I am building TwoPaCo 0.9.4 with TBB 2021.5.0 on Debian experimental. The build fails with the following:

In file included from /home/merkys/twopaco/src/graphdump/graphdump.cpp:17:
/home/merkys/twopaco/src/graphdump/../common/streamfastaparser.h:8:10: fatal error: tbb/mutex.h: No such file or directory
    8 | #include <tbb/mutex.h>
      |          ^~~~~~~~~~~~~

Does this mean that TwoPaCo does not support TBB 2021.5.0? If so, are there plans to support it?

Corrupt input when using graphdump on TwoPaCo output

Hi guys,

First of all; awesome work on TwoPaCo --- the method and the software are both fantastically useful, and I'm really excited to start using it for some downstream applications we're working on. I'm running into the following issue. I used TwoPaCo to build the compacted dBG for the human transcriptome (the following command):

twopaco  -k 31 -t 8 -f 32 gencode.v25.pc_transcripts.fa

As the name suggests, the reference is protein coding human transcripts from gencode v25. This seems to work fine, and I get the following output from TwoPaCo:

Threads = 8
Vertex length = 31
Hash functions = 5
Filter size = 4294967296
Capacity = 2
Files:
/mnt/scratch6/avi/data/txptome/gencode.v25.pc_transcripts.fa
--------------------------------------------------------------------------------
Round 0, 0:4294967296
Pass    Filling Filtering
1       9       16
2       3       1
True junctions count = 358144
False junctions count = 56685
Hash table size = 414829
Candidate marks count = 2551662
--------------------------------------------------------------------------------
Reallocating bifurcations time: 0
True marks count: 2540661
Edges construction time: 6
--------------------------------------------------------------------------------
Distinct junctions = 358144

Now, I want to convert this output to a GFA format (I tried both GFA1 and 2 and get the same error in each case). I used the following command:

graphdump -k 31 -s gencode.v25.pc_transcripts.fa -f gfa1 de_bruijn.bin > gencode.twopaco.gfa1

This results in the following error message:

error: The input is corrupted

At this point, some output has been generated, but I presume it's not complete because, despite the fact that there are ~96k input transcripts, I only get 35,451 output paths (i.e., P) entries in the resulting GFA file. Any idea what might be causing this issue or how to fix it?

Thanks!
Rob

Fail to install TwoPaCo

Hi,

Thank you for developing TwoPaCo!

I had the following error when I compiled the files. I know this should be related to the TBB library, which I followed the instruction (https://github.com/oneapi-src/oneTBB/blob/master/INSTALL.md) to install. However, it still throws the error. Could you please clarify if we need to specify the TBB library when installing TwoPaCo? Appreciate your help!

/home/usr/Tools/TwoPaCo/src/graphdump/graphdump.cpp:15:10: fatal error: oneapi/tbb/parallel_sort.h: No such file or directory
 #include "oneapi/tbb/parallel_sort.h"
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
graphdump/CMakeFiles/graphdump.dir/build.make:75: recipe for target 'graphdump/CMakeFiles/graphdump.dir/graphdump.cpp.o' failed
make[2]: *** [graphdump/CMakeFiles/graphdump.dir/graphdump.cpp.o] Error 1
CMakeFiles/Makefile2:115: recipe for target 'graphdump/CMakeFiles/graphdump.dir/all' failed
make[1]: *** [graphdump/CMakeFiles/graphdump.dir/all] Error 2
Makefile:155: recipe for target 'all' failed
make: *** [all] Error 2

Linking twopaco to TBB fails

Hi @IlyaMinkin,

I am trying to build TwoPaCo on RedHat-7-x86_64. Compilation and linking of graphdump proceeds without error but during linking of twopaco the following error occurs:

CMakeFiles/twopaco.dir/vertexenumerator.cpp.o: In function TwoPaCo::VertexEnumeratorImpl<1ul>::DistributeTasks(std::vector<std::string, std::allocator<std::string> > const&, unsigned long, std::vector<std::unique_ptr<tbb::concurrent_bounded_queue<TwoPaCo::Task, tbb::cache_aligned_allocator<TwoPaCo::Task> >, std::default_delete<tbb::concurrent_bounded_queue<TwoPaCo::Task, tbb::cache_aligned_allocator<TwoPaCo::Task> > > >, std::allocator<std::unique_ptr<tbb::concurrent_bounded_queue<TwoPaCo::Task, tbb::cache_aligned_allocator<TwoPaCo::Task> >, std::default_delete<tbb::concurrent_bounded_queue<TwoPaCo::Task, tbb::cache_aligned_allocator<TwoPaCo::Task> > > > > >&, std::unique_ptr<std::runtime_error, std::default_delete<std::runtime_error> >&, tbb::mutex&, std::ostream&) [clone .constprop.2154]: vertexenumerator.cpp:(.text+0x160d): undefined reference to tbb::internal::concurrent_queue_base_v8::internal_push_move(void const*) vertexenumerator.cpp:(.text+0x1aa8): undefined reference to tbb::internal::concurrent_queue_base_v8::internal_push_move_if_not_full(void const*)'

I have tried three different TBB releases: 43_20150209, tbb2017_20170412 and tbb2018_20170919. They all produce the error posted above. The 2018 release produces an additional error:

CMakeFiles/twopaco.dir/constructor.cpp.o: In function tbb::flow::interface10::graph::~graph()': constructor.cpp:(.text._ZN3tbb4flow11interface105graphD2Ev[_ZN3tbb4flow11interface105graphD5Ev]+0x4c): undefined reference to tbb::interface7::internal::task_arena_base::internal_execute(tbb::interface7::internal::delegate_base&) const constructor.cpp:(.text._ZN3tbb4flow11interface105graphD2Ev[_ZN3tbb4flow11interface105graphD5Ev]+0x104): undefined reference to tbb::interface7::internal::task_arena_base::internal_initialize() constructor.cpp:(.text._ZN3tbb4flow11interface105graphD2Ev[_ZN3tbb4flow11interface105graphD5Ev]+0x11c): undefined reference to tbb::interface7::internal::task_arena_base::internal_terminate()

The ldd command shows that graphdump is linked to the correct TBB library. Am I using the wrong TBB version or am I missing any additional libraries?

Thank you for your help!

Large k value

Hello !
I would be very interested to use TwoPaCo with large kmers.
It works with 281 but not with 291 on Ecoli reference genome.
./twopaco -f 30 -k 291 ../../../../data/ecoli.fa
Give a segfault.

Would it be possible for TwoPaCo to works on arbitray size of k ?

Is it stuck when there is really low CPU usage?

We just run twopaco for thousands of bacterial genomes and now it's at:

Round 0, 0:4398046511104
Pass    Filling Filtering

However, when we look at top we see that while its loaded in memory there is only 0.3% cpu usage:
514.3g 512.2g 4368 D 0.3 33.9 209:34.18 twopaco
Is this normal or does this mean something is going wrong?

Undefined symbol

/sc1/apps/pets/SibeliaZ/bin/twopaco: symbol lookup error: /sc1/apps/pets/SibeliaZ/bin/twopaco: undefined symbol: _ZN3tbb8internal24concurrent_queue_base_v818internal_push_moveEPKv

meaning of software name

Does TwoPaCo stand for "Two Path Compaction"?

This is for a paper where we're trying to provide full names for acronyms and compressed program names to make the jargon less intense.

Problem building TwoPaCo in BioLinux

Hi @IlyaMinkin,

I am trying to install TwoPaCo in BioLinux. I first installed tbb using sudo apt-get install libtbb-dev, then did git clone https://github.com/medvedevgroup/TwoPaCo.git, cd TwoPaCo, mkdir build, cd build, cmake ../src then finally make. Here is the ouput I get:
$ cmake ../src
-- The C compiler identification is GNU 4.8.4
-- The CXX compiler identification is GNU 4.8.4
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/manager/Programmes/TwoPaCo/build

$ make
Scanning dependencies of target graphdump
[ 8%] Building CXX object graphdump/CMakeFiles/graphdump.dir/graphdump.cpp.o
[ 16%] Building CXX object graphdump/CMakeFiles/graphdump.dir//common/dnachar.cpp.o
[ 25%] Building CXX object graphdump/CMakeFiles/graphdump.dir/
/common/streamfastaparser.cpp.o
In file included from /home/manager/Programmes/TwoPaCo/src/common/streamfastaparser.cpp:4:0:
/home/manager/Programmes/TwoPaCo/src/common/streamfastaparser.h:179:3: error: ‘auto_ptr’ in namespace ‘std’ does not name a type
std::auto_ptrTwoPaCo::StreamFastaParser parser_;
^
/home/manager/Programmes/TwoPaCo/src/common/streamfastaparser.h: In constructor ‘TwoPaCo::ChrReader::ChrReader(const std::vector<std::basic_string >&)’:
/home/manager/Programmes/TwoPaCo/src/common/streamfastaparser.h:145:5: error: ‘parser_’ was not declared in this scope
parser_.reset(new TwoPaCo::StreamFastaParser(fileName[0]));
^
/home/manager/Programmes/TwoPaCo/src/common/streamfastaparser.h: In member function ‘bool TwoPaCo::ChrReader::NextChr(std::string&)’:
/home/manager/Programmes/TwoPaCo/src/common/streamfastaparser.h:154:9: error: ‘parser_’ was not declared in this scope
if (parser_->ReadRecord())
^
make[2]: *** [graphdump/CMakeFiles/graphdump.dir/__/common/streamfastaparser.cpp.o] Error 1
make[1]: *** [graphdump/CMakeFiles/graphdump.dir/all] Error 2
make: *** [all] Error 2

How can I solve this problem?
Thanks a lot in advance for your help.

Creation of many temporary files (of considerable size) when there are many references.

Hi,

I have noticed some behavior I was not expecting when using TwoPaCo to generate compacted dBGs for input fasta files with many distinct references. Specifically, we are making use of TwoPaCo internally in pufferfish indexing, and one of the common use cases now is to index a transcriptome for subsequent salmon quantification. Here, the total size of the sequence is small ~300M for the human transcriptome, but the number of individual fasta entries is very large (~200,000).

The behavior I noticed is that TwoPaCo creates, during processing, a temporary file in its temp directory for every input sequence in the fasta file. So, we get a temp folder with ~200,000 distinct files! This seems to be a particular problem for some users who are doing indexing on cluster machines (with NFS-mounted drives).

In addition to the large number of distinct files being created, the total size of the temporary directory grows quite large. For example, for the human transcriptome (again, ~300M of input sequence), the TwoPaCo temp directory grows to ~14G before files start being deleted.

I have two main questions. First, is this large intermediate disk-space usage expected, and if so is there some way that it can be controlled? Second, is there some way to avoid or alter the behavior of creating one temp file per input sequence? This still works (as long as we're not on an NFS) for transcriptomic sequences, but some large metagenomic sequences have literally created more files in a directory than the file system is willing to handle. Ideally, there may be some way to "block together" temporary files for distinct references so that, rather than 1 temp file per-reference there was a temp file for different buckets of references or some such.

Thanks again for the great tool, and for any insight or suggestions you have on the above!

--Rob

Confused about + and - stranded nodes in GFA

Let's use this very simple FASTA:

>seq1
ATATGTCGCTGATCGACTGAAATAGCATCGACTAGCTATCGAT
>seq2
ATATGTCGCTGATCGACTGAATAGTGAAATAGCATCGACTAGC
>seq3
ATATGTCGCTGATCGACTTTTTTTTGAAATAGCATCGACTAGC

Then we construct the graph: ./twopaco -k 15 -f 16 test.fa -o graph and convert it to GFA: graphdump -k 15 -f gfa2 -s test.fa graph > graph.gfa:

H       VN:Z:2.0
S       36      18      ATATGTCGCTGATCGACT
F       36      seq1+   0       18$     0       18      15M
S       24      18      TTCAGTCGATCAGCGACA
F       24      seq1-   0       18$     3       21      15M
E       36+     24-     3       18$     3       18$     15M
S       14      26      GTCGATGCTATTTCAGTCGATCAGCG
F       14      seq1-   0       26$     6       32      15M
E       24-     14-     0       15      11      26$     15M
S       11      19      TGAAATAGCATCGACTAGC
F       11      seq1+   0       19$     17      36      15M
E       14-     11+     0       15      0       15      15M
S       19      22      ATAGCATCGACTAGCTATCGAT
F       19      seq1+   0       22$     21      43$     15M
E       11+     19+     4       19$     0       15      15M
O       seq1p   36+ 24- 14- 11+ 19+
F       36      seq2+   0       18$     0       18      15M
F       24      seq2-   0       18$     3       21      15M
E       36+     24-     3       18$     3       18$     15M
S       13      33      GTCGATGCTATTTCACTATTCAGTCGATCAGCG
F       13      seq2-   0       33$     6       39      15M
E       24-     13-     0       15      18      33$     15M
F       11      seq2+   0       19$     24      43$     15M
E       13-     11+     0       15      0       15      15M
O       seq2p   36+ 24- 13- 11+
F       36      seq3+   0       18$     0       18      15M
S       12      36      GTCGATGCTATTTCAAAAAAAAGTCGATCAGCGACA
F       12      seq3-   0       36$     3       39      15M
E       36+     12-     3       18$     21      36$     15M
F       11      seq3+   0       19$     24      43$     15M
E       12-     11+     0       15      0       15      15M
O       seq3p   36+ 12- 11+

When we look at the paths we have:


seq1p   36+ 24- 14- 11+ 19+
seq2p   36+ 24- 13- 11+
seq3p   36+ 12- 11+

We can only reconstruct the sequence from the GFA by taking the reverse complement of - nodes. When we look at the paths all nodes are on the same strand (i.e. all - or all +), for example, all 24 nodes are -. So why weren't these just all recorded as +?

GFA version update

Is it planned to update the GFA output to the new GFA v2 or have an option to choose v2 as output?
I would like to use TwoPaCo with another tools that needs GFA v2 as input.

MAC OS compilation failed

Compilation failing for mac OS high seirra 10.13.1 with the use of gcc 4.85 with the following error.

In file included from /Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/Arg.h:54:0,
                 from /Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/SwitchArg.h:30,
                 from /Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/CmdLine.h:27,
                 from /Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/constructor.cpp:16:
/Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/ArgTraits.h: In instantiation of ‘struct TCLAP::ArgTraits<long long unsigned int>’:
/Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/ValueArg.h:403:66:   required from ‘void TCLAP::ValueArg<T>::_extractValue(const string&) [with T = long long unsigned int; std::string = std::basic_string<char>]’
/Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/ValueArg.h:363:29:   required from ‘bool TCLAP::ValueArg<T>::processArg(int*, std::vector<std::basic_string<char> >&) [with T = long long unsigned int]’
/Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/constructor.cpp:163:1:   required from here
/Users/tayyu/Desktop/523CSE/code/TwoPaCo/src/graphconstructor/../common/tclap/ArgTraits.h:80:39: error: ‘long long unsigned int’ is not a class, struct, or union type
     typedef typename T::ValueCategory ValueCategory;
                                       ^
make[2]: *** [graphconstructor/CMakeFiles/twopaco.dir/constructor.cpp.o] Error 1
make[1]: *** [graphconstructor/CMakeFiles/twopaco.dir/all] Error 2
make: *** [all] Error 2

error: Inconsistent read size

Hi,

When running TwoPaCo with any dataset that consists of at least two sequences of different size (in one, or more files) I almost immediately get this error message:

Round 0, 0:1048576
Pass Filling Filtering
1 error: Inconsistent read size

Is this wanted behavior, that TwoPaCo can only deal with input sequences of the same size?

Thanks,
Chris

Present API for pufferfish

TwoPaCo is used by Pufferfish which currently embeds a patched code copy of TwoPaCo. Patches seem to transform main() function to make it callable in C/C++ code circumventing the command line interface. Would it be possible to merge Pufferfish's patches to TwoPaCo? This way Pufferfish could be linked against static/shared library of TwoPaCo. In addition, I believe such command line-circumventing API could be useful for other users of TwoPaCo preferring static type checking, for example.

Redundant k-mer in contigs

Hi @IlyaMinkin,

We've run into another minor issue that we think is a bug. I wanted to report the behavior here to get your feedback on it. Basically, what we're seeing is that, for a small number of contigs that TwoPaCo is returning, the contigs contain both a k-mer and its reverse complement. Thus, in the compacted dBG, the k-mer itself is repeated --- which we believe shouldn't happen. I realize that this is possible in the GFA output when the k-mer occurs at the end of a contig, since the GFA file is written such that the overlaps themselves are of length k and hence these k-mers will occur at least twice. However, these repeated k-mers are internal (and seem to happen, in fact, when the entire contig is its own reverse complement).

This issue was discovered by my student @fataltes, who did the legwork to provide the following example. We're working with this reference sequence. We ran TwoPaCo with -k set to 31, and then used graphdump to obtain a GFA1 file. Most of the contigs / segments in this file are OK, but a few of them contain the same k-mer (once in the forward and once in the reverse complement orientation) more than once. Here is the list of such contigs / segments in our output:

2232549 ATGTGTGTGTGTGTATATATATATATATATACACACACACACAT
196044 TGTGTATATATATACACATATATACGTATATATGTGTATATATATACACA
557083 TTTCATGTTTATATATATATATATATGTATATATATATACATATATATATATATATAAACATGAAA
659373 GTGTGTGTGTATATATATATATATATATATACACACACAC
2222892 ATTATATATATATAATATATATATATTATATATATATAAT
2307911 ATATATATATATCATATATATGATATATATATAT
2309111 ATATACATATATATATATATATATATATGTATAT
2861563 AAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTT
2237088 TGTGTGTGTATGTATATTATATAATATACATACACACACA
555324 TATATATATATATACCATATATATGGTATATATATATATA
659376 TGTGTGTGTGTGTGTATATATACACACACACACACA
554875 TATATATATATATATATAATATATATATATATATA
555396 TATATATATATAAATATATATATATTTATATATATATA
162527 TTATATATATATTATATATATAATATATATATAA
2307775 ATATATATGTGTGTATATATATACACACATATATAT
554899 ATATATATATATATATGCATATATATATATATAT
214284 TGTATGTGTGTATATATGTGTGTATATATATATACACACATATATACACACATACA

As you can see, these segments contain quite a few cases where both a k=31-mer and its reverse complement (and even larger k-mers) are present in the same contig. As we are indexing k-mers in the TwoPaCo representation, and expecting each k-mer to occur at most once, this is causing some issues for us. Interestingly, all of these cases seem to be occurring as substrings of segments which are their own reverse complements. So, I presume that this is either (1) expected behavior and we are possibly interpreting the compacted dBG differently from TwoPaCo or (2) some minor corner-case in the contig generation code.

Please let me know if you have any questions about this case or any difficulty re-generating this example. Thanks again!

--Rob

Question about GFA format

Hi @IlyaMinkin,

It's me again :). TwoPaCo has been working great, but I've run into a small issue regarding the GFA file. I was wondering if you could clear up my confusion. I build a cdBG using TwoPaCo with k=31. As the document states that k is the node size, I'm expecting the cdBG to contain a list of segments (i.e., contigs) that overlap by k-1. However, in the resulting GFA file, all of the contigs seem to instead overlap by k (i.e., they show a 31M overlap). This is causing some issues downstream, as we expect the invariant that a k-mer (or its reverse complement) appears at most once in the cdBG. However, when the overlap is of size k, we get that a given k-mer may appear as many times as it participates in an overlap.

Have I misunderstood something about the expected format of this graph? Is there an easy way to obtain the cdBG GFA file such that the overlaps are retained as k-1 bases instead of k?

Thanks!
Rob

gfa1 Format issue (path orientation)

Hi,

the pats in the gfa1 dumped by graphdump seem to have a wrong format. For some of the segments in the path the the orientation is missing and for all segments that have one the orientation is given before the segment number instead of after (although I'm not sure if this is an actual error or if it is allowed based on the format specification)
Output:
P seq1_repeat_seq2_seq3 -23,17,-31,27,-8 31M,31M,31M,31M
gfa spec example:
P 14 11+,12-,13+ 4M,5M

Is this the correct format or is there something wrong?
Thanks,
Chris

undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'

Hitting a general issue with install, despite following the directions.

As a friendly reminder, if developers expect (moderately) computationally literate biologists to use their software, they need to provide explicit lines of code in the README, not general instructions. Sometimes that's the difference between hundreds of citations or almost none.

user@computer:~$ git clone https://github.com/medvedevgroup/TwoPaCo
Cloning into 'TwoPaCo'...
remote: Enumerating objects: 38, done.
remote: Counting objects: 100% (38/38), done.
remote: Compressing objects: 100% (25/25), done.
remote: Total 3548 (delta 19), reused 28 (delta 13), pack-reused 3510
Receiving objects: 100% (3548/3548), 11.90 MiB | 5.98 MiB/s, done.
Resolving deltas: 100% (2325/2325), done.
user@computer:~$ cd TwoPaCo
user@computer:~/TwoPaCo$ mkdir build
user@computer:~/TwoPaCo$ cd build
user@computer:~/TwoPaCo/build$ cmake ../src
-- The C compiler identification is GNU 5.5.0
-- The CXX compiler identification is GNU 5.5.0
-- Check for working C compiler: /home/linuxbrew/.linuxbrew/bin/cc
-- Check for working C compiler: /home/linuxbrew/.linuxbrew/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /home/linuxbrew/.linuxbrew/bin/c++
-- Check for working CXX compiler: /home/linuxbrew/.linuxbrew/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/user/TwoPaCo/build
user@computer:~/TwoPaCo/build$  sudo apt-get install libtbb-dev
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  linux-headers-4.15.0-48 linux-headers-4.15.0-48-generic linux-image-4.15.0-48-generic
  linux-modules-4.15.0-48-generic linux-modules-extra-4.15.0-48-generic python3-dateutil
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  libtbb2
Suggested packages:
  tbb-examples libtbb-doc
The following NEW packages will be installed:
  libtbb-dev libtbb2
0 to upgrade, 2 to newly install, 0 to remove and 77 not to upgrade.
Need to get 342 kB of archives.
After this operation, 2,033 kB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://au.archive.ubuntu.com/ubuntu bionic/universe amd64 libtbb2 amd64 2017~U7-8 [110 kB]
Get:2 http://au.archive.ubuntu.com/ubuntu bionic/universe amd64 libtbb-dev amd64 2017~U7-8 [231 kB]
Fetched 342 kB in 0s (2,559 kB/s)   
Selecting previously unselected package libtbb2:amd64.
(Reading database ... 255576 files and directories currently installed.)
Preparing to unpack .../libtbb2_2017~U7-8_amd64.deb ...
Unpacking libtbb2:amd64 (2017~U7-8) ...
Selecting previously unselected package libtbb-dev:amd64.
Preparing to unpack .../libtbb-dev_2017~U7-8_amd64.deb ...
Unpacking libtbb-dev:amd64 (2017~U7-8) ...
Setting up libtbb2:amd64 (2017~U7-8) ...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
Setting up libtbb-dev:amd64 (2017~U7-8) ...
user@computer:~/TwoPaCo/build$ make
[  7%] Building CXX object graphdump/CMakeFiles/graphdump.dir/graphdump.cpp.o
[ 14%] Building CXX object graphdump/CMakeFiles/graphdump.dir/__/common/dnachar.cpp.o
[ 21%] Building CXX object graphdump/CMakeFiles/graphdump.dir/__/common/streamfastaparser.cpp.o
[ 28%] Linking CXX executable graphdump
/usr/lib/x86_64-linux-gnu/libtbb.so: undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'
/usr/lib/x86_64-linux-gnu/libtbb.so: undefined reference to `std::__exception_ptr::exception_ptr::exception_ptr(void*)@CXXABI_1.3.11'
collect2: error: ld returned 1 exit status
graphdump/CMakeFiles/graphdump.dir/build.make:146: recipe for target 'graphdump/graphdump' failed
make[2]: *** [graphdump/graphdump] Error 1
CMakeFiles/Makefile2:85: recipe for target 'graphdump/CMakeFiles/graphdump.dir/all' failed
make[1]: *** [graphdump/CMakeFiles/graphdump.dir/all] Error 2
Makefile:105: recipe for target 'all' failed
make: *** [all] Error 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.