vg

variant graph data structures, interchange formats, alignment, genotyping, and variant calling methods

If we know about variation in a given population, we should include that knowledge in our primary sequence analyses, or risk bias against things we've seen before. Reference bias is real. We can work around it by formulating our reference system as a graph: either an assembly, or a directed acyclic one similar to how we represent a multiple sequence alignment.

Usage

building

You'll need the protobuf and jansson development libraries installed on your server.

sudo apt-get install protobuf-compiler libprotoc-dev libjansson-dev

You can also run make get-deps.

Other libraries may be required. Please report any build difficulties.

Now, obtain the repo and its submodules:

git clone --recursive https://github.com/ekg/vg.git

Then build with make, and run with ./vg.

What can I do?

Try building a graph and aligning to it:

vg construct -r small/x.fa -v small/x.vcf.gz >x.vg
vg align -s CTACTGACAGCAGAAGTTTGCTGTGAAGATTAAATTAGGTGATGCTTG x.vg

Note that you don't have to store the graph on disk at all, you can simply pipe it into the local aligner:

vg construct -r small/x.fa -v small/x.vcf.gz | vg align -s CTACTGACAGCAGAAGTTTGCTGTGAAGATTAAATTAGGTGATGCTTG -

You can also index and then map reads against the index of the graph:

# construct the graph
vg construct -r small/x.fa -v small/x.vcf.gz >x.vg

# store the graph in the index, and also index the kmers in the graph of size 11
vg index -s -k 11 x.vg

# align a read to the indexed version of the graph
# note that the graph file is not opened, but x.vg.index is assumed
vg map -s CTACTGACAGCAGAAGTTTGCTGTGAAGATTAAATTAGGTGATGCTTG -k 11 x.vg >alignment.json

A variety of commands are available:

construct: graph construction
view: conversion (dot/protobuf/json/GFA)
index: index features of the graph in a disk-backed key/value store
find: use an index to find nodes, edges, kmers, or positions
paths: traverse paths in the graph
align: local alignment
map: global alignment (kmer-driven)
stats: metrics describing graph properties
join: combine graphs (parallel)
concat: combine graphs (serial)
ids: id manipulation

Implementation notes

vg is based around a graph object (vg::VG) which has a native serialized representation that is almost identical on disk and in-memory, with the exception of adjacency indexes that are built when the object is parsed from a stream or file. These graph objects are the results of queries of larger indexes, or manipulation (for example joins or concatenations) of other graphs. vg is designed for interactive, stream-oriented use. You can, for instance, construct a graph, merge it with another one, and pipe the result into a local alignment process. The graph object can be stored in an index (vg::Index), aligned against directly (vg::GSSWAligner), or "mapped" against in a global sense (vg::Mapper), using an index of kmers.

Once constructed, a variant graph (.vg is the suggested file extension) is typically around the same size as the reference (FASTA) and uncompressed variant set (VCF) which were used to build it. The index, however, may be much larger, perhaps more than an order of magnitude. This is less of a concern as it is not loaded into memory, but could be a pain point as vg is scaled up to whole-genome mapping.

The serialization of very large graphs (>62MB) is enabled by the use of protocol buffer ZeroCopyStreams. Graphs are decomposed into sets of N (presently 10k) nodes, and these are written, with their edges, into graph objects that can be streamed into and out of vg. Graphs of unbounded size are possible using this approach.

Development

License

MIT

mcshane / vg Goto Github PK

vg's Introduction

vg

variant graph data structures, interchange formats, alignment, genotyping, and variant calling methods

Usage

building

What can I do?

Implementation notes

Development

License

vg's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs