GithubHelp home page GithubHelp logo

ibebio / gretl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from moinsebi/gretl

0.0 0.0 0.0 1.78 MB

Statistics and analysis for variation graphs

License: MIT License

Rust 23.47% Jupyter Notebook 76.53%

gretl's Introduction

gretl - Graph evaluation toolkit

Description

gretl is a tool for basic graph statistics using GFA format input. Our statistics are based on nodes, edges and paths. Walks can also be used, but will be represented as paths internally. Many commands do not work without paths/walk information.

Requirements on GFA file:

GFA file is

  • format v1.0, v1.1 or v1.2.
  • has numerical node ID

Comment:

  • Sorted node IDs are not required, but all "Jump" related statistics will be based on the order of the nodes in the GFA file.
  • We recommend dense node ID, starting at 1 and end at the number of nodes +1. Memory efficient on multiple levels.

Installation:

Git

git clone https://github.com/MoinSebi/gretl  
cd gretl   
cargo build --release  
./target/release/gretl  

Testing

We provide a small test suite to test the basic functionality of the tool. If you are interested in output format, check the data/test/yeast/ directory after running the following command.

cargo test

Usage

Stats

Calculate statistics on GFA file. A list of all stats can be found here. Please consider using the --pansn option to group the paths by sample. Read more information about PanSN-spec here.

Available options:

  • -bins Adjust number and size of bins. Histogram-like statistics which classify nodes by their length into bins.
  • -path Report statistics for each path in the graph.
  • -y Report output in YAML format (default is tsv).

Graph statistics also include "hybrid" statistics, which are average and standard deviation of all path statistics. All hybrid stats have the prefix "Path". A full list of all statistics be found in paper directory in this repository.

Example


./gretl stats -g /path/to/graph.gfa -o /path/to/output.txt

Result

  • TSV or YAML file with statistics
  • Merge the output of multiple graphs to compare them.
  • Example comparison: plot
  • Example output

ID2INT

Convert any string-based node identifier to numeric values. Use odgi sort to sort the graph in pan-genomic order, which will create more meaningful statistics in gretl stats (see above). Nevertheless, numerical node IDs a required by any gretl command.

Available options:

  • -d, --dict <dict> Write new and old IDs to a plain text file.

Example

./gretl id2int -g /path/to/graph.gfa -o /path/to/output.gfa -d /path/to/dict.txt

Result:

  • GFA file with numerical node IDs

Comment: This function will convert all IDs in the graph. Additional data in (segment-specific) tags will not be converted.

Node-list

Individual node statistics. Statistics provided:

  • Length
  • Degree
  • Depth
  • Core

Length and degree are based on the graph itself, while depth and core are based on the paths.

Example

./gretl node-list -g /path/to/graph.gfa -o /path/to/output.txt

Result

  • Example output

Comment The information of the reported table can be used as a individual lookup or to create own window-like statistics (over nodes).

Core

Compute user-defined statistics of the graph (-s). Calculate the statistics for each node and summarize for each possible value the number of nodes and sequence. In a additional file ("...private.txt") we report for each path the amount of nodes and sequence sole present by this sample.

Available options:

  • -s, --stats <statistics>. Define the statistics you want to summarize (see above) [default: similarity].
./gretl core -g /path/to/graph.gfa -o /path/to/output.txt

Result

Path similarity (PS)

Calculate for each path the amount of nodes and sequence at each similarity level.

./gretl ps -g /path/to/graph.gfa -o /path/to/output.txt

Result ps plot

Feature

Select nodes based on input settings. The output can be used as input for gfa2bin.

./gretl feature -g /path/to/graph.gfa -o /path/to/nodes.txt -D 10 

Result

  • List of nodes which fulfill the input settings (plain-text, one node per line)

Path

Select paths based on input settings. The output can be used as input for gfa2bin.

./gretl feature -g /path/to/graph.gfa -o /path/to/nodes.txt -s "N/D ration" -m 10

Result

  • List of paths/samples which fulfill the input settings (plain-text, one node per line)

Bootstrap

Sample-based bootstrapping to calculate number of nodes and sequence for each number of possible samples. Start with a "complete" graph and remove random path for each run. Then recalculate the general statistics. And summarize the amount of sequence/nodes for each level (e.g. similarity).
We recommend bootstrapping a graphs in PanSN-spec. Use --nodes if the bootstrap should only run on a subset of nodes.
You are able to adjust the number of bootstrap, only calculate one "level" or input a meta file as input. Examples are shown in the data/example_data/ directory.
Meta files can be used to use the same "combinations" for multiple graphs. This only works of the paths/samples of the graphs are in the same order.

Available options:

  • --nodes <nodes>Run bootstrap only on these nodes
  • --meta-input <meta input> Use a meta file as input.
  • --level <level>Run bootstrap only for a specific level
  • --number <number> Number of bootstrap for each number of genomes
  • --meta-line <meta line> Run a boots trap of a specific line in the meta file.
  • --meta <meta> Report the meta information in the output.

Example

./gretl bootstrap -g /path/to/graph.gfa -o /path/to/output.txt -n 20 

Result

(Sliding, path) window

Calculate statistics on a node level (graph- or path-based) and summarize them for each path in a sliding window approach. In detail: Iterate over the nodes of a path (window-like), summarize the stats of all nodes in the window and report a single value for each window.

Example

./gretl window -g /path/to/graph.gfa -o /path/to/output.txt -s 1000 --step 100

Result

Nwindow

Summarizing the graph by a window of nodes. We iterate numerically over the nodes and calculate the statistics for each window. We start at the current node and move away from it based on provided edges, collecting the new nodes. We repeat this process starting at the "new" nodes until one of the following conditions is met:

  • Jumps: A jumps is defined as difference between the current and the next node. Your input referees to the sum of all jumps in the window.
  • Steps: A step it the number of moves we make in the graph. Your input is the maximum steps from the starting node.
  • Sequence: Limit the window by a sequence threshold. We stop if the sequence length is larger than the provided threshold.

Example: How many nodes do I need to collect 1000 bp?

./gretl nwindow -g /path/to/graph.gfa -o /path/to/output.txt --sequence 1000 --node-number

Output: You are able to return the number of collected nodes, the total number of jumps or the total sequence. Some combinations of input limitation and output do not gain any additional information.

nwindow plot

Find

Find a specific node (10), directed node (10+), or edge (10+20+) in the graph and get the exact (sequence) position in the paths. Output is a BED file with the positions. You are able to add additional sequence -l on both sites, which can help if you want to realign to a database and the node is very small.

./gretl find -g /path/to/graph.gfa -o /path/to/output.txt --length 1000 -f feature.txt 

Example of feature file is data/example_data/dirnodes.txt

Scripts

We provide multiple jupyter notebooks to visualize the output of the tool.

Requirements

  • Jupyter
  • Matplotlib
  • Pandas
  • Numpy
  • Seaborn

gretl's People

Contributors

moinsebi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.