GithubHelp home page GithubHelp logo

mattravenhall / sv-pop Goto Github PK

View Code? Open in Web Editor NEW
21.0 2.0 1.0 466 KB

Upscaling SV detection to a multi-population level.

License: MIT License

CSS 0.29% Shell 4.89% Python 87.00% R 7.81%
malaria structural-variation genomics

sv-pop's Introduction

SV-Pop

SV-Pop performs post-discovery SV analysis and visualisation; it contains two modules for these purposes. Both modules should work out of the box, but it's a good idea to run preflightchecks.py (in Analysis/) to check that all dependencies are installed, and to optionally add SVPop to your PATH.

Extended documentation, including specifics regarding input files, is present on this repo's wiki.

Pipeline Overview

Analysis Module

Preview Analysis

Quick start:

SVPop -h

Functions

  • DEFAULT: Process individual vcf files to population-level lists.
  • CONVERT: Convert a variant output file into a window file.
  • FILTER: Filter a variant output file by a range of factors.
  • MERGE-CHR: Merge per-chromosome variants files into one file.
  • MERGE-MODEL: Merge by-model variants files into one file.
  • SUBSET: Create a subset of a given variant or window file.
  • STATS: Produce summary statistics for a variant or window files.
  • PREPROCESS: Process analysis output files for visualisation.

Expanded help can be found on the wiki.

Visualisation Module

Preview Visualiser

Quick start:

SVPop --PREPROCESS --variantFile=YOUR_PREFIX
Rscript easyRun.r

Expected Input

The visualisation module will expect the following files in Visualisation/Files/:

  • <model>_Variants.csv: Reformatted variants file.
  • <model>_Windows.csv: Reformatted windows file.
  • <model>_AllIndex.csv: Locations of all variants, for faster indexing.
  • <model>_FrqIndex.csv: Subset of AllIndex for 'frequent' (>5%) variants only.
  • annotation.txt: The annotation file used for your SVPop Analysis run (in tsv format).

These files can be created from a default SVPop Analysis run with SVPop --PREPROCESS --variantFile=YOUR_PREFIX.

Expanded help can be found on the wiki.

Citation

Matt Ravenhall, Susana Campino and Taane G. Clark. BMC Bioinformatics 2019 20:136. https://doi.org/10.1186/s12859-019-2718-4

sv-pop's People

Contributors

mattravenhall avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

tw7649116

sv-pop's Issues

Allow users to specify sample names

Currently sample names are pulled from input vcf files by assuming they are given as full paths with only one full stop character. This works in most cases, but will fail if user vcf contain multiple full stops, or a user short-hands their path (ie. ./thisfile.vcf). This method also restricts users who with to redefine their sample names from non-file names.

To resolve this, I plan to allow for a special column in the subPops file (ie. named 'SampleID') which will link file locations to custom IDs.

Steps required to go from analysis to visualization

I tried to go from the analysis part to visualization. However, this raised some issues:

  1. running the SV-pop command:
    /software/SV-Pop/Analysis/SVPop -F=SampleFilesFilter.txt -M=DUP -R=${ref_genome_gff} --refFormat=gff --multithreads=True --threads=19 --suppressWarnings=False
    This creates a "merged" file called: Merged_DUP_chrALL_variants_annotated_v1.1.csv

  2. running preprocess command:
    ../software/SV-Pop/Analysis/SVPop --PREPROCESS --variantFile=outFile
    This command looks for a duplication file called "outFile_DUP_chrALL_variants_annotated_v1.1.csv", however this does not exist, and I assume I should input here the Merged_DUP_chrALL_variants_annotated_v1.1. When creating a symbolic link:
    ln -s Merged_DUP_chrALL_variants_annotated_v1.1.csv outFile_DUP_chrALL_variants_annotated_v1.1.csv
    I got the error message that he cannot find the windows file: outFile_DUP_chrALL_windows_annotated_v1.1.csv. However, I cannot find - like with the variants file - an equivalent file called Merged_DUP_chrALL_windows_annotated_v1.1.csv

  3. create a merged windows file based on the variants file.
    Not sure whether I use this command correctly, but I could not find more details in the wiki. The help command says that it converts a Variant File into a window file. When running the command
    ../software/SV-Pop/Analysis/SVPop --CONVERT --variantFile=outFile_DUP_chrALL_variants_annotated_v1.1.csv --refFile=SVPop_Files/annotation.txt
    I get the same output as what I got before (the files with extension windows_annotated_v1.1.csv) are overwritten with the new file (which seem to be the same as before, but now containing the annotation)

  4. Concatenate windows files
    Next, I tried to concatenate the different files ending with windows_annotated_v1.1.csv using awk
    awk 'FNR>1' *windows_annotated_v1.1.csv > outFile_DUP_chrALL_windows_annotated_v1.1.csv
    (afterwards, add again the header).
    Afterwards running the ../software/SV-Pop/Analysis/SVPop --PREPROCESS --variantFile=outFile command does seem to run without errors. However, when visualizing those files, I only get empty plots, and for population I can only choose between "Features" and "Samples".
    However, also adding a dummy population did not solve the issue, as this resulted in a value of 0.0 for all windows

I added the files used for visualization.

viz.zip

SVpop gets stuck on certain chromosomes

When running SVpop on a set of 16 vcf files (created using Delly), SVpop hangs on certain chromosomes. This seems quite reproducible, as always the same chromosome gave issues. This is confirmed when filtering the vcf files for each chromosome separately, and running SVpop on a per-chromosomes basis. When running SVpop on a super computer, the issue starts when suddenly the memory incraease (using up to 100G of RAM), followed by the full usage of the 2Gb swap space. This seems rather strange to me, as the input file only consist of a few Mb of data.

So we can get output files for some chromosomes (*_v1_windows_annotated_v1.1.csv), and even concatenate them using simple bash commands, run PREPROCESS and do visualization. However, as said, for some chromosomes the output "_v1_windows_annotated_v1.1.csv" is not created.

Error: Too few (0) per-chromosome windows to merge

First of all, I want to thank you for freely making available your software development to the scientific community.
All tries to run any analysis with different vcf.gz sample files with any of the model options end showing next error message:
"Too few (0) per-chromosome windows to merge".
Some advice about that for helping us to get a successful result will be invaluable. Thank you so much in advance for your reply.

Running SV-Pop

Dear Matt,

I was hoping to run SV-pop on a large number (>1000) VCFs called by Manta and Canvas. At present these files are merged so each person has a single VCF with all the SV-types merged into one and the types appearing in the "INFO" column e.g. MantaDEL, MantaINS etc.

If I wanted to run SV-pop am I correct in thinking I need to separate out each call type per patient.

I also have VCFs of each type of SV call but merged across all 1000 patients and am unsure if I can use this directly?

Apologies for the novice questions

All the best

Omid

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.