mattravenhall / sv-pop Goto Github PK

Upscaling SV detection to a multi-population level.

License: MIT License

CSS 0.29% Shell 4.89% Python 87.00% R 7.81%

sv-pop's Introduction

SV-Pop

SV-Pop performs post-discovery SV analysis and visualisation; it contains two modules for these purposes. Both modules should work out of the box, but it's a good idea to run preflightchecks.py (in Analysis/) to check that all dependencies are installed, and to optionally add SVPop to your PATH.

Extended documentation, including specifics regarding input files, is present on this repo's wiki.

Analysis Module

Quick start:

SVPop -h

Functions

DEFAULT: Process individual vcf files to population-level lists.
CONVERT: Convert a variant output file into a window file.
FILTER: Filter a variant output file by a range of factors.
MERGE-CHR: Merge per-chromosome variants files into one file.
MERGE-MODEL: Merge by-model variants files into one file.
SUBSET: Create a subset of a given variant or window file.
STATS: Produce summary statistics for a variant or window files.
PREPROCESS: Process analysis output files for visualisation.

Expanded help can be found on the wiki.

Visualisation Module

Quick start:

SVPop --PREPROCESS --variantFile=YOUR_PREFIX
Rscript easyRun.r

Expected Input

The visualisation module will expect the following files in Visualisation/Files/:

<model>_Variants.csv: Reformatted variants file.
<model>_Windows.csv: Reformatted windows file.
<model>_AllIndex.csv: Locations of all variants, for faster indexing.
<model>_FrqIndex.csv: Subset of AllIndex for 'frequent' (>5%) variants only.
annotation.txt: The annotation file used for your SVPop Analysis run (in tsv format).

These files can be created from a default SVPop Analysis run with SVPop --PREPROCESS --variantFile=YOUR_PREFIX.

Expanded help can be found on the wiki.

Citation

Matt Ravenhall, Susana Campino and Taane G. Clark. BMC Bioinformatics 2019 20:136. https://doi.org/10.1186/s12859-019-2718-4

sv-pop's People

Contributors

Stargazers

Watchers

Forkers

tw7649116

sv-pop's Issues

Allow users to specify sample names

Currently sample names are pulled from input vcf files by assuming they are given as full paths with only one full stop character. This works in most cases, but will fail if user vcf contain multiple full stops, or a user short-hands their path (ie. ./thisfile.vcf). This method also restricts users who with to redefine their sample names from non-file names.

To resolve this, I plan to allow for a special column in the subPops file (ie. named 'SampleID') which will link file locations to custom IDs.

How to merge SV files from different callers?

Hi,
Thanks for your tool, but how to use it to merge SV files from different callers, such as manta, delly, lumpy and svaba?

Best,
xiucz

Steps required to go from analysis to visualization

I tried to go from the analysis part to visualization. However, this raised some issues:

running the SV-pop command:
/software/SV-Pop/Analysis/SVPop -F=SampleFilesFilter.txt -M=DUP -R=${ref_genome_gff} --refFormat=gff --multithreads=True --threads=19 --suppressWarnings=False
This creates a "merged" file called: Merged_DUP_chrALL_variants_annotated_v1.1.csv
running preprocess command:
../software/SV-Pop/Analysis/SVPop --PREPROCESS --variantFile=outFile
This command looks for a duplication file called "outFile_DUP_chrALL_variants_annotated_v1.1.csv", however this does not exist, and I assume I should input here the Merged_DUP_chrALL_variants_annotated_v1.1. When creating a symbolic link:
ln -s Merged_DUP_chrALL_variants_annotated_v1.1.csv outFile_DUP_chrALL_variants_annotated_v1.1.csv
I got the error message that he cannot find the windows file: outFile_DUP_chrALL_windows_annotated_v1.1.csv. However, I cannot find - like with the variants file - an equivalent file called Merged_DUP_chrALL_windows_annotated_v1.1.csv
create a merged windows file based on the variants file.
Not sure whether I use this command correctly, but I could not find more details in the wiki. The help command says that it converts a Variant File into a window file. When running the command
../software/SV-Pop/Analysis/SVPop --CONVERT --variantFile=outFile_DUP_chrALL_variants_annotated_v1.1.csv --refFile=SVPop_Files/annotation.txt
I get the same output as what I got before (the files with extension windows_annotated_v1.1.csv) are overwritten with the new file (which seem to be the same as before, but now containing the annotation)
Concatenate windows files
Next, I tried to concatenate the different files ending with windows_annotated_v1.1.csv using awk
awk 'FNR>1' *windows_annotated_v1.1.csv > outFile_DUP_chrALL_windows_annotated_v1.1.csv
(afterwards, add again the header).
Afterwards running the ../software/SV-Pop/Analysis/SVPop --PREPROCESS --variantFile=outFile command does seem to run without errors. However, when visualizing those files, I only get empty plots, and for population I can only choose between "Features" and "Samples".
However, also adding a dummy population did not solve the issue, as this resulted in a value of 0.0 for all windows

I added the files used for visualization.

viz.zip

SVpop gets stuck on certain chromosomes

When running SVpop on a set of 16 vcf files (created using Delly), SVpop hangs on certain chromosomes. This seems quite reproducible, as always the same chromosome gave issues. This is confirmed when filtering the vcf files for each chromosome separately, and running SVpop on a per-chromosomes basis. When running SVpop on a super computer, the issue starts when suddenly the memory incraease (using up to 100G of RAM), followed by the full usage of the 2Gb swap space. This seems rather strange to me, as the input file only consist of a few Mb of data.

So we can get output files for some chromosomes (*_v1_windows_annotated_v1.1.csv), and even concatenate them using simple bash commands, run PREPROCESS and do visualization. However, as said, for some chromosomes the output "_v1_windows_annotated_v1.1.csv" is not created.

Error: Too few (0) per-chromosome windows to merge

First of all, I want to thank you for freely making available your software development to the scientific community.
All tries to run any analysis with different vcf.gz sample files with any of the model options end showing next error message:
"Too few (0) per-chromosome windows to merge".
Some advice about that for helping us to get a successful result will be invaluable. Thank you so much in advance for your reply.

Running SV-Pop

Dear Matt,

I was hoping to run SV-pop on a large number (>1000) VCFs called by Manta and Canvas. At present these files are merged so each person has a single VCF with all the SV-types merged into one and the types appearing in the "INFO" column e.g. MantaDEL, MantaINS etc.

If I wanted to run SV-pop am I correct in thinking I need to separate out each call type per patient.

I also have VCFs of each type of SV call but merged across all 1000 patients and am unsure if I can use this directly?

Apologies for the novice questions

All the best

Omid

mattravenhall / sv-pop Goto Github PK

sv-pop's Introduction

SV-Pop

Analysis Module

Quick start:

Functions

Visualisation Module

Quick start:

Expected Input

Citation

sv-pop's People

Contributors

Stargazers

Watchers

Forkers

sv-pop's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs