GithubHelp home page GithubHelp logo

deepsvfilter's Introduction

DeepSVFilter

Introduction

DeepSVFilter is a deep learning based tool designed to filter false positive structural variants (SVs) obtained by any computational approaches from short read whole genome sequencing data. It can be used as either a stand-alone tool to filter SVs or coupled with commonly used SV detection tool (Delly, Lumpy, Manta et al.) to improve specificity.

Installation

Dependencies

tensorflow 1.15.0, matplotlib==3.1.0, numpy<2.0,>=1.16.0, opencv-python==3.1.0.4, Pillow==7.2.0, pysam==0.15.4, scikit-learn==0.19.2, scipy

Install from github (requires Python 3.6.* or newer)

  1. https://github.com/yongzhuang/DeepSVFilter.git
  2. cd DeepSVFilter
  3. pip install .

Running

usage: DeepSVFilter [OPTIONS]

1. preprocess This option is used to generate SV images for candidate SVs.

usage: DeepSVFilter preprocess [OPTIONS]  

	--sv_type		<STR>	SV type (DEL or DUP) (required)  

	--bam_path		<FILE>	BAM file (required)  

	--bed_path		<FILE>	SV BED file (required)  

	--patch_size		<INT>	image patch size (224 or 299) (required)  

	--output_imgs_dir	<DIR>	output image folder (required  

	--mean_insert_size	<INT>	mean of the insert size (optional)  

	--sd_insert_size	<INT>	standard deviation of the insert size (optional)  

2. augmentate This option is used to do data augmentation for typical SVs. Note that this step is optional.

usage: DeepSVFilter augmentate [OPTIONS]   

	--output_imgs_dir	<DIR>	output image folder (required)
	
	--image_path_file	<FILE>	input typical true or false image path file (required)
	
	--patch_size		<INT>	image patch size (224 or 299) (required)  

3. train This option is used to train a convolutional neural network (CNN) based SV classification model.

usage: DeepSVFilter train [OPTIONS]   

	--sv_type		<STR>	SV type (DEL or DUP) (required)  
	--checkpoint_dir	<DIR>	checkpoint folder (required)  
	--pos_train_file	<FILE>	path file of positive SV images used for training (required)  
	--neg_train_file	<FILE>	path file of negative SV images used for training (required)  
	--pos_eval_file		<FILE>	path file of positive SV images used for evaluation (required)  
	--neg_eval_file		<FILE>	path file of negative SV images used for evaluation (required)  
	--eval_result_dir	<DIR>	validation result (required)  
	--summary_dir		<DIR>	tensorboard summary (required)  
	--use_gpu		<INT>	gpu flag, 1 for GPU and 0 for CPU (optional, default 0)  
	--gpu_idx		<INT>	GPU idx (optional, default 0)  
	--gpu_mem		<FLOAT>	gpu memory usage (0 to 1) (optional, default 0.5)  
	--model			<STR>	M1(for MobileNet_v1) or M2_1.0(for MobileNet_v2_1.0) or M2_1.4(for MobileNet_v2_1.4) or NAS(for NASNet_A_Mobile) or PNAS(for PNASNet_5_Mobile) or IR_v2(for Inception_ResNet_v2) (optional, default M1)  
	--epoch 		<INT>	number of total epoches (optional, default 13)  
	--batch_size		<INT>	number of samples in one batch (optinal, default 16)  
	--start_lr		<INT>	initial learning rate for adam (optional, default 0.001)  
	--eval_every_epoch	<INT>	evaluating and saving checkpoints every # epoch (optional, default 1)  
	--num_cores		<INT>	maximum number of CPU cores (optional, default: use all cpu cores)   

4. predict This option is used to make predictions for candidate SVs, and the SVs with scores less than the specified threshold (default 0.5) are filtered out.

usage: DeepSVFilter predict [OPTIONS]  

	--sv_type		<STR>	SV type (DEL or DUP) (required)  
	--checkpoint_dir	<DIR>	checkpoint folder (required)  
	--test_file		<FILE>	SV image path file (required)  
	--test_result_dir	<DIR>	SV filtering results (required)  
	--use_gpu		<INT>	gpu flag, 1 for GPU and 0 for CPU (optional, default 0)  
	--gpu_idx		<INT>	GPU idx (optional, default 0)  
	--gpu_mem		<FLOAT>	gpu memory usage (0 to 1) (optional, default 0.5)  
	--model			<STR>	M1(for MobileNet_v1) or M2_1.0(for MobileNet_v2_1.0) or M2_1.4(for MobileNet_v2_1.4) or NAS(for NASNet_A_Mobile) or PNAS(for PNASNet_5_Mobile) or IR_v2(for Inception_ResNet_v2) (optional, default M1)   

5. vcf2bed This option is used to used to convert SV vcf file to bed file.

usage: vcf2bed [OPTIONS]  

	--sv_type	<STR>	SV Type (DEL or DUP) (required)  
	--vcf_file	<FILE>	vcf file (required)  
	--bed_file	<FILE>	bed file (required)  
	--tool_name 	<STR>	delly,manta,lumpy or giab (required)  
	--exclude	<FILE>	exclude bed file (optional, default NULL)  
	--length	<INT>	SV length (optional, default 100)  

6. extract_typical_SV This optional is used to extract typical SVs defined in the manuscript.

usage: extract_typical_SV [OPTIONS]  

	--sv_type		<STR>	SV Type (DEL or DUP) (required)  
	--vcf_file		<FILE>	vcf file (required)  
	--bed_file		<FILE>	bed file (required)  
	--ground_truth_file 	<FILE>	ground truth vcf file (required)  
	--tool_name 		<STR>	delly,manta,lumpy or giab (required)  
	--exclude		<FILE>	exclude bed file (optional, default NULL)  
	--length		<INT>	SV length (optional, default 100)  

Input and Output

  1. The 'preprocess' command will take a SV bed file and output a SV image directory which contains
  1. image dir: storing all SV images

  2. SV image path file: storing the paths of all SV images

  1. The 'augmentate' command will take a SV image path file and also output a SV image directory after data augmentation.

  2. The 'train' command will take four SV image path files and output the trained model in the checkpoint directory.

  3. The 'predict' command will take the SV image path file got by the 'preprocess' command and output the filtering result as follows:

    Column 1: chromosome
    Column 2: start
    Column 3: end
    Coumnn 4: score

Example

1. run preprocess.sh to get candidate SV images

DeepSVFilter preprocess \
--sv_type=DEL \
--patch_size=224 \
--bam_path=./data/example.bam \
--bed_path=./data/example.bed \
--output_imgs_dir=./result/images

2. run predict.sh to make predictions for candidate SVs

DeepSVFilter predict \
--sv_type DEL \
--test_file ./result/images/IMG_PATH.txt \
--checkpoint_dir ./checkpoint \
--test_result_dir ./result/filteredSVs

3. augmentate example

DeepSVFilter augmentate \
--image_path_file /path/to/IMG_PATH.txt \
--patch_size 299 \
--output_imgs_dir /path/to/output_imgs_dir

3. train example

DeepSVFilter train \
--use_gpu=0 \
--sv_type=DEL \
--model=M1 \
--epoch=13 \
--batch_size=16 \
--start_lr=0.001 \
--eval_every_epoch=1 \
--checkpoint_dir=/path/to/checkpoint \
--pos_train_file=/path/to/TP_TRAIN.txt \
--neg_train_file=/path/to/FP_TRAIN.txt \
--pos_eval_file=/path/to/TP_EVAL.txt \
--neg_eval_file=/path/to/FP_EVAL.txt \
--eval_result_dir=/path/to/eval_PNAS_results \
--summary_dir=/paht/to/M1_summary/

Running Time and Memory Requirements

  1. The training time is less than 12 hours for a typical size of training set (about 10,000 positive and negative examples).

  2. The predicting time is less than 1 hour for a typical human genome (about 5000 candidate SVs).

  3. All analysis in the manuscript were run on a 384GB memory server.

Contact

[email protected]

License

MIT

deepsvfilter's People

Contributors

yongzhuang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.