GithubHelp home page GithubHelp logo

antoineho / cash Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 271 KB

Compare and shuffle heterozygosity using BED and VCF files

License: GNU General Public License v3.0

Python 100.00%
heterozygosity vcf bed contigs comparison assembly defined-regions variants

cash's Introduction

CASH

Compare And Shuffle Heterozygosity using BED and VCF files

1. General

Usage

This tool will help compare heterozygosity density and homozygosity runs along contigs inside and outside regions defined by a BED file.

$ python shuffle.py [-h] {shuffle,plot-dis,plot-comp,compare} ...

Shuffles SNPs and outputs homozygosity runs along contigs.

positional arguments:
  {shuffle,plot-dis,plot-comp,compare}
    shuffle             Shuffles SNPs and output closest distance of observed and shuffled data.
    plot-dis            Plot shuffled vs observed distributions based on shuffle command results.
    plot-comp           Compare heterozygosity of regions in and out of a bed file.
    compare             Compare heterozygosity of regions in and out of a bed file.

optional arguments:
  -h, --help            show this help message and exit

2. Module shuffle

Usage

The shuffler will shuffle SNPs along contigs based on a .BED file. The variants inside regions defined by the .BED file will be shuffled in the span of this region (i.e. a CDS). The variants outside of the regions will be shuffled along the whole length of the contig not spanned by the regions. The shuffled variants distribution can be compared with the observed distribution to check whether CDS or other regions are denser or less dense in variants.

usage: shuffle.py shuffle [-h] [-s SAMPLE] [-lc LOW_COVERAGE]
                          [-hc HIGH_COVERAGE] [-mq MIN_QUAL] [-p [PLOT]]
                          [-cs CONTIG_SIZE]
                          VCF FASTA BED OUT

positional arguments:
  VCF                   <STRING>
                               A path to a the short variants calls (vcf file).
  FASTA                 <STRING>
                               A path to the reference genome (fasta file).
  BED                   <STRING>
                               A path to the regions locations (sorted and merged bed file).
  OUT                   <STRING>
                               An output directory path for the out files.

optional arguments:
  -h, --help            show this help message and exit
  -s SAMPLE, --sample SAMPLE <STRING>
                             Sample to use for short variants positions. Default: ['ancestor']
  -lc LOW_COVERAGE, --low-coverage LOW_COVERAGE <INT>
                               Minimum coverage required to consider a good variant. Default: [None]
  -hc HIGH_COVERAGE, --high-coverage HIGH_COVERAGE <INT>
                               Maximum coverage threshold to consider a good variant. Default: [None]
  -mq MIN_QUAL, --min-qual MIN_QUAL <INT>
                               Minimum QUAL required to consider a good variant. Default: [None]
  -p [PLOT], --plot [PLOT]
                               Output distribution plots. Default: True
  -cs CONTIG_SIZE, --contig-size CONTIG_SIZE <INT>
                               Minimum size of a contig to plot. Default: [500000]bp

Example output: <sample>_DataFrame.csv

Column Name Explanation
1 index
2 POS position along the contig
3 SNP python boolean determining if site is a SNP or not
4 REGIONS python boolean determining if site was in a .BED defined region or not
5 shuffledSNP python boolean determining if site became a SNP after shuffling
6 CHROM name of the contig
,POS,SNP,REGIONS,shuffledSNP,CHROM
0,0,False,False,False,Chrom_3
1,1,False,False,False,Chrom_3
2,2,False,False,False,Chrom_3
3,3,False,False,False,Chrom_3
4,4,False,False,False,Chrom_3
5,5,False,False,False,Chrom_3
6,6,False,False,False,Chrom_3
7,7,False,False,False,Chrom_3
8,8,False,False,False,Chrom_3
...

Example graphics output: simulated data with genes under selection and large homozygous regions at extremities of contigs. The expected curve (orange) shows that heterozygosity should indeed be higher at telomeres while accounting for lower overall %heterozygosity in genes.

3. Module plot-dis

Usage

Takes .CSV output of shuffle.py shuffle and outputs graphs. See output of the shuffle module.

usage: shuffle.py plot-dis [-h] [-o OUTPUT] [-co COLOR_OBSERVED]
                           [-cs COLOR_SHUFFLED] [-wd WIDTH] [-hg HEIGHT]
                           [-s SIZE] [-b BINS]
                           CSV FASTA
                           
positional arguments:
  CSV                   <STRING> A path to a the output of the shuffle command (csv file).
  FASTA                 <STRING> A path to the reference genome (fasta file).
  
optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT <STRING>
                               Directory name to output. Default: path/to/cur_dir/['plots']
  -co COLOR_OBSERVED, --color-observed COLOR_OBSERVED <STRING>
                               A matplotlib valid color for the observed distribution. Default: ['dodgerblue']
  -cs COLOR_SHUFFLED, --color-shuffled COLOR_SHUFFLED <STRING>
                               A matplotlib valid color for the shuffled distribution. Default: ['orange']
  -wd WIDTH, --width WIDTH <INT>
                               Width of plot to output. Default: [14]
  -hg HEIGHT, --height HEIGHT <INT>
                               Height of plot to output. Default: [7]
  -s SIZE, --size SIZE  <INT>
                               Minimum size of a contig to plot. Default: [500000]bp
  -b BINS, --bins BINS  <INT>
                               Number of bins (windows) to make on each chromosome. Default: [501]bp

4. Module compare

Usage

Takes a .VCF, a .FASTA and a .BED file and compares heterozygosity distribution inside and outside defined regions. This helps verifying that some regions have lower or higher densities of variants.

usage: shuffle.py compare [-h] [-s SAMPLE] [-o OUTPUT] [-ci COLOR_IN]
                          [-co COLOR_OUT] [-wd WIDTH] [-hg HEIGHT]
                          [-cs CONTIG_SIZE] [-rs REGION_SIZE] [-MS MAX_SIZE]
                          [-ms MIN_SIZE] [-bn BIN_NUMBER] [-MH MAX_HET]
                          [-mh MIN_HET] [-bhn BIN_HET_NUMBER]
                          VCF FASTA BED

positional arguments:
  VCF                   <STRING>
                                 A path to a the short variants calls (vcf file).
  FASTA                 <STRING>
                                 A path to the reference genome (fasta file).
  BED                   <STRING>
                                 A path to the regions locations (sorted and merged bed file).

optional arguments:
  -h, --help            show this help message and exit
  -s SAMPLE, --sample SAMPLE <STRING>
                                 Sample name to read in VCF. Default: ['ancestor']
  -o OUTPUT, --output OUTPUT <STRING>
                                 Directory name to output. Default: path/to/cur_dir/['plots']
  -ci COLOR_IN, --color-in COLOR_IN <STRING>
                                 A matplotlib valid color for the inside distribution. Default: ['dodgerblue']
  -co COLOR_OUT, --color-out COLOR_OUT <STRING>
                                 A matplotlib valid color for the outside distribution. Default: ['orange']
  -wd WIDTH, --width WIDTH <INT>
                                 Width of plot to output. Default: [14]
  -hg HEIGHT, --height HEIGHT <INT>
                                 Height of plot to output. Default: [7]
  -cs CONTIG_SIZE, --contig-size CONTIG_SIZE <INT>
                                 Minimum size of a contig to plot. Default: [500000]bp
  -rs REGION_SIZE, --region-size REGION_SIZE <INT>
                                 Minimum size of a region to be considered in output plots. Default: [200]bp
  -MS MAX_SIZE, --max-size MAX_SIZE <INT>
                                 Maximum size of a region to plot the histogram (regions in range [ms, MS]). Default: [500000]
  -ms MIN_SIZE, --min-size MIN_SIZE <INT>
                                 Minimum size of a region to plot the histogram (regions in range [ms, MS]). Default: [200]
  -bn BIN_NUMBER, --bin-number BIN_NUMBER <INT>
                                 Number of bins in histogram in range [ms, MS]. Default: [100]
  -MH MAX_HET, --max-het MAX_HET <INT>
                                 Maximum heterozygosity range of a region to plot the histogram (in range [mh, MH]). Default: [10]
  -mh MIN_HET, --min-het MIN_HET <INT>
                                Minimum heterozygosity range of a region to plot the histogram (in range [mh, MH]). Default: [0]
  -bhn BIN_HET_NUMBER, --bin-het-number BIN_HET_NUMBER <INT>
                                 Number of bins in histogram in range [mh, MH]. Default: [100]

Example output: <sample>_Comparison_DataFrame.csv

Column Name Explanation
1 REGIONS index
2 POS length of the region
3 REGIONS python boolean determining if the region is inside or not
4 SNP integer number of variants found in this region
5 CHROM name of the contig
REGIONS,POS,REGIONS,SNP,CHROM
1,4391,False,38,Chrom_3
2,1012,True,18,Chrom_3
3,2802,False,31,Chrom_3
4,1081,True,43,Chrom_3
5,3863,False,149,Chrom_3
6,2861,True,56,Chrom_3
7,3356,False,109,Chrom_3
8,14533,True,0,Chrom_3
9,11302,False,62,Chrom_3

Example graphics output: histogram of heterozygosity inside and outside defined regions

5. Module plot-comp

Usage

Takes the output of shuffle.py compare and a .FASTA reference to output graphs as shown above.

usage: shuffle.py plot-comp [-h] [-s SAMPLE] [-o OUTPUT] [-ci COLOR_IN]
                            [-co COLOR_OUT] [-wd WIDTH] [-hg HEIGHT]
                            [-cs CONTIG_SIZE] [-rs REGION_SIZE] [-MS MAX_SIZE]
                            [-ms MIN_SIZE] [-bn BIN_NUMBER] [-MH MAX_HET]
                            [-mh MIN_HET] [-bhn BIN_HET_NUMBER]
                            CSV FASTA

positional arguments:
  CSV                   <STRING> A path to a the short variants calls (vcf
                        file).
  FASTA                 <STRING> A path to the reference genome (fasta file).

optional arguments:
  -h, --help            show this help message and exit
  -s SAMPLE, --sample SAMPLE <STRING>
                                   Sample name (only for output file names). Default: ['unknown']
  -o OUTPUT, --output OUTPUT <STRING>
                                   Directory name to output. Default: path/to/cur_dir/['compare_plots']
  -ci COLOR_IN, --color-in COLOR_IN <STRING>
                                   A matplotlib valid color for the inside distribution. Default: ['dodgerblue']
  -co COLOR_OUT, --color-out COLOR_OUT <STRING>
                                   A matplotlib valid color for the outside distribution. Default: ['orange']
  -wd WIDTH, --width WIDTH <INT>
                                   Width of plot to output. Default: [14]
  -hg HEIGHT, --height HEIGHT <INT>
                                   Height of plot to output. Default: [7]
  -cs CONTIG_SIZE, --contig-size CONTIG_SIZE <INT>
                                   Minimum size of a contig to plot. Default: [500000]bp
  -rs REGION_SIZE, --region-size REGION_SIZE <INT>
                                   Minimum size of a region to be considered in output plots. Default: [200]bp
  -MS MAX_SIZE, --max-size MAX_SIZE <INT>
                                   Maximum size of a region to plot the histogram (regions in range [ms, MS]). Default: [500000]
  -ms MIN_SIZE, --min-size MIN_SIZE <INT>
                                   Minimum size of a region to plot the histogram (regions in range [ms, MS]). Default: [200]
  -bn BIN_NUMBER, --bin-number BIN_NUMBER <INT>
                                   Number of bins in histogram in range [ms, MS]. Default: [100]
  -MH MAX_HET, --max-het MAX_HET <INT>
                                   Maximum heterozygosity range of a region to plot the histogram (in range [mh, MH]). Default: [10]
  -mh MIN_HET, --min-het MIN_HET <INT>
                                   Minimum heterozygosity range of a region to plot the histogram (in range [mh, MH]). Default: [0]
  -bhn BIN_HET_NUMBER, --bin-het-number BIN_HET_NUMBER <INT>
                                   Number of bins in histogram in range [mh, MH]. Default: [100]

cash's People

Contributors

antoineho avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.