GithubHelp home page GithubHelp logo

sqcl's Introduction

SqCL - target capture pipeline

Scripts to work with conserved loci data from squamates.

Areas of future improvements

  • matching of contigs to targets has some weird behavior
    • check cases with high matches where matches aren't retained
  • use read error correction prior to assembly
    • http://www.genome.umd.edu/quorum.html
    • check with M. Grundler about this
  • use cap3 to make PRG rather than random picking
  • better trusted SNP set for BQSR
    • maybe do Cortex intersection with raw GATK call
  • check SNP calling filters?
    • https://www.broadinstitute.org/gatk/guide/article?id=3225
    • https://www.broadinstitute.org/gatk/guide/article?id=6925
  • check mito DNA first both for misidentification & for contamination
  • use kmer alignment-free methods to calculate distances to look for possible contaminants or misidentification (MASH or kWIP)
  • add more plotting to summarize SNP quality
    • heterozygosity vs. AF
    • coverage vs. heterozygosity
  • add more plotting to identify collapsed paralogs or chimeric sequences
    • sliding window plots of p-distance across loci
    • sliding window plots of heterozygosity across loci
    • sliding window plots of depth across loci
    • from these, identify peaky distributions

Notes before you start

  • When running any script, you can see the script arguments using python script.py --help.
  • This pipeline is best run with these scripts all in a row, but each script can be run on its own. However, this piecemeal functionality hasn't been tested.

Steps

  1. Start with demultiplexed data.

  2. Create a CSV file summarizing data

    • note, column order does not matter.
    • sample: sample name
    • read1: complete path to read 1 file
    • read2: complete path to read 2 file
    • adaptor1: adaptor 1 to scrub; put asterisk where barcode goes
    • adaptor2: adaptor 2 to scrub; put asterisk where barcode goes
    • barcode1: barcode 1 to scrub
    • barcode2: barcode 2 to scrub (if it exists)
    • lineage: the group to which the sample belongs; used for assembly and SNP calling
  3. Clean reads

    • Remove adaptors using Trimmomatic
    • Merge reads using PEAR
    • Lightly remove low quality sequence using Trimmomatic
    • Ends up creating three files: 2 paired files with ordering retained, and 1 unpaired read file
    • Assumes: Trimmomatic v0.36 and PEAR 0.9.10
    python ~/squamateUCE/clean_reads.py --trimjar ~/bin/Trimmomatic-0.36/trimmomatic-0.36.jar \
    	--PEAR ~/bin/pear-0.9.10/pear-0.9.10-bin-64 --dir /scratch/drabosky_flux/sosi/uce_test/ \
    	--sample Anolis_carolinensis --file /scratch/drabosky_flux/sosi/uce_test/samples.csv
    

    IF YOU WANT This is a good place to check data quality by running quality_1_reads.py.

  4. Assemble reads

    • Trinity is meant for transcriptomes, but the authors of Phyluce have found it works well with non-transcriptome data sets
    • This script combines read 1 and unpaired reads into one file to use with Trinity
    • Then assembles
    • This is very memory intensive - recommended to give 1 Gb of RAM per 1M reads
    • Can use --normal flag to normalize big data sets
      • found this to be necessary because otherwise had heap error during Butterfly step
    • Requires Trinity v2.2.0
    python /home/sosi/squamateUCE/trinity_assembly.py --trinity ~/bin/trinityrnaseq-2.2.0/Trinity \
    	--sample Anolis_carolinensis --dir /scratch/drabosky_flux/sosi/uce_test/ \
    	--mem 55 --CPU 16
    
  5. Match assemblies to original targets

    • Uses blat and a soft reciprocal match to identify which probes match to which contigs in the assembly
    • Multiple probes belong to the same locus, and while the probes don't necessarily overlap, they are likely to match to the same contig in the assembly
    • Script uses a regex that is dataset specific to synonymize probes to targeted loci
    • This script only considers matches that are above the e-value provided by the user and that are no more than 10 orders of magnitude worse than the best match
    • Have a few kinds of outcomes
      • easy_recip_match: 1-to-1 unique match between contig and targeted locus
      • complicated_recip_match: 1-to-1 non-unique match, in which one targeted locus matches to multiple contigs
      • ditched_no_recip_match: a case in which the contig matches to the targeted locus, but it isn't the best match
      • ditched_no_match: a case in which the contig matches to the targeted locus, but the locus doesn't match to the contig
      • ditched_too_many_matches: a case in which one contig has multiple good matches to multiple targeted loci
        • removed because possible paralogs?
    • This requires blat v36
    python ~/squamateUCE/match_contigs_to_probes.py --blat ~/bin/blat --sample Anolis_carolinensis \
    	--dir /scratch/drabosky_flux/sosi/uce_test/ --evalue 1e-30 \
    	--db squamate_AHE_UCE_genes_loci.fasta
    
  6. Generate pseudo-reference genomes

    • Uses lineage designation in sample file created in step 2 to create a pseudo-reference genome (PRG) for each lineage
    • The script identifies all contig matches to all targeted loci across all individuals in that lineage
      • if there are multiple contigs across individuals that match to a given targeted locus, then:
        • either picks the top contig if it is much better than the other matches (>1e3)
        • or if not, picks the longest one
    • This script can be run with only the 'easy' and / or 'complicated' reciprocal matches
    • Also will reverse complement contigs so that they are ordered the same as the contig (important later for phylogeny making!)
    • Requires no additional programs.
    python ~/squamateUCE/make_PRG.py --lineage l1 --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \
    	--dir /scratch/drabosky_flux/sosi/uce_test/ --keep easy_recip_match
    

    IF YOU WANT This is a good place to check data quality by running quality_2_assembly.py.

  7. Align reads

    • align_reads1.py: run by individual
      • Align paired reads & unpaired reads in two separate runs using bwa
        • paired read run performs mate rescue
      • Then run fixmate to fix any broken paired-end information
      • Then sort the BAM files
      • Then mark duplicates
    • align_reads2.py: run by lineage first, then by individual
      • Generates raw set of variants using bcftools for the lineage
      • Filters raw set lightly (--dp and --qual)
      • Uses filtered set to perform base quality score recalibration (BQSR)
      • Recalibrates individual BAM files
    • These scripts asssume BWA 0.7.12, samtools 1.3.1, GATK 4, Picard 2.23, and bcftools 1.7
    python ~/squamateUCE/align_reads1.py --sample Mus_musculus \ 
    	--file /scratch/drabosky_flux/sosi/uce_test/samples.csv \
    	--dir /scratch/drabosky_flux/sosi/uce_test/ --bwa ~/bin/bwa-0.7.12/bwa \
    	--samtools ~/bin/samtools-1.3.1/samtools \
    	--picard ~/bin/picard-tools-2.23.1/picard.jar --CPU 1 --mem 1
    	
    python ~/squamateUCE/align_reads2.py --lineage l1 --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \
    	--dir /scratch/drabosky_flux/sosi/uce_test/ --samtools ~/bin/samtools-1.3.1/samtools \
    	--bcftools ~/bin/bcftools-1.7/bcftools \
    	--gatk ~/bin/GenomeAnalysisTK.jar --mem 3 --dp 10 --qual 20 --CPU 4
    

    IF YOU WANT This is a good place to check data quality by running quality_3_alignment.py.

  8. Call and filter variants

    • Call variant (and invariant sites!) based on "final" BAM files per lineage
      • Note the importance of calling invariant sites - it is the denominator in all pop gen analyses
    • Variants are filtered based on quality (--qual)
    • And then filtered on depth, on a per individual basis (--dp)
      • If an individual has too low of coverage, 'ALLELE/ALLELE' becomes missing ('./.')
      • If every individual in the lineage is missing, then the site is dropped
    • This requires bcftools
    python ~/squamateUCE/call_variants.py --lineage l1 --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \
    	--dir /scratch/drabosky_flux/sosi/uce_test/ --bcftools ~/bin/bcftools-1.7/bcftools --mem 4 \
    	--CPU 4 --dp 10 --qual 20
    

Phylogeny Making

  1. Make alignments.
    • Takes in all the data in the sample file to identify all unique pseudo-reference genomes
    • Creates unaligned fasta files for any locus that is sampled for four or more individuals
      • because most gene tree - species tree programs take in unrooted gene trees, there is no information in three-taxon trees
    • Also creates a summary file that has information about missingness
    python ~/squamateUCE/phylogeny_make_alignments.py --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \
    	--dir /scratch/drabosky_flux/sosi/uce_test/
    
  2. Actually do the alignments and make the gene trees.
    • Does the alignments and makes the gene trees for all files.
    • Uses multiprocessing module in Python to do it more quickly
    • Assumes the user has mafft 7.294 and RAxML 8.2.4
    python ~/squamateUCE/phylogeny_align_genetrees.py --dir /scratch/drabosky_flux/sosi/uce_test/ \
    	--CPU 4 --mafft ~/bin/mafft --raxml ~/bin/standard-RAxML/raxmlHPC
    
  3. Make the concatenated alignment.
    • Designed to be used with RAxML, but could work with any phylogeny program.
    python ~/squamateUCE/phylogeny_make_concatenated.py --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \
    	--dir /scratch/drabosky_flux/sosi/uce_test/ --miss 0.8
    
  4. Makes the files to be used for ASTRID and ASTRAL.
    • Creates a file with all the best trees for each gene.
    • And another file that has the file paths to the bootstrap trees.
     python ~/squamateUCE/phylogeny_prep_astrid_astral.py --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \
     	--dir /scratch/drabosky_flux/sosi/uce_test/ --miss 0.8
    
    • Can then run with commands like these:
    java -jar astral.4.10.6.jar -i best_trees_0.9.trees -b bootstrap_files_0.9.txt -r 100 -o astral_0.9.tre
    ASTRID -b bootstrap_files_0.9.txt -i best_trees_0.9.trees -o astrid_0.9.tre
    

sqcl's People

Contributors

singhal avatar

Stargazers

 avatar Kiseok Lee avatar  avatar Shahan Derkarabetian avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.