GithubHelp home page GithubHelp logo

anampc / popgen Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rsmas-compbioteam/popgen

0.0 2.0 0.0 64 KB

Scripts for CompBio-athon PopGen instructional sessions: SNP analyses, Population Differentiation analyses, Outlier analyses & GWAS, eQTL analyses

Makefile 6.53% R 93.47%

popgen's Introduction

PopGen

Scripts for CompBio-athon PopGen instructional sessions: SNP analyses, Population Differentiation analyses, Outlier analyses & GWAS, eQTL analyses

In this tutorial we will learn:

  1. How to call SNPs
  2. Filtering using vcftools
  3. Plotting a PCA
  4. Identifying FST outliers

How to call SNPs

Once you have aligned your reads to a reference genome or transcriptome, there are many different programs that can be used to identify Single Nucleotide Polymorphisms (SNPs). Some of the most widely used include:

Today, we will be using one of the more simple pipelines - samtools. We start with a set of bam alignment files and a fasta file of the reference genome. We'll use a set downloaded from dropbox along with some scripts we will be using:

mkdir snp
cd snp
wget --no-check-certificate https://www.dropbox.com/sh/9l812bwwnorwfom/AAAF3VYK1VmVLMlXzeR-aWtFa?dl=1 -O snp.zip
unzip snp.zip

The script snp.sh shows how we can use samtools to identify SNPs in our dataset. Here is the script:

#!/bin/bash

#BSUB -J SNPcalling
#BSUB -P ccsfellows
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -n 2

module load samtools/1.2
module load bcftools/1.2

##Index fasta reference
samtools faidx WIFL_sub.fa

##Call SNPs
samtools mpileup -uf WIFL_sub.fa bam/*.bam | bcftools call -vmO v > raw_snps.vcf

Let's execute it:

bsub < snp.sh

This usually takes a while - up to a few days depending on how much data you have. The output will be a file in Variant Call Format (VCF). Go ahead and look at this file (here is a reference for what the fields mean: https://samtools.github.io/hts-specs/VCFv4.2.pdf). This VCF will serve as the raw data for all of our downstread SNP analyses.

Filtering using vcftools

The raw VCF file contains a lot of information. We have SNPs, but we also have INDELS. We have variants that are only seen in one individual and we have regions of the genome where we don't even have coverage for many individuals. We have variants of very low quality. The next step is to filter out the variants that we don't want. the best tool for this is vcftools (http://vcftools.sourceforge.net/). We'll do this in interactive mode on Pegasus:

bsub -Is -P ccsfellows bash

Pegasus already has vcftools, so we can just load the module. We also have to load the correct perl module so that it works.

module unload perl
module load perl/5.22.1
module load vcftools/0.1.13

Let's just execute the most simple commange and see what we get:

vcftools --vcf raw_snps.vcf

This tells you how many individuals and how many variants there are in your vcf. It does not save anything and it does not do any filtering. Let's try to do a simple filtering step:

vcftools --vcf raw_snps.vcf --max-missing 0.5

This means that we have to have genotypes in at least 50% of our individuals. How many SNPs do you have now? Notice that we still didn't output a new file. To do that:

vcftools --vcf raw_snps.vcf --recode --out WIFL_filtered

There are many different filters you can work with. Consult the VCFtools manual for the full list. Here are a few of my favorites:

--remove-indels | removes INDELS so only SNPs remain

--min-alleles | minimum numbers of allelic states observed

--max-alleles | maximum number of allelis states

--maf | minor allele frequency cutoff

--minGQ | minimum genotype quality (phred score)

--minDP | minimum read depth

In order to save the vcf, you must use the following options:

--recode

--out filtered

Filter this set of SNPs however you want, we'll use them to make a PCA and for outlier analysis. You can even save multiple filters is you want to compare the downstream analysis!

Analyzing SNP data

Now that we've filtered our SNP data, we are ready to carry out downstream statistical analyses, which we will do in RStudio. There are a couple of R scripts in this Github repository that we'll use for these analyses. An easy way for you to get these scripts onto your computer (which allows us to show you more cool ways to use Github!) is to fork and clone this repository. Creating a fork of this repository makes a copy of it that is owned by you. Once you make this forked copy, you can then clone it (i.e. download it) to your local computer and work with the files inside. Because this copy of the repository is owned by you, you can also make changes and then push them back to Github.

First, click the 'Fork' button at the top right of this page, and select your Github account. Next, open RStudio and click File > New Project. This time, select 'Version Control'. Paste in the web address to your new forked repository, call it 'PopGen', and create it within your ~/github/ directory.

Cool - now you have a new Rproject that contains the scripts we are going to use in the next part of the analysis! Git is tracking everything already so you can keep track of your analysis as it develops from here.

Principal Components Analysis

One of the most basic ways to visualize mulitvariate data is with a principal components analysis (PCA). Here we'll make one based on the filtered SNP file you made in the last step. Since we'll do this in R on our local computers, we first need to get our vcf files from Pegasus into our local repository (make sure to change username and paths):

scp [email protected]:~/snp/filtered.recode.vcf ~/github/PopGen/

Now just step through the SNPRelate.R script to make your PCA! Note: SNPRelate is a great package that can do many other things besides PCA visualization, including calculating FST and relatedness. If you have extra time, you can play around with these functions. Here is a great tutorial: https://bioconductor.org/packages/release/bioc/vignettes/SNPRelate/inst/doc/SNPRelateTutorial.html

If you still have extra time, you can play around with a spatial structure analysis using TESS3 (https://github.com/cayek/TESS3/tree/master/tess3r). Install the R package and follow the tutorial for help with the code.

Outlier Analysis

There are many software programs designed to do outlier analysis. We'll use OutFLANK because it is easy to implement and its non-parametric approach is good for organisms with complex demography. Work through the OutFLANK.R script step by step.

popgen's People

Contributors

rachaelbay avatar jrcunning avatar michaeltconnelly avatar

Watchers

James Cloos avatar Ana Palacio avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.