GithubHelp home page GithubHelp logo

bbitarello / ncd-statistics Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 1.0 201.3 MB

Code for NCD1 and NCD2 statistics to detect long-term balancing selection

R 100.00%
ncd-statistics allele-frequencies site-frequency-spectrum sfs balancing-selection

ncd-statistics's Introduction


Author: Bárbara D. Bitarelo

Created: 27.03.2015

Last modified: 28 June 2022

Language: R

Welcome to the NCD repo!

UPDATE (June 2022): We have an R package under development that implements NCD1 and NCD2. Check it out here:

This repository provided scripts related to the article "Signatures of long-term balancing selection in human genomes":

What are the NCD (Non-Central Deviation) statistics?

Figure 1

NCD statistics measure the average difference between allele frequencies in a given region from a deviation point, which we call the 'target frequency (tf)'. So, assuming tf = 0.5, the more the allele frequencies are close to 0.5, the lower the NCD values. We propose two implementations of this statistic: NCD1 (only SNPs are required and used as informative sites) and NCD2 (SNPs and fixed differences are used as informative sites). In the manuscript, NCD2 as used to scan the human genome.


Here, we show how to:

  • item Run NCD1 and NCD2
  • item Using examples from the manuscript (see above) which can be extended to other species

Running NCD:

this requires:

* SNP input data as a data.table object in R

* Fixed differences (FD) input data (e.g. human-chimp FD bed file, as used in the manuscript) as a data.table object in R.
Note: *NCD1* only requires the first input file, whereas NCD2 requires both
    
* An open R session

Example input files are provided in example_input_files/. Refer to the README.md in that directory for further explanations.


First:

  • clone this repo: go to your directory and clone:
git clone https://github.com/bbitarello/NCD-statistics.git

Note: this may take a few minutes because large example files are provided!

  • go to root NCD directory
cd NCD-statistics/

Second:

  • open an R session and type
source('scripts/preamble.R') #loads several packages
source('scripts/NCD_func.R') #loads NCD functions NCD1 and NCD2
readRDS('example_input_files/SNP_test_input.rds')-> SNP_input #necessary for NCD1 and NCD2
readRDS('example_input_files/FD_test_input.rds')-> FD_input  #only necessary for NCD2
system.time(example.run.ncd1<-foreach(x=1:22, .combine="rbind", .packages=c("data.table")) %dopar% NCD1(X=SNP_input[[x]], W=3000, S=1500)); #  6 seconds 
system.time(example.run.ncd2<-foreach(x=1:22, .combine="rbind", .packages=c("data.table")) %dopar%  NCD2(X=SNP_input[[x]], Y=FD_input[[x]],  W=3000, S=1500)); # 9 seconds

Note that the runtime will vary considerably depending on computational constraints. My experience with registerDoMC(11) is that each scan for the entire genome takes about one minute. The examples here encompass a smaller proportion of the genome and run in a few seconds. See example_input_files/README.md

Important: This is an example. It is a roadmap of how NCD can be used for other input data (including non-SNP data). Even though the example input data is real (phase 3 1000 Genomes), it does not reproduce the findings from the paper.

Acnowledgements

Many thanks to @VitorAguiar (https://github.com/VitorAguiar) with optimization of the NCD codes.

Updates

Soon: will fix NCD2 so that it runs even when there are no FDs in the input file.(update: fixed) Update: Fixed issue with NCD2 output.

ncd-statistics's People

Contributors

bbitarello avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

shiyunf

ncd-statistics's Issues

How to calculate the P values?

Hi,

I have calculated the NCD1 values using the function NCD1(). I want to ask how to calculate the P values for such NCD1 values?

Leiting

input of NCD2

Hello,
If I use balselr's parse_vcf.R to get the input file for NCD2, how do I calculate NCD2 in NCD-statistics? Because I didn't know how to compute NCD2 in balselr

Thanks

Choice of outgroup

Dear Bárbara,

Thank you for developing this method and useful software.

I want to ask some questions about the choice of the outgroup. I want to use NCD2 to detect the signature of balancing selection in the genome of a fish species. There is a sister species that has a divergent time of 2 MYA to my fish species, and this species also have a similar trait that is hypothesized under balancing selection. However, I will only focus on my species, since there are not enough samples of its sister species. Do you think is it appropriate to use its sister species as an outgroup? Does the divergent time (2 MYA) matter? And a more important question, should I choose another species without the candidate trait of interest as an outgroup, or it doesn't matter since even though the same gene contributes to this trait in two species, the substitutions should be different since the surrounding loci are caused by mutations.

Thanks in advance!

Best wishes,
Xiaomeng

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.