jgx65 / hierfstat Goto Github PK

the hierfstat package

R 44.42% HTML 53.33% CSS 0.17% TeX 2.08%

r hierfstat devtools population-genetics population-genomics kinship quantitative-genetics gwas fstatistics simulations

hierfstat's Introduction

hierfstat

This is the development page of the hierfstat package for the R software.

The hierfstat package is intended for the analysis of population structure using genetic markers. It is suitable for both haploid and diploid data. In particular, it contains functions to estimate and test hierarchical F-statistics for any number of hierarchical levels.

To install the development version of hierfstat

You will need the package devtools to be able to install the devel version of hierfstat. To install devtools:

install.packages("devtools")

To install hierfstat devel:

library(devtools)
install_github("jgx65/hierfstat")
library("hierfstat")

hierfstat's People

Contributors

Stargazers

Watchers

Forkers

smanel caitiecollins zkamvar nmoran frederic-michaud jeffreyhanson genevol-usp tommynoose plantarum timflutre yasfoster tharinduts nikostourvas ggerlach2

hierfstat's Issues

betas.Rd

Hi Jerome,

I get an unused argument error for (betaij = TRUE) when trying to run the below script. My input data works for basic.stats(dat) and wc(dat). It also works for betas() when I don't include the betaijT = TRUE, but I would like coancestries/kinships and inbreeding coefficients in matrix form as per Weir and Goudet 2017.

ind.coan<-betas(cbind(1:120,dat[,-1]),betaij=TRUE)
Error in betas(cbind(1:120, dat[, -1]), betaij = TRUE) :
unused argument (betaij = TRUE)

According to R, the package is up-to-date. Am I missing a step or is there something else I should try?

Many thanks,
Yasmin

problem of pop.freq in pairwise.WCfst

Hi,

I have a multiallelic and multilocus dataset and I use pairwise.WCfst to quantify pairwise population differentiation.
The function works without problem for the complete dataset bu when I try to estimate pairwise fst for each loci separately, I get an error "Error in pop.freq(dat, diploid) : error in frequency estimation. Exiting"

I did not get the same error with gtrunchier dataset but I was able to replicate it with the nancycats dataset of adegenet.

data("nancycats")
nancycats_hf <- genind2hierfstat(nancycats)
pairwise.WCfst(nancycats_hf) ##no problem with that

pairwise.WCfst(nancycats_hf[,c(1,2)])
Error in pop.freq(dat, diploid) : error in frequency estimation. Exiting

Yet, when I try pop.freq(nancycats_hf[,c(1,2)]) I get the allele frequencies per population without problem.
I thought that might be related to missing data but third column does not contain any missing data and pairwise.WCfst(nancycats_hf[,c(1,3)]) raises the same error with a different wording;

Error in pop.freq(cbind(rep(1, length(pop)), dat[, -1]), diploid) :
error in frequency estimation. Exiting

I could not find a way around it. I would appreciate any help!
If there is a better way to calculate pairwise fst per loci I would also be happy to hear about it.

Best,
Onur

Cannot install package...

Anyone else experiencing this issue? Was the package recently removed?

Error: Failed to install 'hierfstat' from GitHub:
(converted from warning) installation of package 'units' had non-zero exit status

Thanks for any suggestions.
-Dylan

Parallelizing genet.dist() ?

Hi there! I'm loving hierfstat and am using it quite often. While most of the functions work quite quickly, computing time for genet.dist() increases non-linearly with number of populations and SNPs. I was wondering if there was any way to parallelize the calculations to take advantage of multi-core machines, or other ways one might speed up the calculations?

Thanks for your help!

`getal.b` should determine `modulo` per column and not once for the whole dataset

The getal.b function looks at the second column of its input data.frame only to determine the encoding of the alleles. This fails when, for instance, the 2nd marker encodes alleles with modulo 100 whereas the first encodes alleles with modulo 1000.

Reproducible example:

> tmp <- data.frame(pop=factor(c(1,1,2,2)),
                                 mrk1=c(150150,150142,134134,150134),
                                 mrk2=c(8882,8882,8880,8882))
> tmp
  pop   mrk1 mrk2
1   1 150150 8882
2   1 150142 8882
3   2 134134 8880
4   2 150134 8882
> getal.b(tmp[,-1])
, , 1

     [,1] [,2]
[1,] 1501   88
[2,] 1501   88
[3,] 1341   88
[4,] 1501   88

, , 2

     [,1] [,2]
[1,]   50   82
[2,]   42   82
[3,]   34   80
[4,]   34   82

I can propose a pull request solving this bug by adding a for loop so that modulo is determined per column. Are you interested?

p.s. : I found this bug because of this error Error in 2 * nal * p - mho : non-conformable arrays from function wc

make use of genlight object (adegenet)

gstat.randtest missing

I discovered today that the function gstat.randtest() is missing in the latest two releases of the package. Was there a name change or have you removed the function for some reason?

pop.freq returns vector for one locus rather than a table

When calculating chord distance for each locus, I got an error that I traced to pop.freq returning a vector rather than a table for a particular locus (Ttr11 in the example data below). It's not clear to me what is different about this locus from the others. I tried to do a bit of debugging and got as far as finding that there is something about the two matrices that the freq function in pop.freq returns for the two columns in ndat that causes apply to return a two-dimensional matrix rather than a two-element list of matrices as it does for the other loci. I couldn't work out why that is happening though...

Here is a zip file with the data:
test.pop.freq.dat.rdata.zip

Here is some code that demonstrates the problem:

load("test.pop.freq.dat.rdata")
for(i in 4:ncol(dat)) {
  print(pop.freq(dat[, c(1, i)]))
}

And the output:
$EV94

x Coastal Offshore.North Offshore.South
229 0.00000000 0.00000000 0.02777778
239 0.00000000 0.02564103 0.00000000
243 0.00000000 0.11538462 0.16666667
245 0.08823529 0.05128205 0.05555556
247 0.00000000 0.03846154 0.00000000
249 0.34558824 0.32051282 0.30555556
251 0.22058824 0.11538462 0.05555556
253 0.00000000 0.06410256 0.05555556
255 0.00000000 0.06410256 0.02777778
259 0.18382353 0.01282051 0.02777778
261 0.16176471 0.03846154 0.05555556
263 0.00000000 0.05128205 0.08333333
265 0.00000000 0.07692308 0.05555556
269 0.00000000 0.01282051 0.05555556
271 0.00000000 0.01282051 0.02777778

[1] 0.00000000 0.30882353 0.00000000 0.00000000 0.43382353 0.24264706 0.01470588 0.00000000
[9] 0.01282051 0.03846154 0.10256410 0.15384615 0.30769231 0.23076923 0.11538462 0.03846154
[17] 0.00000000 0.00000000 0.19444444 0.08333333 0.13888889 0.30555556 0.08333333 0.11111111
[25] 0.05555556 0.02777778 0.11764706 0.07352941 0.08823529 0.10294118 0.10294118 0.04411765
[33] 0.13235294 0.17647059 0.16176471 0.17500000 0.05000000 0.07500000 0.10000000 0.15000000
[41] 0.10000000 0.07500000 0.15000000 0.12500000 0.33333333 0.11111111 0.05555556 0.11111111
[49] 0.05555556 0.05555556 0.11111111 0.05555556 0.11111111
$Ttr34

x Coastal Offshore.North Offshore.South
179 0.00000000 0.00000000 0.02777778
183 0.00000000 0.13750000 0.08333333
185 0.24264706 0.23750000 0.30555556
187 0.13970588 0.11250000 0.11111111
189 0.00000000 0.21250000 0.19444444
191 0.47794118 0.06250000 0.08333333
193 0.05882353 0.10000000 0.08333333
195 0.08088235 0.11250000 0.08333333
199 0.00000000 0.01250000 0.00000000
201 0.00000000 0.01250000 0.02777778

getal seems to require ordered population labels

#Here is a simple example that illustrates the issue:

library(adegenet)
library(hierfstat)

#create genind object
df <- data.frame(locusA=c("11","11","12","12","22","12","21"),
locusB=c(NA,"21","12","11","11","21","12"),locusC=c("22","22","21","22","11","22","12"))
obj <- df2genind(df, ploidy=2, ncode=1)
pop(obj) <- c(2,2,1,1,1,1,3) #assign individuals to populations
obj #examine genind object
/// GENIND OBJECT /////////

// 7 individuals; 3 loci; 6 alleles; size: 6.8 Kb

// Basic content
@tab: 7 x 6 matrix of allele counts
@loc.n.all: number of alleles per locus (range: 2-2)
@loc.fac: locus factor for the 6 columns of @tab
@all.names: list of allele names for each locus
@ploidy: ploidy of each individual (range: 2-2)
@type: codom
@call: df2genind(X = df, ncode = 1, ploidy = 2)

// Optional content
@pop: population of each individual (group size range: 1-4)

#create hierfstat data frame
objh <- genind2hierfstat(obj)
objh #examine hierfstat data frame
pop locusA locusB locusC
1 1 11 NA 22
2 1 11 21 22
3 1 12 21 21
4 1 12 11 22
5 2 22 11 11
6 2 12 21 22
7 3 12 21 21

#example 1
getal(objh) #produces error message
Error in data.frame(pop = rep(data[, 1], 2), ind = ind, al = rbind(firstal, :
arguments imply differing number of rows: 14, 10

#example 2, with a different assignment of individuals to populations
objh[,1] <- c(1,2,2,1,1,1,3)
getal(objh) # no error message, but individuals are incorrectly assigned to populations
pop ind locusA locusB locusC
1 1 1 1 NA 2
2 2 2 1 2 2
3 2 3 1 2 2
4 1 4 1 1 2
5 1 1 2 1 1
6 1 2 1 2 2
7 3 1 1 2 2
11 1 1 1 NA 2
21 2 2 1 1 2
31 2 3 2 1 1
41 1 4 2 1 2
51 1 1 2 1 1
61 1 2 2 1 2
71 3 1 2 1 1

#example 3, ordered population labels seem required to get correct results
objh[,1] <- c(1,1,1,1,2,2,3)
getal(objh)
pop ind locusA locusB locusC
1 1 1 1 NA 2
2 1 2 1 2 2
3 1 3 1 2 2
4 1 4 1 1 2
5 2 1 2 1 1
6 2 2 1 2 2
7 3 1 1 2 2
11 1 1 1 NA 2
21 1 2 1 1 2
31 1 3 2 1 1
41 1 4 2 1 2
51 2 1 2 1 1
61 2 2 2 1 2
71 3 1 2 1 1

#looking at the code in the getal function, the issue seems to be in the following for loop (which, by the way, falls into the second circle of the R Inferno: growing objects):

for (i in sort(unique(data[, 1]))) {
dum <- 1:sum(data[, 1] == i)
if (i == 1)
ind <- dum
else ind <- c(ind, dum)
}
#if ordered population labels are assumed, then the loop above may be replaced by:
#ind <- sequence(table(data[, 1]))

varcomp.glob

Hi,
I am working on a dataset of 100,000 individuals / 1000 SNPs and am getting some errors when using the varcomp.glob() function.
I did some tests on 100, 500 and 1,000 individuals and everything ran well. However when running test on 10,000 individuals, i am getting this error (non-french speakers, i am sorry!):

Error in x[, i] <- rep(1:length(dum1), dum1) :
le nombre d'objets à remplacer n'est pas multiple de la taille du remplacement
De plus : Warning messages:
1: In dum[[j]]^2/thisdum :
la taille d'un objet plus long n'est pas multiple de la taille d'un objet plus court
2: In dum[[j]]^2/thisdum :
la taille d'un objet plus long n'est pas multiple de la taille d'un objet plus court

Do you know what happens?
Thanks
Vincent

boot.ppfst

Hi,
I have noticed that boot.ppfst fails and gives an error message if populations are not sorted.
The issue is easy to solve by sorting the dataframe according to the pop column.
Cheers
Juan

explanation of behavior

Good evening,

I'm trying to understand how hierfstat calculates Weir-Cockerham pairwise FST and hope you may provide some clarification at what's going on at

hierfstat/R/nwc.R

Lines 66 to 69 in f350697

 mho<-lapply(hetpl, 

 function(x) matrix(unlist(lapply(x, 

 function(y) tapply(y,pop,sum,na.rm=TRUE))),ncol=np) 

 )

My understanding is that hetpl creates a list of whether an allele appears in a heterozygous state for an individual at a locus
Locus1

Allele1: ind1, ind2, ind3...
Allele2: ind1, ind2, ind3...

Locus2

Allele1: ind1, ind2, ind3...
Allele2: ind1, ind2, ind3...
Allele3: ind1, ind2, ind3...

I'm not clear on what what mho <- is doing. For example, when using the nancycats data provided by adegenet, allele 145 (idx = 14) in locus fca8 (idx = 1) appears in the heterozygous state 9 times:

tmp <- hetpl[[1]][[14]]

which(tmp == TRUE) #9 of them
38 44 53 70 72 87 140 159 187

I was under the impression that mho

hierfstat/R/nwc.R

Lines 66 to 69 in f350697

 mho<-lapply(hetpl, 

 function(x) matrix(unlist(lapply(x, 

 function(y) tapply(y,pop,sum,na.rm=TRUE))),ncol=np) 

 )

(before it gets concatenated) is a list of matrices, where within each matrix the rows correspond to unique alleles and the columns correspond to the unique populations. However, looking at the first matrix (presumably the first locus fca8), the 14th row, which should(?) correspond to allele 14 (145), does not add up to 9:

mho[[1]][14,]
1 0 0 5 0 0 0 3 2 4 1 3 0 0 0 0 0

May you please explain what counts/values are occuring in mho?

pairwise.WCfst Error in 1:sum(data[, 1] == i) : NA/NaN argument

Hi, I transformed my vcf data using <genind2hierfstat(loci2genind(d), pop=ind.desc$Continents)>, as suggested in the tutorial.

My data has 1120 samples and 2056 unlinked variants.

However when I use <pairwise.WCfst(dat,diploid=TRUE)> I get this error message:
Error in 1:sum(data[, 1] == i) : NA/NaN argument

I can't seem to find a solution to this from the 4 existing issues posted. Any help is highly appreciated. I converted my pop identifiers to integers but that didn't solve the error. Basically all of the hierfstat programs return this error, which suggests maybe it's not recognizing my data. From my standpoint, everything looks OK with data following transformation (very similar to the example data file: gtrunchier).

Here's a sample of the transformed data I am working with:
sample_data_129x30.txt

I can also share the original vcf file if it is helpful in reproducing the error. Thanks!

Genet.dist not keeping population names

First I want to say how much I love this package and many students in my lab use this for genetic studies. So thank you!

My question is that when I use genet.dist on a genind object, it renames the populations from their original names ("MN", "MI", "WI", "NC", "KY", "LA", "OK") to numbers (1, 2,3, 4, 5, 6, 7).

> usda@pop
 [1] MN MN MN MN MN MI WI WI NC NC NC NC NC NC NC NC NC NC NC NC KY KY KY KY KY KY KY KY KY KY KY KY LA LA LA LA LA LA
[39] LA LA LA LA LA LA LA LA LA LA LA LA LA LA LA LA LA LA LA LA LA LA LA OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK
[77] OK OK OK OK OK
Levels: MN MI WI NC KY LA OK

> usda_fst<-genet.dist(usda)
> usda_fst %>% round(digits=3)
      1     2     3     4     5     6
2 0.066                              
3 0.052 0.074                        
4 0.036 0.063 0.048                  
5 0.035 0.061 0.047 0.028            
6 0.038 0.066 0.051 0.030 0.029      
7 0.030 0.058 0.043 0.024 0.024 0.026

I wanted to confirm which number went with which population name. LA is the population that should be the most different. Are these ordered alphabetically perhaps? I just didn't want to make a guess but wanted to be sure.

Thank you so much for your help!
Jenna

request to update hierfstat on CRAN

Hi Jerome,

We are currently polishing http://nescent.github.io/popgenInfo/ for publication. Currently, the version of the packages used on the website are out of date for the reason that many of the analyses will break (for example, the use of genind2hierfstat has been removed from adegenet and breaks this vignette: http://nescent.github.io/popgenInfo/DifferentiationSNP.html#workflow). As the manuscript deadline is set for March 1st, we want to know if there's any way that hierfstat could be released on CRAN before January. This would make sure that we can update the vignettes so the user experience will match what is presented in the vignette.

Best,
Zhian

boot.ppfst

Hello,

I'm trying to gather CIs on my GBS-like SNP dataset. I'm using a genind object, and am using the following code:
data <- read.genepop("data.gen", code =2L)
data.CI<- boot.ppfst(data, nboot = 100, quant = c(0.025, 0.975), diploid = T)

But receiving this error: Error in i + 1 : non-numeric argument to binary operator

I have also run data1 <- genind2hierfstat(data) and still get the same error. My populations are all organised together.

I can calculate the global Fst on the dataset using the wc function, and can get the lsiga, lsigb, and lsigw values from performing wc(data)$sigma.loc but not perform the boot.ppfst function.

Any assistance would be hugely grateful.

Cheers,
John

inconsistent results: fstat vs fst.pairwise for two pops

I may have a fundamental misunderstanding of what's happening but it
seems to me that the functions fstat() and pairwise.fst() should give the
same value for the case of two populations, which they do not.

A reproducible example:
library(hierfstat)
library(adegenet) # for data
data(nancycats)
obj1 <- seppop(nancycats)$P01
obj2 <- seppop(nancycats)$P02
obj3 <- repool(obj1, obj2)
fstat(obj3)
pairwise.fst(obj3)

And output:

fstat(obj3)
pop Ind
Total 0.1307741 0.2804306
pop 0.0000000 0.1721722
pairwise.fst(obj3)
1
2 0.080185

Any help would be very much appreciated.

Best,
Manuel

boot.ppfst requires pop to be numeric

boot.ppfst requires pop to be numeric, but genind does not, so a genind object is used that has populations defined as strings/factors, the loop on line 53 fails.

Inconsistency in genet.dist function

There are some inconsistencies with the genet.dist function. The result from executing the function with a genind object is not the same as with a dataframe. Please see an example below:

library(adegenet)
library(hierfstat)
data("nancycats")
a <- genet.dist(nancycats, method = "WC84")
b <- genet.dist(as.data.frame(cbind(locality = nancycats$pop, nancycats$tab)), diploid = T, method = "WC84")

Note that a and b are not the same. I think it has something to do with the genind2hierfstat function which was used in the genet.dist function to automatically convert genind object to a dataframe.

nancy <- genind2hierfstat(nancycats)
nancy2 <- as.data.frame(cbind(locality = nancycats$pop, nancycats$tab))

a2 <- genet.dist(nancy, method = "WC84")
b2 <- genet.dist(nancy2, method = "WC84")

I understand that the function does not mention using genind object but there are tutorials out there that suggests that genind object can be used (e.g. https://tomjenkins.netlify.app/2020/09/21/r-popgen-getting-started/#5). I thought it would be good to point this out anyways as the function executes with no warnings when a genind object is used.

pairwise.WCfst dim(X) must have a positive length

Hi,

I have a dataset with 300 sequences from Dloop region (508pb) arranged in 17 populations. I'm trying to generate bootstrap matrices in order to use the Monmonier maximum difference algorithm. To do that i'm reading the file as fasta and then converting to genind, to keep only the polymorphic loci.

file <- read.dna("file.fasta", format = "fasta")
file2 <- DNAbin2genind(file)

After that i set the populations using a factor:

pop <- factor(c(rep("pop1", number of individuals), ....))
file2@pop <- pop

And then i convert to hierfstat and use pairwise.WCfst:

file3 <- genind2hierfstat(file2)
fst <- pairwise.WCfst(file3, diploid = "FALSE")

I'm having issues after generate those bootstrap matrices. I'm using genind2df in order to use the sample() function with replacement:

x <- as.matrix(fst)
n.row <- nrow(x)
n.col <- ncol(x)

btest <- c(1:100)

bt.reps = NULL

boot <- genind2df(file2)
file2@pop <- NULL

for (i in btest){
boot2 <- boot[, sample(ncol(boot), replace = TRUE)]
colnames(boot2) <- c(1:number of colums) #to avoid duplicate column names

boot3 <- df2genind(boot2, sep = " ", pop = pop)
boot4 <- genind2hierfstat(boot3)

fst2 <- pairwise.WCfst(boot4, diploid = FALSE)

bt <- matrix(0, nrow = n.row, ncol = n.col)
bt[] <- fst2
bt.reps <- rbind(bt.reps, bt)
}

Sometimes when running this loop i get this error:

Error in apply(x, 2, fun2 <- function(x) sum(x > 0)) :
dim(X) must have a positive length

When i get this error, my loop stops and i get only a few pairwise fst matrices. I've tried many ways to do that but every time it crashes when im using pairwise.WCfst. Any ideas?

write function for estimating weir&cockerham for 2 populations only

Error because typo in fstat.from.adegenet.R header

I got the following error when trying to install hierfstat with devtools::install_github():

Error : (converted from warning) /tmp/RtmpYvI5iE/R.INSTALL1eca1279ea8f/hierfstat/man/fstat.from.adegenet.Rd:23: unexpected section header ‘\details’

Apparently this is cause by a missing "}" to close the "\dontrun" section. I forked the repo, edited the fstat.from.adegenet.R file to include the missing "}", and the package installs OK.

create functions specific for SNPs for efficiency

Cannot calculate global Fst with confidence intervals

Hi,
This may be an issue from my side or I did not see something simple. But I was trying to calculate global Fst with some samples for days. I tried a number of different methods and still could not obtain what I want.

I could read my files and calculate pairwise Fst with confidence intervals with something like

pairwise_all_chrs<-boot.ppfst(dat = my_samples,nboot = 100,quant=c(0.025,0.975))

I could also calculate global Fst, but without confidence intervals.

I am using the same file("my_samples" in this case) as I used for two above calculations which worked fine. Therefore I guess it is not something wrong with my file. But still cannot obtain global Fst with confidence intervals.

As an example when I try a command like following,(also tried with quant)

f_stats<-boot.vc(my_samples[,c(1)],my_samples[,-c(1)],nboot=100)

it gives

Error in solve.default(k, meansq) :
system is computationally singular: reciprocal condition number = 0
Calls: boot.vc ... rbind -> varcomp -> apply -> FUN -> solve -> solve.default
Execution halted

I tried changing these commands according to different instructions dozen of times with no luck.

Can you please tell me how I can calculate global Fst with confidence intervals using the same data. I can provide you with my files if it helps.

tests lead to g.star=0 on all permutations

Hello -

I've run varcomp.glob() successfully on my SNP data with three levels (prov, pop, site), but I'm having trouble testing the different levels. I've tried all tests at nperm=100, but each g.star value is 0 for all permutations and the p-value = 1. I'm not sure what is going wrong.

varcomp.glob(levels=levels1, loci=loci, diploid = TRUE)
test.within(loci, test=site, within = pop, nperm=100)
test.between.within(data.frame(loci), within = prov, rand.unit = site, test.lev = pop, nperm=100)
test.between(loci, rand.unit=pop, test=prov, nperm=100)

Any advice?

allelic.richness does not work for single-locus datasets

When trying to calculate allelic richness for datasets with only one locus I get the following error:
Error in apply(data[, -1], 2, dum) : dim(X) must have a positive length

No error is given for datasets with two or more loci. I have tried both the version of hierfstat currently available on cran and the development version from github. To my understanding this is identical to closed bug #7 for basic.stats opened on Sep 18, 2015, so hopefully this should be an easy fix.

Conversion genind to hierfstat df a little inconsistent

Is there any particular reason why these functions handle a genind input:

hierfstat/R/ppboot.R

Lines 17 to 19 in 5d173d2

 pp.fst<-function(dat=dat,diploid=TRUE,...){ 

 cl<-match.call() 

 if (is.genind(dat)) dat<-genind2hierfstat(dat)

hierfstat/R/ppboot.R

Lines 50 to 52 in 5d173d2

 boot.ppfst<-function(dat=dat,nboot=100,quant=c(0.025,0.975),diploid=TRUE,...){ 

 cl<-match.call() 

 if (is.genind(dat)) dat<-genind2hierfstat(dat)

hierfstat/R/ppboot.R

Lines 112 to 114 in 5d173d2

 boot.ppfis<-function(dat=dat,nboot=100,quant=c(0.025,0.975),diploid=TRUE,dig=4,...){ 

 cl<-match.call() 

 if (is.genind(dat)) dat<-genind2hierfstat(dat)

But not these related ones:

hierfstat/R/pairwise.fst.R

Lines 36 to 37 in 5d173d2

 pairwise.neifst <- function(dat,diploid=TRUE){ 

 dat<-dat[order(dat[,1]),]

hierfstat/R/pairwise.fst.R

Lines 93 to 94 in 5d173d2

 pairwise.WCfst <- function(dat,diploid=TRUE){ 

 dat<-dat[order(dat[,1]),]

If not, I am happy to write a PR to make things a little more homogeneous.

Bug in `getal`

The following line in misc.R: if (i==1) ind<-dum else ind<-c(ind,dum) causes the following error when one of the populations is called '1' AND it doesn't appear first in sort(unique(data[,1])):

> data
/// GENIND OBJECT /////////

 // 61 individuals; 7 loci; 14 alleles; size: 16.9 Kb

 // Basic content
   @tab:  61 x 14 matrix of allele counts
   @loc.n.all: number of alleles per locus (range: 2-2)
   @loc.fac: locus factor for the 14 columns of @tab
   @all.names: list of allele names for each locus
   @ploidy: ploidy of each individual  (range: 2-2)
   @type:  codom
   @call: df2genind(X = locus, sep = "/", ind.names = sample, pop = pop, ploidy = 2)

 // Optional content
   @pop: population of each individual (group size range: 1-13)
> data$pop
 [1] F F F 1 1 5 1 2 5 4 N 3 D D D D D D D 5 D C C C D C C D D 2 1 4 1 D N F F F
[39] F F F F F F L C C K C C M M M K K C C L F 4 2
Levels: F 1 5 2 4 N 3 D C L K M

> getal(data)
Error in data.frame(pop = rep(data[, 1], 2), ind = ind, al = rbind(firstal,  : 
  arguments imply differing number of rows: 122, 96

The offending line should be fixed to just ind<-c(ind,dum). The obvious workaround for anyone facing the issue is to guarantee that none of the populations is called '1'.

Impossible to find "fstat" function

Hello,

I have only little experience with R and received a script to analyse data from microsatellites. In this script, the "fstat" function is used to provide the three F statistics Fst (pop/total), Fit (Ind/total), and Fis (ind/pop). However I get the message from R that the function is unknown.

gen
/// GENIND OBJECT /////////

 // 239 individuals; 11 loci; 81 alleles; size: 113.7 Kb

 // Basic content
   @tab:  239 x 81 matrix of allele counts
   @loc.n.all: number of alleles per locus (range: 2-22)
   @loc.fac: locus factor for the 81 columns of @tab
   @all.names: list of allele names for each locus
   @ploidy: ploidy of each individual  (range: 2-2)
   @type:  codom
   @call: read.structure(file = paste(dat.path1, "Results_form_structure.str", 
    sep = ""), n.ind = 239, n.loc = 11, onerowperind = F, 
    col.lab = 1, col.pop = 2, col.others = NULL, row.marknames = 1, 
    NA.char = "-9", pop = NULL, sep = NULL, ask = F, quiet = F)

 // Optional content
   @pop: population of each individual (group size range: 19-20)
fstat(gen)
Error in fstat(gen) : impossible de trouver la fonction "fstat"

I could not find whether the function was changed or removed. It would be of great help if you can help me with that.

Thank you

pairwise.WCfst(genind2hierfstat(x)) and genet.dist(x, method = "WC84") do not yield the same result

Hi,

I have just realized that these two functions do not give the same result when they both should be providing (according to the descriptions) pairwise estimated of Fst following Weir and Cockerham 1984.

Best regards

basic.stats does not work with only a single locus

I'm doing some single-locus simulations and wanted to use basic.stats, but ran into the problem that this does not work with only a single locus. See following test script:

library(hierfstat)

make a genotype array

Pop = rep(c(1,2), each=20)
Loc1 = sample(c(11,12,21,22),40, replace=TRUE)
dats = cbind(Pop, Loc1)
basic.stats(dats) #gives error: Error in apply(data[, -1], 2, dum) : dim(X) must have a positive length

to verify, now with two loci

Loc2 = sample(c(11,12,21,22),40, replace=TRUE)
dats2 = cbind(Pop, Loc1, Loc2)
basic.stats(dats2) #no problem

Calculating FST with Dosage and Missing Data

Hi @jgx65,

I've been using hierfstat to calculate basic.stats and genetic distances, including WC Fst from VCFs for a while now. I'm using RADseq datasets. Given that the datasets often have a lot of missing data, I calculate the statistics at a variety of missing data thresholds to see how missing data affects the statistics.

Some of my datasets contain related individuals, so I would like to try the dosage based Fst calculation from your 2017 paper, especially pairwise individual or population-level Fsts. However, I run into problems with loci that have missing data. Is there any way around this?

In general my workflow looks like this:

my.vcf <- vcfR::read.vcfR("path/to/vcf.vcf")
genind <- vcfR2genind(my.vcf)
pop.names <- vector.of.populations.that.correspond.to.individuals

df <- genind2hierfstat(genind, pop = pop.names)

basic.stats(df)
genet.dist(df)

This works even with up to 70% missing data (although that is obviously not ideal, and the inferences are not very believable).

However, this does not work:

dos <- biall2dos(df, diploid = TRUE)

# Error in FUN(X[[i]], ...) : 
#  only defined on a data frame with all numeric-alike variables

Is there an easy way to convert genind/vcf or hierstat dfs to dos? I'm happy to send over smaller datasets for you to examine more closely.

write.ped() removes first locus from hierfstat object

To whom it may concern. Whenever using something of the sort:

write.ped(hierfstat_object, fname = "path/to/file", pop = pop, ilab = inds)

The first locus is removed from both the map and ped objects.

confidence intervals for average betas (pairwise FST)

Hi Jérôme,

I am using average betas (betas$W) as pairwise FST between two populations. Would it be possible to implement CI for this parameter, like boot.ppfst would do for WC84 FST?

thanks!

basic.stats :Error in sHo/2/n : non-conformable arrays

Hi, I used this function to calculate H0,Hs,Ht. If I used hierfstat data and genind data, it works, while If I eliminated some pops from hierfstat data. it doesn't work, reported :Error in sHo/2/n : non-conformable arrays.
The data format wasn't changed, initially ,it has 16pops, I eliminated 12pops, it didn't work, and will report the error.

Still don't figure out what's wrong.

Please can you help to check why?

Best!

Pass the check

I currently get (as of b63d8dc):

> check()
Updating hierfstat documentation
Loading hierfstat
Writing fstat.from.adegenet.Rd
'/usr/local/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  CMD build '/home/thibaut/dev/hierfstat' --no-resave-data --no-manual 

* checking for file ‘/home/thibaut/dev/hierfstat/DESCRIPTION’ ... OK
* preparing ‘hierfstat’:
* checking DESCRIPTION meta-information ... OK
* excluding invalid files
Subdirectory 'R' contains invalid file names:
  ‘FSTAT.INI’ ‘beta.R.old’ ‘gtrunchier.dat’ ‘gtrunchier.out’ ‘test.dat’
  ‘test.out’
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* looking to see if a ‘data/datalist’ file should be added
* building ‘hierfstat_0.04-15.tar.gz’

'/usr/local/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  CMD check '/tmp/RtmpTrpAoZ/hierfstat_0.04-15.tar.gz' --timings 

* using log directory ‘/tmp/RtmpTrpAoZ/hierfstat.Rcheck’
* using R Under development (unstable) (2015-04-13 r68171)
* using platform: x86_64-unknown-linux-gnu (64-bit)
* using session charset: UTF-8
* checking for file ‘hierfstat/DESCRIPTION’ ... OK
* this is package ‘hierfstat’ version ‘0.04-15’
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘hierfstat’ can be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking DESCRIPTION meta-information ... NOTE
Malformed Description field: should contain one or more complete sentences.
* checking top-level files ... NOTE
File
  LICENSE
is not mentioned in the DESCRIPTION file.
Non-standard file/directory found at top level:
  ‘hierfstat.Rproj’
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking use of S3 registration ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... WARNING
Codoc mismatches from documentation object 'wc':
wc
  Code: function(ndat, diploid = TRUE, pol = 0)
  Docs: function(ndat, diploid = TRUE, pol = 0, trim = FALSE)
  Argument names in docs not in code:
    trim

* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking contents of ‘data’ directory ... OK
* checking data for non-ASCII characters ... OK
* checking data for ASCII and uncompressed saves ... OK
* checking sizes of PDF files under ‘inst/doc’ ... OK
* checking installed files from ‘inst/doc’ ... OK
* checking examples ... ERROR
Running examples in ‘hierfstat-Ex.R’ failed
The error most likely occurred in:

> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: betai
> ### Title: Estimation of beta per population
> ### Aliases: betai
> ### Keywords: univar
> 
> ### ** Examples
> 
> dat<-sim.genot(size=100,N=c(100,1000,10000),nbloc=50,nbal=10)
> betai(dat)$betai 
Error: could not find function "betai"
Execution halted
* checking PDF version of manual ... OK
* DONE

Status: 1 ERROR, 1 WARNING, 2 NOTEs
See
  ‘/tmp/RtmpTrpAoZ/hierfstat.Rcheck/00check.log’
for details.
Error: Command failed (1)
In addition: Warning message:
NAMESPACE not generated by roxygen2. Skipped. 
>

2 funny problems when i use HierFstat

Hello，I am glad that my last question was answered by you
and，now，i have other questions and hope to get your help

Is there a one-to-one correspondence between the result file after using the genind2hierfstat to convert the adegenet::genind file and the genepop file I input? and my code as follows：
x <- read.genepop("maf0.05.recode.p.snps.gen",ncode = 2L)
basic.stats(x)
genet.dist(x)
head(genind2hierfstat(x)[,1:10])
b <- genind2hierfstat(x)

so what i want to know is whether the loci of maf0.05.recode.p.snps.gen and b is one - to - one？

i want to extract the loci of b that fst > 0.01 ,and convert it to vcf file , how can i achieve this goal?
Or i want to use these loci perform PCA，can you give me some clues？

sorry about these clueless questions ，i am a really freshman in R and bioinformatics.
so i am waiting for your reply！

pairwise.neifst doesn't work if I use diploid=FALSE

I am calculating pairwise.neifst for a haploid organism like below
pairwise.neifst(data, diploid=FALSE)

It returns NA for all the comparisons. But if i remove diploid then it does give out some values.
pairwise.neifst(data)

How do I fix this?
Thanks!

plot boot.ppfst

Hello,

is there a function to plot a heatmap with the matrix obtained using boot.ppfst?
thanks.

Problem while converting from genind to a hierfstat format

The following script raises an error on the last line in my configuration:

The script:

library(adegenet)
library(hierfstat)
data(nancycats)
fstat.nancycats <- genind2hierfstat(nancycats)
basic.stats(fstat.nancycats) #works well
genet.dist(fstat.nancycats) #works well
eucl.dist(fstat.nancycats) #does not work

The error:

Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,  : 
  'max' not meaningful for factors

The problem comes from the fact that in genind, the populations (genind_object$pop) are (or at least can be) factor , whereas the function eucl.dist expect the population (hierfstat_object[,1] )to be number.
I'm not sure if the first column of a "hierfstat object" should be in general a number or a can be both (a factor or a number). In the first case, we should check it in the function genind2hierfstat. In the latter case, we should change the function eucl.dist. I guess this is the most appropriate choice. @jgx65 , tell me what solution is the best for you and I can submit a PR.

Request to update hierfstat to CRAN

Hi Jerome,

I'm digging up #8 again and requesting for hierfstat to be updated on CRAN as it has several bug fixes from the past three years that would prevent issues like #27.

Best,
Zhian

write.struct documentation

Hi,

I found two issues that I think ought to be documented in the write.struct manual.

The ilab argument is described as: "ilab: whether to add a column with individual labels".

This suggests you expect a TRUE/FALSE value. However, you actually require a vector of individual labels. This could be clarified in the documentation? Or maybe you could use row.names(dat) by default, when ilab == TRUE?

Also, as I was sorting through the previous issue, I discovered that running write.struct repeatedly appends data to an existing file, rather than over-writing it. This is surprising I think? Maybe it would be good to check for the existence of the output file before adding to it, and issue a warning? That would be in line with what I see in other packages.

If you agree with either of these I can prepare a pull request.

I can‘t find the example file（ “examplehier.txt”）

hello
I am learning HierFstat recently to estimate F-statistics of my data
I have noticed that there is a example file in article “A step-by-step tutorial to use HierFstat to analyse populations hierarchically structured at multiple levels”，But the link offered by author shows that the page has been migrated.
so ，anyone can give me a link or file for this?
thank you very much

hierfstat::boot.ppfis function ‘+’ not meaningful for factors

heythere,

i have your basic genind obj, and calling either hierfstat::boot.ppfis() or hierfstat::basic.stats() results in this error here:
Error in 1:sum(data[, 1] == i) : NA/NaN argument
In addition: Warning message:
In Ops.factor(data[dim(data)[1], 1], 1) : ‘+’ not meaningful for factors

Please find attached a subset of my genind causing issues. I have taken a look and other functions appear to work fine with this data.

Regards,
D

Test_Genind.RData.zip

pairwise.fst

pairwise Fst estimates obtained with pairwise.fst() differ from those got with pp.fst()
> data(nancycats)
# convert nancycats to data.frame to be passed by pp.fst
> dfcats <- genind2df(nancycats)
# convert dfcats to numeric as required by pp.fst
> for(i in 2:10){dfcats[,i]<-as.numeric(as.character(dfcats[,i]))}
> dfcats[,1]<-as.numeric(dfcats[,1])
# estimate pairwise Fst using the functions pairwise.fst and pp.fst
> pairwise <- pairwise.fst(nancycats)
> pp <- pp.fst(dfcats)
# compare the estimates
> as.matrix(pairwise)[1:3,1:3]
           1         2          3
1 0.00000000 0.0801850 0.07140847
2 0.08018500 0.0000000 0.08200880
3 0.07140847 0.0820088 0.00000000
> pp$fst.pp[1:3,1:3]
     [,1]      [,2]       [,3]
[1,]  NaN 0.1307741 0.08459701
[2,]  NaN       NaN 0.13111707
[3,]  NaN       NaN        NaN

The estimates differ

Make a better use of genind object

check new function genind2hierfstat

hierfstat should now be able to read genindobjects, and several functions in hierfstat can take as argument a genindobject (basic.stats, wc, genet.dist, ...). In addition, genind2hierfstatdoes the conversion, and can take haploid or diploid data, as well as alleles encoded with letters (e.g. adegenet::H3N2). Would love to have feedbacks to make sure that it works. @thibautjombart @thierrygosselin @smanel @zkamvar

Problems estimating observed heterozygosity

in some data imported from adegent, basic.stats() produces an overstimated observed homozygosity. I emailed you some example data.

Problem using hierfstat with genind object

I have a data set of 4451 individuals and 473 SNPs.

I am able to convert the dataset into a genind object, but cannot use any of the hierfstat functions with this object.
Alleles are encoded as characters, (A,T,G,C). Each allele has two characters.

This is the error I get with basic.stats():
Error in 1:sum(data[, 1] == i) : NA/NaN argument

And this one with pairwise.fst():
Error in if (any(object@ploidy < 1L)) { :
missing value where TRUE/FALSE needed

I am able to run the same functions with the sample data set, "Master_Pinus_data_genotype.txt", so the problem is with my data set.

Huge thanks in advance to any tips on what is causing these errors.

/// GENIND OBJECT /////////

 // 4,386 individuals; 473 loci; 946 alleles; size: 16.3 Mb

 // Basic content
   @tab:  4386 x 946 matrix of allele counts
   @loc.n.all: number of alleles per locus (range: 2-2)
   @loc.fac: locus factor for the 946 columns of @tab
   @all.names: list of allele names for each locus
   @ploidy: ploidy of each individual  (range: 2-2)
   @type:  codom
   @call: df2genind(X = loci, sep = "", pop = pop, ploidy = 2)

 // Optional content
   @pop: population of each individual (group size range: 111-1766)

estimated Fis value outside of bootstrapped CIs

Hello,

I have been using hierfstat to calculate Fis for several populations. But for many populations, the values fis.dosage() gives me lie outside the confidence intervals I get from boot.ppfis().

My code looks like this:
test <- read.vcfR("~/40miss_all_filtOneSNP.vcf", verbose = TRUE)
test <- vcfR2genind(test)
pops <- c(rep(1, 4), rep(2, 10), rep(3, 2), rep(4,9), rep(5, 10), rep(6, 6), rep(7,4), rep(8,9), rep(9,3), rep(10,1), rep(11, 12))
test <- genind2hierfstat(test, pop = pops)
boot_fis <- boot.ppfis(test, nboot=10000)
dos <- fstat2dos(test, diploid = T)
fis <- fis.dosage(dos, pop = pops)

The output in turn looks like this:
Confidence intervals:
$fis.ci
ll hl
1 0.0713 0.0928
2 0.1122 0.1268
3 -0.5400 -0.4413
4 0.1215 0.1379
5 0.1059 0.1194
6 0.1076 0.1271
7 0.1167 0.1432
8 0.1298 0.1465
9 -0.2841 -0.1763
10 -Inf -Inf ## since there is only one ind in this population, this makes sense
11 0.0825 0.0955

values computed by fis.dosage:
1 0.07552508
2 0.10268142
3 0.06572209
4 0.10293569
5 0.09418226
6 0.10427239
7 0.16800170
8 0.11198415
9 -0.18332626
10 NaN
11 0.07371901
All 0.07156975

My dataset consists of around 23000 SNPs with quite a bit of missing data.
Even if I calculate Fis for individual populations using for example basic.stats(test[c(15:17),]), the thus retained Fis value is still not within the CIs.
Any idea what I can change to get meaningful CIs?

Your help is much appreciated,
Cheers,
Bernhard

	mho<-lapply(hetpl,
	function(x) matrix(unlist(lapply(x,
	function(y) tapply(y,pop,sum,na.rm=TRUE))),ncol=np)
	)

	pp.fst<-function(dat=dat,diploid=TRUE,...){
	cl<-match.call()
	if (is.genind(dat)) dat<-genind2hierfstat(dat)

	boot.ppfst<-function(dat=dat,nboot=100,quant=c(0.025,0.975),diploid=TRUE,...){
	cl<-match.call()
	if (is.genind(dat)) dat<-genind2hierfstat(dat)

	boot.ppfis<-function(dat=dat,nboot=100,quant=c(0.025,0.975),diploid=TRUE,dig=4,...){
	cl<-match.call()
	if (is.genind(dat)) dat<-genind2hierfstat(dat)

	pairwise.neifst <- function(dat,diploid=TRUE){
	dat<-dat[order(dat[,1]),]

	pairwise.WCfst <- function(dat,diploid=TRUE){
	dat<-dat[order(dat[,1]),]

jgx65 / hierfstat Goto Github PK

hierfstat's Introduction

hierfstat

To install the development version of hierfstat

hierfstat's People

Contributors

Stargazers

Watchers

Forkers

hierfstat's Issues

make a genotype array

to verify, now with two loci

Recommend Projects

Recommend Topics

Recommend Org

Jobs