pcadapt has been developed to detect genetic markers involved in biological adaptation. pcadapt provides statistical tools for outlier detection based on Principal Component Analysis (PCA).

In the following, we show how the pcadapt package can perform genome scans for selection based on individual genotype data. We show how to run the package using the example geno3pops that contains genotype data. A total of 150 individuals coming from three different populations were genotyped at 1,500 diploid markers. Simulations were performed with simuPOP using a divergence model assuming that 150 SNPs confer a selective advantage. To run the package on the provided example, just copy and paste shaded R chunks.

To run the package, you need to install the package and load it using the following command lines:

install.packages("pcadapt")
library(pcadapt)

A. Reading genotype data

You should use the read.pcadapt function to convert your genotype file to the bed format, which is PLINK binary biallelic genotype table. read.pcadapt converts different types of files to the bed format and returns a character string containing the name of the converted file, which should be used as input for the pcadapt function. Supported formats are the following: “pcadapt”, “lfmm”, “vcf”, “bed”, “ped”, “pool”. pcadapt files should have individuals in columns, SNPs in lines, and missing values should be encoded by a single character (e.g. 9) different from 0, 1 or 2.

For example, assume your genotype file is called “foo.lfmm” and is located in the directory “path_to_directory”, use the following command lines:

path_to_file <- "path_to_directory/foo.lfmm"
filename <- read.pcadapt(path_to_file, type = "lfmm")

To run the provided example, we retrieve the file location of the example and we use read.pcadapt to convert the bed example to the pcadapt format.

path_to_file <- system.file("extdata", "geno3pops.bed", package = "pcadapt")
filename <- read.pcadapt(path_to_file, type = "bed")

When working with genotype matrices already loaded in the R session, users have to run the read.pcadapt function and to specify the typeargument, which can be pcadapt (individuals in columns, SNPs in lines) or lfmm (individuals in lines, SNPs in columns).

B. Choosing the number K of Principal Components

The pcadapt function performs two successive tasks. First, PCA is performed on the centered and scaled genotype matrix. The second stage consists in computing test statistics and p-values based on the correlations between SNPs and the first K principal components (PCs). To run the function pcadapt, the user should specify the output returned by the function read.pcadapt and the number K of principal components to compute.

To choose K, principal component analysis should first be performed with a large enough number of principal components (e.g. K=20).

x <- pcadapt(input = filename, K = 20) 

NB: by default, data are assumed to be diploid. To specify the ploidy, use the argument ploidy (ploidy=2 for diploid species and ploidy = 1 for haploid species) in the pcadapt function.

B.1. Scree plot

The ‘scree plot’ displays in decreasing order the percentage of variance explained by each PC. Up to a constant, it corresponds to the eigenvalues in decreasing order. The ideal pattern in a scree plot is a steep curve followed by a bend and a straight line. The eigenvalues that correspond to random variation lie on a straight line whereas the ones that correspond to population structure lie on a steep curve. We recommend to keep PCs that correspond to eigenvalues to the left of the straight line (Cattell’s rule). In the provided example, K = 2 is the optimal choice for K. The plot function displays a scree plot:

plot(x, option = "screeplot")

By default, the number of principal components shown in the scree plot is K, but it can be reduced via the argument K.

plot(x, option = "screeplot", K = 10)

B.2. Score plot

Another option to choose the number of PCs is based on the ‘score plot’ that displays population structure. The score plot displays the projections of the individuals onto the specified principal components. Using the score plot, the choice of K can be limited to the values of K that correspond to a relevant level of population structure.

When population labels are known, individuals of the same populations can be displayed with the same color using the pop argument, which should contain the list of indices of the populations of origin. In the geno3pops example, the first population is composed of the first 50 individuals, the second population of the next 50 individuals, and so on. Thus, a vector of indices or characters (population names) that can be provided to the argument pop should look like this:

# With integers
poplist.int <- c(rep(1, 50), rep(2, 50), rep(3, 50))
# With names
poplist.names <- c(rep("POP1", 50),rep("POP2", 50),rep("POP3", 50))
print(poplist.int)
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3
print(poplist.names)
##   [1] "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1"
##  [11] "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1"
##  [21] "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1"
##  [31] "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1"
##  [41] "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1" "POP1"
##  [51] "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2"
##  [61] "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2"
##  [71] "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2"
##  [81] "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2"
##  [91] "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2" "POP2"
## [101] "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3"
## [111] "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3"
## [121] "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3"
## [131] "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3"
## [141] "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3" "POP3"

If this field is left empty, the points will be displayed in black. By default, if the values of i and j are not specified, the projection is done onto the first two principal components.

plot(x, option = "scores", pop = poplist.int)

plot(x, option = "scores", pop = poplist.names)

Looking at population structure beyond K = 2 confirms the results of the scree plot. The third and the fourth principal components do not ascertain population structure anymore.

plot(x, option = "scores", i = 3, j = 4, pop = poplist.names)