1 Introduction

This package was initiated to integrate some C/Fortran/SAS programs I have written or used over the years. As such, it would rather be a long-term project, but an immediate benefit is something complementary to other packages currently available from CRAN, e.g. genetics, hwde, etc. I hope eventually this will be part of a bigger effort to fulfil most of the requirements foreseen by many1, within the portable environment of R for data management, analysis, graphics and object-oriented programming. My view has been outlined more formally2,3 in relation to other package systems and also on package kinship4,5. I feel the enormous advantage by shifting to R and would like to my work with others as it is available, which I will not claim this work as exclusively done by myself, but would like to invite others to join me and enlarge the collections and improve them.

With recent work on genomewide association studies (GWASs) especially protein GWASs, I have added many functions such as METAL_forestplot which handles data from software METAL6 and qtlFinder which extracts sentinels from GWAS summary statistics in a way that is very appealing to us compared to some other established software. Meanwhile, the size of the package surpasses the limit as imposed by CRAN, thus the old good feature of S as with R that value both code and data alike has to suffer slightly in that gap.datasets and gap.examples are spun off as two separate packages; they do deserve a glimpse however for some general ideas. A separate initiative has been made in the pQTLtools package.

Notable recent technical updates include:

  1. documentaion from R code via roxygen2 and devtools to allow for easier generation of the .Rd files and NAMESPACE.

  2. vignettes with noweb/Sweave as well as rmarkdown and bookdown that allows for numbered sectioning, multiple figures generated from a code chunk, and Citation Style Language (CSL) (https://citationstyles.org/). Rdpack is now employed for BibTeX citations in the .Rd files.

  3. an experimental shiny App (runshinygap()).

These experiences can be useful for others in their package development. I found it useful to use specific functions without loading the whole package, i.e., library(gap), e.g.,invnormal <- gap::invnormal, log10p <- gap::log10p.

2 Implementation

The currently available functions are shown below,

Name Description
ANALYSIS  
AE3 AE model using nuclear family trios
bt Bradley-Terry model for contingency table
ccsize Power and sample size for case-cohort design
cs Credibel set
fbsize Sample size for family-based linkage and association design
gc.em Gene counting for haplotype analysis
gcontrol genomic control
gcontrol2 genomic control based on p values
gcp Permutation tests using GENECOUNTING
gc.lambda Estionmation of the genomic control inflation statistic (lambda)
genecounting Gene counting for haplotype analysis
gif Kinship coefficient and genetic index of familiality
MCMCgrm Mixed modeling with genetic relationship matrices
hap Haplotype reconstruction
hap.em Gene counting for haplotype analysis
hap.score Score statistics for association of traits with haplotypes
htr Haplotype trend regression
hwe Hardy-Weinberg equilibrium test for a multiallelic marker
hwe.cc A likelihood ratio test of population Hardy-Weinberg equilibrium
hwe.hardy Hardy-Weinberg equilibrium test using MCMC
invnormal Inverse normal transformation
kin.morgan kinship matrix for simple pedigree
LD22 LD statistics for two diallelic markers
LDkl LD statistics for two multiallelic markers
lambda1000 A standardized estimate of the genomic inflation scaling to
  a study of 1,000 cases and 1,000 controls
log10p log10(p) for a standard normal deviate
log10pvalue log10(p) for a P value including its scientific format
logp log(p) for a normal deviate
masize Sample size calculation for mediation analysis
mia multiple imputation analysis for hap
mr Mendelian randomization analysis
mtdt Transmission/disequilibrium test of a multiallelic marker
mtdt2 Transmission/disequilibrium test of a multiallelic marker
  by Bradley-Terry model
mvmeta Multivariate meta-analysis based on generalized least squares
pbsize Power for population-based association design
pbsize2 Power for case-control association design
pfc Probability of familial clustering of disease
pfc.sim Probability of familial clustering of disease
pgc Preparing weight for GENECOUNTING
print.hap.score Print a hap.score object
s2k Statistics for 2 by K table
sentinels Sentinel identification from GWAS summary statistics
tscc Power calculation for two-stage case-control design
   
GRAPHICS  
asplot Regional association plot
ESplot Effect-size plot
circos.cnvplot circos plot of CNVs
circos.cis.vs.trans.plot circos plot of cis/trans classification
circos.mhtplot circos Manhattan plot with gene annotation
circos.mhtplot2 Another circos Manhattan plot
cnvplot genomewide plot of CNVs
labelManhattan Annotate Manhattan or Miami Plot
METAL_forestplot forest plot as R/meta’s forest for METAL outputs
makeRLEplot make relative log expression plot
mhtplot Manhattan plot
mhtplot2 Manhattan plot with annotations
mhtplot.trunc truncated Manhattan plot
miamiplot Miami plot
miamiplot2 Miami plot
mr_forestplot Mendelian Randomization forest plot
pedtodot Converting pedigree(s) to dot file(s)
pedtodot_verbatim Pedigree-drawing with graphviz
plot.hap.score Plot haplotype frequencies versus haplotype score statistics
qqfun Quantile-comparison plots
qqunif Q-Q plot for uniformly distributed random variable
qtl2dplot 2D QTL plot
qtl2dplotly 2D QTL plotly
qtl3dplotly 3D QTL plotly
   
UTILITIES  
SNP Functions for single nucleotide polymorphisms (SNPs)
BFDP Bayesian false-discovery probability
FPRP False-positive report probability
ab Test/Power calculation for mediating effect
b2r Obtain correlation coefficients and their variance-covariances
chow.test Chow’s test for heterogeneity in two regressions
chr_pos_a1_a2 Form SNPID from chromosome, posistion and alleles
cis.vs.trans.classification a cis/trans classifier
ci2ms Effect size and standard error from confidence interval
comp.score score statistics for testing genetic linkage of quantitative trait
GRM functions ReadGRM, ReadGRMBin, ReadGRMPLINK,
  ReadGRMPCA, WriteGRM,
  WriteGRMBin, WriteGRMSAS
  handle genomic relationship matrix involving other software
get_b_se Get b and se from AF, n, and z
get_pve_se Get pve and its standard error from n, z
get_sdy Get sd(y) from AF, n, b, se
h2G A utility function for heritability
h2GE A utility function for heritability involving gene-environment interaction
h2l A utility function for converting observed heritability to its counterpart
  under liability threshold model
h2_mzdz Heritability estimation according to twin correlations
klem Haplotype frequency estimation based on a genotype table
  of two multiallelic markers
makeped A function to prepare pedigrees in post-MAKEPED format
metap Meta-analysis of p values
metareg Fixed and random effects model for meta-analysis
muvar Means and variances under 1- and 2- locus (diallelic) QTL model
pvalue P value for a normal deviate
qtlClassifier A QTL cis/trans classifier
qtlFinder Distance-based signal identification
read.ms.output A utility function to read ms output
revStrand Allele on the reverse strand
runshinygap Start shinygap
snptest_sample A utility to generate SNPTEST sample file
whscore Whittemore-Halpern scores for allele-sharing
weighted.median Weighted median with interpolation

After installation, you will be able to obtain the list by typing library(help=gap) in alphabetical order, or ?gap::gap ordered by category, or view it within a web browser via help.start(). A full list of functions is provided in the Appendix. This file can be viewed with command vignette("gap", package="gap"). You can cut and paste examples at end of each function’s documentation.

Both genecounting and hap are able to handle SNPs and multiallelic markers, with the former be flexible enough to include features such as X-linked data and the later being able to handle large number of SNPs. But they are unable to recode allele labels automatically, so functions gc.em and hap.em are in haplo.em format and used by a modified function hap.score in association testing.

It is notable that multilocus data are handled differently from that in hwde and elegant definitions of basic genetic data can be found in the genetics package. Incidentally, I found my C mixed-radixed sorting routine7 is much faster than R’s internal function.

With exceptions such as function pfc which is very computer-intensive, most functions in the package can easily be adapted for analysis of large datasets involving either SNPs or multiallelic markers. Some are utility functions, e.g. muvar and whscore, which will be part of the other analysis routines in the future.

The benefit with R compared to standalone programs is that for users, all functions have unified format. For developers, it is able to incorporate their C/C++ programs more easily and avoid repetitive work such as preparing own routines for matrix algebra and linear models. Further advantage can be taken from packages in Bioconductor, which are designed and written to deal with large number of genes.

3 Independent programs

To facilitate comparisons and individual preferences, the source codes for EHPLUS8, 2LD, GENECOUNTING, HAP9, now hosted at GitHub, have enjoyed great popularity ahead of GWASs therefore are likely to be more familiar than their R counterparts in gap but you need to follow their instructions to compile for a particular computer system.

I have kept the original pedtodot.sh by David Duffy which enables contrast with pedtodot_verbatim() and pedtodot() reported as application notes. I have also included ms code10 to couple with read.ms.output.

A final note is concerned about twinan90, which is now dropped from the package function list due to difficulty to keep up with the requirements by the R environment nevertheless you will still be able to compile and use otherwise from gap.examples.

4 Demos

This has been a template for adding self-contained examples:

library(gap)
demo(gap)

See examples of haplotype analysis there.

5 Pedigrees and kinship

5.1 Pedigree drawing

I have included the original file for the R News as well as put examples in separate vignettes. They can be accessed via vignette("rnews",package="gap.examples") and vignette("pedtodot", package="gap.examples"), respectively.

We also demonstrate through pedigree 10081 example5 with pedtodot_verbatim.

library(gap)
#> Loading required package: gap.datasets
#> gap version 1.6
knitr::kable(p3,caption="An example pedigree")
Table 5.1: An example pedigree
pid id fid mid sex aff GABRB1 D4S1645
10081 1 2 3 2 2 7/7 7/10
10081 2 0 0 1 1 -/- -/-
10081 3 0 0 2 2 7/9 3/10
10081 4 2 3 2 2 7/9 3/7
10081 5 2 3 2 1 7/7 7/10
10081 6 2 3 1 1 7/7 7/10
10081 7 2 3 2 1 7/7 7/10
10081 8 0 0 1 1 -/- -/-
10081 9 8 4 1 1 7/9 3/10
10081 10 0 0 2 1 -/- -/-
10081 11 2 10 2 1 7/7 7/7
10081 12 2 10 2 2 6/7 7/7
10081 13 0 0 1 1 -/- -/-
10081 14 13 11 1 1 7/8 7/8
10081 15 0 0 1 1 -/- -/-
10081 16 15 12 2 1 6/6 7/7
library(DOT)
# one can see the diagram in RStudio
pedtodot_verbatim(p3,run=TRUE,toDOT=TRUE,return="verbatim")
#> Pedigree  10081 
#> running DOT::dot on 10081
library(DiagrammeR)
pedtodot_verbatim(p3)
#> Pedigree  10081
grViz(readr::read_file("10081.dot"))

Figure 5.1: An example pedigree

5.2 Kinship calculation

Next, I will provide an example for kinship calculation using kin.morgan. It is recommended that individuals in a pedigree are ordered so that parents always precede their children. In this regard, package pedigree can be used, and package kinship2 can be used to produce pedigree diagram as with kinship matrix.

The pedigree diagram is as follows,

# pedigree diagram
data(lukas, package="gap.datasets")
library(kinship2)
#> Loading required package: Matrix
#> Loading required package: quadprog
ped <- with(lukas,pedigree(id,father,mother,sex))
plot(ped,cex=0.4)
A pedigree diagram

Figure 5.2: A pedigree diagram

We then turn to the kinship calculation.

# unordered individuals
gk1 <- kin.morgan(lukas)
write.table(gk1$kin.matrix,"gap_1.txt",quote=FALSE)

library(kinship2)
kk1 <- kinship(lukas[,1],lukas[,2],lukas[,3])
write.table(kk1,"kinship_1.txt",quote=FALSE)

d <- gk1$kin.matrix-kk1
sum(abs(d))
#> [1] 2.443634

# order individuals so that parents precede their children
library(pedigree)
op <- orderPed(lukas)
olukas <- lukas[order(op),]
gk2 <- kin.morgan(olukas)

write.table(olukas,"olukas.csv",quote=FALSE)
write.table(gk2$kin.matrix,"gap_2.txt",quote=FALSE)

kk2 <- kinship(olukas[,1],olukas[,2],olukas[,3])
write.table(kk2,"kinship_2.txt",quote=FALSE)

z <- gk2$kin.matrix-kk2
sum(abs(z))
#> [1] 0

We see that in the second case, the result agrees with kinship2.

6 Study designs

I would like to highlight fbsize, pbsize and ccsize functions used for power/sample calculations in a genome-wide asssociatoin study as reported11,12,13.

It now has an experimental work via Shiny from inst/shinygap.

6.1 Family-based design

The example is as follows,

options(width=150)
library(gap)
models <- matrix(c(
         4.0, 0.01,
         4.0, 0.10,
         4.0, 0.50, 
         4.0, 0.80,
         2.0, 0.01,
         2.0, 0.10,
         2.0, 0.50,
         2.0, 0.80,
         1.5, 0.01,    
         1.5, 0.10,
         1.5, 0.50,
         1.5, 0.80), ncol=2, byrow=TRUE)
outfile <- "fbsize.txt"
cat("gamma","p","Y","N_asp","P_A","H1","N_tdt","H2","N_asp/tdt",
    "L_o","L_s\n",file=outfile,sep="\t")
for(i in 1:12) {
    g <- models[i,1]
    p <- models[i,2]
    z <- fbsize(g,p)
    cat(z$gamma,z$p,z$y,z$n1,z$pA,z$h1,z$n2,z$h2,z$n3,
        z$lambdao,z$lambdas,file=outfile,append=TRUE,sep="\t")
    cat("\n",file=outfile,append=TRUE)
}
table1 <- read.table(outfile,header=TRUE,sep="\t")
nc <- c(4,7,9)
table1[,nc] <- ceiling(table1[,nc])
dc <- c(3,5,6,8,10,11)
table1[,dc] <- round(table1[,dc],2)
unlink(outfile)
knitr::kable(table1,caption="Power/Sample size of family-based designs")
Table 6.1: Power/Sample size of family-based designs
gamma p Y N_asp P_A H1 N_tdt H2 N_asp.tdt L_o L_s
4.0 0.01 0.52 6402 0.80 0.05 1201 0.11 257 1.08 1.09
4.0 0.10 0.60 277 0.80 0.35 165 0.54 53 1.48 1.54
4.0 0.50 0.58 446 0.80 0.50 113 0.42 67 1.36 1.39
4.0 0.80 0.53 3024 0.80 0.24 244 0.16 177 1.12 1.13
2.0 0.01 0.50 445964 0.67 0.03 6371 0.04 2155 1.01 1.01
2.0 0.10 0.52 8087 0.67 0.25 761 0.32 290 1.07 1.08
2.0 0.50 0.53 3753 0.67 0.50 373 0.47 197 1.11 1.11
2.0 0.80 0.51 17909 0.67 0.27 701 0.22 431 1.05 1.05
1.5 0.01 0.50 6944779 0.60 0.02 21138 0.03 8508 1.00 1.00
1.5 0.10 0.51 101926 0.60 0.21 2427 0.25 1030 1.02 1.02
1.5 0.50 0.51 27048 0.60 0.50 1039 0.49 530 1.04 1.04
1.5 0.80 0.51 101926 0.60 0.29 1820 0.25 1030 1.02 1.02

As for APOE4 and Alzheimer’s14

g <- 4.5
p <- 0.15
alz <- data.frame(fbsize(g,p))
knitr::kable(alz,caption="Power/Sample size of study on Alzheimer's disease")
Table 6.2: Power/Sample size of study on Alzheimer’s disease
gamma p y n1 pA h1 n2 h2 n3 lambdao lambdas
4.5 0.15 0.6256916 162.6246 0.8181818 0.4598361 108.994 0.6207625 39.97688 1.671594 1.784353

6.2 Population-based design

The example is as follows,

library(gap)
kp <- c(0.01,0.05,0.10,0.2)
models <- matrix(c(
          4.0, 0.01,
          4.0, 0.10,
          4.0, 0.50, 
          4.0, 0.80,
          2.0, 0.01,
          2.0, 0.10,
          2.0, 0.50,
          2.0, 0.80,
          1.5, 0.01,    
          1.5, 0.10,
          1.5, 0.50,
          1.5, 0.80), ncol=2, byrow=TRUE)
outfile <- "pbsize.txt"
cat("gamma","p","p1","p5","p10","p20\n",sep="\t",file=outfile)
for(i in 1:dim(models)[1])
{
   g <- models[i,1]
   p <- models[i,2]
   n <- vector()
   for(k in kp) n <- c(n,ceiling(pbsize(k,g,p)))
   cat(models[i,1:2],n,sep="\t",file=outfile,append=TRUE)
   cat("\n",file=outfile,append=TRUE)
} 
table5 <- read.table(outfile,header=TRUE,sep="\t")
knitr::kable(table5,caption="Sample size of population-based design")
Table 6.3: Sample size of population-based design
gamma p p1 p5 p10 p20
4.0 0.01 46681 8959 4244 1887
4.0 0.10 8180 1570 744 331
4.0 0.50 10891 2091 991 441
4.0 0.80 31473 6041 2862 1272
2.0 0.01 403970 77530 36725 16323
2.0 0.10 52709 10116 4792 2130
2.0 0.50 35285 6772 3208 1426
2.0 0.80 79391 15237 7218 3208
1.5 0.01 1599920 307056 145448 64644
1.5 0.10 192105 36869 17465 7762
1.5 0.50 98013 18811 8911 3961
1.5 0.80 192105 36869 17465 7762

6.3 Case-cohort design

We obtain results for ARIC and EPIC studies.

library(gap)
# ARIC study
outfile <- "aric.txt"
n <- 15792
pD <- 0.03
p1 <- 0.25
alpha <- 0.05
theta <- c(1.35,1.40,1.45)
beta <- 0.2
s_nb <- c(1463,722,468)
cat("n","pD","p1","hr","q","power","ssize\n",file=outfile,sep="\t")
for(i in 1:3)
{
  q <- s_nb[i]/n
  power <- ccsize(n,q,pD,p1,log(theta[i]),alpha,beta,TRUE)
  ssize <- ccsize(n,q,pD,p1,log(theta[i]),alpha,beta,FALSE)
  cat(n,"\t",pD,"\t",p1,"\t",theta[i],"\t",q,"\t",
      signif(power,3),"\t",ssize,"\n",
      file=outfile,append=TRUE)
}
read.table(outfile,header=TRUE,sep="\t")
#>       n   pD   p1   hr          q power ssize
#> 1 15792 0.03 0.25 1.35 0.09264184   0.8  1463
#> 2 15792 0.03 0.25 1.40 0.04571935   0.8   722
#> 3 15792 0.03 0.25 1.45 0.02963526   0.8   468
unlink(outfile)
# EPIC study
outfile <- "epic.txt"
n <- 25000
alpha <- 0.00000005
beta <- 0.2
s_pD <- c(0.3,0.2,0.1,0.05)
s_p1 <- seq(0.1,0.5,by=0.1)
s_hr <- seq(1.1,1.4,by=0.1)
cat("n","pD","p1","hr","alpha","ssize\n",file=outfile,sep="\t")
# direct calculation
for(pD in s_pD)
{
   for(p1 in s_p1)
   {
      for(hr in s_hr)
      {
         ssize <- ccsize(n,q,pD,p1,log(hr),alpha,beta,FALSE)
         if (ssize>0) cat(n,"\t",pD,"\t",p1,"\t",hr,"\t",alpha,"\t",
                          ssize,"\n",
                          file=outfile,append=TRUE)
      }
   }
}
knitr::kable(read.table(outfile,header=TRUE,sep="\t"),caption="Sample size of case-cohort designs")
Table 6.4: Sample size of case-cohort designs
n pD p1 hr alpha ssize
25000 0.3 0.1 1.3 0 14391
25000 0.3 0.1 1.4 0 5732
25000 0.3 0.2 1.2 0 21529
25000 0.3 0.2 1.3 0 5099
25000 0.3 0.2 1.4 0 2613
25000 0.3 0.3 1.2 0 11095
25000 0.3 0.3 1.3 0 3490
25000 0.3 0.3 1.4 0 1882
25000 0.3 0.4 1.2 0 8596
25000 0.3 0.4 1.3 0 2934
25000 0.3 0.4 1.4 0 1611
25000 0.3 0.5 1.2 0 7995
25000 0.3 0.5 1.3 0 2786
25000 0.3 0.5 1.4 0 1538
25000 0.2 0.1 1.4 0 9277
25000 0.2 0.2 1.3 0 7725
25000 0.2 0.2 1.4 0 3164
25000 0.2 0.3 1.3 0 4548
25000 0.2 0.3 1.4 0 2152
25000 0.2 0.4 1.2 0 20131
25000 0.2 0.4 1.3 0 3648
25000 0.2 0.4 1.4 0 1805
25000 0.2 0.5 1.2 0 17120
25000 0.2 0.5 1.3 0 3422
25000 0.2 0.5 1.4 0 1713
25000 0.1 0.2 1.4 0 8615
25000 0.1 0.3 1.4 0 3776
25000 0.1 0.4 1.3 0 13479
25000 0.1 0.4 1.4 0 2824
25000 0.1 0.5 1.3 0 10837
25000 0.1 0.5 1.4 0 2606
unlink(outfile)

7 Graphics

Some figures from the documentation may be of interest.

7.1 Genome-wide association

The following code is used to obtain a Q-Q plot via qqunif function,

library(gap)
u_obs <- runif(1000)
r <- qqunif(u_obs,pch=21,bg="blue",bty="n")
u_exp <- r$y
hits <- u_exp >= 2.30103
points(r$x[hits],u_exp[hits],pch=21,bg="green")
legend("topleft",sprintf("GC.lambda=%.4f",gc.lambda(u_obs)))
A Q-Q plot

Figure 7.1: A Q-Q plot

Based on a chicken genome scan data, the code below generates a Manhattan plot, demonstrating the use of gaps to separate chromosomes.

ord <- with(w4,order(chr,pos))
w4 <- w4[ord,]
oldpar <- par()
par(cex=0.6)
colors <- c(rep(c("blue","red"),15),"red")
mhtplot(w4,control=mht.control(colors=colors,gap=1000),pch=19,srt=0)
axis(2,cex.axis=2)
suggestiveline <- -log10(3.60036E-05)
genomewideline <- -log10(1.8E-06)
abline(h=suggestiveline, col="blue")
abline(h=genomewideline, col="green")
abline(h=0)
A genome-wide association study on chickens

Figure 7.2: A genome-wide association study on chickens

The code below obtains a Manhattan plot with gene annotation15,

data <- with(mhtdata,cbind(chr,pos,p))
glist <- c("IRS1","SPRY2","FTO","GRIK3","SNED1","HTR1A","MARCH3","WISP3",
           "PPP1R3B","RP1L1","FDFT1","SLC39A14","GFRA1","MC4R")
hdata <- subset(mhtdata,gene%in%glist)[c("chr","pos","p","gene")]
color <- rep(c("lightgray","gray"),11)
glen <- length(glist)
hcolor <- rep("red",glen)  
par(las=2, xpd=TRUE, cex.axis=1.8, cex=0.4)
ops <- mht.control(colors=color,yline=1.5,xline=3)
hops <- hmht.control(data=hdata,colors=hcolor)
mhtplot(data,ops,hops,pch=19)
axis(2,pos=2,at=1:16,cex.axis=0.5)
A Manhattan plot with gene annotation

Figure 7.3: A Manhattan plot with gene annotation

All these look familiar, so revised form of the function called mhtplot2 was created for additional features such as centering the chromosome ticks, allowing for more sophisticated coloring schemes, using prespecified fonts, etc. Please refer to the function’s documentation example.

We could also go further with a circos Manhattan plot,

circos.mhtplot(mhtdata, glist)
#> Note: 11 points are out of plotting region in sector 'chr16', track '3'.
A circos Manhattan plot

Figure 7.4: A circos Manhattan plot

and a version with y-axis,

require(gap.datasets)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
testdat <- mhtdata[c("chr","pos","p","gene","start","end")] %>%
           rename(log10p=p) %>%
           mutate(chr=paste0("chr",chr),log10p=-log10(log10p))
dat <- mutate(testdat,start=pos,end=pos) %>%
       select(chr,start,end,log10p)
labs <- subset(testdat,gene %in% glist) %>%
        group_by(gene,chr,start,end) %>%
        summarize() %>%
        mutate(cols="blue") %>%
        select(chr,start,end,gene,cols)
#> `summarise()` has grouped output by 'gene', 'chr', 'start'. You can override using the `.groups` argument.
labs[2,"cols"] <- "red"
ticks <- 0:2*5
circos.mhtplot2(dat,labs,ticks=ticks,ymax=max(ticks))
#> Note: 43 points are out of plotting region in sector 'chr16', track '5'.
Another circos Manhattan plot

Figure 7.5: Another circos Manhattan plot

As a side note, the data is used by manhattanly.

#{r mhttest, fig.cap="Manhattanly plot", fig.height=8, fig.width=8, plotly=TRUE}
library(manhattanly)
mhttest <- manhattanly(mhtdata, chr = "chr", bp = "pos",
                       snp = "rsn", annotation1 = "gene", suggestiveline = TRUE,
                       annotation2 = "rsn", p = "p")
mhttest
htmlwidgets::saveWidget(mhttest,"mhttest.html")

We now experiment with Miami plot,

mhtdata <- within(mhtdata,{pr=p})
miamiplot(mhtdata,chr="chr",bp="pos",p="p",pr="pr",snp="rsn")
Miami plots

Figure 7.6: Miami plots

# An alternative implementation
gwas <- select(mhtdata,chr,pos,p) %>%
        mutate(z=qnorm(p/2))
chrmaxpos <- miamiplot2(gwas,gwas,name1="Batch 2",name2="Batch 1",z1="z",z2="z")
#> Warning in max(c(dat1$pos[dat1$chr == i], dat2$pos[dat2$chr == i]), na.rm = TRUE): no non-missing arguments to max; returning -Inf
labelManhattan(chr=c(2,16),pos=c(226814165,52373776),name=c("AnonymousGene","FTO"),gwas,gwasZLab="z",chrmaxpos=chrmaxpos)
Miami plots

Figure 7.7: Miami plots

and a truncated Manhattan plot, noting that only data points with -log10(P)>=2 are shown,

par(oma=c(0,0,0,0), mar=c(5,6.5,1,1))
mhtplot.trunc(filter(IL.12B,log10P>=2), chr="Chromosome", bp="Position", z="Z",
   snp="MarkerName",
   suggestiveline=FALSE, genomewideline=-log10(5e-10),
   cex.mtext=1.2, cex.text=1.2,
   annotatelog10P=-log10(5e-10), annotateTop = FALSE,
   highlight=with(genes,gene),
   mtext.line=3, y.brk1=115, y.brk2=300, trunc.yaxis=TRUE, delta=0.01,
   cex.axis=1.5, cex=0.8, font=3, font.axis=1.5,
   y.ax.space=20,
   col = c("blue4", "skyblue")
)
Association of IL-12B

Figure 7.8: Association of IL-12B

The code below obtains a regional association plot with the asplot function,

asplot(CDKNlocus, CDKNmap, CDKNgenes, best.pval=5.4e-8, sf=c(3,6))
#> - CDKN2A 
#> - CDKN2B
title("CDKN2A/CDKN2B Region")
A regional association plot

Figure 7.9: A regional association plot

The function predates the currently popular locuszoom software but leaves the option open for generating such plots on the fly and locally.

Note that all these can serve as templates to customize features of your own.

7.2 Copy number variation

A plot of copy number variation (CNV) is shown here,

cnvplot(gap.datasets::cnv)
A CNV plot

Figure 7.10: A CNV plot

7.3 Effect size plot

The code below obtains an effect size plot via the ESplot function.

library(gap)
rs12075 <- data.frame(id=c("CCL2","CCL7","CCL8","CCL11","CCL13","CXCL6","Monocytes"),
                      b=c(0.1694,-0.0899,-0.0973,0.0749,0.189,0.0816,0.0338387),
                      se=c(0.0113,0.013,0.0116,0.0114,0.0114,0.0115,0.00713386))
ESplot(rs12075)
rs12075 and traits

Figure 7.11: rs12075 and traits

7.4 Forest plot

It draws many forest plots given a list of variants, e.g.,

data(OPG,package="gap.datasets")
meta::settings.meta(method.tau="DL")
METAL_forestplot(OPGtbl,OPGall,OPGrsid,width=6.75,height=5,digits.TE=2,digits.se=2,
                 col.diamond="black",col.inside="black",col.square="black")
#> Joining with `by = join_by(MarkerName)`
#> Joining with `by = join_by(MarkerName)`
Forest plots

Figure 7.12: Forest plots

Forest plots

Figure 7.13: Forest plots

METAL_forestplot(OPGtbl,OPGall,OPGrsid,package="metafor",method="FE",xlab="Effect",
                 showweights=TRUE)
#> Joining with `by = join_by(MarkerName)`
#> Joining with `by = join_by(MarkerName)`
Forest plots

Figure 7.14: Forest plots

Forest plots

Figure 7.15: Forest plots

8 Significance

Our focus is on z ~ Normal(0,1), whose schematic diagram is shown below.

Normal(0,1) distribution

Figure 8.1: Normal(0,1) distribution

The associate R function is z <- function(p) qnorm(p/2,lower.tail=FALSE).

When z is very large, its corresponding p value is very small. A genomewide significance is declared at 0.05/1000000=5e-8 with Bonferroni correction assuming 1 million SNPs are tested. This short note describes how to get -log10(p), which can be used in a Q-Q plot and software such as DEPICT16. The solution here is generic since z is also the square root of a chi-squared statistic, for instance.

8.1 log(p) and log10(p)

First thing first, here are the answers for log(p) and log10(p) given z,

# log(p) for a standard normal deviate z based on log()
logp <- function(z) log(2)+pnorm(-abs(z), lower.tail=TRUE, log.p=TRUE)

# log10(p) for a standard normal deviate z based on log()
log10p <- function(z) log(2, base=10)+pnorm(-abs(z), lower.tail=TRUE, log.p=TRUE)/log(10)

Note logp() will be used for functions such as qnorm() as in function cs() whereas log10p() is more appropriate for Manhattan plot and used in sentinels().

8.2 Rationale

We start with z=1.96 whose corresponding p value is approximately 0.05.

2*pnorm(-1.96,lower.tail=TRUE)

giving an acceptable value 0.04999579, so we proceed to get log10(p)

log10(2)+log10(pnorm(-abs(z),lower.tail=TRUE))

leading to the expression above from the fact that log10(X)=log(X)/log(10) since log(), being the natural log function, ln() – so log(exp(1)) = 1, in R, works far better on the numerator of the second term. The use of -abs() just makes sure we are working on the lower tail of the standard Normal distribution from which our p value is calculated.

8.3 Benchmark

Now we have a stress test,

z <- 20000
-log10p(z)

giving -log10(p) = 86858901.

8.4 Multiple precision arithmetic

We would be curious about the p value itself as well, which is furnished with the Rmpfr package

require(Rmpfr)
2*pnorm(mpfr(-abs(z),100),lower.tail=TRUE,log.p=FALSE)
mpfr(log(2),100) + pnorm(mpfr(-abs(z),100),lower.tail=TRUE,log.p=TRUE)

giving p = 1.660579603192917090365313727164e-86858901 and -log(p) = -200000010.1292789076808554854177, respectively. To carry on we have -log10(p) = -log(p)/log(10)=86858901.

To make -log10(p) usable in R we obtain it directly through

as.numeric(-log10(2*pnorm(mpfr(-abs(z),100),lower.tail=TRUE)))

which actually yields exactly the same 86858901.

If we go very far to have z=50,000. then -log10(p)=542868107 but we have less luck with Rmpfr.

One may wonder the P value in this case, which is 6.6666145952e-542868108 or simply 6.67e-542868108.

The magic function for doing this is defined as follows,

pvalue <- function (z, decimals = 2)
{
    lp <- -log10p(z)
    exponent <- ceiling(lp)
    base <- 10^-(lp - exponent)
    paste0(round(base, decimals), "e", -exponent)
}

and it is more appropriate to express p values in scientific format so they can be handled as follows,

log10pvalue <- function(p=NULL,base=NULL,exponent=NULL)
{
  if(!is.null(p))
  {
    p <- format(p,scientific=TRUE)
    p2 <- strsplit(p,"e")
    base <- as.numeric(lapply(p2,"[",1))
    exponent <- as.numeric(lapply(p2,"[",2))
  } else if(is.null(base) | is.null(exponent)) stop("base and exponent should both be specified")
  log10(base)+exponent
}

used as log10pvalue(p) when p<=1e-323, or log10pvalue(base=1,exponent=-323) otherwise.

One can also derive logpvalue for natural base (e) similarly.

We end with a quick look-up table

require(gap)
zlist <- c(5,10,30,40,50,100,500,1000,2000,3000,5000)
zp <- sapply(zlist,function(z) {c(z,pvalue(z),logp(z),log10p(z))})
rownames(zp) <- c("z","P","log(P)","log10(P)")
knitr::kable(t(zp),caption="z, P, log(P) and log10(P)")
Table 8.1: z, P, log(P) and log10(P)
z P log(P) log10(P)
5 5.73e-7 -14.3718512134288 -6.24161567672667
10 1.52e-23 -52.5381379699525 -22.8170234098221
30 9.81e-198 -453.628096775783 -197.008179265997
40 7.31e-350 -803.915294833194 -349.135976463682
50 2.16e-545 -1254.13821395886 -544.665305866333
100 2.69e-2174 -5004.83106151365 -2173.57051287337
500 2.47e-54290 -125006.440403451 -54289.6072695865
1000 4.58e-217151 -500007.133547632 -217150.339011999
2000 4.34e-868593 -2000007.82669406 -868592.362896546
3000 1.8e-1954329 -4500008.23215903 -1954328.74374587
5000 1.51e-5428685 -12500008.7429846 -5428684.82082061

8.5 Application

The mhtplot.trunc() function accepts three types of arguments:
  1. P values of association statistics, which could be very small.
log10p. log10(P).
  1. normal statistics that could be very large.

In all three cases, a log10(P) counterpart is obtained internally and to accommodate extreme value, the y-axis allows for truncation leaving out a given range to highlight the largest.

See the IL-12B example above.

9 Linear regression

Several functions related to linear regression are detailed here.

9.1 Some preparations

Let \(\mbox{x} = SNP\ dosage\). Note that \(\mbox{Var}(\mbox{x})=2f(1-f)\), \(f=MAF\) or \(1-MAF\) by symmetry.

Our linear regression model is \(\mbox{y}=a + b\mbox{x} + e\). We have \(\mbox{Var}(\mbox{y}) = b^2\mbox{Var}(\mbox{x}) + \mbox{Var}(e)\). Moreover, \(\mbox{Var}(b)=\mbox{Var}(e)(\mbox{x}'\mbox{x})^{-1}=\mbox{Var}(e)/S_\mbox{xx}\), we have \(\mbox{Var}(e) = \mbox{Var}(b)S_\mbox{xx} = N \mbox{Var}(b) \mbox{Var}(\mbox{x})\). Consequently, let \(z = {b}/{SE(b)}\), we have

\[\begin{eqnarray*} \mbox{Var}(\mbox{y}) &=& \mbox{Var}(\mbox{x})(b^2+N\mbox{Var}(b)) \hspace{100cm} \cr &=& \mbox{Var}(\mbox{x})\mbox{Var}(\mbox{b})(z^2+N) \cr &=& 2f(1-f)(z^2+N)\mbox{Var}(b) \end{eqnarray*}\]

Moreover, the mean and the variance of the multiple correlation coefficient or the coefficient of determination (\(R^2\)) are known17 to be \({1}/{(N-1)}\) and \({2(N-2)}/{\left[(N-1)^2(N+1)\right]}\), respectively.

We also need some established results of a ratio (R/S)1, i.e., the mean

\[ \begin{align} E(R/S) \approx \frac{\mu_R}{\mu_S}-\frac{\mbox{Cov}(R,S)}{\mu_S^2}+\frac{\sigma_S^2\mu_R}{\mu_S^3} \hspace{100cm} \tag{9.1} \end{align} \]

and more importantly the variance

\[ \begin{align} \mbox{Var}(R/S) \approx \frac{\mu_R^2}{\mu_S^2} \left[ \frac{\sigma_R^2}{\mu_R^2} -2\frac{\mbox{Cov}(R,S)}{\mu_R\;\mu_S} +\frac{\sigma_S^2}{\mu_S^2} \right] \hspace{100cm} \tag{9.2} \end{align} \]

where \(\mu_R\), \(\mu_S\), \(\sigma_R^2\), \(\sigma_S^2\) are the means and the variances for R and S, respectively.

Finally, we need some facts about \(\chi_1^2\), \(\chi^2\) distribution of one degree of freedom. For \(z \sim N(0,1)\), \(z^2\sim \chi_1^2\), whose mean and variance are 1 and 2, respectively.

We now have the following results.

9.2 Proportion of variance explained

We have

\[ \begin{align} \mbox{PVE}_{\mbox{linear regression}} & = \frac{\mbox{Var}(b\mbox{x})}{\mbox{Var}(\mbox{y})} \hspace{100cm} \\ & = \frac{\mbox{Var}(\mbox{x})b^2}{\mbox{Var}(\mbox{x})(b^2+N\mbox{Var}(b))} \\ & = \frac{\mbox{z}^2}{\mbox{z}^2+N} \tag{9.3} \end{align} \]

On the other hand, for a simple linear regression \(R^2\equiv r^2\) where \(r\) is the Pearson correlation coefficient, which is readily from the \(t\)-statistic of the regression slope, i.e., \(r={t}/{\sqrt{t^2+N-2}}\). so assuming \(t \equiv \ z \sim \chi_1^2\)

\[ \begin{align} \mbox{PVE}_{t-\mbox{statistic}} & = \frac{\chi^2}{\chi^2+N-2} \hspace{100cm} \tag{9.4} \end{align} \]

To obtain coherent estimates of the asymptotic means and variances of both forms we resort to variance of a ratio (R/S). All the required elements are listed in a table below.

Characteristics Linear regression \(t\)-statistic
\(\mu_R\) 1 1
\(\sigma_R^2\) 2 2
\(\mu_S\) \(N+1\) \(N-1\)
\(\sigma_S^2\) 2 2
\(\mbox{Cov}(R,S)\) 2 2

then we have the means and the variances for PVE.

Characteristics Linear regression \(t\)-statistic
mean \(\frac{1}{N+1}\left[1-\frac{2}{N+1}+\frac{2}{(N+1)^2}\right]\) \(\frac{1}{N-1}\left[1-\frac{2}{N-1}+\frac{2}{(N-1)^2}\right]\)
variance \(\frac{2}{(N+1)^2}\left[1-\frac{1}{N+1}\right]^2\) \(\frac{2}{(N-1)^2}\left[1-\frac{1}{N-1}\right]^2\)

Finally, our approximation of PVE for a protein with \(T\) independent pQTLs from the meta-analysis

Characteristics Linear regression \(t\)-statistic
estimate \(\sum_{i=1}^T{\frac{\chi_i^2}{\chi_i^2+N_i}}\) \(\sum_{i=1}^T{\frac{\chi_i^2}{\chi_i^2+N_i-2}}\)
variance \(\sum_{i=1}^T\frac{2}{(N_i+1)^2}\) \(\sum_{i=1}^T\frac{2}{(N_i-1)^2}\)

Therefore they differ from the asymptotic results17 by ratios of \((N-2)(N+1)/(N-1)^2\) and \((N-2)/(N+1)\) for linear regression and \(t\)-statistic, respectively.

Comparisons of Var($R^2$) estimates

Figure 9.1: Comparisons of Var(\(R^2\)) estimates

As the sample size increases, the estimates are quite close nevertheless quite small while the ratios approach 1 quickly from opposite sides after \(N\approx 300\).

9.3 Effect size and standard error

When \(\mbox{Var}(\mbox{y})=1\), as in cis eQTLGen18 data, we have \(b\) and its standard error (se) as follows,

\[ \begin{align} b & = z/d \hspace{100cm} \\ se & = 1/d \tag{9.5} \end{align} \]

where \(d = \sqrt{2f(1-f)(z^2+N)}\).

Now three functions are in place.

9.4 get_b_se

A record of the eQTLGen data is shown below

        SNP    Pvalue SNPChr  SNPPos AssessedAllele OtherAllele Zscore
 rs1003563 2.308e-06     12 6424577              A           G 4.7245
            Gene GeneSymbol GeneChr GenePos NrCohorts NrSamples         FDR
 ENSG00000111321       LTBR      12 6492472        34     23991 0.006278872
 BonferroniP hg19_chr hg19_pos AlleleA AlleleB allA_total allAB_total
           1       12  6424577       A       G       2574        8483
 allB_total AlleleB_all
       7859   0.6396966

from which we obtain the effect size and its standard error as follows,

get_b_se(0.6396966,23991,4.7245)
#>               b          se
#> [1,] 0.04490488 0.009504684

9.5 get_pve_se

This function obtains proportion of explained variation (PVE) from n, z; its standard error is based on variance of the ratio (correction=TRUE) or \(r^2\).

9.6 get_sdy

We continue with the eQTLGen example above,

get_sdy(0.6396966,23991,0.04490488,0.009504684)
#> [1] 1

and indeed the eQTLGen data were standardized.

10 Sentinels of association

10.1 Identification

Following an earlier implementation called sentinels, a distance-based signal identification based on GWAS summary statistics is avaialable from qtlFinder function; to avoid the overhead of data-loading it works on a preselected list of variants from a GWAS and in this case a METAL output file.

10.2 Fine-mapping

We considered a region of interest (which could be approximately independent variants, e.g., \(r^2 \le 0.5\)) using expressions that rely on effect sizes and their standard errors19. More specifically, let Bayes factor (BF) for each variant in the meta-analysis be defined as \(ln(BF) \propto 0.5 \beta^2/SE^2\), where \(\beta\) and \(SE\) are the effect size and standard error from the meta-analysis, respectively. The posterior probability (PP) for being causal for a particular variant is obtained as \(BF_i/\sum_{i=1}^TBF_i\), where \(i=1,\ldots,T\) indexes all variants considered in the region. We generated credible sets within a given region by ranking all variants by PPs in descending order and calculating the number of variants required to reach a cumulative probability of such as 99%.

The function cs obtains credible set.

# zcat METAL/4E.BP1-1.tbl.gz | \
# awk 'NR==1 || ($1==4 && $2 >= 187158034 - 1e6 && $2 < 187158034 + 1e6)' >  4E.BP1.z
tbl <- within(read.delim("4E.BP1.z"),{logp <- logp(Effect/StdErr)})
z <- cs(tbl)
l <- cs(tbl,log_p="logp")

Note in particular that the implementation intends to avoid the naive summation in scenarios such as proteogenomic analysis containing exceptionally large BFs.

11 Polygenic modeling

In line with the recent surge of interest in the polygenic models, a separate vignette is available through vignette("h2",package="gap.examples") demonstrating aspect of the models on heritability. Utility Functions h2G, h2GE and h2l are briefly documented. Functions h2.jags and hwe.jags are also available. The function h2_mzdz can be used for heritability estimation based on monozygotic (MZ) and dizygotic (DZ) twin correlations under the additive genetics, common and specific environment (ACE) model, e.g., 10.1038/s41562-023-01530-y.

12 Mendelian randomization

12.1 The mr function

The function mr was originally developed to rework on data generated from GSMR20, although it could be any harmonised data. The following example is from analysis of a real data on LIF-R protein and CVD/FEV1.

Table 12.1: LIF-R and CAD/FEV1
SNP b.LIF.R SE.LIF.R b.FEV1 SE.FEV1 b.CAD SE.CAD
rs188743906 0.6804 0.1104 0.00177 0.01660 NA NA
rs2289779 -0.0788 0.0134 0.00104 0.00261 -0.007543 0.0092258
rs117804300 -0.2281 0.0390 -0.00392 0.00855 0.109372 0.0362219
rs7033492 -0.0968 0.0147 -0.00585 0.00269 0.022793 0.0119903
rs10793962 0.2098 0.0212 0.00378 0.00536 -0.014567 0.0138196
rs635634 -0.2885 0.0153 -0.02040 0.00334 0.077157 0.0117123
rs176690 -0.0973 0.0142 0.00293 0.00306 -0.000007 0.0107781
rs147278971 -0.2336 0.0378 -0.01240 0.00792 0.079873 0.0397491
rs11562629 0.1155 0.0181 0.00960 0.00378 -0.010040 0.0151460

The MR analysis is as follows,

Mendelian randomization

Figure 12.1: Mendelian randomization

Mendelian randomization

Figure 12.2: Mendelian randomization

Mendelian randomization

Figure 12.3: Mendelian randomization

Mendelian randomization

Figure 12.4: Mendelian randomization

Mendelian randomization

Figure 12.5: Mendelian randomization

Mendelian randomization

Figure 12.6: Mendelian randomization

Table 12.2: LIFR variant rs635634 and CAD/FEV1
LIF.R.CAD LIF.R.FEV1
bIVW -0.187 0.045
sebIVW 0.050 0.012
CochQ 2.116 0.482
CochQp 0.953 1.000
bEGGER -0.184 0.049
sebEGGER 0.060 0.015
intEGGER 0.001 0.001
seintEGGER 0.010 0.002
bWM -0.247 0.062
sebWM 0.045 0.012
bPWM -0.248 0.055
sebPWM 0.044 0.014
IVW 1.79e-04 3.19e-04
EGGER 2.15e-03 1.13e-03
WM 4.15e-08 1.68e-07
PWM 1.73e-08 5.64e-05

This is close to Ligthart et al.21 as used one time at workplace which turns to overlap with TwoSampleMR22.

12.2 Contrast of effect sizes

It would be of interest to contrast their effect sizes in the analysis above as well,

mr_names <- names(mrdat)
LIF.R <- cbind(mrdat[grepl("SNP|LIF.R",mr_names)],trait="LIF.R"); names(LIF.R) <- c("SNP","b","se","trait")
FEV1 <- cbind(mrdat[grepl("SNP|FEV1",mr_names)],trait="FEV1"); names(FEV1) <- c("SNP","b","se","trait")
CAD <- cbind(mrdat[grepl("SNP|CAD",mr_names)],trait="CAD"); names(CAD) <- c("SNP","b","se","trait")
mrdat2 <- within(rbind(LIF.R,FEV1,CAD),{y=b})
library(ggplot2)
p <- ggplot2::ggplot(mrdat2,aes(y = SNP, x = y))+
     ggplot2::theme_bw()+
     ggplot2::geom_point()+
     ggplot2::facet_wrap(~ trait, ncol=3, scales="free_x")+
     ggplot2::geom_segment(aes(x = b-1.96*se, xend = b+1.96*se, yend = SNP))+
     ggplot2::geom_vline(lty=2, ggplot2::aes(xintercept=0), colour = 'red')+
     ggplot2::xlab("Effect size")+
     ggplot2::ylab("")
p
#> Warning: Removed 1 row containing missing values or values outside the scale range (`geom_point()`).
#> Warning: Removed 1 row containing missing values or values outside the scale range (`geom_segment()`).
Combined forest plots for LIF.R, FEV1 and CAD

Figure 12.7: Combined forest plots for LIF.R, FEV1 and CAD

12.3 Forest plots

We illustrate with mr_forestplot(),

mr_forestplot(tnfb, colgap.forest.left="0.05cm", fontsize=14, leftlabs=c("Outcome","b","SE"),
              common=FALSE, random=FALSE, print.I2=FALSE, print.pval.Q=FALSE, print.tau2=FALSE,
              spacing=1.6,digits.TE=2,digits.se=2,xlab="Effect size",type.study="square",col.inside="black",col.square="black")
Forest plots for MR results on TNFB

Figure 12.8: Forest plots for MR results on TNFB

mr_forestplot(tnfb, colgap.forest.left="0.05cm", fontsize=14,
              leftcols="studlab", leftlabs="Outcome", plotwidth="3inch", sm="OR", rightlabs="ci",
              sortvar=tnfb[["Effect"]],
              common=FALSE, random=FALSE, print.I2=FALSE, print.pval.Q=FALSE, print.tau2=FALSE,
              backtransf=TRUE, spacing=1.6,type.study="square",col.inside="black",col.square="black")
Forest plots for MR results on TNFB (no summary statistics)

Figure 12.9: Forest plots for MR results on TNFB (no summary statistics)

mr_forestplot(tnfb,colgap.forest.left="0.05cm", fontsize=14,
              leftcols=c("studlab"), leftlabs=c("Outcome"),
              plotwidth="3inch", sm="OR", sortvar=tnfb[["Effect"]],
              rightcols=c("effect","ci","pval"), rightlabs=c("OR","95%CI","P"),
              digits=3, digits.pval=2, scientific.pval=TRUE,
              common=FALSE, random=FALSE, print.I2=FALSE, print.pval.Q=FALSE, print.tau2=FALSE,
              addrow=TRUE, backtransf=TRUE, spacing=1.6,type.study="square",col.inside="black",col.square="black")
Forest plots for MR results on TNFB (with P values)

Figure 12.10: Forest plots for MR results on TNFB (with P values)

13 Miscellaneous functions

13.1 chr_pos_a1_a2 and inv_chr_pos_a1_a2

They are functions to handle SNPid.

require(gap)
s <- chr_pos_a1_a2(1,c(123,321),letters[1:2],letters[2:1])
s
#> [1] "chr1:123_A_B" "chr1:321_A_B"
inv_chr_pos_a1_a2(s)
#>               chr pos a1 a2
#> chr1:123_A_B chr1 123  A  B
#> chr1:321_A_B chr1 321  A  B
inv_chr_pos_a1_a2("chr1:123-A_B",seps=c(":","-","_"))
#>               chr pos a1 a2
#> chr1:123-A_B chr1 123  A  B

13.2 ci2ms

Here is the documentation example on rs3784099 and breast cancer23.

example(ci2ms)
#> 
#> ci2ms> # rs3784099 and breast cancer recurrence/mortality
#> ci2ms> ms <- ci2ms("1.28-1.72")
#> 
#> ci2ms> print(ms)
#> $m
#> [1] 0.3945922
#> 
#> $s
#> [1] 0.07537491
#> 
#> $direction
#> [1] "+"
#> 
#> 
#> ci2ms> # Vector input
#> ci2ms> ci2 <- c("1.28-1.72","1.25-1.64")
#> 
#> ci2ms> ms2 <- ci2ms(ci2)
#> 
#> ci2ms> print(ms2)
#> $m
#> [1] 0.3945922 0.3589199
#> 
#> $s
#> [1] 0.07537491 0.06927492
#> 
#> $direction
#> [1] "+" "+"

13.3 gc.lambda

The definition is as follows,

gc.lambda <- function(x, logscale=FALSE, z=FALSE) {
  v <- x[!is.na(x)]
  n <- length(v)
  if (z) {
     obs <- v^2
     exp <- qchisq(log(1:n/n),1,lower.tail=FALSE,log.p=TRUE)
  } else {
    if (!logscale)
    {
      obs <- qchisq(v,1,lower.tail=FALSE)
      exp <- qchisq(1:n/n,1,lower.tail=FALSE)
    } else {
      obs <- qchisq(-log(10)*v,1,lower.tail=FALSE,log.p=TRUE)
      exp <- qchisq(log(1:n/n),1,lower.tail=FALSE,log.p=TRUE)
    }
  }

  lambda <- median(obs)/median(exp)
  return(lambda)
}

# A simplified version is as follows,
# obs <- median(chisq)
# exp <- qchisq(0.5, 1) # 0.4549364
# lambda <- obs/exp
# see also estlambda from GenABEL and qq.chisq from snpStats

# A related function

lambda1000 <- function(lambda, ncases, ncontrols)
  1 + (lambda - 1) * (1 / ncases + 1 / ncontrols)/( 1 / 1000 + 1 / 1000)

13.4 invnormal

The function is widely used in various consortium analyses and defined as follows,

invnormal <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))

An example use on data from Poisson distribution is as follows,

set.seed(12345)
Ni <- rpois(50, lambda = 4); table(factor(Ni, 0:max(Ni)))
#> 
#>  0  1  2  3  4  5  6  7  8  9 
#>  2  4  6 11  8 10  4  2  2  1
y <- invnormal(Ni)
sd(y)
#> [1] 0.9755074
mean(y)
#> [1] 0.002650512
Ni <- 1:50
y <- invnormal(Ni)
mean(y)
#> [1] -1.810184e-17
sd(y)
#> [1] 0.9973999

13.5 revStrand

This functions obtains allele(s) on the opposite strand.

alleles <- c("a","c","G","t")
revStrand(alleles)
#> [1] "t" "g" "C" "a"

13.6 snptest_sample

This is a function to output sample file for SNPTEST.

d <- data.frame(ID_1=1,ID_2=1,missing=0,PC1=1,PC2=2,D1=1,P1=10)
snptest_sample(d,C=paste0("PC",1:2),D=paste0("D",1:1),P=paste0("P",1:1))

The commands above generates a file named snptest.sample.

14 Known issues

A lot of code optimization as wtih better memory management is desirable. There are apparent issues with the Shiny interfaces for power/sample size calculation which produce less outputs for certain configurations than expected.

15 Summary

By now the package should have given you a flavor of the project. It sets to build a infrastructure to keep up with the development of R system itself and collect elements from oning work. Over years it also serves to inspire others to join force and develop better alternatives.

Bibliography

1.
Guo, S. W. & Lange, K. Genetic mapping of complex traits: Promises, problems, and prospects. Theor Popul Biol 57, 1–11 (2000).
2.
Zhao, J. H. & Tan, Q. Integrated analysis of genetic data with R. Hum Genomics 2, 258–65 (2006).
3.
Zhao, J. H. & Tan, Q. Genetic dissection of complex traits in silico: Approaches, problems and solutions. Current Bioinformatics 1, 359–369 (2006).
4.
5.
Zhao, J. H. Pedigree-drawing with R and graphviz. Bioinformatics 22, 1013–4 (2006).
6.
Willer, C. J., Li, Y. & Abecasis, G. R. METAL: Fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190–1 (2010).
7.
Zhao, J. H. & Sham, P. C. Generic number systems and haplotype analysis. Comput Methods Programs Biomed 70, 1–9 (2003).
8.
Zhao, J. H., Curtis, D. & Sham, P. C. Model-free analysis and permutation tests for allelic associations. Hum Hered 50, 133–9 (2000).
9.
10.
Hudson, R. R. Generating samples under a wright-fisher neutral model of genetic variation. Bioinformatics 18, 337–8 (2002).
11.
Risch, N. & Merikangas, K. The future of genetic studies of complex human diseases. Science 273, 1516–1517 (1996).
12.
Risch, N. & Merikangas, K. Reply to Scott el al. Science 275, 1329–1330 (1997).
13.
Zhao, J. H. gap: Genetic analysis package. Journal of Statistical Software 23, 1–18 (2007).
14.
Scott, W. K., Pericak-Vance, M. A. & Haines, J. L. Genetic analysis of complex diseases. Science 275, 1327; author reply 1329–30 (1997).
15.
16.
17.
Kotz, S., Read, C. B., Balakrishnan, N., Vidakovic, B. & Johnson, N. L. Encyclopedia of Statistical Sciences. (John Wiley & Sons, Inc., Hoboken, New Jersey, 2006).
18.
Võsa, U. et al. Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat Genet (2021) doi:10.1038/s41588-021-00913-z.
19.
Graham, S. E. et al. The power of genetic diversity in genome-wide association studies of lipids. Nature 600, 675–679 (2021).
20.
21.
22.
23.
24.
Williams, C. J., Christian, J. C. & Norton, J. A. TWINAN90: A FORTRAN program for conducting ANOVA-based and likelihood-based analyses of twin data. Computer Methods and Programs in Biomedicine 38, 167–176 (1992).

Appendix

After package loading via library(gap), you can use lsf.str("package:gap") and data(package="gap") to generate a list of functions and a list of datasets, respectvely. If this looks odd to you, you might try search() within R to examine what is available in your environment before issuing the lsf.str command.

#>  [1] ".GlobalEnv"           "package:ggplot2"      "package:lattice"      "package:dplyr"        "package:pedigree"     "package:kinship2"    
#>  [7] "package:quadprog"     "package:Matrix"       "package:DiagrammeR"   "package:DOT"          "package:gap"          "package:gap.datasets"
#> [13] "package:stats"        "package:graphics"     "package:grDevices"    "package:utils"        "package:datasets"     "package:methods"     
#> [19] "Autoloads"            "package:base"
#> AE3 : function (model, random, data, seed = 1234, n.sim = 50000, verbose = TRUE)  
#> BFDP : function (a, b, pi1, W, logscale = FALSE)  
#> ESplot : function (ESdat, alpha = 0.05, fontsize = 12)  
#> FPRP : function (a, b, pi0, ORlist, logscale = FALSE)  
#> KCC : function (model, GRR, p1, K)  
#> LD22 : function (h, n)  
#> LDkl : function (n1 = 2, n2 = 2, h, n, optrho = 2, verbose = FALSE)  
#> MCMCgrm : function (model, prior, data, GRM, eps = 0, n.thin = 10, n.burnin = 3000, n.iter = 13000, ...)  
#> METAL_forestplot : function (tbl, all, rsid, flag = "", package = "meta", method = "REML", split = FALSE, ...)  
#> ReadGRM : function (prefix = 51)  
#> ReadGRMBin : function (prefix, AllN = FALSE, size = 4)  
#> WriteGRM : function (prefix = 51, id, N, GRM)  
#> WriteGRMBin : function (prefix, grm, N, id, size = 4)  
#> a2g : function (a1, a2)  
#> ab : function (type = "power", n = 25000, a = 0.15, sa = 0.01, b = log(1.19), sb = 0.01, alpha = 0.05, fold = 1)  
#> allele.recode : function (a1, a2, miss.val = NA)  
#> asplot : function (locus, map, genes, flanking = 1000, best.pval = NULL, sf = c(4, 4), logpmax = 10, pch = 21)  
#> b2r : function (b, s, rho, n)  
#> bt : function (x)  
#> ccsize : function (n, q, pD, p1, theta, alpha, beta = 0.2, power = TRUE, verbose = FALSE)  
#> chow.test : function (y1, x1, y2, x2, x = NULL)  
#> chr_pos_a1_a2 : function (chr, pos, a1, a2, prefix = "chr", seps = c(":", "_", "_"), uppercase = TRUE)  
#> ci2ms : function (ci, logscale = TRUE, alpha = 0.05)  
#> circos.cis.vs.trans.plot : function (hits, panel, id, radius = 1e+06)  
#> circos.cnvplot : function (data)  
#> circos.mhtplot : function (data, glist)  
#> circos.mhtplot2 : function (dat, labs, species = "hg18", ticks = 0:3 * 10, ymax = 30)  
#> cis.vs.trans.classification : function (hits, panel, id, radius = 1e+06)  
#> cnvplot : function (data)  
#> comp.score : function (ibddata = "ibd_dist.out", phenotype = "pheno.dat", mean = 0, var = 1, h2 = 0.3)  
#> cs : function (tbl, b = "Effect", se = "StdErr", log_p = NULL, cutoff = 0.95)  
#> fbsize : function (gamma, p, alpha = c(1e-04, 1e-08, 1e-08), beta = 0.2, debug = 0, error = 0)  
#> g2a : function (g)  
#> gc.em : function (data, locus.label = NA, converge.eps = 1e-06, maxiter = 500, handle.miss = 0, miss.val = 0, control = gc.control())  
#> gc.lambda : function (x, logscale = FALSE, z = FALSE)  
#> gcontrol : function (data, zeta = 1000, kappa = 4, tau2 = 1, epsilon = 0.01, ngib = 500, burn = 50, idum = 2348)  
#> gcontrol2 : function (p, col = palette()[4], lcol = palette()[2], ...)  
#> gcp : function (y, cc, g, handle.miss = 1, miss.val = 0, n.sim = 0, locus.label = NULL, quietly = FALSE)  
#> genecounting : function (data, weight = NULL, loci = NULL, control = gc.control())  
#> geno.recode : function (geno, miss.val = 0)  
#> get_b_se : function (f, n, z)  
#> get_pve_se : function (n, z, correction = TRUE)  
#> get_sdy : function (f, n, b, se, method = "mean", ...)  
#> gif : function (data, gifset)  
#> grid2d : function (chrlen, plot = TRUE, cex.labels = 0.6, xlab = "QTL position", ylab = "Gene position")  
#> h2.jags : function (y, x, G, eps = 1e-04, sigma.p = 0, sigma.r = 1, parms = c("b", "p", "r", "h2"), ...)  
#> h2G : function (V, VCOV, verbose = TRUE)  
#> h2GE : function (V, VCOV, verbose = TRUE)  
#> h2_mzdz : function (mzDat = NULL, dzDat = NULL, rmz = NULL, rdz = NULL, nmz = NULL, ndz = NULL, selV = NULL)  
#> h2l : function (K = 0.05, P = 0.5, h2, se, verbose = TRUE)  
#> hap : function (id, data, nloci, loci = rep(2, nloci), names = paste("loci", 1:nloci, sep = ""), control = hap.control())  
#> hap.control : function (mb = 0, pr = 0, po = 0.001, to = 0.001, th = 1, maxit = 100, n = 0, ss = 0, rs = 0, rp = 0, ro = 0, rv = 0, sd = 0, mm = 0, mi = 0, 
#>     mc = 50, ds = 0.1, de = 0, q = 0, hapfile = "hap.out", assignfile = "assign.out")  
#> hap.em : function (id, data, locus.label = NA, converge.eps = 1e-06, maxiter = 500, miss.val = 0)  
#> hap.score : function (y, geno, trait.type = "gaussian", offset = NA, x.adj = NA, skip.haplo = 0.005, locus.label = NA, miss.val = 0, n.sim = 0, method = "gc", 
#>     id = NA, handle.miss = 0, mloci = NA, sexid = NA)  
#> hmht.control : function (data = NULL, colors = NULL, yoffset = 0.25, cex = 1.5, boxed = FALSE)  
#> hwe : function (data, data.type = "allele", yates.correct = FALSE, miss.val = 0)  
#> hwe.cc : function (model, case, ctrl, k0, initial1, initial2)  
#> hwe.hardy : function (a, alleles = 3, seed = 3000, sample = c(1000, 1000, 5000))  
#> hwe.jags : function (k, n, delta = rep(1/k, k), lambda = 0, lambdamu = -1, lambdasd = 1, parms = c("p", "f", "q", "theta", "lambda"), ...)  
#> inv_chr_pos_a1_a2 : function (chr_pos_a1_a2, prefix = "chr", seps = c(":", "_", "_"))  
#> invnormal : function (x)  
#> ixy : function (x)  
#> kin.morgan : function (ped, verbose = FALSE)  
#> klem : function (obs, k = 2, l = 2)  
#> labelManhattan : function (chr, pos, name, gwas, gwasChrLab = "chr", gwasPosLab = "pos", gwasPLab = "p", gwasZLab = "NULL", chrmaxpos, textPos = 4, angle = 0, 
#>     miamiBottom = FALSE)  
#> log10p : function (z)  
#> log10pvalue : function (p = NULL, base = NULL, exponent = NULL)  
#> logp : function (z)  
#> makeped : function (pifile = "pedfile.pre", pofile = "pedfile.ped", auto.select = 1, with.loop = 0, loop.file = NA, auto.proband = 1, proband.file = NA)  
#> masize : function (model, opts, alpha = 0.025, gamma = 0.2)  
#> metap : function (data, N, verbose = "Y", prefixp = "p", prefixn = "n")  
#> metareg : function (data, N, verbose = "Y", prefixb = "b", prefixse = "se")  
#> mht.control : function (type = "p", usepos = FALSE, logscale = TRUE, base = 10, cutoffs = NULL, colors = NULL, labels = NULL, srt = 45, gap = NULL, cex = 0.4, 
#>     yline = 3, xline = 3)  
#> mhtplot : function (data, control = mht.control(), hcontrol = hmht.control(), ...)  
#> mhtplot.trunc : function (x, chr = "CHR", bp = "BP", p = NULL, log10p = NULL, z = NULL, snp = "SNP", col = c("gray10", "gray60"), chrlabs = NULL, suggestiveline = -log10(1e-05), 
#>     genomewideline = -log10(5e-08), highlight = NULL, annotatelog10P = NULL, annotateTop = FALSE, cex.mtext = 1.5, cex.text = 0.7, mtext.line = 2, 
#>     y.ax.space = 5, y.brk1 = NULL, y.brk2 = NULL, trunc.yaxis = TRUE, cex.axis = 1.2, delta = 0.05, ...)  
#> mhtplot2 : function (data, control = mht.control(), hcontrol = hmht.control(), ...)  
#> mia : function (hapfile = "hap.out", assfile = "assign.out", miafile = "mia.out", so = 0, ns = 0, mi = 0, allsnps = 0, sas = 0)  
#> miamiplot : function (x, chr = "CHR", bp = "BP", p = "P", pr = "PR", snp = "SNP", col = c("midnightblue", "chartreuse4"), col2 = c("royalblue1", "seagreen1"), 
#>     ymax = NULL, highlight = NULL, highlight.add = NULL, pch = 19, cex = 0.75, cex.lab = 1, xlab = "Chromosome", ylab = "-log10(P) [y>0]; log10(P) [y<0]", 
#>     lcols = c("red", "black"), lwds = c(5, 2), ltys = c(1, 2), main = "", ...)  
#> miamiplot2 : function (gwas1, gwas2, name1 = "GWAS 1", name2 = "GWAS 2", chr1 = "chr", chr2 = "chr", pos1 = "pos", pos2 = "pos", p1 = "p", p2 = "p", z1 = NULL, 
#>     z2 = NULL, sug = 1e-05, sig = 5e-08, pcutoff = 0.1, topcols = c("green3", "darkgreen"), botcols = c("royalblue1", "navy"), yAxisInterval = 5)  
#> mr : function (data, X, Y, alpha = 0.05, other_plots = FALSE)  
#> mr_forestplot : function (dat, sm = "", title = "", ...)  
#> mtdt : function (x, n.sim = 0)  
#> mtdt2 : function (x, verbose = TRUE, n.sim = NULL, ...)  
#> muvar : function (n.loci = 1, y1 = c(0, 1, 1), y12 = c(1, 1, 1, 1, 1, 0, 0, 0, 0), p1 = 0.99, p2 = 0.9)  
#> mvmeta : function (b, V)  
#> pbsize : function (kp, gamma = 4.5, p = 0.15, alpha = 5e-08, beta = 0.2)  
#> pbsize2 : function (N, fc = 0.5, alpha = 0.05, gamma = 4.5, p = 0.15, kp = 0.1, model = "additive")  
#> pedtodot : function (pedfile, makeped = FALSE, sink = TRUE, page = "B5", url = "https://jinghuazhao.github.io/", height = 0.5, width = 0.75, rotate = 0, 
#>     dir = "none")  
#> pedtodot_verbatim : function (f, run = FALSE, toDOT = FALSE, ...)  
#> pfc : function (famdata, enum = 0)  
#> pfc.sim : function (famdata, n.sim = 1e+06, n.loop = 1)  
#> pgc : function (data, handle.miss = 1, is.genotype = 0, with.id = 0)  
#> pvalue : function (z, decimals = 2)  
#> qqfun : function (x, distribution = "norm", ylab = deparse(substitute(x)), xlab = paste(distribution, "quantiles"), main = NULL, las = par("las"), envelope = 0.95, 
#>     labels = FALSE, col = palette()[4], lcol = palette()[2], xlim = NULL, ylim = NULL, lwd = 1, pch = 1, bg = palette()[4], cex = 0.4, line = c("quartiles", 
#>         "robust", "none"), ...)  
#> qqunif : function (u, type = "unif", logscale = TRUE, base = 10, col = palette()[4], lcol = palette()[2], ci = FALSE, alpha = 0.05, ...)  
#> qtl2dplot : function (d, chrlen = gap::hg19, snp_name = "SNP", snp_chr = "Chr", snp_pos = "bp", gene_chr = "p.chr", gene_start = "p.start", gene_end = "p.end", 
#>     trait = "p.target.short", gene = "p.gene", TSS = FALSE, cis = "cis", value = "log10p", plot = TRUE, cex.labels = 0.6, cex.points = 0.6, xlab = "QTL position", 
#>     ylab = "Gene position")  
#> qtl2dplotly : function (d, chrlen = gap::hg19, qtl.id = "SNPid:", qtl.prefix = "QTL:", qtl.gene = "Gene:", target.type = "Protein", TSS = FALSE, xlab = "QTL position", 
#>     ylab = "Gene position", ...)  
#> qtl3dplotly : function (d, chrlen = gap::hg19, zmax = 300, qtl.id = "SNPid:", qtl.prefix = "QTL:", qtl.gene = "Gene:", target.type = "Protein", TSS = FALSE, 
#>     xlab = "QTL position", ylab = "Gene position", ...)  
#> qtlClassifier : function (geneSNP, SNPPos, genePos, radius)  
#> qtlFinder : function (d, Chromosome = "Chromosome", Position = "Position", MarkerName = "MarkerName", Allele1 = "Allele1", Allele2 = "Allele2", EAF = "Freq1", 
#>     Effect = "Effect", StdErr = "StdErr", log10P = "log10P", N = "N", radius = 1e+06, collapse.hla = TRUE, build = "hg19")  
#> read.ms.output : function (msout, is.file = TRUE, xpose = TRUE, verbose = TRUE, outfile = NULL, outfileonly = FALSE)  
#> revStrand : function (allele)  
#> runshinygap : function (...)  
#> s2k : function (y1, y2)  
#> sentinels : function (p, pid, st, debug = FALSE, flanking = 1e+06, chr = "Chrom", pos = "End", b = "Effect", se = "StdErr", log_p = NULL, snp = "MarkerName", 
#>     sep = ",")  
#> snptest_sample : function (data, sample_file = "snptest.sample", ID_1 = "ID_1", ID_2 = "ID_2", missing = "missing", C = NULL, D = NULL, P = NULL)  
#> tscc : function (model, GRR, p1, n1, n2, M, alpha.genome, pi.samples, pi.markers, K)  
#> whscore : function (allele, type)  
#> xy : function (x)
#>      Package LibPath                                                        Item   Title                             
#> [1,] "gap"   "/rds/user/jhz22/hpc-work/work/RtmpcOzqwx/Rinst2fb82136a4edc8" "hg18" "Chromosomal lengths for build 36"
#> [2,] "gap"   "/rds/user/jhz22/hpc-work/work/RtmpcOzqwx/Rinst2fb82136a4edc8" "hg19" "Chromosomal lengths for build 37"
#> [3,] "gap"   "/rds/user/jhz22/hpc-work/work/RtmpcOzqwx/Rinst2fb82136a4edc8" "hg38" "Chromosomal lengths for build 38"

  1. Notes by Howard Seltman from Carnegie Mellon University: Pittsburgh, PA, USA.↩︎