Examples in detail

Alexander Klassmann

2021-09-15

1 Overview

Despite a bewildering nomenclature, the idea of Extended Haplotype Homozygosity is simple. Consider the following alignment of nucleotide sequences where only bi-allelic sites have been retained:

The colored areas mark the maximal extension to which at least two sequences carrying the same focal allele are identical, i.e. homozygous to each other. The average length of all sequence-pairwise shared haplotypes yields the iHH scores for the two central alleles, respectively. The (unstandardized) iHS value is the log ratio of them. The statistics XP-EHH and Rsb are constructed in the same way with the two alleles replaced by two populations and while Rsb is normalized to 1 at the focal position, XP-EHH is not. That’s all!

This vignette analyses in great detail two small example data sets delivered with the package rehh (see main vignette). They have been constructed to ease comprehension of the relevant statistics and functionality of the package. The first example has been already discussed in (Gautier et al. 2017) while the second set is an extension to include multiple markers and missing values. The modifications for unphased or unpolarized data have been described in (Klassmann and Gautier 2020).

The pattern of variation seen in the sets and in the alignment above is intended to reflect an evolutionary scenario of an “on-going selective sweep” with one allele of the central marker experiencing strong selection.

The package has to be installed and then loaded by

library(rehh)

2 Example data set 1

2.1 Input

2.1.1 Data files

Both example data sets are provided in two formats: as a pair of haplotype & map files and as a single file in variant call format (vcf). They are copied (together with other files) into the current working directory by the command

make.example.files()

Data set 1 contains the hypothetical variation at a particular genetic locus in 8 sequences of 4 diploid individuals. Eleven markers, which might be imagined as SNPs, have two alleles, coded as ‘0’ and ‘1’, for ancestral and derived allele, respectively.

The file example1.hap conforms to what in the main vignette is referred to as “standard” haplotype format, i.e. chromosomes are given in rows and each marker corresponds to a column. The first column is reserved for identifiers of the individual haplotypes.

cat(readLines("example1.hap"), sep = "\n")
> HG1_1 0 0 1 1 0 0 0 0 0 1 1
> HG1_2 0 0 0 0 1 0 1 0 1 0 0
> HG2_1 0 1 0 1 0 0 1 0 0 0 1
> HG2_2 0 0 1 0 0 0 0 1 1 0 0
> HG3_1 0 0 0 0 0 1 0 0 0 0 1
> HG3_2 0 0 0 0 0 1 0 0 0 0 1
> HG4_1 1 0 0 0 0 1 0 0 0 0 0
> HG4_2 0 0 0 0 0 1 0 0 0 0 0

The calculation of the “integrated” statistics such as iHH and iES requires a measure for the distance between markers, which is provided by the file example1.map. It relates the markers with their chromosomal positions. The latter can represent physical positions (base pairs) or genetic positions derived from estimated recombination rates.

cat(readLines("example1.map"), sep = "\n")
> rs1 chr1 10000 0 1
> rs2 chr1 20000 0 1
> rs3 chr1 30000 0 1
> rs4 chr1 40000 0 1
> rs5 chr1 50000 0 1
> rs6 chr1 60000 0 1
> rs7 chr1 70000 0 1
> rs8 chr1 80000 0 1
> rs9 chr1 90000 0 1
> rs10 chr1 100000 0 1
> rs11 chr1 110000 0 1

The file in variant call format combines the information of haplotype and map files. However, vcf codes alleles with respect to a reference sequence, not with respect to ancestry status. Information about ancestry can be added using a key of the INFO field, conventionally named AA. For instance, in the file example1.vcf, the reference alleles of markers rs6 and rs11 differ from the ancestral alleles.

cat(readLines("example1.vcf"), sep = "\n")
> ##fileformat=VCFv4.2
> ##reference=NCBI38
> ##contig=<ID=chr1,length=120000>
> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
> #CHROM    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG1 HG2 HG3 HG4
> chr1  10000   rs1 G   T   100 PASS    AA=G    GT  0|0 0|0 0|0 1|0
> chr1  20000   rs2 G   A   100 PASS    AA=G    GT  0|0 1|0 0|0 0|0
> chr1  30000   rs3 T   C   100 PASS    AA=T    GT  1|0 0|1 0|0 0|0
> chr1  40000   rs4 A   C   100 PASS    AA=A    GT  1|0 1|0 0|0 0|0
> chr1  50000   rs5 T   G   100 PASS    AA=T    GT  0|1 0|0 0|0 0|0
> chr1  60000   rs6 G   A   100 PASS    AA=A    GT  1|1 1|1 0|0 0|0
> chr1  70000   rs7 C   T   100 PASS    AA=C    GT  0|1 1|0 0|0 0|0
> chr1  80000   rs8 C   T   100 PASS    AA=C    GT  0|0 0|1 0|0 0|0
> chr1  90000   rs9 T   A   100 PASS    AA=T    GT  0|1 0|1 0|0 0|0
> chr1  100000  rs10    A   G   100 PASS    AA=A    GT  1|0 0|0 0|0 0|0
> chr1  110000  rs11    C   T   100 PASS    AA=T    GT  0|1 0|1 0|0 1|1

2.1.2 Input options

In order to work with the data, it has to be transformed to an internal representation, namely an object of class haplohh. Let us first use the pair of files example1.hap and example1.map. In this case the allele coding in the haplotype file is conform to the 01 format (defined in the main vignette), hence setting allele_coding = "01" is appropriate. This is also the format which is used internally.

hh <- data2haplohh(hap_file = "example1.hap",
                   map_file = "example1.map",
                   allele_coding = "01")
> * Reading input file(s) *
> Map info: 11 markers declared for chromosome chr1 .
> Haplotype input file in standard format assumed.
> Assume that alleles in the haplotype file are coded as:
> NA or '.': missing value
> 0: ancestral allele
> 1, 2, ...: derived allele(s).
> * Filtering data *
> Discard markers genotyped on less than 100 % of haplotypes.
> No marker discarded.
> Data consists of 8 haplotypes and 11 markers.
> Number of mono-, bi-, multi-allelic markers:
> 1 2 
> 0 11

This data set is constructed such that setting allele_coding = "map" or allele_coding = "none" yields an identical haplohh object. In the first case, the fourth column of the map file, standing for the ancestral allele, contains always 0, hence for every marker in the haplotype file the “allele” 0 is assigned by the map file as ancestral and thus recoded by 0; the fifth column in the map file delineates the derived alleles of each marker, here again always 1, hence the “allele” 1 of the haplotype file is translated to the first derived allele, coded by 1. In the second case, which causes a coding by alpha-numeric order, we end up again by just replacing 0 by 0 and 1 by 1.

hh_map <- data2haplohh(hap_file = "example1.hap",
                       map_file = "example1.map",
                       allele_coding = "map",
                       verbose = FALSE)
identical(hh, hh_map)
> [1] TRUE
hh_none <- data2haplohh(hap_file = "example1.hap",
                        map_file = "example1.map",
                        allele_coding = "none",
                        verbose = FALSE)
identical(hh, hh_none)
> [1] TRUE

Finally, we use the vcf file as input. No map file has to be specified. The option polarize_vcf is set by default to TRUE, assuming ancestral alleles are given in the INFO field by key AA. The function data2haplohh then recodes the markers where reference and ancestral alleles differ. If the option is turned off, coding with respect to the reference sequence is taken and the information about ancestry lost.

Again, one yields the same internal representation of the data:

hh_vcf <- data2haplohh(hap_file = "example1.vcf",
                       vcf_reader = "data.table",
                       verbose = FALSE)
identical(hh, hh_vcf)
> [1] TRUE

2.2 Calculations and visualizations

2.2.1 Visualizing the sequences

The haplohh-object can be visualized by a simple plot that shows ancestral alleles in blue and derived ones in red:

plot(hh)
Graphical output of the plot.haplohh() function

Figure 2.1: Graphical output of the plot.haplohh() function

Note, however, that this kind of plot is intended only for relatively small data sets.

2.2.2 EHH

We start with the computation of (allele-specific) EHH which is used for the detection of selection within a single, supposedly homogeneous, population.

We restrict our attention to the central marker with id rs6. We start by computing EHH for its alleles. We include the number of evaluated haplotypes into the output which tells us merely that 4 haplotypes are evaluated for each allele, in agreement with their frequencies. Note that the integral iHH_D is not computed, since the EHH_D is still above the threshold limehh (default value 0.05) at the borders of the chromosome and the option discard_integration_at_border is by default TRUE.

res <- calc_ehh(hh, mrk = "rs6", include_nhaplo = TRUE)
res
> An object of class "ehh"
> [[1]]
> [1] "rs6"
> 
> [[2]]
> FREQ_A FREQ_D 
>    0.5    0.5 
> 
> [[3]]
>      POSITION     EHH_A     EHH_D NHAPLO_A NHAPLO_D
> rs1     10000 0.0000000 0.5000000        0        4
> rs2     20000 0.0000000 1.0000000        0        4
> rs3     30000 0.0000000 1.0000000        4        4
> rs4     40000 0.1666667 1.0000000        4        4
> rs5     50000 0.5000000 1.0000000        4        4
> rs6     60000 1.0000000 1.0000000        4        4
> rs7     70000 0.3333333 1.0000000        4        4
> rs8     80000 0.1666667 1.0000000        4        4
> rs9     90000 0.0000000 1.0000000        4        4
> rs10   100000 0.0000000 1.0000000        0        4
> rs11   110000 0.0000000 0.3333333        0        4
> 
> [[4]]
>    IHH_A    IHH_D 
> 18816.67       NA

The individual values of the table can be easily checked “by hand” using the internal data representation of the haplotypes:

haplo(hh)
>       rs1 rs2 rs3 rs4 rs5 rs6 rs7 rs8 rs9 rs10 rs11
> HG1_1   0   0   1   1   0   0   0   0   0    1    1
> HG1_2   0   0   0   0   1   0   1   0   1    0    0
> HG2_1   0   1   0   1   0   0   1   0   0    0    1
> HG2_2   0   0   1   0   0   0   0   1   1    0    0
> HG3_1   0   0   0   0   0   1   0   0   0    0    1
> HG3_2   0   0   0   0   0   1   0   0   0    0    1
> HG4_1   1   0   0   0   0   1   0   0   0    0    0
> HG4_2   0   0   0   0   0   1   0   0   0    0    0

The chromosomes HG1_1, HG1_2, HG_2_1 and HG_2_2 carry the ancestral allele of marker rs6 and form the “starting set” of shared haplotypes to be extended along the chromosome. By definition the starting set is homozygous and EHH equal to 1 at the focal marker. The shared haplotypes cover so far a single chromosomal position, namely that of the focal marker. We extend them to the next marker on the right, rs7, where two chromosomes carry one allele and the other two another allele (ancestry status matters only at the focal marker!). Hence the initial set of four can be subdivided into two sets of extended shared haplotypes, namely {HG1_1, HG2_2} sharing 00 and {HG1_2, HG2_1} sharing 01. They cover now the region from rs6 til rs7. Plugging the numbers into the formula for EHH (see main vignette) yields \[ EHH^a_{rs6,rs7}=\frac{1}{n_a(n_a-1)}\sum_{k=1}^{K^a_{rs6,rs7}}n_k(n_k-1)=\frac{1}{4\cdot 3}(2\cdot1+2\cdot1)=\frac{1}{3}\;. \] One marker further to the right, at rs8, the first set can be partitioned into two, with HG1_1 having extended haplotype 000, and HG2_2 001. The second set {HG1_2,HG2_1} remains homozygous, now sharing the extended haplotype 010. Thus, at marker rs8 the corresponding EHH value yields \[ EHH^a_{rs6,rs8}=\frac{1}{n_a(n_a-1)}\sum_{k=1}^{K^a_{rs6,rs8}}n_k(n_k-1)=\frac{1}{4\cdot 3}(1\cdot0+2\cdot1+1\cdot0)=\frac{1}{6}\;. \] At marker rs9, chromosome HG1_2 carries allele 1 and HG2_1 allele 0. Hence the extended haplotypes now differ between all four considered chromosomes and EHH becomes zero. An analogous, but independent, calculation can be done for the markers to the left of the focal marker.

The starting set for the derived allele consists of {HG3_1, HG3_2, HG4_1 and HG4_2}. Extending to the right, the corresponding haplotypes remain homozygous and consequently the set is not split until marker rs11. In particular we have \[EHH^d_{rs6,rs7}=EHH^d_{rs6,rs8}=\frac{1}{4\cdot3}\sum_{k=1}^14\cdot3=1\] and essentially the same situation on the left side of the focal marker.

The corresponding plot (Figure 2.2) shows that EHH of the ancestral allele decays more rapidly than that of the derived allele.

plot(res)
Graphical output of the plot.ehh() function

Figure 2.2: Graphical output of the plot.ehh() function

Assume now that the haplotypes are not phased. That means, at each marker for which a diploid individual is heterozygous, it is unknown which allele belongs to chromosome ‘1’ and which to chromosome ‘2’. In this case the concept of extended haplotypes is not well-defined across individuals. However, we can still measure the decay of extended homozygosity within individuals. This is done by setting option phased to FALSE while assuming that the haplotypes in the input files are ordered as pairs belonging to individuals.

res <- calc_ehh(hh, 
                mrk = "rs6", 
                include_nhaplo = TRUE, 
                phased = FALSE)
res
> An object of class "ehh"
> [[1]]
> [1] "rs6"
> 
> [[2]]
> FREQ_A FREQ_D 
>    0.5    0.5 
> 
> [[3]]
>      POSITION EHH_A EHH_D NHAPLO_A NHAPLO_D
> rs1     10000   0.0   0.5        0        4
> rs2     20000   0.0   1.0        0        4
> rs3     30000   0.0   1.0        0        4
> rs4     40000   0.0   1.0        4        4
> rs5     50000   0.5   1.0        4        4
> rs6     60000   1.0   1.0        4        4
> rs7     70000   0.0   1.0        4        4
> rs8     80000   0.0   1.0        0        4
> rs9     90000   0.0   1.0        0        4
> rs10   100000   0.0   1.0        0        4
> rs11   110000   0.0   1.0        0        4
> 
> [[4]]
>   IHH_A   IHH_D 
> 13537.5      NA

Individuals which are heterozygous for the focal marker are excluded from the calculation. In our small sample all 4 individuals are homozygous at marker rs6; HG1 and HG2 for the ancestral allele and HG3 and HG4 for the derived allele, hence no haplotype is excluded. Note that in realistic data a substantial number of haplotypes is expected to belong to heterozygous individuals, cf. the Hardy-Weinberg principle.

Again we can retrace the calculations by hand keeping in mind that EHH is now estimated by the fraction of homozygous individuals at each marker, see the corresponding formula in the main vignette.

Extending the shared haplotypes to the right, we find that both individuals HG1 and HG2 become heterozygous already at marker rs7 and hence EHH at this position becomes 0. Extending to the left, HG1 becomes heterozygous at marker rs5, while HG2 is still homozygous in the region spanning from rs6 to rs5. Hence the proportion of homozygous individuals at this marker is \(\frac{1}{2}\). At marker rs4 the second individual becomes heterozygous, too, and EHH yields 0.

By contrast, the individuals carrying the derived focal allele are homozygous for the entire chromosome except for marker rs1 where HG4 becomes heterozygous.

Figure 2.3 shows again the difference between EHH of the two core alleles:

plot(res)
Graphical output of the plot.ehh() function

Figure 2.3: Graphical output of the plot.ehh() function

2.2.3 EHHS

Next, we calculate EHH per Site or EHHS which forms the basis for cross-population comparisons. The options of the corresponding function calc_ehhs() are essentially identical to those of the function calc_ehh(). The output contains EHHS and its normalized version nEHHS, distinguished merely by a factor such that the latter becomes 1 at the focal marker. We toggle the option discard_integration_at_border to FALSE in order to obtain the two corresponding integrals iES and inES even though the values EHHS and nEHHS have not fallen below the default threshold of 0.05 at the first resp. last marker. Note that elements 3 and 4 of the list can be addressed by res$IES and res$INES, too.

res <- calc_ehhs(hh,
                 mrk = "rs6", 
                 include_nhaplo = TRUE,
                 discard_integration_at_border = FALSE)
res
> An object of class "ehhs"
> [[1]]
> [1] "rs6"
> 
> [[2]]
>      POSITION       EHHS     NEHHS NHAPLO
> rs1     10000 0.10714286 0.2500000      8
> rs2     20000 0.21428571 0.5000000      8
> rs3     30000 0.21428571 0.5000000      8
> rs4     40000 0.25000000 0.5833333      8
> rs5     50000 0.32142857 0.7500000      8
> rs6     60000 0.42857143 1.0000000      8
> rs7     70000 0.28571429 0.6666667      8
> rs8     80000 0.25000000 0.5833333      8
> rs9     90000 0.21428571 0.5000000      8
> rs10   100000 0.21428571 0.5000000      8
> rs11   110000 0.07142857 0.1666667      8
> 
> [[3]]
> [1] 19821.43
> 
> [[4]]
> [1] 52916.67

The calculations can be retraced easily. In fact, the only distinction to EHH is that all haplotypes are considered and not only those with a specific allele at the focal marker. Using visual inspection of the haplotype data file (see above), we can plug the numbers into the formula for EHHS (see main vignette): \[\mathrm{EHHS}_{rs6,rs6}=\frac{1}{n_s(n_s-1)}\left(\sum\limits_{k=1}^{K_{rs6,rs6}}n_k(n_k-1)\right)=\frac{1}{8\cdot7}(4\cdot3+4\cdot3)=\frac{3}{7}\] \[\mathrm{EHHS}_{rs6,rs7}=\frac{1}{n_s(n_s-1)}\left(\sum\limits_{k=1}^{K_{rs6,rs7}}n_k(n_k-1)\right)=\frac{1}{8\cdot7}(2\cdot1+2\cdot1+4\cdot3)=\frac{2}{7}\] \[\mathrm{EHHS}_{rs6,rs8}=\frac{1}{n_s(n_s-1)}\left(\sum\limits_{k=1}^{K_{rs6,rs8}}n_k(n_k-1)\right)=\frac{1}{8\cdot7}(1\cdot0+2\cdot1+1\cdot0+4\cdot3)=\frac{1}{4}\;.\]

By default, the following command shows the (un-normalized) EHHS values as in Figure 2.4. In order to draw the normalized values one can toggle the option nehhs to TRUE.

plot(res)
Graphical output of the plot.ehhs() function

Figure 2.4: Graphical output of the plot.ehhs() function

2.2.4 “Genome-wide” scan

In order to find out whether marker rs6 is “special”, we compute the integrals over EHH and EHHS, yielding resp. iHH and iES, for all markers, hence performing a “genomic scan”. Since we are dealing with a single population, only the iHH values are of interest. We can see that IHH_A and IHH_D are particularly different for the central marker.

res.scan <- scan_hh(hh, discard_integration_at_border = FALSE)
res.scan
>       CHR POSITION FREQ_A FREQ_D NHAPLO_A NHAPLO_D    IHH_A    IHH_D       IES
> rs1  chr1    10000  0.875  0.125        7        1 21992.26     0.00 15295.238
> rs2  chr1    20000  0.875  0.125        7        1 36190.48     0.00 25892.857
> rs3  chr1    30000  0.750  0.250        6        2 45000.00 23512.50 22678.571
> rs4  chr1    40000  0.750  0.250        6        2 47666.67 28025.00 24285.714
> rs5  chr1    50000  0.875  0.125        7        1 33333.33     0.00 23750.000
> rs6  chr1    60000  0.500  0.500        4        4 18816.67 89166.67 19821.429
> rs7  chr1    70000  0.750  0.250        6        2 45333.33 28025.00 23035.714
> rs8  chr1    80000  0.875  0.125        7        1 38571.43     0.00 27678.571
> rs9  chr1    90000  0.750  0.250        6        2 50666.67 23512.50 25714.286
> rs10 chr1   100000  0.875  0.125        7        1 35000.00     0.00 25000.000
> rs11 chr1   110000  0.500  0.500        4        4 25075.00 25833.33  8032.143
>          INES
> rs1  21992.26
> rs2  36190.48
> rs3  43437.50
> rs4  46250.00
> rs5  33333.33
> rs6  52916.67
> rs7  44062.50
> rs8  38571.43
> rs9  48750.00
> rs10 35000.00
> rs11 25416.67

In order to evaluate the differences, the log ratio between IHH_A and IHH_D is calculated, yielding the, as yet, un-normalized values unihs. In general these should be normalized bin-wise, grouping markers with the same frequency of the derived allele. With so few markers, though, it is appropriate to normalize over all markers at once by setting the bin number to 1.

ihs <- ihh2ihs(res.scan, freqbin = 1, verbose = FALSE)
ihs
> $ihs
>       CHR POSITION        IHS  LOGPVALUE
> rs1  chr1    10000         NA         NA
> rs2  chr1    20000         NA         NA
> rs3  chr1    30000  0.5813076 0.25101145
> rs4  chr1    40000  0.4464354 0.18357125
> rs5  chr1    50000         NA         NA
> rs6  chr1    60000 -1.9389615 1.27979084
> rs7  chr1    70000  0.3890668 0.15662597
> rs8  chr1    80000         NA         NA
> rs9  chr1    90000  0.7168779 0.32472642
> rs10 chr1   100000         NA         NA
> rs11 chr1   110000 -0.1947262 0.07283128
> 
> $frequency.class
>             N_MRK MEAN_UNIHS  SD_UNIHS  LOWER_QT  UPPER_QT
> [0.05,0.95)    11  0.1405648 0.8748648 -1.365018 0.7529103

We can use calc_candidate_regions() to delineate regions with “extreme” scores hinting at selection. Because we have only a few markers, we choose an unrealistically low threshold of 1.5 and set the window size to 1, which means that no windowing is performed and only the positions of extreme markers returned.

cr <- calc_candidate_regions(ihs, threshold = 1.5, ignore_sign = TRUE, window_size = 1)
cr
>      CHR START   END  EXTR_MRK
> rs6 chr1 60000 60000 -1.938961

Under the assumption that most sites evolve neutrally, the standardized iHS values should follow a normal distribution with the sites under selection as outliers.

Obviously we do not have enough markers to fit a distribution, but a “genome-wide” plot of the ihs values shows clearly that the central marker is rather an outlier (as much as is possible for such a small sample), see Figure 2.5.

manhattanplot(ihs, threshold = c(-1.5,1.5), cr = cr, ylim = c(-2.5,2.5), pch = 20)
Graphical output of the manhattanplot() function

Figure 2.5: Graphical output of the manhattanplot() function

2.2.5 Furcations and haplotype length

A furcation plot represents a more fine-grained visualization of the homozygosity decay. In particular, individual haplotypes can be discerned which may instigate further investigations. The labels plotted in Figure 2.6 are set in bold face, if the branches with which they are associated encompass further haplotypes.

f <- calc_furcation(hh, mrk = "rs6")
# set equal plot margins on left and right side and save old ones
oldpar <- par(mar = (c(5, 3, 4, 3) + 0.1))
plot(f,
     # increase line width
     lwd = 1.5,
     # set habplotype identifiers as labels
     hap.names = hap.names(hh),
     # find a place for the legend ...
     legend.xy.coords = c(60000, 0.2)
)
# reset old margins
par(oldpar)
Graphical output of the plot.furcation() function

Figure 2.6: Graphical output of the plot.furcation() function

A furcation diagram consists of trees for each allele and both sides (“left” and “right”) of the marker. The individual trees can be exported into a string in Newick format to be rendered by external programs, e.g. the phylogenetic R-package ape, see Figure 2.7.

newick <- as.newick(f, 
                    allele = 0, 
                    side = "left", 
                    hap.names = hap.names(hh))
newick
> [1] "(((HG2_2:0,(HG2_1:0,HG1_1:0):10000):10000,HG1_2:0):10000);"
library(ape)
tree <- ape::read.tree(text = newick)
plot(tree, 
     direction = "leftwards", 
     edge.color = "blue",
     underscore = TRUE,
     no.margin = TRUE)
Graphical output of the plot.phylo() function of package ape

Figure 2.7: Graphical output of the plot.phylo() function of package ape

The end points of shared extended haplotypes can be defined as the “last split” in a furcation, i.e. the positions until which at least two different chromosomes of the sample are homozygous. Calculation of shared haplotype length and its visualization in Figure 2.8 are called by:

h <- calc_haplen(f)
plot(h, 
     hap.names = hap.names(hh), 
     legend.xy.coords = c(70000,1.15))
Graphical output of the plot.haplen() function

Figure 2.8: Graphical output of the plot.haplen() function

In case of unphased haplotypes, furcations can only occur within individuals which limits the informative value of furcation diagrams as in Figure 2.9.

f <- calc_furcation(hh, mrk = "rs6", phased = FALSE)
# set equal plot margins on left and right side and save old ones
oldpar <- par(mar = (c(5, 3, 4, 3) + 0.1))
plot(f,
     # increase line width
     lwd = 1.5,
     # set haplotype identifiers as labels
     hap.names = hap.names(hh),
     # no place for a legend inside the plot
     legend.xy.coords = "none")
# reset old margins
par(oldpar)
Graphical output of the plot.haplen() function

Figure 2.9: Graphical output of the plot.haplen() function

Nevertheless, the length of shared haplotypes, now identical to the ranges of individual homozygosity, can be calculated as before to yield Figure 2.10.

h <- calc_haplen(f)
plot(h, hap.names = hap.names(hh))
Graphical output of the plot.haplen() function

Figure 2.10: Graphical output of the plot.haplen() function

3 Example data set 2

The second data is an extension of the first, containing four additional haplotypes, some missing values (including for the central marker rs6), and two markers with three alleles, namely rs6 and rs9.

3.1 Input

3.1.1 Data files

Let us first inspect the file example2.hap. Missing values are here marked by a point, but NA could be used, too.

cat(readLines("example2.hap"),sep = "\n")
> HG1_1 0 0 . 1 0 0 0 0 0 1 1
> HG1_2 0 0 0 0 1 0 1 0 1 0 0
> HG2_1 0 1 . 1 0 0 1 0 0 0 1
> HG2_2 0 0 1 0 0 0 0 1 . . 0
> HG3_1 0 0 0 0 0 1 0 0 0 0 1
> HG3_2 0 0 0 0 0 1 0 0 0 . 1
> HG4_1 1 0 0 0 0 . 0 0 0 0 0
> HG4_2 0 0 0 0 0 1 0 0 0 0 0
> HG5_1 0 0 0 0 0 2 0 0 2 1 0
> HG5_2 0 0 0 1 0 2 0 0 . 0 0
> HG6_1 0 . 0 0 0 2 0 0 1 0 0
> HG6_2 0 0 0 0 . 2 0 0 0 . 0

If the map information file is used for allele recoding, the fifth column can accommodate multiple (derived) alleles, separated by a comma, as in example2.map:

cat(readLines("example2.map"),sep = "\n")
> rs1 chr1 10000 0 1
> rs2 chr1 20000 0 1
> rs3 chr1 30000 0 1
> rs4 chr1 40000 0 1
> rs5 chr1 50000 0 1
> rs6 chr1 60000 0 1,2
> rs7 chr1 70000 0 1
> rs8 chr1 80000 0 1
> rs9 chr1 90000 0 1,2
> rs10 chr1 100000 0 1
> rs11 chr1 110000 0 1

The file example2.vcf combines the information of haplotype and map file. It contains an additional marker rs12 which lacks information about the ancestral allele. This marker gets excluded since by default polarize_vcf = TRUE, hence the function tries to polarize all markers. If the parameter is set to FALSE, the marker is included, but ancestry information is lost for all markers.

Furthermore, the ancestral allele for the marker rs1 is now given by a lower case latter, which typically means that it is of “low confidence”. This marker gets included since capitalize_AA = TRUE by default. If this option is turned off, the marker cannot be polarized and is discarded.

cat(readLines("example2.vcf"), sep = "\n")
> ##fileformat=VCFv4.2
> ##reference=NCBI38
> ##contig=<ID=chr1,length=120000>
> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
> #CHROM    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG1 HG2 HG3 HG4 HG5 HG6
> chr1  10000   rs1 G   T   100 PASS    AA=g    GT  0|0 0|0 0|0 1|0 0|0 0|0
> chr1  20000   rs2 G   A   100 PASS    AA=G    GT  0|0 1|0 0|0 0|0 0|0 .|0
> chr1  30000   rs3 T   C   100 PASS    AA=T    GT  .|0 .|1 0|0 0|0 0|0 0|0
> chr1  40000   rs4 A   C   100 PASS    AA=A    GT  1|0 1|0 0|0 0|0 0|1 0|0
> chr1  50000   rs5 T   G   100 PASS    AA=T    GT  0|1 0|0 0|0 0|0 0|0 0|.
> chr1  60000   rs6 G   A,C 100 PASS    AA=A    GT  1|1 1|1 0|0 .|0 2|2 2|2
> chr1  70000   rs7 C   T   100 PASS    AA=C    GT  0|1 1|0 0|0 0|0 0|0 0|0
> chr1  80000   rs8 C   T   100 PASS    AA=C    GT  0|0 0|1 0|0 0|0 0|0 0|0
> chr1  90000   rs9 T   A,G 100 PASS    AA=T    GT  0|1 0|. 0|0 0|0 2|. 1|0
> chr1  100000  rs10    A   G   100 PASS    AA=A    GT  1|0 0|. 0|. 0|0 1|0 0|.
> chr1  110000  rs11    C   T   100 PASS    AA=T    GT  0|1 0|1 0|0 1|1 1|1 1|1
> chr1  120000  rs12    T   G   100 PASS    AA=.    GT  0|0 0|0 1|0 0|0 0|0 0|0

3.1.2 Input options

Alleles in the haplotype input file are already coded by the numbers 0,1 and 2 hence conform to the coding "01" explained in the main vignette.

hh <- data2haplohh(hap_file = "example2.hap",
                   map_file = "example2.map",
                   allele_coding = "01")
> * Reading input file(s) *
> Map info: 11 markers declared for chromosome chr1 .
> Haplotype input file in standard format assumed.
> Assume that alleles in the haplotype file are coded as:
> NA or '.': missing value
> 0: ancestral allele
> 1, 2, ...: derived allele(s).
> * Filtering data *
> Discard markers genotyped on less than 100 % of haplotypes.
> 6 markers discarded.
> 5 markers remaining.
> Data consists of 12 haplotypes and 5 markers.
> Number of mono-, bi-, multi-allelic markers:
> 1 2 
> 0 5

The output tells us that 6 markers with missing values have been eliminated. This is due to the pre-set filter option of min_perc_geno.mrk = 100 which excludes all markers that are not fully genotyped. If we do not want to loose them, we can lower the corresponding threshold, e.g. to 50%. Note that setting this condition to 0 is not allowed since the package cannot handle markers with no data attached.

hh <- data2haplohh(hap_file = "example2.hap",
                   map_file = "example2.map",
                   allele_coding = "01",
                   min_perc_geno.mrk = 50)
> * Reading input file(s) *
> Map info: 11 markers declared for chromosome chr1 .
> Haplotype input file in standard format assumed.
> Assume that alleles in the haplotype file are coded as:
> NA or '.': missing value
> 0: ancestral allele
> 1, 2, ...: derived allele(s).
> * Filtering data *
> Discard markers genotyped on less than 50 % of haplotypes.
> No marker discarded.
> Data consists of 12 haplotypes and 11 markers.
> Number of mono-, bi-, multi-allelic markers:
> 1 2 3 
> 0 9 2

Like with the first example, the data set is constructed such that the allele coding options "01", "map" and "none" applied to the pair of haplotype and map file or input from the vcf file lead to identical haplohh objects:

hh_map <- data2haplohh(hap_file = "example2.hap",
                       map_file = "example2.map",
                       allele_coding = "map",
                       min_perc_geno.mrk = 50,
                       verbose = FALSE)
identical(hh, hh_map)
> [1] TRUE
hh_none <- data2haplohh(hap_file = "example2.hap",
                        map_file = "example2.map",
                        allele_coding = "none",
                        min_perc_geno.mrk = 50,
                        verbose = FALSE)
identical(hh, hh_none)
> [1] TRUE
hh_vcf <- data2haplohh(hap_file = "example2.vcf",
                       min_perc_geno.mrk = 50,
                       vcf_reader = "data.table",
                       verbose = FALSE)
identical(hh, hh_vcf)
> [1] TRUE

3.2 Calculations and visualizations

3.2.1 Visualizing the sequences

The haplohh-object can be visualized by a simple plot command:

plot(hh)
Graphical output of the plot.furcation() function

Figure 3.1: Graphical output of the plot.furcation() function

3.2.2 EHH

As with the first example data set, the function calc_ehh() reports the EHH values for each allele of the focal marker of which there are now three. We set the option include_zero_values to TRUE to include the remaining marker rs11 in the table (and subsequent plot) where all three EHH values reach zero.

res <- calc_ehh(hh, 
                mrk = "rs6", 
                include_nhaplo = TRUE,
                include_zero_values = TRUE,
                discard_integration_at_border = FALSE )
res
> An object of class "ehh"
> [[1]]
> [1] "rs6"
> 
> [[2]]
>    FREQ_A   FREQ_D1   FREQ_D2 
> 0.3636364 0.2727273 0.3636364 
> 
> [[3]]
>      POSITION     EHH_A EHH_D1    EHH_D2 NHAPLO_A NHAPLO_D1 NHAPLO_D2
> rs1     10000 0.0000000      1 0.0000000        0         3         0
> rs2     20000 0.0000000      1 0.0000000        0         3         2
> rs3     30000 0.0000000      1 0.3333333        2         3         3
> rs4     40000 0.1666667      1 0.3333333        4         3         3
> rs5     50000 0.5000000      1 1.0000000        4         3         3
> rs6     60000 1.0000000      1 1.0000000        4         3         4
> rs7     70000 0.3333333      1 1.0000000        4         3         4
> rs8     80000 0.1666667      1 1.0000000        4         3         4
> rs9     90000 0.0000000      1 0.0000000        4         3         3
> rs10   100000 0.0000000      1 0.0000000        0         2         0
> rs11   110000 0.0000000      0 0.0000000        0         2         0
> 
> [[4]]
>    IHH_A   IHH_D1   IHH_D2 
> 18816.67 90012.50 43216.67

Note that the derived alleles are ordered by their internal coding.

Figure 3.2 shows clearly that the first derived allele has a strong extended homozygosity while the second derived allele is not that different from the ancestral allele.

plot(res)
Graphical output of the plot.ehh() function

Figure 3.2: Graphical output of the plot.ehh() function

3.2.3 EHHS

For EHHS the output format does not depend on the number of core alleles. The values however, do, since more alleles at the focal marker entail a lower overall homozygosity.

res <- calc_ehhs(hh,
                 mrk = "rs6", 
                 include_nhaplo = TRUE,
                 include_zero_values = TRUE,
                 discard_integration_at_border = FALSE)
res
> An object of class "ehhs"
> [[1]]
> [1] "rs6"
> 
> [[2]]
>      POSITION       EHHS     NEHHS NHAPLO
> rs1     10000 0.14285714 0.6000000      7
> rs2     20000 0.14285714 0.6000000      7
> rs3     30000 0.14285714 0.5714286      8
> rs4     40000 0.11111111 0.4166667     10
> rs5     50000 0.20000000 0.7500000     10
> rs6     60000 0.27272727 1.0000000     11
> rs7     70000 0.20000000 0.7333333     11
> rs8     80000 0.18181818 0.6666667     11
> rs9     90000 0.06666667 0.2500000     10
> rs10   100000 0.00000000 0.1000000      9
> rs11   110000 0.00000000 0.0000000      9
> 
> [[3]]
> [1] 9582.161
> 
> [[4]]
> [1] 49005.95

Note that the number of evaluated haplotypes NHAPLO decreases with distance to the focal marker due to missing values which lead at each calculation step to subsequent removals of the respective chromosomes. This can sometimes yield a transient increase of EHHS (and in general, EHH, too) as can be seen at position 30 kb in Figure 3.3.

plot(res)
Graphical output of the plot.ehh() function

Figure 3.3: Graphical output of the plot.ehh() function

3.2.4 “Genome-wide” scan

The function scan_hh() does not evaluate EHH for every allele of a focal marker, but chooses, besides the mandatory ancestral allele, the derived allele with highest frequency.

scan <- scan_hh(hh, discard_integration_at_border = FALSE)
scan
>       CHR POSITION    FREQ_A     FREQ_D NHAPLO_A NHAPLO_D    IHH_A    IHH_D
> rs1  chr1    10000 0.9166667 0.08333333       11        1 25045.24     0.00
> rs2  chr1    20000 0.9090909 0.09090909       10        1 32750.99     0.00
> rs3  chr1    30000 0.9000000 0.10000000        9        1 39881.94     0.00
> rs4  chr1    40000 0.7500000 0.25000000        9        3 38453.37 21383.33
> rs5  chr1    50000 0.9090909 0.09090909       10        1 29680.16     0.00
> rs6  chr1    60000 0.3636364 0.36363636        4        4 18816.67 43216.67
> rs7  chr1    70000 0.8333333 0.16666667       10        2 28960.52 28025.00
> rs8  chr1    80000 0.9166667 0.08333333       11        1 27370.13     0.00
> rs9  chr1    90000 0.7000000 0.20000000        7        2 40142.86 33012.50
> rs10 chr1   100000 0.7777778 0.22222222        7        2 24452.38  9025.00
> rs11 chr1   110000 0.6666667 0.33333333        8        4 14291.67 13037.50
>            IES     INES
> rs1  19362.121 25045.24
> rs2  25023.593 32750.99
> rs3  30057.143 39881.94
> rs4  23654.672 38145.51
> rs5  22727.201 29680.16
> rs6   9582.161 49005.95
> rs7  17869.697 28743.68
> rs8  21479.798 27370.13
> rs9  15730.159 39857.95
> rs10 11326.984 23125.00
> rs11  5890.837 14158.23

If alleles are not polarized, the corresponding option should be set to FALSE which replaces “ancestral” and “derived” by “major” and “minor” (hence the two most frequent) alleles.

scan_unpol <- scan_hh(hh, 
                      discard_integration_at_border = FALSE,
                      polarized = FALSE)
scan_unpol
>       CHR POSITION  FREQ_MAJ   FREQ_MIN NHAPLO_MAJ NHAPLO_MIN  IHH_MAJ  IHH_MIN
> rs1  chr1    10000 0.9166667 0.08333333         11          1 25045.24     0.00
> rs2  chr1    20000 0.9090909 0.09090909         10          1 32750.99     0.00
> rs3  chr1    30000 0.9000000 0.10000000          9          1 39881.94     0.00
> rs4  chr1    40000 0.7500000 0.25000000          9          3 38453.37 21383.33
> rs5  chr1    50000 0.9090909 0.09090909         10          1 29680.16     0.00
> rs6  chr1    60000 0.3636364 0.36363636          4          4 18816.67 43216.67
> rs7  chr1    70000 0.8333333 0.16666667         10          2 28960.52 28025.00
> rs8  chr1    80000 0.9166667 0.08333333         11          1 27370.13     0.00
> rs9  chr1    90000 0.7000000 0.20000000          7          2 40142.86 33012.50
> rs10 chr1   100000 0.7777778 0.22222222          7          2 24452.38  9025.00
> rs11 chr1   110000 0.6666667 0.33333333          8          4 14291.67 13037.50
>            IES     INES
> rs1  19362.121 25045.24
> rs2  25023.593 32750.99
> rs3  30057.143 39881.94
> rs4  23654.672 38145.51
> rs5  22727.201 29680.16
> rs6   9582.161 49005.95
> rs7  17869.697 28743.68
> rs8  21479.798 27370.13
> rs9  15730.159 39857.95
> rs10 11326.984 23125.00
> rs11  5890.837 14158.23

We continue with the polarized scan and calculate standardized log ratios of the iHH values without any binning.

ihs <- ihh2ihs(scan, freqbin = 1)
> Discard focal markers with Minor Allele Frequency equal to or below 0.05 .
> No marker discarded.
ihs
> $ihs
>       CHR POSITION         IHS   LOGPVALUE
> rs1  chr1    10000          NA          NA
> rs2  chr1    20000          NA          NA
> rs3  chr1    30000          NA          NA
> rs4  chr1    40000  0.66462142 0.295598363
> rs5  chr1    50000          NA          NA
> rs6  chr1    60000 -1.64513452 1.000251645
> rs7  chr1    70000 -0.23757455 0.090331086
> rs8  chr1    80000          NA          NA
> rs9  chr1    90000  0.02742085 0.009606056
> rs10 chr1   100000  1.33214189 0.737991576
> rs11 chr1   110000 -0.14147509 0.051834262
> 
> $frequency.class
>             N_MRK MEAN_UNIHS  SD_UNIHS   LOWER_QT  UPPER_QT
> [0.05,0.95)    11  0.1787203 0.6140553 -0.7234433 0.9454923

The “genome-wide” ihs values are depicted in Figure 3.4.

manhattanplot(ihs, threshold = c(-1.5, 1.5), ylim = c(-2.5,2.5), pch = 20)
Graphical output of the manhattanplot() function

Figure 3.4: Graphical output of the manhattanplot() function

Now we know that actually the first derived allele at marker rs6 is of interest, but it is not present in the scan because it is not the major derived allele. How can we modify the scan to include it?

It is tempting to combine the two derived alleles by overwriting the internal coding and yielding a bi-allelic marker. This, however, entails a completely new pattern of extended haplotypes and a distortion of the statistics. Another approach may be to exclude the chromosomes carrying the less interesting allele from the analysis as done below:

hh_subset = subset(hh,
                   # exclude haplotypes with allele 2 at "rs6"
                   select.hap = haplo(hh)[ ,"rs6"] != 2, 
                   min_perc_geno.mrk = 50)
> * Subset haplotypes *
> 4 haplotypes discarded.
> 8 haplotypes remaining.
> * Filtering data *
> Discard markers genotyped on less than 50 % of haplotypes.
> No marker discarded.
> Data consists of 8 haplotypes and 11 markers.
> Number of mono-, bi-, multi-allelic markers:
> 1 2 
> 1 10
scan <- scan_hh(hh_subset, discard_integration_at_border = FALSE)
scan
>       CHR POSITION    FREQ_A    FREQ_D NHAPLO_A NHAPLO_D    IHH_A   IHH_D
> rs1  chr1    10000 1.0000000 0.0000000        7        1 26267.86     0.0
> rs2  chr1    20000 0.8571429 0.1428571        6        1 38741.67     0.0
> rs3  chr1    30000 0.8000000 0.2000000        4        1 58370.83     0.0
> rs4  chr1    40000 0.7142857 0.2857143        5        2 39741.67 28025.0
> rs5  chr1    50000 0.8571429 0.1428571        6        1 33958.33     0.0
> rs6  chr1    60000 0.5714286 0.4285714        4        3 18816.67 90012.5
> rs7  chr1    70000 0.7142857 0.2857143        5        2 37241.67 28025.0
> rs8  chr1    80000 0.8571429 0.1428571        6        1 31500.00     0.0
> rs9  chr1    90000 0.8333333 0.1666667        5        1 43333.33     0.0
> rs10 chr1   100000 0.8000000 0.2000000        4        1 27500.00     0.0
> rs11 chr1   110000 0.4285714 0.5714286        3        4 14012.50 13037.5
>            IES     INES
> rs1  26267.857 26267.86
> rs2  24839.286 38741.67
> rs3  32741.667 58370.83
> rs4  25616.071 39697.89
> rs5  21449.405 33958.33
> rs6  18116.071 49710.52
> rs7  16568.452 36061.53
> rs8  20904.762 31500.00
> rs9  26833.333 43333.33
> rs10 14500.000 27500.00
> rs11  4267.857 13050.00

Note that the value of EHH_D, now representing the allele with internal coding 1, is much higher at marker rs6 than before.

However, with so few EHH values due to missing values, there is not much signal left and a standardization by ihh2ihs() averages the alleged outlier away as can be observed in Figure 3.5.

ihs <- ihh2ihs(scan, freqbin = 1, verbose = FALSE)
manhattanplot(ihs, threshold = c(-1.5, 1.5), ylim = c(-2.5,2.5), pch = 20)
Graphical output of the manhattanplot() function

Figure 3.5: Graphical output of the manhattanplot() function

3.2.5 Furcations and haplotype length

A furcation diagram can show the pattern for all three alleles of the focal marker rs6. (Pseudo-)furcations that arise from the removal of chromosomes due to missing values are marked by dashed lines as depicted in Figure 3.6.

f <- calc_furcation(hh, mrk = "rs6")
# set equal plot margins on left and right side and save old ones
oldpar <- par(mar = (c(5, 3, 4, 3) + 0.1))
plot(f,
     lwd = 1.5,
     hap.names = hap.names(hh),
     # no place for a legend inside the plot
     legend.xy.coords = "none")
par(oldpar)
Graphical output of the plot.furcation() function

Figure 3.6: Graphical output of the plot.furcation() function

Again, it is possible to export each tree into Newick format. This format, however, has no option to mark different kinds of branches. We let package ape render the Newick string to yield Figure 3.7.

newick <- as.newick(f, 
                    allele = 0, 
                    side = "left", 
                    hap.names = hap.names(hh))
tree <- ape::read.tree(text = newick)
plot(tree, 
     direction = "leftwards", 
     edge.color = "blue",
     underscore = TRUE,
     no.margin = TRUE)
Graphical output of the plot.phylo() function of package ape

Figure 3.7: Graphical output of the plot.phylo() function of package ape

Likewise, Figure 3.8 of the shared haplotype lengths covers all alleles of the focal marker.

h <- calc_haplen(f)
plot(h, hap.names = hap.names(hh), legend.xy.coords = "none")
Graphical output of the plot.haplen() function

Figure 3.8: Graphical output of the plot.haplen() function

Finally, to clean-up the working directory, we call

remove.example.files()

References

Gautier M., Klassmann A., Vitalis R., 2017 rehh 2.0: a reimplementation of the R package rehh to detect positive selection from haplotype structure. Molecular Ecology Resources 17: 78–90.

Klassmann A., Gautier M., 2020 Detecting selection using extended haplotype homozygosity-based statistics on unphased or unpolarized data (preprint). https://doi.org/10.22541/au.160405572.29972398/v1.