This page provides installation instruction and usage examples for the R package ADPclust. Please see XiaoFeng Wang and Yifan Xu [2015] for details of the procedure.
##Introduction: APDclust
ADPclust stands for Fast clustering using adaptive density peak detection. It is a non-iterative procedure that finds the number of clusters and cluster assignments of large amount of high dimensional data by identifying cluster centroids from estimated densities. The procedure is built upon the work by Rodriguez [2014]. ADPclust automatically identifies cluster centroids from a projected two dimensional decision plot that separates cluster centroids from the rest of the points. This decision plot is generated by calculating the following two values, \(f(\mathbf{x}), \delta(\mathbf{x}))\) for each data point.
For a data set \(\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}\) where each \(\mathbf{x}_i\) is a d dimensional vector, ADPclust first estimate the local multivariate Gaussian density \(f(\mathbf{x}_i), i=1,\ldots,n\) by
\[\hat{f}(\mathbf{x}_i; h_1,…h_d) = n^{-1} \left(\prod_{l=1}^d h_l \right)^{-1} \cdot \sum_{j=1}^n K\left(\frac{x_{i1} - x_{j1}}{h_1}, …, \frac{x_{id} - x_{jd}}{h_d}\right). \] where \(h_1,…,h_d\) are bandwidths at each dimension. Two default values for \(h\) are provided in APDclust: 1) rule-of-thumb (ROT) bandwidth by Scott[2002]; 2) asymptotic mean integrated squared error (AMISE) bandwidth by Wand[1994]. Other bandwidths can also be specified if the default does not give satisfactory results.
Given density estimation \(\hat{f}(\mathbf{x}_i), i = 1,…,n\), the “isolation” indices \(\delta(\mathbf{x}_i)’s\) are found by:
\[\hat{\delta}(\mathbf{x}_i) = \min_{j:\hat{f}(\mathbf{x}_i) < \hat{f}(\mathbf{x}_j)}{d(\mathbf{x}_i,\mathbf{x}_j)}.\]
where \(d(\mathbf{x}_i,\mathbf{x}_j)\) is the distance measure between \(\mathbf{x}_i\) and \(\mathbf{x}_j\).
The scatter plot of \((\hat{f}(\mathbf{x}_i), \hat{\delta}(\mathbf{x}_i)), i = 1,…,n\) is called a decision plot, from which \(k\) centroids are selected automatically or manually from the upper-right corner, and all other points are clustered according to their distances to the closest centroid.
The average silhouette score is calculated after clusters are assigned, and is used to chose the best number of clusters among a sequence of testing \(k\)'s.
##Installation Run the following line to install the package.
install.packages("ADPclust_0.6.3.tar.gz", repos = NULL, source = TRUE)
Run the following line to load the package:
library(ADPclust)
##Example 1: Automatic centroids selection in ADPclust
####Default settings
The automatic centroids selection in ADPclust finds the best bandwidth h
and number of clusters k
from a grid of combinations of testing (h,k)
values. By default, the testing h's are 10 values evenly spread in the interval \([0.5h_0, 3h_0]\), where \(h_0\) is the Wand's asymptotic mean integrated squared error bandwidth (AMISE). The default testing numbers of clusters are \(k = 2,\ldots,10\).
# Load a simple simulated data set with 3 clusters.
data(clust3)
ans <- adpclust(clust3)
The output of ADPclust ans
is an object of class adpclust
associated with summary
and plot
methods. The latter, plot(ans)
produces a figure similar to the one shown above.
summary(ans)
## -- ADPclust Procedure --
##
## Number of variables: 2
## Number of obs.: 90
## Centroids selection: Automatic
## Bandwith selection: AMISE (0.16)
## Number of clusters: 3
## Avg. Silhouette: 0.7747114
##
## f(x):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.444 5.048 7.907 7.599 10.300 13.210
##
## delta(x):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02322 0.05521 0.09144 0.15980 0.13480 2.06300
####Change the bandwidth in density estimation
We can change the reference bandwidth \(h_0\) to Scott's rule-of-thumb (ROT) value by setting htype = "ROT"
:
# Result not shown.
ans <- adpclust(clust3, htype = "ROT")
We can also pass a specific value for h. This setting suppresses htype
. In the following we also set a different range of testing cluster numbers:
ans <- adpclust(clust3, nclust = 2:15, h = 10)
Another important argument in the automatic selection of centroids is f.cut
. It denotes the quantile percentage for the range of all \(\hat{f}(\mathbf{x})\) values. Only the data points whose \(\hat{f}(\mathbf{x}_i)\) values are lareger than the \(f.cut\)-th quantile are candidates for cluster centroids. We demonstrate the usage in the following example with a data set consists of 5 clusters where points are more closely clustered. Note we use a function AMISE()
to calculate the AMISE bandwidth.
# Load the data
data(clust5)
ans <- adpclust(clust5, h = AMISE(clust5), f.cut = 0.01)
As we can see from the middle figure where the dotted line marks the cutoff, some isolated points (small f's) are mistaken as cluster centroids. By setting
f.cut = 0.1
we obtain the correct clustering result.
ans <- adpclust(clust5, h = AMISE(clust5), f.cut = 0.1)
ADPclust also allow user to interactively select cluster centroids from the \((f(x), \delta(x))\) decision scatter plot. After running the following line, the first figure below is displayed, on which you can click arbitrary number of centroids, then hit “ESC” to end selection. The right figure then shows the corresponding clustering result.
data(clust5.1)
ans <- adpclust(clust5.1, centroids = "user")
By default (h = NULL
), the method specified in htype
is used to find the bandwidth. The h
value can also be specified.