BayesChange Tutorial

Here we provide a brief tutorial of the BayesChange package. The BayesChange package contains functions that perform change points detection on univariate and multivariate time series and functions that perform clustering of time series and survival functions with common change points. Here we briefly show how to implement the main functions.

Detecting change points on time series

Two functions of the package, detect_cp_uni and detect_cp_multi, provide respectively a method for detecting change points on univariate and multivariate time series. The function detect_cp_uni is based on the work Martínez and Mena (2014) and the function detect_cp_multi on the work Corradin, Danese, and Ongaro (2022).

In order to use the function detect_cp_uni we need a vector of observations. For example we can create a vector of 100 observations where the first 50 observations are sampled from a normal distribution with mean 0 and variance 0.1 and the other 50 observations still from a normal distribution with mean 0 but variance 0.25.

data_uni <- as.numeric(c(rnorm(50,0,0.1), rnorm(50,1,0.25)))

Now we can run the function detect_cp_uni, as arguments of the function we need to specify the number of iterations, the probability q of performing a split at each step, the autoregressive coefficient phi for the likelihood of the data and the parameters a, b, c for the priors.

out <- detect_cp_uni(data = data_uni,                             
                            n_iterations = 2500,
                            q = 0.25,
                            phi = 0.1, a = 1, b = 1, c = 0.1)
#> Completed:   250/2500 - in 0.032 sec
#> Completed:   500/2500 - in 0.06 sec
#> Completed:   750/2500 - in 0.099 sec
#> Completed:   1000/2500 - in 0.137 sec
#> Completed:   1250/2500 - in 0.173 sec
#> Completed:   1500/2500 - in 0.22 sec
#> Completed:   1750/2500 - in 0.256 sec
#> Completed:   2000/2500 - in 0.288 sec
#> Completed:   2250/2500 - in 0.32 sec
#> Completed:   2500/2500 - in 0.403 sec

In order to get a point estimate of the final partition we can use the function get_clust_VI that choose as final estimate for the partition the visited partition that minimizes the Variation of Information loss function.

The function correctly detect a change point at time 50.

table(get_clust_VI(out$order[500:2500,]))
#> 
#>  0  1 
#> 50 50

Remark: the function get_clust_VI has been included inside the package in order to give the user an easy way to obtain a final estimate of the partition. In literature there are many other methods based on other loss functions and other algorithms that estimate a final partition from a sequence of visited partition by an MCMC algorithm. See for example David B. Dahl and Müller (2022)

Function detect_cp_multi work similarly to its univariate counterpart. Data must be a matrix where each row is a time series.

data_multi <- matrix(NA, nrow = 3, ncol = 100)

data_multi[1,] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_multi[2,] <- as.numeric(c(rnorm(50,0,0.125), rnorm(50,1,0.225)))
data_multi[3,] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))

Arguments k_0, nu_0, phi_0, m_0, par_theta_c, par_theta_d and prior_var_gamma correspond to the parameters of the prior distributions for the multivariate likelihood.

out <- detect_cp_multi(data = data_multi,
                              n_iterations = 2500,
                              q = 0.25, k_0 = 0.25, nu_0 = 4, phi_0 = diag(1,3,3), m_0 = rep(0,3),
                              par_theta_c = 2, par_theta_d = 0.2, prior_var_gamma = 0.1)
#> Completed:   250/2500 - in 0.09 sec
#> Completed:   500/2500 - in 0.142 sec
#> Completed:   750/2500 - in 0.187 sec
#> Completed:   1000/2500 - in 0.226 sec
#> Completed:   1250/2500 - in 0.275 sec
#> Completed:   1500/2500 - in 0.326 sec
#> Completed:   1750/2500 - in 0.409 sec
#> Completed:   2000/2500 - in 0.494 sec
#> Completed:   2250/2500 - in 0.569 sec
#> Completed:   2500/2500 - in 0.635 sec

table(get_clust_VI(out$order[500:2500,]))
#> 
#>  0  1 
#> 50 50

Cluster time series with common change points

BayesChange contains two functions, clust_cp_uni and clust_cp_multi, that cluster respectively univariate and multivariate time series with common change points. Details about this methods can be found in Corradin et al. (2024).

In clust_cp_uni data are a matrix where each row is a time series.

data_mat <- matrix(NA, nrow = 5, ncol = 100)

data_mat[1,] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_mat[2,] <- as.numeric(c(rnorm(50,0,0.125), rnorm(50,1,0.225)))
data_mat[3,] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_mat[4,] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_mat[5,] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))

Arguments that need to be specified in clust_cp_uni are the number of iterations n_iterations, the number of elements in the normalisation constant B, the split-and-merge step L performed when a new partition is proposed and the autoregressive coefficient gamma for the likelihood of the time series.

out <- clust_cp_uni(data = data_mat, n_iterations = 2500, B = 1000, L = 1, gamma = 0.5)
#> Normalization constant - completed:  100/1000 - in 0.014 sec
#> Normalization constant - completed:  200/1000 - in 0.024 sec
#> Normalization constant - completed:  300/1000 - in 0.037 sec
#> Normalization constant - completed:  400/1000 - in 0.048 sec
#> Normalization constant - completed:  500/1000 - in 0.058 sec
#> Normalization constant - completed:  600/1000 - in 0.068 sec
#> Normalization constant - completed:  700/1000 - in 0.077 sec
#> Normalization constant - completed:  800/1000 - in 0.086 sec
#> Normalization constant - completed:  900/1000 - in 0.096 sec
#> Normalization constant - completed:  1000/1000 - in 0.104 sec
#> 
#> ------ MAIN LOOP ------
#> 
#> Completed:   250/2500 - in 0.284 sec
#> Completed:   500/2500 - in 0.489 sec
#> Completed:   750/2500 - in 0.679 sec
#> Completed:   1000/2500 - in 0.873 sec
#> Completed:   1250/2500 - in 1.067 sec
#> Completed:   1500/2500 - in 1.257 sec
#> Completed:   1750/2500 - in 1.455 sec
#> Completed:   2000/2500 - in 1.642 sec
#> Completed:   2250/2500 - in 1.842 sec
#> Completed:   2500/2500 - in 2.09 sec

get_clust_VI(out$clust[500:2500,])
#>      [,1]
#> [1,]    0
#> [2,]    0
#> [3,]    0
#> [4,]    1
#> [5,]    1

In clust_cp_multi data must be an array, where each element is a multivariate time series represented by a matrix. Each row of the matrix is a component of the time series.

data_array <- array(data = NA, dim = c(3,100,5))

data_array[1,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[2,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[3,,1] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))

data_array[1,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[2,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))
data_array[3,,2] <- as.numeric(c(rnorm(50,0,0.100), rnorm(50,1,0.250)))

data_array[1,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_array[2,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))
data_array[3,,3] <- as.numeric(c(rnorm(50,0,0.175), rnorm(50,1,0.280)))

data_array[1,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_array[2,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))
data_array[3,,4] <- as.numeric(c(rnorm(25,0,0.135), rnorm(75,1,0.225)))

data_array[1,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
data_array[2,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))
data_array[3,,5] <- as.numeric(c(rnorm(25,0,0.155), rnorm(75,1,0.280)))

Parameters for the algorithm are the same of clust_cp_uni, parameters of the multivariate likelihood correspond to those in detect_cp_multi

out <- clust_cp_multi(data = data_array, n_iterations = 2500, B = 1000, L = 1,
                        gamma = 0.1, k_0 = 0.25, nu_0 = 5, phi_0 = diag(0.1,3,3), m_0 = rep(0,3))
#> Normalization constant - completed:  100/1000 - in 0.012 sec
#> Normalization constant - completed:  200/1000 - in 0.023 sec
#> Normalization constant - completed:  300/1000 - in 0.044 sec
#> Normalization constant - completed:  400/1000 - in 0.075 sec
#> Normalization constant - completed:  500/1000 - in 0.104 sec
#> Normalization constant - completed:  600/1000 - in 0.129 sec
#> Normalization constant - completed:  700/1000 - in 0.145 sec
#> Normalization constant - completed:  800/1000 - in 0.156 sec
#> Normalization constant - completed:  900/1000 - in 0.167 sec
#> Normalization constant - completed:  1000/1000 - in 0.18 sec
#> 
#> ------ MAIN LOOP ------
#> 
#> Completed:   250/2500 - in 0.235 sec
#> Completed:   500/2500 - in 0.486 sec
#> Completed:   750/2500 - in 0.735 sec
#> Completed:   1000/2500 - in 0.967 sec
#> Completed:   1250/2500 - in 1.245 sec
#> Completed:   1500/2500 - in 1.54 sec
#> Completed:   1750/2500 - in 1.826 sec
#> Completed:   2000/2500 - in 2.098 sec
#> Completed:   2250/2500 - in 2.365 sec
#> Completed:   2500/2500 - in 2.615 sec

get_clust_VI(out$clust[500:2000,])
#>      [,1]
#> [1,]    0
#> [2,]    0
#> [3,]    0
#> [4,]    1
#> [5,]    1

Cluster survival functions with common change points

Functions clust_cp_epi provide a method for clustering survival functions with common change points. Also here details can be found in Corradin et al. (2024).

Data are a matrix where each row is the number of infected at each time. Inside this package is included the function sim_epi_data that simulates infection times.

data_mat <- matrix(NA, nrow = 5, ncol = 50)

betas <- list(c(rep(0.45, 25),rep(0.14,25)),
               c(rep(0.55, 25),rep(0.11,25)),
               c(rep(0.50, 25),rep(0.12,25)),
               c(rep(0.52, 10),rep(0.15,40)),
               c(rep(0.53, 10),rep(0.13,40)))

  inf_times <- list()

  for(i in 1:5){

    inf_times[[i]] <- sim_epi_data(S0 = 10000, I0 = 10, max_time = 50, beta_vec = betas[[i]], gamma_0 = 1/8)

    vec <- rep(0,50)
    names(vec) <- as.character(1:50)

    for(j in 1:50){
      if(as.character(j) %in% names(table(floor(inf_times[[i]])))){
      vec[j] = table(floor(inf_times[[i]]))[which(names(table(floor(inf_times[[i]]))) == j)]
      }
    }
    data_mat[i,] <- vec
  }

In clust_cp_epi we need to specify, besides the usual parameters, the number of Monte Carlo replications M for the approximation of the integrated likelihood and the recovery rate gamma.

out <- clust_cp_epi(data = data_mat, n_iterations = 2500, M = 250, B = 1000, L = 1, q = 0.1, gamma = 1/8)

get_clust_VI(out$clust[500:2500,])

Corradin, Riccardo, Luca Danese, Wasiur R. KhudaBukhsh, and Andrea Ongaro. 2024. “Model-Based Clustering of Time-Dependent Observations with Common Structural Changes.” https://arxiv.org/abs/2410.09552.

Corradin, Riccardo, Luca Danese, and Andrea Ongaro. 2022. “Bayesian Nonparametric Change Point Detection for Multivariate Time Series with Missing Observations.” International Journal of Approximate Reasoning 143: 26–43. https://doi.org/https://doi.org/10.1016/j.ijar.2021.12.019.

David B. Dahl, Devin J. Johnson, and Peter Müller. 2022. “Search Algorithms and Loss Functions for Bayesian Clustering.” Journal of Computational and Graphical Statistics 31 (4): 1189–1201. https://doi.org/10.1080/10618600.2022.2069779.

Martínez, Asael Fabian, and Ramsés H. Mena. 2014. “On a Nonparametric Change Point Detection Model in Markovian Regimes.” Bayesian Analysis 9 (4): 823–58. https://doi.org/10.1214/14-BA878.