Introduction to the comparer R package

When coding, especially for data science, there are multiple ways to solve each problem. When presented with two options, you want to pick the one that is faster and/or more accurate. Comparing different code chunks on the same task can be tedious. It often requires creating data, writing a for loop (or using sapply), then comparing.

Motivation from `microbenchmark`

The R package microbenchmark provides the fantastic eponymous function. It makes it simple to run different segments of code and see which is faster. Borrowing an example from http://adv-r.had.co.nz/Performance.html, the following shows how it gives a summary of how fast each ran.

if (requireNamespace("microbenchmark", quietly = TRUE)) {
  x <- runif(100)
  microbenchmark::microbenchmark(sqrt(x), x ^ .5)
} else {
  "microbenchmark not available on your computer"
}

## Unit: nanoseconds
##     expr  min   lq mean median   uq   max neval
##  sqrt(x)  200  300  658    300  400 11900   100
##    x^0.5 2300 2500 3025   2600 2700 17100   100

However it gives no summary of the output. For this example it is fine since the output is deterministic, but when working with randomness or model predictions we want to have some sort of summary or evaluation metric to see which has better accuracy, or to just see how the outputs differ.

`mbc` to the rescue

The function mbc in the comparer package was created to solve this problem, where a comparison of the output is desired in addition to the run time.

For example, we may wish to see how the sample size affects an estimate of the mean of a random sample. The following shows the results of finding the mean of 10 and 100 samples from a normal distribution.

library(comparer)

## Loading required package: GauPro

## Loading required package: mixopt

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: ggplot2

## Loading required package: splitfngr

## Loading required package: numDeriv

## Loading required package: rmarkdown

## Loading required package: tidyr

## Loading required package: plyr

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## Loading required package: progress

mbc(mean(rnorm(10)), mean(rnorm(100)))

## Run times (sec)
##           Function        Sort1        Sort2        Sort3        Sort4
## 1  mean(rnorm(10)) 6.914139e-06 7.152557e-06 7.152557e-06 1.096725e-05
## 2 mean(rnorm(100)) 9.059906e-06 9.775162e-06 1.001358e-05 1.001358e-05
##          Sort5        mean           sd neval
## 1 7.970333e-04 1.65844e-04 3.528496e-04     5
## 2 2.121925e-05 1.20163e-05 5.159435e-06     5
## 
## Output summary
##               Func Stat      Sort1        Sort2       Sort3      Sort4
## 1  mean(rnorm(10))    1 -0.1385332  0.005776678  0.12611456 0.29728392
## 2 mean(rnorm(100))    1 -0.1886518 -0.071591735 -0.04465929 0.02597593
##        Sort5        mean        sd
## 1 0.56415632  0.17095965 0.2718657
## 2 0.07643412 -0.04049856 0.1012739

By default it only runs 5 trials, but this can be changed with the times parameter. The first part of the output gives the run times. For 5 or fewer, it shows all the values in sorted order, for more than 5 it shows summary statistics. Unfortunately, the timing is only accurate up to 0.01 seconds, so these all show as 0.

The second section of the output gives the summary of the output. This also will show summary stats for more than 5 trials, but for this small sample size it shows all the values in sorted order with the mean and standard deviation given. The first column shows the name of each, and the second column shows which output statistic is given. Since there is only one output for this code it is called “1”.

Setting times changes the number of trials run. Below the same example as above is run but for 100 trials.

mbc(mean(rnorm(10)), mean(rnorm(100)), times=100)

## Run times (sec)
##           Function         Min.      1st Qu.       Median         Mean
## 1  mean(rnorm(10)) 5.960464e-06 5.960464e-06 6.914139e-06 6.821156e-06
## 2 mean(rnorm(100)) 8.821487e-06 9.059906e-06 9.059906e-06 9.701252e-06
##        3rd Qu.         Max.           sd neval
## 1 6.914139e-06 1.907349e-05 1.354602e-06   100
## 2 1.001358e-05 2.098083e-05 1.495924e-06   100
## 
## Output summary
##               Func Stat       Min.     1st Qu.       Median        Mean
## 1  mean(rnorm(10))    1 -0.8159787 -0.25627116 -0.057128447 -0.01818848
## 2 mean(rnorm(100))    1 -0.1839926 -0.07144344  0.005504562  0.00882492
##      3rd Qu.      Max.        sd
## 1 0.20917428 0.7309789 0.3200503
## 2 0.07699597 0.2467896 0.1014793

We see that the mean of both is around zero, but that the larger sample size (mean(rnorm(100))) has a tighter distribution and a standard deviation a third as large as the other, which is about what we expect for a sample that is 10 times larger (it should be $\sqrt{10} \approx 3.16$ times smaller on average).

In this example each function had its own input, but many times we want to compare the functions on the same input for better comparison.

Shared input

Input can be passed in to the input argument as a list, and then the code will be evaluated in an environment with that data. In this example we compare the functions mean and median on random data from an exponential distribution. The mean should be about 1, while the median should be about $\ln(2)=0.693$.

mbc(mean(x), median(x), input=list(x=rexp(30)))

## Run times (sec)
##    Function        Sort1        Sort2        Sort3        Sort4        Sort5
## 1   mean(x) 5.960464e-06 5.960464e-06 6.914139e-06 9.059906e-06 2.002716e-05
## 2 median(x) 2.193451e-05 2.312660e-05 2.503395e-05 2.908707e-05 1.621246e-04
##           mean           sd neval
## 1 9.584427e-06 5.973325e-06     5
## 2 5.226135e-05 6.147534e-05     5
## 
## Output summary
##        Func Stat     Sort1     Sort2     Sort3     Sort4     Sort5      mean sd
## 1   mean(x)    1 0.9596460 0.9596460 0.9596460 0.9596460 0.9596460 0.9596460  0
## 2 median(x)    1 0.6464802 0.6464802 0.6464802 0.6464802 0.6464802 0.6464802  0

In this case each evaluation is identical since the input is not random. The data passed to input is kept as is, so there is no randomness from the data. If we want randomness in the data, we can use inputi, which evaluates its argument as an expression, meaning that each time it will be different.

Below is the same code as above except inputi is used with x set in brackets instead of a list. We see there is randomness and we can get an idea of the distribution of the median and mean.

mbc(mean(x), median(x), inputi={x=rexp(30)})

## Run times (sec)
##    Function        Sort1        Sort2        Sort3        Sort4        Sort5
## 1   mean(x) 5.960464e-06 5.960464e-06 6.914139e-06 1.192093e-05 1.382828e-05
## 2 median(x) 2.598763e-05 2.694130e-05 2.694130e-05 4.291534e-05 5.197525e-05
##           mean           sd neval
## 1 8.916855e-06 3.695873e-06     5
## 2 3.495216e-05 1.185231e-05     5
## 
## Output summary
##        Func Stat        V1       V2       V3        V4        V5      mean
## 1   mean(x)    1 1.0108746 1.364529 1.057787 0.9138151 1.0079656 1.0709943
## 2 median(x)    1 0.8878218 0.742379 0.619253 0.7204445 0.7747634 0.7489323
##           sd
## 1 0.17221326
## 2 0.09699069

When the code chunks to evaluate are simple functions of a single variable, this can be simplified. Look how simple it is to run a test on these!

mbc(mean, median, inputi=rexp(30))

## Run times (sec)
##   Function        Sort1        Sort2        Sort3        Sort4        Sort5
## 1     mean 1.907349e-06 2.145767e-06 2.861023e-06 3.099442e-06 4.053116e-06
## 2   median 1.907349e-06 1.907349e-06 2.145767e-06 3.099442e-06 3.099442e-06
##           mean           sd neval
## 1 2.813339e-06 8.496537e-07     5
## 2 2.431870e-06 6.171312e-07     5

Comparing to expected values

The previous comparisons showed a summary of the outputs, but many times we want to compare output values to true values, then calculate a summary statistic, such as an average error. The argument target specifies the values the code chunks should give, then summary statistics can be calculated by specifying metrics, which defaults to calculating the rmse.

For example, suppose we have data from a linear function, and want to see how accurate the model is when the output values are corrupted with noise. Below we compare two linear models: the first with an intercept term, and the second without. The model with the intercept term should be much better since the data has an intercept of $-0.6$.

We see that the output is different in a few ways now. The Stat column tells what the row is showing. These all say rmse, meaning they are giving the root mean squared error of the predicted values compared to the true y. There’s also a new section at the bottom title Compare. This compares the rmse values from the two methods, and does a t-test to see if the difference is significant. However, since there is no randomness, it fails to perform the t-test.

n <- 20
x <- seq(0, 1, length.out = n)
y <- 1.8 * x - .6
ynoise <- y + rnorm(n, 0, .2)

mbc(predict(lm(ynoise ~ x), data.frame(x)),
    predict(lm(ynoise ~ x - 1), data.frame(x)),
    target = y)

## Run times (sec)
##                                     Function        Sort1        Sort2
## 1     predict(lm(ynoise ~ x), data.frame(x)) 0.0006148815 0.0006420612
## 2 predict(lm(ynoise ~ x - 1), data.frame(x)) 0.0006229877 0.0006349087
##          Sort3        Sort4        Sort5         mean           sd neval
## 1 0.0006489754 0.0006699562 0.0040249825 0.0013201714 1.512164e-03     5
## 2 0.0006380081 0.0006439686 0.0007719994 0.0006623745 6.175718e-05     5
## 
## Output summary
##                                         Func Stat         V1         V2
## 1     predict(lm(ynoise ~ x), data.frame(x)) rmse 0.03486556 0.03486556
## 2 predict(lm(ynoise ~ x - 1), data.frame(x)) rmse 0.31189957 0.31189957
##           V3         V4         V5       mean sd
## 1 0.03486556 0.03486556 0.03486556 0.03486556  0
## 2 0.31189957 0.31189957 0.31189957 0.31189957  0
## 
## Compare
##                                                                                   Func
## 1 predict(lm(ynoise ~ x), data.frame(x)) vs predict(lm(ynoise ~ x - 1), data.frame(x))
##   Stat conf.low conf.up  t  p
## 1 rmse       NA      NA NA NA

To add randomness we can simply define ynoise in the inputi argument, as shown below. Now there is randomness in the data, so a paired t-test can be computed. It is paired since the same ynoise is given to each model. We see that even with only a sample size of 5, the p-value is highly significant.

mbc(predict(lm(ynoise ~ x), data.frame(x)),
    predict(lm(ynoise ~ x - 1), data.frame(x)),
    inputi={ynoise <- y + rnorm(n, 0, .2)},
    target = y)

## Run times (sec)
##                                     Function        Sort1        Sort2
## 1     predict(lm(ynoise ~ x), data.frame(x)) 0.0006039143 0.0006308556
## 2 predict(lm(ynoise ~ x - 1), data.frame(x)) 0.0005760193 0.0006251335
##          Sort3        Sort4        Sort5         mean          sd neval
## 1 0.0006418228 0.0006690025 0.0009450912 0.0006981373 0.000140010     5
## 2 0.0006289482 0.0006670952 0.0029590130 0.0010912418 0.001044617     5
## 
## Output summary
##                                         Func Stat         V1         V2
## 1     predict(lm(ynoise ~ x), data.frame(x)) rmse 0.04772297 0.04205276
## 2 predict(lm(ynoise ~ x - 1), data.frame(x)) rmse 0.31425846 0.31135820
##           V3         V4         V5       mean          sd
## 1 0.04900406 0.08436103 0.02185191 0.04899855 0.022568322
## 2 0.31485779 0.31178120 0.31209022 0.31286917 0.001577832
## 
## Compare
##                                                                                Func
## t predict(lm(ynoise ~ x), data.frame(x))-predict(lm(ynoise ~ x - 1), data.frame(x))
##   Stat         V1         V2         V3         V4         V5       mean
## t rmse -0.2665355 -0.2693054 -0.2658537 -0.2274202 -0.2902383 -0.2638706
##           sd         t            p
## t 0.02271818 -25.97183 1.305753e-05

Simplifying with `evaluator`

Many times the code chunks we want to compare only differ by a small amount, such as a single argument. In the example above, the only difference is the formula in the lm command. With mbc, the evaluator can be set to make these cases easier. The argument for evaluator should be an expression including ., which will be replaced with the code chunks provided. The example below rewrites the above comparison using evaluator.

mbc(ynoise ~ x,
    ynoise ~ x - 1,
    evaluator=predict(lm(.), data.frame(x)),
    inputi={ynoise <- y + rnorm(n, 0, .2)},
    target = y)

## Run times (sec)
##         Function        Sort1        Sort2        Sort3        Sort4
## 1     ynoise ~ x 0.0005779266 0.0005910397 0.0006289482 0.0006480217
## 2 ynoise ~ x - 1 0.0005629063 0.0005860329 0.0005991459 0.0006051064
##          Sort5         mean           sd neval
## 1 0.0011458397 0.0007183552 2.406309e-04     5
## 2 0.0007331371 0.0006172657 6.676976e-05     5
## 
## Output summary
##             Func Stat         V1         V2         V3         V4         V5
## 1     ynoise ~ x rmse 0.03486556 0.03486556 0.03486556 0.03486556 0.03486556
## 2 ynoise ~ x - 1 rmse 0.31189957 0.31189957 0.31189957 0.31189957 0.31189957
##         mean sd
## 1 0.03486556  0
## 2 0.31189957  0
## 
## Compare
##                        Func Stat        V1        V2        V3        V4
## 1 ynoise ~ x-ynoise ~ x - 1 rmse -0.277034 -0.277034 -0.277034 -0.277034
##          V5      mean sd  t  p
## 1 -0.277034 -0.277034  0 NA NA

K-Fold Cross Validation

K-fold cross validation can also be done using mbc using the kfold parameter. K-fold cross validation involves splitting $N$ data points into $k$ groups. kfold should specify what this $N$ is, since it depends on the data. By default it will set the number of folds, $k$, to be times. Then each replicate will be evaluating a single fold. Note that this will not do $k$ folds $times$ times.

To make $k$ different from $times$, pass in $kfold$ as a vector whose second element in the number of folds. For example, suppose you have 100 data points, want to do 5 folds, and repeat this process twice (i.e. evaluate 10 folds). Then you should pass in $kfold=c(100,5)$ and $times=10$. The first five trials would then the the five separate folds. The sixth through tenth trials would be a new partition of the data into five folds.

Then to use this folds you must use ki as part of an expression in the code chunk or inputi. The following shows how to use k-fold cross validation fitting a linear model to the cars dataset. Setting kfold=c(nrow(cars), 5) tells it that you want to use 5 folds on the cars data set. It has 50 rows, so in each trial ki is a subset of 1:50 of 40 elements. Setting times=30 means that we are repeating the five folds six times. The code chunk fits the model, makes predictions on the hold-out data, and calculates the RMSE.

mbc({mod <- lm(dist ~ speed, data=cars[ki,])
     p <- predict(mod,cars[-ki,])
     sqrt(mean((p - cars$dist[-ki])^2))
     },
    kfold=c(nrow(cars), 5),
    times=30)

## Run times (sec)
##                                                                                                           Function
## 1 { mod <- lm(dist ~ speed, data = cars[ki, ]) p <- predict(mod, cars[-ki, ]) sqrt(mean((p - cars$dist[-ki])^2)) }
##           Min.      1st Qu.       Median         Mean      3rd Qu.        Max.
## 1 0.0005500317 0.0005877614 0.0006076097 0.0006978591 0.0006324649 0.002804041
##            sd neval
## 1 0.000403711    30
## 
## Output summary
##                                                                                                               Func
## 1 { mod <- lm(dist ~ speed, data = cars[ki, ]) p <- predict(mod, cars[-ki, ]) sqrt(mean((p - cars$dist[-ki])^2)) }
##   Stat     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.       sd
## 1    1 8.929092 11.72902 13.85972 15.19293 19.13354 23.79548 4.322134

The following example simplifies this a little. Setting targetin tells it what the input to predict should be and setting target="dist" tells it that the target is the dist element from targetin. You cannot set target=cars$dit[-ki] since target cannot be evaluated as an expression.

mbc(lm(dist ~ speed, data=cars[ki,]),
    targetin=cars[-ki,], target="dist",
    kfold=c(nrow(cars), 5),
    times=30)

## Run times (sec)
##                              Function         Min.      1st Qu.       Median
## 1 lm(dist ~ speed, data = cars[ki, ]) 0.0003139973 0.0003354549 0.0003554821
##           Mean      3rd Qu.        Max.           sd neval
## 1 0.0003877401 0.0003721118 0.001163006 0.0001516688    30
## 
## Output summary
##                                  Func Stat     Min.  1st Qu.  Median     Mean
## 1 lm(dist ~ speed, data = cars[ki, ]) rmse 10.51727 11.74167 14.4834 15.48358
##    3rd Qu.     Max.       sd
## 1 18.85401 24.05299 4.106909

Metrics

In the previous example, the output shows that the “Stat” is “rmse”, meaning that it calculated the root-mean-square error from the. predictions and target values. The metric, or statistic, calculated can be changed using the metric argument, which defaults to rmse. Three of the other options for metric are t, mis90, and sr27. These three all compare target values ($y$) to predicted values ($\hat{y}$) and predicted errors ($s$). These only work for models that give predicted errors, such as Gaussian process models.

`metric=t`

Using these the target value, predicted value, and predicted error, we can calculate a t-score.

\[ t = \frac{\hat{y} - y}{s} \]

The output then shows the distribution of these t-scores by showing the six number summary.

mbc(lm(dist ~ speed, data=cars[ki,]),
    targetin=cars[-ki,], target="dist",
    kfold=c(nrow(cars), 5),
    times=30,
    metric='t')

## Run times (sec)
##                              Function         Min.      1st Qu.       Median
## 1 lm(dist ~ speed, data = cars[ki, ]) 0.0003368855 0.0003628731 0.0004019737
##           Mean      3rd Qu.         Max.           sd neval
## 1 0.0004223347 0.0004310012 0.0008871555 0.0001087221    30
## 
## Output summary
##                                  Func      Stat        Min.     1st Qu.
## 1 lm(dist ~ speed, data = cars[ki, ])    Min. t -20.9075400 -11.5984276
## 2 lm(dist ~ speed, data = cars[ki, ]) 1st Qu. t  -4.9313685  -2.7018594
## 3 lm(dist ~ speed, data = cars[ki, ])  Median t  -2.8354889  -0.4966254
## 4 lm(dist ~ speed, data = cars[ki, ])    Mean t  -3.2792170  -0.7899931
## 5 lm(dist ~ speed, data = cars[ki, ]) 3rd Qu. t   0.4225208   1.6802855
## 6 lm(dist ~ speed, data = cars[ki, ])    Max. t   3.3310282   6.0047720
##       Median       Mean    3rd Qu.        Max.       sd
## 1 -9.7004842 -9.8979129 -7.8246452 -0.08528217 5.402287
## 2 -2.0802868 -1.8818658 -0.8568723  1.61930520 1.752299
## 3  0.6155601  0.7373985  1.9882930  3.95020736 1.803065
## 4  0.3880399  0.1253952  0.9522826  3.31685915 1.650908
## 5  2.8136150  2.9367746  3.9946167  6.06310737 1.536431
## 6  7.1397249  7.0871224  8.8368563  9.65604663 1.942429

`metric=mis90`

The t-score metric is not very informative because you can get the same t-scores by having a large error and large predicted error as having a small error and small predicted error. mis90 is the mean interval score for 90% coverage intervals as described by Gneiting and Raftery (2007, Equation 43).

\[ 3.28s + 20 \left( \hat{y} - y - 1.64s \right)^+ + 20 \left( y - \hat{y} - 1.64s \right)^+ \] where $()^+$ denotes the positive part of what is in the parentheses. Smaller values are better. This metric penalizes having large predicted errors and having actual errors different from the predicted errors, so it is very good for judging the accuracy of a prediction interval.

`metric=sr27`

The scoring rule in Equation 27 Gneiting and Raftery (2007) is another proper scoring rule.

\[ -\left( \frac{\hat{y} - y}{s} \right)^2 - \log s^2 \] For this metric, larger values are better. A problem with this metric is that if $s=0$, which can happen from numerical issues, then it will go to infinity, which does not happen with the mean interval score.

Running time-consuming experiments with `ffexp`

The other main function of the package is ffexp, an abbreviation for full-factorial experiment. It will run a function using all possible combinations of input parameters given. It is useful for running experiments that take a long time to complete.

The first arguments given to ffexp$new should give the possible values for each input parameter. In the example below, a can be 1, 2, or 3, and b can “a”, “b”, or “c”. Then eval_func should be given that can operate on these parameters. For example, using eval_func = paste will paste together the value of a with the value of b.

f1 <- ffexp$new(
  a=1:3,
  b=c("a","b","c"),
  eval_func=paste
)

After creating the ffexp object, we can call f1$run_all to run eval_func on every combination of a and b.

f1$run_all()

Now to see the results in a clean format, look at f1$outcleandf.

f1$outcleandf

##   a b  V1 runtime          start_time            end_time run_number
## 1 1 a 1 a       0 2024-09-29 12:53:37 2024-09-29 12:53:37          1
## 2 2 a 2 a       0 2024-09-29 12:53:37 2024-09-29 12:53:37          2
## 3 3 a 3 a       0 2024-09-29 12:53:37 2024-09-29 12:53:37          3
## 4 1 b 1 b       0 2024-09-29 12:53:37 2024-09-29 12:53:37          4
## 5 2 b 2 b       0 2024-09-29 12:53:37 2024-09-29 12:53:37          5
## 6 3 b 3 b       0 2024-09-29 12:53:37 2024-09-29 12:53:37          6
## 7 1 c 1 c       0 2024-09-29 12:53:37 2024-09-29 12:53:37          7
## 8 2 c 2 c       0 2024-09-29 12:53:37 2024-09-29 12:53:37          8
## 9 3 c 3 c       0 2024-09-29 12:53:37 2024-09-29 12:53:37          9

Introduction to the comparer R package

Collin Erickson

2024-09-29

Motivation from `microbenchmark`

`mbc` to the rescue

Shared input

Comparing to expected values

Simplifying with `evaluator`

K-Fold Cross Validation

Metrics

`metric=t`

`metric=mis90`

`metric=sr27`

Running time-consuming experiments with `ffexp`

References

Introduction to the comparer R package

Collin Erickson

2024-09-29

Motivation from microbenchmark

mbc to the rescue

Shared input

Comparing to expected values

Simplifying with evaluator

K-Fold Cross Validation

Metrics

metric=t

metric=mis90

metric=sr27

Running time-consuming experiments with ffexp

References

Motivation from `microbenchmark`

`mbc` to the rescue

Simplifying with `evaluator`

`metric=t`

`metric=mis90`

`metric=sr27`

Running time-consuming experiments with `ffexp`