For even quicker usage instructions, see the README.

The general idea of this package is that you can throw nearly any model built in the typical R fashion. For now, we just need a somewhat standard way of extracting fixed effects, and that `update`

works, which is nearly always the case.

Assuming your model is called `base_model`

, if `coef(summary(base_model))`

gives the estimates and standard errors, the method will work. The result of `coef`

is often a list; this package will then produce estimates for each element of the list. Common examples where there are multiple sets of coefficients are zero-inflated models.

The data should be supplied manually, although it’s currently the case that if the dataframe that creates the model is accessible from `base_model$model`

, `base_model$frame`

or `base_model@frame`

, you won’t have to send the data in to the function separately (but you will get a warning). Be careful here - some models supply this but the method introduces bugs. Much safer to supply the data explicitly.

Let’s work through an example with `glmmTMB`

. We’ll use the data that comes with `glmmboot`

.

```
library(glmmboot)
data(test_data)
head(test_data)
# x_var1 x_var2 x_var3 subj y
# 1 -0.3603511 2.7427146 0.2636070 3 1.00000000
# 2 2.6866123 1.5713332 0.1509874 20 0.08064748
# 3 0.6342672 0.9936967 0.1261363 19 0.43073744
# 4 1.1648519 1.6567148 0.6139765 33 0.98353502
# 5 0.7842167 1.3459927 -0.7063382 1 1.00000000
# 6 2.8261973 1.9342870 1.2697793 3 1.00000000
```

We’re assuming the `x_var`

variables are fixed, and `subj`

is to be treated as a random effect.

Thus our base analysis is:

```
library(glmmTMB)
model_formula <- as.formula(y ~ x_var1 + x_var2 + x_var2 + (1 | subj))
base_run <- glmmTMB(formula = model_formula,
data = test_data,
family = binomial)
# Warning in eval(family$initialize): non-integer #successes in a binomial
# glm!
```

We get a warning because the outcome data is proportional. Not to worry.

Now we’ll use the bootstrap. By default it’ll perform block bootstrapping over the highest entropy random effect - but there’s only one, so of course the winner is `subj`

.

```
bootstrap_over_subj <- bootstrap_model(base_model = base_run,
base_data = test_data,
resamples = 99)
# Performing block resampling, over subj
```

For publications etc, better to run about ten thousand resamples to avoid noise having much of an effect. Of course 99 is far too small, only for an example.

And comparing results:

```
print(bootstrap_over_subj)
# estimate boot 2.5% boot 97.5% boot p_value base p_value
# (Intercept) 0.6018968 -0.1852 1.2099 0.14 0.1636
# x_var1 0.1909343 -0.1195 0.5351 0.24 0.2338
# x_var2 0.3140507 0.0856 0.5092 0.02 0.0396
# base 2.5% base 97.5% boot/base width
# (Intercept) -0.2449 1.4487 0.8237075
# x_var1 -0.1234 0.5052 1.0415190
# x_var2 0.0150 0.6131 0.7082128
```

The above might take a long time in a real setting. If it takes far too long on your machine, you can ideally run it on a bunch of computers. We don’t want each computer to output the fully processed output, only the intermediate outcome. To do this, we set `return_coefs_instead = TRUE`

for each run:

```
b_list1 <- bootstrap_model(base_model = base_run,
base_data = test_data,
resamples = 29,
return_coefs_instead = TRUE)
# Performing block resampling, over subj
b_list2 <- bootstrap_model(base_model = base_run,
base_data = test_data,
resamples = 30,
return_coefs_instead = TRUE)
# Performing block resampling, over subj
b_list3 <- bootstrap_model(base_model = base_run,
base_data = test_data,
resamples = 30,
return_coefs_instead = TRUE)
# Performing block resampling, over subj
```

Combining this is simple enough. If we’ve used a few, we don’t want to mess around with even more lists, so we can enter them into the relevant function:

```
print(combine_resampled_lists(b_list1, b_list2, b_list3))
# estimate boot 2.5% boot 97.5% boot p_value base p_value
# (Intercept) 0.6018968 -0.0381 1.3170 0.0667 0.1636
# x_var1 0.1909343 -0.1595 0.4680 0.2444 0.2338
# x_var2 0.3140507 0.0401 0.5377 0.0222 0.0396
# base 2.5% base 97.5% boot/base width
# (Intercept) -0.2449 1.4487 0.8000964
# x_var1 -0.1234 0.5052 0.9983490
# x_var2 0.0150 0.6131 0.8319557
```

If we’ve run a huge number of such runs, ideally we’ll combine all output to a list of lists, like so:

And we’ll get the same result:

```
print(combine_resampled_lists(list_of_lists_output))
# estimate boot 2.5% boot 97.5% boot p_value base p_value
# (Intercept) 0.6018968 -0.0381 1.3170 0.0667 0.1636
# x_var1 0.1909343 -0.1595 0.4680 0.2444 0.2338
# x_var2 0.3140507 0.0401 0.5377 0.0222 0.0396
# base 2.5% base 97.5% boot/base width
# (Intercept) -0.2449 1.4487 0.8000964
# x_var1 -0.1234 0.5052 0.9983490
# x_var2 0.0150 0.6131 0.8319557
```

You MUST set `return_coefs_instead = TRUE`

for methods like these that combine output.

There are two primary ways to run models in parallel:

`parallel`

: this will use`parallel::mclapply`

to run the models.`future`

: this will use`future.apply::future_lapply`

to run the models

Use he parameter `parallelism`

in the function `bootstrap_model()`

to choose how to parallelise.

`parallelism = "none"`

This is the default, and will run each model sequentially, using `lapply()`

. If you set `num_cores`

to a number greater than 1, you’ll get an error with this option.

This isn’t the default if you set `num_cores`

to a value greater than 1.

`parallelism = "parallel"`

This will use `parallel::mclapply`

to run the models. If you set `num_cores`

, then that many cores will be used. If you don’t set `num_cores`

, the function will use `num_cores = parallel::detectCores() - 1`

.

Example:

```
## will use `parallel::detectCores() - 1` cores
model_results <- bootstrap_model(base_model = some_base_run,
base_data = some_interesting_data,
resamples = 9999,
parallelism = "parallel")
## will use 4 cores:
model_results <- bootstrap_model(base_model = some_base_run,
base_data = some_interesting_data,
resamples = 9999,
parallelism = "parallel",
num_cores = 4)
```

I’ve heard that `parallel::mclapply`

doesn’t play well on Windows. Instead of implementing `parallel::parLapply`

(which I will do if there’s interest), I would recommend using `parallelism = "future"`

, see below.

This becomes the default if you don’t set `parallelism`

but just `num_cores`

.

`parallelism = "future"`

This uses the very nice `future.apply::future_lapply`

function to run the models in parallel. Note that you MUST set the `future::plan`

(see the docs for the `future`

and `future.apply`

packages) to actually make use of multiple cores etc. Using `num_cores`

is NOT the right way to set the backend, and will cause an error.

This should work well with Windows.

Example:

`num_cores`

If you don’t touch the `parallelism`

argument, but just set `num_cores`

, e.g.:

```
model_results <- bootstrap_model(base_model = some_base_run,
base_data = some_interesting_data,
resamples = 9999,
num_cores = 8)
```

It’ll be as if you set `parallelism = "parallel"`

, and will use the `parallel::mclapply`

backend (unless you set `num_cores = 1`

!).

Unfortunately currently none of these methods have any progress bars setup. There used to be a call to a parallel package that did this, but it seemed to have some bugs. Hopefully this will be possible in the future.

The (not great) options for now: * run a small set to estimate timing before running a large set of models * using the split approach (see the Combining Runs approach above), print out some intermediate output between runs.

Let’s look at an example of the second approach, using the `future`

argument. First, we’ll write a somewhat rough logging function:

```
log_remaining <- function(start_time,
cur_time,
j,
total_iters,
time_units = "hours"){
cur_time <- Sys.time()
total_time <- difftime(cur_time, start_time, units = time_units)
est_remaining <- (total_iters - j) * (total_time / j)
paste0("[", cur_time, "] [iteration ", j, "] ",
"total ", time_units, ": ", round(total_time, 3), " // ",
"est remaining ", time_units, ": ", round(est_remaining, 3))
}
```

This looks like:

```
total_iterations <- 5
start_time <- Sys.time()
for (j in 1:total_iterations) {
## simulate expensive operation...
Sys.sleep(2)
print(log_remaining(start_time, Sys.time(), j, total_iterations, "secs"))
}
# [1] "[2019-09-02 13:29:30] [iteration 1] total secs: 2.005 // est remaining secs: 8.019"
# [1] "[2019-09-02 13:29:32] [iteration 2] total secs: 4.014 // est remaining secs: 6.021"
# [1] "[2019-09-02 13:29:34] [iteration 3] total secs: 6.014 // est remaining secs: 4.01"
# [1] "[2019-09-02 13:29:36] [iteration 4] total secs: 8.015 // est remaining secs: 2.004"
# [1] "[2019-09-02 13:29:38] [iteration 5] total secs: 10.021 // est remaining secs: 0"
```

As before, don’t forget `return_coefs_instead = TRUE`

. We’d run:

```
library(future)
plan("multiprocess")
results_list <- list()
num_blocks <- 50
runs_per_block <- 100
start_time <- Sys.time()
for (j in 1:num_blocks) {
results_list[[j]] <- bootstrap_model(base_model = base_run,
base_data = test_data,
resamples = runs_per_block,
parallelism = "future",
return_coefs_instead = TRUE,
suppress_sampling_message = TRUE)
print(log_remaining(start_time, Sys.time(), j, num_blocks, "secs"))
}
combined_results <- combine_resampled_lists(results_list)
```

Of course with 50 blocks of size 100, we’d get the equivalent of 5000 samples.

Let’s first reuse the model from the glmmTMB vignettes:

```
owls <- transform(Owls,
nest = reorder(Nest, NegPerChick),
ncalls = SiblingNegotiation,
ft = FoodTreatment)
fit_zipoisson <- glmmTMB(
ncalls ~ (ft + ArrivalTime) * SexParent +
offset(log(BroodSize)) + (1 | nest),
data = owls,
ziformula = ~1,
family = poisson)
```

We run this more or less the same way as the other models:

In this case the output would be a list with two entries, one matrix in `zero_boot$cond`

, the typical `conditional`

model, and another matrix (in this case a single row, because `ziformula = ~1`

) in `zero_cond$zi`

, each giving the same shape output as we’ve seen above.

In this particular model, the bootstrap model is quite conservative. Assumptions are often useful!