emba

Intro

The emba R package name stands for Ensemble (Boolean) Model Biomarker Analysis. It’s main purpose is to be used on a dataset consisted of an ensemble of boolean models. These models are usually (but not necessarily) different versions of the same initial model, parameterized in different ways (e.g. some boolean operators in the model equations have changed from OR to AND or vice-versa). A prerequisite for using this package, is that this model dataset must be tested in-silico (using some computational way) against a list of drug combinations, in order to assess which drugs combinations behave synergistically for which models. An example software that generates such boolean model ensembles and performs a comprehensive drug response analysis on them is the DrugLogics NTNU software pipeline (see respective documentation).

Given a list of gold-standard (lab-observed/verified) synergies 1, this package enables the easy grouping of the models into different classes based on a specific performance metric evaluation. This model classification enables the discovery and visualization of biomarkers - nodes whose activity and/or boolean model parameterization might affect either the prediction performance of those models or the manifestation of the predicted synergies.

In the next sections we will describe the main inputs and outputs of the general analysis functions (which group a lot of functionality into one) and provide some insights on the implementation behind. Biomarkers will be assessed and visualized using a test dataset generated from the DrugLogics software mentioned above.

The complementary R package usefun has various helpful functions that are used both inside the emba package and during the analysis below.

For further analyses using this package on boolean model ensemble datasets see this GitHub repository. See also an example that demonstrates all the intermediate steps included in the general analysis functions as well as other miscellaneous usages that this guide does not cover. Lastly, you might also want to check a nice presentation I made for a conference about this package.

Setup

# libraries
library(emba)
library(usefun)
library(dplyr)
library(knitr)
library(Ckmeans.1d.dp)

# wrapper to change printing to invisible
pr = function(x) invisible(x)

Input

Test dataset

The test dataset we will use has \(7500\) boolean models with \(139\) nodes each. It helps to think of each boolean model as a network of nodes where the edges represent either activation or inhibition of the corresponding target and the nodes activity can be either active (1) or inactive (0).

The models have been assessed for synergy against a total of \(153\) drug combinations.

data.list = readRDS(url("https://github.com/bblodfon/emba/blob/main/vignettes/data.rds?raw=true"))

model.predictions = data.list$model.predictions
models.stable.state = data.list$models.stable.state
models.link.operator = data.list$models.equations
observed.synergies = data.list$observed.synergies

# (x,y) coordinates for visualization
nice.layout = data.list$nice.layout
# model network as an igraph object
net = data.list$net

# drug combinations
drug.combos = colnames(model.predictions)

# change model names (shorter names for readability)
model.names = paste0("model", 1:7500)
rownames(model.predictions) = model.names
rownames(models.stable.state) = model.names
rownames(models.link.operator) = model.names

Model Predictions

This data represents the results of in-silico testing the boolean models against a drug combination dataset. More specifically, the model predictions is a data.frame whose values (corresponding to a specific model-drug combination element) can be one of the following:

model.predictions[1:5, 77:84] %>% kable(caption = "Model predictions example")
Model predictions example
PI-JN PI-D1 PI-60 PI-SB PI-RU PI-D4 PI-F4 PI-ST
model1 0 0 0 0 0 0 0 NA
model2 0 0 0 0 0 0 0 NA
model3 NA NA NA NA NA NA NA NA
model4 0 1 0 NA 0 NA 0 NA
model5 0 1 0 0 0 0 1 NA

Model stable states

Each model must have a stable state configuration where the nodes have fixed to either 0 (inactive state) or 1 (active state). In other words, a fixpoint attractor. Of course, if a model has multiple attractors or other methods are used to derive a solution to the system of boolean equations that is the model itself, then continuous activity state values (in the \([0,1]\) interval) are also supported.

models.stable.state[1:5, 5:11] %>% kable(caption = "Model stable states example")
Model stable states example
MAP3K4 MAP2K4 IKBKG IKBKB AKT1 BRAF SMAD3
model1 0 1 1 1 1 0 1
model2 0 1 1 1 0 0 0
model3 0 1 1 1 1 0 1
model4 0 1 1 1 0 0 1
model5 0 1 1 0 0 0 1

Observed (GS) synergies

A list of gold standard (GS) drug combinations which have been termed as synergistic via experimental and/or other computational methods. These drug combinations must be a subset of the ones tested in the models (the column names of the model.predictions data).

usefun::pretty_print_vector_values(observed.synergies, vector.values.str = "observed synergies")

17 observed synergies: AK-60, AK-BI, AK-D1, PI-D1, PD-G2, AK-G4, D1-G4, PI-JN, BI-P5, PD-P5, PI-P5, AK-PD, BI-PD, AK-PI, BI-PI, PD-PI, PK-ST

Performance biomarkers

Performance biomarkers are nodes in our studied networks (boolean models) whose activity state and/or boolean model parameterization (link operator) affects the prediction performance of those models. These nodes can be thus used as indicators of either activity or structural changes that have a positive effect on the prediction performance of our models.

The model performance can be assessed via various ways. In this package we offer two ways to group the models to different classification categories: either based on the number of true positive (TP) predictions or on the Matthews correlation coefficient (MCC) score with respect to the drug combination dataset tested for synergy. The function emba::biomarker_tp_analysis() is used for the former classification and the function emba::biomarker_mcc_analysis() for the latter. Note that it’s generally better to use the MCC classification, since it’s a more robust performance evaluation metric compared to the number of TP predictions, since it takes into account all of the four confusion matrix values.

When the models have been grouped to different classification categories, their nodes activity or boolean model parameterization can be summarised in each group and compared to the others, obtaining thus the expected biomarkers using the methodology described below.

TP-based analysis

We use the emba::biomarker_tp_analysis() function with the specified inputs:

tp.analysis.res = emba::biomarker_tp_analysis(
  model.predictions, 
  models.stable.state, 
  models.link.operator, 
  observed.synergies, 
  penalty = 0.1,
  threshold = 0.55)

The penalty term is used to reduce the bias when model groups have different sizes. For example, if I were to compare the average activity of nodes between two groups of models, with respective group sizes 5 and 1000, then the result would be heavily biased towards the group with the larger size, making thus the quality of the results coming out of this comparison questionable. As such, with penalty values closer to 0, more bias is introduced and we expect more biomarkers to be found. The default value of \(0.1\) is a good rule-of-thumb choice for minimizing such biases. See more info on emba::get_vector_diff().


As a first result, we get the predicted synergies - i.e. the drug combinations that are a subset of the observed ones and were predicted by at least one of the models in the dataset:

usefun::pretty_print_vector_values(tp.analysis.res$predicted.synergies, vector.values.str = "predicted synergies")

5 predicted synergies: AK-PD, BI-PD, BI-PI, PD-PI, PI-D1

The percentage of true positive predicted synergies is thus 29.4%. Such a low number might be a sign that the models quality is poor (need for a different parameterization) or other reasons like incorrect assessment of the gold standard synergies, etc.

The next informative barplot shows the distribution of models according to their true positive predictions:

pr(emba::make_barplot_on_models_stats(table(tp.analysis.res$models.synergies.tp), 
  title = "True Positive Synergy Predictions",
  xlab = "Number of maximum correctly predicted synergies",
  ylab = "Number of models"))

Next result we get is the average activity differences per network node for all group classifications:

tp.analysis.res$diff.state.tp.mat %>% 
  as.data.frame() %>%
  select(c("AKT","PTEN","PSEN1","STAT3","CEBPA")) %>% # show only part of the matrix
  kable(caption = "Average Activity Difference Matrix")
Average Activity Difference Matrix
AKT PTEN PSEN1 STAT3 CEBPA
(0,1) -0.0278105 0.4462906 -0.0012761 0.2316447 0.0283239
(0,2) -0.0851093 0.3814671 0.0104114 0.0200058 0.0118586
(0,3) 0.0175886 0.2594653 0.4728458 -0.3232749 0.4327603
(1,2) -0.0630977 0.0363188 0.0113038 -0.1576796 -0.0099449
(1,3) 0.0306321 0.0456991 0.4695673 -0.4304678 0.4157744
(2,3) 0.0916724 0.0305575 0.6086029 -0.4379904 0.5551273

In our case, threshold = 0.55 and thus CEBPA and PSEN1 are returned as active biomarkers:

usefun::pretty_print_vector_values(tp.analysis.res$biomarkers.tp.active,
  vector.values.str = "active biomarkers")

2 active biomarkers: PSEN1, CEBPA

usefun::pretty_print_vector_values(tp.analysis.res$biomarkers.tp.inhibited,
  vector.values.str = "inhibited biomarkers")

0 inhibited biomarkers:

With the models initial network as an igraph object (see emba::construct_network() on how to create such a net object), we can visualize every row of the above matrix as follows:

pr(emba::plot_avg_state_diff_graph(net, tp.analysis.res$diff.state.tp.mat["(2,3)",], 
  layout = nice.layout, title = "Bad models (2 TP) vs Good models (3 TP)"))

Note that with less penalty, more bias would be introduced and thus more biomarkers would be found (even for a higher chosen threshold):

tp.analysis.res.biased = emba::biomarker_tp_analysis(
  model.predictions, 
  models.stable.state, 
  models.link.operator, 
  observed.synergies, 
  penalty = 0,
  threshold = 0.7)

usefun::pretty_print_vector_values(tp.analysis.res.biased$biomarkers.tp.active,
  vector.values.str = "active biomarkers")

13 active biomarkers: PSEN1, CEBPA, MAPK8IP1, MAPK9, JAK1, TYK2, JAK3, IFNGR2/INFGR1, IFNGR1, PTPN11, IFNGR2, IL2RB, IL10RA

usefun::pretty_print_vector_values(tp.analysis.res.biased$biomarkers.tp.inhibited,
  vector.values.str = "inhibited biomarkers")

17 inhibited biomarkers: MAP3K7, MAP2K6, MAP2K3, NLK, IKBKG, STAT3, RXRA, SOCS3, TGFB1, HSPA1A, SALL4, ROCK1, TGFBR1, TRAF6, RHOA, PIK3R1, CASP9

Last result we get is the average link operator differences per network node (whose boolean equation had a link operator) for all group classifications:

tp.analysis.res$diff.link.tp.mat %>% 
  as.data.frame() %>%
  select(c("AKT","PTEN","PSEN1","STAT3","CEBPA")) %>% # show only part of the matrix
  kable(caption = "Average Link Operator Difference Matrix")
Average Link Operator Difference Matrix
AKT PTEN PSEN1 STAT3 CEBPA
(0,1) 0.2019689 0.4291810 -0.0047786 0.2316447 0.0269599
(0,2) 0.1116284 0.3682466 -0.1722642 0.0200058 0.0101784
(0,3) 0.1588424 0.2512846 0.2915064 -0.3232749 0.4356235
(1,2) -0.0440675 0.0363188 -0.1671873 -0.1576796 -0.0105660
(1,3) 0.0617622 0.0456991 0.2913783 -0.4304678 0.4192610
(2,3) 0.1171370 0.0305575 0.5194769 -0.4379904 0.5602202

In our case, threshold = 0.55 and thus CEBPA is returned as an OR link operator biomarker:

usefun::pretty_print_vector_values(tp.analysis.res$biomarkers.tp.or,
  vector.values.str = "'OR' biomarkers")

1 ‘OR’ biomarker: CEBPA

usefun::pretty_print_vector_values(tp.analysis.res$biomarkers.tp.and,
  vector.values.str = "'AND' biomarkers")

0 ‘AND’ biomarkers:

We can also visualize every row of the average link operator differences matrix as follows:

pr(emba::plot_avg_link_operator_diff_graph(net, tp.analysis.res$diff.link.tp.mat["(2,3)",], 
  layout = nice.layout, title = "Bad models (2 TP) vs Good models (3 TP)"))

Interpreting the result regarding the CEBPA biomarker, we look back at its boolean equation and we see that the higher performance models must have the OR NOT link operator in order for CEBPA to be in an active (ON) state (an AND NOT results mostly on an inhibited state for CEBPA):

CEBPA = (GSK3B OR MAP2K1 OR MEK1/2) OR NOT CTNNB1

MCC-based analysis

We use the emba::biomarker_mcc_analysis() function with the specified inputs:

mcc.analysis.res = emba::biomarker_mcc_analysis(
  model.predictions, 
  models.stable.state, 
  models.link.operator, 
  observed.synergies, 
  threshold = 0.65,
  num.of.mcc.classes = 4,
  penalty = 0.2)

First result is the predicted synergies, which are the same as the ones found with the TP-based analysis (the model predictions did not change). As such, the drug combinations which were predicted by at least one of the models in the dataset are:

usefun::pretty_print_vector_values(mcc.analysis.res$predicted.synergies, vector.values.str = "predicted synergies")

5 predicted synergies: AK-PD, BI-PD, BI-PI, PD-PI, PI-D1

We can get a first idea of the range and distribution of the models MCC scores with the next barplot:

pr(emba::make_barplot_on_models_stats(table(mcc.analysis.res$models.mcc), 
  title = "MCC scores", xlab = "MCC value", 
  ylab = "Number of models", cont.values = TRUE))

We can also plot the MCC-model histogram, which in addition shows the estimated density (how many models) and width (MCC range) of each MCC class:

models.mcc = mcc.analysis.res$models.mcc
num.of.mcc.classes = 4

res = Ckmeans.1d.dp(x = models.mcc, k = num.of.mcc.classes)
models.cluster.ids = res$cluster

pr(emba::plot_mcc_classes_hist(models.mcc, models.cluster.ids, num.of.mcc.classes))

Next result we get is the average activity differences per network node for all group classifications:

mcc.analysis.res$diff.state.mcc.mat %>%
  as.data.frame() %>%
  select(c("AKT","PPM1A","PTEN","PSEN1","PTK2","CEBPA")) %>% # show only part of the matrix
  kable(caption = "Average Activity Difference Matrix")
Average Activity Difference Matrix
AKT PPM1A PTEN PSEN1 PTK2 CEBPA
(1,2) -0.0263106 0.5637404 0.5637404 0.0002462 -0.5637404 -0.0180814
(1,3) -0.0419764 0.6791224 0.6791224 -0.0004583 -0.6791224 0.0189502
(1,4) -0.0551735 0.6416618 0.6416618 0.0039844 -0.6416618 -0.0054756
(2,3) -0.0196324 0.1787135 0.1787135 -0.0007511 -0.1787135 0.0390053
(2,4) -0.0300506 0.1517566 0.1517566 0.0034388 -0.1517566 0.0088853
(3,4) -0.0138152 0.0141918 0.0141918 0.0036747 -0.0141918 -0.0191041

In our case, threshold = 0.65 and thus PTEN and PPM1A are returned as active biomarkers and PTK2 as an inhibited biomarker:

usefun::pretty_print_vector_values(mcc.analysis.res$biomarkers.mcc.active,
  vector.values.str = "active biomarkers")

2 active biomarkers: PTEN, PPM1A

usefun::pretty_print_vector_values(mcc.analysis.res$biomarkers.mcc.inhibited,
  vector.values.str = "inhibited biomarkers")

1 inhibited biomarker: PTK2

Note that looking at the respective boolean equations:

we conclude that the only activity biomarker of interest is PTEN as it’s the only regulator whose state directly influences the PPM1A and PTK2 nodes.

With the models initial network as an igraph object (see emba::construct_network() on how to create such a net object), we can visualize every row of the above matrix as follows:

pr(emba::plot_avg_state_diff_graph(net, mcc.analysis.res$diff.state.mcc.mat["(1,4)",], 
  layout = nice.layout, title = "Bad models (MCC Class 1) vs Good models (MCC Class 4)"))

Last result we get is the average link operator differences per network node (whose boolean equation had a link operator) for all group classifications:

mcc.analysis.res$diff.link.mcc.mat %>% 
  as.data.frame() %>%
  select(c("AKT","PTEN","PSEN1","CEBPA","STAT3","JAK1")) %>% # show only part of the matrix
  kable(caption = "Average Link Operator Difference Matrix")
Average Link Operator Difference Matrix
AKT PTEN PSEN1 CEBPA STAT3 JAK1
(1,2) 0.2793768 0.5880847 0.0466807 -0.0280618 -0.3069948 0.2806183
(1,3) 0.3182545 0.6791224 0.0115062 0.0114039 0.0664054 -0.0648736
(1,4) 0.2860364 0.6416618 -0.0109073 -0.0060220 -0.6043843 0.5783232
(2,3) 0.0684637 0.1542205 -0.0343283 0.0407584 0.3818048 -0.3535848
(2,4) 0.0461114 0.1330813 -0.0457409 0.0160442 -0.3147733 0.3112794
(3,4) -0.0055788 0.0141918 -0.0179237 -0.0137708 -0.5550172 0.5321063

In our case, threshold = 0.65 and thus PTEN is returned as an OR link operator biomarker:

usefun::pretty_print_vector_values(mcc.analysis.res$biomarkers.mcc.or,
  vector.values.str = "'OR' biomarkers")

1 ‘OR’ biomarker: PTEN

usefun::pretty_print_vector_values(mcc.analysis.res$biomarkers.mcc.and,
  vector.values.str = "'AND' biomarkers")

0 ‘AND’ biomarkers:

We can also visualize every row of the average link operator differences matrix as in the following example:

pr(emba::plot_avg_link_operator_diff_graph(net, mcc.analysis.res$diff.link.mcc.mat["(1,4)",], 
  layout = nice.layout, title = "Bad models (MCC Class 1) vs Good models (MCC Class 4)"))