Are you a machine learning researcher? Do you publish at NeurIPS and push the state-of-the-art in CV and NLP? This guide will serve as your first introduction to core Keras & TensorFlow API concepts.
In this guide, you will learn about:
Layer
] classadd_loss()
methodtensorflow::tf_function()
]You will also see the Keras API in action in two end-to-end research examples: a Variational Autoencoder, and a Hypernetwork.
TensorFlow is an infrastructure layer for differentiable programming. At its heart, it’s a framework for manipulating N-dimensional arrays (tensors), much like NumPy.
However, there are three key differences between NumPy and TensorFlow:
Let’s take a look at the object that is at the core of TensorFlow: the Tensor.
Here’s a constant tensor:
## tf.Tensor(
## [[5. 2.]
## [1. 3.]], shape=(2, 2), dtype=float64)
You can get its value as a R array by calling
as.array()
:
## [,1] [,2]
## [1,] 5 2
## [2,] 1 3
It features the attributes dtype
and
shape
:
## tf.float64
## TensorShape([2, 2])
A common way to create constant tensors is via tf$ones
and tf$zeros
:
## tf.Tensor(
## [[1.]
## [1.]], shape=(2, 1), dtype=float32)
## tf.Tensor(
## [[0.]
## [0.]], shape=(2, 1), dtype=float32)
You can also create random constant tensors:
Variables are special tensors used to store mutable state (such as
the weights of a neural network). You create a Variable
using some initial value:
## <tf.Variable 'Variable:0' shape=(2, 2) dtype=float32, numpy=
## array([[ 0.9057419 , 0.7916686 ],
## [ 0.28754202, -0.5408822 ]], dtype=float32)>
You update the value of a Variable
by using the methods
$assign(value)
, $assign_add(increment)
, or
$assign_sub(decrement)
:
## <tf.Variable 'UnreadVariable' shape=(2, 2) dtype=float32, numpy=
## array([[-0.3405368 , -2.1463926 ],
## [ 1.2602988 , 0.12241419]], dtype=float32)>
## <tf.Variable 'UnreadVariable' shape=(2, 2) dtype=float32, numpy=
## array([[ 0.04820395, -2.6854615 ],
## [ 0.23246336, 1.4535258 ]], dtype=float32)>
If you’ve used NumPy, doing math in TensorFlow will look very familiar. The main difference is that your TensorFlow code can run on GPU and TPU.
Here’s another big difference with R: you can automatically retrieve the gradient of any differentiable expression.
Just open a GradientTape
, start “watching” a tensor via
tape$watch()
, and compose a differentiable expression using
this tensor as input:
a <- random_normal(shape=c(2, 2))
b <- random_normal(shape=c(2, 2))
with(tf$GradientTape() %as% tape, {
tape$watch(a) # Start recording the history of operations applied to `a`
c <- tf$sqrt(tf$square(a) + tf$square(b)) # Do some math using `a`
# What's the gradient of `c` with respect to `a`?
dc_da <- tape$gradient(c, a)
print(dc_da)
})
## tf.Tensor(
## [[ 0.9969011 -0.7707146 ]
## [ 0.23378514 0.96255165]], shape=(2, 2), dtype=float32)
By default, variables are watched automatically, so you don’t need to
manually watch
them:
a <- tf$Variable(a)
with(tf$GradientTape() %as% tape, {
c <- tf$sqrt(tf$square(a) + tf$square(b))
dc_da <- tape$gradient(c, a)
print(dc_da)
})
## tf.Tensor(
## [[ 0.9969011 -0.7707146 ]
## [ 0.23378514 0.96255165]], shape=(2, 2), dtype=float32)
Note that you can compute higher-order derivatives by nesting tapes:
with(tf$GradientTape() %as% outer_tape, {
with(tf$GradientTape() %as% tape, {
c <- tf$sqrt(tf$square(a) + tf$square(b))
dc_da <- tape$gradient(c, a)
})
d2c_da2 <- outer_tape$gradient(dc_da, a)
print(d2c_da2)
})
## tf.Tensor(
## [[3.3447742e-03 7.1282005e-01]
## [5.7464113e+00 5.5013180e-02]], shape=(2, 2), dtype=float32)
While TensorFlow is an infrastructure layer for differentiable programming, dealing with tensors, variables, and gradients, Keras is a user interface for deep learning, dealing with layers, models, optimizers, loss functions, metrics, and more.
Keras serves as the high-level API for TensorFlow: Keras is what makes TensorFlow simple and productive.
The Layer
class is the fundamental abstraction in Keras.
A Layer
encapsulates a state (weights) and some computation
(defined in the call method).
A simple layer looks like this. The self$add_weight()
method gives you a shortcut for creating weights:
Linear <- new_layer_class(
"Linear",
initialize = function(units = 32, input_dim = 32) {
super$initialize()
self$w <- self$add_weight(
shape = shape(input_dim, units),
initializer = "random_normal",
trainable = TRUE
)
self$b <- self$add_weight(
shape = shape(units),
initializer = "zeros",
trainable = TRUE
)
},
call = function(inputs) {
tf$matmul(inputs, self$w) + self$b
}
)
You would use a Layer
instance much like a R
function:
# Instantiate our layer.
linear_layer <- Linear(units=4, input_dim=2)
# The layer can be treated as a function.
# Here we call it on some data.
y <- linear_layer(tf$ones(shape(2, 2)))
The weight variables (created in initialize
) are
automatically tracked under the weights
property:
## [[1]]
## <KerasVariable shape=(2, 4), dtype=float32, path=linear/variable>
##
## [[2]]
## <KerasVariable shape=(4), dtype=float32, path=linear/variable_1>
You have many built-in layers available, from Dense
to
Conv2D
to LSTM
to fancier ones like
Conv3DTranspose
or ConvLSTM2D
. Be smart about
reusing built-in functionality.
build(input_shape)
It’s often a good idea to defer weight creation to the
build()
method, so that you don’t need to specify the input
dim/shape at layer construction time:
Linear <- new_layer_class(
"Linear",
initialize = function(units = 32) {
super$initialize()
self$units <- units
},
build = function(input_shape) {
self$w <- self$add_weight(
shape = shape(input_shape[-1], self$units),
initializer = "random_normal",
trainable = TRUE
)
self$b <- self$add_weight(
shape = shape(self$units),
initializer = "zeros",
trainable = TRUE
)
},
call = function(inputs) {
tf$matmul(inputs, self$w) + self$b
}
)
# Instantiate our layer.
linear_layer <- Linear(units = 4)
# This will also call `build(input_shape)` and create the weights.
y <- linear_layer(tf$ones(shape(2, 2)))
You can automatically retrieve the gradients of the weights of a
layer by calling it inside a GradientTape
. Using these
gradients, you can update the weights of the layer, either manually, or
using an optimizer object. Of course, you can modify the gradients
before using them, if you need to.
# Prepare a dataset.
c(c(x_train, y_train), .) %<-% dataset_mnist()
x_train <- array_reshape(x_train, c(60000, 784)) / 255
dataset <- tfdatasets::tensor_slices_dataset(list(x_train, y_train)) %>%
tfdatasets::dataset_shuffle(buffer_size=1024) %>%
tfdatasets::dataset_batch(64)
# Instantiate our linear layer (defined above) with 10 units.
linear_layer <- Linear(units = 10)
# Instantiate a logistic loss function that expects integer targets.
loss_fn <- loss_sparse_categorical_crossentropy(from_logits=TRUE)
# Instantiate an optimizer.
optimizer <- optimizer_sgd(learning_rate=1e-3)
# Iterate over the batches of the dataset.
coro::loop(for(data in dataset) {
# Open a GradientTape.
with(tf$GradientTape() %as% tape, {
# Forward pass.
logits <- linear_layer(data[[1]])
# Loss value for this batch.
loss_value <- loss_fn(data[[2]], logits)
})
# Get gradients of the loss wrt the weights.
gradients <- tape$gradient(loss_value, linear_layer$trainable_weights)
# Update the weights of our linear layer.
optimizer$apply_gradients(zip_lists(gradients, linear_layer$trainable_weights))
})
loss_value
## tf.Tensor(1.2819729, shape=(), dtype=float32)
Weights created by layers can be either trainable or non-trainable.
They’re exposed in trainable_weights
and
non_trainable_weights
respectively. Here’s a layer with a
non-trainable weight:
ComputeSum <- new_layer_class(
"ComputeSum",
initialize = function(input_dim) {
super$initialize()
# Create a non-trainable weight.
self$total <- self$add_weight(
initializer = "zeros",
shape = shape(input_dim),
trainable = FALSE
)
},
call = function(inputs) {
self$total$assign_add(tf$reduce_sum(inputs, axis=0L))
self$total
}
)
my_sum <- ComputeSum(input_dim = 2)
x <- tf$ones(shape(2, 2))
as.array(my_sum(x))
## [1] 2 2
## [1] 4 4
## list()
Layers can be recursively nested to create bigger computation blocks. Each layer will track the weights of its sublayers (both trainable and non-trainable).
# Let's reuse the Linear class
# with a `build` method that we defined above.
MLP <- new_layer_class(
"MLP",
initialize = function() {
super$initialize()
self$linear_1 <- Linear(units = 32)
self$linear_2 <- Linear(units = 32)
self$linear_3 <- Linear(units = 10)
},
call = function(inputs) {
x <- self$linear_1(inputs)
x <- tf$nn$relu(x)
x <- self$linear_2(x)
x <- tf$nn$relu(x)
return(self$linear_3(x))
}
)
mlp <- MLP()
# The first call to the `mlp` object will create the weights.
y <- mlp(tf$ones(shape=shape(3, 64)))
# Weights are recursively tracked.
length(mlp$weights)
## [1] 6
Note that our manually-created MLP above is equivalent to the following built-in option:
Layers can create losses during the forward pass via the
add_loss()
method. This is especially useful for
regularization losses. The losses created by sublayers are recursively
tracked by the parent layers.
Here’s a layer that creates an activity regularization loss:
# A layer that creates an activity sparsity regularization loss
ActivityRegularization <- new_layer_class(
"ActivityRegularization",
initialize = function(rate=1e-2) {
super$initialize()
self$rate <- rate
},
call = function(inputs) {
self$add_loss(self$rate * tf$reduce_sum(tf$abs(inputs)))
inputs
}
)
Any model incorporating this layer will track this regularization loss:
# Let's use the loss layer in a MLP block.
SparseMLP <- new_layer_class(
"SparseMLP",
initialize = function() {
super$initialize()
self$linear_1 <- Linear(units = 32)
self$reg <- ActivityRegularization(rate = 1e-2)
self$linear_3 <- Linear(units = 10)
},
call = function(inputs) {
x <- self$linear_1(inputs)
x <- tf$nn$relu(x)
x <- self$reg(x)
return(self$linear_3(x))
}
)
mlp <- SparseMLP()
y <- mlp(tf$ones(shape(10, 10)))
mlp$losses # List containing one float32 scalar
## [[1]]
## tf.Tensor(0.18065463, shape=(), dtype=float32)
These losses are cleared by the top-level layer at the start of each
forward pass – they don’t accumulate. layer.losses
always
contains only the losses created during the last forward pass. You would
typically use these losses by summing them before computing your
gradients when writing a training loop.
## tf.Tensor(
## [[ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]], shape=(10, 10), dtype=float32)
## [1] 1
## tf.Tensor(
## [[ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]
## [ 0.0388482 -0.03920118 0.01624808 -0.01361975 -0.01354899 0.07107338
## -0.01077365 0.05688906 -0.02838149 -0.04084621]], shape=(10, 10), dtype=float32)
## [1] 1
# Let's demonstrate how to use these losses in a training loop.
# Prepare a dataset.
c(c(x_train, y_train), .) %<-% dataset_mnist()
x_train <- array_reshape(x_train, c(60000, 784)) / 255
dataset <- tfdatasets::tensor_slices_dataset(list(x_train, y_train)) %>%
tfdatasets::dataset_shuffle(buffer_size=1024) %>%
tfdatasets::dataset_batch(64)
# A new MLP.
mlp <- SparseMLP()
# Loss and optimizer.
loss_fn <- loss_sparse_categorical_crossentropy(from_logits=TRUE)
optimizer <- optimizer_sgd(learning_rate=1e-3)
coro::loop(for(data in dataset) {
x <- data[[1]]
y <- data[[2]]
with(tf$GradientTape() %as% tape, {
# Forward pass.
logits <- mlp(x)
# External loss value for this batch.
loss <- loss_fn(y, logits)
# Add the losses created during the forward pass.
loss <- loss + Reduce(`+`, mlp$losses)
# Get gradients of the loss wrt the weights.
gradients <- tape$gradient(loss, mlp$trainable_weights)
# Update the weights of our linear layer.
optimizer$apply_gradients(zip_lists(gradients, mlp$trainable_weights))
})
})
Keras offers a broad range of built-in metrics, like
metric_auc
or metric_precision_at_recall
. It’s
also easy to create your own metrics in a few lines of code.
To use a metric in a custom training loop, you would:
metric = metric_auc()
metric$udpate_state(targets, predictions)
method for each batch of datametric$result()
metric$reset_state()
Here’s a simple example:
# Instantiate a metric object
accuracy <- metric_sparse_categorical_accuracy()
# Prepare our layer, loss, and optimizer.
model <- keras_model_sequential() %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 10)
loss_fn <- loss_sparse_categorical_crossentropy(from_logits = TRUE)
optimizer <- optimizer_adam(learning_rate=1e-3)
for (epoch in seq_len(2)) {
coro::loop(for (data in dataset) {
x <- data[[1]]
y <- data[[2]]
with(tf$GradientTape() %as% tape, {
# Forward pass.
logits <- model(x)
# External loss value for this batch.
loss_value <- loss_fn(y, logits)
})
# Update the state of the `accuracy` metric.
accuracy$update_state(y, logits)
# Update the weights of the model to minimize the loss value.
gradients <- tape$gradient(loss_value, model$trainable_weights)
optimizer$apply_gradients(zip_lists(gradients, model$trainable_weights))
})
cat("Epoch:", epoch, "Accuracy:", as.numeric(accuracy$result()), "\n")
accuracy$reset_state()
}
## Epoch: 1 Accuracy: 0.8757833
## Epoch: 2 Accuracy: 0.93915
You can also define your own metrics by subclassing
keras.metrics.Metric
. You need to override the three
functions called above:
update_state()
to update the statistic
values.result()
to return the metric value.reset_state()
to reset the metric to its
initial state.Here is an example where we implement the F1-score metric (with support for sample weighting).
F1Score <- new_metric_class(
"F1Score",
initialize = function(self, name="f1_score", dtype="float32", threshold=0.5, ...) {
super$initialize(name=name, dtype=dtype, ...)
self$threshold <- threshold
self$true_positives <- self$add_weight(
name="tp", dtype=dtype, initializer="zeros"
)
self$false_positives <- self$add_weight(
name="fp", dtype=dtype, initializer="zeros"
)
self$false_negatives <- self$add_weight(
name="fn", dtype=dtype, initializer="zeros"
)
},
update_state = function(y_true, y_pred, sample_weight=NULL) {
y_pred <- tf$math$greater_equal(y_pred, self$threshold)
y_true <- tf$cast(y_true, tf$bool)
y_pred <- tf$cast(y_pred, tf$bool)
true_positives <- tf$cast(y_true & y_pred, self$dtype)
false_positives <- tf$cast((!y_true) & y_pred, self$dtype)
false_negatives <- tf$cast(y_true & (!y_pred), self$dtype)
if (!is.null(sample_weight)) {
sample_weight <- tf$cast(sample_weight, self$dtype)
true_positives <- true_positives * sample_weight
false_positives <- false_positives * sample_weight
false_negatives <- false_negatives * sample_weight
}
self$true_positives$assign_add(tf$reduce_sum(true_positives))
self$false_positives$assign_add(tf$reduce_sum(false_positives))
self$false_negatives$assign_add(tf$reduce_sum(false_negatives))
},
result = function() {
precision <- self$true_positives / (self$true_positives + self$false_positives)
recall <- self$true_positives / (self$true_positives + self$false_negatives)
f1_score <- 2 * precision * recall / (precision + recall)
f1_score
},
reset_state = function() {
self$true_positives$assign(0)
self$false_positives$assign(0)
self$false_negatives$assign(0)
}
)
Let’s test-drive it:
m <- F1Score()
m$update_state(c(0, 1, 0, 0), c(0.3, 0.5, 0.8, 0.9))
cat("Intermediate result:", as.numeric(m$result()), "\n")
## Intermediate result: 0.5
m$update_state(c(1, 1, 1, 1), c(0.1, 0.7, 0.6, 0.0))
cat("Final result:", as.numeric(m$result()), "\n")
## Final result: 0.6
Running eagerly is great for debugging, but you will get better
performance by compiling your computation into static graphs. Static
graphs are a researcher’s best friends. You can compile any function by
wrapping it in a tf.function
decorator.
# Prepare our layer, loss, and optimizer.
model <- keras_model_sequential() %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 10)
loss_fn <- loss_sparse_categorical_crossentropy(from_logits = TRUE)
optimizer <- optimizer_adam(learning_rate=1e-3)
# Create a training step function.
train_on_batch <- tf_function(function(x, y) {
with(tf$GradientTape() %as% tape, {
# Forward pass.
logits <- model(x)
# External loss value for this batch.
loss_value <- loss_fn(y, logits)
})
# Update the weights of the model to minimize the loss value.
gradients <- tape$gradient(loss_value, model$trainable_weights)
optimizer$apply_gradients(zip_lists(gradients, model$trainable_weights))
loss_value
})
# Prepare a dataset.
c(c(x_train, y_train), .) %<-% dataset_mnist()
x_train <- array_reshape(x_train, c(60000, 784)) / 255
dataset <- tfdatasets::tensor_slices_dataset(list(x_train, y_train)) %>%
tfdatasets::dataset_shuffle(buffer_size=1024) %>%
tfdatasets::dataset_batch(64)
i <- 0
coro::loop(for (data in dataset) {
i <- i + 1
x <- data[[1]]
y <- data[[2]]
loss <- train_on_batch(x, y)
if (i %% 100 == 0)
cat("Loss:", as.numeric(loss), "\n")
})
## Loss: 0.551749
## Loss: 0.2131135
## Loss: 0.2765952
## Loss: 0.1296219
## Loss: 0.2657076
## Loss: 0.2683381
## Loss: 0.1570166
## Loss: 0.3139241
## Loss: 0.08981849
Some layers, in particular the BatchNormalization
layer
and the Dropout
layer, have different behaviors during
training and inference. For such layers, it is standard practice to
expose a training
(boolean) argument in the
call
method.
By exposing this argument in call
, you enable the
built-in training and evaluation loops (e.g. fit) to correctly use the
layer in training and inference modes.
Dropout <- new_layer_class(
"Dropout",
initialize = function(rate) {
super$initialize()
self$rate <- rate
},
call = function(inputs, training = NULL) {
if (!is.null(training) && training) {
return(tf$nn$dropout(inputs, rate = self$rate))
}
inputs
}
)
MLPWithDropout <- new_layer_class(
"MLPWithDropout",
initialize = function() {
super$initialize()
self$linear_1 <- Linear(units = 32)
self$dropout <- Dropout(rate = 0.5)
self$linear_3 <- Linear(units = 10)
},
call = function(inputs, training = NULL) {
x <- self$linear_1(inputs)
x <- tf$nn$relu(x)
x <- self$dropout(x, training = training)
self$linear_3(x)
}
)
mlp <- MLPWithDropout()
y_train <- mlp(tf$ones(shape(2, 2)), training=TRUE)
y_test <- mlp(tf$ones(shape(2, 2)), training=FALSE)
To build deep learning models, you don’t have to use object-oriented programming all the time. All layers we’ve seen so far can also be composed functionally, like this (we call it the “Functional API”):
# We use an `Input` object to describe the shape and dtype of the inputs.
# This is the deep learning equivalent of *declaring a type*.
# The shape argument is per-sample; it does not include the batch size.
# The functional API focused on defining per-sample transformations.
# The model we create will automatically batch the per-sample transformations,
# so that it can be called on batches of data.
inputs <- layer_input(shape = 16, dtype = "float32")
# We call layers on these "type" objects
# and they return updated types (new shapes/dtypes).
outputs <- inputs %>%
Linear(units = 32) %>% # We are reusing the Linear layer we defined earlier.
Dropout(rate = 0.5) %>% # We are reusing the Dropout layer we defined earlier.
Linear(units = 10)
# A functional `Model` can be defined by specifying inputs and outputs.
# A model is itself a layer like any other.
model <- keras_model(inputs, outputs)
# A functional model already has weights, before being called on any data.
# That's because we defined its input shape in advance (in `Input`).
length(model$weights)
## [1] 4
## TensorShape([2, 10])
# You can pass a `training` argument in `__call__`
# (it will get passed down to the Dropout layer).
y <- model(tf$ones(shape(2, 16)), training=TRUE)
The Functional API tends to be more concise than subclassing, and provides a few other advantages (generally the same advantages that functional, typed languages provide over untyped OO development). However, it can only be used to define DAGs of layers – recursive networks should be defined as Layer subclasses instead.
Learn more about the Functional API here.
In your research workflows, you may often find yourself mix-and-matching OO models and Functional models.
Note that the Model
class also features built-in
training & evaluation loops: fit()
,
predict()
and evaluate()
(configured via the
compile()
method). These built-in functions give you access
to the following built-in training infrastructure features:
tf.distribute
API –
with no changes to your code.steps_per_execution
argument in
Model.compile()
, you can process multiple batches in a
single tf.function
call, which greatly improves device
utilization on TPUs.We won’t go into the details, but we provide a simple code example below. It leverages the built-in training infrastructure to implement the MNIST example above.
inputs <- layer_input(shape = 784, dtype="float32")
outputs <- inputs %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 10)
model <- keras_model(inputs, outputs)
# Specify the loss, optimizer, and metrics with `compile()`.
model %>% compile(
loss = loss_sparse_categorical_crossentropy(from_logits=TRUE),
optimizer=optimizer_adam(learning_rate=1e-3),
metrics=list(metric_sparse_categorical_accuracy()),
)
# Train the model with the dataset for 2 epochs.
model %>% fit(dataset, epochs=2)
## Epoch 1/2
## 938/938 - 4s - 4ms/step - loss: 0.3958 - sparse_categorical_accuracy: 0.8866
## Epoch 2/2
## 938/938 - 1s - 960us/step - loss: 0.1888 - sparse_categorical_accuracy: 0.9443
## 938/938 - 1s - 1ms/step
## 938/938 - 1s - 1ms/step - loss: 0.1763 - sparse_categorical_accuracy: 0.9454
## $loss
## [1] 0.1763445
##
## $sparse_categorical_accuracy
## [1] 0.9454167
You can always subclass the Model
class (it works
exactly like subclassing Layer
) if you want to leverage
built-in training loops for your OO models. Just override the
Model$train_step()
to customize what happens in
fit()
while retaining support for the built-in
infrastructure features outlined above – callbacks, zero-code
distribution support, and step fusing support. You may also override
test_step()
to customize what happens in
evaluate()
, and override predict_step()
to
customize what happens in predict()
. For more information,
please refer to this
guide.
CustomModel <- new_model_class(
"CustomModel",
initialize = function(...) {
super$initialize(...)
self$loss_tracker <- metric_mean(name="loss")
self$accuracy <- metric_sparse_categorical_accuracy()
self$loss_fn <- loss_sparse_categorical_crossentropy(from_logits=TRUE)
self$optimizer <- optimizer_adam(learning_rate=1e-3)
},
train_step = function(data) {
c(x, y = NULL, sample_weight = NULL) %<-% data
with(tf$GradientTape() %as% tape, {
y_pred <- self(x, training=TRUE)
loss <- self$loss_fn(y = y, y_pred = y_pred, sample_weight=sample_weight)
})
gradients <- tape$gradient(loss, self$trainable_variables)
self$optimizer$apply_gradients(
zip_lists(gradients, self$trainable_variables)
)
# Update metrics (includes the metric that tracks the loss)
self$loss_tracker$update_state(loss)
self$accuracy$update_state(y, y_pred, sample_weight=sample_weight)
# Return a list mapping metric names to current value
list(
loss = self$loss_tracker$result(),
accuracy = self$accuracy$result()
)
},
metrics = mark_active(function() {
list(self$loss_tracker, self$accuracy)
})
)
inputs <- layer_input(shape = 784, dtype="float32")
outputs <- inputs %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 10)
model <- CustomModel(inputs, outputs)
model %>% compile()
model %>% fit(dataset, epochs=2)
## Epoch 1/2
## 938/938 - 2s - 2ms/step - loss: 0.3869 - sparse_categorical_accuracy: 0.8924
## Epoch 2/2
## 938/938 - 1s - 1ms/step - loss: 0.2163 - sparse_categorical_accuracy: 0.9370
Here are some of the things you’ve learned so far:
Layer
encapsulates a state (created in
__init__
or build
) and some computation
(defined in call
).GradientTape
, calling your model inside the tape’s scope,
then retrieving gradients and applying them via an optimizer.@tf.function
decorator.self.add_loss()
.Let’s put all of these things together into an end-to-end example: we’re going to implement a Variational AutoEncoder (VAE). We’ll train it on MNIST digits.
Our VAE will be a subclass of Layer
, built as a nested
composition of layers that subclass Layer
. It will feature
a regularization loss (KL divergence).
Below is our model definition.
First, we have an Encoder
class, which uses a
Sampling
layer to map a MNIST digit to a latent-space
triplet (z_mean, z_log_var, z)
.
Sampling <- new_layer_class(
"Sampling",
call = function(inputs) {
c(z_mean, z_log_var) %<-% inputs
batch <- op_shape(z_mean)[[1]]
dim <- op_shape(z_mean)[[2]]
epsilon <- random_normal(shape = c(batch, dim))
z_mean + op_exp(0.5 * z_log_var) * epsilon
}
)
Encoder <- new_layer_class(
"Encoder",
initialize = function(latent_dim = 32, intermediate_dim = 64, ...) {
super$initialize(...)
self$dense_proj <- layer_dense(units = intermediate_dim, activation = "relu")
self$dense_mean <- layer_dense(units = latent_dim)
self$dense_log_var <- layer_dense(units = latent_dim)
self$sampling <- Sampling()
},
call = function(inputs) {
x <- self$dense_proj(inputs)
z_mean <- self$dense_mean(x)
z_log_var <- self$dense_log_var(x)
z <- self$sampling(list(z_mean, z_log_var))
list(z_mean, z_log_var, z)
}
)
Next, we have a Decoder
class, which maps the
probabilistic latent space coordinates back to a MNIST digit.
Decoder <- new_layer_class(
"Decoder",
initialize = function(original_dim, intermediate_dim = 64, ...) {
super$initialize(...)
self$dense_proj <- layer_dense(units = intermediate_dim, activation = "relu")
self$dense_output <- layer_dense(units = original_dim, activation = "sigmoid")
},
call = function(inputs) {
x <- self$dense_proj(inputs)
self$dense_output(x)
}
)
Finally, our VariationalAutoEncoder
composes together an
encoder and a decoder, and creates a KL divergence regularization loss
via add_loss()
.
VariationalAutoEncoder <- new_model_class(
"VariationalAutoEncoder",
initialize = function(original_dim,
intermediate_dim=64,
latent_dim=32,
name="autoencoder", ...) {
super$initialize(name = name, ...)
self$original_dim <- original_dim
self$encoder <- Encoder(
latent_dim = latent_dim,
intermediate_dim = intermediate_dim
)
self$decoder <- Decoder(
original_dim = original_dim,
intermediate_dim = intermediate_dim
)
},
call = function(inputs) {
c(z_mean, z_log_var, z) %<-% self$encoder(inputs)
reconstructed <- self$decoder(z)
# Add KL divergence regularization loss.
kl_loss <- -0.5 * op_mean(
z_log_var - op_square(z_mean) - op_exp(z_log_var) + 1
)
self$add_loss(kl_loss)
reconstructed
}
)
Now, let’s write a training loop. Our training step is decorated with
a @tf.function
to compile into a super fast graph
function.
# Our model.
vae <- VariationalAutoEncoder(
original_dim = 784,
intermediate_dim = 64,
latent_dim = 32
)
# Loss and optimizer.
loss_fn <- loss_mean_squared_error()
optimizer = optimizer_adam(learning_rate=1e-3)
# Prepare a dataset.
c(c(x_train, .), .) %<-% dataset_mnist()
x_train <- array_reshape(x_train, c(60000, 784)) / 255
dataset <- tfdatasets::tensor_slices_dataset(x_train) %>%
tfdatasets::dataset_shuffle(buffer_size=1024) %>%
tfdatasets::dataset_batch(32)
training_step <- tf_function(function(x) {
with(tf$GradientTape() %as% tape, {
reconstructed <- vae(x) # Compute input reconstruction.
# Compute loss.
loss <- loss_fn(x, reconstructed)
loss <- loss + op_sum(vae$losses) # Add KLD term.
})
# Update the weights of the VAE.
grads <- tape$gradient(loss, vae$trainable_weights)
optimizer$apply_gradients(zip_lists(grads, vae$trainable_weights))
loss
})
losses <- c() # Keep track of the losses over time.
coro::loop(for(data in dataset) {
loss <- training_step(data)
# Logging.
losses[length(losses) + 1] <- as.numeric(loss)
if (length(losses) %% 100 == 0) {
cat("Step:", length(losses), "Loss:", mean(losses), "\n")
}
# Stop after 1000 steps.
# Training the model to convergence is left
# as an exercise to the reader.
if (length(losses) >= 1000) {
break
}
})
## Step: 100 Loss: 0.1270978
## Step: 200 Loss: 0.1003238
## Step: 300 Loss: 0.09001128
## Step: 400 Loss: 0.08493649
## Step: 500 Loss: 0.08171404
## Step: 600 Loss: 0.07926706
## Step: 700 Loss: 0.07790599
## Step: 800 Loss: 0.07670419
## Step: 900 Loss: 0.07570736
## Step: 1000 Loss: 0.07476593
As you can see, building and training this type of model in Keras is quick and painless.
Let’s take a look at another kind of research experiment: hypernetworks.
The idea is to use a small deep neural network (the hypernetwork) to generate the weights for a larger network (the main network).
Let’s implement a really trivial hypernetwork: we’ll use a small 2-layer network to generate the weights of a larger 3-layer network.
input_dim <- 784
classes <- 10
# This is the main network we'll actually use to predict labels.
inputs <- layer_input(shape = input_dim)
dense1 <- layer_dense(units = 64, activation = "relu")
dense1$built <- TRUE
dense2 <- layer_dense(units = classes)
dense2$built <- TRUE
outputs <- inputs %>% dense1() %>% dense2()
main_network <- keras_model(inputs, outputs)
# This is the number of weight coefficients to generate. Each layer in the
# main network requires output_dim * input_dim + output_dim coefficients.
num_weights_to_generate <- (classes * 64 + classes) + (64 * input_dim + 64)
# This is the hypernetwork that generates the weights of the `main_network` above.
hypernetwork <- keras_model_sequential() %>%
layer_dense(units=16, activation="relu") %>%
layer_dense(units=num_weights_to_generate, activation="sigmoid")
This is our training loop. For each batch of data:
hypernetwork
to generate an array of weight
coefficients, weights_pred
main_network
main_network
to compute
the actual MNIST predictionshypernetwork
to minimize the final classification loss# Loss and optimizer.
loss_fn <- loss_sparse_categorical_crossentropy(from_logits = TRUE)
optimizer <- optimizer_adam(learning_rate=1e-4)
# Prepare a dataset.
c(c(x_train, y_train), .) %<-% dataset_mnist()
x_train <- array_reshape(x_train, c(60000, 784)) / 255
dataset <- tfdatasets::tensor_slices_dataset(list(x_train, y_train)) %>%
tfdatasets::dataset_shuffle(buffer_size=1024) %>%
# We'll use a batch size of 1 for this experiment.
tfdatasets::dataset_batch(1)
train_step <- function(x, y) {
with(tf$GradientTape() %as% tape, {
weights_pred <- hypernetwork(x)
# Reshape them to the expected shapes for w and b for the outer model.
# Layer 1 kernel.
start_index <- 1
w1_shape <- c(input_dim, 64)
w1_coeffs <- weights_pred[, start_index:(start_index + prod(w1_shape) - 1)]
w1 <- tf$reshape(w1_coeffs, as.integer(w1_shape))
start_index <- start_index + prod(w1_shape)
# Layer 1 bias.
b1_shape <- c(64)
b1_coeffs <- weights_pred[, start_index:(start_index + prod(b1_shape) - 1)]
b1 <- tf$reshape(b1_coeffs, as.integer(b1_shape))
start_index <- start_index + prod(b1_shape)
# Layer 2 kernel.
w2_shape <- c(64, classes)
w2_coeffs <- weights_pred[, start_index:(start_index + prod(w2_shape) - 1)]
w2 <- tf$reshape(w2_coeffs, as.integer(w2_shape))
start_index <- start_index + prod(w2_shape)
# Layer 2 bias.
b2_shape <- c(classes)
b2_coeffs <- weights_pred[, start_index:(start_index + prod(b2_shape) - 1)]
b2 <- tf$reshape(b2_coeffs, as.integer(b2_shape))
start_index <- start_index + prod(b2_shape)
# Set the weight predictions as the weight variables on the outer model.
dense1$kernel <- w1
dense1$bias <- b1
dense2$kernel <- w2
dense2$bias <- b2
# Inference on the outer model.
preds <- main_network(x)
loss <- loss_fn(y, preds)
})
grads <- tape$gradient(loss, hypernetwork$trainable_weights)
optimizer$apply_gradients(zip_lists(grads, hypernetwork$trainable_weights))
loss
}
losses <- c() # Keep track of the losses over time.
coro::loop(for (data in dataset) {
x <- data[[1]]
y <- data[[2]]
loss <- train_step(x, y)
# Logging.
losses[length(losses) + 1] <- as.numeric(loss)
if (length(losses) %% 100 == 0) {
cat("Step:", length(losses), "Loss:", mean(losses), "\n")
}
# Stop after 1000 steps.
# Training the model to convergence is left
# as an exercise to the reader.
if (length(losses) >= 1000) {
break
}
})
## Step: 100 Loss: 2.536778
## Step: 200 Loss: 2.236472
## Step: 300 Loss: 2.119417
## Step: 400 Loss: 2.040341
## Step: 500 Loss: 1.949125
## Step: 600 Loss: 1.859384
## Step: 700 Loss: 1.845726
## Step: 800 Loss: 1.820594
## Step: 900 Loss: 1.771334
## Step: 1000 Loss: 1.730648
Implementing arbitrary research ideas with Keras is straightforward and highly productive. Imagine trying out 25 ideas per day (20 minutes per experiment on average)!
Keras has been designed to go from idea to results as fast as possible, because we believe this is the key to doing great research.
We hope you enjoyed this quick introduction. Let us know what you build with Keras!