Density Estimation in mlr3proba

Nurul Ain Toha

This vignette is an introduction to performing density estimation in mlr3proba.

A very quick introduction to density estimation

Density estimation is the learning task to find the unknown distribution from which an i.i.d. data set is generated. We interpret this broadly, with this distribution not necessarily being continuous (so may possess a mass not density). The conditional case, where a distribution is predicted conditional on covariates, is known as ‘probabilistic supervised regression’, and will be implemented in mlr3proba in the near-future. In mlr3proba, (unconditional) density estimation is viewed as an unsupervised task, whereas probabilistic supervised regression (or conditional density estimation) is a supervised task

Density Task

Unconditional density estimation is an unsupervised method. Hence, TaskDens is an unsupervised task which inherits directly from Task unlike TaskClassif and TaskRegr. However, TaskDens still has a target and a $truth field defined by:

library(mlr3proba); library(mlr3)

task = TaskDens$new(id = "mpg", backend = datasets::mtcars, target = "mpg")

task
#> <TaskDens:mpg> (32 x 11)
#> * Target: mpg
#> * Properties: -
#> * Features (10):
#>   - dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt

task$truth()[1:10]
#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2

Train and Predict

Density learners have train and predict methods, though being unsupervised, ‘prediction’ is actually ‘estimation’. In training, a distr6 object is created, see here for full tutorials on how to access the pdf, cdf, and other important fields and methods. The predict method is simply a wrapper around self$model$pdf and if available self$model$cdf, i.e. evaluates the pdf/cdf at given points. Note that in prediction the points to evaluate the pdf and cdf are determined by the target column in the TaskDens object used for testing.

# create task and learner

task_faithful = TaskDens$new(id = "eruptions", backend = datasets::faithful,
                             target = "eruptions")
learner = lrn("dens.kde")

# train/test split 

train_set = sample(task_faithful$nrow, 0.8 * task_faithful$nrow)
test_set = setdiff(seq_len(task_faithful$nrow), train_set)

# fitting KDE and model inspection

learner$train(task_faithful, row_ids = train_set)
learner$model
#> Norm_KDE
class(learner$model)
#> [1] "Distribution" "R6"

# make predictions for new data

prediction = learner$predict(task_faithful, row_ids = test_set)

Every PredictionDens object can estimate:

Some learners can estimate:

prediction
#> <PredictionDens> for 55 observations:
#>     row_id truth       pdf
#>          3 3.333 0.1094527
#>          8 3.600 0.2057676
#>         11 1.833 0.3015347
#> ---                       
#>        241 4.150 0.4651316
#>        242 2.350 0.2310265
#>        272 4.467 0.4800183

# `pdf` is evaluated using the `log-loss`

prediction$score()
#> dens.logloss 
#>     1.145351