R package dependence trees

Ioannis Kosmidis

2022-08-26

cranly dependence trees

Since version 0.2 cranly includes functions for constructing and working with package dependence tree objects. Specifically, the packages that are requirements for a specified package (i.e. appear in Depends, Imports or LinkingTo) are found, then the requirements for those packages are found, and so on. In essence, a package’s dependence tree shows what else needs to be installed with the package in an empty package library with the package, and hence it can be used to + remove unnecessary dependencies that “drag” with them all sorts of other packages + identify packages that are heavy for the CRAN mirrors + produced some neat visuals for the package

Constructing cranly_dependence_tree objects

Constructing cranly_dependence_tree objects is straightforward once a package directives network has been derived.

Let’s attach cranly

library("cranly")

and use an instance of the package directives network

package_network <- readRDS(url("https://raw.githubusercontent.com/ikosmidis/cranly/develop/inst/extdata/package_network.rds"))

from CRAN’s state on 2022-08-26 14:43:43 BST.

Alternatively, today’s package directives network can be constructed by doing

cran_db <- clean_CRAN_db()
package_network <- build_network(cran_db)

We can compute dependence trees for any package in CRAN using the function compute_dependence_tree on the package directives network. For example the dependence tree of brglm2 is

 compute_dependence_tree(package_network, "brglm2")
#>       package generation
#> 1      brglm2          0
#> 2        MASS         -1
#> 3       stats         -1
#> 4      Matrix         -1
#> 5    graphics         -1
#> 6        nnet         -1
#> 7  enrichwith         -1
#> 8    numDeriv         -1
#> 9     methods         -2
#> 10   graphics         -2
#> 11       grid         -2
#> 12      stats         -2
#> 13      utils         -2
#> 14    lattice         -2
#> 15  grDevices         -2

and of tibble is

compute_dependence_tree(package_network, "tibble")
#>      package generation
#> 1     tibble          0
#> 2      fansi         -1
#> 3  lifecycle         -1
#> 4   magrittr         -1
#> 5    methods         -1
#> 6     pillar         -1
#> 7  pkgconfig         -1
#> 8      rlang         -1
#> 9      utils         -1
#> 10     vctrs         -1
#> 11 grDevices         -2
#> 12     utils         -2
#> 13      glue         -2
#> 14     rlang         -2
#> 15       cli         -2
#> 16     fansi         -2
#> 17 lifecycle         -2
#> 18      utf8         -2
#> 19     vctrs         -2
#> 20      glue         -3
#> 21     utils         -3
#> 22 grDevices         -3
#> 23   methods         -3
#> 24     rlang         -3
#> 25       cli         -3

The resulting data frame, includes package names and a generation index. The generation of the named package is by default 0 and as we move back through the required packages and the requirements of those the generation index decreases by 1. I had loads of fun implementing compute_dependence_tree, because the tree construction can be neatly and cleanly written as a recursion (see source code of compute_dependence_tree), leveraging the advantages of functional programming (that’s a different and long discussion, though).

The method build_dependence_tree uses compute_dependence_tree to construct and edge list for the dependence tree, that we can the visualize. For example for tibble

tibble_tree <- build_dependence_tree(package_network, "tibble")
plot(tibble_tree)

Package dependence index

The package dependence index is a rough measure of how much “baggage” an R package carries. The package dependence index is defined as the weighted average that averages across the generation index of the packages in the tree, with weights that are inversely proportional to the popularity of each package in terms of how many other packages depend on, link to or import it. Mathematically, the package dependence index is defined as \[ -\frac{\sum_{i \in C_p; i \ne p} \frac{1}{N_i} g_i}{\sum_{i \in C_p; i \ne p} \frac{1}{N_i}} \] where \(C_p\) is the dependence tree for the package(s) \(p\), \(N_i\) is the total number of packages that depend, link or import package \(i\), and \(g_i\) is the generation that package \(i\) appears in the dependence tree of package(s) \(p\). The generation takes values on the non-positive integers, with the package(s) \(p\) being placed at generation \(0\), the packages that \(p\) links to, depends or imports at generation \(-1\) and so on.

For example, the package dependence index for all packages in the dependence tree of betareg

betareg_tree <- build_dependence_tree(package_network, "betareg")
betareg_dep_index <- sapply(betareg_tree$nodes$package, function(package) {
    tree <- build_dependence_tree(package_network, package = package)
    s <- summary(tree)
    s$dependence_index
})
sort(betareg_dep_index)
#>    flexmix    Formula    lattice modeltools       nnet        zoo    betareg 
#> 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.06768005 
#>     lmtest   sandwich 
#> 0.50726552 0.50726552