Comparison with {hash}

This vignette provides a comparison of {r2r} with the same-purpose CRAN package {hash}, which also offers an implementation of hash tables based on R environments. We first describe the features offered by both packages, and then perform some benchmark timing comparisons. The package versions referred to in this vignette are:

Features

Both {r2r} and {hash} hash tables are built on top of the R built-in environment data structure, and have thus a similar API. In particular, hash table objects have reference semantics for both packages. {r2r} hashtables are S3 class objects, whereas in {hash} the data structure is implemented as an S4 class.

Hash tables provided by r2r support arbitrary type keys and values, arbitrary key comparison and hash functions, and have customizable behaviour (either throw an exception or return a default value) upon query of a missing key.

In contrast, hash tables in hash currently support only string keys, with basic identity comparison (the hashing is performed automatically by the underlying environment objects); values can be arbitrary R objects. Querying missing keys through non-vectorized [[-subsetting returns the default value NULL, whereas queries through vectorized [-subsetting result in an error. On the other hand, hash also offers support for inverting hash tables (an experimental feature at the time of writing).

The table below summarizes the features of the two packages

Features supported by {r2r} and {hash}
Feature	r2r	hash
Basic data structure	R environment	R environment
Arbitrary type keys	X
Arbitrary type values	X	X
Arbitrary hash function	X
Arbitrary key comparison function	X
Throw or return default on missing keys	X
Hash table inversion		X

Performance tests

We will perform our benchmark tests using the CRAN package microbenchmark.

library(microbenchmark)

Key insertion

We start by timing the insertion of:

N <- 1e4

random key-value pairs (with possible repetitions). In order to perform a meaningful comparison between the two packages, we restrict to string (i.e. length one character) keys. We can generate random keys as follows:

chars <- c(letters, LETTERS, 0:9)
random_keys <- function(n) paste0(
    sample(chars, n, replace = TRUE),
    sample(chars, n, replace = TRUE),
    sample(chars, n, replace = TRUE),
    sample(chars, n, replace = TRUE),
    sample(chars, n, replace = TRUE)
    )

set.seed(840)
keys <- random_keys(N)
values <- rnorm(N)

We test both the non-vectorized ([[<-) and vectorized ([<-) operators:

microbenchmark(
    `r2r_[[<-` = {
        for (i in seq_along(keys))
            m_r2r[[ keys[[i]] ]] <- values[[i]]
    },
    `r2r_[<-` = { m_r2r[keys] <- values },
    `hash_[[<-` = { 
        for (i in seq_along(keys))
            m_hash[[ keys[[i]] ]] <- values[[i]]
    },
    `hash_[<-` = m_hash[keys] <- values,
    
    times = 30, 
    setup = { m_r2r <- hashmap(); m_hash <- hash() }
)
#> Unit: milliseconds
#>       expr      min       lq      mean    median       uq      max neval
#>   r2r_[[<-  97.1113 126.6628 173.17751 175.37990 206.8687 301.6009    30
#>    r2r_[<-  91.2664 113.1550 137.13168 127.47230 158.4516 215.8025    30
#>  hash_[[<- 107.5062 133.6981 178.84656 166.02690 201.8416 367.6347    30
#>   hash_[<-  40.4989  73.6090  87.99125  86.84845 102.5808 189.4421    30

As it is seen, r2r and hash have comparable performances at the insertion of key-value pairs, with both vectorized and non-vectorized insertions, hash being somewhat more efficient in both cases.

Key query

We now test key query, again both in non-vectorized and vectorized form:

microbenchmark(
    `r2r_[[` = { for (key in keys) m_r2r[[ key ]] },
    `r2r_[` = { m_r2r[ keys ] },
    `hash_[[` = { for (key in keys) m_hash[[ key ]] },
    `hash_[` = { m_hash[ keys ] },
    
    times = 30,
    setup = { 
        m_r2r <- hashmap(); m_r2r[keys] <- values
        m_hash <- hash(); m_hash[keys] <- values
    }
)
#> Unit: milliseconds
#>     expr     min       lq      mean    median       uq      max neval
#>   r2r_[[ 88.8908 131.3450 162.35679 166.82115 193.6489 224.0128    30
#>    r2r_[ 78.2608  89.1680 133.01280 136.82050 171.4101 192.3764    30
#>  hash_[[ 11.3301  13.2122  19.92809  16.61005  21.1257 114.3301    30
#>   hash_[ 59.1891  72.5410  97.80876 103.46870 116.3497 157.7794    30

For non-vectorized queries, hash is significantly faster (by one order of magnitude) than r2r. This is likely due to the fact that the [[ method dispatch is handled natively by R in hash (i.e. the default [[ method for environments is used ), whereas r2r suffers the overhead of S3 method dispatch. This is confirmed by the result for vectorized queries, which is comparable for the two packages; notice that here a single (rather than N) S3 method dispatch occurs in the r2r timed expression.

As an additional test, we perform the benchmarks for non-vectorized expressions with a new set of keys:

set.seed(841)
new_keys <- random_keys(N)
microbenchmark(
    `r2r_[[_bis` = { for (key in new_keys) m_r2r[[ key ]] },
    `hash_[[_bis` = { for (key in new_keys) m_hash[[ key ]] },
    
    times = 30,
    setup = { 
        m_r2r <- hashmap(); m_r2r[keys] <- values
        m_hash <- hash(); m_hash[keys] <- values
    }
)
#> Unit: milliseconds
#>         expr     min      lq     mean   median       uq      max neval
#>   r2r_[[_bis 60.9385 66.6542 97.49977 98.44680 118.4777 160.2272    30
#>  hash_[[_bis 10.3190 10.9783 14.71031 12.72205  18.6043  23.5141    30

The results are similar to the ones already commented. Finally, we test the performances of the two packages in checking the existence of keys (notice that here has_key refers to r2r::has_key, whereas has.key is hash::has.key):

set.seed(842)
mixed_keys <- sample(c(keys, new_keys), N)
microbenchmark(
    r2r_has_key = { for (key in mixed_keys) has_key(m_r2r, key) },
    hash_has_key = { for (key in new_keys) has.key(key, m_hash) },
    
    times = 30,
    setup = { 
        m_r2r <- hashmap(); m_r2r[keys] <- values
        m_hash <- hash(); m_hash[keys] <- values
    }
)
#> Unit: milliseconds
#>          expr      min       lq     mean   median       uq      max neval
#>   r2r_has_key  82.1291 105.5541 121.9681 118.7348 138.7861 177.3694    30
#>  hash_has_key 199.9635 253.8353 309.2154 296.2679 362.1202 504.9347    30

The results are comparable for the two packages, r2r being slightly more performant in this particular case.

Key deletion

Finally, we test key deletion. In order to handle name collisions, we will use delete() (which refers to r2r::delete()) and del() (which refers to hash::del()).

microbenchmark(
    r2r_delete = { for (key in keys) delete(m_r2r, key) },
    hash_delete = { for (key in keys) del(key, m_hash) },
    hash_vectorized_delete = { del(keys, m_hash) },
    
    times = 30,
    setup = { 
        m_r2r <- hashmap(); m_r2r[keys] <- values
        m_hash <- hash(); m_hash[keys] <- values
    }
)
#> Unit: milliseconds
#>                    expr      min       lq       mean    median       uq
#>              r2r_delete 125.3984 147.5797 190.454610 194.47480 219.5499
#>             hash_delete  63.6216  73.2791 104.263703 102.95080 134.2641
#>  hash_vectorized_delete   2.6007   3.0677   3.737853   3.53695   4.0799
#>       max neval
#>  259.5941    30
#>  176.3312    30
#>    6.8738    30

The vectorized version of hash significantly outperforms the non-vectorized versions (by roughly two orders of magnitude in speed). Currently, r2r does not support vectorized key deletion ¹.

This is due to complications introduced by the internal hash collision handling system of r2r.↩︎

Comparison with `{hash}`

Features

Performance tests

Key insertion

Key query

Key deletion

Conclusions