alike

Brodie Gaslam

What is Alikeness?

alike is similar to all.equal from base R except it only compares object structure. As with all.equal, the first argument (target) must be matched by the second (current).

library(vetr)
alike(integer(5), 1:5)      # different values, but same structure
[1] TRUE
alike(integer(5), 1:4)      # wrong size
[1] "`length(1:4)` should be 5 (is 4)"
alike(integer(26), letters) # same size, but different types
[1] "`letters` should be type \"integer-like\" (is \"character\")"

alike only compares structural elements that are defined in target (a.k.a. the template). This allows “wildcard” templates. For example, we consider length zero vectors to have undefined length so those match vectors of any length:

alike(integer(), 1:5)
[1] TRUE
alike(integer(), 1:4)
[1] TRUE
alike(integer(), letters)  # type is still defined and must match
[1] "`letters` should be type \"integer-like\" (is \"character\")"

Similarly, if a template does not specify an attribute, objects with any value for that attribute will match:

alike(list(), data.frame())  # a data frame is a list with a attributes
[1] TRUE
alike(data.frame(), list())  # but a list does not have the data.frame attributes
[1] "`list()` should be class \"data.frame\" (is \"list\")"

As an extension to the wildcard concept, we interpret partially specified core R attributes. Here we allow any three column integer matrix to match:

mx.tpl <- matrix(integer(), ncol=3)          # partially specified matrix
alike(mx.tpl, matrix(sample(1:12), nrow=4))  # any number of rows match
[1] TRUE
alike(mx.tpl, matrix(sample(1:12), nrow=3))  # but column count must match
[1] "`matrix(sample(1:12), nrow = 3)` should have 3 columns (has 4)"

or a data frame of arbitrary number of rows, but same column structure as iris:

iris.tpl <- iris[0, ]                        # no rows, but structure is defined
alike(iris.tpl, iris[1:10, ])                # any number of rows match
[1] TRUE
alike(iris.tpl, CO2)                         # but column structure must match
[1] "`names(CO2)[1]` should be \"Sepal.Length\" (is \"Plant\")"

“alikeness” is complex to describe, but should be intuitive to grasp. We recommend you look example(alike) to get a sense of “alikeness”. If you want to understand the specifics, read on.

Declarative Comparison

alike’s template based comparison is declarative. You declare what structure an object is expected to implement, and vetr infers all the computations required to verify that is so. This makes is particularly well suited for enforcing structural requirements for S3 objects. The S4 system does this and more, but S3 objects are still used extensively in R code, and sometimes S4 classes are not appropriate.

There are several advantages to template based comparisons:

The template concept was inspired by vapply.

Object Comparison

Overview

alike compares objects on type, length, and attributes. Recursive structures are compared element by element. Language objects and functions are compared specially because the concept of a value within those is more complex (e.g., is the + in x + y just a value?).

We will defer discussion of attribute comparison to the attributes section.

Length Comparison

Objects must be the same length to be alike, unless the template (target) is zero length, in which case the object may be any length. Environments are an exception: we only require that all the elements present in target be present in current. Also, note that calls to ( are ignored in language objects, which may affect length computation.

Type Comparison

Type comparison is done on type (i.e. the typeof) with some adjustments to better align comparisons to “percieved” types as opposed to internal storage types.

Numerics and Integers

We allow integer vectors to be considered numeric, and short integer-like numerics to be treated as integers:

alike(1L, 1)     # `1` is not technically integer, but we treat it as such
[1] TRUE
alike(1L, 1.1)   # 1.1 is not integer-like
[1] "`1.1` should be type \"integer-like\" (is \"double\")"
alike(1.1, 1L)   # integers can match numerics
[1] TRUE

This feature is designed to simplify checks for integer-like numbers. The following two expressions are roughly equivalent:

stopifnot(length(x) == 1L && (is.integer(x) || is.numeric(x) && floor(x) == x))
stopifnot(alike(integer(1L), x))

Note that we only check numerics of length <= 100 for integerness to avoid full scans on large vectors. We expect that the primary source of these integer-like numerics is hand input vectors (e.g. c(1, 2, 3)), so hopefully this compromise is not too limiting. You can modify the threshold length for this treatment via the fuzzy.int.max.len parameter to the settings objects (see ?vetr_settings).

Functions

Closures, builtins, and specials are all treated as a single type, even though internally they are stored as different types.

Recursive Objects

alike will recurse through lists (and by extension data frames), pairlists, expressions, and environments and will check pairwise alikeness between the corresponding elements of the target and current objects.

Environments have slightly different comparison rules in two respects:

NULL elements within templates in recursive objects are considered undefined and as such act like wildcards:

## two NULLs match two length list
alike(list(NULL, NULL), list(1:10, letters))
[1] TRUE
## but not three length list
alike(list(NULL, NULL), list(1:10, letters, iris))
[1] "`length(list(1:10, letters, iris))` should be 2 (is 3)"

Note that top level NULLs do not act as wildcards:

alike(NULL, 1:10)                   # NULL only matches NULL
[1] "`1:10` should be `NULL` (is \"integer\")"

Treating NULL inconsistently depending on whether it is nested or not is a compromise designed to make alike a better fit for argument validation because arguments that are NULL by default are fairly common.

alike will check for self-referential loops in nested environments and prevent infinite recursion. If you somehow introduce a self-referential structure in a template without using environments then alike will get stuck in an infinite recursion loop.

We are currently considering adding new comparison modes for lists that would allow for checks more similar to environments (see #29).

Language Objects, Formulas, and Functions

Alikeness for these types of objects is a little harder to define. We have settled on somewhat arbitrary semantics, though hopefully they are intuitive. These may change in the future as we gain experience using alike with these types of objects. This is particularly true of functions.

Language objects are also compared recursively, but alikeness has a slightly different meaning for them:

Language Objects

alike(quote(sum(a, b)), quote(sum(x, y)))   # calls are consistent
[1] TRUE
alike(quote(sum(a, b)), quote(sum(x, x)))   # calls are inconsistent
[1] "`quote(sum(x, x))[[3]]` should not be `x`"
alike(quote(mean(a, b)), quote(sum(x, y)))  # functions are different
[1] "`quote(sum(x, y))[[1]]` should be a call to `mean` (is a call to `sum`)"

Since variables can contain anything we do not require them to match directly across calls. In the examples above the second call fails because the template defines different variables for each argument, but the current object uses the same variable twice. The third call fails because the functions are different and as such the calls are fundamentally different.

If a function is defined in the calling frame, alike will match.call it prior to testing alikeness:

fun <- function(a, b, c) NULL
alike(quote(fun(p, q, p)), quote(fun(y, x, x)))
[1] "`quote(fun(y, x, x))[[4]]` should be `y` (is `x`)"
# `match.call` re-orders arguments
alike(quote(fun(p, q, p)), quote(fun(b=y, x, x)))
[1] TRUE

Constants match any constants, but keep in mind that expressions like 1:10 or c(1, 2, 3) are calls to : and c respectively, not constants in the context of language objects.

NULL is a wild card in calls as well:

str(one.arg.tpl <- as.call(list(NULL, NULL)))
 language NULL(NULL)
alike(one.arg.tpl, quote(log(10)))
[1] TRUE
alike(one.arg.tpl, quote(sd(runif(20))))
[1] TRUE
alike(one.arg.tpl, quote(log(10, 10)))
[1] "`quote(log(10, 10))` should have 1 arguments (has 2)"

Calls to ( are ignored when comparing calls since parentheses are redundant in call trees because the tree structure encodes operation precedence independent of operator precedence.

We concede that the rules for “alikeness” of language objects are arbitrary, but hope the outcomes of those rules is generally intuitive. Unfortunately value and structure are somewhat intertwined for language objects so we must impose our own view of what is value and what is structure.

Formulas

Formulas are treated like calls, except that constants must match:

alike(y ~ x ^ 2, a ~ b ^ 2)
[1] TRUE
alike(y ~ x ^ 2, a ~ b ^ 3)
[1] "`(a ~ b^3)[[3]][[3]]` should have identical constant values"

Functions

Functions are alike if the signature of the current function can reasonably be interpreted as a valid method for the target function.

alike(print, print.default)   # print can be the generic for print.default
[1] TRUE
alike(print.default, print)   # but not vice versa
[1] "`print` should have argument `digits` after argument `x`"

A method of a generic must have all arguments present in the generic, with the same default values if those are defined. If the generic contains ... then the method may have additional arguments, but must also contain ....

Potential changes / improvements for function comparison are being considered in #35.

S4 and R5 (RC Objects)

S4 and RC objects are considered alike if current inherits from class(target). Since these objects embed structural information in their definitions alike relies on class alone to establish alikeness.

Pointer Objects

Objects of the following types are actually references to specific memory locations:

These are typically attached as attributes to other objects that contain the information required to establish alikeness (e.g. data.table, byte-compiled functions), so we only check their type.

Attribute Comparison

Normal Attributes

Much of the structure of an object is determined by attributes. alike recursively compares object attributes and requires them to be alike, unless the attribute is a special attribute or an environment. Environments within attributes in the template must be matched by an environment, but nothing is checked about the environments to avoid expensive computations on objects that commonly include environments in their attributes (e.g. formulas); note this is different than the treatment of environments as actual objects.

Only attributes present in the template object are checked:

alike(structure(logical(1L), a=integer(3L)), structure(TRUE, a=1:3, b=letters))
[1] TRUE
alike(structure(TRUE, a=1:3, b=letters), structure(logical(1L), a=integer(3L)))
[1] "`structure(logical(1L), a = integer(3L))` should have attribute \"b\""

Attributes present in current but missing in target may be anything at all.

Special Attributes

Overview

The special attributes are names, row.names, dim, dimnames, class, tsp, and levels. These attributes are discussed in sections 2.2 and 2.3 of the R Language Definition, and have well defined and consistently applied semantics in R. Since the semantics of these attributes are well known, we are able to define “alikeness” for them in a more granular way than we can for arbitrary attributes.

We also consider srcref to be a special attribute. This attribute is not checked.

row.names and names

If present in target, then must be matched exactly by the corresponding attribute in current, except that:

  • zero length target names/row.names (i.e. character(0L)) will match any character names/row.names
  • a zero character element (i.e. "") in a target names/row.names character vector will allow any value to match at the corresponding position of the current names/row.names vector
alike(setNames(integer(), character()), 1:3)
[1] "`1:3` should have attribute \"names\""
alike(setNames(integer(), character()), c(a=1, b=2, c=3))
[1] TRUE
alike(setNames(integer(3), c("", "", "Z")), c(a=1, b=2, c=3))
[1] "`names(c(a = 1, b = 2, c = 3))[3]` should be \"Z\" (is \"c\")"
alike(setNames(integer(3), c("", "", "Z")), c(a=1, b=2, Z=3))
[1] TRUE

dim

dim attributes must be identical between target and current, except that if a value of the dim vector is zero in target then the corresponding value in current can be any value. This is how comparisons like the following succeed:

mx.tpl <- matrix(integer(), ncol=3)                # partially specified matrix
alike(mx.tpl, matrix(sample(1:12), nrow=4))
[1] TRUE
alike(mx.tpl, matrix(sample(1:12), nrow=3))        # wrong number of columns
[1] "`matrix(sample(1:12), nrow = 3)` should have 3 columns (has 4)"
str(mx.tpl)    # notice 0 for 1st dimension
 int[0 , 1:3] 

dimnames

Must also be identical, except that if the target value of the dimnames list for a particular dimension is NULL, then the corresponding dimnames value in current may be anything. As with names, zero character dimname element elements match any name.

mx.tpl <- matrix(integer(), ncol=3, dimnames=list(row.id=NULL, c("R", "G", "")))
mx.cur <- matrix(sample(0:255, 12), ncol=3, dimnames=list(row.id=1:4, rgb=c("R", "G", "Blue")))
mx.cur2 <- matrix(sample(0:255, 12), ncol=3, dimnames=list(1:4, c("R", "G", "b")))

alike(mx.tpl, mx.cur)
[1] TRUE
alike(mx.tpl, mx.cur2)
[1] "`dimnames(mx.cur2)` should have attribute \"names\""

Note that dimnames can have a names attribute. This names attributed is treated as described in row.names and names.

names(dimnames(mx.tpl))
[1] "row.id" ""      

class

S3 objects are considered alike if the current class inherits from the target class. Note that “inheritance” here is used in a stricter context than in the typical S3 application:

  • Every class present in target must be present in current
  • The overlapping classes must be in the same order
  • The last class in current must be the same as the last class in target

To illustrate:

tpl <- structure(TRUE, class=c("a", "b", "c"))
cur <- structure(TRUE, class=c("x", "a", "b", "c"))
cur2 <- structure(TRUE, class=c("a", "b", "c", "x"))

alike(tpl, cur)
[1] TRUE
alike(tpl, cur2)
[1] "`class(cur2)[2]` should be \"a\" (is \"b\")"

tsp

The tsp attribute of ts objects behaves similarly to the dim attribute. Any component (i.e. start, end, frequency) that is set to zero will act as a wild card. Other components must be identical. It is illegal to set tsp components to zero throught the standard R interface, but you may use abstract as a work-around.

levels

Levels are compared like row.names and names.

srcref

This attribute is completely ignored.

Normal Attributes that Happen To Have Special Names

If an object contains one of the special attributes, but the attribute value is inconsistent with the standard definition of the attribute, alike will silently treat that attribute as any other normal attribute.

Modifying Comparison Behavior

You can use the settings parameter to alike to modify comparison behavior. See ?vetr_settings for details.

Creating Templates

From The Ground Up

You can always create your own templates by manually building R structures:

int.scalar <- integer(1L)
int.mat.2.by.4 <- matrix(integer(), 2, 4)
# A df without column names
df.chr.num.num <- structure(
  list(character(), numeric(), numeric()), class="data.frame"
)

Abstracting Existing Structures

Alternatively, you can start with a known structure, and abstract away the instance-specific details. For example, suppose we are sending sample collectors out on the field to record information about iris flowers:

iris.tpl <- iris[0, ]
alike(iris.tpl, iris.sample.1)  # make sure they submit data correctly

Or equivalently:

iris.tpl <- abstract(iris)

abstract is an S3 generic defined by alike along with methods for common objects. abstract primarily sets the length of atomic vectors to zero:

abstract(list(c(a=1, b=2, c=3), letters))
[[1]]
named numeric(0)

[[2]]
character(0)

and also abstracts the dim, dimnames, and tsp attributes if present. Other attributes are left untouched unless a specific abstract method exists for a particular object that also modifies attributes. One example of such a method is abstract.lm, and it does some minor tweaking to the base abstractions to allow us to match models produced by lm:

df.dummy <- data.frame(x=runif(3), y=runif(3), z=runif(3))
mdl.tpl <- abstract(lm(y ~ x + z, df.dummy))
# TRUE, expecting bi-variate model
alike(mdl.tpl, lm(Sepal.Length ~ Sepal.Width + Petal.Width, iris))
[1] TRUE
alike(mdl.tpl, lm(Sepal.Length ~ Sepal.Width, iris))
[1] "`lm(Sepal.Length ~ Sepal.Width, iris)$terms[[3]]` should be a call to `+` (is \"symbol\")"

The error message is telling us that at index "terms" (i.e. lm(Sepal.Length ~ Sepal.Width, iris)$terms) alike was expecting a call to + instead of a symbol (i.e Sepal.Width + <somevar> instead of Sepal.Width). The message could certainly be more eloquent, but with a little context it should provide enough information to figure out the problem.

Performance Considerations

Sample Timings

We have gone to great lengths to make alike fast so that it can be included in other functions without concerns for what overhead:

type_and_len <- function(a, b)
  typeof(a) == typeof(b) && length(a) == length(b)  # for reference

bench_mark(times=1e4,
  identical(rivers, rivers),
  alike(rivers, rivers),
  type_and_len(rivers, rivers)
)
Mean eval time from 10000 iterations, in microseconds:
  identical(rivers, rivers)     ~  0.3
  alike(rivers, rivers)         ~  2.0
  type_and_len(rivers, rivers)  ~  1.0

While alike is slower than identical and the comparable bare bones R function, it is competitive with a bare bones R function that checks types and length. As objects grow more complex, identical will obviously pull ahead, though alike should be sufficiently fast for most applications:

bench_mark(times=1e4,
  identical(mtcars, mtcars),
  alike(mtcars, mtcars)
)
Mean eval time from 10000 iterations, in microseconds:
  identical(mtcars, mtcars)  ~  0.3
  alike(mtcars, mtcars)      ~  8.2

In the above example, we are comparing the data frames, their attributes, and the 11 columns individually.

Keep in mind that the complexity of the alike comparison is driven by the complexity of the template, not the object we are checking, so we can always manage the expense of the alike evaluation.

Comparisons that succeed will be substantially faster than comparisons that fail as the construction of error messages is non-trivial and we have prioritized optimization in the success case.

Language object comparison is relatively slow. We intend to optimize this some day.

Templates with large numbers of attributes (e.g. > 25) may scale non-linearly. We intend to optimize this some day, though in our experience objects with that many attributes are rare (note having multiple objects each with a handful attributes nested in recursive structures is not a problem).

Large objects will be slower to evaluate. Let us revisit the lm example, though this time we compare our template to itself to ensure that the comparisons succeed for alike, all.equal, and identical:

mdl.tpl <- abstract(lm(y ~ x + z, data.frame(x=runif(3), y=runif(3), z=runif(3))))
# compare mdl.tpl to itself to ensure success in all three scenarios
bench_mark(
  alike(mdl.tpl, mdl.tpl),
  all.equal(mdl.tpl, mdl.tpl),   # for reference
  identical(mdl.tpl, mdl.tpl)
)
Mean eval time from 1000 iterations, in microseconds:
  alike(mdl.tpl, mdl.tpl)      ~   105
  all.equal(mdl.tpl, mdl.tpl)  ~  1246
  identical(mdl.tpl, mdl.tpl)  ~     0

Even with template as large as lm results (check str(mdl.tpl)) we can evaluate alike thousands of times before the overhead becomes noticeable.

Pre-defining Templates

Some fairly innocuous R expressions carry substantial overhead. Consider:

df.tpl <- data.frame(a=integer(), b=numeric())
df.cur <- data.frame(a=1:10, b=1:10 + .1)

bench_mark(
  alike(df.tpl, df.cur),
  alike(data.frame(integer(), numeric()), df.cur)
)
Mean eval time from 1000 iterations, in microseconds:
  alike(df.tpl, df.cur)                     ~    5.5
  alike(data.frame(integer(), numeric())..  ~  221.5

data.frame is a particularly slow constructor, but in general you are best served by defining your templates (including calls to abstract) outside of your function so they are created on package load rather than every time your function is called.

Miscellaneous

alike as an S3 generic

alike is not currently an S3 generic, but will likely one in the future provided we can create an implementation with and acceptable performance profile.