represtools

Overview

The package represtools supports a specific, yet flexible analytic workflow. The guiding principles are:

Any and all of the steps in the research may be reproduced by another researcher.
Each analytic step is modular. This means that it may work in isolation, so long as precedent data exists.
There are four major analytic steps: gather, cook, analyze and publish.
Every step uses rmarkdown for code and documentation.
Data is exchanged using .rda files.
The make program is used to ensure that data and content is (re)built when needed by any module.

The four steps of data analysis

The four steps are verbs. Their outputs are adjectives which describe the data contained. Each verb and adjective gets its own directory. Let’s say we’d like to analyze baseball data. The command below will create this structure.

represtools::NewResearch("Baseball")

The resulting directory tree is as shown below:

- Baseball/
| - analyze/
| - analyzed/
| - cook/
| - cooked/
| - gather/
| - gathered/
| - present/
| - presented/
| - Baseball.Rproj
| - Makefile

By default, NewResearch will add an RStudio project file to the directory. That project file will presume that make is used to build the project. NewResearch adds this makefile to the project directory. Note that the makefile uses some extensions that are present in the GNU version of make as described in the DESCRIPTION file.

Gather some data

Every analysis begins with data. To start gathering data, simply execute the Gather function with a file name which describes the data being gathered.

represtools::Gather("Hitter")

This will create a new .Rmd file in the “gather” directory. The template will set various knitr options as below:

knitr::opts_knit$set(root.dir = normalizePath('../'))
knitr::opts_chunk$set(echo=TRUE, message=TRUE, warning=TRUE, error=TRUE)

The normalizePath option ensures that we may run in a child directory of the analysis and use paths relative to that directory. For the gather step, the default is to echo every line of code and generate every message, warning and error.

After data has been gathered, the template will collect the names of every object in memory to save to an .rda file. By default, it will search for objects which start with the characters “df”, “plt” and “fit”. This is a kind of Hungarian notation to identify the most common analysis objects: “df” - a data frame (or tbl_df), “plt” - a plot object and “fit” - a model fit.

The DescribeObjects function will apply a vector of functions to a vector of objects. The default is to apply the str function. Note that we wrap the object names with a call to NamesToObjects. This will seach for objects in memory based on variable name and return them in a list.

lstObjects <- represtools::ListObjects()
represtools::DescribeObjects(represtools::NamesToObjects(lstObjects))

Finally, we save the output to .rda. The function OutputFile uses elements from the params object to construct the output filename. More about the use of parameters in rmarkdown may be found here: .

save(file = represtools::OutputFile(params)
     , list = lstObjects)

Though not strictly enforced, I find it a best practice to have one Gather file for each data object (typically a data frame).

Cook the data

Gathered data is raw. The opposite of raw data is cooked data. “Cooking”, in this context, refers to all of the manipulation one performs to render data fit for analysis. This is everything from renaming columns to filtering and merging multiple data sets. This could also encompass treatment of data fields like scaling, centering and so on.

The inputFiles argument will add those values to the params section of the YAML. By default, the cook .Rmd template will load those files into memory as the first step of the cooking process.

represtools::Cook("Hitters", inputFiles = c("Hitters", "Salaries"))

Note that there is a new code chunk just after initializing knitr. the LoadObjects function will load everything from all of the .rda files listed in inputFiles. It will return a character string of all the objects loaded. Note that there is a possibility for naming collisions. Be careful not to use the same object name in multiple files.

loadedObjects <- represtools::LoadObjects(params)

Analyze the data

If I’ve properly cooked my data, I’m good to go. The Analyze function will produce a new .Rmd file with the input that I need for analysis.

represtools::Analyze("HitterPerformance", inputFiles = "Hitters")

This template will be less aggressive at informing the user in the published document. Only errors will be shown.

knitr::opts_chunk$set(echo=FALSE, message=FALSE, warning=FALSE, error=TRUE)

Present

Present is slightly different. There will be no data output for further stage. Instead the goal is to generate output for an audience.

represtools::Present("Handedness"
                     , inputFiles = "HitterPerformance"
                     , title = "On the quality of right-handed batters"
                     , output = "word_document")

There is a dummy parameter at the end of the .Rmd file which is used to generate an .rda file. This prevents the make utility from recreating a presentation document. This is also a useful control for the user to ensure that the file is recreated until they are satisfied with the results.

Make

represtools relies on the tool make to control the workflow. make keeps track of file dependencies between various modules of the project. A full description of make may be found in the GNU Make documentation. Make is an option for constructing an RStudio project and this will be reflected in the RStudio project file, if the user asked for one. One may also invoke Make by calling the Make function within represtools. The make program must be installed an available on the system path in order for this to work.

represtools::Make()

Make accepts one parameter, which indicates which recipe to create. By default, every represtools Makefile comes with at least four: “all”, “clean”, “cooked” and “cleanCooked”.

A sample workflow

represtools::NewResearch("Baseball")
represtools::Gather("Hitters")
# write some code
represtools::Cook("Hitters")
# write some code
represtools::Analyze("Handedness")
# write some code
represtools::Present("Handedness", title = "On the quality of right-handed batters", output = "html")
# write some code
represtools::Make()

And that would do it.

Let’s say that I’d like to augment my analysis with another set of batters. I’ll need a new data frame, so I need a new gather file. It’s fairly straightforward to add a new data file and re-cook. If I’m lucky, my cooked data is such that I don’t need to alter my analysis code to generate new output.

represtools::Gather("Japanese")
# write some code to gather
# edit the code in the "Hitters" cook module

represtools::Make("clean")
represtools::Make()

Questions

Why not .rds?

For now, I’m sticking with .rda. It enables you to save more than one item in a file and doesn’t lose the names of the objects. I think it’s possible to introduce confusion if a user loads the same data into two different named objects. As the goal is reproduction, we should try to avoid this. Yes, it’s possible to copy an object to a new variable, but I think this is more transparent in a review of code. I think most R users may not even be aware of the difference between .rda and .rds.

What’s wrong with ProjectTemplate?

The short answer is nothing. John Myles White doesn’t need my help in getting more efficient at analyzing data. His structure works for him and, I’m sure, loads of other people. It just didn’t work for me. My tiny brain can’t handle more than about 4 folders at a time. I struggle to understand the difference between “data” and “cache”, “doc” and “reports”, “lib” and “src”. ProjectTemplate’s philosophy seems monolithic: load all of the data, work with it and produce a single output. I like to break things into bite-size chunks and only look at whatever element of data is relevant for a certain piece of the analysis. I also find that gathering and cooking tend to be steps that get reworked often, even in a later stage of analysis.

None of this is meant as crticism. John White is loads smarter than me and he’s created something very useful for people whose brains function like his. Some people like Coke. Some people like Pepsi.