Randomization Examples using nycflights13 flights data

Chester Ismay and Andrew Bray

2018-01-05

Note: The type argument in generate() is automatically filled based on the entries for specify() and hypothesize(). It can be removed throughout the examples that follow. It is left in to reiterate the type of generation process being performed.

This vignette is designed to show how to use the {infer} package with {dplyr} syntax. It does not show how to calculate observed statistics or p-values using the {infer} package. To see examples of these, check out the “Computation of observed statistics…” vignette instead.

Data preparation

library(nycflights13)
library(dplyr)
library(ggplot2)
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>% 
  na.omit() %>%
  sample_n(size = 500) %>% 
  mutate(season = case_when(
    month %in% c(10:12, 1:3) ~ "winter",
    month %in% c(4:9) ~ "summer"
  )) %>% 
  mutate(day_hour = case_when(
    between(hour, 1, 12) ~ "morning",
    between(hour, 13, 24) ~ "not morning"
  )) %>% 
  select(arr_delay, dep_delay, season, 
         day_hour, origin, carrier)

Hypothesis tests

One numerical variable (mean)

p_value
0.356

One numerical variable (median)

p_value
0.02

One categorical (one proportion)

p_value
0.028

Logical variables will be coerced to factors:

Two categorical (2 level) variables

## [1] 1.158

One categorical (>2 level) - GoF

## [1] 0.03

Two categorical (>2 level) variables

## [1] 0.777

One numerical variable, one categorical (2 levels) (diff in means)

## [1] 1.638

One numerical variable, one categorical (2 levels) (diff in medians)

## [1] 0.646

One numerical, one categorical (>2 levels) - ANOVA

## [1] 0.526

Two numerical vars - SLR

## [1] 0

Confidence intervals