Full infer pipeline examples using nycflights13 flights data

Chester Ismay

Updated on 2018-06-14

Data preparation

library(nycflights13)
library(dplyr)
library(ggplot2)
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>% 
  na.omit() %>%
  sample_n(size = 500) %>% 
  mutate(season = case_when(
    month %in% c(10:12, 1:3) ~ "winter",
    month %in% c(4:9) ~ "summer"
  )) %>% 
  mutate(day_hour = case_when(
    between(hour, 1, 12) ~ "morning",
    between(hour, 13, 24) ~ "not morning"
  )) %>% 
  select(arr_delay, dep_delay, season, 
         day_hour, origin, carrier)

Hypothesis tests

One numerical variable (mean)

Observed stat

stat
11.49
## Setting `type = "bootstrap"` in `generate()`.

p_value
0.356

One numerical variable (standardized mean \(t\))

Observed stat

stat
6.827
## Setting `type = "bootstrap"` in `generate()`.

p_value
0

One numerical variable (median)

Observed stat

stat
-2
## Setting `type = "bootstrap"` in `generate()`.

p_value
0.018

One categorical (one proportion)

Observed stat

stat
0.452
## Setting `type = "simulate"` in `generate()`.

p_value
0.036

Logical variables will be coerced to factors:

## Setting `type = "simulate"` in `generate()`.

One categorical variable (standardized proportion \(z\))

Not yet implemented.

Two categorical (2 level) variables

Observed stat

stat
0.0044
## Setting `type = "permute"` in `generate()`.

p_value
0.954

Two categorical (2 level) variables (z)

Standardized observed stat

stat
0.0985
## Setting `type = "permute"` in `generate()`.

p_value
0.95

Note the similarities in this plot and the previous one.

One categorical (>2 level) - GoF

Observed stat

Note the need to add in the hypothesized values here to compute the observed statistic.

stat
7.009

p_value
0.037

Two categorical (>2 level) variables

Observed stat

stat
0.5284

p_value
0.77

One numerical variable, one categorical (2 levels) (diff in means)

Observed stat

stat
3

p_value
0.338

One numerical variable, one categorical (2 levels) (t)

Standardized observed stat

stat
0.8909

p_value
0.4

Note the similarities in this plot and the previous one.

One numerical variable, one categorical (2 levels) (diff in medians)

Observed stat

stat
1

p_value
0.64

One numerical, one categorical (>2 levels) - ANOVA

Observed stat

stat
0.6858

p_value
0.529

Two numerical vars - SLR

Observed stat

stat
0.9916

p_value
0

Two numerical vars - correlation

Observed stat

stat
0.8951

p_value
0

Two numerical vars - SLR (t)

Not currently implemented since \(t\) could refer to standardized slope or standardized correlation.

Confidence intervals

One categorical variable (standardized proportion \(z\))

Not yet implemented.

Two numerical vars - t

Not currently implemented since \(t\) could refer to standardized slope or standardized correlation.