A practical introduction to regression

Damian Pavlyshyn

26 May 2023

The structure of a data table

data(birthwt, package = "MASS")
head(birthwt)
birth_weight m_age m_weight m_race m_smoker m_prem_labors m_hist_hyper m_uter_irrit m_doc_visits
2.52 19 82.55 black FALSE 0 FALSE TRUE 0
2.55 33 70.31 other FALSE 0 FALSE FALSE 3
2.56 20 47.63 white TRUE 0 FALSE FALSE 1
2.59 21 48.99 white TRUE 0 FALSE TRUE 2
2.60 18 48.53 white TRUE 0 FALSE TRUE 0
2.62 21 56.25 other FALSE 0 FALSE FALSE 0

Training data

Contains measurements of all variables.

birthwt_train <- slice_sample(birthwt, prop = 0.9)
birth_weight m_age m_weight m_race m_smoker m_prem_labors m_hist_hyper m_uter_irrit m_doc_visits
1.89 19 41.28 white TRUE 2 FALSE TRUE 0
3.10 22 54.43 white FALSE 0 TRUE FALSE 1
2.08 19 46.27 white FALSE 0 FALSE FALSE 2
3.20 30 69.40 other FALSE 0 FALSE FALSE 0
3.63 19 106.59 white TRUE 0 TRUE FALSE 0
3.37 16 61.23 white TRUE 0 FALSE FALSE 0

New data

Only contains measurements of covariates … but we’d like to know the outcome

birthwt_new <- birthwt %>%
  anti_join(birthwt_train) %>%
  select(-birth_weight)
m_age m_weight m_race m_smoker m_prem_labors m_hist_hyper m_uter_irrit m_doc_visits
20 54.43 other FALSE 0 FALSE TRUE 0
17 55.34 white TRUE 0 FALSE FALSE 0
23 58.97 black FALSE 0 FALSE FALSE 1
23 58.06 other FALSE 0 FALSE FALSE 0
30 43.09 white TRUE 0 FALSE FALSE 2
32 60.78 white TRUE 1 FALSE FALSE 4

Regression

How can we use the measurements in the training data to inform our guesses about the outcome in the new data?

Step 1: Fit/train/learn a model

Ingredients:

  1. Type of model : How complicated can the relationships between covariates and outcome be?
  2. Formula : Which outcome and covariates are we considering?
  3. Training data : Which measurements are we using to inform our guesses?
fit <- lm(
  formula = birth_weight ~ m_age + m_weight + m_smoker + m_prem_labors,
  data    = birthwt_train
)

Step 2: Predict outcomes

new_obs <- birthwt_new[1:5, ]
m_age m_weight m_race m_smoker m_prem_labors m_hist_hyper m_uter_irrit m_doc_visits
20 54.43 other FALSE 0 FALSE TRUE 0
17 55.34 white TRUE 0 FALSE FALSE 0
23 58.97 black FALSE 0 FALSE FALSE 1
23 58.06 other FALSE 0 FALSE FALSE 0
30 43.09 white TRUE 0 FALSE FALSE 2

What’s our best educated guess for the birth weight of these people’s children?

predict(fit, newdata = new_obs)
##        1        2        3        4        5 
## 2.985043 2.715692 3.037831 3.029939 2.666900

How did predict() make that calculation?

summary(fit)
## 
## Call:
## lm(formula = birth_weight ~ m_age + m_weight + m_smoker + m_prem_labors, 
##     data = birthwt_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04349 -0.46118 -0.01145  0.49653  1.88205 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.422660   0.311950   7.766 8.12e-13 ***
## m_age          0.004443   0.010529   0.422   0.6736    
## m_weight       0.008700   0.003946   2.205   0.0289 *  
## m_smokerTRUE  -0.263915   0.114576  -2.303   0.0225 *  
## m_prem_labors -0.115341   0.112220  -1.028   0.3055    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7146 on 165 degrees of freedom
## Multiple R-squared:  0.07915,    Adjusted R-squared:  0.05682 
## F-statistic: 3.545 on 4 and 165 DF,  p-value: 0.008366

Significance: will dropping this variable change the predictions?

predict(fit, newdata = new_obs)
##        1        2        3        4        5 
## 2.985043 2.715692 3.037831 3.029939 2.666900

What happens if we remove the non-significant variable m_age?

fit_no_age <- lm(
  formula = birth_weight ~ m_weight + m_smoker + m_prem_labors,
  data    = birthwt_train
)

predict(fit_no_age, newdata = new_obs)
##        1        2        3        4        5 
## 2.997543 2.739253 3.038551 3.030349 2.628531

How about the significant m_smoker?

fit_no_smoker <- lm(
  formula = birth_weight ~ m_age + m_weight + m_prem_labors,
  data    = birthwt_train
)

predict(fit_no_smoker, newdata = new_obs)
##        1        2        3        4        5 
## 2.889462 2.880035 2.945557 2.937779 2.849588

A one-covariate example

Let’s just look at the relationship between birth weight and maternal weight

p <- birthwt_train %>%
  ggplot(aes(y = birth_weight, x = m_weight)) +
  geom_point(color = g3) +
  theme_burnet

We train of model with a new, simple formula

weight_fit <- lm(
  formula = birth_weight ~ m_weight,
  data    = birthwt_train
)

Now, let’s see what the predicted birth weights are for a number of evenly-spaced maternal weights:

spaced_weights <- tibble(
  m_weight = c(30, 40, 50, 60, 70, 80, 90, 100, 110)
)

predict(weight_fit, newdata = spaced_weights)
##        1        2        3        4        5        6        7        8 
## 2.629677 2.727985 2.826293 2.924601 3.022909 3.121217 3.219525 3.317833 
##        9 
## 3.416141

summary(weight_fit)
## 
## Call:
## lm(formula = birth_weight ~ m_weight, data = birthwt_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1608 -0.4728  0.0067  0.4764  2.1068 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.334753   0.235627   9.909   <2e-16 ***
## m_weight    0.009831   0.003869   2.541    0.012 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7242 on 168 degrees of freedom
## Multiple R-squared:  0.03701,    Adjusted R-squared:  0.03127 
## F-statistic: 6.456 on 1 and 168 DF,  p-value: 0.01196

Bonus: local linear regression

Bonus: gradient-boosted tree regression

Problematic cases

You can run a linear regression on any data … but the result may be meaningless!

summary() will never give you the whole story!

What if the model is wrong?

fit <- lm(y ~ time, data = df)
summary(fit)
## 
## Call:
## lm(formula = y ~ time, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.145 -2.803 -1.165  1.915 20.722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.0997     0.7787  -3.981 0.000132 ***
## time          3.7385     0.2268  16.485  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.983 on 98 degrees of freedom
## Multiple R-squared:  0.735,  Adjusted R-squared:  0.7322 
## F-statistic: 271.7 on 1 and 98 DF,  p-value: < 2.2e-16

Let’s plot the data

df %>%
  ggplot(aes(x = time, y = y)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE, color = g_accent) +
  theme_burnet

The residuals are the errors in our predictions.

residuals <- resid(fit)

df %>%
  ggplot(aes(x = time, y = residuals)) +
  geom_point() +
  theme_burnet

Unbiased residuals:

What if the noise is not Gaussian?

Try to estimate the number of centimetres in an inch by measuring a bunch of objects with a good centimetre ruler and a really terrible inch ruler.

n_obs <- 1000

df <- tibble(
  len_cm = rnorm(n_obs, mean = 10),
  len_in = 2.54 * len_cm + rcauchy(n_obs)
)

Fit the model and find the slope - we expect it to be around 2.54:

fit <- lm(len_in ~ len_cm, data = df)
summary(fit)
## 
## Call:
## lm(formula = len_in ~ len_cm, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -92.48  -2.57  -1.26   0.00 371.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   10.920      6.256   1.745   0.0812 .
## len_cm         1.581      0.623   2.537   0.0113 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.38 on 998 degrees of freedom
## Multiple R-squared:  0.006409,   Adjusted R-squared:  0.005413 
## F-statistic: 6.437 on 1 and 998 DF,  p-value: 0.01133

Let’s have a look at the data

df %>%
  ggplot(aes(y = len_in, x = len_cm)) +
  geom_point() +
  theme_burnet

Check if the residuals are normally distributed

resid <- resid(fit)
qqnorm(resid)

Normally distributed residuals