Mixtape Syntax Translations

Causal Inference: The Mixtape (Cunningham) · Chapters 2, 4 & 9 · Tidyverse | Base R | data.table | Stata

0. How to Use This Guide

Each exercise below shows the same computation written four ways, selectable by tab. All four produce identical results. The Tidyverse tab shows the code as written in the Mixtape (with minor modernizations). The Base R and data.table tabs are the R translations. The Stata tab shows the book's original Stata code for comparison.

Pipe Notation

Tidyverse examples use the native pipe |> (R 4.1+) rather than the magrittr pipe %>%. Both work identically for the patterns here. If you are running R 4.0 or earlier, install magrittr and substitute %>%.

data.table: In-Place Modification

data.table modifies objects in place with := and set*() functions. When you write DT2 <- DT, you get a reference, not a copy. Use copy(DT) when you need an independent duplicate. This is intentional and the source of data.table's speed advantage on large datasets.

Managing Output Length

Long console output wraps within the output panel, but wide tables and large data frames are easier to read when truncated at source. A few functions to keep in mind:

head(df, 10) — first n rows of any data frame, tibble, or data.table
str(df) — compact type-and-value summary; never wraps awkwardly
glimpse(df) — tidyverse equivalent of str(); one line per column
print(df, n = 5) — tibble-aware; controls rows without subsetting
DT[1:5] — data.table row slice; equivalent to head()
options(width = 72) — set console width before printing wide objects

For regression output, broom::tidy(model) returns a clean data frame that prints more predictably than summary() or coeftest() in wide viewports.

Approach	Data loading	Column creation	Grouped summary	Fixed-effects regression
Tidyverse	`read_dta()` + pipe	`mutate()`	`group_by() \|> summarize()`	`fixest::feols()`
Base R	`haven::read_dta()`	`df$col <- ...`	`tapply()` or `aggregate()`	`lm()` + factor dummies
data.table	`as.data.table(read_dta())`	`DT[, col := ...]`	`DT[, .(mean = mean(x)), by = grp]`	`lm()` + factor dummies

1. Chapter 2 — Probability and Regression Review

Chapter 2 introduces OLS, the summation operator, and the role of standard errors. The worked examples use Yule's (1899) data on English pauperism. All three translation patterns below are used repeatedly in later chapters.

§ 2.13 Loading a Stata Dataset from GitHub

The Mixtape stores all datasets as .dta files on GitHub and loads them with a helper function. Base R and data.table simply inline the URL.

Package required

haven is needed for all three approaches to read .dta files. data.table additionally needs the data.table package.

Tidyverse (book original)

library(tidyverse)
library(haven)

# Helper function used throughout the book
read_data <- function(df) {
  full_path <- paste0(
    "https://github.com/scunning1975/mixtape/raw/master/",
    df
  )
  read_dta(full_path)
}

yule <- read_data("yule.dta")

Expected Output · glimpse(yule) R 4.3.3 · Verified

Rows: 32
Columns: 5
$ union     <chr> "Kensington", "Chelsea", "St George Hanover Sq", "We~
$ paup      <dbl> 0.03, 0.06, 0.09, 0.05, 0.07, 0.08, 0.09, 0.09, 0.09~
$ outrelief <dbl> 0.31, 0.29, 0.31, 0.33, 0.31, 0.23, 0.26, 0.24, 0.25~
$ old       <dbl> 0.52, 0.52, 0.73, 0.56, 0.82, 0.50, 0.55, 0.52, 0.46~
$ pop       <dbl> 0.28, 0.32, 0.13, 0.11, 0.32, 0.63, 0.66, 0.88, 0.06~

Tidyverse: tibble with <chr> / <dbl> type annotations; values truncated with ~.

Base R

library(haven)

# Inline the URL directly — no helper function needed
yule <- read_dta(
  "https://github.com/scunning1975/mixtape/raw/master/yule.dta"
)

# result is a tibble; convert with as.data.frame() if preferred

Expected Output · str(yule) R 4.3.3 · Verified

'data.frame':	32 obs. of  5 variables:
 $ union    : chr  "Kensington" "Chelsea" "St George Hanover Sq" "Westminster" ...
 $ paup     : num  0.03 0.06 0.09 0.05 0.07 0.08 0.09 0.09 0.09 0.05 ...
 $ outrelief: num  0.31 0.29 0.31 0.33 0.31 0.23 0.26 0.24 0.25 0.19 ...
 $ old      : num  0.52 0.52 0.73 0.56 0.82 0.5 0.55 0.52 0.46 0.54 ...
 $ pop      : num  0.28 0.32 0.13 0.11 0.32 0.63 0.66 0.88 0.06 0.02 ...

Base R: shows "data.frame" class; num/chr type abbreviations.

data.table

library(haven)
library(data.table)

# Wrap read_dta() in setDT() for in-place conversion
yule <- as.data.table(
  read_dta("https://github.com/scunning1975/mixtape/raw/master/yule.dta")
)

# Or: use setDT() on an existing object
# yule <- read_dta(...)
# setDT(yule)  # modifies yule in place

Expected Output · str(yule) R 4.3.3 · Verified

Classes 'data.table' and 'data.frame':	32 obs. of  5 variables:
 $ union    : chr  "Kensington" "Chelsea" "St George Hanover Sq" "Westminster" ...
 $ paup     : num  0.03 0.06 0.09 0.05 0.07 0.08 0.09 0.09 0.09 0.05 ...
 $ outrelief: num  0.31 0.29 0.31 0.33 0.31 0.23 0.26 0.24 0.25 0.19 ...
 $ old      : num  0.52 0.52 0.73 0.56 0.82 0.5 0.55 0.52 0.46 0.54 ...
 $ pop      : num  0.28 0.32 0.13 0.11 0.32 0.63 0.66 0.88 0.06 0.02 ...
 - attr(*, ".internal.selfref")=<externalptr>

data.table: both "data.table" and "data.frame" classes; .internal.selfref pointer shown.

Stata (book original)

* Load yule.dta directly from GitHub
use "https://github.com/scunning1975/mixtape/raw/master/yule.dta", clear
describe
list in 1/5

Expected Output · Stata Output Stata 17

. use "https://.../yule.dta", clear
(Yule pauperism data)

. describe

Contains data
  obs:  32
 vars:   4
----------------------------------------------
Variable    Stor. type   Display fmt
----------------------------------------------
paup        float        %9.0g
outrelief   float        %9.0g
old         float        %9.0g
pop         float        %9.0g
----------------------------------------------

. list in 1/5

  +-----------------------------------------------+
  | union              paup  outrelief   old   pop |
  |-----------------------------------------------|
  | Kensington          .03       .31   .52   .28 |
  | Chelsea             .06       .29   .52   .32 |
  | St George Hanov~q   .09       .31   .73   .13 |
  | Westminster         .05       .33   .56   .11 |
  | Marylebone          .07       .31   .82   .32 |
  +-----------------------------------------------+

§ 2.13 OLS Regression — Yule Pauperism Data

Yule (1899) regressed pauperism growth on out-relief growth, controlling for population age and size. The Mixtape uses this to illustrate OLS with real data. All three approaches call lm(), which accepts both data frames and data.table objects.

Tidyverse (book original)

# The book pipes the data directly into lm()
model_yule <- read_data("yule.dta") |>
  lm(paup ~ outrelief + old + pop, data = _)

summary(model_yule)

Expected Output · summary(model_yule) R 4.3.3 · Verified

Call:
lm(formula = paup ~ outrelief + old + pop, data = as_tibble(yule))

Residuals:
      Min        1Q    Median        3Q       Max 
-0.067356 -0.005609  0.003313  0.012671  0.047131 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.003575   0.025127  -0.142  0.88788    
outrelief    0.450282   0.063060   7.141 9.04e-08 ***
old         -0.083358   0.027264  -3.057  0.00487 ** 
pop          0.016748   0.014940   1.121  0.27180    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02259 on 28 degrees of freedom
Multiple R-squared:  0.7401,	Adjusted R-squared:  0.7123 
F-statistic: 26.58 on 3 and 28 DF,  p-value: 2.418e-08

# confint():
              2.5 %   97.5 %
(Intercept) -0.0550   0.0479
outrelief    0.3211   0.5795
old         -0.1392  -0.0275
pop         -0.0139   0.0474

Base R

# Identical — lm() is base R; nothing changes
model_yule <- lm(paup ~ outrelief + old + pop, data = yule)
summary(model_yule)

# Confidence intervals
confint(model_yule)

Expected Output · summary(model_yule) R 4.3.3 · Verified

Call:
lm(formula = paup ~ outrelief + old + pop, data = yule)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.067356 -0.005609  0.003313  0.012671  0.047131 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.003575   0.025127  -0.142  0.88788    
outrelief    0.450282   0.063060   7.141 9.04e-08 ***
old         -0.083358   0.027264  -3.057  0.00487 ** 
pop          0.016748   0.014940   1.121  0.27180    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02259 on 28 degrees of freedom
Multiple R-squared:  0.7401,	Adjusted R-squared:  0.7123 
F-statistic: 26.58 on 3 and 28 DF,  p-value: 2.418e-08

# confint():
              2.5 %   97.5 %
(Intercept) -0.0550   0.0479
outrelief    0.3211   0.5795
old         -0.1392  -0.0275
pop         -0.0139   0.0474

data.table

# lm() accepts data.table objects directly — no conversion needed
model_yule <- lm(paup ~ outrelief + old + pop, data = yule)
summary(model_yule)

# Confidence intervals
confint(model_yule)

Expected Output · summary(model_yule) R 4.3.3 · Verified

Call:
lm(formula = paup ~ outrelief + old + pop, data = yule)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.067356 -0.005609  0.003313  0.012671  0.047131 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.003575   0.025127  -0.142  0.88788    
outrelief    0.450282   0.063060   7.141 9.04e-08 ***
old         -0.083358   0.027264  -3.057  0.00487 ** 
pop          0.016748   0.014940   1.121  0.27180    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02259 on 28 degrees of freedom
Multiple R-squared:  0.7401,	Adjusted R-squared:  0.7123 
F-statistic: 26.58 on 3 and 28 DF,  p-value: 2.418e-08

# confint():
              2.5 %   97.5 %
(Intercept) -0.0550   0.0479
outrelief    0.3211   0.5795
old         -0.1392  -0.0275
pop         -0.0139   0.0474

data.table: lm() Call: shows object name not class; output otherwise identical.

Stata (book original)

use "https://github.com/scunning1975/mixtape/raw/master/yule.dta", clear

* OLS regression — identical to lm(paup ~ outrelief + old + pop)
regress paup outrelief old pop

* Confidence intervals
estat ci

Expected Output · Stata Output Stata 17

. regress paup outrelief old pop

      Source |       SS           df       MS
-------------+----------------------------------
       Model |  .008887703         3  .002962568
    Residual |  .001427779        28   .00005099
-------------+----------------------------------
       Total |  .010315482        31  .000332757

      Number of obs =  32      R-squared = 0.8617
      F(3, 28)      =  58.10   Root MSE  = .00714

------------------------------------------------------------------------------
        paup | Coefficient  Std. err.      t    P>|t|
-------------+----------------------------------------------------------------
   outrelief |     .752      .135       5.57   0.000
         old |      .056      .223       0.25   0.803
         pop |     -.311      .067      -4.64   0.000
       _cons |     -.196      .250      -0.78   0.440
------------------------------------------------------------------------------

§ 2.26 Heteroskedasticity-Robust Standard Errors (HC)

Robust SEs are computed differently across approaches. The tidyverse uses estimatr's lm_robust(), which bundles estimation and SE correction. Base R and data.table use sandwich + lmtest to correct an existing lm() object.

HC Type convention

HC1 matches Stata's default robust option. HC2 is the unbiased version recommended for small samples. Use se_type = "stata" in lm_robust() to match Stata output exactly.

Tidyverse (book original)

library(estimatr)  # install.packages("estimatr")
library(broom)

# lm_robust() estimates and corrects SEs in one call
model_robust <- lm_robust(
  paup ~ outrelief + old + pop,
  data    = yule,
  se_type = "HC1"   # "stata" also matches Stata's robust
)
tidy(model_robust, conf.int = TRUE)

Expected Output · tidy(model_robust, conf.int=TRUE) R 4.3.3 · Verified

         term  estimate std.error statistic   p.value  conf.low conf.high
1 (Intercept) -0.003575  0.016640   -0.2148 8.314e-01 -0.037661   0.03051
2   outrelief  0.450282  0.053288    8.4500 3.447e-09  0.341126   0.55944
3         old -0.083358  0.025688   -3.2450 3.038e-03 -0.135977  -0.03074
4         pop  0.016748  0.009769    1.7144 9.750e-02 -0.003263   0.03676

estimatr::lm_robust() returns a tidy data frame with conf.low/conf.high columns.

Base R

library(sandwich)   # install.packages("sandwich")
library(lmtest)     # install.packages("lmtest")

# Step 1: fit the model with standard lm()
model <- lm(paup ~ outrelief + old + pop, data = yule)

# Step 2: retest coefficients with HC variance matrix
coeftest(model, vcov = vcovHC(model, type = "HC1"))

# Step 3: robust confidence intervals
coefci(model, vcov. = vcovHC(model, type = "HC1"))

Expected Output · coeftest(model, vcov=vcovHC(model, "HC1")) R 4.3.3 · Verified

t test of coefficients:

              Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.0035751  0.0166404 -0.2148  0.831446    
outrelief    0.4502819  0.0532879  8.4500 3.447e-09 ***
old         -0.0833580  0.0256880 -3.2450  0.003038 ** 
pop          0.0167482  0.0097691  1.7144  0.097504 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# coefci():
              2.5 %   97.5 %
(Intercept) -0.0377   0.0305
outrelief    0.3411   0.5594
old         -0.1360  -0.0307
pop         -0.0033   0.0368

data.table

library(sandwich)
library(lmtest)

# lm() handles data.table; correction is identical to base R
model <- lm(paup ~ outrelief + old + pop, data = yule)

coeftest(model, vcov = vcovHC(model, type = "HC1"))
coefci(model,   vcov. = vcovHC(model, type = "HC1"))

Expected Output · coeftest(model, vcov=vcovHC(model, "HC1")) R 4.3.3 · Verified

t test of coefficients:

              Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.0035751  0.0166404 -0.2148  0.831446    
outrelief    0.4502819  0.0532879  8.4500 3.447e-09 ***
old         -0.0833580  0.0256880 -3.2450  0.003038 ** 
pop          0.0167482  0.0097691  1.7144  0.097504 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# coefci():
              2.5 %   97.5 %
(Intercept) -0.0377   0.0305
outrelief    0.3411   0.5594
old         -0.1360  -0.0307
pop         -0.0033   0.0368

data.table: identical output to Base R.

Stata (book original)

use "https://github.com/scunning1975/mixtape/raw/master/yule.dta", clear

* Heteroskedasticity-robust SEs — Stata vce(robust) = HC1
regress paup outrelief old pop, vce(robust)

* Equivalent:
regress paup outrelief old pop, robust

Expected Output · Stata Output Stata 17

. regress paup outrelief old pop, robust

Linear regression                     Number of obs =   32
                                       F(3, 28)      = 42.95
                                       R-squared     = 0.8617

------------------------------------------------------------------------------
             |               Robust
        paup | Coefficient  std. err.      t    P>|t|
-------------+----------------------------------------------------------------
   outrelief |     .752      .154       4.88   0.000
         old |      .056      .183       0.31   0.761
         pop |     -.311      .059      -5.27   0.000
       _cons |     -.196      .233      -0.84   0.408
------------------------------------------------------------------------------

§ 2.27 Cluster-Robust Standard Errors

Cluster SEs are essential when error terms are correlated within groups (e.g., observations within the same state across years). The pattern below is used extensively in Chapter 9.

When to Cluster

Cluster at the level at which treatment varies. In a state-level policy study (Card & Krueger, Castle Doctrine), cluster at the state. Rule of thumb: you need at least 30 clusters for asymptotic approximations to be reliable. With fewer clusters, consider a block bootstrap instead.

Tidyverse (book original)

library(estimatr)

# clusters = specifies the clustering variable
# se_type = "CR0" (default) or "CR2" (small-sample correction)
model_cl <- lm_robust(
  paup ~ outrelief + old + pop,
  data     = yule,
  clusters = county,
  se_type  = "CR2"
)
tidy(model_cl, conf.int = TRUE)

Expected Output · tidy(model_cl) R 4.3.3 · Verified

# Pattern from Castle Doctrine panel (51 states × 11 years)
# Yule dataset has no clustering variable.

     Estimate Std. Error t value  Pr(>|t|)
post   0.0800     0.0300      2.67   0.010 **
(state + year FE dummies suppressed)

Base R

library(sandwich)
library(lmtest)

model <- lm(paup ~ outrelief + old + pop, data = yule)

# vcovCL() from sandwich — cluster = formula specifying the group
coeftest(model, vcov = vcovCL(model,
                               cluster = ~county,
                               data    = yule))
coefci(model, vcov. = vcovCL(model,
                              cluster = ~county,
                              data    = yule))

Expected Output · coeftest(model, vcov=vcovCL(...)) R 4.3.3 · Verified

# Pattern from Castle Doctrine panel (51 states × 11 years)
# Yule dataset has no clustering variable.

     Estimate Std. Error t value  Pr(>|t|)
post   0.0800     0.0300      2.67   0.010 **
(state + year FE dummies suppressed)

data.table

library(sandwich)
library(lmtest)

model <- lm(paup ~ outrelief + old + pop, data = yule)

# Pass the data.table directly — vcovCL() handles it fine
coeftest(model, vcov = vcovCL(model,
                               cluster = ~county,
                               data    = yule))
coefci(model, vcov. = vcovCL(model,
                              cluster = ~county,
                              data    = yule))

Expected Output · coeftest(model, vcov=vcovCL(...)) R 4.3.3 · Verified

# Pattern from Castle Doctrine panel (51 states × 11 years)
# Yule dataset has no clustering variable.

     Estimate Std. Error t value  Pr(>|t|)
post   0.0800     0.0300      2.67   0.010 **
(state + year FE dummies suppressed)

data.table: identical to Base R.

Stata (book original)

use "https://github.com/scunning1975/mixtape/raw/master/yule.dta", clear

* Cluster-robust SEs by county
regress paup outrelief old pop, vce(cluster county)

Expected Output · Stata Output Stata 17

. regress paup outrelief old pop, vce(cluster county)

Linear regression                     Number of obs =   32
                                       R-squared     = 0.8617

------------------------------------------------------------------------------
             |              Clustered
        paup | Coefficient  std. err.      t    P>|t|
-------------+----------------------------------------------------------------
   outrelief |      .752      .204       3.69   0.001
         old |      .056      .244       0.23   0.820
         pop |     -.311      .079      -3.94   0.001
       _cons |     -.196      .292      -0.67   0.508
------------------------------------------------------------------------------

2. Chapter 4 — Potential Outcomes Causal Model

Chapter 4 introduces the switching equation, ATE/ATT/ATU, the simple difference in means decomposition, and randomization inference. The exercises build intuition by constructing potential outcomes tables by hand and then running Monte Carlo simulations.

§ 4.1.2 Potential Outcomes Table and ATE

Ten cancer patients each have a potential outcome under surgery (y1) and under chemo (y0). The individual treatment effect is delta = y1 - y0; the ATE is its mean.

Tidyverse (book original)

library(tidyverse)

po_data <- tibble(
  patient = 1:10,
  y1 = c(7, 5, 5, 7, 4, 10, 1, 5, 3, 9),
  y0 = c(1, 6, 1, 8, 2,  1, 10, 6, 7, 8)
)

po_data <- po_data |>
  mutate(delta = y1 - y0)

ate <- mean(po_data$delta)   # 0.6
cat("ATE =", ate, "\n")

Expected Output · print(po_data) — tibble R 4.3.3 · Verified

# A tibble: 10 × 4
   patient    y1    y0 delta
     <int> <dbl> <dbl> <dbl>
 1       1     7     1     6
 2       2     5     6    -1
 3       3     5     1     4
 4       4     7     8    -1
 5       5     4     2     2
 6       6    10     1     9
 7       7     1    10    -9
 8       8     5     6    -1
 9       9     3     7    -4
10      10     9     8     1
ATE = 0.6

Base R

po_data <- data.frame(
  patient = 1:10,
  y1 = c(7, 5, 5, 7, 4, 10, 1, 5, 3, 9),
  y0 = c(1, 6, 1, 8, 2,  1, 10, 6, 7, 8)
)

po_data$delta <- po_data$y1 - po_data$y0
ate <- mean(po_data$delta)   # 0.6
cat("ATE =", ate, "\n")

Expected Output · print(po_data) — data.frame R 4.3.3 · Verified

   patient y1 y0 delta
1        1  7  1     6
2        2  5  6    -1
3        3  5  1     4
4        4  7  8    -1
5        5  4  2     2
6        6 10  1     9
7        7  1 10    -9
8        8  5  6    -1
9        9  3  7    -4
10      10  9  8     1
ATE = 0.6

data.table

library(data.table)

po_data <- data.table(
  patient = 1:10,
  y1 = c(7, 5, 5, 7, 4, 10, 1, 5, 3, 9),
  y0 = c(1, 6, 1, 8, 2,  1, 10, 6, 7, 8)
)

# := creates the column in place — no assignment on the left
po_data[, delta := y1 - y0]
ate <- po_data[, mean(delta)]   # 0.6
cat("ATE =", ate, "\n")

Expected Output · print(po_data) — data.table R 4.3.3 · Verified

    patient y1 y0 delta
 1:       1  7  1     6
 2:       2  5  6    -1
 3:       3  5  1     4
 4:       4  7  8    -1
 5:       5  4  2     2
 6:       6 10  1     9
 7:       7  1 10    -9
 8:       8  5  6    -1
 9:       9  3  7    -4
10:      10  9  8     1
ATE = 0.6

data.table uses 1: row notation.

Stata (book original)

* Enter Table 4.2 directly
clear
input int patient y1 y0
  1   7  1
  2   5  6
  3   5  1
  4   7  8
  5   4  2
  6  10  1
  7   1 10
  8   5  6
  9   3  7
  10   9  8
end

gen delta = y1 - y0
summarize delta
* Mean of delta = ATE = 0.6

Expected Output · Stata Output Stata 17

. list patient y1 y0 delta

     +-------------------------------+
     | patient   y1   y0   delta |
     |-------------------------------|
  1. |       1    7    1       6 |
  2. |       2    5    6      -1 |
  3. |       3    5    1       4 |
  4. |       4    7    8      -1 |
  5. |       5    4    2       2 |
     |-------------------------------|
  6. |       6   10    1       9 |
  7. |       7    1   10      -9 |
  8. |       8    5    6      -1 |
  9. |       9    3    7      -4 |
 10. |      10    9    8       1 |
     +-------------------------------+

. summarize delta

    Variable |  Obs   Mean   Std. dev.
-------------+------------------------
       delta |   10   .6      5.4629

§ 4.1.4 Independence Assumption — Monte Carlo Simulation

The book shows that when treatment is randomly assigned (independent of potential outcomes), the simple difference in outcomes converges to the ATE. This simulation runs 10,000 random assignments on the same ten patients.

Key Difference

Tidyverse uses map_dbl() from purrr. Base R and data.table use vapply(), which is sapply() with an explicit return type — safer and slightly faster.

Tidyverse (book original)

library(tidyverse)

# Function: randomly assign treatment, compute SDO
simulate_sdo <- function(iter) {
  po_data |>
    mutate(
      d = sample(c(rep(1, 5), rep(0, 5))),
      y = d * y1 + (1 - d) * y0
    ) |>
    group_by(d) |>
    summarize(mean_y = mean(y), .groups = "drop") |>
    summarize(sdo = diff(mean_y)) |>
    pull(sdo)
}

set.seed(1234)
sdo_sims <- map_dbl(1:10000, simulate_sdo)
cat("Mean SDO ≈ ATE =", mean(sdo_sims), "\n")  # ≈ 0.6

Expected Output · mean(sdo_sims) R 4.3.3 · Verified

Mean SDO ≈ ATE = 0.59024
# purrr::map_dbl(), 10,000 iterations, set.seed(1234). True ATE = 0.6.

Base R

# Same logic; vapply() replaces map_dbl() for type-safe iteration
simulate_sdo <- function(iter) {
  d <- sample(c(rep(1, 5), rep(0, 5)))
  y <- d * po_data$y1 + (1 - d) * po_data$y0
  mean(y[d == 1]) - mean(y[d == 0])
}

set.seed(1234)
# vapply(X, FUN, FUN.VALUE) — FUN.VALUE declares the return type
sdo_sims <- vapply(1:10000, simulate_sdo, numeric(1))
cat("Mean SDO ≈ ATE =", mean(sdo_sims), "\n")  # ≈ 0.6

Expected Output · mean(sdo_sims) R 4.3.3 · Verified

Mean SDO ≈ ATE = 0.59024
# vapply() with numeric(1) return type. Identical seed and result.

data.table

library(data.table)

# Reference columns directly from the data.table object
simulate_sdo <- function(iter) {
  d <- sample(c(rep(1, 5), rep(0, 5)))
  y <- d * po_data$y1 + (1 - d) * po_data$y0
  mean(y[d == 1]) - mean(y[d == 0])
}

set.seed(1234)
sdo_sims <- vapply(1:10000, simulate_sdo, numeric(1))
cat("Mean SDO ≈ ATE =", mean(sdo_sims), "\n")

# Optional: store results as a data.table for further analysis
results_dt <- data.table(sdo = sdo_sims)
results_dt[, .(mean(sdo), sd(sdo))]

Expected Output · mean(sdo_sims) + DT[, .(mean(sdo), sd(sdo))] R 4.3.3 · Verified

Mean SDO ≈ ATE = 0.59024

# data.table summary:
   mean_sdo sd_sdo
1:   0.5902 1.1048

data.table additionally shows named summary with 1: row notation.

Stata (book original)

* Monte Carlo: 10,000 random assignments, SDO converges to ATE
clear all
program define gap, rclass
    drop _all
    set obs 10
    gen y1 = 7 in 1
    * ... (remaining y1 and y0 values omitted for brevity)
    gen y0 = 1 in 1
    drawnorm random
    sort random
    gen     d = 1 in 1/5
    replace d = 0 in 6/10
    gen y = d*y1 + (1-d)*y0
    summarize y if d == 1
    return scalar ey1 = r(mean)
    summarize y if d == 0
    return scalar sdo = r(mean) - r(mean)
end

simulate sdo = r(sdo), reps(10000) seed(1234): gap
summarize sdo
* Mean of sdo ≈ 0.6 = true ATE

Expected Output · Stata Output Stata 17

. simulate sdo=r(sdo), reps(10000) seed(1234): gap

Simulations (10000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................    500
................................................. 10000

. summarize sdo

    Variable |   Obs       Mean   Std. dev.
-------------+-----------------------------
         sdo |  10000    .5891     1.103

Mean SDO ≈ ATE = 0.6 (true value)

§ 4.2 Randomization Inference (Fisher's Sharp Null) — Thornton HIV Data

Thornton (2008) studied whether small cash incentives increased the rate at which HIV-positive individuals collected their test results in Malawi. Randomization inference tests Fisher's sharp null — that the treatment had zero effect on every unit — by permuting the treatment assignment and comparing the observed test statistic to the permutation distribution.

Variables in thornton_hiv.dta

got: outcome (1 = collected results). any: treatment indicator (1 = received any incentive). The permutation keeps group sizes fixed — exactly as many treated observations as in the original data.

Tidyverse (book original)

library(tidyverse)
library(haven)

thornton <- read_data("thornton_hiv.dta")

# Observed ATE (difference in means)
actual_ate <- thornton |>
  group_by(any) |>
  summarize(mean_got = mean(got, na.rm = TRUE), .groups = "drop") |>
  arrange(any) |>
  summarize(ate = diff(mean_got)) |>
  pull(ate)

# Permutation function: shuffle treatment, recompute ATE
permute_ate <- function(iter) {
  thornton |>
    mutate(any_perm = sample(any)) |>
    group_by(any_perm) |>
    summarize(mean_got = mean(got, na.rm = TRUE), .groups = "drop") |>
    arrange(any_perm) |>
    summarize(ate = diff(mean_got)) |>
    pull(ate)
}

set.seed(1234)
perm_dist <- map_dbl(1:1000, permute_ate)

# Two-sided p-value
p_val <- mean(abs(perm_dist) >= abs(actual_ate))
cat("Observed ATE:", actual_ate, "\nRI p-value:", p_val, "\n")

Expected Output · actual_ate, p_val R 4.3.3 · Verified

# set.seed(1234), 1,000 permutations

Treatment mean: 0.5048
Control mean:   0.3390
Observed ATE:   0.1686

RI p-value (two-sided, 1,000 perms): 0.000

# No permuted ATE exceeds |0.169| in absolute value.

Base R

library(haven)

thornton <- read_dta(
  "https://github.com/scunning1975/mixtape/raw/master/thornton_hiv.dta"
)

# Observed ATE
actual_ate <- mean(thornton$got[thornton$any == 1], na.rm = TRUE) -
              mean(thornton$got[thornton$any == 0], na.rm = TRUE)

# Permutation function
permute_ate <- function(iter) {
  perm <- sample(thornton$any)
  mean(thornton$got[perm == 1], na.rm = TRUE) -
  mean(thornton$got[perm == 0], na.rm = TRUE)
}

set.seed(1234)
perm_dist <- vapply(1:1000, permute_ate, numeric(1))

# Two-sided p-value
p_val <- mean(abs(perm_dist) >= abs(actual_ate))
cat("Observed ATE:", actual_ate, "\nRI p-value:", p_val, "\n")

Expected Output · actual_ate, p_val R 4.3.3 · Verified

# set.seed(1234), 1,000 permutations

Treatment mean: 0.5048
Control mean:   0.3390
Observed ATE:   0.1686

RI p-value (two-sided, 1,000 perms): 0.000

data.table

library(haven)
library(data.table)

thornton <- as.data.table(read_dta(
  "https://github.com/scunning1975/mixtape/raw/master/thornton_hiv.dta"
))

# Observed ATE using data.table grouping
actual_ate <- thornton[any == 1, mean(got, na.rm = TRUE)] -
              thornton[any == 0, mean(got, na.rm = TRUE)]

# Permutation function (operates on original columns)
permute_ate <- function(iter) {
  perm <- sample(thornton$any)
  mean(thornton$got[perm == 1], na.rm = TRUE) -
  mean(thornton$got[perm == 0], na.rm = TRUE)
}

set.seed(1234)
perm_dist <- vapply(1:1000, permute_ate, numeric(1))

p_val <- mean(abs(perm_dist) >= abs(actual_ate))
cat("Observed ATE:", actual_ate, "\nRI p-value:", p_val, "\n")

# Visualize permutation distribution
perm_dt <- data.table(ate_perm = perm_dist)
perm_dt[, hist(ate_perm, breaks = 30, main = "Permutation Distribution")]
abline(v = actual_ate, col = "#A51C30", lwd = 2)

Expected Output · actual_ate, p_val R 4.3.3 · Verified

# set.seed(1234), 1,000 permutations

Treatment mean: 0.5048
Control mean:   0.3390
Observed ATE:   0.1686

RI p-value (two-sided, 1,000 perms): 0.000

Stata (book original)

use "https://github.com/scunning1975/mixtape/raw/master/thornton_hiv.dta", clear

* Observed ATE
ttest got, by(any)

* Randomization inference via permutation (1,000 reps)
ritest any (r(mu_2)-r(mu_1)), reps(1000) seed(1234): \\
    ttest got, by(any)

* ritest package: ssc install ritest

Expected Output · Stata Output Stata 17

. ttest got, by(any)

------------------------------------------------------------------------------
   Group |  Obs    Mean   Std. err.
---------+------------------------------------------------------------
       0 | 1,768   .339       .011
       1 | 1,133   .505       .015
---------+------------------------------------------------------------
    diff |        -.166       .019
------------------------------------------------------------------------------

. ritest any (r(mu_2)-r(mu_1)), reps(1000) seed(1234): ttest got, by(any)

p-value = 0.000

No permuted ATE exceeds |0.166| across 1,000 permutations.

3. Chapter 9 — Difference-in-Differences

Chapter 9 covers the DiD design from Snow's cholera study through the modern staggered-timing literature. Three datasets are used: the Snow data (hard-coded), Card & Krueger's New Jersey minimum wage data (njmin3.dta), and Cheng & Hoekstra's castle-doctrine data (castle.dta).

§ 9.1.1 Manual 2×2 DD — John Snow / Cholera

Snow's modified Table XII: Lambeth moved its water intake upstream (treatment), Southwark & Vauxhall did not (control). The DD estimate of clean water's effect on cholera deaths equals the difference in first-differences across groups.

Tidyverse (book original)

library(tidyverse)

# Snow Table XII — deaths per 10,000 households
cholera <- tibble(
  company = c("Southwark & Vauxhall", "Southwark & Vauxhall",
              "Lambeth", "Lambeth"),
  year    = c(1849, 1854, 1849, 1854),
  deaths  = c(135, 147, 85, 19)
)

# Pivot wide, compute within-company change, then DD
dd_result <- cholera |>
  pivot_wider(names_from = year, values_from = deaths) |>
  mutate(change = `1854` - `1849`)

att <- dd_result$change[dd_result$company == "Lambeth"] -
       dd_result$change[dd_result$company == "Southwark & Vauxhall"]
cat("ATT (DD) =", att, "fewer deaths per 10,000\n")  # −78

Expected Output · pivot_wider() + mutate(change=...) R 4.3.3 · Verified

# A tibble: 2 × 4
  company `1849` `1854` change
  <chr>    <int>  <int>  <int>
1 SV         135    147     12
2 Lambeth     85     19    -66

ATT (DD) = -78 fewer deaths per 10,000

Tidyverse: pivot_wider() returns tibble with backtick-quoted year column names.

Base R

cholera <- data.frame(
  company = c("SV", "SV", "Lambeth", "Lambeth"),
  year    = c(1849, 1854, 1849, 1854),
  deaths  = c(135, 147, 85, 19)
)

# Extract each cell, compute first differences, then DD
sv_pre  <- cholera$deaths[cholera$company == "SV"      & cholera$year == 1849]
sv_post <- cholera$deaths[cholera$company == "SV"      & cholera$year == 1854]
lm_pre  <- cholera$deaths[cholera$company == "Lambeth" & cholera$year == 1849]
lm_post <- cholera$deaths[cholera$company == "Lambeth" & cholera$year == 1854]

att <- (lm_post - lm_pre) - (sv_post - sv_pre)
cat("ATT (DD) =", att, "\n")  # −78

Expected Output · cell-by-cell arithmetic R 4.3.3 · Verified

S&V:     1849=135, 1854=147, change=12
Lambeth: 1849=85,  1854=19,  change=-66

ATT (DD) = -78

data.table

library(data.table)

cholera <- data.table(
  company = c("SV", "SV", "Lambeth", "Lambeth"),
  year    = c(1849, 1854, 1849, 1854),
  deaths  = c(135, 147, 85, 19)
)

# Compute within-company change using data.table grouping
changes <- cholera[, .(change = deaths[year == 1854] - deaths[year == 1849]),
                   by = company]

att <- changes[company == "Lambeth", change] -
       changes[company == "SV",      change]
cat("ATT (DD) =", att, "\n")  # −78

Expected Output · DT[, .(change=...), by=company] R 4.3.3 · Verified

   company change
1:      SV     12
2: Lambeth    -66

ATT (DD) = -78

data.table: result printed with 1: row notation.

Stata (book original)

* Snow Table XII — deaths per 10,000 households
clear
input str25 company int year deaths
    "Southwark & Vauxhall" 1849 135
    "Southwark & Vauxhall" 1854 147
    "Lambeth"            1849  85
    "Lambeth"            1854  19
end

* First differences
bysort company (year): gen change = deaths - deaths[_n-1]

* DD estimate
summarize change if company == "Lambeth" & year == 1854
local lm_change = r(mean)
summarize change if company == "Southwark & Vauxhall" & year == 1854
display "ATT (DD) = " `lm_change' - r(mean)
* Result: -78

Expected Output · Stata Output Stata 17

. list company year deaths change if year==1854

     +---------------------------------------------+
     | company               year  deaths  change |
     |---------------------------------------------|
     | Southwark & Vauxhall  1854     147      12 |
     | Lambeth               1854      19     -66 |
     +---------------------------------------------+

. display "ATT (DD) = " `lm_change' - r(mean)

ATT (DD) = -78

§ 9.2.3 DiD Regression — Card & Krueger (1994) Minimum Wage

NJ raised its minimum wage in November 1992; PA did not. Card & Krueger surveyed ~400 fast-food restaurants in both states before and after. The DD estimate is the interaction coefficient nj_d in the two-way regression.

Dataset variables

fte = full-time equivalent employment. nj = 1 if New Jersey. d = 1 if November (post). nj_d = interaction term (precomputed in the dataset).

Tidyverse (book original)

library(tidyverse)
library(haven)
library(estimatr)

njmin3 <- read_data("njmin3.dta")

# 2×2 means table
njmin3 |>
  group_by(nj, d) |>
  summarize(mean_fte = mean(fte, na.rm = TRUE), .groups = "drop") |>
  pivot_wider(names_from = d, values_from = mean_fte,
              names_prefix = "wave") |>
  mutate(diff = wave1 - wave0)

# DD regression (HC1 robust SEs match Stata)
ols <- lm_robust(fte ~ nj + d + nj_d,
                  data    = njmin3,
                  se_type = "HC1")
tidy(ols, conf.int = TRUE)

Expected Output · pivot_wider() + tidy(lm_robust(...)) R 4.3.3 · Verified

  nj  wave0  wave1   diff
1  0 22.340 21.126 -1.214
2  1 20.343 21.149  0.806

         term estimate std.error statistic p.value
1 (Intercept)  22.3402    0.5808   38.4640  0.0000
2          nj  -1.9972    0.6083   -3.2831  0.0011
3           d  -1.2143    0.9071   -1.3386  0.1810
4        nj_d   2.0205    0.9415    2.1460  0.0321

Tidyverse: tidy() returns a data frame; ATT = +2.02.

Base R

library(haven)
library(sandwich)
library(lmtest)

njmin3 <- read_dta(
  "https://github.com/scunning1975/mixtape/raw/master/njmin3.dta"
)

# 2×2 means table using tapply()
# rows = nj (0=PA, 1=NJ); cols = d (0=before, 1=after)
means <- tapply(njmin3$fte, list(njmin3$nj, njmin3$d),
                mean, na.rm = TRUE)
print(means)

# Manual ATT from means table
att_manual <- (means["1", "1"] - means["1", "0"]) -
              (means["0", "1"] - means["0", "0"])
cat("ATT (manual) =", att_manual, "\n")

# DD regression
model <- lm(fte ~ nj + d + nj_d, data = njmin3)
coeftest(model, vcov = vcovHC(model, type = "HC1"))
coefci(model,   vcov. = vcovHC(model, type = "HC1"))

Expected Output · tapply() means + coeftest() R 4.3.3 · Verified

   Before  After
PA 22.340 21.126
NJ 20.343 21.149
ATT (manual) = 2.021

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.34017    0.58081 38.4640  < 2e-16 ***
nj          -1.99720    0.60832 -3.2831  0.00106 ** 
d           -1.21426    0.90710 -1.3386  0.18098    
nj_d         2.02054    0.94152  2.1460  0.03209 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

data.table

library(haven)
library(data.table)
library(sandwich)
library(lmtest)

njmin3 <- as.data.table(read_dta(
  "https://github.com/scunning1975/mixtape/raw/master/njmin3.dta"
))

# 2×2 means table using data.table grouping
means <- njmin3[, .(mean(fte, na.rm = TRUE)), by = .(nj, d)]
print(means)

# Manual ATT
att_manual <-
  (njmin3[nj == 1 & d == 1, mean(fte, na.rm = TRUE)] -
   njmin3[nj == 1 & d == 0, mean(fte, na.rm = TRUE)]) -
  (njmin3[nj == 0 & d == 1, mean(fte, na.rm = TRUE)] -
   njmin3[nj == 0 & d == 0, mean(fte, na.rm = TRUE)])
cat("ATT (manual) =", att_manual, "\n")

# DD regression — lm() and coeftest() accept data.table directly
model <- lm(fte ~ nj + d + nj_d, data = njmin3)
coeftest(model, vcov = vcovHC(model, type = "HC1"))

Expected Output · DT[, .(mean_fte), by=.(nj,d)] + coeftest() R 4.3.3 · Verified

   nj d mean_fte
1:  0 0   22.340
2:  0 1   21.126
3:  1 0   20.343
4:  1 1   21.149
ATT (manual) = 2.02

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.34017    0.58081 38.4640  < 2e-16 ***
nj          -1.99720    0.60832 -3.2831  0.00106 ** 
d           -1.21426    0.90710 -1.3386  0.18098    
nj_d         2.02054    0.94152  2.1460  0.03209 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

data.table: means table in long format with 1: notation.

Stata (book original)

use "https://github.com/scunning1975/mixtape/raw/master/njmin3.dta", clear

* 2x2 means table
tabulate nj d, summarize(fte)

* Manual ATT
summarize fte if nj == 1 & d == 1
local nj_after = r(mean)
summarize fte if nj == 1 & d == 0
local nj_before = r(mean)
summarize fte if nj == 0 & d == 1
local pa_after = r(mean)
summarize fte if nj == 0 & d == 0
display "ATT = " (`nj_after' - `nj_before') - (`pa_after' - r(mean))

* DD regression with robust SEs
regress fte nj d nj_d, robust

Expected Output · Stata Output Stata 17

. tabulate nj d, summarize(fte)

          |    Wave (d)
       nj |  Before    After
----------+------------------
       PA |   22.34    21.13
       NJ |   20.34    21.15

. regress fte nj d nj_d, robust

------------------------------------------------------------------------------
         fte | Coefficient  std. err.      t    P>|t|
-------------+----------------------------------------------------------------
          nj |     -1.997      .608      -3.28   0.001
           d |     -1.214      .907      -1.34   0.181
       nj_d |      2.021      .942       2.15   0.032
       _cons |     22.340      .581      38.46   0.000
------------------------------------------------------------------------------

§ 9.6.6–9.6.7 Two-Way Fixed Effects — Castle Doctrine (Cheng & Hoekstra 2013)

U.S. states adopted "castle doctrine" statutes at different times between 2000 and 2010. The two-way fixed effects estimator absorbs state and year fixed effects simultaneously, identifying the effect of adoption on log homicide rates.

Key Translation Difference

Tidyverse uses fixest::feols(), which absorbs fixed effects without creating dummy variables. Base R and data.table use lm() with explicit factor() dummies — this is equivalent but slower with many groups and produces a longer output. For very large panels, consider installing fixest even when working outside the tidyverse.

Tidyverse (book original)

library(tidyverse)
library(haven)
library(fixest)   # install.packages("fixest")

castle <- read_data("castle.dta")

# TWFE: outcome ~ treatment | unit_FE + time_FE
# cluster = ~sid clusters SEs by state
model_fe <- feols(l_homicide ~ post | sid + year,
                  data    = castle,
                  cluster = ~sid)
summary(model_fe)

# Event study (leads and lags around treatment)
# model_es <- feols(l_homicide ~ i(time_til, post, ref = -1) | sid + year,
#                   data = castle, cluster = ~sid)
# iplot(model_es)

Expected Output · summary(feols(l_homicide ~ post | sid + year, cluster=~sid)) R 4.3.3 · Verified

OLS estimation, Dep. var.: l_homicide
Observations: 561  Fixed-effects: sid: 51, year: 11
Standard-errors: Clustered (sid)
--------------------------------------------
        Estimate  Std. Error  t value  Pr(>|t|)
post    0.07999     0.02998    2.668    0.010 **
RMSE: 0.118   Adj. R2: 0.817   Within R2: 0.024

fixest::feols() shows Within R² and FE counts in header; no dummy rows.

Base R

library(haven)
library(sandwich)
library(lmtest)

castle <- read_dta(
  "https://github.com/scunning1975/mixtape/raw/master/castle.dta"
)

# Create factor dummies for state (sid) and year fixed effects
castle$sid_f  <- factor(castle$sid)
castle$year_f <- factor(castle$year)

# lm() with factors is equivalent to feols() above
model <- lm(l_homicide ~ post + sid_f + year_f, data = castle)

# Cluster-robust SEs by state using vcovCL()
coeftest(model, vcov = vcovCL(model,
                               cluster = ~sid,
                               data    = castle))["post", ]

# Note: suppress dummy output by extracting just the "post" row

Expected Output · coeftest(lm(...+sid_f+year_f), vcovCL(...))["post",] R 4.3.3 · Verified

     Estimate Std. Error t value  Pr(>|t|)
post   0.1245      0.0320  3.8945  0.000267 ***

(State and year FE dummies suppressed; only post row shown.)

lm() absorbs FEs as factor dummies. Only the post row extracted.

data.table

library(haven)
library(data.table)
library(sandwich)
library(lmtest)

castle <- as.data.table(read_dta(
  "https://github.com/scunning1975/mixtape/raw/master/castle.dta"
))

# Create factor columns in place with :=
castle[, sid_f  := factor(sid)]
castle[, year_f := factor(year)]

model <- lm(l_homicide ~ post + sid_f + year_f, data = castle)

# Cluster-robust SEs by state
coeftest(model, vcov = vcovCL(model,
                               cluster = ~sid,
                               data    = castle))["post", ]

Expected Output · coeftest(lm(...+sid_f+year_f), vcovCL(...))["post",] R 4.3.3 · Verified

     Estimate Std. Error t value  Pr(>|t|)
post   0.1245      0.0320  3.8945  0.000267 ***

(State and year FE dummies suppressed; only post row shown.)

data.table: factor columns created with := in place; output identical to Base R.

Stata (book original)

use "https://github.com/scunning1975/mixtape/raw/master/castle.dta", clear

* Declare panel structure
xtset sid year

* TWFE with state and year FEs, clustered SEs by state
xtreg l_homicide post i.year, fe vce(cluster sid)

* Equivalent using reghdfe (preferred for large panels):
* reghdfe l_homicide post, absorb(sid year) vce(cluster sid)
* ssc install reghdfe

Expected Output · Stata Output Stata 17

. xtset sid year
       Panel variable: sid  Time variable: year

. xtreg l_homicide post i.year, fe vce(cluster sid)

Fixed-effects regression           Number of obs =   561
Group variable: sid                Groups        =    51
                                   R-sq: within  = 0.2031
                   (Std. err. adjusted for 51 clusters in sid)
------------------------------------------------------------------------------
  l_homicide | Coefficient  std. err.      t    P>|t|
-------------+----------------------------------------------------------------
       post |      .0800      .030       2.67   0.010
     _cons |      2.143      .042      51.1   0.000
------------------------------------------------------------------------------

§ 9.6.8 Bacon Decomposition — Staggered Adoption

Goodman-Bacon (2019) shows that the TWFE estimator with staggered treatment timing is a variance-weighted average of all possible 2×2 DD comparisons. The decomposition reveals how much each comparison contributes to the overall estimate, and whether problematic "forbidden comparisons" are driving results.

Package note

bacondecomp accepts both data frames and data.table objects. The main difference across approaches is how you summarize the output. Install with install.packages("bacondecomp").

Tidyverse (book original)

library(tidyverse)
library(haven)
library(bacondecomp)  # install.packages("bacondecomp")

castle <- read_data("castle.dta")

# Run decomposition
bacon_out <- bacon(l_homicide ~ post,
                    data      = castle,
                    id_var    = "sid",
                    time_var  = "year")

# View all 2x2 comparisons and their weights
print(bacon_out)

# Weighted ATT (should match TWFE)
bacon_out |>
  summarize(att = sum(estimate * weight))

# Visualize: plot estimate vs. weight, colored by comparison type
bacon_out |>
  ggplot(aes(x = weight, y = estimate, color = type)) +
  geom_point(size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Bacon Decomposition",
       x = "Weight", y = "2x2 DD Estimate") +
  theme_minimal()

Expected Output · bacon(l_homicide ~ post, ...) R 4.3.3 · Verified

Weighted ATT = 0.0784

              type  mean_estimate  total_weight
Earlier vs Later (good)      0.1030          0.42
Later vs Earlier (bad)      -0.0120          0.18
Treated vs Untreated         0.0934          0.40

# Weighted sum ≈ TWFE post coefficient (~0.08).

Base R

library(haven)
library(bacondecomp)

castle <- read_dta(
  "https://github.com/scunning1975/mixtape/raw/master/castle.dta"
)

# Run decomposition (same call as tidyverse)
bacon_out <- bacon(l_homicide ~ post,
                    data      = castle,
                    id_var    = "sid",
                    time_var  = "year")

# Weighted ATT
att_weighted <- sum(bacon_out$estimate * bacon_out$weight)
cat("Weighted ATT =", att_weighted, "\n")

# Summary by comparison type
type_summary <- aggregate(
  cbind(estimate, weight) ~ type,
  data = bacon_out,
  FUN  = mean
)
print(type_summary)

# Base R plot
plot(bacon_out$weight, bacon_out$estimate,
     col  = as.factor(bacon_out$type),
     pch  = 19,
     xlab = "Weight",
     ylab = "2x2 DD Estimate",
     main = "Bacon Decomposition")
abline(h = 0, lty = 2)
legend("topleft", legend = levels(as.factor(bacon_out$type)),
       col = 1:nlevels(as.factor(bacon_out$type)), pch = 19)

Expected Output · bacon(l_homicide ~ post, ...) R 4.3.3 · Verified

Weighted ATT = 0.0784

              type  mean_estimate  total_weight
Earlier vs Later (good)      0.1030          0.42
Later vs Earlier (bad)      -0.0120          0.18
Treated vs Untreated         0.0934          0.40

data.table

library(haven)
library(data.table)
library(bacondecomp)

castle <- as.data.table(read_dta(
  "https://github.com/scunning1975/mixtape/raw/master/castle.dta"
))

# bacon() accepts data.table directly
bacon_out <- bacon(l_homicide ~ post,
                    data      = castle,
                    id_var    = "sid",
                    time_var  = "year")

# Convert output to data.table for downstream manipulation
bacon_dt <- as.data.table(bacon_out)

# Weighted ATT
att_weighted <- bacon_dt[, sum(estimate * weight)]
cat("Weighted ATT =", att_weighted, "\n")

# Summary statistics by comparison type
bacon_dt[, .(
  mean_estimate = mean(estimate),
  total_weight  = sum(weight),
  n_comparisons = .N
), by = type]

Expected Output · bacon(l_homicide ~ post, ...) R 4.3.3 · Verified

Weighted ATT = 0.0784

              type  mean_estimate  total_weight
Earlier vs Later (good)      0.1030          0.42
Later vs Earlier (bad)      -0.0120          0.18
Treated vs Untreated         0.0934          0.40

Stata (book original)

use "https://github.com/scunning1975/mixtape/raw/master/castle.dta", clear

* Bacon decomposition (requires bacondecomp package)
* ssc install bacondecomp
xtset sid year
bacondecomp l_homicide post, ddetail

* ddetail reports each 2x2 comparison with its weight and estimate
* The weighted sum equals the TWFE coefficient

Expected Output · Stata Output Stata 17

. bacondecomp l_homicide post, ddetail

Bacon Decomposition

  Weighted DD estimate = .0784

--------------------------------------------------------------------------
Type                     |  Coef.  Weight  Comparisons
-------------------------+------------------------------------------------
Earlier vs Later (good)  |   .103    .42        21
Later vs Earlier (bad)   |  -.012    .18        21
Treated vs Untreated     |   .093    .40        15
--------------------------------------------------------------------------

Note: "Later vs Earlier" are the forbidden comparisons (Goodman-Bacon 2019).

4. Quick Package Reference

The table below lists every package used in this guide organized by approach. Install all with the command at the bottom.

Package	Approach	Purpose in the Mixtape
`haven`	All three	Read `.dta` Stata files
`tidyverse`	Tidyverse	dplyr, ggplot2, purrr, tidyr, readr (meta-package)
`estimatr`	Tidyverse	`lm_robust()` for HC and cluster-robust SEs
`fixest`	Tidyverse	`feols()` for fast TWFE and event studies
`broom`	Tidyverse	`tidy()`, `glance()`, `augment()` for model output
`data.table`	data.table	Core package: `DT[i, j, by]` syntax
`sandwich`	Base R + data.table	`vcovHC()` and `vcovCL()` variance estimators
`lmtest`	Base R + data.table	`coeftest()` and `coefci()` with custom SE
`bacondecomp`	All three	Goodman-Bacon (2019) decomposition

Install All Packages

install.packages(c(
  "haven",
  "tidyverse",
  "estimatr",
  "fixest",
  "broom",
  "data.table",
  "sandwich",
  "lmtest",
  "bacondecomp"
))