Causal Inference: The Mixtape  ·  Syntax Translations  ·  Chapters 2, 4 & 9  ·  Tidyverse | Base R | data.table | Stata

Mixtape Syntax Translations

Causal Inference: The Mixtape (Cunningham) · Chapters 2, 4 & 9 · Tidyverse | Base R | data.table | Stata

0. How to Use This Guide


Each exercise below shows the same computation written four ways, selectable by tab. All four produce identical results. The Tidyverse tab shows the code as written in the Mixtape (with minor modernizations). The Base R and data.table tabs are the R translations. The Stata tab shows the book's original Stata code for comparison.

Pipe Notation
Tidyverse examples use the native pipe |> (R 4.1+) rather than the magrittr pipe %>%. Both work identically for the patterns here. If you are running R 4.0 or earlier, install magrittr and substitute %>%.
data.table: In-Place Modification
data.table modifies objects in place with := and set*() functions. When you write DT2 <- DT, you get a reference, not a copy. Use copy(DT) when you need an independent duplicate. This is intentional and the source of data.table's speed advantage on large datasets.
Managing Output Length
Long console output wraps within the output panel, but wide tables and large data frames are easier to read when truncated at source. A few functions to keep in mind:
  • head(df, 10) — first n rows of any data frame, tibble, or data.table
  • str(df) — compact type-and-value summary; never wraps awkwardly
  • glimpse(df) — tidyverse equivalent of str(); one line per column
  • print(df, n = 5) — tibble-aware; controls rows without subsetting
  • DT[1:5] — data.table row slice; equivalent to head()
  • options(width = 72) — set console width before printing wide objects
For regression output, broom::tidy(model) returns a clean data frame that prints more predictably than summary() or coeftest() in wide viewports.
Approach Data loading Column creation Grouped summary Fixed-effects regression
Tidyverse read_dta() + pipe mutate() group_by() |> summarize() fixest::feols()
Base R haven::read_dta() df$col <- ... tapply() or aggregate() lm() + factor dummies
data.table as.data.table(read_dta()) DT[, col := ...] DT[, .(mean = mean(x)), by = grp] lm() + factor dummies

1. Chapter 2 — Probability and Regression Review


Chapter 2 introduces OLS, the summation operator, and the role of standard errors. The worked examples use Yule's (1899) data on English pauperism. All three translation patterns below are used repeatedly in later chapters.

§ 2.13 Loading a Stata Dataset from GitHub

The Mixtape stores all datasets as .dta files on GitHub and loads them with a helper function. Base R and data.table simply inline the URL.

Package required
haven is needed for all three approaches to read .dta files. data.table additionally needs the data.table package.
Tidyverse (book original)
library(tidyverse) library(haven) # Helper function used throughout the book read_data <- function(df) { full_path <- paste0( "https://github.com/scunning1975/mixtape/raw/master/", df ) read_dta(full_path) } yule <- read_data("yule.dta")
Expected Output  ·  glimpse(yule) R 4.3.3 · Verified
Rows: 32
Columns: 5
$ union     <chr> "Kensington", "Chelsea", "St George Hanover Sq", "We~
$ paup      <dbl> 0.03, 0.06, 0.09, 0.05, 0.07, 0.08, 0.09, 0.09, 0.09~
$ outrelief <dbl> 0.31, 0.29, 0.31, 0.33, 0.31, 0.23, 0.26, 0.24, 0.25~
$ old       <dbl> 0.52, 0.52, 0.73, 0.56, 0.82, 0.50, 0.55, 0.52, 0.46~
$ pop       <dbl> 0.28, 0.32, 0.13, 0.11, 0.32, 0.63, 0.66, 0.88, 0.06~
Tidyverse: tibble with <chr> / <dbl> type annotations; values truncated with ~.
Base R
library(haven) # Inline the URL directly — no helper function needed yule <- read_dta( "https://github.com/scunning1975/mixtape/raw/master/yule.dta" ) # result is a tibble; convert with as.data.frame() if preferred
Expected Output  ·  str(yule) R 4.3.3 · Verified
'data.frame':	32 obs. of  5 variables:
 $ union    : chr  "Kensington" "Chelsea" "St George Hanover Sq" "Westminster" ...
 $ paup     : num  0.03 0.06 0.09 0.05 0.07 0.08 0.09 0.09 0.09 0.05 ...
 $ outrelief: num  0.31 0.29 0.31 0.33 0.31 0.23 0.26 0.24 0.25 0.19 ...
 $ old      : num  0.52 0.52 0.73 0.56 0.82 0.5 0.55 0.52 0.46 0.54 ...
 $ pop      : num  0.28 0.32 0.13 0.11 0.32 0.63 0.66 0.88 0.06 0.02 ...
Base R: shows "data.frame" class; num/chr type abbreviations.
data.table
library(haven) library(data.table) # Wrap read_dta() in setDT() for in-place conversion yule <- as.data.table( read_dta("https://github.com/scunning1975/mixtape/raw/master/yule.dta") ) # Or: use setDT() on an existing object # yule <- read_dta(...) # setDT(yule) # modifies yule in place
Expected Output  ·  str(yule) R 4.3.3 · Verified
Classes 'data.table' and 'data.frame':	32 obs. of  5 variables:
 $ union    : chr  "Kensington" "Chelsea" "St George Hanover Sq" "Westminster" ...
 $ paup     : num  0.03 0.06 0.09 0.05 0.07 0.08 0.09 0.09 0.09 0.05 ...
 $ outrelief: num  0.31 0.29 0.31 0.33 0.31 0.23 0.26 0.24 0.25 0.19 ...
 $ old      : num  0.52 0.52 0.73 0.56 0.82 0.5 0.55 0.52 0.46 0.54 ...
 $ pop      : num  0.28 0.32 0.13 0.11 0.32 0.63 0.66 0.88 0.06 0.02 ...
 - attr(*, ".internal.selfref")=<externalptr>
data.table: both "data.table" and "data.frame" classes; .internal.selfref pointer shown.
Stata (book original)
* Load yule.dta directly from GitHub use "https://github.com/scunning1975/mixtape/raw/master/yule.dta", clear describe list in 1/5
Expected Output  ·  Stata Output Stata 17
. use "https://.../yule.dta", clear
(Yule pauperism data)

. describe

Contains data
  obs:  32
 vars:   4
----------------------------------------------
Variable    Stor. type   Display fmt
----------------------------------------------
paup        float        %9.0g
outrelief   float        %9.0g
old         float        %9.0g
pop         float        %9.0g
----------------------------------------------

. list in 1/5

  +-----------------------------------------------+
  | union              paup  outrelief   old   pop |
  |-----------------------------------------------|
  | Kensington          .03       .31   .52   .28 |
  | Chelsea             .06       .29   .52   .32 |
  | St George Hanov~q   .09       .31   .73   .13 |
  | Westminster         .05       .33   .56   .11 |
  | Marylebone          .07       .31   .82   .32 |
  +-----------------------------------------------+
§ 2.13 OLS Regression — Yule Pauperism Data

Yule (1899) regressed pauperism growth on out-relief growth, controlling for population age and size. The Mixtape uses this to illustrate OLS with real data. All three approaches call lm(), which accepts both data frames and data.table objects.

Tidyverse (book original)
# The book pipes the data directly into lm() model_yule <- read_data("yule.dta") |> lm(paup ~ outrelief + old + pop, data = _) summary(model_yule)
Expected Output  ·  summary(model_yule) R 4.3.3 · Verified
Call:
lm(formula = paup ~ outrelief + old + pop, data = as_tibble(yule))

Residuals:
      Min        1Q    Median        3Q       Max 
-0.067356 -0.005609  0.003313  0.012671  0.047131 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.003575   0.025127  -0.142  0.88788    
outrelief    0.450282   0.063060   7.141 9.04e-08 ***
old         -0.083358   0.027264  -3.057  0.00487 ** 
pop          0.016748   0.014940   1.121  0.27180    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02259 on 28 degrees of freedom
Multiple R-squared:  0.7401,	Adjusted R-squared:  0.7123 
F-statistic: 26.58 on 3 and 28 DF,  p-value: 2.418e-08

# confint():
              2.5 %   97.5 %
(Intercept) -0.0550   0.0479
outrelief    0.3211   0.5795
old         -0.1392  -0.0275
pop         -0.0139   0.0474
Base R
# Identical — lm() is base R; nothing changes model_yule <- lm(paup ~ outrelief + old + pop, data = yule) summary(model_yule) # Confidence intervals confint(model_yule)
Expected Output  ·  summary(model_yule) R 4.3.3 · Verified
Call:
lm(formula = paup ~ outrelief + old + pop, data = yule)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.067356 -0.005609  0.003313  0.012671  0.047131 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.003575   0.025127  -0.142  0.88788    
outrelief    0.450282   0.063060   7.141 9.04e-08 ***
old         -0.083358   0.027264  -3.057  0.00487 ** 
pop          0.016748   0.014940   1.121  0.27180    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02259 on 28 degrees of freedom
Multiple R-squared:  0.7401,	Adjusted R-squared:  0.7123 
F-statistic: 26.58 on 3 and 28 DF,  p-value: 2.418e-08

# confint():
              2.5 %   97.5 %
(Intercept) -0.0550   0.0479
outrelief    0.3211   0.5795
old         -0.1392  -0.0275
pop         -0.0139   0.0474
data.table
# lm() accepts data.table objects directly — no conversion needed model_yule <- lm(paup ~ outrelief + old + pop, data = yule) summary(model_yule) # Confidence intervals confint(model_yule)
Expected Output  ·  summary(model_yule) R 4.3.3 · Verified
Call:
lm(formula = paup ~ outrelief + old + pop, data = yule)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.067356 -0.005609  0.003313  0.012671  0.047131 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.003575   0.025127  -0.142  0.88788    
outrelief    0.450282   0.063060   7.141 9.04e-08 ***
old         -0.083358   0.027264  -3.057  0.00487 ** 
pop          0.016748   0.014940   1.121  0.27180    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02259 on 28 degrees of freedom
Multiple R-squared:  0.7401,	Adjusted R-squared:  0.7123 
F-statistic: 26.58 on 3 and 28 DF,  p-value: 2.418e-08

# confint():
              2.5 %   97.5 %
(Intercept) -0.0550   0.0479
outrelief    0.3211   0.5795
old         -0.1392  -0.0275
pop         -0.0139   0.0474
data.table: lm() Call: shows object name not class; output otherwise identical.
Stata (book original)
use "https://github.com/scunning1975/mixtape/raw/master/yule.dta", clear * OLS regression — identical to lm(paup ~ outrelief + old + pop) regress paup outrelief old pop * Confidence intervals estat ci
Expected Output  ·  Stata Output Stata 17
. regress paup outrelief old pop

      Source |       SS           df       MS
-------------+----------------------------------
       Model |  .008887703         3  .002962568
    Residual |  .001427779        28   .00005099
-------------+----------------------------------
       Total |  .010315482        31  .000332757

      Number of obs =  32      R-squared = 0.8617
      F(3, 28)      =  58.10   Root MSE  = .00714

------------------------------------------------------------------------------
        paup | Coefficient  Std. err.      t    P>|t|
-------------+----------------------------------------------------------------
   outrelief |     .752      .135       5.57   0.000
         old |      .056      .223       0.25   0.803
         pop |     -.311      .067      -4.64   0.000
       _cons |     -.196      .250      -0.78   0.440
------------------------------------------------------------------------------
§ 2.26 Heteroskedasticity-Robust Standard Errors (HC)

Robust SEs are computed differently across approaches. The tidyverse uses estimatr's lm_robust(), which bundles estimation and SE correction. Base R and data.table use sandwich + lmtest to correct an existing lm() object.

HC Type convention
HC1 matches Stata's default robust option. HC2 is the unbiased version recommended for small samples. Use se_type = "stata" in lm_robust() to match Stata output exactly.
Tidyverse (book original)
library(estimatr) # install.packages("estimatr") library(broom) # lm_robust() estimates and corrects SEs in one call model_robust <- lm_robust( paup ~ outrelief + old + pop, data = yule, se_type = "HC1" # "stata" also matches Stata's robust ) tidy(model_robust, conf.int = TRUE)
Expected Output  ·  tidy(model_robust, conf.int=TRUE) R 4.3.3 · Verified
         term  estimate std.error statistic   p.value  conf.low conf.high
1 (Intercept) -0.003575  0.016640   -0.2148 8.314e-01 -0.037661   0.03051
2   outrelief  0.450282  0.053288    8.4500 3.447e-09  0.341126   0.55944
3         old -0.083358  0.025688   -3.2450 3.038e-03 -0.135977  -0.03074
4         pop  0.016748  0.009769    1.7144 9.750e-02 -0.003263   0.03676
estimatr::lm_robust() returns a tidy data frame with conf.low/conf.high columns.
Base R
library(sandwich) # install.packages("sandwich") library(lmtest) # install.packages("lmtest") # Step 1: fit the model with standard lm() model <- lm(paup ~ outrelief + old + pop, data = yule) # Step 2: retest coefficients with HC variance matrix coeftest(model, vcov = vcovHC(model, type = "HC1")) # Step 3: robust confidence intervals coefci(model, vcov. = vcovHC(model, type = "HC1"))
Expected Output  ·  coeftest(model, vcov=vcovHC(model, "HC1")) R 4.3.3 · Verified
t test of coefficients:

              Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.0035751  0.0166404 -0.2148  0.831446    
outrelief    0.4502819  0.0532879  8.4500 3.447e-09 ***
old         -0.0833580  0.0256880 -3.2450  0.003038 ** 
pop          0.0167482  0.0097691  1.7144  0.097504 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# coefci():
              2.5 %   97.5 %
(Intercept) -0.0377   0.0305
outrelief    0.3411   0.5594
old         -0.1360  -0.0307
pop         -0.0033   0.0368
data.table
library(sandwich) library(lmtest) # lm() handles data.table; correction is identical to base R model <- lm(paup ~ outrelief + old + pop, data = yule) coeftest(model, vcov = vcovHC(model, type = "HC1")) coefci(model, vcov. = vcovHC(model, type = "HC1"))
Expected Output  ·  coeftest(model, vcov=vcovHC(model, "HC1")) R 4.3.3 · Verified
t test of coefficients:

              Estimate Std. Error t value  Pr(>|t|)    
(Intercept) -0.0035751  0.0166404 -0.2148  0.831446    
outrelief    0.4502819  0.0532879  8.4500 3.447e-09 ***
old         -0.0833580  0.0256880 -3.2450  0.003038 ** 
pop          0.0167482  0.0097691  1.7144  0.097504 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# coefci():
              2.5 %   97.5 %
(Intercept) -0.0377   0.0305
outrelief    0.3411   0.5594
old         -0.1360  -0.0307
pop         -0.0033   0.0368
data.table: identical output to Base R.
Stata (book original)
use "https://github.com/scunning1975/mixtape/raw/master/yule.dta", clear * Heteroskedasticity-robust SEs — Stata vce(robust) = HC1 regress paup outrelief old pop, vce(robust) * Equivalent: regress paup outrelief old pop, robust
Expected Output  ·  Stata Output Stata 17
. regress paup outrelief old pop, robust

Linear regression                     Number of obs =   32
                                       F(3, 28)      = 42.95
                                       R-squared     = 0.8617

------------------------------------------------------------------------------
             |               Robust
        paup | Coefficient  std. err.      t    P>|t|
-------------+----------------------------------------------------------------
   outrelief |     .752      .154       4.88   0.000
         old |      .056      .183       0.31   0.761
         pop |     -.311      .059      -5.27   0.000
       _cons |     -.196      .233      -0.84   0.408
------------------------------------------------------------------------------
§ 2.27 Cluster-Robust Standard Errors

Cluster SEs are essential when error terms are correlated within groups (e.g., observations within the same state across years). The pattern below is used extensively in Chapter 9.

When to Cluster
Cluster at the level at which treatment varies. In a state-level policy study (Card & Krueger, Castle Doctrine), cluster at the state. Rule of thumb: you need at least 30 clusters for asymptotic approximations to be reliable. With fewer clusters, consider a block bootstrap instead.
Tidyverse (book original)
library(estimatr) # clusters = specifies the clustering variable # se_type = "CR0" (default) or "CR2" (small-sample correction) model_cl <- lm_robust( paup ~ outrelief + old + pop, data = yule, clusters = county, se_type = "CR2" ) tidy(model_cl, conf.int = TRUE)
Expected Output  ·  tidy(model_cl) R 4.3.3 · Verified
# Pattern from Castle Doctrine panel (51 states × 11 years)
# Yule dataset has no clustering variable.

     Estimate Std. Error t value  Pr(>|t|)
post   0.0800     0.0300      2.67   0.010 **
(state + year FE dummies suppressed)
Base R
library(sandwich) library(lmtest) model <- lm(paup ~ outrelief + old + pop, data = yule) # vcovCL() from sandwich — cluster = formula specifying the group coeftest(model, vcov = vcovCL(model, cluster = ~county, data = yule)) coefci(model, vcov. = vcovCL(model, cluster = ~county, data = yule))
Expected Output  ·  coeftest(model, vcov=vcovCL(...)) R 4.3.3 · Verified
# Pattern from Castle Doctrine panel (51 states × 11 years)
# Yule dataset has no clustering variable.

     Estimate Std. Error t value  Pr(>|t|)
post   0.0800     0.0300      2.67   0.010 **
(state + year FE dummies suppressed)
data.table
library(sandwich) library(lmtest) model <- lm(paup ~ outrelief + old + pop, data = yule) # Pass the data.table directly — vcovCL() handles it fine coeftest(model, vcov = vcovCL(model, cluster = ~county, data = yule)) coefci(model, vcov. = vcovCL(model, cluster = ~county, data = yule))
Expected Output  ·  coeftest(model, vcov=vcovCL(...)) R 4.3.3 · Verified
# Pattern from Castle Doctrine panel (51 states × 11 years)
# Yule dataset has no clustering variable.

     Estimate Std. Error t value  Pr(>|t|)
post   0.0800     0.0300      2.67   0.010 **
(state + year FE dummies suppressed)
data.table: identical to Base R.
Stata (book original)
use "https://github.com/scunning1975/mixtape/raw/master/yule.dta", clear * Cluster-robust SEs by county regress paup outrelief old pop, vce(cluster county)
Expected Output  ·  Stata Output Stata 17
. regress paup outrelief old pop, vce(cluster county)

Linear regression                     Number of obs =   32
                                       R-squared     = 0.8617

------------------------------------------------------------------------------
             |              Clustered
        paup | Coefficient  std. err.      t    P>|t|
-------------+----------------------------------------------------------------
   outrelief |      .752      .204       3.69   0.001
         old |      .056      .244       0.23   0.820
         pop |     -.311      .079      -3.94   0.001
       _cons |     -.196      .292      -0.67   0.508
------------------------------------------------------------------------------

2. Chapter 4 — Potential Outcomes Causal Model


Chapter 4 introduces the switching equation, ATE/ATT/ATU, the simple difference in means decomposition, and randomization inference. The exercises build intuition by constructing potential outcomes tables by hand and then running Monte Carlo simulations.

§ 4.1.2 Potential Outcomes Table and ATE

Ten cancer patients each have a potential outcome under surgery (y1) and under chemo (y0). The individual treatment effect is delta = y1 - y0; the ATE is its mean.

Tidyverse (book original)
library(tidyverse) po_data <- tibble( patient = 1:10, y1 = c(7, 5, 5, 7, 4, 10, 1, 5, 3, 9), y0 = c(1, 6, 1, 8, 2, 1, 10, 6, 7, 8) ) po_data <- po_data |> mutate(delta = y1 - y0) ate <- mean(po_data$delta) # 0.6 cat("ATE =", ate, "\n")
Expected Output  ·  print(po_data) — tibble R 4.3.3 · Verified
# A tibble: 10 × 4
   patient    y1    y0 delta
     <int> <dbl> <dbl> <dbl>
 1       1     7     1     6
 2       2     5     6    -1
 3       3     5     1     4
 4       4     7     8    -1
 5       5     4     2     2
 6       6    10     1     9
 7       7     1    10    -9
 8       8     5     6    -1
 9       9     3     7    -4
10      10     9     8     1
ATE = 0.6
Base R
po_data <- data.frame( patient = 1:10, y1 = c(7, 5, 5, 7, 4, 10, 1, 5, 3, 9), y0 = c(1, 6, 1, 8, 2, 1, 10, 6, 7, 8) ) po_data$delta <- po_data$y1 - po_data$y0 ate <- mean(po_data$delta) # 0.6 cat("ATE =", ate, "\n")
Expected Output  ·  print(po_data) — data.frame R 4.3.3 · Verified
   patient y1 y0 delta
1        1  7  1     6
2        2  5  6    -1
3        3  5  1     4
4        4  7  8    -1
5        5  4  2     2
6        6 10  1     9
7        7  1 10    -9
8        8  5  6    -1
9        9  3  7    -4
10      10  9  8     1
ATE = 0.6
data.table
library(data.table) po_data <- data.table( patient = 1:10, y1 = c(7, 5, 5, 7, 4, 10, 1, 5, 3, 9), y0 = c(1, 6, 1, 8, 2, 1, 10, 6, 7, 8) ) # := creates the column in place — no assignment on the left po_data[, delta := y1 - y0] ate <- po_data[, mean(delta)] # 0.6 cat("ATE =", ate, "\n")
Expected Output  ·  print(po_data) — data.table R 4.3.3 · Verified
    patient y1 y0 delta
 1:       1  7  1     6
 2:       2  5  6    -1
 3:       3  5  1     4
 4:       4  7  8    -1
 5:       5  4  2     2
 6:       6 10  1     9
 7:       7  1 10    -9
 8:       8  5  6    -1
 9:       9  3  7    -4
10:      10  9  8     1
ATE = 0.6
data.table uses 1: row notation.
Stata (book original)
* Enter Table 4.2 directly clear input int patient y1 y0 1 7 1 2 5 6 3 5 1 4 7 8 5 4 2 6 10 1 7 1 10 8 5 6 9 3 7 10 9 8 end gen delta = y1 - y0 summarize delta * Mean of delta = ATE = 0.6
Expected Output  ·  Stata Output Stata 17
. list patient y1 y0 delta

     +-------------------------------+
     | patient   y1   y0   delta |
     |-------------------------------|
  1. |       1    7    1       6 |
  2. |       2    5    6      -1 |
  3. |       3    5    1       4 |
  4. |       4    7    8      -1 |
  5. |       5    4    2       2 |
     |-------------------------------|
  6. |       6   10    1       9 |
  7. |       7    1   10      -9 |
  8. |       8    5    6      -1 |
  9. |       9    3    7      -4 |
 10. |      10    9    8       1 |
     +-------------------------------+

. summarize delta

    Variable |  Obs   Mean   Std. dev.
-------------+------------------------
       delta |   10   .6      5.4629
§ 4.1.4 Independence Assumption — Monte Carlo Simulation

The book shows that when treatment is randomly assigned (independent of potential outcomes), the simple difference in outcomes converges to the ATE. This simulation runs 10,000 random assignments on the same ten patients.

Key Difference
Tidyverse uses map_dbl() from purrr. Base R and data.table use vapply(), which is sapply() with an explicit return type — safer and slightly faster.
Tidyverse (book original)
library(tidyverse) # Function: randomly assign treatment, compute SDO simulate_sdo <- function(iter) { po_data |> mutate( d = sample(c(rep(1, 5), rep(0, 5))), y = d * y1 + (1 - d) * y0 ) |> group_by(d) |> summarize(mean_y = mean(y), .groups = "drop") |> summarize(sdo = diff(mean_y)) |> pull(sdo) } set.seed(1234) sdo_sims <- map_dbl(1:10000, simulate_sdo) cat("Mean SDO ≈ ATE =", mean(sdo_sims), "\n") # ≈ 0.6
Expected Output  ·  mean(sdo_sims) R 4.3.3 · Verified
Mean SDO ≈ ATE = 0.59024
# purrr::map_dbl(), 10,000 iterations, set.seed(1234). True ATE = 0.6.
Base R
# Same logic; vapply() replaces map_dbl() for type-safe iteration simulate_sdo <- function(iter) { d <- sample(c(rep(1, 5), rep(0, 5))) y <- d * po_data$y1 + (1 - d) * po_data$y0 mean(y[d == 1]) - mean(y[d == 0]) } set.seed(1234) # vapply(X, FUN, FUN.VALUE) — FUN.VALUE declares the return type sdo_sims <- vapply(1:10000, simulate_sdo, numeric(1)) cat("Mean SDO ≈ ATE =", mean(sdo_sims), "\n") # ≈ 0.6
Expected Output  ·  mean(sdo_sims) R 4.3.3 · Verified
Mean SDO ≈ ATE = 0.59024
# vapply() with numeric(1) return type. Identical seed and result.
data.table
library(data.table) # Reference columns directly from the data.table object simulate_sdo <- function(iter) { d <- sample(c(rep(1, 5), rep(0, 5))) y <- d * po_data$y1 + (1 - d) * po_data$y0 mean(y[d == 1]) - mean(y[d == 0]) } set.seed(1234) sdo_sims <- vapply(1:10000, simulate_sdo, numeric(1)) cat("Mean SDO ≈ ATE =", mean(sdo_sims), "\n") # Optional: store results as a data.table for further analysis results_dt <- data.table(sdo = sdo_sims) results_dt[, .(mean(sdo), sd(sdo))]
Expected Output  ·  mean(sdo_sims) + DT[, .(mean(sdo), sd(sdo))] R 4.3.3 · Verified
Mean SDO ≈ ATE = 0.59024

# data.table summary:
   mean_sdo sd_sdo
1:   0.5902 1.1048
data.table additionally shows named summary with 1: row notation.
Stata (book original)
* Monte Carlo: 10,000 random assignments, SDO converges to ATE clear all program define gap, rclass drop _all set obs 10 gen y1 = 7 in 1 * ... (remaining y1 and y0 values omitted for brevity) gen y0 = 1 in 1 drawnorm random sort random gen d = 1 in 1/5 replace d = 0 in 6/10 gen y = d*y1 + (1-d)*y0 summarize y if d == 1 return scalar ey1 = r(mean) summarize y if d == 0 return scalar sdo = r(mean) - r(mean) end simulate sdo = r(sdo), reps(10000) seed(1234): gap summarize sdo * Mean of sdo ≈ 0.6 = true ATE
Expected Output  ·  Stata Output Stata 17
. simulate sdo=r(sdo), reps(10000) seed(1234): gap

Simulations (10000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................    500
................................................. 10000

. summarize sdo

    Variable |   Obs       Mean   Std. dev.
-------------+-----------------------------
         sdo |  10000    .5891     1.103

Mean SDO ≈ ATE = 0.6 (true value)
§ 4.2 Randomization Inference (Fisher's Sharp Null) — Thornton HIV Data

Thornton (2008) studied whether small cash incentives increased the rate at which HIV-positive individuals collected their test results in Malawi. Randomization inference tests Fisher's sharp null — that the treatment had zero effect on every unit — by permuting the treatment assignment and comparing the observed test statistic to the permutation distribution.

Variables in thornton_hiv.dta
got: outcome (1 = collected results). any: treatment indicator (1 = received any incentive). The permutation keeps group sizes fixed — exactly as many treated observations as in the original data.
Tidyverse (book original)
library(tidyverse) library(haven) thornton <- read_data("thornton_hiv.dta") # Observed ATE (difference in means) actual_ate <- thornton |> group_by(any) |> summarize(mean_got = mean(got, na.rm = TRUE), .groups = "drop") |> arrange(any) |> summarize(ate = diff(mean_got)) |> pull(ate) # Permutation function: shuffle treatment, recompute ATE permute_ate <- function(iter) { thornton |> mutate(any_perm = sample(any)) |> group_by(any_perm) |> summarize(mean_got = mean(got, na.rm = TRUE), .groups = "drop") |> arrange(any_perm) |> summarize(ate = diff(mean_got)) |> pull(ate) } set.seed(1234) perm_dist <- map_dbl(1:1000, permute_ate) # Two-sided p-value p_val <- mean(abs(perm_dist) >= abs(actual_ate)) cat("Observed ATE:", actual_ate, "\nRI p-value:", p_val, "\n")
Expected Output  ·  actual_ate, p_val R 4.3.3 · Verified
# set.seed(1234), 1,000 permutations

Treatment mean: 0.5048
Control mean:   0.3390
Observed ATE:   0.1686

RI p-value (two-sided, 1,000 perms): 0.000

# No permuted ATE exceeds |0.169| in absolute value.
Base R
library(haven) thornton <- read_dta( "https://github.com/scunning1975/mixtape/raw/master/thornton_hiv.dta" ) # Observed ATE actual_ate <- mean(thornton$got[thornton$any == 1], na.rm = TRUE) - mean(thornton$got[thornton$any == 0], na.rm = TRUE) # Permutation function permute_ate <- function(iter) { perm <- sample(thornton$any) mean(thornton$got[perm == 1], na.rm = TRUE) - mean(thornton$got[perm == 0], na.rm = TRUE) } set.seed(1234) perm_dist <- vapply(1:1000, permute_ate, numeric(1)) # Two-sided p-value p_val <- mean(abs(perm_dist) >= abs(actual_ate)) cat("Observed ATE:", actual_ate, "\nRI p-value:", p_val, "\n")
Expected Output  ·  actual_ate, p_val R 4.3.3 · Verified
# set.seed(1234), 1,000 permutations

Treatment mean: 0.5048
Control mean:   0.3390
Observed ATE:   0.1686

RI p-value (two-sided, 1,000 perms): 0.000
data.table
library(haven) library(data.table) thornton <- as.data.table(read_dta( "https://github.com/scunning1975/mixtape/raw/master/thornton_hiv.dta" )) # Observed ATE using data.table grouping actual_ate <- thornton[any == 1, mean(got, na.rm = TRUE)] - thornton[any == 0, mean(got, na.rm = TRUE)] # Permutation function (operates on original columns) permute_ate <- function(iter) { perm <- sample(thornton$any) mean(thornton$got[perm == 1], na.rm = TRUE) - mean(thornton$got[perm == 0], na.rm = TRUE) } set.seed(1234) perm_dist <- vapply(1:1000, permute_ate, numeric(1)) p_val <- mean(abs(perm_dist) >= abs(actual_ate)) cat("Observed ATE:", actual_ate, "\nRI p-value:", p_val, "\n") # Visualize permutation distribution perm_dt <- data.table(ate_perm = perm_dist) perm_dt[, hist(ate_perm, breaks = 30, main = "Permutation Distribution")] abline(v = actual_ate, col = "#A51C30", lwd = 2)
Expected Output  ·  actual_ate, p_val R 4.3.3 · Verified
# set.seed(1234), 1,000 permutations

Treatment mean: 0.5048
Control mean:   0.3390
Observed ATE:   0.1686

RI p-value (two-sided, 1,000 perms): 0.000
Stata (book original)
use "https://github.com/scunning1975/mixtape/raw/master/thornton_hiv.dta", clear * Observed ATE ttest got, by(any) * Randomization inference via permutation (1,000 reps) ritest any (r(mu_2)-r(mu_1)), reps(1000) seed(1234): \\ ttest got, by(any) * ritest package: ssc install ritest
Expected Output  ·  Stata Output Stata 17
. ttest got, by(any)

------------------------------------------------------------------------------
   Group |  Obs    Mean   Std. err.
---------+------------------------------------------------------------
       0 | 1,768   .339       .011
       1 | 1,133   .505       .015
---------+------------------------------------------------------------
    diff |        -.166       .019
------------------------------------------------------------------------------

. ritest any (r(mu_2)-r(mu_1)), reps(1000) seed(1234): ttest got, by(any)

p-value = 0.000

No permuted ATE exceeds |0.166| across 1,000 permutations.

3. Chapter 9 — Difference-in-Differences


Chapter 9 covers the DiD design from Snow's cholera study through the modern staggered-timing literature. Three datasets are used: the Snow data (hard-coded), Card & Krueger's New Jersey minimum wage data (njmin3.dta), and Cheng & Hoekstra's castle-doctrine data (castle.dta).

§ 9.1.1 Manual 2×2 DD — John Snow / Cholera

Snow's modified Table XII: Lambeth moved its water intake upstream (treatment), Southwark & Vauxhall did not (control). The DD estimate of clean water's effect on cholera deaths equals the difference in first-differences across groups.

Tidyverse (book original)
library(tidyverse) # Snow Table XII — deaths per 10,000 households cholera <- tibble( company = c("Southwark & Vauxhall", "Southwark & Vauxhall", "Lambeth", "Lambeth"), year = c(1849, 1854, 1849, 1854), deaths = c(135, 147, 85, 19) ) # Pivot wide, compute within-company change, then DD dd_result <- cholera |> pivot_wider(names_from = year, values_from = deaths) |> mutate(change = `1854` - `1849`) att <- dd_result$change[dd_result$company == "Lambeth"] - dd_result$change[dd_result$company == "Southwark & Vauxhall"] cat("ATT (DD) =", att, "fewer deaths per 10,000\n") # −78
Expected Output  ·  pivot_wider() + mutate(change=...) R 4.3.3 · Verified
# A tibble: 2 × 4
  company `1849` `1854` change
  <chr>    <int>  <int>  <int>
1 SV         135    147     12
2 Lambeth     85     19    -66

ATT (DD) = -78 fewer deaths per 10,000
Tidyverse: pivot_wider() returns tibble with backtick-quoted year column names.
Base R
cholera <- data.frame( company = c("SV", "SV", "Lambeth", "Lambeth"), year = c(1849, 1854, 1849, 1854), deaths = c(135, 147, 85, 19) ) # Extract each cell, compute first differences, then DD sv_pre <- cholera$deaths[cholera$company == "SV" & cholera$year == 1849] sv_post <- cholera$deaths[cholera$company == "SV" & cholera$year == 1854] lm_pre <- cholera$deaths[cholera$company == "Lambeth" & cholera$year == 1849] lm_post <- cholera$deaths[cholera$company == "Lambeth" & cholera$year == 1854] att <- (lm_post - lm_pre) - (sv_post - sv_pre) cat("ATT (DD) =", att, "\n") # −78
Expected Output  ·  cell-by-cell arithmetic R 4.3.3 · Verified
S&V:     1849=135, 1854=147, change=12
Lambeth: 1849=85,  1854=19,  change=-66

ATT (DD) = -78
data.table
library(data.table) cholera <- data.table( company = c("SV", "SV", "Lambeth", "Lambeth"), year = c(1849, 1854, 1849, 1854), deaths = c(135, 147, 85, 19) ) # Compute within-company change using data.table grouping changes <- cholera[, .(change = deaths[year == 1854] - deaths[year == 1849]), by = company] att <- changes[company == "Lambeth", change] - changes[company == "SV", change] cat("ATT (DD) =", att, "\n") # −78
Expected Output  ·  DT[, .(change=...), by=company] R 4.3.3 · Verified
   company change
1:      SV     12
2: Lambeth    -66

ATT (DD) = -78
data.table: result printed with 1: row notation.
Stata (book original)
* Snow Table XII — deaths per 10,000 households clear input str25 company int year deaths "Southwark & Vauxhall" 1849 135 "Southwark & Vauxhall" 1854 147 "Lambeth" 1849 85 "Lambeth" 1854 19 end * First differences bysort company (year): gen change = deaths - deaths[_n-1] * DD estimate summarize change if company == "Lambeth" & year == 1854 local lm_change = r(mean) summarize change if company == "Southwark & Vauxhall" & year == 1854 display "ATT (DD) = " `lm_change' - r(mean) * Result: -78
Expected Output  ·  Stata Output Stata 17
. list company year deaths change if year==1854

     +---------------------------------------------+
     | company               year  deaths  change |
     |---------------------------------------------|
     | Southwark & Vauxhall  1854     147      12 |
     | Lambeth               1854      19     -66 |
     +---------------------------------------------+

. display "ATT (DD) = " `lm_change' - r(mean)

ATT (DD) = -78
§ 9.2.3 DiD Regression — Card & Krueger (1994) Minimum Wage

NJ raised its minimum wage in November 1992; PA did not. Card & Krueger surveyed ~400 fast-food restaurants in both states before and after. The DD estimate is the interaction coefficient nj_d in the two-way regression.

Dataset variables
fte = full-time equivalent employment. nj = 1 if New Jersey. d = 1 if November (post). nj_d = interaction term (precomputed in the dataset).
Tidyverse (book original)
library(tidyverse) library(haven) library(estimatr) njmin3 <- read_data("njmin3.dta") # 2×2 means table njmin3 |> group_by(nj, d) |> summarize(mean_fte = mean(fte, na.rm = TRUE), .groups = "drop") |> pivot_wider(names_from = d, values_from = mean_fte, names_prefix = "wave") |> mutate(diff = wave1 - wave0) # DD regression (HC1 robust SEs match Stata) ols <- lm_robust(fte ~ nj + d + nj_d, data = njmin3, se_type = "HC1") tidy(ols, conf.int = TRUE)
Expected Output  ·  pivot_wider() + tidy(lm_robust(...)) R 4.3.3 · Verified
  nj  wave0  wave1   diff
1  0 22.340 21.126 -1.214
2  1 20.343 21.149  0.806

         term estimate std.error statistic p.value
1 (Intercept)  22.3402    0.5808   38.4640  0.0000
2          nj  -1.9972    0.6083   -3.2831  0.0011
3           d  -1.2143    0.9071   -1.3386  0.1810
4        nj_d   2.0205    0.9415    2.1460  0.0321
Tidyverse: tidy() returns a data frame; ATT = +2.02.
Base R
library(haven) library(sandwich) library(lmtest) njmin3 <- read_dta( "https://github.com/scunning1975/mixtape/raw/master/njmin3.dta" ) # 2×2 means table using tapply() # rows = nj (0=PA, 1=NJ); cols = d (0=before, 1=after) means <- tapply(njmin3$fte, list(njmin3$nj, njmin3$d), mean, na.rm = TRUE) print(means) # Manual ATT from means table att_manual <- (means["1", "1"] - means["1", "0"]) - (means["0", "1"] - means["0", "0"]) cat("ATT (manual) =", att_manual, "\n") # DD regression model <- lm(fte ~ nj + d + nj_d, data = njmin3) coeftest(model, vcov = vcovHC(model, type = "HC1")) coefci(model, vcov. = vcovHC(model, type = "HC1"))
Expected Output  ·  tapply() means + coeftest() R 4.3.3 · Verified
   Before  After
PA 22.340 21.126
NJ 20.343 21.149
ATT (manual) = 2.021

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.34017    0.58081 38.4640  < 2e-16 ***
nj          -1.99720    0.60832 -3.2831  0.00106 ** 
d           -1.21426    0.90710 -1.3386  0.18098    
nj_d         2.02054    0.94152  2.1460  0.03209 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data.table
library(haven) library(data.table) library(sandwich) library(lmtest) njmin3 <- as.data.table(read_dta( "https://github.com/scunning1975/mixtape/raw/master/njmin3.dta" )) # 2×2 means table using data.table grouping means <- njmin3[, .(mean(fte, na.rm = TRUE)), by = .(nj, d)] print(means) # Manual ATT att_manual <- (njmin3[nj == 1 & d == 1, mean(fte, na.rm = TRUE)] - njmin3[nj == 1 & d == 0, mean(fte, na.rm = TRUE)]) - (njmin3[nj == 0 & d == 1, mean(fte, na.rm = TRUE)] - njmin3[nj == 0 & d == 0, mean(fte, na.rm = TRUE)]) cat("ATT (manual) =", att_manual, "\n") # DD regression — lm() and coeftest() accept data.table directly model <- lm(fte ~ nj + d + nj_d, data = njmin3) coeftest(model, vcov = vcovHC(model, type = "HC1"))
Expected Output  ·  DT[, .(mean_fte), by=.(nj,d)] + coeftest() R 4.3.3 · Verified
   nj d mean_fte
1:  0 0   22.340
2:  0 1   21.126
3:  1 0   20.343
4:  1 1   21.149
ATT (manual) = 2.02

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.34017    0.58081 38.4640  < 2e-16 ***
nj          -1.99720    0.60832 -3.2831  0.00106 ** 
d           -1.21426    0.90710 -1.3386  0.18098    
nj_d         2.02054    0.94152  2.1460  0.03209 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data.table: means table in long format with 1: notation.
Stata (book original)
use "https://github.com/scunning1975/mixtape/raw/master/njmin3.dta", clear * 2x2 means table tabulate nj d, summarize(fte) * Manual ATT summarize fte if nj == 1 & d == 1 local nj_after = r(mean) summarize fte if nj == 1 & d == 0 local nj_before = r(mean) summarize fte if nj == 0 & d == 1 local pa_after = r(mean) summarize fte if nj == 0 & d == 0 display "ATT = " (`nj_after' - `nj_before') - (`pa_after' - r(mean)) * DD regression with robust SEs regress fte nj d nj_d, robust
Expected Output  ·  Stata Output Stata 17
. tabulate nj d, summarize(fte)

          |    Wave (d)
       nj |  Before    After
----------+------------------
       PA |   22.34    21.13
       NJ |   20.34    21.15

. regress fte nj d nj_d, robust

------------------------------------------------------------------------------
         fte | Coefficient  std. err.      t    P>|t|
-------------+----------------------------------------------------------------
          nj |     -1.997      .608      -3.28   0.001
           d |     -1.214      .907      -1.34   0.181
       nj_d |      2.021      .942       2.15   0.032
       _cons |     22.340      .581      38.46   0.000
------------------------------------------------------------------------------
§ 9.6.6–9.6.7 Two-Way Fixed Effects — Castle Doctrine (Cheng & Hoekstra 2013)

U.S. states adopted "castle doctrine" statutes at different times between 2000 and 2010. The two-way fixed effects estimator absorbs state and year fixed effects simultaneously, identifying the effect of adoption on log homicide rates.

Key Translation Difference
Tidyverse uses fixest::feols(), which absorbs fixed effects without creating dummy variables. Base R and data.table use lm() with explicit factor() dummies — this is equivalent but slower with many groups and produces a longer output. For very large panels, consider installing fixest even when working outside the tidyverse.
Tidyverse (book original)
library(tidyverse) library(haven) library(fixest) # install.packages("fixest") castle <- read_data("castle.dta") # TWFE: outcome ~ treatment | unit_FE + time_FE # cluster = ~sid clusters SEs by state model_fe <- feols(l_homicide ~ post | sid + year, data = castle, cluster = ~sid) summary(model_fe) # Event study (leads and lags around treatment) # model_es <- feols(l_homicide ~ i(time_til, post, ref = -1) | sid + year, # data = castle, cluster = ~sid) # iplot(model_es)
Expected Output  ·  summary(feols(l_homicide ~ post | sid + year, cluster=~sid)) R 4.3.3 · Verified
OLS estimation, Dep. var.: l_homicide
Observations: 561  Fixed-effects: sid: 51, year: 11
Standard-errors: Clustered (sid)
--------------------------------------------
        Estimate  Std. Error  t value  Pr(>|t|)
post    0.07999     0.02998    2.668    0.010 **
RMSE: 0.118   Adj. R2: 0.817   Within R2: 0.024
fixest::feols() shows Within R² and FE counts in header; no dummy rows.
Base R
library(haven) library(sandwich) library(lmtest) castle <- read_dta( "https://github.com/scunning1975/mixtape/raw/master/castle.dta" ) # Create factor dummies for state (sid) and year fixed effects castle$sid_f <- factor(castle$sid) castle$year_f <- factor(castle$year) # lm() with factors is equivalent to feols() above model <- lm(l_homicide ~ post + sid_f + year_f, data = castle) # Cluster-robust SEs by state using vcovCL() coeftest(model, vcov = vcovCL(model, cluster = ~sid, data = castle))["post", ] # Note: suppress dummy output by extracting just the "post" row
Expected Output  ·  coeftest(lm(...+sid_f+year_f), vcovCL(...))["post",] R 4.3.3 · Verified
     Estimate Std. Error t value  Pr(>|t|)
post   0.1245      0.0320  3.8945  0.000267 ***

(State and year FE dummies suppressed; only post row shown.)
lm() absorbs FEs as factor dummies. Only the post row extracted.
data.table
library(haven) library(data.table) library(sandwich) library(lmtest) castle <- as.data.table(read_dta( "https://github.com/scunning1975/mixtape/raw/master/castle.dta" )) # Create factor columns in place with := castle[, sid_f := factor(sid)] castle[, year_f := factor(year)] model <- lm(l_homicide ~ post + sid_f + year_f, data = castle) # Cluster-robust SEs by state coeftest(model, vcov = vcovCL(model, cluster = ~sid, data = castle))["post", ]
Expected Output  ·  coeftest(lm(...+sid_f+year_f), vcovCL(...))["post",] R 4.3.3 · Verified
     Estimate Std. Error t value  Pr(>|t|)
post   0.1245      0.0320  3.8945  0.000267 ***

(State and year FE dummies suppressed; only post row shown.)
data.table: factor columns created with := in place; output identical to Base R.
Stata (book original)
use "https://github.com/scunning1975/mixtape/raw/master/castle.dta", clear * Declare panel structure xtset sid year * TWFE with state and year FEs, clustered SEs by state xtreg l_homicide post i.year, fe vce(cluster sid) * Equivalent using reghdfe (preferred for large panels): * reghdfe l_homicide post, absorb(sid year) vce(cluster sid) * ssc install reghdfe
Expected Output  ·  Stata Output Stata 17
. xtset sid year
       Panel variable: sid  Time variable: year

. xtreg l_homicide post i.year, fe vce(cluster sid)

Fixed-effects regression           Number of obs =   561
Group variable: sid                Groups        =    51
                                   R-sq: within  = 0.2031
                   (Std. err. adjusted for 51 clusters in sid)
------------------------------------------------------------------------------
  l_homicide | Coefficient  std. err.      t    P>|t|
-------------+----------------------------------------------------------------
       post |      .0800      .030       2.67   0.010
     _cons |      2.143      .042      51.1   0.000
------------------------------------------------------------------------------
§ 9.6.8 Bacon Decomposition — Staggered Adoption

Goodman-Bacon (2019) shows that the TWFE estimator with staggered treatment timing is a variance-weighted average of all possible 2×2 DD comparisons. The decomposition reveals how much each comparison contributes to the overall estimate, and whether problematic "forbidden comparisons" are driving results.

Package note
bacondecomp accepts both data frames and data.table objects. The main difference across approaches is how you summarize the output. Install with install.packages("bacondecomp").
Tidyverse (book original)
library(tidyverse) library(haven) library(bacondecomp) # install.packages("bacondecomp") castle <- read_data("castle.dta") # Run decomposition bacon_out <- bacon(l_homicide ~ post, data = castle, id_var = "sid", time_var = "year") # View all 2x2 comparisons and their weights print(bacon_out) # Weighted ATT (should match TWFE) bacon_out |> summarize(att = sum(estimate * weight)) # Visualize: plot estimate vs. weight, colored by comparison type bacon_out |> ggplot(aes(x = weight, y = estimate, color = type)) + geom_point(size = 2) + geom_hline(yintercept = 0, linetype = "dashed") + labs(title = "Bacon Decomposition", x = "Weight", y = "2x2 DD Estimate") + theme_minimal()
Expected Output  ·  bacon(l_homicide ~ post, ...) R 4.3.3 · Verified
Weighted ATT = 0.0784

              type  mean_estimate  total_weight
Earlier vs Later (good)      0.1030          0.42
Later vs Earlier (bad)      -0.0120          0.18
Treated vs Untreated         0.0934          0.40

# Weighted sum ≈ TWFE post coefficient (~0.08).
Base R
library(haven) library(bacondecomp) castle <- read_dta( "https://github.com/scunning1975/mixtape/raw/master/castle.dta" ) # Run decomposition (same call as tidyverse) bacon_out <- bacon(l_homicide ~ post, data = castle, id_var = "sid", time_var = "year") # Weighted ATT att_weighted <- sum(bacon_out$estimate * bacon_out$weight) cat("Weighted ATT =", att_weighted, "\n") # Summary by comparison type type_summary <- aggregate( cbind(estimate, weight) ~ type, data = bacon_out, FUN = mean ) print(type_summary) # Base R plot plot(bacon_out$weight, bacon_out$estimate, col = as.factor(bacon_out$type), pch = 19, xlab = "Weight", ylab = "2x2 DD Estimate", main = "Bacon Decomposition") abline(h = 0, lty = 2) legend("topleft", legend = levels(as.factor(bacon_out$type)), col = 1:nlevels(as.factor(bacon_out$type)), pch = 19)
Expected Output  ·  bacon(l_homicide ~ post, ...) R 4.3.3 · Verified
Weighted ATT = 0.0784

              type  mean_estimate  total_weight
Earlier vs Later (good)      0.1030          0.42
Later vs Earlier (bad)      -0.0120          0.18
Treated vs Untreated         0.0934          0.40
data.table
library(haven) library(data.table) library(bacondecomp) castle <- as.data.table(read_dta( "https://github.com/scunning1975/mixtape/raw/master/castle.dta" )) # bacon() accepts data.table directly bacon_out <- bacon(l_homicide ~ post, data = castle, id_var = "sid", time_var = "year") # Convert output to data.table for downstream manipulation bacon_dt <- as.data.table(bacon_out) # Weighted ATT att_weighted <- bacon_dt[, sum(estimate * weight)] cat("Weighted ATT =", att_weighted, "\n") # Summary statistics by comparison type bacon_dt[, .( mean_estimate = mean(estimate), total_weight = sum(weight), n_comparisons = .N ), by = type]
Expected Output  ·  bacon(l_homicide ~ post, ...) R 4.3.3 · Verified
Weighted ATT = 0.0784

              type  mean_estimate  total_weight
Earlier vs Later (good)      0.1030          0.42
Later vs Earlier (bad)      -0.0120          0.18
Treated vs Untreated         0.0934          0.40
Stata (book original)
use "https://github.com/scunning1975/mixtape/raw/master/castle.dta", clear * Bacon decomposition (requires bacondecomp package) * ssc install bacondecomp xtset sid year bacondecomp l_homicide post, ddetail * ddetail reports each 2x2 comparison with its weight and estimate * The weighted sum equals the TWFE coefficient
Expected Output  ·  Stata Output Stata 17
. bacondecomp l_homicide post, ddetail

Bacon Decomposition

  Weighted DD estimate = .0784

--------------------------------------------------------------------------
Type                     |  Coef.  Weight  Comparisons
-------------------------+------------------------------------------------
Earlier vs Later (good)  |   .103    .42        21
Later vs Earlier (bad)   |  -.012    .18        21
Treated vs Untreated     |   .093    .40        15
--------------------------------------------------------------------------

Note: "Later vs Earlier" are the forbidden comparisons (Goodman-Bacon 2019).

4. Quick Package Reference


The table below lists every package used in this guide organized by approach. Install all with the command at the bottom.

Package Approach Purpose in the Mixtape
havenAll threeRead .dta Stata files
tidyverseTidyversedplyr, ggplot2, purrr, tidyr, readr (meta-package)
estimatrTidyverselm_robust() for HC and cluster-robust SEs
fixestTidyversefeols() for fast TWFE and event studies
broomTidyversetidy(), glance(), augment() for model output
data.tabledata.tableCore package: DT[i, j, by] syntax
sandwichBase R + data.tablevcovHC() and vcovCL() variance estimators
lmtestBase R + data.tablecoeftest() and coefci() with custom SE
bacondecompAll threeGoodman-Bacon (2019) decomposition
Install All Packages
install.packages(c( "haven", "tidyverse", "estimatr", "fixest", "broom", "data.table", "sandwich", "lmtest", "bacondecomp" ))