Installation, Package Management & Libraries · macOS and Windows
1. Overview
R is a statistical computing language and environment, widely used in data analysis, visualization, and reproducible research. RStudio (distributed by Posit) is the most widely adopted integrated development environment (IDE) for R: it provides a code editor, console, environment viewer, and plot panel in a single interface.
You install R first, then RStudio. RStudio detects your R installation automatically and does not work without it.
Before You Begin:
You will need an internet connection and administrator access on your machine. The total installation typically takes under 10 minutes.
Community libraries that extend R's functionality.
Installed from within R or RStudio
2. Installing R
R is distributed through the Comprehensive R Archive Network (CRAN). Always install the latest stable release unless a project requires a specific version.
macOS Installation
Go to cran.r-project.org and click Download R for macOS.
Select the correct installer for your chip. Choose the Apple Silicon package (arm64) for M1, M2, M3, or M4 Macs. Choose the Intel package for Intel-based Macs. You can verify via Apple menu > About This Mac.
Download the .pkg file and open it. Follow the installer prompts and accept the default installation location (/Library/Frameworks/R.framework).
Verify the installation by opening RStudio (see Section 02). The Console pane at the bottom left will display a message beginning with R version 4.x.x as soon as RStudio launches. If you see this message, R is installed correctly. No further verification is needed.
Note on Xcode Tools:
A small number of specialist packages require additional compilation tools to install. If you encounter an error message mentioning "no developer tools" or "xcrun" when installing a package later, contact your instructor or IT support. Most users will never need to address this during an introductory course.
Windows Installation
Go to cran.r-project.org and click Download R for Windows, then base.
Download the .exe installer (e.g., R-4.x.x-win.exe).
Run the installer. Accept the license and keep the default install path (C:\Program Files\R\R-4.x.x). Leave both 32-bit and 64-bit components checked unless disk space is limited; on modern Windows, 64-bit only is also acceptable.
Verify the installation by opening RStudio (see Section 02). The Console pane at the bottom left will display a message beginning with R version 4.x.x as soon as RStudio launches. If you see this message, the installation succeeded. If RStudio opens but the Console shows an error, see the callout below.
Rtools (Recommended):
Windows users who need to compile packages from source should also install Rtools, available at the same CRAN Windows page. Match the Rtools version to your installed R version. Most introductory users will not need this immediately.
3. Installing RStudio
RStudio Desktop is the free, open-source edition suitable for individual use on your local machine. It is maintained by Posit, the company behind the tidyverse ecosystem.
macOS Installation
Visit posit.co/download/rstudio-desktop and click Download RStudio Desktop. The page auto-detects your operating system.
Open the downloaded .dmg file and drag the RStudio icon into your Applications folder.
Launch RStudio from Applications. On first open, macOS may prompt you to confirm opening an app downloaded from the internet. Click Open.
RStudio detects your R installation automatically. The Console pane will display your R version on startup, confirming the connection.
Windows Installation
Visit posit.co/download/rstudio-desktop and download the Windows .exe installer.
Run the installer with default settings. RStudio installs to C:\Program Files\RStudio by default.
Launch RStudio from the Start menu or Desktop shortcut. The Console pane should display your R version, confirming that RStudio found the R installation.
If RStudio Cannot Find R:
Open RStudio, navigate to Tools > Global Options > General, and manually set the R version path to the folder where R was installed.
4. Installing Packages
R packages extend the base language with functions, datasets, and tools. The primary source is CRAN, which hosts over 20,000 packages. Packages are installed once and stored in a local library on your machine.
Installing from CRAN
The easiest way to install a package is through RStudio's built-in point-and-click interface. In the Packages tab (bottom-right pane), click Install, type the package name, and click Install again. RStudio handles the rest.
Using the Packages Tab:
In RStudio, go to the Packages pane (bottom right) and click Install. Type the package name in the dialog box, make sure Install dependencies is checked, and click Install. You only need to do this once per package on your machine.
You can also type the install command directly into the RStudio Console pane (bottom left) and press Enter. Both approaches do exactly the same thing.
Using install.packages()
# Install a single packageinstall.packages("ggplot2")
# Install multiple packages at onceinstall.packages(c("dplyr", "tidyr", "readr"))
# Install the full tidyverse meta-packageinstall.packages("tidyverse")
Installing from GitHub
Development versions of packages, or packages not yet on CRAN, can be installed from GitHub using the pak package.
Using pak (Recommended)
# Install pak first if you do not have itinstall.packages("pak")
# Install any GitHub package by user/repopak::pkg_install("tidyverse/ggplot2")
# pak also handles CRAN packages and is fasterpak::pkg_install("dplyr")
Why pak:
pak resolves dependencies in parallel, produces clearer error messages, and is the modern recommended approach for package installation as of 2023 onward.
Updating and Removing Packages
To update packages using the menu, go to the Packages tab in RStudio and click Update. A list of packages with available updates will appear; check the ones you want and click Install Updates. You can also use the Console commands below for the same effect.
Package Maintenance
# Check which installed packages have updates availableold.packages()
# Update all outdated packages at onceupdate.packages(ask = FALSE)
# Remove a packageremove.packages("packagename")
# List all currently installed packagesinstalled.packages()[, "Package"]
Commonly Used Packages by Category
Package
Category
Purpose
ggplot2
Visualization
Grammar of graphics plotting system
dplyr
Data Wrangling
Data frame manipulation and transformation
tidyr
Data Wrangling
Reshaping and tidying data
data.table
Data Wrangling
High-performance data manipulation for large datasets; faster than dplyr on big files
readr
Import
Fast reading of flat files (CSV, TSV)
readxl
Import
Reading Excel files
lubridate
Date/Time
Intuitive date and time handling
stringr
Strings
Consistent string manipulation functions
purrr
Functional
Functional programming tools and iteration
knitr
Reporting
Dynamic report generation
rmarkdown
Reporting
R Markdown documents and notebooks
5. Loading Libraries
Installing a package makes it available on disk. To use it in a session, you must load it into memory with library(). This call goes at the top of every script or R Markdown file that needs the package.
Install Once, Load Every Session:
install.packages() is run once per machine (or when updating). library() is called at the start of each new R session or script.
The library() Function
Loading Packages into a Session
# Load a single packagelibrary(ggplot2)
# Typical script header: load all dependencies upfrontlibrary(dplyr)
library(ggplot2)
library(readr)
library(lubridate)
# Load without printing startup messagessuppressPackageStartupMessages(library(tidyverse))
Using Functions Without Loading
If you only need one or two functions from a package, call them directly using the :: operator. This avoids attaching the whole package to the search path and makes dependencies explicit in the code.
The :: Operator
# Call a function directly without loading the librarydplyr::filter(my_data, value > 10)
readr::read_csv("data/file.csv")
# Useful when two packages have functions with the same namestats::filter(x, rep(1/3, 3)) # base R filter, not dplyr::filter
Checking if a Package is Installed
Portable Script Header Pattern
This pattern installs any missing packages automatically when a collaborator runs your script for the first time.
# Define required packages
packages_needed <- c("dplyr", "ggplot2", "readr")
# Install any that are missing
new_packages <- packages_needed[
!(packages_needed %in%installed.packages()[, "Package"])
]
if (length(new_packages)) install.packages(new_packages)
# Load allinvisible(lapply(packages_needed, library, character.only = TRUE))
Where Libraries Are Stored
Library Paths
# See where R looks for installed packages.libPaths()
# Example output on macOS:# [1] "/Library/Frameworks/R.framework/Versions/4.4/Resources/library"# Example output on Windows:# [1] "C:/Users/YourName/AppData/Local/R/win-library/4.4"# [2] "C:/Program Files/R/R-4.4.0/library"
6. R File Types & Helper Files
RStudio supports several distinct file types for writing R code. Each serves a different purpose: some are designed for clean, executable scripts; others weave prose and code together for reporting; others are built for interactive exploration. Understanding which to use, and when, is one of the most practical decisions you will make when setting up a project.
R Script (.R)
An R script is a plain text file containing only R code and comments. It is the simplest and most portable file type: any R installation can run it, and it has no dependencies beyond base R. Scripts are the right choice for data processing pipelines, reusable functions, and any code that should be sourced by other files.
Anatomy of an R Script
# ── script_name.R ────────────────────────────────────────────# Purpose: Clean and reshape the enrollment dataset# Author: Your Name# Updated: 2026-03-23# 1. Load dependencies ────────────────────────────────────────library(dplyr)
library(readr)
# 2. Read data ────────────────────────────────────────────────
raw <- readr::read_csv("data/enrollment_raw.csv")
# 3. Clean ────────────────────────────────────────────────────
clean <- raw |>
dplyr::filter(!is.na(id)) |>
dplyr::mutate(year = as.integer(year))
When to Use a Script:
Use .R scripts for data cleaning pipelines, simulation code, helper function definitions, and any file you plan to source() from another file. Scripts run from top to bottom with no markup overhead, which makes them fast and predictable.
R Markdown (.Rmd)
R Markdown files combine prose (written in Markdown) with executable code chunks. When rendered, the file produces a self-contained document in a format of your choice: HTML, PDF, Word, or slides. R Markdown is the standard format for reproducible reports, homework submissions, and any analysis where you need to explain your reasoning alongside the code and output.
Anatomy of an R Markdown File
---title: "Weekly Analysis"author: "Your Name"date: "2026-03-23"output: html_document---## IntroductionThis report summarizes enrollment trends for Spring 2026.```{r setup, include=FALSE}library(dplyr)
library(ggplot2)
``````{r plot, echo=FALSE}ggplot(data, aes(x = week, y = count)) +
geom_line()
```
Render the document by clicking Knit in RStudio, or by running rmarkdown::render("file.Rmd") in the console.
When to Use R Markdown:
Use .Rmd when the final deliverable is a document: a report, a homework assignment, a methods appendix, or a slide deck. Because the file re-runs all code on render, every figure and table in the output is guaranteed to reflect the current data and code.
R Notebook (.Rmd with notebook output)
An R Notebook is technically an R Markdown file with output: html_notebook set in its YAML header. The key distinction is execution behavior: in a standard .Rmd, all chunks run together when you knit; in a Notebook, each chunk runs independently and its output appears inline immediately below the chunk. This makes Notebooks well-suited for exploratory analysis where you want to inspect results step by step without re-running the entire document.
Save the file in RStudio and a .nb.html preview file is generated automatically alongside it. This preview can be opened in any browser without R installed.
When to Use a Notebook:
Use Notebooks during active exploration: checking data distributions, testing model specifications, or iterating on visualizations. Switch to a standard .Rmd when you are ready to produce a final, fully reproducible document from scratch.
Comparison: Choosing the Right File Type
File Type
Extension
Best For
Output on Run
R Script
.R
Pipelines, functions, sourced utilities
Objects in environment; no document
R Markdown
.Rmd
Reproducible reports, final deliverables
HTML, PDF, Word, or slides on Knit
R Notebook
.Rmd (notebook)
Interactive exploration, iterative work
Inline chunk output; .nb.html preview
Quarto
.qmd
Modern replacement for R Markdown; also supports Python and Julia
HTML, PDF, Word, slides, websites
Helper Files
As a project grows, it becomes useful to separate reusable code into dedicated helper files rather than repeating it across scripts and documents. Helper files are plain .R scripts that contain only function definitions and constants; they carry no side effects and produce no output when sourced.
Creating and Sourcing a Helper File
helpers.R
# ── helpers.R ─────────────────────────────────────────────────# Reusable utility functions for the project.# Source this file at the top of any script or .Rmd that needs it.# Compute percentage change between two values
pct_change <- function(baseline, followup) {
(followup - baseline) / baseline * 100
}
# Standardize a numeric vector to mean 0, sd 1
standardize <- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
# A project-wide ggplot2 theme
theme_project <- ggplot2::theme_minimal() +
ggplot2::theme(
text = ggplot2::element_text(size = 11),
plot.title = ggplot2::element_text(face = "bold")
)
Sourcing the Helper File
# In any script or .Rmd chunk, load helpers with source()source("helpers.R")
# Or use a path relative to the project root with here::here()source(here::here("R", "helpers.R"))
# Functions are now available in the sessionpct_change(100, 115) # returns 15standardize(my_vector)
Recommended Project File Structure
A common convention is to keep all helper files in an R/ subfolder within the project directory. This mirrors the structure used in R packages and makes it easy to source multiple helpers at once.
Create a new project in RStudio via File > New Project. The .Rproj file sets the working directory to the project root automatically whenever you open it, which means all relative file paths work consistently regardless of where the project folder lives on your machine. This is the single most important habit for reproducible work.
Sourcing All Helper Files at Once
Batch Source Pattern
# Source every .R file in the R/ folderinvisible(
lapply(list.files("R", pattern = "\\.R$", full.names = TRUE), source)
)
External Cheat Sheets
Official and community reference cards. Each preview is embedded below; use the Open PDF link to view full-screen or download.
The three cheat sheets below cover the most frequently used functions and patterns for everyday data work in R. Each is organized by task rather than alphabetically so you can scan quickly while working. Use the tabs to switch between packages.
Load the Tidyverse:
library(tidyverse) loads ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats in one call. Alternatively, load only the packages you need.
The Pipe Operator
|> (Native Pipe, R 4.1+)
# The pipe passes the left-hand result into the first argument of the right-hand function
data |> filter(year == 2024) |> select(id, outcome) |> head()
# Equivalent without the pipe (harder to read):head(select(filter(data, year == 2024), id, outcome))
separate(df, date, into = c("year","month","day"), sep = "-")
unite()
Combine multiple columns into one
unite(df, "full_name", first, last, sep = " ")
drop_na()
Remove rows with NA in specified columns
drop_na(df, score, outcome)
fill()
Fill NA values downward or upward within a column
fill(df, group, .direction = "down")
readr: Reading & Writing Files
Function
What It Does
Example
read_csv()
Read a comma-separated file into a tibble
read_csv("data/file.csv")
read_tsv()
Read a tab-separated file
read_tsv("data/file.tsv")
read_delim()
Read any delimiter; specify with delim = "|"
read_delim("file.txt", delim = "|")
write_csv()
Write a data frame or tibble to CSV
write_csv(df, "output/results.csv")
read_rds() / write_rds()
Read/write R's native binary format; preserves data types exactly
write_rds(model, "output/model.rds")
stringr: String Operations
Function
What It Does
Example
str_detect()
Returns TRUE if pattern is found
filter(df, str_detect(name, "^A"))
str_replace()
Replace first match of a pattern
str_replace(x, "\\.", ",")
str_replace_all()
Replace all matches
str_replace_all(x, " ", "_")
str_trim()
Strip leading/trailing whitespace
mutate(df, name = str_trim(name))
str_to_lower() / str_to_upper()
Change case
str_to_lower(df$country)
str_glue()
String interpolation using {variable} syntax
str_glue("Subject {id}: score = {score}")
str_sub()
Extract substring by position
str_sub(x, 1, 4)
lubridate: Dates & Times
Function
What It Does
Example
ymd(), mdy(), dmy()
Parse date strings in various orders
ymd("2024-03-15")
year(), month(), day()
Extract date components
mutate(df, yr = year(date))
floor_date()
Round date down to unit (week, month, quarter)
floor_date(date, "month")
interval() / as.period()
Compute time between two dates
interval(start, end) / years(1)
today() / now()
Current date or datetime
mutate(df, age_days = today() - dob)
Load data.table:
library(data.table). Convert an existing data frame with setDT(df) (modifies in place) or as.data.table(df) (returns a copy). Read files directly into a data.table with fread().
The Core Syntax: DT[i, j, by]
Every data.table operation fits into a single bracket expression: i filters rows, j selects or computes columns, and by groups the result. Leaving any slot empty means "do nothing for that step."
DT[i, j, by] Pattern
# Think of it as: "Take DT, subset rows with i, compute j, grouped by by"# Filter rows (i)
DT[age > 18]
DT[country == "US" & !is.na(score)]
# Select / compute columns (j)
DT[, .(id, score)] # select columns
DT[, .(mean_score = mean(score))] # aggregate
DT[, score_log := log(score)] # add/overwrite column in place# Group by (by)
DT[, .(mean_score = mean(score)), by = country]
DT[, .(n = .N), by = .(country, year)]
# Combine all three
DT[year >= 2020, .(mean = mean(score)), by = country]
Special Symbols
Symbol
Meaning
Example
.N
Number of rows (in the current group)
DT[, .N, by = country]
:=
Assign a column by reference (no copy made)
DT[, z := x + y]
.()
Shorthand for list() in j and by
DT[, .(a, b), by = .(grp)]
.SD
Subset of Data: the current group's data as a data.table
DT[, lapply(.SD, mean), by = grp]
.SDcols
Restrict .SD to specific columns
DT[, lapply(.SD, mean), by = grp, .SDcols = c("x","y")]
.GRP
Integer index of the current group
DT[, grp_id := .GRP, by = country]
.I
Row indices of the current group
DT[, .I[score == max(score)], by = country]
Reading, Writing & Converting
Function
What It Does
Example
fread()
Read CSV/TSV fast; auto-detects delimiter and column types
fread("data/large_file.csv")
fwrite()
Write to CSV extremely fast
fwrite(DT, "output/results.csv")
setDT()
Convert a data frame to data.table in place (no copy)
setDT(my_df)
as.data.table()
Return a new data.table copy
DT <- as.data.table(my_df)
as.data.frame()
Convert back to a standard data frame
as.data.frame(DT)
Keys, Sorting & Merging
Function
What It Does
Example
setkey()
Sort table and index by one or more columns for fast lookups
setkey(DT, id, year)
setkeyv()
Same as setkey() but accepts a character vector of names
setkeyv(DT, c("id", "year"))
merge()
SQL-style merge; works like base R but faster with keyed tables
merge(DT1, DT2, by = "id", all.x = TRUE)
setorder()
Sort a data.table in place by columns
setorder(DT, -year, country)
setnames()
Rename columns in place
setnames(DT, "old_name", "new_name")
Useful Operations
Task
data.table Syntax
Add multiple columns at once
DT[, c("a","b") := .(x+1, y*2)]
Delete a column
DT[, col_to_drop := NULL]
Filter and count
DT[score > 80, .N]
Cumulative sum by group
DT[, cum_score := cumsum(score), by = id]
Lag/lead a column
DT[, lag_score := shift(score, 1), by = id]
Row-wise between filter
DT[between(score, 50, 80)]
Chain operations
DT[year > 2020][, .N, by = country][order(-N)]
Cross-join / expand grid
CJ(x = 1:3, y = c("a","b"))
In-Place Modification:
Unlike dplyr, data.table modifies objects in place by default when using := or set*() functions. This avoids copying large datasets and is why data.table is faster for big files. Be aware that assigning DT2 <- DT does not create an independent copy; use DT2 <- copy(DT) if you need a true duplicate.
The Grammar of Graphics:
Every ggplot2 plot is built by layering components: a data source, aesthetic mappings (which variables map to x, y, color, etc.), one or more geoms (the visual marks), optional scales and facets, and a theme. Layers are added with +.
Plot Template
Core ggplot2 Structure
ggplot(data = df, aes(x = var1, y = var2)) +
geom_point() +
scale_x_log10() +
facet_wrap(~ group) +
labs(title = "My Title", x = "X Label", y = "Y Label") +
theme_minimal()
Aesthetic Mappings: aes()
Aesthetic
Controls
Example
x, y
Position on axes
aes(x = year, y = gdp)
color
Point/line color (border for polygons)
aes(color = region)
fill
Fill color of bars, areas, polygons
aes(fill = treatment)
size
Point size or line width
aes(size = population)
shape
Point shape (circle, triangle, etc.)
aes(shape = group)
alpha
Transparency (0 = invisible, 1 = opaque)
aes(alpha = density)
linetype
Solid, dashed, dotted lines
aes(linetype = model)
label
Text labels for geom_text() / geom_label()
aes(label = country)
group
Group data without visual encoding (for lines)
aes(group = subject_id)
Common Geoms
Geom
Best For
Key Arguments
geom_point()
Scatterplots; relationships between two continuous variables
size, alpha, shape
geom_line()
Time series; trends over an ordered variable
linewidth, linetype
geom_col()
Bar charts with pre-computed heights
fill, position = "dodge"
geom_bar()
Bar charts where R counts rows automatically
stat = "count" (default)
geom_histogram()
Distribution of a single continuous variable
bins, binwidth
geom_density()
Smoothed distribution curve
fill, alpha, adjust
geom_boxplot()
Distribution summary: median, IQR, outliers
outlier.shape, notch
geom_violin()
Distribution shape across groups
draw_quantiles = c(0.25, 0.5, 0.75)
geom_smooth()
Fitted trend line with optional confidence band
method = "lm" or "loess", se = FALSE
geom_text()
Text labels on data points
aes(label = name), size, hjust
geom_label()
Text labels with a background box
Same as geom_text()
geom_tile()
Heatmaps; fill encodes a third variable
aes(fill = value)
geom_ribbon()
Shaded area between ymin and ymax (e.g., confidence intervals)
aes(ymin = lo, ymax = hi), alpha
geom_vline() / geom_hline()
Reference lines
xintercept or yintercept, linetype
Scales
Scales control how data values map to visual properties. The naming convention is scale_{aesthetic}_{type}().
scale_fill_gradient(low = "white", high = "#A51C30")
scale_size_area()
Make area (not radius) proportional to value
+ scale_size_area(max_size = 12)
Facets
Faceting: Small Multiples
# One variable: wrap panels automatically into rows/columnsfacet_wrap(~ region)
facet_wrap(~ region, ncol = 3, scales = "free_y")
# Two variables: explicit row and column assignmentfacet_grid(treatment ~ year)
facet_grid(rows = vars(treatment), cols = vars(year))
Labels & Annotations
Function
What It Controls
Example
labs()
Title, subtitle, caption, axis labels, legend title
labs(title = "...", x = "...", color = "Region")
annotate()
Add a single text or shape annotation at fixed coordinates
annotate("text", x = 2020, y = 50, label = "Policy change")
coord_flip()
Swap x and y axes (e.g., horizontal bars)
+ coord_flip()
coord_cartesian()
Zoom in without dropping data (unlike xlim())
coord_cartesian(ylim = c(0, 100))
Themes
Theme
Look
theme_minimal()
Clean white background; no border; light grid lines. Good default.
theme_bw()
White background with gray grid and black border.
theme_classic()
White background; x and y axes only; no grid. Good for publication.
theme_gray()
Gray background (ggplot2 default).
theme_void()
Completely blank canvas; useful for maps and custom layouts.
Customizing a Theme with theme()
# Override specific theme elements after choosing a base themetheme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
axis.text = element_text(size = 10),
legend.position = "bottom",
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = "#1a1a1a"),
strip.text = element_text(color = "white", face = "bold")
)
Saving Plots:
Use ggsave("output/plot.png", width = 8, height = 5, dpi = 300) immediately after your plot code to save the most recently printed plot. Specify plot = my_plot to save a named object. Supported formats include .png, .pdf, .svg, and .tiff.
No Library Needed:
All functions in this section are available in every R session without loading any package. Base R is always present; it is the foundation on which all packages are built.
Getting Help
Function
What It Does
Example
?
Open the help page for a function
?mean
help()
Same as ?
help("lm")
help.search() / ??
Search help pages by keyword
??regression
example()
Run the examples from a help page
example(mean)
args()
Show the arguments of a function
args(lm)
vignette()
Open a package vignette (tutorial document)
vignette("dplyr")
Understanding Your Data
Function
What It Does
Example
str()
Compact display of object structure and types
str(df)
head() / tail()
First or last n rows (default 6)
head(df, 10)
dim()
Number of rows and columns
dim(df)
nrow() / ncol()
Number of rows or columns individually
nrow(df)
names() / colnames()
Column names
names(df)
class()
Object class (e.g., data.frame, numeric, factor)
class(df$age)
typeof()
Low-level storage type (integer, double, character)
typeof(df$id)
summary()
Summary statistics for each column
summary(df)
table()
Frequency counts; also cross-tabulations
table(df$country)
View()
Open a spreadsheet-style viewer in RStudio
View(df)
Data Types & Coercion
Function
What It Does
Example
as.numeric()
Convert to number
as.numeric("3.14")
as.integer()
Convert to whole number
as.integer(3.9) returns 3
as.character()
Convert to text
as.character(42)
as.logical()
Convert to TRUE/FALSE
as.logical(0) returns FALSE
as.factor()
Convert to categorical factor
as.factor(df$group)
as.Date()
Convert a string to a Date object
as.Date("2024-03-15")
is.na()
Test for missing values; returns logical vector
sum(is.na(df$score))
is.numeric(), is.character()
Test type of an object
is.numeric(df$age)
Vectors & Sequences
Creating & Working with Vectors
# Create vectors
x <- c(1, 2, 3, 4, 5)
words <- c("a", "b", "c")
# Sequences1:10# integers 1 to 10seq(0, 1, by = 0.1) # 0.0 0.1 0.2 ... 1.0seq(0, 100, length.out = 5) # exactly 5 evenly spaced valuesrep(0, times = 5) # 0 0 0 0 0rep(c("A", "B"), each = 3) # A A A B B B# Indexing (1-based)
x[2] # second element
x[x > 3] # elements greater than 3
x[c(1, 3)] # first and third elements
x[-1] # everything except the first
Numeric & Summary Functions
Function
What It Does
Example
sum()
Sum of all values
sum(x, na.rm = TRUE)
mean()
Arithmetic mean
mean(x, na.rm = TRUE)
median()
Median value
median(x, na.rm = TRUE)
sd() / var()
Standard deviation / variance
sd(x, na.rm = TRUE)
min() / max()
Smallest or largest value
max(x, na.rm = TRUE)
range()
Returns c(min, max)
range(x)
quantile()
Percentiles
quantile(x, probs = c(0.25, 0.75))
cumsum() / cumprod()
Cumulative sum or product
cumsum(x)
diff()
Lagged differences
diff(x)
abs()
Absolute value
abs(-5)
round() / ceiling() / floor()
Rounding
round(3.567, 2) returns 3.57
log() / log10() / exp()
Logarithms and exponentiation
log(x) (natural log)
sqrt()
Square root
sqrt(16)
Data Frames
Creating & Subsetting Data Frames
# Create a data frame
df <- data.frame(
id = 1:3,
name = c("Alice", "Bob", "Carol"),
score = c(85, 92, 78)
)
# Access a column (three equivalent ways)
df$score
df[, "score"]
df[, 3]
# Subset rows
df[df$score > 80, ] # rows where score > 80
df[1:2, ] # first two rows# Subset rows and columns together
df[df$score > 80, c("id", "name")]
# Add a new column
df$grade <- ifelse(df$score >= 90, "A", "B")
# Remove a column
df$grade <- NULL
Logic & Control Flow
Expression / Function
What It Does
Example
==, !=, <, >, <=, >=
Comparison operators
x == 5
&, |, !
Element-wise AND, OR, NOT
x > 2 & x < 8
&&, ||
Scalar AND / OR (for single TRUE/FALSE values)
if (a > 0 && b > 0)
%in%
Test membership in a set
x %in% c(1, 3, 5)
ifelse()
Vectorised if-else
ifelse(score >= 60, "pass", "fail")
if () {} else {}
Standard conditional (single value)
if (n > 0) { ... } else { ... }
for (i in x) {}
Loop over elements of a vector
for (i in 1:10) { print(i) }
while () {}
Loop while a condition is TRUE
while (x < 100) { x <- x * 2 }
next / break
Skip to next iteration or exit a loop
if (is.na(x)) next
Writing Functions
Function Syntax
# Basic function
add <- function(x, y) {
x + y # last evaluated expression is returned
}
add(3, 4) # 7# Default argument values
greet <- function(name, greeting = "Hello") {
paste(greeting, name)
}
greet("Alice") # "Hello Alice"greet("Bob", "Hi") # "Hi Bob"# Explicit return (use when returning early)
safe_log <- function(x) {
if (x <= 0) return(NA)
log(x)
}
Apply Functions
The apply family lets you perform the same operation across rows, columns, or list elements without writing an explicit loop.
Function
What It Does
Example
apply()
Apply a function over rows (1) or columns (2) of a matrix or data frame
apply(df, 2, mean): column means
lapply()
Apply a function to each element of a list; returns a list
lapply(my_list, summary)
sapply()
Like lapply() but simplifies the result to a vector or matrix if possible
sapply(df, class)
tapply()
Apply a function to subgroups defined by a factor
tapply(df$score, df$group, mean)
Map()
Apply a function to corresponding elements of multiple lists
Map("+", list_a, list_b)
Reduce()
Cumulatively apply a function across a list (fold)
Reduce("+", list(1, 2, 3)) returns 6
String Functions
Function
What It Does
Example
paste()
Concatenate strings with a separator (default: space)
Generalized linear model (logistic, Poisson, etc.)
glm(y ~ x, family = binomial, data = df)
t.test()
One- or two-sample t-test
t.test(score ~ group, data = df)
chisq.test()
Chi-squared test of independence
chisq.test(table(df$a, df$b))
rnorm() / runif()
Draw random samples from normal or uniform distributions
rnorm(100, mean = 0, sd = 1)
set.seed()
Set the random number seed for reproducibility
set.seed(42)
dnorm() / pnorm() / qnorm()
Density, CDF, and quantile of the normal distribution
pnorm(1.96) returns 0.975
How to Read This Table:
Each row shows the same operation written three ways. All three produce equivalent results. data.table is fastest and most memory-efficient for large datasets: ideal for administrative records, claims data, and large cohort files. Tidyverse reads closest to plain English and is widely used in teaching materials and Stack Overflow answers. Base R requires no packages and works in any environment.
Two operations underpin almost every multi-source analysis: joining tables on a shared key, and reshaping a table between wide and long formats. This section covers both in depth, with worked examples across Base R, dplyr, and data.table, and a practical guide to diagnosing problems before and after a join.
Join Types
A join combines rows from two tables based on matching values in one or more key columns. The choice of join type determines which rows appear in the result when the keys do not match perfectly.
Join Type
Rows Kept
Typical Use
Left join
All rows from the left table; matched data from the right where available; NA where no match
Adding characteristics from a lookup table while keeping every row in the primary dataset
Inner join
Only rows with a match in both tables
Restricting analysis to observations with complete data across both sources
Full join
All rows from both tables; NA wherever a match is missing on either side
Auditing two datasets for overlap and discrepancies
Anti join
Rows in the left table with no match in the right
Finding records that failed a linkage, or identifying controls not in the treatment file
Semi join
Rows in the left table that have a match in the right, without adding any right-table columns
Filtering a dataset to only those IDs that appear in a second file, without duplicating columns
Check Your Keys Before Joining:
Always verify that your key column is unique in at least one of the two tables before joining. A many-to-many join (duplicate keys on both sides) silently multiplies rows and is almost never intended. Use anyDuplicated(df$id) or df |> count(id) |> filter(n > 1) to check.
Joining with dplyr
dplyr Join Functions
library(dplyr)
# Left join: keep all rows in df_patients, add columns from df_insurance
df_joined <- left_join(df_patients, df_insurance, by = "patient_id")
# Inner join: only patients who appear in both tables
df_matched <- inner_join(df_patients, df_labs, by = "patient_id")
# Join on columns with different names in each table
df_joined <- left_join(df_claims, df_providers,
by = c("provider_npi" = "npi"))
# Join on multiple keys (patient + visit date must both match)
df_joined <- left_join(df_vitals, df_meds,
by = c("patient_id", "visit_date"))
# Anti join: patients in df_enrolled with no matching lab result
df_missing_labs <- anti_join(df_enrolled, df_labs, by = "patient_id")
# Semi join: filter df_patients to only those with a pharmacy claim
df_with_rx <- semi_join(df_patients, df_pharmacy, by = "patient_id")
Joining with data.table
data.table joins use the bracket syntax with an on argument, or the merge() function. Setting keys first with setkey() speeds up repeated joins on large tables.
data.table Join Syntax
library(data.table)
setDT(df_patients); setDT(df_insurance)
# Left join using bracket syntax (X[Y] is a right join; Y[X] gives left)
df_joined <- df_insurance[df_patients, on = "patient_id"]
# Left join using merge() - syntax matches base R
df_joined <- merge(df_patients, df_insurance,
by = "patient_id", all.x = TRUE)
# Inner join
df_matched <- merge(df_patients, df_labs, by = "patient_id")
# Full join
df_full <- merge(df_patients, df_labs,
by = "patient_id", all = TRUE)
# Anti join: rows in patients not matched in labs
df_missing <- df_patients[!df_labs, on = "patient_id"]
# Join on columns with different names
df_joined <- merge(df_claims, df_providers,
by.x = "provider_npi", by.y = "npi", all.x = TRUE)
# Set key for fast repeated lookupssetkey(df_insurance, patient_id)
df_joined <- df_insurance[df_patients, on = "patient_id"]
Diagnosing Join Problems
The most common join errors are silent: the operation succeeds but the row count or column values are wrong. These checks catch the most frequent problems before they propagate downstream.
Pre-Join Checks
# 1. Check for duplicate keys on the right (lookup) table# If n > 1 for any id, a left join will expand your row count unexpectedly
df_insurance |>
count(patient_id) |>
filter(n > 1)
# 2. Check that key columns have the same type in both tables# Joining character "001" to integer 1 silently produces zero matchesclass(df_patients$patient_id)
class(df_insurance$patient_id)
# 3. Check for NA values in the key columnsum(is.na(df_patients$patient_id))
sum(is.na(df_insurance$patient_id))
# 4. Preview key overlap between the two tables
n_left <- n_distinct(df_patients$patient_id)
n_right <- n_distinct(df_insurance$patient_id)
n_shared <- n_distinct(intersect(df_patients$patient_id,
df_insurance$patient_id))
cat(sprintf("Left: %d Right: %d Shared: %d\n", n_left, n_right, n_shared))
Post-Join Checks
# After a left join, row count should equal the left table exactlystopifnot(nrow(df_joined) == nrow(df_patients))
# Count how many left-table rows had no match (NAs in a right-table column)sum(is.na(df_joined$insurance_type))
# Full audit: left-only, matched, and right-only rows
df_audit <- full_join(
df_patients |> mutate(in_patients = TRUE),
df_insurance |> mutate(in_insurance = TRUE),
by = "patient_id"
)
table(left_only = is.na(df_audit$in_insurance),
right_only = is.na(df_audit$in_patients))
Reshaping: Wide and Long Formats
Data arrives in two common shapes. Wide format has one row per subject and one column per time point or variable. Long format has one row per observation, with a column identifying which variable or time point each row represents. Most R modelling and plotting functions expect long format; most data entry and reporting tools produce wide format.
Format
Shape
When You Have It
When You Need It
Wide
Many columns, fewer rows
Survey exports, lab panels with one column per test, repeated-measures spreadsheets
Reporting tables, cross-tabulations, some time-series packages
Long
Fewer columns, many rows
Electronic health records, claims files, relational databases
library(tidyr)
# Wide: one row per patient, columns week1 through week4# patient_id | week1 | week2 | week3 | week4# 001 | 82 | 85 | 80 | 88
df_long <- df_wide |>
pivot_longer(
cols = starts_with("week"), # columns to stack
names_to = "week", # new column holding the old column names
values_to = "sbp"# new column holding the values
)
# Result: patient_id | week | sbp# 001 | week1 | 82# 001 | week2 | 85 ...# Strip the "week" prefix to leave just a number, and coerce to integer
df_long <- df_wide |>
pivot_longer(
cols = starts_with("week"),
names_to = "week",
names_prefix = "week",
names_transform = list(week = as.integer),
values_to = "sbp"
)
# Stack two value types at once (sbp and dbp both measured weekly)
df_long <- df_wide |>
pivot_longer(
cols = matches("^(sbp|dbp)_week"),
names_to = c(".value", "week"), # .value routes sbp and dbp to separate columns
names_sep = "_week"
)
Long to Wide: pivot_wider()
pivot_wider() — tidyr
# Long: one row per patient-week# patient_id | week | sbp# 001 | week1 | 82
df_wide <- df_long |>
pivot_wider(
names_from = week, # column whose values become new column names
values_from = sbp # column whose values fill those new columns
)
# Spread two value columns simultaneously
df_wide <- df_long |>
pivot_wider(
names_from = week,
values_from = c(sbp, dbp) # creates sbp_week1, dbp_week1, sbp_week2 ...
)
# Duplicate keys cause list-columns: summarise first, then widen
df_wide <- df_long |>
group_by(patient_id, week) |>
summarise(sbp = mean(sbp, na.rm = TRUE), .groups = "drop") |>
pivot_wider(names_from = week, values_from = sbp)
Reshaping with data.table: melt() and dcast()
melt() and dcast() — data.table
library(data.table)
# Wide to long: melt()
DT_long <- melt(DT_wide,
id.vars = "patient_id",
measure.vars = c("week1", "week2", "week3", "week4"),
variable.name = "week",
value.name = "sbp"
)
# Melt two value types simultaneously using patterns()
DT_long <- melt(DT_wide,
measure.vars = patterns("^sbp", "^dbp"),
variable.name = "week",
value.name = c("sbp", "dbp")
)
# Long to wide: dcast()# Formula: rows ~ columns; value.var is the column to spread
DT_wide <- dcast(DT_long,
patient_id ~ week,
value.var = "sbp"
)
# Aggregate while casting (mean sbp per patient per week)
DT_wide <- dcast(DT_long,
patient_id ~ week,
value.var = "sbp",
fun.aggregate = mean,
na.rm = TRUE
)
Binding Rows and Columns
Binding stacks or places tables side by side without matching on a key. Row binding requires matching column names; column binding requires matching row counts.
Row and Column Binding
# Stack two tables with the same columns (e.g., two annual extracts)# dplyr fills missing columns with NA rather than throwing an error
df_combined <- dplyr::bind_rows(df_2023, df_2024)
# Stack a list of many tables at once, adding a source label column
list_of_dfs <- list(df_2021, df_2022, df_2023, df_2024)
df_combined <- dplyr::bind_rows(list_of_dfs, .id = "year_src")
# data.table equivalent (faster for large tables)
DT_combined <- rbindlist(list(DT_2021, DT_2022, DT_2023), idcol = "year_src")
# fill = TRUE adds NA for columns missing in some tables
DT_combined <- rbindlist(list_of_DTs, use.names = TRUE, fill = TRUE)
# Column binding: place tables side by side (rows must already correspond)
df_combined <- dplyr::bind_cols(df_demographics, df_outcomes)
Prefer Joins Over Column Binding:
bind_cols() and cbind() assume rows in the two tables are in the same order and correspond to the same subjects. This assumption fails silently if either table has been sorted, filtered, or subsetted. A left_join() on an explicit key is almost always safer.
R offers several formats for persisting data between sessions. Choosing the right one depends on whether you need to share a single object or a whole collection, whether the file must be readable outside R, and how large the dataset is. This section covers each format, when to use it, and the practical tradeoffs between them.
Format Comparison
Format
Extension
Saves
R-Only?
Best For
saveRDS() / readRDS()
.rds
One object
Yes
Saving a single cleaned dataset, model, or list between scripts. The object can be loaded under any name.
save() / load()
.RData
Named objects (one or many)
Yes
Checkpointing a set of related objects mid-analysis. Objects are restored under their original names.
save.image() / load()
.RData
Entire workspace
Yes
Generally not recommended. Creates implicit, hard-to-audit dependencies. Avoid for reproducible work.
Sharing data with collaborators in Excel, Python, Stata, or any other tool. Universal but slow and loses column type information.
write_parquet() / read_parquet()
.parquet
One tabular object
No
Large datasets shared with Python (pandas, polars) or cloud pipelines. Columnar storage; fast and compact. Requires the arrow package.
write_fst() / read_fst()
.fst
One data frame or data.table
Near-R-only
Fastest read/write for R-to-R workflows on large tabular data. Supports random column access. Requires the fst package.
write.xlsx() / read_excel()
.xlsx
One or more sheets
No
When a collaborator or system requires .xlsx and CSV is not accepted. Avoid for intermediate analysis storage.
RDS: Saving Individual Objects
saveRDS() and readRDS() are the recommended default for saving any single R object. Unlike save(), the object is not bound to its original variable name on load, which makes it easier to use in different scripts without name collisions.
saveRDS() and readRDS()
# Save one object to disksaveRDS(df_clean, file = "data/clean/df_clean.rds")
saveRDS(model_logit, file = "output/model_logit.rds")
# Load it back — assign to any name you choose
df_clean <- readRDS("data/clean/df_clean.rds")
model_final <- readRDS("output/model_logit.rds") # original name not required# RDS preserves all R attributes: factor levels, column types, class, etc.# A data.table saved with saveRDS() is still a data.table on load.# A list, model object, or ggplot is preserved exactly as saved.
Use RDS as Your Default:
For any intermediate or final R object that does not need to be opened in another tool, saveRDS() is the safest and most explicit choice. It saves exactly one thing, forces you to name it explicitly on load, and preserves all R-specific attributes such as factor levels, ordered factors, and object class.
RData: Saving Multiple Named Objects
save() stores multiple R objects in a single file. When load() reads the file, each object reappears in the environment under its original name. This is useful for checkpointing a set of related results, but requires discipline: the names are baked into the file, so loading into a session that already has objects with those names will silently overwrite them.
save() and load()
# Save a specific set of objects into one filesave(df_clean, df_joined, model_lm,
file = "output/checkpoint_01.RData")
# Restore all of them at once — names are fixed to what was savedload("output/checkpoint_01.RData") # df_clean, df_joined, model_lm appear in environment# Check what a .RData file contains before loading itload("output/checkpoint_01.RData", verbose = TRUE)
# Safer pattern: load into a new environment to inspect before exposing to global env
checkpoint <- new.env()
load("output/checkpoint_01.RData", envir = checkpoint)
ls(checkpoint) # see what it contains
df_clean <- checkpoint$df_clean # pull out only what you need
Workspace Saving: What to Avoid and Why
RStudio prompts you to save your workspace when you close a session. The default file is .RData in your working directory. Accepting this prompt is one of the most common reproducibility mistakes in R.
Why Workspace Saving Causes Problems
# save.image() writes every object in the current environment to .RDatasave.image() # saves to .RData in the working directorysave.image(file = "session_backup.RData") # explicit filename# The problem: .RData loads silently every time R starts in that directory.# Objects from old, deleted, or changed scripts persist invisibly.# Code appears to work only because an old object is in memory,# not because the script that creates it still runs correctly.# The fix: turn off automatic workspace saving in RStudio.# Tools > Global Options > General:# "Save workspace to .RData on exit" -> set to Never# "Restore .RData into workspace at startup" -> uncheck# Then start each session clean and source the scripts that rebuild your objects.# If rebuilding takes too long, save intermediate objects explicitly with saveRDS().
The Blank Slate Principle:
A reproducible analysis is one that produces the same results when run from a blank R session on a machine that has never seen the data before. If your code relies on objects in .RData rather than on scripts that create those objects, it fails this test. Disable automatic workspace saving and use saveRDS() for any intermediate results that are expensive to recompute.
CSV: Universal Plain-Text Exchange
CSV is the safest format for sharing tabular data with any other tool. It is slow, verbose, and does not preserve column types, but it opens in Excel, Python, Stata, SAS, and any text editor. Use it as a delivery format, not an intermediate storage format.
Reading and Writing CSV
# Base R (slow; adds row names by default unless suppressed)write.csv(df, "output/results.csv", row.names = FALSE)
df <- read.csv("data/file.csv", stringsAsFactors = FALSE)
# readr (fast; prints column type guesses; returns a tibble)library(readr)
readr::write_csv(df, "output/results.csv") # no row names by default
df <- readr::read_csv("data/file.csv", show_col_types = FALSE)
# data.table (fastest; handles large files well)library(data.table)
data.table::fwrite(DT, "output/results.csv") # very fast; no row names
DT <- data.table::fread("data/file.csv") # auto-detects delimiter and types# Preserve a date column across CSV round-trips by formatting explicitly
df$date <- format(df$date, "%Y-%m-%d") # write as ISO string
df$date <- as.Date(df$date) # parse back after reading
Parquet: Fast Cross-Language Storage
Parquet is a columnar binary format supported natively by Python (pandas, polars), Spark, DuckDB, and cloud storage services. It preserves column types, compresses well, and reads far faster than CSV for large files. The arrow package provides the R interface.
arrow: write_parquet() and read_parquet()
install.packages("arrow") # install oncelibrary(arrow)
# Write a data frame or data.table to parquetarrow::write_parquet(df_clean, "data/clean/df_clean.parquet")
# Read back (returns a tibble by default)
df_clean <- arrow::read_parquet("data/clean/df_clean.parquet")
# Specify only the columns you need (parquet reads column-by-column,# so selecting columns avoids reading unused data from disk entirely)
df_subset <- arrow::read_parquet(
"data/clean/df_clean.parquet",
col_select = c("patient_id", "age", "outcome")
)
# Convert a data.table to data frame before writing if arrow warns about classarrow::write_parquet(as.data.frame(DT), "output/DT.parquet")
fst: Fastest R-to-R Binary Format
The fst package provides the fastest read and write speeds available for tabular data in R, often ten times faster than fread() on large files. It also supports random column access, meaning you can read a subset of columns without loading the full file. The format is not widely supported outside R, so use it for intermediate objects in pure-R pipelines.
fst: write_fst() and read_fst()
install.packages("fst") # install oncelibrary(fst)
# Write (accepts data frames and data.tables)fst::write_fst(df_clean, "data/clean/df_clean.fst")
# Compress (0 = none, 100 = max; default 50 is a good balance)fst::write_fst(df_clean, "data/clean/df_clean.fst", compress = 75)
# Read the full file
df_clean <- fst::read_fst("data/clean/df_clean.fst")
# Read only specific columns (very fast; no other columns are read from disk)
df_sub <- fst::read_fst("data/clean/df_clean.fst",
columns = c("patient_id", "age", "outcome"))
# Read back as a data.table directlylibrary(data.table)
DT <- as.data.table(fst::read_fst("data/clean/df_clean.fst"))
Excel: When CSV Is Not an Option
Use Excel format when a collaborator or system requires .xlsx specifically and CSV is not acceptable. For reading Excel files into R, readxl is reliable and requires no Java dependency. For writing, writexl is fast and lightweight; openxlsx supports formatting, multiple sheets, and styled headers when the output format is prescribed.
Reading and Writing Excel Files
# Reading Excel filesinstall.packages("readxl")
library(readxl)
df <- readxl::read_excel("data/file.xlsx") # first sheet by default
df <- readxl::read_excel("data/file.xlsx", sheet = "Sheet2")
df <- readxl::read_excel("data/file.xlsx", skip = 2, na = "NA")
readxl::excel_sheets("data/file.xlsx") # list all sheet names# Writing Excel files: writexl (no Java; single or multiple sheets)install.packages("writexl")
writexl::write_xlsx(df, "output/results.xlsx") # single sheetwritexl::write_xlsx(list(Summary = df_summary, Detail = df_detail),
"output/report.xlsx") # multiple sheets; names become tab labels# openxlsx: styled output, formatted headers, bold rowsinstall.packages("openxlsx")
library(openxlsx)
wb <- createWorkbook()
addWorksheet(wb, "Results")
writeData(wb, "Results", df_summary, headerStyle = createStyle(textDecoration = "bold"))
saveWorkbook(wb, "output/report.xlsx", overwrite = TRUE)
Choosing a Format
Decision Guide
# Saving one cleaned dataset for use in the next script?# -> saveRDS() [default choice; preserves all attributes]# Saving several related objects (model + data + metadata) as a checkpoint?# -> save() [convenient; names are restored on load]# Large tabular file that only needs to be read back into R?# -> write_fst() [fastest read/write; random column access]# Large tabular file shared with Python, Spark, or a cloud pipeline?# -> write_parquet() [cross-language; typed; compressed; widely supported]# Sharing data with a collaborator using Excel, Stata, or SAS?# -> write_csv() / fwrite() [universal; accepts any tool; loses types]# Collaborator or system requires .xlsx and CSV is not accepted?# -> writexl::write_xlsx() or openxlsx [Excel-native; multiple sheets]# Closing RStudio and asked to save workspace?# -> No. Turn this off in Tools > Global Options > General.
File Paths and Project Portability
Hard-coded absolute paths break when a project is moved to a new machine or shared with a collaborator. The here package constructs paths relative to the project root, making all file references portable without any setup.
here::here() for Portable Paths
install.packages("here") # install oncelibrary(here)
# here::here() always resolves relative to the .Rproj file location# regardless of where the calling script lives in the project foldersaveRDS(df_clean, here::here("data", "clean", "df_clean.rds"))
df_clean <- readRDS(here::here("data", "clean", "df_clean.rds"))
readr::write_csv(df, here::here("output", "results.csv"))
arrow::write_parquet(df, here::here("data", "clean", "df.parquet"))
# here() builds the path from multiple arguments, joining with the OS separator# On any machine: /path/to/project/data/clean/df_clean.rds# No setwd() needed; no broken absolute paths.
Recommended Folder Convention:
Keep raw source files in data/raw/ and treat them as read-only. Write all processed or cleaned objects to data/clean/. Write all final outputs (tables, figures, reports) to output/. This separation makes it unambiguous which files can be regenerated by scripts and which are irreplaceable originals.
10. Variable Types & Regression Analysis
This section covers how to assign and verify variable types in R, fit linear and logistic regression models with lm() and glm(), and interpret the output that summary() returns. These are the most common modelling steps in public health data analysis.
Assigning Variable Types
R stores data in different types depending on what the values represent. Getting types right before modelling matters: a variable stored as character will be silently dropped; a numeric code stored as numeric instead of factor will be treated as continuous when it should be categorical.
Calendar dates: admission date, date of birth. Enables date arithmetic.
df$dob <- as.Date(df$dob,
format = "%Y-%m-%d")
class(df$dob)
Factors: Reference Levels and Coding
In regression, R uses the first level of a factor as the reference (baseline) category. You should set this deliberately rather than accepting the alphabetical default.
Setting and Checking Factor Levels
# Check current levels (first = reference in regression)levels(df$insurance)
# e.g. "Medicaid" "Medicare" "Private" "Uninsured"# Set a specific reference level
df$insurance <- relevel(df$insurance, ref = "Private")
# Verify: Private is now firstlevels(df$insurance)
# Recode and relabel levelslevels(df$educ) <- c(
"lt_hs" = "Less than high school",
"hs" = "High school / GED",
"some_col" = "Some college",
"col_plus" = "College or above"
)
Always Inspect Types Before Modelling:
Run str(df) or sapply(df, class) before fitting any model. Numeric codes for categorical variables (e.g., 1, 2, 3 for insurance type) will be treated as continuous unless converted to factors. This is one of the most common sources of silent errors in public health analyses.
Linear Regression: lm()
Use lm() when your outcome is a continuous variable (blood pressure, BMI, length of stay, a cost measure). The formula syntax is outcome ~ predictor1 + predictor2.
Fitting a Linear Model
# Fit the model
model_lm <- lm(sbp ~ age + as.factor(insurance) + bmi,
data = df)
# View full resultssummary(model_lm)
# Confidence intervals for coefficientsconfint(model_lm)
# Add fitted values and residuals to the data frame
df$fitted <- fitted(model_lm)
df$residual <- residuals(model_lm)
# Basic residual diagnostics (4 plots)par(mfrow = c(2,2))
plot(model_lm)
Common Formula Operators
Syntax
Meaning
Example
y ~ x
Simple regression of y on x
lm(sbp ~ age)
y ~ x1 + x2
Multiple regression; additive terms
lm(sbp ~ age + bmi)
y ~ x1 * x2
Main effects plus interaction term
lm(sbp ~ age * insurance)
y ~ x1 + x1:x2
Main effect of x1 plus interaction only (no main effect of x2)
lm(sbp ~ age + age:insurance)
y ~ I(x^2)
Arithmetic inside I(); adds a squared term
lm(sbp ~ age + I(age^2))
y ~ .
All other columns in the data frame as predictors
lm(sbp ~ ., data = df)
y ~ . - x
All columns except x
lm(sbp ~ . - id, data = df)
y ~ 0 + x
Suppress the intercept
lm(sbp ~ 0 + insurance)
Logistic Regression: glm() with Binomial Family
Use glm(family = binomial) when your outcome is binary: died / survived, readmitted / not, disease present / absent. The model estimates log-odds; exponentiating the coefficients gives odds ratios.
Fitting a Logistic Model
# Outcome must be 0/1 numeric or a two-level factor
df$readmit <- as.integer(df$readmit_30day == "Yes")
# Fit the model
model_logit <- glm(readmit ~ age + insurance + n_comorbidities,
data = df,
family = binomial(link = "logit"))
# View results (log-odds scale)summary(model_logit)
# Odds ratios and 95% CIexp(coef(model_logit)) # odds ratiosexp(confint(model_logit)) # 95% CI on OR scale# Predicted probabilities for each observation
df$pred_prob <- predict(model_logit,
type = "response")
Probit and Other Links:
The binomial family accepts other link functions. Use link = "probit" for a probit model or link = "cloglog" for a complementary log-log model. For Poisson count outcomes (e.g., number of ED visits), use family = poisson(link = "log").
Reading summary() Output
For lm(): Linear Regression Output
Annotated lm() summary() Output
# Call:# lm(formula = sbp ~ age + insurance + bmi, data = df)# Residuals:# Min 1Q Median 3Q Max# -28.41 -6.12 -0.44 5.98 31.07# ^ Residuals should be roughly symmetric around 0.# A large Max vs Min asymmetry suggests outliers.# Coefficients:# Estimate Std. Error t value Pr(>|t|)# (Intercept) 82.14 4.21 19.51 <2e-16 ***# age 0.43 0.06 7.18 1.2e-12 ***# insuranceMedicaid 3.81 1.14 3.34 0.0009 ***# insuranceMedicare 1.92 1.08 1.78 0.0756 .# insuranceUninsured 5.60 1.31 4.27 2.3e-05 ***# bmi 0.71 0.09 7.89 7.4e-15 ***# ^ Estimate: the coefficient.# For numeric predictors: change in outcome per 1-unit increase.# For factor levels: difference vs. the reference level (Private).# ^ Std. Error: uncertainty around the estimate.# ^ t value: Estimate / Std. Error.# ^ Pr(>|t|): p-value; probability of this t-value under H0.# ^ Signif. codes: *** p<.001 ** p<.01 * p<.05 . p<.1# Residual standard error: 9.83 on 1194 degrees of freedom# ^ Typical size of prediction error in outcome units (mmHg here).# Multiple R-squared: 0.213, Adjusted R-squared: 0.210# ^ R²: proportion of outcome variance explained by the model.# Adjusted R² penalises for number of predictors; use this one.# F-statistic: 64.3 on 5 and 1194 DF, p-value: < 2.2e-16# ^ Tests whether the model as a whole explains more than chance.
For glm(): Logistic Regression Output
Annotated glm() summary() Output
# Coefficients:# Estimate Std. Error z value Pr(>|z|)# (Intercept) -2.841 0.312 -9.11 <2e-16 ***# age 0.027 0.006 4.50 6.8e-06 ***# insuranceMedicaid 0.441 0.142 3.11 0.0019 **# insuranceMedicare 0.198 0.139 1.42 0.1549# n_comorbidities 0.312 0.041 7.61 2.8e-14 ***# ^ Estimates are LOG-ODDS (logit scale), not probabilities.# Positive = higher odds of the outcome; negative = lower odds.# Use exp(coef()) to convert to odds ratios.# ^ z value replaces t value; interpretation is the same.# Null deviance: 1284.3 on 1199 degrees of freedom# Residual deviance: 1091.7 on 1195 degrees of freedom# ^ Null deviance: fit of intercept-only model.# Residual deviance: fit of your model.# Larger reduction = better model fit.# AIC: 1101.7# ^ Lower AIC = better fit (penalised for complexity).# Use AIC to compare models on the same data.
Interpreting Coefficients
Model
Predictor Type
Coefficient Represents
Practical Interpretation
lm()
Continuous (e.g., age)
Change in outcome per 1-unit increase in predictor, holding others constant
Age coefficient = 0.43: each additional year of age is associated with 0.43 mmHg higher systolic BP, adjusted for insurance and BMI
lm()
Factor (e.g., insurance)
Difference in outcome vs. the reference level, holding others constant
Medicaid coefficient = 3.81: Medicaid patients have systolic BP 3.81 mmHg higher on average than Private patients with the same age and BMI
glm() binomial
Continuous
Change in log-odds per 1-unit increase; exp(coef) gives the odds ratio
Age coefficient = 0.027; OR = exp(0.027) = 1.027: each additional year of age is associated with 2.7% higher odds of readmission
glm() binomial
Factor
Log-odds difference vs. reference level; exp(coef) gives the odds ratio
Medicaid coefficient = 0.441; OR = exp(0.441) = 1.55: Medicaid patients have 55% higher odds of readmission compared to Private patients, adjusted for age and comorbidities
Odds Ratios Are Not Risk Ratios:
An odds ratio of 1.55 does not mean Medicaid patients are 55% more likely to be readmitted. It means their odds are 55% higher. When the outcome is common (prevalence above roughly 10%), odds ratios overstate the relative risk. For common binary outcomes, consider using a log-binomial model (family = binomial(link = "log")) or a Poisson model with robust standard errors to estimate risk ratios directly.
Extracting Results Programmatically
Tidy Model Output with broom
The broom package converts model output into tidy data frames, making it easy to plot coefficients or export results.
library(broom)
# Coefficients table as a data frametidy(model_logit)
tidy(model_logit, conf.int = TRUE, exponentiate = TRUE)
# ^ exponentiate = TRUE gives odds ratios directly# Model-level statistics (R², AIC, df, etc.)glance(model_lm)
glance(model_logit)
# Observation-level: fitted values, residuals, influence statsaugment(model_lm, data = df)
augment(model_logit, data = df, type.predict = "response")
# ^ .fitted column gives predicted probabilities for glm
Install broom:
install.packages("broom"). It is part of the tidyverse meta-package so it is already installed if you have run install.packages("tidyverse").
11. Next Steps
With R, RStudio, and your core packages installed, you have a working statistical computing environment. The resources below provide the most reliable paths to building further fluency.
Resource
Focus
Where
R for Data Science (Wickham, Cetinkaya-Rundel & Grolemund)
Consider using the renv package from the start of any project. It records the exact package versions used in a project lockfile, making your analyses reproducible across machines and over time. Install with install.packages("renv") and initialize with renv::init().