R & RStudio Quickstart Guide

Installation, Package Management & Libraries · macOS and Windows

1. Overview

R is a statistical computing language and environment, widely used in data analysis, visualization, and reproducible research. RStudio (distributed by Posit) is the most widely adopted integrated development environment (IDE) for R: it provides a code editor, console, environment viewer, and plot panel in a single interface.

You install R first, then RStudio. RStudio detects your R installation automatically and does not work without it.

Before You Begin:

You will need an internet connection and administrator access on your machine. The total installation typically takes under 10 minutes.

What Each Component Does

Component	Role	Where to Get It
`R`	The language engine and base libraries. Required.	cran.r-project.org
`RStudio`	The IDE. Provides the interface you work in daily.	posit.co/download
CRAN Packages	Community libraries that extend R's functionality.	Installed from within R or RStudio

2. Installing R

R is distributed through the Comprehensive R Archive Network (CRAN). Always install the latest stable release unless a project requires a specific version.

macOS Installation

Go to cran.r-project.org and click Download R for macOS.
Select the correct installer for your chip. Choose the Apple Silicon package (arm64) for M1, M2, M3, or M4 Macs. Choose the Intel package for Intel-based Macs. You can verify via Apple menu > About This Mac.
Download the .pkg file and open it. Follow the installer prompts and accept the default installation location (/Library/Frameworks/R.framework).
Verify the installation by opening RStudio (see Section 02). The Console pane at the bottom left will display a message beginning with R version 4.x.x as soon as RStudio launches. If you see this message, R is installed correctly. No further verification is needed.

Note on Xcode Tools:

A small number of specialist packages require additional compilation tools to install. If you encounter an error message mentioning "no developer tools" or "xcrun" when installing a package later, contact your instructor or IT support. Most users will never need to address this during an introductory course.

Windows Installation

Go to cran.r-project.org and click Download R for Windows, then base.
Download the .exe installer (e.g., R-4.x.x-win.exe).
Run the installer. Accept the license and keep the default install path (C:\Program Files\R\R-4.x.x). Leave both 32-bit and 64-bit components checked unless disk space is limited; on modern Windows, 64-bit only is also acceptable.
Verify the installation by opening RStudio (see Section 02). The Console pane at the bottom left will display a message beginning with R version 4.x.x as soon as RStudio launches. If you see this message, the installation succeeded. If RStudio opens but the Console shows an error, see the callout below.

Rtools (Recommended):

Windows users who need to compile packages from source should also install Rtools, available at the same CRAN Windows page. Match the Rtools version to your installed R version. Most introductory users will not need this immediately.

3. Installing RStudio

RStudio Desktop is the free, open-source edition suitable for individual use on your local machine. It is maintained by Posit, the company behind the tidyverse ecosystem.

macOS Installation

Visit posit.co/download/rstudio-desktop and click Download RStudio Desktop. The page auto-detects your operating system.
Open the downloaded .dmg file and drag the RStudio icon into your Applications folder.
Launch RStudio from Applications. On first open, macOS may prompt you to confirm opening an app downloaded from the internet. Click Open.
RStudio detects your R installation automatically. The Console pane will display your R version on startup, confirming the connection.

Windows Installation

Visit posit.co/download/rstudio-desktop and download the Windows .exe installer.
Run the installer with default settings. RStudio installs to C:\Program Files\RStudio by default.
Launch RStudio from the Start menu or Desktop shortcut. The Console pane should display your R version, confirming that RStudio found the R installation.

If RStudio Cannot Find R:

Open RStudio, navigate to Tools > Global Options > General, and manually set the R version path to the folder where R was installed.

4. Installing Packages

R packages extend the base language with functions, datasets, and tools. The primary source is CRAN, which hosts over 20,000 packages. Packages are installed once and stored in a local library on your machine.

Installing from CRAN

The easiest way to install a package is through RStudio's built-in point-and-click interface. In the Packages tab (bottom-right pane), click Install, type the package name, and click Install again. RStudio handles the rest.

Using the Packages Tab:

In RStudio, go to the Packages pane (bottom right) and click Install. Type the package name in the dialog box, make sure Install dependencies is checked, and click Install. You only need to do this once per package on your machine.

You can also type the install command directly into the RStudio Console pane (bottom left) and press Enter. Both approaches do exactly the same thing.

Using install.packages()

# Install a single package
install.packages("ggplot2")

# Install multiple packages at once
install.packages(c("dplyr", "tidyr", "readr"))

# Install the full tidyverse meta-package
install.packages("tidyverse")

Installing from GitHub

Development versions of packages, or packages not yet on CRAN, can be installed from GitHub using the pak package.

Using pak (Recommended)

# Install pak first if you do not have it
install.packages("pak")

# Install any GitHub package by user/repo
pak::pkg_install("tidyverse/ggplot2")

# pak also handles CRAN packages and is faster
pak::pkg_install("dplyr")

Why pak:

pak resolves dependencies in parallel, produces clearer error messages, and is the modern recommended approach for package installation as of 2023 onward.

Updating and Removing Packages

To update packages using the menu, go to the Packages tab in RStudio and click Update. A list of packages with available updates will appear; check the ones you want and click Install Updates. You can also use the Console commands below for the same effect.

Package Maintenance

# Check which installed packages have updates available
old.packages()

# Update all outdated packages at once
update.packages(ask = FALSE)

# Remove a package
remove.packages("packagename")

# List all currently installed packages
installed.packages()[, "Package"]

Commonly Used Packages by Category

Package	Category	Purpose
`ggplot2`	Visualization	Grammar of graphics plotting system
`dplyr`	Data Wrangling	Data frame manipulation and transformation
`tidyr`	Data Wrangling	Reshaping and tidying data
`data.table`	Data Wrangling	High-performance data manipulation for large datasets; faster than dplyr on big files
`readr`	Import	Fast reading of flat files (CSV, TSV)
`readxl`	Import	Reading Excel files
`lubridate`	Date/Time	Intuitive date and time handling
`stringr`	Strings	Consistent string manipulation functions
`purrr`	Functional	Functional programming tools and iteration
`knitr`	Reporting	Dynamic report generation
`rmarkdown`	Reporting	R Markdown documents and notebooks

5. Loading Libraries

Installing a package makes it available on disk. To use it in a session, you must load it into memory with library(). This call goes at the top of every script or R Markdown file that needs the package.

Install Once, Load Every Session:

install.packages() is run once per machine (or when updating). library() is called at the start of each new R session or script.

The library() Function

Loading Packages into a Session

# Load a single package
library(ggplot2)

# Typical script header: load all dependencies upfront
library(dplyr)
library(ggplot2)
library(readr)
library(lubridate)

# Load without printing startup messages
suppressPackageStartupMessages(library(tidyverse))

Using Functions Without Loading

If you only need one or two functions from a package, call them directly using the :: operator. This avoids attaching the whole package to the search path and makes dependencies explicit in the code.

The :: Operator

# Call a function directly without loading the library
dplyr::filter(my_data, value > 10)
readr::read_csv("data/file.csv")

# Useful when two packages have functions with the same name
stats::filter(x, rep(1/3, 3))  # base R filter, not dplyr::filter

Checking if a Package is Installed

Portable Script Header Pattern

This pattern installs any missing packages automatically when a collaborator runs your script for the first time.

# Define required packages
packages_needed <- c("dplyr", "ggplot2", "readr")

# Install any that are missing
new_packages <- packages_needed[
  !(packages_needed %in% installed.packages()[, "Package"])
]
if (length(new_packages)) install.packages(new_packages)

# Load all
invisible(lapply(packages_needed, library, character.only = TRUE))

Where Libraries Are Stored

Library Paths

# See where R looks for installed packages
.libPaths()

# Example output on macOS:
# [1] "/Library/Frameworks/R.framework/Versions/4.4/Resources/library"
# Example output on Windows:
# [1] "C:/Users/YourName/AppData/Local/R/win-library/4.4"
# [2] "C:/Program Files/R/R-4.4.0/library"

6. R File Types & Helper Files

RStudio supports several distinct file types for writing R code. Each serves a different purpose: some are designed for clean, executable scripts; others weave prose and code together for reporting; others are built for interactive exploration. Understanding which to use, and when, is one of the most practical decisions you will make when setting up a project.

R Script (.R)

An R script is a plain text file containing only R code and comments. It is the simplest and most portable file type: any R installation can run it, and it has no dependencies beyond base R. Scripts are the right choice for data processing pipelines, reusable functions, and any code that should be sourced by other files.

Anatomy of an R Script

# ── script_name.R ────────────────────────────────────────────
# Purpose:  Clean and reshape the enrollment dataset
# Author:   Your Name
# Updated:  2026-03-23
# 1. Load dependencies ────────────────────────────────────────
library(dplyr)
library(readr)

# 2. Read data ────────────────────────────────────────────────
raw <- readr::read_csv("data/enrollment_raw.csv")

# 3. Clean ────────────────────────────────────────────────────
clean <- raw |>
  dplyr::filter(!is.na(id)) |>
  dplyr::mutate(year = as.integer(year))

When to Use a Script:

Use .R scripts for data cleaning pipelines, simulation code, helper function definitions, and any file you plan to source() from another file. Scripts run from top to bottom with no markup overhead, which makes them fast and predictable.

R Markdown (.Rmd)

R Markdown files combine prose (written in Markdown) with executable code chunks. When rendered, the file produces a self-contained document in a format of your choice: HTML, PDF, Word, or slides. R Markdown is the standard format for reproducible reports, homework submissions, and any analysis where you need to explain your reasoning alongside the code and output.

Anatomy of an R Markdown File

---
title: "Weekly Analysis"
author: "Your Name"
date: "2026-03-23"
output: html_document
---
## Introduction
This report summarizes enrollment trends for Spring 2026.
```{r setup, include=FALSE}
library(dplyr)
library(ggplot2)
```
```{r plot, echo=FALSE}
ggplot(data, aes(x = week, y = count)) +
  geom_line()
```

Render the document by clicking Knit in RStudio, or by running rmarkdown::render("file.Rmd") in the console.

When to Use R Markdown:

Use .Rmd when the final deliverable is a document: a report, a homework assignment, a methods appendix, or a slide deck. Because the file re-runs all code on render, every figure and table in the output is guaranteed to reflect the current data and code.

R Notebook (.Rmd with notebook output)

An R Notebook is technically an R Markdown file with output: html_notebook set in its YAML header. The key distinction is execution behavior: in a standard .Rmd, all chunks run together when you knit; in a Notebook, each chunk runs independently and its output appears inline immediately below the chunk. This makes Notebooks well-suited for exploratory analysis where you want to inspect results step by step without re-running the entire document.

Notebook YAML Header

---
title: "Exploratory Analysis"
output: html_notebook
---

Save the file in RStudio and a .nb.html preview file is generated automatically alongside it. This preview can be opened in any browser without R installed.

When to Use a Notebook:

Use Notebooks during active exploration: checking data distributions, testing model specifications, or iterating on visualizations. Switch to a standard .Rmd when you are ready to produce a final, fully reproducible document from scratch.

Comparison: Choosing the Right File Type

File Type	Extension	Best For	Output on Run
R Script	`.R`	Pipelines, functions, sourced utilities	Objects in environment; no document
R Markdown	`.Rmd`	Reproducible reports, final deliverables	HTML, PDF, Word, or slides on Knit
R Notebook	`.Rmd` (notebook)	Interactive exploration, iterative work	Inline chunk output; `.nb.html` preview
Quarto	`.qmd`	Modern replacement for R Markdown; also supports Python and Julia	HTML, PDF, Word, slides, websites

Helper Files

As a project grows, it becomes useful to separate reusable code into dedicated helper files rather than repeating it across scripts and documents. Helper files are plain .R scripts that contain only function definitions and constants; they carry no side effects and produce no output when sourced.

Creating and Sourcing a Helper File

helpers.R

# ── helpers.R ─────────────────────────────────────────────────
# Reusable utility functions for the project.
# Source this file at the top of any script or .Rmd that needs it.
# Compute percentage change between two values
pct_change <- function(baseline, followup) {
  (followup - baseline) / baseline * 100
}

# Standardize a numeric vector to mean 0, sd 1
standardize <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

# A project-wide ggplot2 theme
theme_project <- ggplot2::theme_minimal() +
  ggplot2::theme(
    text       = ggplot2::element_text(size = 11),
    plot.title = ggplot2::element_text(face = "bold")
  )

Sourcing the Helper File

# In any script or .Rmd chunk, load helpers with source()
source("helpers.R")

# Or use a path relative to the project root with here::here()
source(here::here("R", "helpers.R"))

# Functions are now available in the session
pct_change(100, 115)   # returns 15
standardize(my_vector)

Recommended Project File Structure

A common convention is to keep all helper files in an R/ subfolder within the project directory. This mirrors the structure used in R packages and makes it easy to source multiple helpers at once.

Typical Project Layout

my_project/
├── my_project.Rproj   # RStudio project file
├── R/
│   ├── helpers.R      # Utility functions
│   ├── themes.R       # ggplot2 theme definitions
│   └── constants.R    # Shared constants (paths, labels, colors)
├── data/
│   ├── raw/           # Original, unmodified data files
│   └── clean/         # Processed outputs
├── analysis/
│   ├── 01_clean.R
│   ├── 02_model.R
│   └── 03_report.Rmd
└── output/            # Figures, tables, rendered reports

RStudio Project Files (.Rproj):

Create a new project in RStudio via File > New Project. The .Rproj file sets the working directory to the project root automatically whenever you open it, which means all relative file paths work consistently regardless of where the project folder lives on your machine. This is the single most important habit for reproducible work.

Sourcing All Helper Files at Once

Batch Source Pattern

# Source every .R file in the R/ folder
invisible(
  lapply(list.files("R", pattern = "\\.R$", full.names = TRUE), source)
)

External Cheat Sheets

Official and community reference cards. Each preview is embedded below; use the Open PDF link to view full-screen or download.

RStudio IDE Posit / RStudio Open PDF →

R Markdown Posit / RStudio Open PDF →

Data Visualization (ggplot2) Posit / RStudio Open PDF →

data.table DataCamp Open PDF →

7. Quick-Reference Cheat Sheets

The three cheat sheets below cover the most frequently used functions and patterns for everyday data work in R. Each is organized by task rather than alphabetically so you can scan quickly while working. Use the tabs to switch between packages.

Load the Tidyverse:

library(tidyverse) loads ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats in one call. Alternatively, load only the packages you need.

The Pipe Operator

|> (Native Pipe, R 4.1+)

# The pipe passes the left-hand result into the first argument of the right-hand function
data |> filter(year == 2024) |> select(id, outcome) |> head()

# Equivalent without the pipe (harder to read):
head(select(filter(data, year == 2024), id, outcome))

dplyr: Data Manipulation

Function	What It Does	Example
`filter()`	Keep rows matching a condition	`filter(df, age > 18, !is.na(score))`
`select()`	Keep or drop columns by name	`select(df, id, year, outcome)`
`mutate()`	Add or overwrite columns	`mutate(df, log_income = log(income))`
`rename()`	Rename columns	`rename(df, participant_id = id)`
`arrange()`	Sort rows	`arrange(df, desc(year), name)`
`group_by()`	Define groups for downstream verbs	`group_by(df, country, year)`
`summarise()`	Reduce groups to summary rows	`summarise(df, mean_score = mean(score, na.rm = TRUE))`
`count()`	Count rows per group	`count(df, country, sort = TRUE)`
`slice_max()`	Keep top-n rows by a variable	`slice_max(df, order_by = score, n = 5)`
`distinct()`	Remove duplicate rows	`distinct(df, id, .keep_all = TRUE)`
`across()`	Apply a function to multiple columns inside `mutate()` or `summarise()`	`mutate(df, across(where(is.numeric), scale))`
`relocate()`	Move columns to a new position	`relocate(df, outcome, .before = everything())`

Joins

Function	What It Does
`left_join(x, y, by = "id")`	Keep all rows in `x`; match from `y` where possible
`inner_join(x, y, by = "id")`	Keep only rows with a match in both tables
`full_join(x, y, by = "id")`	Keep all rows from both tables
`anti_join(x, y, by = "id")`	Keep rows in `x` with no match in `y`
`semi_join(x, y, by = "id")`	Keep rows in `x` that have a match in `y`, without adding `y` columns

tidyr: Reshaping Data

Function	What It Does	Example
`pivot_longer()`	Wide to long: stack multiple columns into key-value rows	`pivot_longer(df, cols = starts_with("week"), names_to = "week", values_to = "score")`
`pivot_wider()`	Long to wide: spread key-value rows into columns	`pivot_wider(df, names_from = group, values_from = mean_score)`
`separate()`	Split one column into multiple on a delimiter	`separate(df, date, into = c("year","month","day"), sep = "-")`
`unite()`	Combine multiple columns into one	`unite(df, "full_name", first, last, sep = " ")`
`drop_na()`	Remove rows with `NA` in specified columns	`drop_na(df, score, outcome)`
`fill()`	Fill `NA` values downward or upward within a column	`fill(df, group, .direction = "down")`

readr: Reading & Writing Files

Function	What It Does	Example
`read_csv()`	Read a comma-separated file into a tibble	`read_csv("data/file.csv")`
`read_tsv()`	Read a tab-separated file	`read_tsv("data/file.tsv")`
`read_delim()`	Read any delimiter; specify with `delim = "\|"`	`read_delim("file.txt", delim = "\|")`
`write_csv()`	Write a data frame or tibble to CSV	`write_csv(df, "output/results.csv")`
`read_rds()` / `write_rds()`	Read/write R's native binary format; preserves data types exactly	`write_rds(model, "output/model.rds")`

stringr: String Operations

Function	What It Does	Example
`str_detect()`	Returns `TRUE` if pattern is found	`filter(df, str_detect(name, "^A"))`
`str_replace()`	Replace first match of a pattern	`str_replace(x, "\\.", ",")`
`str_replace_all()`	Replace all matches	`str_replace_all(x, " ", "_")`
`str_trim()`	Strip leading/trailing whitespace	`mutate(df, name = str_trim(name))`
`str_to_lower()` / `str_to_upper()`	Change case	`str_to_lower(df$country)`
`str_glue()`	String interpolation using `{variable}` syntax	`str_glue("Subject {id}: score = {score}")`
`str_sub()`	Extract substring by position	`str_sub(x, 1, 4)`

lubridate: Dates & Times

Function	What It Does	Example
`ymd()`, `mdy()`, `dmy()`	Parse date strings in various orders	`ymd("2024-03-15")`
`year()`, `month()`, `day()`	Extract date components	`mutate(df, yr = year(date))`
`floor_date()`	Round date down to unit (week, month, quarter)	`floor_date(date, "month")`
`interval()` / `as.period()`	Compute time between two dates	`interval(start, end) / years(1)`
`today()` / `now()`	Current date or datetime	`mutate(df, age_days = today() - dob)`

Load data.table:

library(data.table). Convert an existing data frame with setDT(df) (modifies in place) or as.data.table(df) (returns a copy). Read files directly into a data.table with fread().

The Core Syntax: DT[i, j, by]

Every data.table operation fits into a single bracket expression: i filters rows, j selects or computes columns, and by groups the result. Leaving any slot empty means "do nothing for that step."

DT[i, j, by] Pattern

# Think of it as: "Take DT, subset rows with i, compute j, grouped by by"
# Filter rows (i)
DT[age > 18]
DT[country == "US" & !is.na(score)]

# Select / compute columns (j)
DT[, .(id, score)]                          # select columns
DT[, .(mean_score = mean(score))]          # aggregate
DT[, score_log := log(score)]             # add/overwrite column in place
# Group by (by)
DT[, .(mean_score = mean(score)), by = country]
DT[, .(n = .N), by = .(country, year)]

# Combine all three
DT[year >= 2020, .(mean = mean(score)), by = country]

Special Symbols

Symbol	Meaning	Example
`.N`	Number of rows (in the current group)	`DT[, .N, by = country]`
`:=`	Assign a column by reference (no copy made)	`DT[, z := x + y]`
`.()`	Shorthand for `list()` in `j` and `by`	`DT[, .(a, b), by = .(grp)]`
`.SD`	Subset of Data: the current group's data as a data.table	`DT[, lapply(.SD, mean), by = grp]`
`.SDcols`	Restrict `.SD` to specific columns	`DT[, lapply(.SD, mean), by = grp, .SDcols = c("x","y")]`
`.GRP`	Integer index of the current group	`DT[, grp_id := .GRP, by = country]`
`.I`	Row indices of the current group	`DT[, .I[score == max(score)], by = country]`

Reading, Writing & Converting

Function	What It Does	Example
`fread()`	Read CSV/TSV fast; auto-detects delimiter and column types	`fread("data/large_file.csv")`
`fwrite()`	Write to CSV extremely fast	`fwrite(DT, "output/results.csv")`
`setDT()`	Convert a data frame to data.table in place (no copy)	`setDT(my_df)`
`as.data.table()`	Return a new data.table copy	`DT <- as.data.table(my_df)`
`as.data.frame()`	Convert back to a standard data frame	`as.data.frame(DT)`

Keys, Sorting & Merging

Function	What It Does	Example
`setkey()`	Sort table and index by one or more columns for fast lookups	`setkey(DT, id, year)`
`setkeyv()`	Same as `setkey()` but accepts a character vector of names	`setkeyv(DT, c("id", "year"))`
`merge()`	SQL-style merge; works like base R but faster with keyed tables	`merge(DT1, DT2, by = "id", all.x = TRUE)`
`setorder()`	Sort a data.table in place by columns	`setorder(DT, -year, country)`
`setnames()`	Rename columns in place	`setnames(DT, "old_name", "new_name")`

Useful Operations

Task	data.table Syntax
Add multiple columns at once	`DT[, c("a","b") := .(x+1, y*2)]`
Delete a column	`DT[, col_to_drop := NULL]`
Filter and count	`DT[score > 80, .N]`
Cumulative sum by group	`DT[, cum_score := cumsum(score), by = id]`
Lag/lead a column	`DT[, lag_score := shift(score, 1), by = id]`
Row-wise between filter	`DT[between(score, 50, 80)]`
Chain operations	`DT[year > 2020][, .N, by = country][order(-N)]`
Cross-join / expand grid	`CJ(x = 1:3, y = c("a","b"))`

In-Place Modification:

Unlike dplyr, data.table modifies objects in place by default when using := or set*() functions. This avoids copying large datasets and is why data.table is faster for big files. Be aware that assigning DT2 <- DT does not create an independent copy; use DT2 <- copy(DT) if you need a true duplicate.

The Grammar of Graphics:

Every ggplot2 plot is built by layering components: a data source, aesthetic mappings (which variables map to x, y, color, etc.), one or more geoms (the visual marks), optional scales and facets, and a theme. Layers are added with +.

Plot Template

Core ggplot2 Structure

ggplot(data = df, aes(x = var1, y = var2)) +
  geom_point() +
  scale_x_log10() +
  facet_wrap(~ group) +
  labs(title = "My Title", x = "X Label", y = "Y Label") +
  theme_minimal()

Aesthetic Mappings: aes()

Aesthetic	Controls	Example
`x`, `y`	Position on axes	`aes(x = year, y = gdp)`
`color`	Point/line color (border for polygons)	`aes(color = region)`
`fill`	Fill color of bars, areas, polygons	`aes(fill = treatment)`
`size`	Point size or line width	`aes(size = population)`
`shape`	Point shape (circle, triangle, etc.)	`aes(shape = group)`
`alpha`	Transparency (0 = invisible, 1 = opaque)	`aes(alpha = density)`
`linetype`	Solid, dashed, dotted lines	`aes(linetype = model)`
`label`	Text labels for `geom_text()` / `geom_label()`	`aes(label = country)`
`group`	Group data without visual encoding (for lines)	`aes(group = subject_id)`

Common Geoms

Geom	Best For	Key Arguments
`geom_point()`	Scatterplots; relationships between two continuous variables	`size`, `alpha`, `shape`
`geom_line()`	Time series; trends over an ordered variable	`linewidth`, `linetype`
`geom_col()`	Bar charts with pre-computed heights	`fill`, `position = "dodge"`
`geom_bar()`	Bar charts where R counts rows automatically	`stat = "count"` (default)
`geom_histogram()`	Distribution of a single continuous variable	`bins`, `binwidth`
`geom_density()`	Smoothed distribution curve	`fill`, `alpha`, `adjust`
`geom_boxplot()`	Distribution summary: median, IQR, outliers	`outlier.shape`, `notch`
`geom_violin()`	Distribution shape across groups	`draw_quantiles = c(0.25, 0.5, 0.75)`
`geom_smooth()`	Fitted trend line with optional confidence band	`method = "lm"` or `"loess"`, `se = FALSE`
`geom_text()`	Text labels on data points	`aes(label = name)`, `size`, `hjust`
`geom_label()`	Text labels with a background box	Same as `geom_text()`
`geom_tile()`	Heatmaps; fill encodes a third variable	`aes(fill = value)`
`geom_ribbon()`	Shaded area between `ymin` and `ymax` (e.g., confidence intervals)	`aes(ymin = lo, ymax = hi)`, `alpha`
`geom_vline()` / `geom_hline()`	Reference lines	`xintercept` or `yintercept`, `linetype`

Scales

Scales control how data values map to visual properties. The naming convention is scale_{aesthetic}_{type}().

Scale	What It Controls	Example
`scale_x_log10()`	Log-transform the x-axis	`+ scale_x_log10(labels = scales::comma)`
`scale_x_continuous()`	Axis limits, breaks, and labels for continuous x	`scale_x_continuous(limits = c(0,100), breaks = seq(0,100,20))`
`scale_x_date()`	Format a date axis	`scale_x_date(date_labels = "%b %Y")`
`scale_color_manual()`	Specify exact colors by group	`scale_color_manual(values = c("A" = "#A51C30", "B" = "#7A726A"))`
`scale_fill_brewer()`	ColorBrewer palettes for fills	`scale_fill_brewer(palette = "Set2")`
`scale_fill_gradient()`	Continuous fill from low to high color	`scale_fill_gradient(low = "white", high = "#A51C30")`
`scale_size_area()`	Make area (not radius) proportional to value	`+ scale_size_area(max_size = 12)`

Facets

Faceting: Small Multiples

# One variable: wrap panels automatically into rows/columns
facet_wrap(~ region)
facet_wrap(~ region, ncol = 3, scales = "free_y")

# Two variables: explicit row and column assignment
facet_grid(treatment ~ year)
facet_grid(rows = vars(treatment), cols = vars(year))

Labels & Annotations

Function	What It Controls	Example
`labs()`	Title, subtitle, caption, axis labels, legend title	`labs(title = "...", x = "...", color = "Region")`
`annotate()`	Add a single text or shape annotation at fixed coordinates	`annotate("text", x = 2020, y = 50, label = "Policy change")`
`coord_flip()`	Swap x and y axes (e.g., horizontal bars)	`+ coord_flip()`
`coord_cartesian()`	Zoom in without dropping data (unlike `xlim()`)	`coord_cartesian(ylim = c(0, 100))`

Themes

Theme	Look
`theme_minimal()`	Clean white background; no border; light grid lines. Good default.
`theme_bw()`	White background with gray grid and black border.
`theme_classic()`	White background; x and y axes only; no grid. Good for publication.
`theme_gray()`	Gray background (ggplot2 default).
`theme_void()`	Completely blank canvas; useful for maps and custom layouts.

Customizing a Theme with theme()

# Override specific theme elements after choosing a base theme
theme_minimal() +
theme(
  plot.title    = element_text(face = "bold", size = 14),
  axis.text     = element_text(size = 10),
  legend.position = "bottom",
  panel.grid.minor = element_blank(),
  strip.background = element_rect(fill = "#1a1a1a"),
  strip.text    = element_text(color = "white", face = "bold")
)

Saving Plots:

Use ggsave("output/plot.png", width = 8, height = 5, dpi = 300) immediately after your plot code to save the most recently printed plot. Specify plot = my_plot to save a named object. Supported formats include .png, .pdf, .svg, and .tiff.

No Library Needed:

All functions in this section are available in every R session without loading any package. Base R is always present; it is the foundation on which all packages are built.

Getting Help

Function	What It Does	Example
`?`	Open the help page for a function	`?mean`
`help()`	Same as `?`	`help("lm")`
`help.search()` / `??`	Search help pages by keyword	`??regression`
`example()`	Run the examples from a help page	`example(mean)`
`args()`	Show the arguments of a function	`args(lm)`
`vignette()`	Open a package vignette (tutorial document)	`vignette("dplyr")`

Understanding Your Data

Function	What It Does	Example
`str()`	Compact display of object structure and types	`str(df)`
`head()` / `tail()`	First or last n rows (default 6)	`head(df, 10)`
`dim()`	Number of rows and columns	`dim(df)`
`nrow()` / `ncol()`	Number of rows or columns individually	`nrow(df)`
`names()` / `colnames()`	Column names	`names(df)`
`class()`	Object class (e.g., data.frame, numeric, factor)	`class(df$age)`
`typeof()`	Low-level storage type (integer, double, character)	`typeof(df$id)`
`summary()`	Summary statistics for each column	`summary(df)`
`table()`	Frequency counts; also cross-tabulations	`table(df$country)`
`View()`	Open a spreadsheet-style viewer in RStudio	`View(df)`

Data Types & Coercion

Function	What It Does	Example
`as.numeric()`	Convert to number	`as.numeric("3.14")`
`as.integer()`	Convert to whole number	`as.integer(3.9)` returns `3`
`as.character()`	Convert to text	`as.character(42)`
`as.logical()`	Convert to TRUE/FALSE	`as.logical(0)` returns `FALSE`
`as.factor()`	Convert to categorical factor	`as.factor(df$group)`
`as.Date()`	Convert a string to a Date object	`as.Date("2024-03-15")`
`is.na()`	Test for missing values; returns logical vector	`sum(is.na(df$score))`
`is.numeric()`, `is.character()`	Test type of an object	`is.numeric(df$age)`

Vectors & Sequences

Creating & Working with Vectors

# Create vectors
x <- c(1, 2, 3, 4, 5)
words <- c("a", "b", "c")

# Sequences
1:10 # integers 1 to 10
seq(0, 1, by = 0.1)                  # 0.0 0.1 0.2 ... 1.0
seq(0, 100, length.out = 5)          # exactly 5 evenly spaced values
rep(0, times = 5)                     # 0 0 0 0 0
rep(c("A", "B"), each = 3)           # A A A B B B
# Indexing (1-based)
x[2]                                  # second element
x[x > 3]                              # elements greater than 3
x[c(1, 3)]                             # first and third elements
x[-1]                                  # everything except the first

Numeric & Summary Functions

Function	What It Does	Example
`sum()`	Sum of all values	`sum(x, na.rm = TRUE)`
`mean()`	Arithmetic mean	`mean(x, na.rm = TRUE)`
`median()`	Median value	`median(x, na.rm = TRUE)`
`sd()` / `var()`	Standard deviation / variance	`sd(x, na.rm = TRUE)`
`min()` / `max()`	Smallest or largest value	`max(x, na.rm = TRUE)`
`range()`	Returns c(min, max)	`range(x)`
`quantile()`	Percentiles	`quantile(x, probs = c(0.25, 0.75))`
`cumsum()` / `cumprod()`	Cumulative sum or product	`cumsum(x)`
`diff()`	Lagged differences	`diff(x)`
`abs()`	Absolute value	`abs(-5)`
`round()` / `ceiling()` / `floor()`	Rounding	`round(3.567, 2)` returns `3.57`
`log()` / `log10()` / `exp()`	Logarithms and exponentiation	`log(x)` (natural log)
`sqrt()`	Square root	`sqrt(16)`

Data Frames

Creating & Subsetting Data Frames

# Create a data frame
df <- data.frame(
  id    = 1:3,
  name  = c("Alice", "Bob", "Carol"),
  score = c(85, 92, 78)
)

# Access a column (three equivalent ways)
df$score
df[, "score"]
df[, 3]

# Subset rows
df[df$score > 80, ]              # rows where score > 80
df[1:2, ]                         # first two rows
# Subset rows and columns together
df[df$score > 80, c("id", "name")]

# Add a new column
df$grade <- ifelse(df$score >= 90, "A", "B")

# Remove a column
df$grade <- NULL

Logic & Control Flow

Expression / Function	What It Does	Example
`==`, `!=`, `<`, `>`, `<=`, `>=`	Comparison operators	`x == 5`
`&`, `\|`, `!`	Element-wise AND, OR, NOT	`x > 2 & x < 8`
`&&`, `\|\|`	Scalar AND / OR (for single TRUE/FALSE values)	`if (a > 0 && b > 0)`
`%in%`	Test membership in a set	`x %in% c(1, 3, 5)`
`ifelse()`	Vectorised if-else	`ifelse(score >= 60, "pass", "fail")`
`if () {} else {}`	Standard conditional (single value)	`if (n > 0) { ... } else { ... }`
`for (i in x) {}`	Loop over elements of a vector	`for (i in 1:10) { print(i) }`
`while () {}`	Loop while a condition is TRUE	`while (x < 100) { x <- x * 2 }`
`next` / `break`	Skip to next iteration or exit a loop	`if (is.na(x)) next`

Writing Functions

Function Syntax

# Basic function
add <- function(x, y) {
  x + y                            # last evaluated expression is returned
}
add(3, 4)                          # 7
# Default argument values
greet <- function(name, greeting = "Hello") {
  paste(greeting, name)
}
greet("Alice")                      # "Hello Alice"
greet("Bob", "Hi")                  # "Hi Bob"
# Explicit return (use when returning early)
safe_log <- function(x) {
  if (x <= 0) return(NA)
  log(x)
}

Apply Functions

The apply family lets you perform the same operation across rows, columns, or list elements without writing an explicit loop.

Function	What It Does	Example
`apply()`	Apply a function over rows (1) or columns (2) of a matrix or data frame	`apply(df, 2, mean)`: column means
`lapply()`	Apply a function to each element of a list; returns a list	`lapply(my_list, summary)`
`sapply()`	Like `lapply()` but simplifies the result to a vector or matrix if possible	`sapply(df, class)`
`tapply()`	Apply a function to subgroups defined by a factor	`tapply(df$score, df$group, mean)`
`Map()`	Apply a function to corresponding elements of multiple lists	`Map("+", list_a, list_b)`
`Reduce()`	Cumulatively apply a function across a list (fold)	`Reduce("+", list(1, 2, 3))` returns `6`

String Functions

Function	What It Does	Example
`paste()`	Concatenate strings with a separator (default: space)	`paste("a", "b", sep = "-")`
`paste0()`	Concatenate with no separator	`paste0("id_", 1:3)`
`nchar()`	Number of characters in a string	`nchar("hello")` returns `5`
`substr()`	Extract substring by start and stop position	`substr("abcdef", 2, 4)` returns `"bcd"`
`toupper()` / `tolower()`	Change case	`toupper("hello")`
`trimws()`	Strip leading and trailing whitespace	`trimws(" hello ")`
`grep()`	Return indices where a pattern matches	`grep("^A", names(df))`
`grepl()`	Return logical vector: does each element match?	`grepl("@", emails)`
`gsub()`	Replace all matches of a pattern	`gsub("\s+", "_", x)`
`sprintf()`	Format strings with C-style placeholders	`sprintf("%.2f%%", 95.678)` returns `"95.68%"`

Reading & Writing Files

Function	What It Does	Example
`read.csv()`	Read a CSV file into a data frame	`read.csv("data/file.csv")`
`read.table()`	Read any delimited file; specify `sep`	`read.table("file.txt", sep = " ", header = TRUE)`
`write.csv()`	Write a data frame to CSV	`write.csv(df, "output/file.csv", row.names = FALSE)`
`saveRDS()`	Save a single R object to a binary file	`saveRDS(model, "model.rds")`
`readRDS()`	Load an object saved with `saveRDS()`	`model <- readRDS("model.rds")`
`load()` / `save()`	Save or restore multiple objects at once	`save(df, model, file = "session.RData")`

Environment & Session

Function	What It Does	Example
`ls()`	List all objects in the current environment	`ls()`
`rm()`	Remove one or more objects	`rm(x, tmp_df)`
`rm(list = ls())`	Clear the entire environment	`rm(list = ls())`
`getwd()` / `setwd()`	Get or set the working directory	`getwd()`
`source()`	Run an external `.R` script	`source("R/helpers.R")`
`sessionInfo()`	Report R version, OS, and loaded packages	`sessionInfo()`
`Sys.time()`	Current date and time	`start <- Sys.time()`
`proc.time()`	Elapsed CPU time; useful for benchmarking	`pt <- proc.time(); ...; proc.time() - pt`

Basic Statistics & Distributions

Function	What It Does	Example
`cor()`	Pearson (or Spearman) correlation matrix	`cor(df[, numeric_cols])`
`lm()`	Fit a linear model	`lm(outcome ~ treat + age, data = df)`
`glm()`	Generalized linear model (logistic, Poisson, etc.)	`glm(y ~ x, family = binomial, data = df)`
`t.test()`	One- or two-sample t-test	`t.test(score ~ group, data = df)`
`chisq.test()`	Chi-squared test of independence	`chisq.test(table(df$a, df$b))`
`rnorm()` / `runif()`	Draw random samples from normal or uniform distributions	`rnorm(100, mean = 0, sd = 1)`
`set.seed()`	Set the random number seed for reproducibility	`set.seed(42)`
`dnorm()` / `pnorm()` / `qnorm()`	Density, CDF, and quantile of the normal distribution	`pnorm(1.96)` returns `0.975`

How to Read This Table:

Each row shows the same operation written three ways. All three produce equivalent results. data.table is fastest and most memory-efficient for large datasets: ideal for administrative records, claims data, and large cohort files. Tidyverse reads closest to plain English and is widely used in teaching materials and Stack Overflow answers. Base R requires no packages and works in any environment.

Reading & Writing Data

Task	Base R	data.table	Tidyverse (dplyr / tidyr)
Read a CSV file	read.csv("file.csv")	fread("file.csv")	readr::read_csv("file.csv")
Write to CSV	write.csv(df, "out.csv", row.names = FALSE)	fwrite(DT, "out.csv")	readr::write_csv(df, "out.csv")
Save / load a single R object	saveRDS(x, "x.rds") readRDS("x.rds")	saveRDS(DT, "dt.rds") readRDS("dt.rds")	readr::write_rds(df, "x.rds") readr::read_rds("x.rds")

Inspecting Data

Task	Base R	data.table	Tidyverse (dplyr / tidyr)
Preview first rows	head(df, 6)	head(DT, 6)	dplyr::glimpse(df)
Structure and column types	str(df)	str(DT)	dplyr::glimpse(df)
Summary statistics	summary(df)	summary(DT)	summary(df)
Dimensions (rows × columns)	dim(df) nrow(df); ncol(df)	dim(DT)	dim(df)
Column names	names(df)	names(DT)	names(df)

Filtering Rows

Task	Base R	data.table	Tidyverse (dplyr / tidyr)
Keep rows matching a condition	df[df$age > 18, ]	DT[age > 18]	filter(df, age > 18)
Multiple conditions (AND)	df[df$age > 18 & df$country == "US", ]	DT[age > 18 & country == "US"]	filter(df, age > 18, country == "US")
Membership test (OR across values)	df[df$grp %in% c("A","B"), ]	DT[grp %in% c("A","B")]	filter(df, grp %in% c("A","B"))
Remove rows with missing values	df[!is.na(df$score), ]	DT[!is.na(score)]	filter(df, !is.na(score))
Keep first n rows	head(df, 10)	DT[1:10]	slice_head(df, n = 10)
Remove duplicate rows	unique(df)	unique(DT)	distinct(df)

Selecting & Renaming Columns

Task	Base R	data.table	Tidyverse (dplyr / tidyr)
Select columns by name	df[, c("id", "score")]	DT[, .(id, score)]	select(df, id, score)
Drop a column	df$col <- NULL	DT[, col := NULL]	select(df, -col)
Select columns matching a pattern	df[, grep("^week", names(df))]	DT[, .SD, .SDcols = grep("^week", names(DT))]	select(df, starts_with("week"))
Rename a column	names(df)[names(df)=="old"] <- "new"	setnames(DT, "old", "new")	rename(df, new = old)
Reorder columns	df[, c("b","a","c")]	setcolorder(DT, c("b","a","c"))	relocate(df, b, .before = a)

Creating & Transforming Columns

Task	Base R	data.table	Tidyverse (dplyr / tidyr)
Add a new column	df$log_inc <- log(df$income)	DT[, log_inc := log(income)]	mutate(df, log_inc = log(income))
Add multiple columns at once	df$a <- df$x + 1 df$b <- df$y * 2	DT[, c("a","b") := .(x+1, y*2)]	mutate(df, a = x+1, b = y*2)
Conditional column (if / else)	df$pass <- ifelse( df$score >= 60, "pass", "fail")	DT[, pass := ifelse( score >= 60, "pass", "fail")]	mutate(df, pass = if_else(score >= 60, "pass", "fail"))
Multi-way conditional	df$cat <- cut(df$score, breaks=c(0,60,80,100), labels=c("C","B","A"))	DT[, cat := fcase( score < 60, "C", score < 80, "B", score >= 80,"A")]	mutate(df, cat = case_when( score < 60 ~ "C", score < 80 ~ "B", .default = "A"))
Lag / lead a column by group	df$lag_s <- ave( df$score, df$id, FUN=function(x) c(NA,x[-length(x)]))	DT[, lag_s := shift(score,1), by = id]	df \|> group_by(id) \|> mutate(lag_s = lag(score))
Cumulative sum by group	df$cum_s <- ave( df$score, df$id, FUN = cumsum)	DT[, cum_s := cumsum(score), by = id]	df \|> group_by(id) \|> mutate(cum_s = cumsum(score))

Sorting

Task	Base R	data.table	Tidyverse (dplyr / tidyr)
Sort ascending by one column	df[order(df$year), ]	setorder(DT, year)	arrange(df, year)
Sort descending	df[order(-df$year), ]	setorder(DT, -year)	arrange(df, desc(year))
Sort by multiple columns	df[order(df$country, -df$year), ]	setorder(DT, country, -year)	arrange(df, country, desc(year))

Aggregating & Summarising

Task	Base R	data.table	Tidyverse (dplyr / tidyr)
Count rows per group	table(df$group)	DT[, .N, by = group]	count(df, group)
Single summary stat by group	tapply(df$score, df$group, mean, na.rm=TRUE)	DT[, .(mean(score, na.rm=TRUE)), by = group]	df \|> group_by(group) \|> summarise( m=mean(score, na.rm=TRUE))
Multiple summary stats by group	aggregate( score ~ group, data = df, FUN = function(x) c(m=mean(x), s=sd(x)))	DT[, .(m=mean(score), s=sd(score), n=.N), by = group]	df \|> group_by(group) \|> summarise( m=mean(score), s=sd(score), n=n())
Top row per group	do.call(rbind, lapply( split(df,df$group), function(x) x[which.max(x$score),]))	DT[DT[,.I[which.max(score)], by=group]$V1]	df \|> group_by(group) \|> slice_max(score, n=1)
Add group summary back as column	df$grp_mean <- ave( df$score, df$group, FUN = mean)	DT[, grp_mean := mean(score), by = group]	df \|> group_by(group) \|> mutate( grp_mean=mean(score))

Joining Tables

Task	Base R	data.table	Tidyverse (dplyr / tidyr)
Left join (all rows from left)	merge(df1, df2, by="id", all.x=TRUE)	merge(DT1, DT2, by="id", all.x=TRUE)	left_join(df1, df2, by="id")
Inner join (matching rows only)	merge(df1, df2, by="id")	merge(DT1, DT2, by="id")	inner_join(df1, df2, by="id")
Full join (all rows from both)	merge(df1, df2, by="id", all=TRUE)	merge(DT1, DT2, by="id", all=TRUE)	full_join(df1, df2, by="id")
Anti-join (rows with no match)	df1[!df1$id %in% df2$id, ]	DT1[!DT2, on="id"]	anti_join(df1, df2, by="id")
Join on columns with different names	merge(df1, df2, by.x="pid", by.y="id", all.x=TRUE)	merge(DT1, DT2, by.x="pid", by.y="id", all.x=TRUE)	left_join(df1, df2, by=c("pid"="id"))

Reshaping Data

Task	Base R	data.table	Tidyverse (dplyr / tidyr)
Wide to long (stack columns into rows)	reshape(df, varying=c("w1","w2"), v.names="val", direction="long")	data.table::melt(DT, measure.vars= c("w1","w2"), variable.name="week", value.name="val")	pivot_longer(df, cols=c(w1,w2), names_to="week", values_to="val")
Long to wide (spread rows into columns)	reshape(df, idvar="id", timevar="week", direction="wide")	data.table::dcast(DT, id ~ week, value.var="val")	pivot_wider(df, names_from=week, values_from=val)
Stack two tables with same columns	rbind(df1, df2)	rbind(DT1, DT2)	dplyr::bind_rows(df1, df2)
Combine two tables side by side	cbind(df1, df2)	cbind(DT1, DT2)	dplyr::bind_cols(df1, df2)

8. Merging and Reshaping Data

Two operations underpin almost every multi-source analysis: joining tables on a shared key, and reshaping a table between wide and long formats. This section covers both in depth, with worked examples across Base R, dplyr, and data.table, and a practical guide to diagnosing problems before and after a join.

Join Types

A join combines rows from two tables based on matching values in one or more key columns. The choice of join type determines which rows appear in the result when the keys do not match perfectly.

Join Type	Rows Kept	Typical Use
Left join	All rows from the left table; matched data from the right where available; `NA` where no match	Adding characteristics from a lookup table while keeping every row in the primary dataset
Inner join	Only rows with a match in both tables	Restricting analysis to observations with complete data across both sources
Full join	All rows from both tables; `NA` wherever a match is missing on either side	Auditing two datasets for overlap and discrepancies
Anti join	Rows in the left table with no match in the right	Finding records that failed a linkage, or identifying controls not in the treatment file
Semi join	Rows in the left table that have a match in the right, without adding any right-table columns	Filtering a dataset to only those IDs that appear in a second file, without duplicating columns

Check Your Keys Before Joining:

Always verify that your key column is unique in at least one of the two tables before joining. A many-to-many join (duplicate keys on both sides) silently multiplies rows and is almost never intended. Use anyDuplicated(df$id) or df |> count(id) |> filter(n > 1) to check.

Joining with dplyr

dplyr Join Functions

library(dplyr)

# Left join: keep all rows in df_patients, add columns from df_insurance
df_joined <- left_join(df_patients, df_insurance, by = "patient_id")

# Inner join: only patients who appear in both tables
df_matched <- inner_join(df_patients, df_labs, by = "patient_id")

# Join on columns with different names in each table
df_joined <- left_join(df_claims, df_providers,
               by = c("provider_npi" = "npi"))

# Join on multiple keys (patient + visit date must both match)
df_joined <- left_join(df_vitals, df_meds,
               by = c("patient_id", "visit_date"))

# Anti join: patients in df_enrolled with no matching lab result
df_missing_labs <- anti_join(df_enrolled, df_labs, by = "patient_id")

# Semi join: filter df_patients to only those with a pharmacy claim
df_with_rx <- semi_join(df_patients, df_pharmacy, by = "patient_id")

Joining with data.table

data.table joins use the bracket syntax with an on argument, or the merge() function. Setting keys first with setkey() speeds up repeated joins on large tables.

data.table Join Syntax

library(data.table)
setDT(df_patients); setDT(df_insurance)

# Left join using bracket syntax (X[Y] is a right join; Y[X] gives left)
df_joined <- df_insurance[df_patients, on = "patient_id"]

# Left join using merge() - syntax matches base R
df_joined <- merge(df_patients, df_insurance,
               by = "patient_id", all.x = TRUE)

# Inner join
df_matched <- merge(df_patients, df_labs, by = "patient_id")

# Full join
df_full <- merge(df_patients, df_labs,
              by = "patient_id", all = TRUE)

# Anti join: rows in patients not matched in labs
df_missing <- df_patients[!df_labs, on = "patient_id"]

# Join on columns with different names
df_joined <- merge(df_claims, df_providers,
               by.x = "provider_npi", by.y = "npi", all.x = TRUE)

# Set key for fast repeated lookups
setkey(df_insurance, patient_id)
df_joined <- df_insurance[df_patients, on = "patient_id"]

Diagnosing Join Problems

The most common join errors are silent: the operation succeeds but the row count or column values are wrong. These checks catch the most frequent problems before they propagate downstream.

Pre-Join Checks

# 1. Check for duplicate keys on the right (lookup) table
#    If n > 1 for any id, a left join will expand your row count unexpectedly
df_insurance |>
  count(patient_id) |>
  filter(n > 1)

# 2. Check that key columns have the same type in both tables
#    Joining character "001" to integer 1 silently produces zero matches
class(df_patients$patient_id)
class(df_insurance$patient_id)

# 3. Check for NA values in the key column
sum(is.na(df_patients$patient_id))
sum(is.na(df_insurance$patient_id))

# 4. Preview key overlap between the two tables
n_left   <- n_distinct(df_patients$patient_id)
n_right  <- n_distinct(df_insurance$patient_id)
n_shared <- n_distinct(intersect(df_patients$patient_id,
                                    df_insurance$patient_id))
cat(sprintf("Left: %d  Right: %d  Shared: %d\n", n_left, n_right, n_shared))

Post-Join Checks

# After a left join, row count should equal the left table exactly
stopifnot(nrow(df_joined) == nrow(df_patients))

# Count how many left-table rows had no match (NAs in a right-table column)
sum(is.na(df_joined$insurance_type))

# Full audit: left-only, matched, and right-only rows
df_audit <- full_join(
  df_patients  |> mutate(in_patients  = TRUE),
  df_insurance |> mutate(in_insurance = TRUE),
  by = "patient_id"
)
table(left_only  = is.na(df_audit$in_insurance),
      right_only = is.na(df_audit$in_patients))

Reshaping: Wide and Long Formats

Data arrives in two common shapes. Wide format has one row per subject and one column per time point or variable. Long format has one row per observation, with a column identifying which variable or time point each row represents. Most R modelling and plotting functions expect long format; most data entry and reporting tools produce wide format.

Format	Shape	When You Have It	When You Need It
Wide	Many columns, fewer rows	Survey exports, lab panels with one column per test, repeated-measures spreadsheets	Reporting tables, cross-tabulations, some time-series packages
Long	Fewer columns, many rows	Electronic health records, claims files, relational databases	ggplot2 (one row per plotted point), lme4 mixed models, dplyr group_by summaries

Wide to Long: pivot_longer()

pivot_longer() — tidyr

library(tidyr)

# Wide: one row per patient, columns week1 through week4
#   patient_id | week1 | week2 | week3 | week4
#   001        |  82   |  85   |  80   |  88

df_long <- df_wide |>
  pivot_longer(
    cols      = starts_with("week"),   # columns to stack
    names_to  = "week",               # new column holding the old column names
    values_to = "sbp"                 # new column holding the values
  )
# Result: patient_id | week  | sbp
#         001        | week1 |  82
#         001        | week2 |  85  ...

# Strip the "week" prefix to leave just a number, and coerce to integer
df_long <- df_wide |>
  pivot_longer(
    cols            = starts_with("week"),
    names_to        = "week",
    names_prefix    = "week",
    names_transform = list(week = as.integer),
    values_to       = "sbp"
  )

# Stack two value types at once (sbp and dbp both measured weekly)
df_long <- df_wide |>
  pivot_longer(
    cols      = matches("^(sbp|dbp)_week"),
    names_to  = c(".value", "week"),   # .value routes sbp and dbp to separate columns
    names_sep = "_week"
  )

Long to Wide: pivot_wider()

pivot_wider() — tidyr

# Long: one row per patient-week
#   patient_id | week  | sbp
#   001        | week1 |  82

df_wide <- df_long |>
  pivot_wider(
    names_from  = week,   # column whose values become new column names
    values_from = sbp     # column whose values fill those new columns
  )

# Spread two value columns simultaneously
df_wide <- df_long |>
  pivot_wider(
    names_from   = week,
    values_from  = c(sbp, dbp)   # creates sbp_week1, dbp_week1, sbp_week2 ...
  )

# Duplicate keys cause list-columns: summarise first, then widen
df_wide <- df_long |>
  group_by(patient_id, week) |>
  summarise(sbp = mean(sbp, na.rm = TRUE), .groups = "drop") |>
  pivot_wider(names_from = week, values_from = sbp)

Reshaping with data.table: melt() and dcast()

melt() and dcast() — data.table

library(data.table)

# Wide to long: melt()
DT_long <- melt(DT_wide,
  id.vars       = "patient_id",
  measure.vars  = c("week1", "week2", "week3", "week4"),
  variable.name = "week",
  value.name    = "sbp"
)

# Melt two value types simultaneously using patterns()
DT_long <- melt(DT_wide,
  measure.vars  = patterns("^sbp", "^dbp"),
  variable.name = "week",
  value.name    = c("sbp", "dbp")
)

# Long to wide: dcast()
# Formula: rows ~ columns; value.var is the column to spread
DT_wide <- dcast(DT_long,
  patient_id ~ week,
  value.var = "sbp"
)

# Aggregate while casting (mean sbp per patient per week)
DT_wide <- dcast(DT_long,
  patient_id ~ week,
  value.var     = "sbp",
  fun.aggregate = mean,
  na.rm         = TRUE
)

Binding Rows and Columns

Binding stacks or places tables side by side without matching on a key. Row binding requires matching column names; column binding requires matching row counts.

Row and Column Binding

# Stack two tables with the same columns (e.g., two annual extracts)
# dplyr fills missing columns with NA rather than throwing an error
df_combined <- dplyr::bind_rows(df_2023, df_2024)

# Stack a list of many tables at once, adding a source label column
list_of_dfs <- list(df_2021, df_2022, df_2023, df_2024)
df_combined  <- dplyr::bind_rows(list_of_dfs, .id = "year_src")

# data.table equivalent (faster for large tables)
DT_combined <- rbindlist(list(DT_2021, DT_2022, DT_2023), idcol = "year_src")

# fill = TRUE adds NA for columns missing in some tables
DT_combined <- rbindlist(list_of_DTs, use.names = TRUE, fill = TRUE)

# Column binding: place tables side by side (rows must already correspond)
df_combined <- dplyr::bind_cols(df_demographics, df_outcomes)

Prefer Joins Over Column Binding:

bind_cols() and cbind() assume rows in the two tables are in the same order and correspond to the same subjects. This assumption fails silently if either table has been sorted, filtered, or subsetted. A left_join() on an explicit key is almost always safer.

Common Patterns

Reading and Stacking Multiple Files

Stack All CSV Files in a Folder

library(purrr)
library(readr)
library(dplyr)

files   <- list.files("data/raw", pattern = "\.csv$", full.names = TRUE)
df_all  <- files |>
  set_names(basename(files)) |>
  purrr::map(readr::read_csv, show_col_types = FALSE) |>
  dplyr::bind_rows(.id = "source_file")

# data.table equivalent
DT_all <- rbindlist(
  lapply(files, fread),
  idcol = "source_file",
  fill  = TRUE
)
DT_all[, source_file := files[source_file]]   # replace integer index with filename

Reshaping and Summarising in One Pipeline

Long Format to Summary Table

# Read wide data, reshape, summarise, widen for reporting
df_report <- df_wide |>
  pivot_longer(
    cols      = starts_with("week"),
    names_to  = "week",
    values_to = "sbp"
  ) |>
  group_by(insurance_type, week) |>
  summarise(
    mean_sbp = mean(sbp, na.rm = TRUE),
    n        = n(),
    .groups  = "drop"
  ) |>
  pivot_wider(
    names_from  = week,
    values_from = c(mean_sbp, n)
  )

9. Saving and Loading Data

R offers several formats for persisting data between sessions. Choosing the right one depends on whether you need to share a single object or a whole collection, whether the file must be readable outside R, and how large the dataset is. This section covers each format, when to use it, and the practical tradeoffs between them.

Format Comparison

Format	Extension	Saves	R-Only?	Best For
`saveRDS()` / `readRDS()`	`.rds`	One object	Yes	Saving a single cleaned dataset, model, or list between scripts. The object can be loaded under any name.
`save()` / `load()`	`.RData`	Named objects (one or many)	Yes	Checkpointing a set of related objects mid-analysis. Objects are restored under their original names.
`save.image()` / `load()`	`.RData`	Entire workspace	Yes	Generally not recommended. Creates implicit, hard-to-audit dependencies. Avoid for reproducible work.
`write.csv()` / `read.csv()` `write_csv()` / `read_csv()` `fwrite()` / `fread()`	`.csv`	One tabular object	No	Sharing data with collaborators in Excel, Python, Stata, or any other tool. Universal but slow and loses column type information.
`write_parquet()` / `read_parquet()`	`.parquet`	One tabular object	No	Large datasets shared with Python (pandas, polars) or cloud pipelines. Columnar storage; fast and compact. Requires the `arrow` package.
`write_fst()` / `read_fst()`	`.fst`	One data frame or data.table	Near-R-only	Fastest read/write for R-to-R workflows on large tabular data. Supports random column access. Requires the `fst` package.
`write.xlsx()` / `read_excel()`	`.xlsx`	One or more sheets	No	When a collaborator or system requires `.xlsx` and CSV is not accepted. Avoid for intermediate analysis storage.

RDS: Saving Individual Objects

saveRDS() and readRDS() are the recommended default for saving any single R object. Unlike save(), the object is not bound to its original variable name on load, which makes it easier to use in different scripts without name collisions.

saveRDS() and readRDS()

# Save one object to disk
saveRDS(df_clean, file = "data/clean/df_clean.rds")
saveRDS(model_logit, file = "output/model_logit.rds")

# Load it back — assign to any name you choose
df_clean    <- readRDS("data/clean/df_clean.rds")
model_final <- readRDS("output/model_logit.rds")   # original name not required

# RDS preserves all R attributes: factor levels, column types, class, etc.
# A data.table saved with saveRDS() is still a data.table on load.
# A list, model object, or ggplot is preserved exactly as saved.

Use RDS as Your Default:

For any intermediate or final R object that does not need to be opened in another tool, saveRDS() is the safest and most explicit choice. It saves exactly one thing, forces you to name it explicitly on load, and preserves all R-specific attributes such as factor levels, ordered factors, and object class.

RData: Saving Multiple Named Objects

save() stores multiple R objects in a single file. When load() reads the file, each object reappears in the environment under its original name. This is useful for checkpointing a set of related results, but requires discipline: the names are baked into the file, so loading into a session that already has objects with those names will silently overwrite them.

save() and load()

# Save a specific set of objects into one file
save(df_clean, df_joined, model_lm,
     file = "output/checkpoint_01.RData")

# Restore all of them at once — names are fixed to what was saved
load("output/checkpoint_01.RData")   # df_clean, df_joined, model_lm appear in environment

# Check what a .RData file contains before loading it
load("output/checkpoint_01.RData", verbose = TRUE)

# Safer pattern: load into a new environment to inspect before exposing to global env
checkpoint <- new.env()
load("output/checkpoint_01.RData", envir = checkpoint)
ls(checkpoint)                          # see what it contains
df_clean <- checkpoint$df_clean         # pull out only what you need

Workspace Saving: What to Avoid and Why

RStudio prompts you to save your workspace when you close a session. The default file is .RData in your working directory. Accepting this prompt is one of the most common reproducibility mistakes in R.

Why Workspace Saving Causes Problems

# save.image() writes every object in the current environment to .RData
save.image()                            # saves to .RData in the working directory
save.image(file = "session_backup.RData")  # explicit filename

# The problem: .RData loads silently every time R starts in that directory.
# Objects from old, deleted, or changed scripts persist invisibly.
# Code appears to work only because an old object is in memory,
# not because the script that creates it still runs correctly.

# The fix: turn off automatic workspace saving in RStudio.
# Tools > Global Options > General:
#   "Save workspace to .RData on exit"  -> set to Never
#   "Restore .RData into workspace at startup" -> uncheck

# Then start each session clean and source the scripts that rebuild your objects.
# If rebuilding takes too long, save intermediate objects explicitly with saveRDS().

The Blank Slate Principle:

A reproducible analysis is one that produces the same results when run from a blank R session on a machine that has never seen the data before. If your code relies on objects in .RData rather than on scripts that create those objects, it fails this test. Disable automatic workspace saving and use saveRDS() for any intermediate results that are expensive to recompute.

CSV: Universal Plain-Text Exchange

CSV is the safest format for sharing tabular data with any other tool. It is slow, verbose, and does not preserve column types, but it opens in Excel, Python, Stata, SAS, and any text editor. Use it as a delivery format, not an intermediate storage format.

Reading and Writing CSV

# Base R (slow; adds row names by default unless suppressed)
write.csv(df, "output/results.csv", row.names = FALSE)
df <- read.csv("data/file.csv", stringsAsFactors = FALSE)

# readr (fast; prints column type guesses; returns a tibble)
library(readr)
readr::write_csv(df, "output/results.csv")         # no row names by default
df <- readr::read_csv("data/file.csv", show_col_types = FALSE)

# data.table (fastest; handles large files well)
library(data.table)
data.table::fwrite(DT, "output/results.csv")       # very fast; no row names
DT <- data.table::fread("data/file.csv")             # auto-detects delimiter and types

# Preserve a date column across CSV round-trips by formatting explicitly
df$date <- format(df$date, "%Y-%m-%d")    # write as ISO string
df$date <- as.Date(df$date)              # parse back after reading

Parquet: Fast Cross-Language Storage

Parquet is a columnar binary format supported natively by Python (pandas, polars), Spark, DuckDB, and cloud storage services. It preserves column types, compresses well, and reads far faster than CSV for large files. The arrow package provides the R interface.

arrow: write_parquet() and read_parquet()

install.packages("arrow")        # install once
library(arrow)

# Write a data frame or data.table to parquet
arrow::write_parquet(df_clean, "data/clean/df_clean.parquet")

# Read back (returns a tibble by default)
df_clean <- arrow::read_parquet("data/clean/df_clean.parquet")

# Specify only the columns you need (parquet reads column-by-column,
# so selecting columns avoids reading unused data from disk entirely)
df_subset <- arrow::read_parquet(
  "data/clean/df_clean.parquet",
  col_select = c("patient_id", "age", "outcome")
)

# Convert a data.table to data frame before writing if arrow warns about class
arrow::write_parquet(as.data.frame(DT), "output/DT.parquet")

fst: Fastest R-to-R Binary Format

The fst package provides the fastest read and write speeds available for tabular data in R, often ten times faster than fread() on large files. It also supports random column access, meaning you can read a subset of columns without loading the full file. The format is not widely supported outside R, so use it for intermediate objects in pure-R pipelines.

fst: write_fst() and read_fst()

install.packages("fst")          # install once
library(fst)

# Write (accepts data frames and data.tables)
fst::write_fst(df_clean, "data/clean/df_clean.fst")

# Compress (0 = none, 100 = max; default 50 is a good balance)
fst::write_fst(df_clean, "data/clean/df_clean.fst", compress = 75)

# Read the full file
df_clean <- fst::read_fst("data/clean/df_clean.fst")

# Read only specific columns (very fast; no other columns are read from disk)
df_sub <- fst::read_fst("data/clean/df_clean.fst",
             columns = c("patient_id", "age", "outcome"))

# Read back as a data.table directly
library(data.table)
DT <- as.data.table(fst::read_fst("data/clean/df_clean.fst"))

Excel: When CSV Is Not an Option

Use Excel format when a collaborator or system requires .xlsx specifically and CSV is not acceptable. For reading Excel files into R, readxl is reliable and requires no Java dependency. For writing, writexl is fast and lightweight; openxlsx supports formatting, multiple sheets, and styled headers when the output format is prescribed.

Reading and Writing Excel Files

# Reading Excel files
install.packages("readxl")
library(readxl)

df <- readxl::read_excel("data/file.xlsx")               # first sheet by default
df <- readxl::read_excel("data/file.xlsx", sheet = "Sheet2")
df <- readxl::read_excel("data/file.xlsx", skip = 2, na = "NA")
readxl::excel_sheets("data/file.xlsx")           # list all sheet names

# Writing Excel files: writexl (no Java; single or multiple sheets)
install.packages("writexl")
writexl::write_xlsx(df, "output/results.xlsx")         # single sheet
writexl::write_xlsx(list(Summary = df_summary, Detail = df_detail),
              "output/report.xlsx")               # multiple sheets; names become tab labels

# openxlsx: styled output, formatted headers, bold rows
install.packages("openxlsx")
library(openxlsx)
wb <- createWorkbook()
addWorksheet(wb, "Results")
writeData(wb, "Results", df_summary, headerStyle = createStyle(textDecoration = "bold"))
saveWorkbook(wb, "output/report.xlsx", overwrite = TRUE)

Choosing a Format

Decision Guide

# Saving one cleaned dataset for use in the next script?
# -> saveRDS()  [default choice; preserves all attributes]

# Saving several related objects (model + data + metadata) as a checkpoint?
# -> save()  [convenient; names are restored on load]

# Large tabular file that only needs to be read back into R?
# -> write_fst()  [fastest read/write; random column access]

# Large tabular file shared with Python, Spark, or a cloud pipeline?
# -> write_parquet()  [cross-language; typed; compressed; widely supported]

# Sharing data with a collaborator using Excel, Stata, or SAS?
# -> write_csv() / fwrite()  [universal; accepts any tool; loses types]

# Collaborator or system requires .xlsx and CSV is not accepted?
# -> writexl::write_xlsx() or openxlsx  [Excel-native; multiple sheets]

# Closing RStudio and asked to save workspace?
# -> No. Turn this off in Tools > Global Options > General.

File Paths and Project Portability

Hard-coded absolute paths break when a project is moved to a new machine or shared with a collaborator. The here package constructs paths relative to the project root, making all file references portable without any setup.

here::here() for Portable Paths

install.packages("here")        # install once
library(here)

# here::here() always resolves relative to the .Rproj file location
# regardless of where the calling script lives in the project folder

saveRDS(df_clean,  here::here("data", "clean", "df_clean.rds"))
df_clean <- readRDS(here::here("data", "clean", "df_clean.rds"))

readr::write_csv(df, here::here("output", "results.csv"))
arrow::write_parquet(df, here::here("data", "clean", "df.parquet"))

# here() builds the path from multiple arguments, joining with the OS separator
# On any machine: /path/to/project/data/clean/df_clean.rds
# No setwd() needed; no broken absolute paths.

Recommended Folder Convention:

Keep raw source files in data/raw/ and treat them as read-only. Write all processed or cleaned objects to data/clean/. Write all final outputs (tables, figures, reports) to output/. This separation makes it unambiguous which files can be regenerated by scripts and which are irreplaceable originals.

10. Variable Types & Regression Analysis

This section covers how to assign and verify variable types in R, fit linear and logistic regression models with lm() and glm(), and interpret the output that summary() returns. These are the most common modelling steps in public health data analysis.

Assigning Variable Types

R stores data in different types depending on what the values represent. Getting types right before modelling matters: a variable stored as character will be silently dropped; a numeric code stored as numeric instead of factor will be treated as continuous when it should be categorical.

Type	When to Use	How to Assign	How to Check
numeric / double	Continuous measurements: age, BMI, income, blood pressure	df$age <- as.numeric(df$age)	is.numeric(df$age)
integer	Whole-number counts: number of visits, year	df$visits <- as.integer(df$visits)	is.integer(df$visits)
factor	Categorical variables with a fixed set of levels: treatment group, insurance type, race/ethnicity, education tier	df$group <- as.factor(df$group)	is.factor(df$group) levels(df$group)
ordered factor	Ordinal categories where order matters: low / medium / high severity, Likert scales	df$severity <- factor(df$severity, levels = c("low","medium","high"), ordered = TRUE)	is.ordered(df$severity)
character	Free-text strings: names, notes. Not used directly in models.	df$name <- as.character(df$name)	is.character(df$name)
logical	Binary TRUE/FALSE indicators: event occurrence, eligibility flags	df$died <- as.logical(df$died)	is.logical(df$died)
Date	Calendar dates: admission date, date of birth. Enables date arithmetic.	df$dob <- as.Date(df$dob, format = "%Y-%m-%d")	class(df$dob)

Factors: Reference Levels and Coding

In regression, R uses the first level of a factor as the reference (baseline) category. You should set this deliberately rather than accepting the alphabetical default.

Setting and Checking Factor Levels

# Check current levels (first = reference in regression)
levels(df$insurance)
# e.g. "Medicaid"  "Medicare"  "Private"  "Uninsured"
# Set a specific reference level
df$insurance <- relevel(df$insurance, ref = "Private")

# Verify: Private is now first
levels(df$insurance)

# Recode and relabel levels
levels(df$educ) <- c(
  "lt_hs"   = "Less than high school",
  "hs"      = "High school / GED",
  "some_col" = "Some college",
  "col_plus" = "College or above"
)

Always Inspect Types Before Modelling:

Run str(df) or sapply(df, class) before fitting any model. Numeric codes for categorical variables (e.g., 1, 2, 3 for insurance type) will be treated as continuous unless converted to factors. This is one of the most common sources of silent errors in public health analyses.

Linear Regression: lm()

Use lm() when your outcome is a continuous variable (blood pressure, BMI, length of stay, a cost measure). The formula syntax is outcome ~ predictor1 + predictor2.

Fitting a Linear Model

# Fit the model
model_lm <- lm(sbp ~ age + as.factor(insurance) + bmi,
              data = df)

# View full results
summary(model_lm)

# Confidence intervals for coefficients
confint(model_lm)

# Add fitted values and residuals to the data frame
df$fitted   <- fitted(model_lm)
df$residual <- residuals(model_lm)

# Basic residual diagnostics (4 plots)
par(mfrow = c(2,2))
plot(model_lm)

Common Formula Operators

Syntax	Meaning	Example
`y ~ x`	Simple regression of y on x	`lm(sbp ~ age)`
`y ~ x1 + x2`	Multiple regression; additive terms	`lm(sbp ~ age + bmi)`
`y ~ x1 * x2`	Main effects plus interaction term	`lm(sbp ~ age * insurance)`
`y ~ x1 + x1:x2`	Main effect of x1 plus interaction only (no main effect of x2)	`lm(sbp ~ age + age:insurance)`
`y ~ I(x^2)`	Arithmetic inside `I()`; adds a squared term	`lm(sbp ~ age + I(age^2))`
`y ~ .`	All other columns in the data frame as predictors	`lm(sbp ~ ., data = df)`
`y ~ . - x`	All columns except x	`lm(sbp ~ . - id, data = df)`
`y ~ 0 + x`	Suppress the intercept	`lm(sbp ~ 0 + insurance)`

Logistic Regression: glm() with Binomial Family

Use glm(family = binomial) when your outcome is binary: died / survived, readmitted / not, disease present / absent. The model estimates log-odds; exponentiating the coefficients gives odds ratios.

Fitting a Logistic Model

# Outcome must be 0/1 numeric or a two-level factor
df$readmit <- as.integer(df$readmit_30day == "Yes")

# Fit the model
model_logit <- glm(readmit ~ age + insurance + n_comorbidities,
                data   = df,
                family = binomial(link = "logit"))

# View results (log-odds scale)
summary(model_logit)

# Odds ratios and 95% CI
exp(coef(model_logit))            # odds ratios
exp(confint(model_logit))         # 95% CI on OR scale
# Predicted probabilities for each observation
df$pred_prob <- predict(model_logit,
                          type = "response")

Probit and Other Links:

The binomial family accepts other link functions. Use link = "probit" for a probit model or link = "cloglog" for a complementary log-log model. For Poisson count outcomes (e.g., number of ED visits), use family = poisson(link = "log").

Reading summary() Output

For lm(): Linear Regression Output

Annotated lm() summary() Output

# Call:
# lm(formula = sbp ~ age + insurance + bmi, data = df)
# Residuals:
#     Min      1Q  Median      3Q     Max
# -28.41   -6.12   -0.44    5.98   31.07
# ^ Residuals should be roughly symmetric around 0.
#   A large Max vs Min asymmetry suggests outliers.
# Coefficients:
#                        Estimate Std. Error t value Pr(>|t|)
# (Intercept)              82.14       4.21   19.51   <2e-16 ***
# age                       0.43       0.06    7.18  1.2e-12 ***
# insuranceMedicaid          3.81       1.14    3.34   0.0009 ***
# insuranceMedicare          1.92       1.08    1.78   0.0756 .
# insuranceUninsured         5.60       1.31    4.27  2.3e-05 ***
# bmi                        0.71       0.09    7.89  7.4e-15 ***
# ^ Estimate:    the coefficient.
#   For numeric predictors: change in outcome per 1-unit increase.
#   For factor levels: difference vs. the reference level (Private).
# ^ Std. Error:  uncertainty around the estimate.
# ^ t value:     Estimate / Std. Error.
# ^ Pr(>|t|):   p-value; probability of this t-value under H0.
# ^ Signif. codes: *** p<.001  ** p<.01  * p<.05  . p<.1
# Residual standard error: 9.83 on 1194 degrees of freedom
# ^ Typical size of prediction error in outcome units (mmHg here).
# Multiple R-squared:  0.213,  Adjusted R-squared:  0.210
# ^ R²: proportion of outcome variance explained by the model.
#   Adjusted R² penalises for number of predictors; use this one.
# F-statistic: 64.3 on 5 and 1194 DF,  p-value: < 2.2e-16
# ^ Tests whether the model as a whole explains more than chance.

For glm(): Logistic Regression Output

Annotated glm() summary() Output

# Coefficients:
#                      Estimate Std. Error z value Pr(>|z|)
# (Intercept)           -2.841      0.312   -9.11  <2e-16 ***
# age                    0.027      0.006    4.50  6.8e-06 ***
# insuranceMedicaid      0.441      0.142    3.11   0.0019 **
# insuranceMedicare      0.198      0.139    1.42   0.1549
# n_comorbidities        0.312      0.041    7.61  2.8e-14 ***
# ^ Estimates are LOG-ODDS (logit scale), not probabilities.
#   Positive = higher odds of the outcome; negative = lower odds.
#   Use exp(coef()) to convert to odds ratios.
# ^ z value replaces t value; interpretation is the same.
# Null deviance:     1284.3 on 1199 degrees of freedom
# Residual deviance: 1091.7 on 1195 degrees of freedom
# ^ Null deviance: fit of intercept-only model.
#   Residual deviance: fit of your model.
#   Larger reduction = better model fit.
# AIC: 1101.7
# ^ Lower AIC = better fit (penalised for complexity).
#   Use AIC to compare models on the same data.

Interpreting Coefficients

Model	Predictor Type	Coefficient Represents	Practical Interpretation
lm()	Continuous (e.g., age)	Change in outcome per 1-unit increase in predictor, holding others constant	Age coefficient = 0.43: each additional year of age is associated with 0.43 mmHg higher systolic BP, adjusted for insurance and BMI
lm()	Factor (e.g., insurance)	Difference in outcome vs. the reference level, holding others constant	Medicaid coefficient = 3.81: Medicaid patients have systolic BP 3.81 mmHg higher on average than Private patients with the same age and BMI
glm() binomial	Continuous	Change in log-odds per 1-unit increase; `exp(coef)` gives the odds ratio	Age coefficient = 0.027; OR = exp(0.027) = 1.027: each additional year of age is associated with 2.7% higher odds of readmission
glm() binomial	Factor	Log-odds difference vs. reference level; `exp(coef)` gives the odds ratio	Medicaid coefficient = 0.441; OR = exp(0.441) = 1.55: Medicaid patients have 55% higher odds of readmission compared to Private patients, adjusted for age and comorbidities

Odds Ratios Are Not Risk Ratios:

An odds ratio of 1.55 does not mean Medicaid patients are 55% more likely to be readmitted. It means their odds are 55% higher. When the outcome is common (prevalence above roughly 10%), odds ratios overstate the relative risk. For common binary outcomes, consider using a log-binomial model (family = binomial(link = "log")) or a Poisson model with robust standard errors to estimate risk ratios directly.

Extracting Results Programmatically

Tidy Model Output with broom

The broom package converts model output into tidy data frames, making it easy to plot coefficients or export results.

library(broom)

# Coefficients table as a data frame
tidy(model_logit)
tidy(model_logit, conf.int = TRUE, exponentiate = TRUE)
# ^ exponentiate = TRUE gives odds ratios directly
# Model-level statistics (R², AIC, df, etc.)
glance(model_lm)
glance(model_logit)

# Observation-level: fitted values, residuals, influence stats
augment(model_lm, data = df)
augment(model_logit, data = df, type.predict = "response")
# ^ .fitted column gives predicted probabilities for glm

Install broom:

install.packages("broom"). It is part of the tidyverse meta-package so it is already installed if you have run install.packages("tidyverse").

11. Next Steps

With R, RStudio, and your core packages installed, you have a working statistical computing environment. The resources below provide the most reliable paths to building further fluency.

Resource	Focus	Where
R for Data Science (Wickham, Cetinkaya-Rundel & Grolemund)	Tidyverse workflow; tidy data principles	r4ds.hadley.nz
Advanced R (Wickham)	Language internals; functional programming	adv-r.hadley.nz
CRAN Task Views	Curated packages by domain (statistics, finance, etc.)	cran.r-project.org/web/views
Posit Community Forum	Help, discussion, and package announcements	community.rstudio.com

Reproducibility Tip:

Consider using the renv package from the start of any project. It records the exact package versions used in a project lockfile, making your analyses reproducible across machines and over time. Install with install.packages("renv") and initialize with renv::init().