R & RSTUDIO  ·  QUICKSTART GUIDE  ·  INSTALLATION, PACKAGES & LIBRARIES

R & RStudio Quickstart Guide

Installation, Package Management & Libraries · macOS and Windows

1. Overview


R is a statistical computing language and environment, widely used in data analysis, visualization, and reproducible research. RStudio (distributed by Posit) is the most widely adopted integrated development environment (IDE) for R: it provides a code editor, console, environment viewer, and plot panel in a single interface.

You install R first, then RStudio. RStudio detects your R installation automatically and does not work without it.

Before You Begin:
You will need an internet connection and administrator access on your machine. The total installation typically takes under 10 minutes.

What Each Component Does

ComponentRoleWhere to Get It
RThe language engine and base libraries. Required.cran.r-project.org
RStudioThe IDE. Provides the interface you work in daily.posit.co/download
CRAN PackagesCommunity libraries that extend R's functionality.Installed from within R or RStudio

2. Installing R


R is distributed through the Comprehensive R Archive Network (CRAN). Always install the latest stable release unless a project requires a specific version.

macOS Installation
  1. Go to cran.r-project.org and click Download R for macOS.

  2. Select the correct installer for your chip. Choose the Apple Silicon package (arm64) for M1, M2, M3, or M4 Macs. Choose the Intel package for Intel-based Macs. You can verify via Apple menu > About This Mac.

  3. Download the .pkg file and open it. Follow the installer prompts and accept the default installation location (/Library/Frameworks/R.framework).

  4. Verify the installation by opening RStudio (see Section 02). The Console pane at the bottom left will display a message beginning with R version 4.x.x as soon as RStudio launches. If you see this message, R is installed correctly. No further verification is needed.

Note on Xcode Tools:
A small number of specialist packages require additional compilation tools to install. If you encounter an error message mentioning "no developer tools" or "xcrun" when installing a package later, contact your instructor or IT support. Most users will never need to address this during an introductory course.
Windows Installation
  1. Go to cran.r-project.org and click Download R for Windows, then base.

  2. Download the .exe installer (e.g., R-4.x.x-win.exe).

  3. Run the installer. Accept the license and keep the default install path (C:\Program Files\R\R-4.x.x). Leave both 32-bit and 64-bit components checked unless disk space is limited; on modern Windows, 64-bit only is also acceptable.

  4. Verify the installation by opening RStudio (see Section 02). The Console pane at the bottom left will display a message beginning with R version 4.x.x as soon as RStudio launches. If you see this message, the installation succeeded. If RStudio opens but the Console shows an error, see the callout below.

Rtools (Recommended):
Windows users who need to compile packages from source should also install Rtools, available at the same CRAN Windows page. Match the Rtools version to your installed R version. Most introductory users will not need this immediately.

3. Installing RStudio


RStudio Desktop is the free, open-source edition suitable for individual use on your local machine. It is maintained by Posit, the company behind the tidyverse ecosystem.

macOS Installation
  1. Visit posit.co/download/rstudio-desktop and click Download RStudio Desktop. The page auto-detects your operating system.

  2. Open the downloaded .dmg file and drag the RStudio icon into your Applications folder.

  3. Launch RStudio from Applications. On first open, macOS may prompt you to confirm opening an app downloaded from the internet. Click Open.

  4. RStudio detects your R installation automatically. The Console pane will display your R version on startup, confirming the connection.

Windows Installation
  1. Visit posit.co/download/rstudio-desktop and download the Windows .exe installer.

  2. Run the installer with default settings. RStudio installs to C:\Program Files\RStudio by default.

  3. Launch RStudio from the Start menu or Desktop shortcut. The Console pane should display your R version, confirming that RStudio found the R installation.

If RStudio Cannot Find R:
Open RStudio, navigate to Tools > Global Options > General, and manually set the R version path to the folder where R was installed.

4. Installing Packages


R packages extend the base language with functions, datasets, and tools. The primary source is CRAN, which hosts over 20,000 packages. Packages are installed once and stored in a local library on your machine.

Installing from CRAN

The easiest way to install a package is through RStudio's built-in point-and-click interface. In the Packages tab (bottom-right pane), click Install, type the package name, and click Install again. RStudio handles the rest.

Using the Packages Tab:
In RStudio, go to the Packages pane (bottom right) and click Install. Type the package name in the dialog box, make sure Install dependencies is checked, and click Install. You only need to do this once per package on your machine.

You can also type the install command directly into the RStudio Console pane (bottom left) and press Enter. Both approaches do exactly the same thing.

Using install.packages()
# Install a single package install.packages("ggplot2") # Install multiple packages at once install.packages(c("dplyr", "tidyr", "readr")) # Install the full tidyverse meta-package install.packages("tidyverse")

Installing from GitHub

Development versions of packages, or packages not yet on CRAN, can be installed from GitHub using the pak package.

Using pak (Recommended)
# Install pak first if you do not have it install.packages("pak") # Install any GitHub package by user/repo pak::pkg_install("tidyverse/ggplot2") # pak also handles CRAN packages and is faster pak::pkg_install("dplyr")
Why pak:
pak resolves dependencies in parallel, produces clearer error messages, and is the modern recommended approach for package installation as of 2023 onward.

Updating and Removing Packages

To update packages using the menu, go to the Packages tab in RStudio and click Update. A list of packages with available updates will appear; check the ones you want and click Install Updates. You can also use the Console commands below for the same effect.

Package Maintenance
# Check which installed packages have updates available old.packages() # Update all outdated packages at once update.packages(ask = FALSE) # Remove a package remove.packages("packagename") # List all currently installed packages installed.packages()[, "Package"]

Commonly Used Packages by Category

PackageCategoryPurpose
ggplot2VisualizationGrammar of graphics plotting system
dplyrData WranglingData frame manipulation and transformation
tidyrData WranglingReshaping and tidying data
data.tableData WranglingHigh-performance data manipulation for large datasets; faster than dplyr on big files
readrImportFast reading of flat files (CSV, TSV)
readxlImportReading Excel files
lubridateDate/TimeIntuitive date and time handling
stringrStringsConsistent string manipulation functions
purrrFunctionalFunctional programming tools and iteration
knitrReportingDynamic report generation
rmarkdownReportingR Markdown documents and notebooks

5. Loading Libraries


Installing a package makes it available on disk. To use it in a session, you must load it into memory with library(). This call goes at the top of every script or R Markdown file that needs the package.

Install Once, Load Every Session:
install.packages() is run once per machine (or when updating). library() is called at the start of each new R session or script.

The library() Function

Loading Packages into a Session
# Load a single package library(ggplot2) # Typical script header: load all dependencies upfront library(dplyr) library(ggplot2) library(readr) library(lubridate) # Load without printing startup messages suppressPackageStartupMessages(library(tidyverse))

Using Functions Without Loading

If you only need one or two functions from a package, call them directly using the :: operator. This avoids attaching the whole package to the search path and makes dependencies explicit in the code.

The :: Operator
# Call a function directly without loading the library dplyr::filter(my_data, value > 10) readr::read_csv("data/file.csv") # Useful when two packages have functions with the same name stats::filter(x, rep(1/3, 3)) # base R filter, not dplyr::filter

Checking if a Package is Installed

Portable Script Header Pattern

This pattern installs any missing packages automatically when a collaborator runs your script for the first time.

# Define required packages packages_needed <- c("dplyr", "ggplot2", "readr") # Install any that are missing new_packages <- packages_needed[ !(packages_needed %in% installed.packages()[, "Package"]) ] if (length(new_packages)) install.packages(new_packages) # Load all invisible(lapply(packages_needed, library, character.only = TRUE))

Where Libraries Are Stored

Library Paths
# See where R looks for installed packages .libPaths() # Example output on macOS: # [1] "/Library/Frameworks/R.framework/Versions/4.4/Resources/library" # Example output on Windows: # [1] "C:/Users/YourName/AppData/Local/R/win-library/4.4" # [2] "C:/Program Files/R/R-4.4.0/library"

6. R File Types & Helper Files


RStudio supports several distinct file types for writing R code. Each serves a different purpose: some are designed for clean, executable scripts; others weave prose and code together for reporting; others are built for interactive exploration. Understanding which to use, and when, is one of the most practical decisions you will make when setting up a project.

R Script (.R)

An R script is a plain text file containing only R code and comments. It is the simplest and most portable file type: any R installation can run it, and it has no dependencies beyond base R. Scripts are the right choice for data processing pipelines, reusable functions, and any code that should be sourced by other files.

Anatomy of an R Script
# ── script_name.R ──────────────────────────────────────────── # Purpose: Clean and reshape the enrollment dataset # Author: Your Name # Updated: 2026-03-23 # 1. Load dependencies ──────────────────────────────────────── library(dplyr) library(readr) # 2. Read data ──────────────────────────────────────────────── raw <- readr::read_csv("data/enrollment_raw.csv") # 3. Clean ──────────────────────────────────────────────────── clean <- raw |> dplyr::filter(!is.na(id)) |> dplyr::mutate(year = as.integer(year))
When to Use a Script:
Use .R scripts for data cleaning pipelines, simulation code, helper function definitions, and any file you plan to source() from another file. Scripts run from top to bottom with no markup overhead, which makes them fast and predictable.

R Markdown (.Rmd)

R Markdown files combine prose (written in Markdown) with executable code chunks. When rendered, the file produces a self-contained document in a format of your choice: HTML, PDF, Word, or slides. R Markdown is the standard format for reproducible reports, homework submissions, and any analysis where you need to explain your reasoning alongside the code and output.

Anatomy of an R Markdown File
--- title: "Weekly Analysis" author: "Your Name" date: "2026-03-23" output: html_document --- ## Introduction This report summarizes enrollment trends for Spring 2026. ```{r setup, include=FALSE} library(dplyr) library(ggplot2) ``` ```{r plot, echo=FALSE} ggplot(data, aes(x = week, y = count)) + geom_line() ```

Render the document by clicking Knit in RStudio, or by running rmarkdown::render("file.Rmd") in the console.

When to Use R Markdown:
Use .Rmd when the final deliverable is a document: a report, a homework assignment, a methods appendix, or a slide deck. Because the file re-runs all code on render, every figure and table in the output is guaranteed to reflect the current data and code.

R Notebook (.Rmd with notebook output)

An R Notebook is technically an R Markdown file with output: html_notebook set in its YAML header. The key distinction is execution behavior: in a standard .Rmd, all chunks run together when you knit; in a Notebook, each chunk runs independently and its output appears inline immediately below the chunk. This makes Notebooks well-suited for exploratory analysis where you want to inspect results step by step without re-running the entire document.

Notebook YAML Header
--- title: "Exploratory Analysis" output: html_notebook ---

Save the file in RStudio and a .nb.html preview file is generated automatically alongside it. This preview can be opened in any browser without R installed.

When to Use a Notebook:
Use Notebooks during active exploration: checking data distributions, testing model specifications, or iterating on visualizations. Switch to a standard .Rmd when you are ready to produce a final, fully reproducible document from scratch.

Comparison: Choosing the Right File Type

File TypeExtensionBest ForOutput on Run
R Script.RPipelines, functions, sourced utilitiesObjects in environment; no document
R Markdown.RmdReproducible reports, final deliverablesHTML, PDF, Word, or slides on Knit
R Notebook.Rmd (notebook)Interactive exploration, iterative workInline chunk output; .nb.html preview
Quarto.qmdModern replacement for R Markdown; also supports Python and JuliaHTML, PDF, Word, slides, websites

Helper Files

As a project grows, it becomes useful to separate reusable code into dedicated helper files rather than repeating it across scripts and documents. Helper files are plain .R scripts that contain only function definitions and constants; they carry no side effects and produce no output when sourced.

Creating and Sourcing a Helper File

helpers.R
# ── helpers.R ───────────────────────────────────────────────── # Reusable utility functions for the project. # Source this file at the top of any script or .Rmd that needs it. # Compute percentage change between two values pct_change <- function(baseline, followup) { (followup - baseline) / baseline * 100 } # Standardize a numeric vector to mean 0, sd 1 standardize <- function(x) { (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE) } # A project-wide ggplot2 theme theme_project <- ggplot2::theme_minimal() + ggplot2::theme( text = ggplot2::element_text(size = 11), plot.title = ggplot2::element_text(face = "bold") )
Sourcing the Helper File
# In any script or .Rmd chunk, load helpers with source() source("helpers.R") # Or use a path relative to the project root with here::here() source(here::here("R", "helpers.R")) # Functions are now available in the session pct_change(100, 115) # returns 15 standardize(my_vector)

Recommended Project File Structure

A common convention is to keep all helper files in an R/ subfolder within the project directory. This mirrors the structure used in R packages and makes it easy to source multiple helpers at once.

Typical Project Layout
my_project/ ├── my_project.Rproj # RStudio project file ├── R/ │ ├── helpers.R # Utility functions │ ├── themes.R # ggplot2 theme definitions │ └── constants.R # Shared constants (paths, labels, colors) ├── data/ │ ├── raw/ # Original, unmodified data files │ └── clean/ # Processed outputs ├── analysis/ │ ├── 01_clean.R │ ├── 02_model.R │ └── 03_report.Rmd └── output/ # Figures, tables, rendered reports
RStudio Project Files (.Rproj):
Create a new project in RStudio via File > New Project. The .Rproj file sets the working directory to the project root automatically whenever you open it, which means all relative file paths work consistently regardless of where the project folder lives on your machine. This is the single most important habit for reproducible work.

Sourcing All Helper Files at Once

Batch Source Pattern
# Source every .R file in the R/ folder invisible( lapply(list.files("R", pattern = "\\.R$", full.names = TRUE), source) )

External Cheat Sheets


Official and community reference cards. Each preview is embedded below; use the Open PDF link to view full-screen or download.

RStudio IDE Posit / RStudio Open PDF →

PDF preview not available in this browser.

Download / open directly →
R Markdown Posit / RStudio Open PDF →

PDF preview not available in this browser.

Download / open directly →
Data Visualization (ggplot2) Posit / RStudio Open PDF →

PDF preview not available in this browser.

Download / open directly →
data.table DataCamp Open PDF →

PDF preview not available in this browser.

Download / open directly →

7. Quick-Reference Cheat Sheets


The three cheat sheets below cover the most frequently used functions and patterns for everyday data work in R. Each is organized by task rather than alphabetically so you can scan quickly while working. Use the tabs to switch between packages.

Load the Tidyverse:
library(tidyverse) loads ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats in one call. Alternatively, load only the packages you need.

The Pipe Operator

|> (Native Pipe, R 4.1+)
# The pipe passes the left-hand result into the first argument of the right-hand function data |> filter(year == 2024) |> select(id, outcome) |> head() # Equivalent without the pipe (harder to read): head(select(filter(data, year == 2024), id, outcome))

dplyr: Data Manipulation

FunctionWhat It DoesExample
filter()Keep rows matching a conditionfilter(df, age > 18, !is.na(score))
select()Keep or drop columns by nameselect(df, id, year, outcome)
mutate()Add or overwrite columnsmutate(df, log_income = log(income))
rename()Rename columnsrename(df, participant_id = id)
arrange()Sort rowsarrange(df, desc(year), name)
group_by()Define groups for downstream verbsgroup_by(df, country, year)
summarise()Reduce groups to summary rowssummarise(df, mean_score = mean(score, na.rm = TRUE))
count()Count rows per groupcount(df, country, sort = TRUE)
slice_max()Keep top-n rows by a variableslice_max(df, order_by = score, n = 5)
distinct()Remove duplicate rowsdistinct(df, id, .keep_all = TRUE)
across()Apply a function to multiple columns inside mutate() or summarise()mutate(df, across(where(is.numeric), scale))
relocate()Move columns to a new positionrelocate(df, outcome, .before = everything())

Joins

FunctionWhat It Does
left_join(x, y, by = "id")Keep all rows in x; match from y where possible
inner_join(x, y, by = "id")Keep only rows with a match in both tables
full_join(x, y, by = "id")Keep all rows from both tables
anti_join(x, y, by = "id")Keep rows in x with no match in y
semi_join(x, y, by = "id")Keep rows in x that have a match in y, without adding y columns

tidyr: Reshaping Data

FunctionWhat It DoesExample
pivot_longer()Wide to long: stack multiple columns into key-value rowspivot_longer(df, cols = starts_with("week"), names_to = "week", values_to = "score")
pivot_wider()Long to wide: spread key-value rows into columnspivot_wider(df, names_from = group, values_from = mean_score)
separate()Split one column into multiple on a delimiterseparate(df, date, into = c("year","month","day"), sep = "-")
unite()Combine multiple columns into oneunite(df, "full_name", first, last, sep = " ")
drop_na()Remove rows with NA in specified columnsdrop_na(df, score, outcome)
fill()Fill NA values downward or upward within a columnfill(df, group, .direction = "down")

readr: Reading & Writing Files

FunctionWhat It DoesExample
read_csv()Read a comma-separated file into a tibbleread_csv("data/file.csv")
read_tsv()Read a tab-separated fileread_tsv("data/file.tsv")
read_delim()Read any delimiter; specify with delim = "|"read_delim("file.txt", delim = "|")
write_csv()Write a data frame or tibble to CSVwrite_csv(df, "output/results.csv")
read_rds() / write_rds()Read/write R's native binary format; preserves data types exactlywrite_rds(model, "output/model.rds")

stringr: String Operations

FunctionWhat It DoesExample
str_detect()Returns TRUE if pattern is foundfilter(df, str_detect(name, "^A"))
str_replace()Replace first match of a patternstr_replace(x, "\\.", ",")
str_replace_all()Replace all matchesstr_replace_all(x, " ", "_")
str_trim()Strip leading/trailing whitespacemutate(df, name = str_trim(name))
str_to_lower() / str_to_upper()Change casestr_to_lower(df$country)
str_glue()String interpolation using {variable} syntaxstr_glue("Subject {id}: score = {score}")
str_sub()Extract substring by positionstr_sub(x, 1, 4)

lubridate: Dates & Times

FunctionWhat It DoesExample
ymd(), mdy(), dmy()Parse date strings in various ordersymd("2024-03-15")
year(), month(), day()Extract date componentsmutate(df, yr = year(date))
floor_date()Round date down to unit (week, month, quarter)floor_date(date, "month")
interval() / as.period()Compute time between two datesinterval(start, end) / years(1)
today() / now()Current date or datetimemutate(df, age_days = today() - dob)
Load data.table:
library(data.table). Convert an existing data frame with setDT(df) (modifies in place) or as.data.table(df) (returns a copy). Read files directly into a data.table with fread().

The Core Syntax: DT[i, j, by]

Every data.table operation fits into a single bracket expression: i filters rows, j selects or computes columns, and by groups the result. Leaving any slot empty means "do nothing for that step."

DT[i, j, by] Pattern
# Think of it as: "Take DT, subset rows with i, compute j, grouped by by" # Filter rows (i) DT[age > 18] DT[country == "US" & !is.na(score)] # Select / compute columns (j) DT[, .(id, score)] # select columns DT[, .(mean_score = mean(score))] # aggregate DT[, score_log := log(score)] # add/overwrite column in place # Group by (by) DT[, .(mean_score = mean(score)), by = country] DT[, .(n = .N), by = .(country, year)] # Combine all three DT[year >= 2020, .(mean = mean(score)), by = country]

Special Symbols

SymbolMeaningExample
.NNumber of rows (in the current group)DT[, .N, by = country]
:=Assign a column by reference (no copy made)DT[, z := x + y]
.()Shorthand for list() in j and byDT[, .(a, b), by = .(grp)]
.SDSubset of Data: the current group's data as a data.tableDT[, lapply(.SD, mean), by = grp]
.SDcolsRestrict .SD to specific columnsDT[, lapply(.SD, mean), by = grp, .SDcols = c("x","y")]
.GRPInteger index of the current groupDT[, grp_id := .GRP, by = country]
.IRow indices of the current groupDT[, .I[score == max(score)], by = country]

Reading, Writing & Converting

FunctionWhat It DoesExample
fread()Read CSV/TSV fast; auto-detects delimiter and column typesfread("data/large_file.csv")
fwrite()Write to CSV extremely fastfwrite(DT, "output/results.csv")
setDT()Convert a data frame to data.table in place (no copy)setDT(my_df)
as.data.table()Return a new data.table copyDT <- as.data.table(my_df)
as.data.frame()Convert back to a standard data frameas.data.frame(DT)

Keys, Sorting & Merging

FunctionWhat It DoesExample
setkey()Sort table and index by one or more columns for fast lookupssetkey(DT, id, year)
setkeyv()Same as setkey() but accepts a character vector of namessetkeyv(DT, c("id", "year"))
merge()SQL-style merge; works like base R but faster with keyed tablesmerge(DT1, DT2, by = "id", all.x = TRUE)
setorder()Sort a data.table in place by columnssetorder(DT, -year, country)
setnames()Rename columns in placesetnames(DT, "old_name", "new_name")

Useful Operations

Taskdata.table Syntax
Add multiple columns at onceDT[, c("a","b") := .(x+1, y*2)]
Delete a columnDT[, col_to_drop := NULL]
Filter and countDT[score > 80, .N]
Cumulative sum by groupDT[, cum_score := cumsum(score), by = id]
Lag/lead a columnDT[, lag_score := shift(score, 1), by = id]
Row-wise between filterDT[between(score, 50, 80)]
Chain operationsDT[year > 2020][, .N, by = country][order(-N)]
Cross-join / expand gridCJ(x = 1:3, y = c("a","b"))
In-Place Modification:
Unlike dplyr, data.table modifies objects in place by default when using := or set*() functions. This avoids copying large datasets and is why data.table is faster for big files. Be aware that assigning DT2 <- DT does not create an independent copy; use DT2 <- copy(DT) if you need a true duplicate.
The Grammar of Graphics:
Every ggplot2 plot is built by layering components: a data source, aesthetic mappings (which variables map to x, y, color, etc.), one or more geoms (the visual marks), optional scales and facets, and a theme. Layers are added with +.

Plot Template

Core ggplot2 Structure
ggplot(data = df, aes(x = var1, y = var2)) + geom_point() + scale_x_log10() + facet_wrap(~ group) + labs(title = "My Title", x = "X Label", y = "Y Label") + theme_minimal()

Aesthetic Mappings: aes()

AestheticControlsExample
x, yPosition on axesaes(x = year, y = gdp)
colorPoint/line color (border for polygons)aes(color = region)
fillFill color of bars, areas, polygonsaes(fill = treatment)
sizePoint size or line widthaes(size = population)
shapePoint shape (circle, triangle, etc.)aes(shape = group)
alphaTransparency (0 = invisible, 1 = opaque)aes(alpha = density)
linetypeSolid, dashed, dotted linesaes(linetype = model)
labelText labels for geom_text() / geom_label()aes(label = country)
groupGroup data without visual encoding (for lines)aes(group = subject_id)

Common Geoms

GeomBest ForKey Arguments
geom_point()Scatterplots; relationships between two continuous variablessize, alpha, shape
geom_line()Time series; trends over an ordered variablelinewidth, linetype
geom_col()Bar charts with pre-computed heightsfill, position = "dodge"
geom_bar()Bar charts where R counts rows automaticallystat = "count" (default)
geom_histogram()Distribution of a single continuous variablebins, binwidth
geom_density()Smoothed distribution curvefill, alpha, adjust
geom_boxplot()Distribution summary: median, IQR, outliersoutlier.shape, notch
geom_violin()Distribution shape across groupsdraw_quantiles = c(0.25, 0.5, 0.75)
geom_smooth()Fitted trend line with optional confidence bandmethod = "lm" or "loess", se = FALSE
geom_text()Text labels on data pointsaes(label = name), size, hjust
geom_label()Text labels with a background boxSame as geom_text()
geom_tile()Heatmaps; fill encodes a third variableaes(fill = value)
geom_ribbon()Shaded area between ymin and ymax (e.g., confidence intervals)aes(ymin = lo, ymax = hi), alpha
geom_vline() / geom_hline()Reference linesxintercept or yintercept, linetype

Scales

Scales control how data values map to visual properties. The naming convention is scale_{aesthetic}_{type}().

ScaleWhat It ControlsExample
scale_x_log10()Log-transform the x-axis+ scale_x_log10(labels = scales::comma)
scale_x_continuous()Axis limits, breaks, and labels for continuous xscale_x_continuous(limits = c(0,100), breaks = seq(0,100,20))
scale_x_date()Format a date axisscale_x_date(date_labels = "%b %Y")
scale_color_manual()Specify exact colors by groupscale_color_manual(values = c("A" = "#A51C30", "B" = "#7A726A"))
scale_fill_brewer()ColorBrewer palettes for fillsscale_fill_brewer(palette = "Set2")
scale_fill_gradient()Continuous fill from low to high colorscale_fill_gradient(low = "white", high = "#A51C30")
scale_size_area()Make area (not radius) proportional to value+ scale_size_area(max_size = 12)

Facets

Faceting: Small Multiples
# One variable: wrap panels automatically into rows/columns facet_wrap(~ region) facet_wrap(~ region, ncol = 3, scales = "free_y") # Two variables: explicit row and column assignment facet_grid(treatment ~ year) facet_grid(rows = vars(treatment), cols = vars(year))

Labels & Annotations

FunctionWhat It ControlsExample
labs()Title, subtitle, caption, axis labels, legend titlelabs(title = "...", x = "...", color = "Region")
annotate()Add a single text or shape annotation at fixed coordinatesannotate("text", x = 2020, y = 50, label = "Policy change")
coord_flip()Swap x and y axes (e.g., horizontal bars)+ coord_flip()
coord_cartesian()Zoom in without dropping data (unlike xlim())coord_cartesian(ylim = c(0, 100))

Themes

ThemeLook
theme_minimal()Clean white background; no border; light grid lines. Good default.
theme_bw()White background with gray grid and black border.
theme_classic()White background; x and y axes only; no grid. Good for publication.
theme_gray()Gray background (ggplot2 default).
theme_void()Completely blank canvas; useful for maps and custom layouts.
Customizing a Theme with theme()
# Override specific theme elements after choosing a base theme theme_minimal() + theme( plot.title = element_text(face = "bold", size = 14), axis.text = element_text(size = 10), legend.position = "bottom", panel.grid.minor = element_blank(), strip.background = element_rect(fill = "#1a1a1a"), strip.text = element_text(color = "white", face = "bold") )
Saving Plots:
Use ggsave("output/plot.png", width = 8, height = 5, dpi = 300) immediately after your plot code to save the most recently printed plot. Specify plot = my_plot to save a named object. Supported formats include .png, .pdf, .svg, and .tiff.
No Library Needed:
All functions in this section are available in every R session without loading any package. Base R is always present; it is the foundation on which all packages are built.

Getting Help

FunctionWhat It DoesExample
?Open the help page for a function?mean
help()Same as ?help("lm")
help.search() / ??Search help pages by keyword??regression
example()Run the examples from a help pageexample(mean)
args()Show the arguments of a functionargs(lm)
vignette()Open a package vignette (tutorial document)vignette("dplyr")

Understanding Your Data

FunctionWhat It DoesExample
str()Compact display of object structure and typesstr(df)
head() / tail()First or last n rows (default 6)head(df, 10)
dim()Number of rows and columnsdim(df)
nrow() / ncol()Number of rows or columns individuallynrow(df)
names() / colnames()Column namesnames(df)
class()Object class (e.g., data.frame, numeric, factor)class(df$age)
typeof()Low-level storage type (integer, double, character)typeof(df$id)
summary()Summary statistics for each columnsummary(df)
table()Frequency counts; also cross-tabulationstable(df$country)
View()Open a spreadsheet-style viewer in RStudioView(df)

Data Types & Coercion

FunctionWhat It DoesExample
as.numeric()Convert to numberas.numeric("3.14")
as.integer()Convert to whole numberas.integer(3.9) returns 3
as.character()Convert to textas.character(42)
as.logical()Convert to TRUE/FALSEas.logical(0) returns FALSE
as.factor()Convert to categorical factoras.factor(df$group)
as.Date()Convert a string to a Date objectas.Date("2024-03-15")
is.na()Test for missing values; returns logical vectorsum(is.na(df$score))
is.numeric(), is.character()Test type of an objectis.numeric(df$age)

Vectors & Sequences

Creating & Working with Vectors
# Create vectors x <- c(1, 2, 3, 4, 5) words <- c("a", "b", "c") # Sequences 1:10 # integers 1 to 10 seq(0, 1, by = 0.1) # 0.0 0.1 0.2 ... 1.0 seq(0, 100, length.out = 5) # exactly 5 evenly spaced values rep(0, times = 5) # 0 0 0 0 0 rep(c("A", "B"), each = 3) # A A A B B B # Indexing (1-based) x[2] # second element x[x > 3] # elements greater than 3 x[c(1, 3)] # first and third elements x[-1] # everything except the first

Numeric & Summary Functions

FunctionWhat It DoesExample
sum()Sum of all valuessum(x, na.rm = TRUE)
mean()Arithmetic meanmean(x, na.rm = TRUE)
median()Median valuemedian(x, na.rm = TRUE)
sd() / var()Standard deviation / variancesd(x, na.rm = TRUE)
min() / max()Smallest or largest valuemax(x, na.rm = TRUE)
range()Returns c(min, max)range(x)
quantile()Percentilesquantile(x, probs = c(0.25, 0.75))
cumsum() / cumprod()Cumulative sum or productcumsum(x)
diff()Lagged differencesdiff(x)
abs()Absolute valueabs(-5)
round() / ceiling() / floor()Roundinground(3.567, 2) returns 3.57
log() / log10() / exp()Logarithms and exponentiationlog(x) (natural log)
sqrt()Square rootsqrt(16)

Data Frames

Creating & Subsetting Data Frames
# Create a data frame df <- data.frame( id = 1:3, name = c("Alice", "Bob", "Carol"), score = c(85, 92, 78) ) # Access a column (three equivalent ways) df$score df[, "score"] df[, 3] # Subset rows df[df$score > 80, ] # rows where score > 80 df[1:2, ] # first two rows # Subset rows and columns together df[df$score > 80, c("id", "name")] # Add a new column df$grade <- ifelse(df$score >= 90, "A", "B") # Remove a column df$grade <- NULL

Logic & Control Flow

Expression / FunctionWhat It DoesExample
==, !=, <, >, <=, >=Comparison operatorsx == 5
&, |, !Element-wise AND, OR, NOTx > 2 & x < 8
&&, ||Scalar AND / OR (for single TRUE/FALSE values)if (a > 0 && b > 0)
%in%Test membership in a setx %in% c(1, 3, 5)
ifelse()Vectorised if-elseifelse(score >= 60, "pass", "fail")
if () {} else {}Standard conditional (single value)if (n > 0) { ... } else { ... }
for (i in x) {}Loop over elements of a vectorfor (i in 1:10) { print(i) }
while () {}Loop while a condition is TRUEwhile (x < 100) { x <- x * 2 }
next / breakSkip to next iteration or exit a loopif (is.na(x)) next

Writing Functions

Function Syntax
# Basic function add <- function(x, y) { x + y # last evaluated expression is returned } add(3, 4) # 7 # Default argument values greet <- function(name, greeting = "Hello") { paste(greeting, name) } greet("Alice") # "Hello Alice" greet("Bob", "Hi") # "Hi Bob" # Explicit return (use when returning early) safe_log <- function(x) { if (x <= 0) return(NA) log(x) }

Apply Functions

The apply family lets you perform the same operation across rows, columns, or list elements without writing an explicit loop.

FunctionWhat It DoesExample
apply()Apply a function over rows (1) or columns (2) of a matrix or data frameapply(df, 2, mean): column means
lapply()Apply a function to each element of a list; returns a listlapply(my_list, summary)
sapply()Like lapply() but simplifies the result to a vector or matrix if possiblesapply(df, class)
tapply()Apply a function to subgroups defined by a factortapply(df$score, df$group, mean)
Map()Apply a function to corresponding elements of multiple listsMap("+", list_a, list_b)
Reduce()Cumulatively apply a function across a list (fold)Reduce("+", list(1, 2, 3)) returns 6

String Functions

FunctionWhat It DoesExample
paste()Concatenate strings with a separator (default: space)paste("a", "b", sep = "-")
paste0()Concatenate with no separatorpaste0("id_", 1:3)
nchar()Number of characters in a stringnchar("hello") returns 5
substr()Extract substring by start and stop positionsubstr("abcdef", 2, 4) returns "bcd"
toupper() / tolower()Change casetoupper("hello")
trimws()Strip leading and trailing whitespacetrimws(" hello ")
grep()Return indices where a pattern matchesgrep("^A", names(df))
grepl()Return logical vector: does each element match?grepl("@", emails)
gsub()Replace all matches of a patterngsub("\s+", "_", x)
sprintf()Format strings with C-style placeholderssprintf("%.2f%%", 95.678) returns "95.68%"

Reading & Writing Files

FunctionWhat It DoesExample
read.csv()Read a CSV file into a data frameread.csv("data/file.csv")
read.table()Read any delimited file; specify sepread.table("file.txt", sep = " ", header = TRUE)
write.csv()Write a data frame to CSVwrite.csv(df, "output/file.csv", row.names = FALSE)
saveRDS()Save a single R object to a binary filesaveRDS(model, "model.rds")
readRDS()Load an object saved with saveRDS()model <- readRDS("model.rds")
load() / save()Save or restore multiple objects at oncesave(df, model, file = "session.RData")

Environment & Session

FunctionWhat It DoesExample
ls()List all objects in the current environmentls()
rm()Remove one or more objectsrm(x, tmp_df)
rm(list = ls())Clear the entire environmentrm(list = ls())
getwd() / setwd()Get or set the working directorygetwd()
source()Run an external .R scriptsource("R/helpers.R")
sessionInfo()Report R version, OS, and loaded packagessessionInfo()
Sys.time()Current date and timestart <- Sys.time()
proc.time()Elapsed CPU time; useful for benchmarkingpt <- proc.time(); ...; proc.time() - pt

Basic Statistics & Distributions

FunctionWhat It DoesExample
cor()Pearson (or Spearman) correlation matrixcor(df[, numeric_cols])
lm()Fit a linear modellm(outcome ~ treat + age, data = df)
glm()Generalized linear model (logistic, Poisson, etc.)glm(y ~ x, family = binomial, data = df)
t.test()One- or two-sample t-testt.test(score ~ group, data = df)
chisq.test()Chi-squared test of independencechisq.test(table(df$a, df$b))
rnorm() / runif()Draw random samples from normal or uniform distributionsrnorm(100, mean = 0, sd = 1)
set.seed()Set the random number seed for reproducibilityset.seed(42)
dnorm() / pnorm() / qnorm()Density, CDF, and quantile of the normal distributionpnorm(1.96) returns 0.975
How to Read This Table:
Each row shows the same operation written three ways. All three produce equivalent results. data.table is fastest and most memory-efficient for large datasets: ideal for administrative records, claims data, and large cohort files. Tidyverse reads closest to plain English and is widely used in teaching materials and Stack Overflow answers. Base R requires no packages and works in any environment.

Reading & Writing Data

Task Base R data.table Tidyverse (dplyr / tidyr)
Read a CSV file
read.csv("file.csv")
fread("file.csv")
readr::read_csv("file.csv")
Write to CSV
write.csv(df, "out.csv", row.names = FALSE)
fwrite(DT, "out.csv")
readr::write_csv(df, "out.csv")
Save / load a single R object
saveRDS(x, "x.rds") readRDS("x.rds")
saveRDS(DT, "dt.rds") readRDS("dt.rds")
readr::write_rds(df, "x.rds") readr::read_rds("x.rds")

Inspecting Data

Task Base R data.table Tidyverse (dplyr / tidyr)
Preview first rows
head(df, 6)
head(DT, 6)
dplyr::glimpse(df)
Structure and column types
str(df)
str(DT)
dplyr::glimpse(df)
Summary statistics
summary(df)
summary(DT)
summary(df)
Dimensions (rows × columns)
dim(df) nrow(df); ncol(df)
dim(DT)
dim(df)
Column names
names(df)
names(DT)
names(df)

Filtering Rows

Task Base R data.table Tidyverse (dplyr / tidyr)
Keep rows matching a condition
df[df$age > 18, ]
DT[age > 18]
filter(df, age > 18)
Multiple conditions (AND)
df[df$age > 18 & df$country == "US", ]
DT[age > 18 & country == "US"]
filter(df, age > 18, country == "US")
Membership test (OR across values)
df[df$grp %in% c("A","B"), ]
DT[grp %in% c("A","B")]
filter(df, grp %in% c("A","B"))
Remove rows with missing values
df[!is.na(df$score), ]
DT[!is.na(score)]
filter(df, !is.na(score))
Keep first n rows
head(df, 10)
DT[1:10]
slice_head(df, n = 10)
Remove duplicate rows
unique(df)
unique(DT)
distinct(df)

Selecting & Renaming Columns

Task Base R data.table Tidyverse (dplyr / tidyr)
Select columns by name
df[, c("id", "score")]
DT[, .(id, score)]
select(df, id, score)
Drop a column
df$col <- NULL
DT[, col := NULL]
select(df, -col)
Select columns matching a pattern
df[, grep("^week", names(df))]
DT[, .SD, .SDcols = grep("^week", names(DT))]
select(df, starts_with("week"))
Rename a column
names(df)[names(df)=="old"] <- "new"
setnames(DT, "old", "new")
rename(df, new = old)
Reorder columns
df[, c("b","a","c")]
setcolorder(DT, c("b","a","c"))
relocate(df, b, .before = a)

Creating & Transforming Columns

Task Base R data.table Tidyverse (dplyr / tidyr)
Add a new column
df$log_inc <- log(df$income)
DT[, log_inc := log(income)]
mutate(df, log_inc = log(income))
Add multiple columns at once
df$a <- df$x + 1 df$b <- df$y * 2
DT[, c("a","b") := .(x+1, y*2)]
mutate(df, a = x+1, b = y*2)
Conditional column (if / else)
df$pass <- ifelse( df$score >= 60, "pass", "fail")
DT[, pass := ifelse( score >= 60, "pass", "fail")]
mutate(df, pass = if_else(score >= 60, "pass", "fail"))
Multi-way conditional
df$cat <- cut(df$score, breaks=c(0,60,80,100), labels=c("C","B","A"))
DT[, cat := fcase( score < 60, "C", score < 80, "B", score >= 80,"A")]
mutate(df, cat = case_when( score < 60 ~ "C", score < 80 ~ "B", .default = "A"))
Lag / lead a column by group
df$lag_s <- ave( df$score, df$id, FUN=function(x) c(NA,x[-length(x)]))
DT[, lag_s := shift(score,1), by = id]
df |> group_by(id) |> mutate(lag_s = lag(score))
Cumulative sum by group
df$cum_s <- ave( df$score, df$id, FUN = cumsum)
DT[, cum_s := cumsum(score), by = id]
df |> group_by(id) |> mutate(cum_s = cumsum(score))

Sorting

Task Base R data.table Tidyverse (dplyr / tidyr)
Sort ascending by one column
df[order(df$year), ]
setorder(DT, year)
arrange(df, year)
Sort descending
df[order(-df$year), ]
setorder(DT, -year)
arrange(df, desc(year))
Sort by multiple columns
df[order(df$country, -df$year), ]
setorder(DT, country, -year)
arrange(df, country, desc(year))

Aggregating & Summarising

Task Base R data.table Tidyverse (dplyr / tidyr)
Count rows per group
table(df$group)
DT[, .N, by = group]
count(df, group)
Single summary stat by group
tapply(df$score, df$group, mean, na.rm=TRUE)
DT[, .(mean(score, na.rm=TRUE)), by = group]
df |> group_by(group) |> summarise( m=mean(score, na.rm=TRUE))
Multiple summary stats by group
aggregate( score ~ group, data = df, FUN = function(x) c(m=mean(x), s=sd(x)))
DT[, .(m=mean(score), s=sd(score), n=.N), by = group]
df |> group_by(group) |> summarise( m=mean(score), s=sd(score), n=n())
Top row per group
do.call(rbind, lapply( split(df,df$group), function(x) x[which.max(x$score),]))
DT[DT[,.I[which.max(score)], by=group]$V1]
df |> group_by(group) |> slice_max(score, n=1)
Add group summary back as column
df$grp_mean <- ave( df$score, df$group, FUN = mean)
DT[, grp_mean := mean(score), by = group]
df |> group_by(group) |> mutate( grp_mean=mean(score))

Joining Tables

Task Base R data.table Tidyverse (dplyr / tidyr)
Left join (all rows from left)
merge(df1, df2, by="id", all.x=TRUE)
merge(DT1, DT2, by="id", all.x=TRUE)
left_join(df1, df2, by="id")
Inner join (matching rows only)
merge(df1, df2, by="id")
merge(DT1, DT2, by="id")
inner_join(df1, df2, by="id")
Full join (all rows from both)
merge(df1, df2, by="id", all=TRUE)
merge(DT1, DT2, by="id", all=TRUE)
full_join(df1, df2, by="id")
Anti-join (rows with no match)
df1[!df1$id %in% df2$id, ]
DT1[!DT2, on="id"]
anti_join(df1, df2, by="id")
Join on columns with different names
merge(df1, df2, by.x="pid", by.y="id", all.x=TRUE)
merge(DT1, DT2, by.x="pid", by.y="id", all.x=TRUE)
left_join(df1, df2, by=c("pid"="id"))

Reshaping Data

Task Base R data.table Tidyverse (dplyr / tidyr)
Wide to long (stack columns into rows)
reshape(df, varying=c("w1","w2"), v.names="val", direction="long")
data.table::melt(DT, measure.vars= c("w1","w2"), variable.name="week", value.name="val")
pivot_longer(df, cols=c(w1,w2), names_to="week", values_to="val")
Long to wide (spread rows into columns)
reshape(df, idvar="id", timevar="week", direction="wide")
data.table::dcast(DT, id ~ week, value.var="val")
pivot_wider(df, names_from=week, values_from=val)
Stack two tables with same columns
rbind(df1, df2)
rbind(DT1, DT2)
dplyr::bind_rows(df1, df2)
Combine two tables side by side
cbind(df1, df2)
cbind(DT1, DT2)
dplyr::bind_cols(df1, df2)

8. Merging and Reshaping Data


Two operations underpin almost every multi-source analysis: joining tables on a shared key, and reshaping a table between wide and long formats. This section covers both in depth, with worked examples across Base R, dplyr, and data.table, and a practical guide to diagnosing problems before and after a join.

Join Types

A join combines rows from two tables based on matching values in one or more key columns. The choice of join type determines which rows appear in the result when the keys do not match perfectly.

Join TypeRows KeptTypical Use
Left join All rows from the left table; matched data from the right where available; NA where no match Adding characteristics from a lookup table while keeping every row in the primary dataset
Inner join Only rows with a match in both tables Restricting analysis to observations with complete data across both sources
Full join All rows from both tables; NA wherever a match is missing on either side Auditing two datasets for overlap and discrepancies
Anti join Rows in the left table with no match in the right Finding records that failed a linkage, or identifying controls not in the treatment file
Semi join Rows in the left table that have a match in the right, without adding any right-table columns Filtering a dataset to only those IDs that appear in a second file, without duplicating columns
Check Your Keys Before Joining:
Always verify that your key column is unique in at least one of the two tables before joining. A many-to-many join (duplicate keys on both sides) silently multiplies rows and is almost never intended. Use anyDuplicated(df$id) or df |> count(id) |> filter(n > 1) to check.

Joining with dplyr

dplyr Join Functions
library(dplyr) # Left join: keep all rows in df_patients, add columns from df_insurance df_joined <- left_join(df_patients, df_insurance, by = "patient_id") # Inner join: only patients who appear in both tables df_matched <- inner_join(df_patients, df_labs, by = "patient_id") # Join on columns with different names in each table df_joined <- left_join(df_claims, df_providers, by = c("provider_npi" = "npi")) # Join on multiple keys (patient + visit date must both match) df_joined <- left_join(df_vitals, df_meds, by = c("patient_id", "visit_date")) # Anti join: patients in df_enrolled with no matching lab result df_missing_labs <- anti_join(df_enrolled, df_labs, by = "patient_id") # Semi join: filter df_patients to only those with a pharmacy claim df_with_rx <- semi_join(df_patients, df_pharmacy, by = "patient_id")

Joining with data.table

data.table joins use the bracket syntax with an on argument, or the merge() function. Setting keys first with setkey() speeds up repeated joins on large tables.

data.table Join Syntax
library(data.table) setDT(df_patients); setDT(df_insurance) # Left join using bracket syntax (X[Y] is a right join; Y[X] gives left) df_joined <- df_insurance[df_patients, on = "patient_id"] # Left join using merge() - syntax matches base R df_joined <- merge(df_patients, df_insurance, by = "patient_id", all.x = TRUE) # Inner join df_matched <- merge(df_patients, df_labs, by = "patient_id") # Full join df_full <- merge(df_patients, df_labs, by = "patient_id", all = TRUE) # Anti join: rows in patients not matched in labs df_missing <- df_patients[!df_labs, on = "patient_id"] # Join on columns with different names df_joined <- merge(df_claims, df_providers, by.x = "provider_npi", by.y = "npi", all.x = TRUE) # Set key for fast repeated lookups setkey(df_insurance, patient_id) df_joined <- df_insurance[df_patients, on = "patient_id"]

Diagnosing Join Problems

The most common join errors are silent: the operation succeeds but the row count or column values are wrong. These checks catch the most frequent problems before they propagate downstream.

Pre-Join Checks
# 1. Check for duplicate keys on the right (lookup) table # If n > 1 for any id, a left join will expand your row count unexpectedly df_insurance |> count(patient_id) |> filter(n > 1) # 2. Check that key columns have the same type in both tables # Joining character "001" to integer 1 silently produces zero matches class(df_patients$patient_id) class(df_insurance$patient_id) # 3. Check for NA values in the key column sum(is.na(df_patients$patient_id)) sum(is.na(df_insurance$patient_id)) # 4. Preview key overlap between the two tables n_left <- n_distinct(df_patients$patient_id) n_right <- n_distinct(df_insurance$patient_id) n_shared <- n_distinct(intersect(df_patients$patient_id, df_insurance$patient_id)) cat(sprintf("Left: %d Right: %d Shared: %d\n", n_left, n_right, n_shared))
Post-Join Checks
# After a left join, row count should equal the left table exactly stopifnot(nrow(df_joined) == nrow(df_patients)) # Count how many left-table rows had no match (NAs in a right-table column) sum(is.na(df_joined$insurance_type)) # Full audit: left-only, matched, and right-only rows df_audit <- full_join( df_patients |> mutate(in_patients = TRUE), df_insurance |> mutate(in_insurance = TRUE), by = "patient_id" ) table(left_only = is.na(df_audit$in_insurance), right_only = is.na(df_audit$in_patients))

Reshaping: Wide and Long Formats

Data arrives in two common shapes. Wide format has one row per subject and one column per time point or variable. Long format has one row per observation, with a column identifying which variable or time point each row represents. Most R modelling and plotting functions expect long format; most data entry and reporting tools produce wide format.

FormatShapeWhen You Have ItWhen You Need It
Wide Many columns, fewer rows Survey exports, lab panels with one column per test, repeated-measures spreadsheets Reporting tables, cross-tabulations, some time-series packages
Long Fewer columns, many rows Electronic health records, claims files, relational databases ggplot2 (one row per plotted point), lme4 mixed models, dplyr group_by summaries

Wide to Long: pivot_longer()

pivot_longer() — tidyr
library(tidyr) # Wide: one row per patient, columns week1 through week4 # patient_id | week1 | week2 | week3 | week4 # 001 | 82 | 85 | 80 | 88 df_long <- df_wide |> pivot_longer( cols = starts_with("week"), # columns to stack names_to = "week", # new column holding the old column names values_to = "sbp" # new column holding the values ) # Result: patient_id | week | sbp # 001 | week1 | 82 # 001 | week2 | 85 ... # Strip the "week" prefix to leave just a number, and coerce to integer df_long <- df_wide |> pivot_longer( cols = starts_with("week"), names_to = "week", names_prefix = "week", names_transform = list(week = as.integer), values_to = "sbp" ) # Stack two value types at once (sbp and dbp both measured weekly) df_long <- df_wide |> pivot_longer( cols = matches("^(sbp|dbp)_week"), names_to = c(".value", "week"), # .value routes sbp and dbp to separate columns names_sep = "_week" )

Long to Wide: pivot_wider()

pivot_wider() — tidyr
# Long: one row per patient-week # patient_id | week | sbp # 001 | week1 | 82 df_wide <- df_long |> pivot_wider( names_from = week, # column whose values become new column names values_from = sbp # column whose values fill those new columns ) # Spread two value columns simultaneously df_wide <- df_long |> pivot_wider( names_from = week, values_from = c(sbp, dbp) # creates sbp_week1, dbp_week1, sbp_week2 ... ) # Duplicate keys cause list-columns: summarise first, then widen df_wide <- df_long |> group_by(patient_id, week) |> summarise(sbp = mean(sbp, na.rm = TRUE), .groups = "drop") |> pivot_wider(names_from = week, values_from = sbp)

Reshaping with data.table: melt() and dcast()

melt() and dcast() — data.table
library(data.table) # Wide to long: melt() DT_long <- melt(DT_wide, id.vars = "patient_id", measure.vars = c("week1", "week2", "week3", "week4"), variable.name = "week", value.name = "sbp" ) # Melt two value types simultaneously using patterns() DT_long <- melt(DT_wide, measure.vars = patterns("^sbp", "^dbp"), variable.name = "week", value.name = c("sbp", "dbp") ) # Long to wide: dcast() # Formula: rows ~ columns; value.var is the column to spread DT_wide <- dcast(DT_long, patient_id ~ week, value.var = "sbp" ) # Aggregate while casting (mean sbp per patient per week) DT_wide <- dcast(DT_long, patient_id ~ week, value.var = "sbp", fun.aggregate = mean, na.rm = TRUE )

Binding Rows and Columns

Binding stacks or places tables side by side without matching on a key. Row binding requires matching column names; column binding requires matching row counts.

Row and Column Binding
# Stack two tables with the same columns (e.g., two annual extracts) # dplyr fills missing columns with NA rather than throwing an error df_combined <- dplyr::bind_rows(df_2023, df_2024) # Stack a list of many tables at once, adding a source label column list_of_dfs <- list(df_2021, df_2022, df_2023, df_2024) df_combined <- dplyr::bind_rows(list_of_dfs, .id = "year_src") # data.table equivalent (faster for large tables) DT_combined <- rbindlist(list(DT_2021, DT_2022, DT_2023), idcol = "year_src") # fill = TRUE adds NA for columns missing in some tables DT_combined <- rbindlist(list_of_DTs, use.names = TRUE, fill = TRUE) # Column binding: place tables side by side (rows must already correspond) df_combined <- dplyr::bind_cols(df_demographics, df_outcomes)
Prefer Joins Over Column Binding:
bind_cols() and cbind() assume rows in the two tables are in the same order and correspond to the same subjects. This assumption fails silently if either table has been sorted, filtered, or subsetted. A left_join() on an explicit key is almost always safer.

Common Patterns

Reading and Stacking Multiple Files

Stack All CSV Files in a Folder
library(purrr) library(readr) library(dplyr) files <- list.files("data/raw", pattern = "\.csv$", full.names = TRUE) df_all <- files |> set_names(basename(files)) |> purrr::map(readr::read_csv, show_col_types = FALSE) |> dplyr::bind_rows(.id = "source_file") # data.table equivalent DT_all <- rbindlist( lapply(files, fread), idcol = "source_file", fill = TRUE ) DT_all[, source_file := files[source_file]] # replace integer index with filename

Reshaping and Summarising in One Pipeline

Long Format to Summary Table
# Read wide data, reshape, summarise, widen for reporting df_report <- df_wide |> pivot_longer( cols = starts_with("week"), names_to = "week", values_to = "sbp" ) |> group_by(insurance_type, week) |> summarise( mean_sbp = mean(sbp, na.rm = TRUE), n = n(), .groups = "drop" ) |> pivot_wider( names_from = week, values_from = c(mean_sbp, n) )

9. Saving and Loading Data


R offers several formats for persisting data between sessions. Choosing the right one depends on whether you need to share a single object or a whole collection, whether the file must be readable outside R, and how large the dataset is. This section covers each format, when to use it, and the practical tradeoffs between them.

Format Comparison

FormatExtensionSavesR-Only?Best For
saveRDS() / readRDS() .rds One object Yes Saving a single cleaned dataset, model, or list between scripts. The object can be loaded under any name.
save() / load() .RData Named objects (one or many) Yes Checkpointing a set of related objects mid-analysis. Objects are restored under their original names.
save.image() / load() .RData Entire workspace Yes Generally not recommended. Creates implicit, hard-to-audit dependencies. Avoid for reproducible work.
write.csv() / read.csv()
write_csv() / read_csv()
fwrite() / fread()
.csv One tabular object No Sharing data with collaborators in Excel, Python, Stata, or any other tool. Universal but slow and loses column type information.
write_parquet() / read_parquet() .parquet One tabular object No Large datasets shared with Python (pandas, polars) or cloud pipelines. Columnar storage; fast and compact. Requires the arrow package.
write_fst() / read_fst() .fst One data frame or data.table Near-R-only Fastest read/write for R-to-R workflows on large tabular data. Supports random column access. Requires the fst package.
write.xlsx() / read_excel() .xlsx One or more sheets No When a collaborator or system requires .xlsx and CSV is not accepted. Avoid for intermediate analysis storage.

RDS: Saving Individual Objects

saveRDS() and readRDS() are the recommended default for saving any single R object. Unlike save(), the object is not bound to its original variable name on load, which makes it easier to use in different scripts without name collisions.

saveRDS() and readRDS()
# Save one object to disk saveRDS(df_clean, file = "data/clean/df_clean.rds") saveRDS(model_logit, file = "output/model_logit.rds") # Load it back — assign to any name you choose df_clean <- readRDS("data/clean/df_clean.rds") model_final <- readRDS("output/model_logit.rds") # original name not required # RDS preserves all R attributes: factor levels, column types, class, etc. # A data.table saved with saveRDS() is still a data.table on load. # A list, model object, or ggplot is preserved exactly as saved.
Use RDS as Your Default:
For any intermediate or final R object that does not need to be opened in another tool, saveRDS() is the safest and most explicit choice. It saves exactly one thing, forces you to name it explicitly on load, and preserves all R-specific attributes such as factor levels, ordered factors, and object class.

RData: Saving Multiple Named Objects

save() stores multiple R objects in a single file. When load() reads the file, each object reappears in the environment under its original name. This is useful for checkpointing a set of related results, but requires discipline: the names are baked into the file, so loading into a session that already has objects with those names will silently overwrite them.

save() and load()
# Save a specific set of objects into one file save(df_clean, df_joined, model_lm, file = "output/checkpoint_01.RData") # Restore all of them at once — names are fixed to what was saved load("output/checkpoint_01.RData") # df_clean, df_joined, model_lm appear in environment # Check what a .RData file contains before loading it load("output/checkpoint_01.RData", verbose = TRUE) # Safer pattern: load into a new environment to inspect before exposing to global env checkpoint <- new.env() load("output/checkpoint_01.RData", envir = checkpoint) ls(checkpoint) # see what it contains df_clean <- checkpoint$df_clean # pull out only what you need

Workspace Saving: What to Avoid and Why

RStudio prompts you to save your workspace when you close a session. The default file is .RData in your working directory. Accepting this prompt is one of the most common reproducibility mistakes in R.

Why Workspace Saving Causes Problems
# save.image() writes every object in the current environment to .RData save.image() # saves to .RData in the working directory save.image(file = "session_backup.RData") # explicit filename # The problem: .RData loads silently every time R starts in that directory. # Objects from old, deleted, or changed scripts persist invisibly. # Code appears to work only because an old object is in memory, # not because the script that creates it still runs correctly. # The fix: turn off automatic workspace saving in RStudio. # Tools > Global Options > General: # "Save workspace to .RData on exit" -> set to Never # "Restore .RData into workspace at startup" -> uncheck # Then start each session clean and source the scripts that rebuild your objects. # If rebuilding takes too long, save intermediate objects explicitly with saveRDS().
The Blank Slate Principle:
A reproducible analysis is one that produces the same results when run from a blank R session on a machine that has never seen the data before. If your code relies on objects in .RData rather than on scripts that create those objects, it fails this test. Disable automatic workspace saving and use saveRDS() for any intermediate results that are expensive to recompute.

CSV: Universal Plain-Text Exchange

CSV is the safest format for sharing tabular data with any other tool. It is slow, verbose, and does not preserve column types, but it opens in Excel, Python, Stata, SAS, and any text editor. Use it as a delivery format, not an intermediate storage format.

Reading and Writing CSV
# Base R (slow; adds row names by default unless suppressed) write.csv(df, "output/results.csv", row.names = FALSE) df <- read.csv("data/file.csv", stringsAsFactors = FALSE) # readr (fast; prints column type guesses; returns a tibble) library(readr) readr::write_csv(df, "output/results.csv") # no row names by default df <- readr::read_csv("data/file.csv", show_col_types = FALSE) # data.table (fastest; handles large files well) library(data.table) data.table::fwrite(DT, "output/results.csv") # very fast; no row names DT <- data.table::fread("data/file.csv") # auto-detects delimiter and types # Preserve a date column across CSV round-trips by formatting explicitly df$date <- format(df$date, "%Y-%m-%d") # write as ISO string df$date <- as.Date(df$date) # parse back after reading

Parquet: Fast Cross-Language Storage

Parquet is a columnar binary format supported natively by Python (pandas, polars), Spark, DuckDB, and cloud storage services. It preserves column types, compresses well, and reads far faster than CSV for large files. The arrow package provides the R interface.

arrow: write_parquet() and read_parquet()
install.packages("arrow") # install once library(arrow) # Write a data frame or data.table to parquet arrow::write_parquet(df_clean, "data/clean/df_clean.parquet") # Read back (returns a tibble by default) df_clean <- arrow::read_parquet("data/clean/df_clean.parquet") # Specify only the columns you need (parquet reads column-by-column, # so selecting columns avoids reading unused data from disk entirely) df_subset <- arrow::read_parquet( "data/clean/df_clean.parquet", col_select = c("patient_id", "age", "outcome") ) # Convert a data.table to data frame before writing if arrow warns about class arrow::write_parquet(as.data.frame(DT), "output/DT.parquet")

fst: Fastest R-to-R Binary Format

The fst package provides the fastest read and write speeds available for tabular data in R, often ten times faster than fread() on large files. It also supports random column access, meaning you can read a subset of columns without loading the full file. The format is not widely supported outside R, so use it for intermediate objects in pure-R pipelines.

fst: write_fst() and read_fst()
install.packages("fst") # install once library(fst) # Write (accepts data frames and data.tables) fst::write_fst(df_clean, "data/clean/df_clean.fst") # Compress (0 = none, 100 = max; default 50 is a good balance) fst::write_fst(df_clean, "data/clean/df_clean.fst", compress = 75) # Read the full file df_clean <- fst::read_fst("data/clean/df_clean.fst") # Read only specific columns (very fast; no other columns are read from disk) df_sub <- fst::read_fst("data/clean/df_clean.fst", columns = c("patient_id", "age", "outcome")) # Read back as a data.table directly library(data.table) DT <- as.data.table(fst::read_fst("data/clean/df_clean.fst"))

Excel: When CSV Is Not an Option

Use Excel format when a collaborator or system requires .xlsx specifically and CSV is not acceptable. For reading Excel files into R, readxl is reliable and requires no Java dependency. For writing, writexl is fast and lightweight; openxlsx supports formatting, multiple sheets, and styled headers when the output format is prescribed.

Reading and Writing Excel Files
# Reading Excel files install.packages("readxl") library(readxl) df <- readxl::read_excel("data/file.xlsx") # first sheet by default df <- readxl::read_excel("data/file.xlsx", sheet = "Sheet2") df <- readxl::read_excel("data/file.xlsx", skip = 2, na = "NA") readxl::excel_sheets("data/file.xlsx") # list all sheet names # Writing Excel files: writexl (no Java; single or multiple sheets) install.packages("writexl") writexl::write_xlsx(df, "output/results.xlsx") # single sheet writexl::write_xlsx(list(Summary = df_summary, Detail = df_detail), "output/report.xlsx") # multiple sheets; names become tab labels # openxlsx: styled output, formatted headers, bold rows install.packages("openxlsx") library(openxlsx) wb <- createWorkbook() addWorksheet(wb, "Results") writeData(wb, "Results", df_summary, headerStyle = createStyle(textDecoration = "bold")) saveWorkbook(wb, "output/report.xlsx", overwrite = TRUE)

Choosing a Format

Decision Guide
# Saving one cleaned dataset for use in the next script? # -> saveRDS() [default choice; preserves all attributes] # Saving several related objects (model + data + metadata) as a checkpoint? # -> save() [convenient; names are restored on load] # Large tabular file that only needs to be read back into R? # -> write_fst() [fastest read/write; random column access] # Large tabular file shared with Python, Spark, or a cloud pipeline? # -> write_parquet() [cross-language; typed; compressed; widely supported] # Sharing data with a collaborator using Excel, Stata, or SAS? # -> write_csv() / fwrite() [universal; accepts any tool; loses types] # Collaborator or system requires .xlsx and CSV is not accepted? # -> writexl::write_xlsx() or openxlsx [Excel-native; multiple sheets] # Closing RStudio and asked to save workspace? # -> No. Turn this off in Tools > Global Options > General.

File Paths and Project Portability

Hard-coded absolute paths break when a project is moved to a new machine or shared with a collaborator. The here package constructs paths relative to the project root, making all file references portable without any setup.

here::here() for Portable Paths
install.packages("here") # install once library(here) # here::here() always resolves relative to the .Rproj file location # regardless of where the calling script lives in the project folder saveRDS(df_clean, here::here("data", "clean", "df_clean.rds")) df_clean <- readRDS(here::here("data", "clean", "df_clean.rds")) readr::write_csv(df, here::here("output", "results.csv")) arrow::write_parquet(df, here::here("data", "clean", "df.parquet")) # here() builds the path from multiple arguments, joining with the OS separator # On any machine: /path/to/project/data/clean/df_clean.rds # No setwd() needed; no broken absolute paths.
Recommended Folder Convention:
Keep raw source files in data/raw/ and treat them as read-only. Write all processed or cleaned objects to data/clean/. Write all final outputs (tables, figures, reports) to output/. This separation makes it unambiguous which files can be regenerated by scripts and which are irreplaceable originals.

10. Variable Types & Regression Analysis


This section covers how to assign and verify variable types in R, fit linear and logistic regression models with lm() and glm(), and interpret the output that summary() returns. These are the most common modelling steps in public health data analysis.

Assigning Variable Types

R stores data in different types depending on what the values represent. Getting types right before modelling matters: a variable stored as character will be silently dropped; a numeric code stored as numeric instead of factor will be treated as continuous when it should be categorical.

TypeWhen to UseHow to AssignHow to Check
numeric / double Continuous measurements: age, BMI, income, blood pressure
df$age <- as.numeric(df$age)
is.numeric(df$age)
integer Whole-number counts: number of visits, year
df$visits <- as.integer(df$visits)
is.integer(df$visits)
factor Categorical variables with a fixed set of levels: treatment group, insurance type, race/ethnicity, education tier
df$group <- as.factor(df$group)
is.factor(df$group) levels(df$group)
ordered factor Ordinal categories where order matters: low / medium / high severity, Likert scales
df$severity <- factor(df$severity, levels = c("low","medium","high"), ordered = TRUE)
is.ordered(df$severity)
character Free-text strings: names, notes. Not used directly in models.
df$name <- as.character(df$name)
is.character(df$name)
logical Binary TRUE/FALSE indicators: event occurrence, eligibility flags
df$died <- as.logical(df$died)
is.logical(df$died)
Date Calendar dates: admission date, date of birth. Enables date arithmetic.
df$dob <- as.Date(df$dob, format = "%Y-%m-%d")
class(df$dob)

Factors: Reference Levels and Coding

In regression, R uses the first level of a factor as the reference (baseline) category. You should set this deliberately rather than accepting the alphabetical default.

Setting and Checking Factor Levels
# Check current levels (first = reference in regression) levels(df$insurance) # e.g. "Medicaid" "Medicare" "Private" "Uninsured" # Set a specific reference level df$insurance <- relevel(df$insurance, ref = "Private") # Verify: Private is now first levels(df$insurance) # Recode and relabel levels levels(df$educ) <- c( "lt_hs" = "Less than high school", "hs" = "High school / GED", "some_col" = "Some college", "col_plus" = "College or above" )
Always Inspect Types Before Modelling:
Run str(df) or sapply(df, class) before fitting any model. Numeric codes for categorical variables (e.g., 1, 2, 3 for insurance type) will be treated as continuous unless converted to factors. This is one of the most common sources of silent errors in public health analyses.

Linear Regression: lm()

Use lm() when your outcome is a continuous variable (blood pressure, BMI, length of stay, a cost measure). The formula syntax is outcome ~ predictor1 + predictor2.

Fitting a Linear Model
# Fit the model model_lm <- lm(sbp ~ age + as.factor(insurance) + bmi, data = df) # View full results summary(model_lm) # Confidence intervals for coefficients confint(model_lm) # Add fitted values and residuals to the data frame df$fitted <- fitted(model_lm) df$residual <- residuals(model_lm) # Basic residual diagnostics (4 plots) par(mfrow = c(2,2)) plot(model_lm)

Common Formula Operators

SyntaxMeaningExample
y ~ xSimple regression of y on xlm(sbp ~ age)
y ~ x1 + x2Multiple regression; additive termslm(sbp ~ age + bmi)
y ~ x1 * x2Main effects plus interaction termlm(sbp ~ age * insurance)
y ~ x1 + x1:x2Main effect of x1 plus interaction only (no main effect of x2)lm(sbp ~ age + age:insurance)
y ~ I(x^2)Arithmetic inside I(); adds a squared termlm(sbp ~ age + I(age^2))
y ~ . All other columns in the data frame as predictorslm(sbp ~ ., data = df)
y ~ . - xAll columns except xlm(sbp ~ . - id, data = df)
y ~ 0 + xSuppress the interceptlm(sbp ~ 0 + insurance)

Logistic Regression: glm() with Binomial Family

Use glm(family = binomial) when your outcome is binary: died / survived, readmitted / not, disease present / absent. The model estimates log-odds; exponentiating the coefficients gives odds ratios.

Fitting a Logistic Model
# Outcome must be 0/1 numeric or a two-level factor df$readmit <- as.integer(df$readmit_30day == "Yes") # Fit the model model_logit <- glm(readmit ~ age + insurance + n_comorbidities, data = df, family = binomial(link = "logit")) # View results (log-odds scale) summary(model_logit) # Odds ratios and 95% CI exp(coef(model_logit)) # odds ratios exp(confint(model_logit)) # 95% CI on OR scale # Predicted probabilities for each observation df$pred_prob <- predict(model_logit, type = "response")
Probit and Other Links:
The binomial family accepts other link functions. Use link = "probit" for a probit model or link = "cloglog" for a complementary log-log model. For Poisson count outcomes (e.g., number of ED visits), use family = poisson(link = "log").

Reading summary() Output

For lm(): Linear Regression Output

Annotated lm() summary() Output
# Call: # lm(formula = sbp ~ age + insurance + bmi, data = df) # Residuals: # Min 1Q Median 3Q Max # -28.41 -6.12 -0.44 5.98 31.07 # ^ Residuals should be roughly symmetric around 0. # A large Max vs Min asymmetry suggests outliers. # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 82.14 4.21 19.51 <2e-16 *** # age 0.43 0.06 7.18 1.2e-12 *** # insuranceMedicaid 3.81 1.14 3.34 0.0009 *** # insuranceMedicare 1.92 1.08 1.78 0.0756 . # insuranceUninsured 5.60 1.31 4.27 2.3e-05 *** # bmi 0.71 0.09 7.89 7.4e-15 *** # ^ Estimate: the coefficient. # For numeric predictors: change in outcome per 1-unit increase. # For factor levels: difference vs. the reference level (Private). # ^ Std. Error: uncertainty around the estimate. # ^ t value: Estimate / Std. Error. # ^ Pr(>|t|): p-value; probability of this t-value under H0. # ^ Signif. codes: *** p<.001 ** p<.01 * p<.05 . p<.1 # Residual standard error: 9.83 on 1194 degrees of freedom # ^ Typical size of prediction error in outcome units (mmHg here). # Multiple R-squared: 0.213, Adjusted R-squared: 0.210 # ^ R²: proportion of outcome variance explained by the model. # Adjusted R² penalises for number of predictors; use this one. # F-statistic: 64.3 on 5 and 1194 DF, p-value: < 2.2e-16 # ^ Tests whether the model as a whole explains more than chance.

For glm(): Logistic Regression Output

Annotated glm() summary() Output
# Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) -2.841 0.312 -9.11 <2e-16 *** # age 0.027 0.006 4.50 6.8e-06 *** # insuranceMedicaid 0.441 0.142 3.11 0.0019 ** # insuranceMedicare 0.198 0.139 1.42 0.1549 # n_comorbidities 0.312 0.041 7.61 2.8e-14 *** # ^ Estimates are LOG-ODDS (logit scale), not probabilities. # Positive = higher odds of the outcome; negative = lower odds. # Use exp(coef()) to convert to odds ratios. # ^ z value replaces t value; interpretation is the same. # Null deviance: 1284.3 on 1199 degrees of freedom # Residual deviance: 1091.7 on 1195 degrees of freedom # ^ Null deviance: fit of intercept-only model. # Residual deviance: fit of your model. # Larger reduction = better model fit. # AIC: 1101.7 # ^ Lower AIC = better fit (penalised for complexity). # Use AIC to compare models on the same data.

Interpreting Coefficients

ModelPredictor TypeCoefficient RepresentsPractical Interpretation
lm() Continuous (e.g., age) Change in outcome per 1-unit increase in predictor, holding others constant Age coefficient = 0.43: each additional year of age is associated with 0.43 mmHg higher systolic BP, adjusted for insurance and BMI
lm() Factor (e.g., insurance) Difference in outcome vs. the reference level, holding others constant Medicaid coefficient = 3.81: Medicaid patients have systolic BP 3.81 mmHg higher on average than Private patients with the same age and BMI
glm() binomial Continuous Change in log-odds per 1-unit increase; exp(coef) gives the odds ratio Age coefficient = 0.027; OR = exp(0.027) = 1.027: each additional year of age is associated with 2.7% higher odds of readmission
glm() binomial Factor Log-odds difference vs. reference level; exp(coef) gives the odds ratio Medicaid coefficient = 0.441; OR = exp(0.441) = 1.55: Medicaid patients have 55% higher odds of readmission compared to Private patients, adjusted for age and comorbidities
Odds Ratios Are Not Risk Ratios:
An odds ratio of 1.55 does not mean Medicaid patients are 55% more likely to be readmitted. It means their odds are 55% higher. When the outcome is common (prevalence above roughly 10%), odds ratios overstate the relative risk. For common binary outcomes, consider using a log-binomial model (family = binomial(link = "log")) or a Poisson model with robust standard errors to estimate risk ratios directly.

Extracting Results Programmatically

Tidy Model Output with broom

The broom package converts model output into tidy data frames, making it easy to plot coefficients or export results.

library(broom) # Coefficients table as a data frame tidy(model_logit) tidy(model_logit, conf.int = TRUE, exponentiate = TRUE) # ^ exponentiate = TRUE gives odds ratios directly # Model-level statistics (R², AIC, df, etc.) glance(model_lm) glance(model_logit) # Observation-level: fitted values, residuals, influence stats augment(model_lm, data = df) augment(model_logit, data = df, type.predict = "response") # ^ .fitted column gives predicted probabilities for glm
Install broom:
install.packages("broom"). It is part of the tidyverse meta-package so it is already installed if you have run install.packages("tidyverse").

11. Next Steps


With R, RStudio, and your core packages installed, you have a working statistical computing environment. The resources below provide the most reliable paths to building further fluency.

ResourceFocusWhere
R for Data Science (Wickham, Cetinkaya-Rundel & Grolemund)Tidyverse workflow; tidy data principlesr4ds.hadley.nz
Advanced R (Wickham)Language internals; functional programmingadv-r.hadley.nz
CRAN Task ViewsCurated packages by domain (statistics, finance, etc.)cran.r-project.org/web/views
Posit Community ForumHelp, discussion, and package announcementscommunity.rstudio.com
Reproducibility Tip:
Consider using the renv package from the start of any project. It records the exact package versions used in a project lockfile, making your analyses reproducible across machines and over time. Install with install.packages("renv") and initialize with renv::init().