R Development

Survival Analysis Pipeline with ggsurvplot

Client

Health Research Institute

Industry

Health & Medical

The challenge

Visualising time-to-event data for publication-quality reports.

Our approach

Using the `survival` and `survminer` packages to create publication-ready Kaplan-Meier curves with risk tables, confidence intervals, and stratified analysis.

The solution

Built reproducible R scripts that generate high-quality survival curves from clinical trial data, with automated reporting via Quarto.

Impact

Hours → Minutes Report generation time

Full audit trail Reproducibility

Journal-ready Figure quality

90% reduction Analyst rework

Technologies

Rsurvivalsurvminerggplot2Quarto

The outcome

Publication-ready survival curves delivered for peer-reviewed manuscript, including Kaplan-Meier plots with risk tables and stratified comparisons.

The Challenge

A health research institute was conducting a clinical trial evaluating treatment effectiveness across two patient cohorts. The primary statistical analysis required survival analysis — specifically, estimation of time-to-event distributions and comparison of survival curves between treatment groups.

The research team’s existing workflow produced survival plots through a manual, iterative process in a proprietary statistical package. Each revision to the data (adding patients, adjusting exclusion criteria, reclassifying events) required manual regeneration of every figure, followed by manual reassembly of the manuscript appendices. A single data update could require 4-6 hours of figure regeneration and layout work.

The figures also fell short of journal submission standards. The proprietary software produced static images with limited control over typography, colour palettes, and layout — requiring significant post-processing in external graphics software before the figures could be submitted.

Beyond the technical constraints, the team faced a reproducibility gap. The statistical analysis was performed interactively, meaning that the exact sequence of data transformations and model specifications used to produce the published figures was not systematically recorded. This made it difficult to regenerate results if reviewers requested additional analyses or if the dataset was updated after submission.

Statistical Methodology

The analysis uses Kaplan-Meier estimation to construct non-parametric survival curves — step functions that estimate the probability of survival (or remaining event-free) at each observed time point. The Kaplan-Meier estimator is the standard approach for time-to-event data because it makes no assumptions about the underlying survival distribution and handles censored observations (patients who leave the study or are censored at a known time point) naturally.

For each patient cohort, the estimator proceeds as follows:

Sort all observed event times (both events and censoring) in ascending order
At each event time, calculate the conditional probability of survival given that the patient has reached that time point: S(t) = S(t-1) × (1 - d/n), where d is the number of events at time t and n is the number of patients at risk just before time t
Plot the resulting step function, with vertical drops at each event time

Log-rank testing provides a formal statistical comparison between the survival curves of the two cohorts. The log-rank test evaluates the null hypothesis that there is no difference between the survival curves, computing a chi-squared statistic from the observed versus expected event counts at each time point.

Cox proportional hazards regression adds multivariable adjustment. By including covariates (age, sex, disease stage, baseline risk factors), the Cox model estimates hazard ratios — the relative risk of an event for one cohort compared to another, adjusted for confounding factors. This is the standard approach for reporting treatment effects in clinical research.

Technical Implementation

The implementation uses R’s survival package for statistical estimation and survminer/ggplot2 for publication-quality visualisation. The entire pipeline is encapsulated in a single Quarto document, making the analysis fully reproducible.

Data preparation. The survival object is constructed using Surv(time, event), which encodes both the time-to-event and the event indicator (1 = event observed, 0 = censored). The data preparation phase handles common clinical trial data issues: converting date fields to numeric durations, flagging censored observations, and creating stratification variables for subgroup analysis.

Model fitting. The survfit() function computes the Kaplan-Meier estimates for each cohort. The survdiff() function performs the log-rank test. The coxph() function fits the Cox proportional hazards model with covariate adjustment. All three analyses are executed sequentially within the same Quarto document, with results passed directly between stages.

Visualisation. The ggsurvplot() function from the survminer package generates the survival curves with confidence intervals, risk tables, and censoring marks in a single call. The output is built on ggplot2, providing fine-grained control over every visual element: axis labels, font sizes, colour schemes, grid lines, and legend positioning. The figures are exported as vector PDFs (not raster images), ensuring that they meet journal requirements for resolution and scalability.

Risk tables. Each figure includes a risk table below the survival curve, showing the number of patients at risk at each time point in each cohort. This is critical for interpreting the curves — wide confidence intervals at later time points indicate small sample sizes and reduced statistical power.

Stratified analysis. In addition to the primary two-cohort comparison, the pipeline generates stratified analyses by key subgroups (age groups, disease stage, baseline risk). Each stratified analysis follows the same automated pipeline, producing consistent figures across all subgroups.

Reproducibility and Audit Trail

The Quarto-based approach ensures that every analysis can be regenerated from raw data with a single command. The document records:

Data sources and version identifiers
All data transformation steps with code and intermediate outputs
Model specifications and convergence diagnostics
Figure generation parameters and export settings
Complete statistical test outputs with p-values and confidence intervals

If reviewers request additional analyses — a different stratification, a sensitivity analysis excluding early events, or updated hazard ratio estimates — these can be produced within minutes rather than hours. The full computational trail is documented, providing complete transparency for peer review.

Research Process Improvements

Figure generation time. The manual process of regenerating figures after data updates (4-6 hours) was reduced to under 10 minutes — a single Quarto render. This eliminates the bottleneck where data revisions delayed manuscript preparation.

Post-processing eliminated. The vector PDF outputs from ggsurvplot meet journal quality standards directly. The team no longer needs to export figures and manually edit them in external graphics software, saving an estimated 1-2 hours per manuscript.

Consistency across analyses. All figures follow the same template — consistent fonts, colours, axis ranges, and risk table layouts. This consistency is critical for multi-figure manuscripts where reviewers expect uniform presentation across all survival plots.

Analyst rework reduction. With the previous interactive approach, any data error or reviewer comment that required re-running the analysis meant regenerating every figure manually. The automated pipeline reduces this rework to a single render, cutting rework time by approximately 90%.

Outcomes & Benefits

The project delivered both immediate research output and a sustainable analytical capability:

Peer-reviewed manuscript delivered — all survival analysis figures met journal submission standards, produced entirely within R with zero post-processing
Hours reduced to minutes — figure generation and re-generation after data updates dropped from 4-6 hours to under 10 minutes
Full reproducibility achieved — every result can be regenerated from raw data, with complete audit trail for peer review and regulatory compliance
Subgroup analysis capability built — the stratified analysis pipeline enables rapid generation of subgroup survival curves without additional development effort
Analyst capacity freed — the time saved on figure generation and rework allows the research team to focus on analysis design and interpretation rather than manual production
Scalable template — the Quarto document serves as a reusable template for future studies. New trials only require data format alignment and minor parameter adjustments, not new code development

The pipeline has since been adopted for two subsequent studies, with the research team applying the same Quarto template and producing journal-ready survival analysis figures independently.