--- title: "Introduction to twoPhaseGAS" author: "Osvaldo Espin-Garcia, Apostolos Dimitromanolakis, Shelley Bull" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to twoPhaseGAS} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4 ) ``` ## Overview **twoPhaseGAS** provides tools for the design and analysis of two-phase genetic association studies (GAS) in which a subset of the full cohort is selected for targeted sequencing (phase 2), while the remaining individuals contribute only GWAS-level data (phase 1). The package offers: 1. **Data simulation** via `DataGeneration_TPD()`. 2. **Phase 2 design optimisation** via `twoPhaseDesign()` / `twoPhase()`. 3. **Joint analysis** of phase 1 and phase 2 data via `twoPhaseSPML()`, which implements a semiparametric maximum likelihood (SPML) estimator using the EM algorithm. --- ## 1. Simulating two-phase data ```{r simulate} library(twoPhaseGAS) data.table::setDTthreads(1) set.seed(42) data <- DataGeneration_TPD( Beta0 = 2, # intercept Beta1 = 0.5, # genetic effect (G on Y) Disp = 1, # error variance for default gaussian family N = 3000, # phase 1 cohort size LD.r = 0.75, # LD (r) between causal SNP G and GWAS SNP Z P_g = 0.20, # MAF of G P_z = 0.30 # MAF of Z ) head(data) ``` The simulated data frame contains: | Column | Description | |--------|-------------| | `Y` | Continuous outcome | | `G1` | Causal sequence variant (absent from phase 1 GWAS) | | `Z` | GWAS SNP (available for everyone) | | `S` | Stratum variable derived from residuals | --- ## 2. Selecting a phase 2 subsample Here we use simple random sampling; `twoPhaseDesign()` can be used for optimised designs. ```{r subsample} set.seed(1) n2 <- 500 # phase 2 size R <- rep(0L, nrow(data)) R[sample(nrow(data), n2)] <- 1L # 1 = selected for phase 2 data[R == 0, c("G1")] <- NA # Make all phase 1 data complement G1 values missing cat("Phase 1 complement:", sum(R == 0), "individuals\n") cat("Phase 2: ", sum(R == 1), "individuals\n") ``` --- ## 3. Analysis under H₀ (score test) When the missing-by-design covariate (`miscov`) is **absent** from `formula`, `twoPhaseSPML()` fits the null model and returns score statistics. ```{r null-model} res_Ho <- twoPhaseSPML( formula = Y ~ Z, miscov = ~ G1, auxvar = ~ Z, data = data ) print(res_Ho) ``` The `print` method (modelled on `glm`) displays: - the fitted call, - estimated regression coefficients, - family and link, - log-likelihood and number of EM iterations, and - the observed and expected score statistics with a chi-squared *p*-value. --- ## 4. Analysis under Hₐ (Wald test) Including `miscov` in `formula` switches to the alternative model. The EM algorithm now estimates the effect of `G1` jointly with the regression parameters, and Wald statistics are reported. ```{r alt-model} res_Ha <- twoPhaseSPML( formula = Y ~ Z + G1, miscov = ~ G1, auxvar = ~ Z, data = data ) print(res_Ha) ``` --- ## 5. Accessing results programmatically The returned object is a list of class `"twoPhaseSPML"`. ```{r access} # Regression coefficients res_Ha$theta # Log-likelihood res_Ha$ll # Estimated joint distribution of G1 and Z head(res_Ha$qGZ) # Wald statistic and degrees of freedom cat("Wobs =", res_Ha$Wobs, " df =", res_Ha$df, "\n") cat("p-value =", pchisq(res_Ha$Wobs, df = res_Ha$df, lower.tail = FALSE), "\n") ``` --- ## 6. Using `start.values` for warm starts When analysing many variants, passing converged estimates from a nearby model as `start.values` can reduce computation time. ```{r warm-start, eval = FALSE} # Use null-model estimates as starting point for the alternative model res_warm <- twoPhaseSPML( formula = Y ~ Z + G1, miscov = ~ G1, auxvar = ~ Z, data = data, start.values = list(betas = res_Ho$theta, q = res_Ho$qGZ) ) ``` --- ## Session information ```{r session} sessionInfo() ```