---
title: "Introduction to twoPhaseGAS"
author: "Osvaldo Espin-Garcia, Apostolos Dimitromanolakis, Shelley Bull"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to twoPhaseGAS}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  fig.width  = 7,
  fig.height = 4
)
```

## Overview

**twoPhaseGAS** provides tools for the design and analysis of two-phase
genetic association studies (GAS) in which a subset of the full cohort is
selected for targeted sequencing (phase 2), while the remaining individuals
contribute only GWAS-level data (phase 1).

The package offers:

1. **Data simulation** via `DataGeneration_TPD()`.
2. **Phase 2 design optimisation** via `twoPhaseDesign()` / `twoPhase()`.
3. **Joint analysis** of phase 1 and phase 2 data via `twoPhaseSPML()`, which
   implements a semiparametric maximum likelihood (SPML) estimator using the
   EM algorithm.

---

## 1. Simulating two-phase data

```{r simulate}
library(twoPhaseGAS)

data.table::setDTthreads(1)

set.seed(42)
data <- DataGeneration_TPD(
  Beta0  = 2,    # intercept
  Beta1  = 0.5,  # genetic effect (G on Y)
  Disp = 1,      # error variance for default gaussian family
  N      = 3000, # phase 1 cohort size
  LD.r   = 0.75, # LD (r) between causal SNP G and GWAS SNP Z
  P_g    = 0.20, # MAF of G
  P_z    = 0.30  # MAF of Z
)
head(data)
```

The simulated data frame contains:

| Column | Description |
|--------|-------------|
| `Y`    | Continuous outcome |
| `G1`   | Causal sequence variant (absent from phase 1 GWAS) |
| `Z`    | GWAS SNP (available for everyone) |
| `S`    | Stratum variable derived from residuals |

---

## 2. Selecting a phase 2 subsample

Here we use simple random sampling; `twoPhaseDesign()` can be used for
optimised designs.

```{r subsample}
set.seed(1)
n2   <- 500                              # phase 2 size
R    <- rep(0L, nrow(data))
R[sample(nrow(data), n2)] <- 1L         # 1 = selected for phase 2

data[R == 0, c("G1")]  <- NA # Make all phase 1 data complement G1 values  missing

cat("Phase 1 complement:", sum(R == 0), "individuals\n")
cat("Phase 2:           ", sum(R == 1), "individuals\n")
```

---

## 3. Analysis under H₀ (score test)

When the missing-by-design covariate (`miscov`) is **absent** from
`formula`, `twoPhaseSPML()` fits the null model and returns score statistics.

```{r null-model}
res_Ho <- twoPhaseSPML(
  formula = Y ~ Z,
  miscov  = ~ G1,
  auxvar  = ~ Z,
  data   = data
)

print(res_Ho)
```

The `print` method (modelled on `glm`) displays:

- the fitted call,
- estimated regression coefficients,
- family and link,
- log-likelihood and number of EM iterations, and
- the observed and expected score statistics with a chi-squared *p*-value.

---

## 4. Analysis under Hₐ (Wald test)

Including `miscov` in `formula` switches to the alternative model.  The EM
algorithm now estimates the effect of `G1` jointly with the regression
parameters, and Wald statistics are reported.

```{r alt-model}
res_Ha <- twoPhaseSPML(
  formula = Y ~ Z + G1,
  miscov  = ~ G1,
  auxvar  = ~ Z,
  data   = data
)

print(res_Ha)
```

---

## 5. Accessing results programmatically

The returned object is a list of class `"twoPhaseSPML"`.

```{r access}
# Regression coefficients
res_Ha$theta

# Log-likelihood
res_Ha$ll

# Estimated joint distribution of G1 and Z
head(res_Ha$qGZ)

# Wald statistic and degrees of freedom
cat("Wobs =", res_Ha$Wobs, "  df =", res_Ha$df, "\n")
cat("p-value =", pchisq(res_Ha$Wobs, df = res_Ha$df, lower.tail = FALSE), "\n")
```

---

## 6. Using `start.values` for warm starts

When analysing many variants, passing converged estimates from a nearby model
as `start.values` can reduce computation time.

```{r warm-start, eval = FALSE}
# Use null-model estimates as starting point for the alternative model
res_warm <- twoPhaseSPML(
  formula     = Y ~ Z + G1,
  miscov      = ~ G1,
  auxvar      = ~ Z,
  data       = data,
  start.values = list(betas = res_Ho$theta,
                      q     = res_Ho$qGZ)
)
```

---

## Session information

```{r session}
sessionInfo()
```