---
title: "Summarise missing data"
output: 
  html_document:
    pandoc_args: [
      "--number-offset=1,0"
      ]
    number_sections: yes
    toc: yes
vignette: >
  %\VignetteIndexEntry{missing_data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

In this vignette, we explore how *OmopSketch* functions can serve as a valuable tool for summarising missingness in databases containing electronic health records mapped to the OMOP Common Data Model.

## Create a mock cdm

To illustrate the package’s functionality, we begin by loading the required packages and connecting to a test CDM using the Eunomia GiBleed dataset.

```{r, warning=FALSE}
library(dplyr)
library(omock)
library(OmopSketch)

cdm <- mockCdmFromDataset(datasetName = "GiBleed", source = "duckdb")

cdm
```

## Summary of missing data

A common first step in data quality assessment is to identify missing values. In this contest, missing data are defined as either NA values or concept IDs equal to 0 (counts are separated by either of the cases).

You can use the `summariseMissingData()` function to summarise missingness across the clinical tables in the CDM:

```{r, warning=FALSE}
result_missingData <- summariseMissingData(
  cdm = cdm,
  omopTableName = "observation_period"
)

result_missingData |> 
  glimpse()
```

### Summarise by OMOP CDM table

You can choose to summarise missing data for specific OMOP CDM tables using the argument `omopTableName`.

```{r, warning=FALSE, eval = FALSE}
result_missingData <- summariseMissingData(
  cdm = cdm,
  omopTableName = c(
    "observation_period", "visit_occurrence", "condition_occurrence", 
    "drug_exposure", "procedure_occurrence", "device_exposure", "measurement",
    "observation", "death"
  )
)
```

### Summarise by sex

You can choose to summarise missing data by sex by setting the argument `sex` to `TRUE`.

```{r, warning=FALSE, eval = FALSE}
result_missingData <- summariseMissingData(
  cdm = cdm,
  omopTableName = c(
    "observation_period", "visit_occurrence", "condition_occurrence",
    "drug_exposure", "procedure_occurrence", "device_exposure", "measurement", 
    "observation", "death"
  ),
  sex = TRUE
)
```

### Summarise by age group

You can choose to summarise missing data by age group by creating a list defining the age groups you want to use.

```{r, warning=FALSE, eval = FALSE}
result_missingData <- summariseMissingData(
  cdm = cdm,
  omopTableName = c(
    "observation_period", "visit_occurrence", "condition_occurrence", 
    "drug_exposure", "procedure_occurrence", "device_exposure", "measurement", 
    "observation", "death"
  ),
  ageGroup = list(c(0, 17), c(18, 64), c(65, 150))
)
```

### Summarise by date and/or time interval

You can also summarise missing data within a specific date range or across defined time intervals using the `dateRange` and `interval` arguments. The `interval` argument supports "overall" (no time stratification), "years", "quarters", or "months".

```{r, warning=FALSE, eval = FALSE}
result_missingData <- summariseMissingData(
  cdm = cdm,
  omopTableName = c(
    "observation_period", "visit_occurrence", "condition_occurrence", 
    "drug_exposure", "procedure_occurrence", "device_exposure", "measurement", 
    "observation", "death"
  ),
  interval = "years",
  dateRange = as.Date(c("2012-01-01", "2019-01-01"))
)
```

### Summarise by column

You can also choose to summarise missing data for specific columns in the OMOP CDM tables using the argument `col`.

```{r, warning=FALSE, eval = FALSE}
result_missingData <- summariseMissingData(
  cdm = cdm,
  omopTableName = c(
    "observation_period", "visit_occurrence", "condition_occurrence", 
    "drug_exposure", "procedure_occurrence", "device_exposure", "measurement", 
    "observation", "death"
  ),
  col = c("observation_period_start_date", "observation_period_end_date")
)
```

### Summarise in sample of OMOP CDM

Finally, you can summarise missing data on a subset of subjects via the `sample` argument: provide an integer to randomly select that many `person_id`s from the `person` table, or a character string naming a `cohort` table to limit counts to its `subject_id`s.

```{r, warning=FALSE, eval = FALSE}
result_missingData <- summariseMissingData(
  cdm = cdm,
  omopTableName = c(
    "observation_period", "visit_occurrence", "condition_occurrence", 
    "drug_exposure", "procedure_occurrence", "device_exposure", "measurement", 
    "observation", "death"
  ),
  sample = 1000
)
```

### Visualise summary results

You can present these results using the function `tableMissingData()`.

```{r, warning = FALSE}
result_missingData <- summariseMissingData(
  cdm = cdm,
  omopTableName = c("condition_occurrence", "drug_exposure", "procedure_occurrence"),
  sex = TRUE,
  ageGroup = list(c(0, 17), c(18, 64), c(65, 150)),
  interval = "years",
  dateRange = as.Date(c("2012-01-01", "2019-01-01")),
  sample = 1000
)
result_missingData |> tableMissingData()
```

This table can either be of type [gt](https://gt.rstudio.com/) (default) or [flextable](https://davidgohel.github.io/flextable/).

```{r, warning = FALSE}
tableMissingData(result = result_missingData, type = "gt")
```

# Disconnect from CDM

Finally, disconnect from the mock CDM.

```{r}
cdmDisconnect(cdm = cdm)
```