--- title: "Summarise missing data" output: html_document: pandoc_args: [ "--number-offset=1,0" ] number_sections: yes toc: yes vignette: > %\VignetteIndexEntry{missing_data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` In this vignette, we explore how *OmopSketch* functions can serve as a valuable tool for summarising missingness in databases containing electronic health records mapped to the OMOP Common Data Model. ## Create a mock cdm To illustrate the package’s functionality, we begin by loading the required packages and connecting to a test CDM using the Eunomia GiBleed dataset. ```{r, warning=FALSE} library(dplyr) library(omock) library(OmopSketch) cdm <- mockCdmFromDataset(datasetName = "GiBleed", source = "duckdb") cdm ``` ## Summary of missing data A common first step in data quality assessment is to identify missing values. In this contest, missing data are defined as either NA values or concept IDs equal to 0 (counts are separated by either of the cases). You can use the `summariseMissingData()` function to summarise missingness across the clinical tables in the CDM: ```{r, warning=FALSE} result_missingData <- summariseMissingData( cdm = cdm, omopTableName = "observation_period" ) result_missingData |> glimpse() ``` ### Summarise by OMOP CDM table You can choose to summarise missing data for specific OMOP CDM tables using the argument `omopTableName`. ```{r, warning=FALSE, eval = FALSE} result_missingData <- summariseMissingData( cdm = cdm, omopTableName = c( "observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "procedure_occurrence", "device_exposure", "measurement", "observation", "death" ) ) ``` ### Summarise by sex You can choose to summarise missing data by sex by setting the argument `sex` to `TRUE`. ```{r, warning=FALSE, eval = FALSE} result_missingData <- summariseMissingData( cdm = cdm, omopTableName = c( "observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "procedure_occurrence", "device_exposure", "measurement", "observation", "death" ), sex = TRUE ) ``` ### Summarise by age group You can choose to summarise missing data by age group by creating a list defining the age groups you want to use. ```{r, warning=FALSE, eval = FALSE} result_missingData <- summariseMissingData( cdm = cdm, omopTableName = c( "observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "procedure_occurrence", "device_exposure", "measurement", "observation", "death" ), ageGroup = list(c(0, 17), c(18, 64), c(65, 150)) ) ``` ### Summarise by date and/or time interval You can also summarise missing data within a specific date range or across defined time intervals using the `dateRange` and `interval` arguments. The `interval` argument supports "overall" (no time stratification), "years", "quarters", or "months". ```{r, warning=FALSE, eval = FALSE} result_missingData <- summariseMissingData( cdm = cdm, omopTableName = c( "observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "procedure_occurrence", "device_exposure", "measurement", "observation", "death" ), interval = "years", dateRange = as.Date(c("2012-01-01", "2019-01-01")) ) ``` ### Summarise by column You can also choose to summarise missing data for specific columns in the OMOP CDM tables using the argument `col`. ```{r, warning=FALSE, eval = FALSE} result_missingData <- summariseMissingData( cdm = cdm, omopTableName = c( "observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "procedure_occurrence", "device_exposure", "measurement", "observation", "death" ), col = c("observation_period_start_date", "observation_period_end_date") ) ``` ### Summarise in sample of OMOP CDM Finally, you can summarise missing data on a subset of subjects via the `sample` argument: provide an integer to randomly select that many `person_id`s from the `person` table, or a character string naming a `cohort` table to limit counts to its `subject_id`s. ```{r, warning=FALSE, eval = FALSE} result_missingData <- summariseMissingData( cdm = cdm, omopTableName = c( "observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "procedure_occurrence", "device_exposure", "measurement", "observation", "death" ), sample = 1000 ) ``` ### Visualise summary results You can present these results using the function `tableMissingData()`. ```{r, warning = FALSE} result_missingData <- summariseMissingData( cdm = cdm, omopTableName = c("condition_occurrence", "drug_exposure", "procedure_occurrence"), sex = TRUE, ageGroup = list(c(0, 17), c(18, 64), c(65, 150)), interval = "years", dateRange = as.Date(c("2012-01-01", "2019-01-01")), sample = 1000 ) result_missingData |> tableMissingData() ``` This table can either be of type [gt](https://gt.rstudio.com/) (default) or [flextable](https://davidgohel.github.io/flextable/). ```{r, warning = FALSE} tableMissingData(result = result_missingData, type = "gt") ``` # Disconnect from CDM Finally, disconnect from the mock CDM. ```{r} cdmDisconnect(cdm = cdm) ```