--- title: "Getting Started with prepR4pcm" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with prepR4pcm} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Before running any phylogenetic comparative analysis — PGLS, phylogenetic mixed models, ancestral state reconstruction — species names in your data must match the tip labels in your tree. In practice, they rarely do. **prepR4pcm** automates the matching of species names between data and tree (which we call *reconciliation*), records every name-matching decision so you can audit it later, and produces an aligned data frame + pruned tree (the *aligned objects*) where the species lists match exactly — the precondition for any phylogenetic comparative method. ## The problem Mismatches between data and tree arise from three kinds of difference: - **Formatting differences** — same species, written differently. For example, the same animal may appear as `Homo_sapiens` in the tree and as `Homo sapiens` in the data; trailing whitespace and attached authority strings (`Homo sapiens Linnaeus, 1758`) cause similar mismatches. - **Synonymy** — situations where multiple scientific names refer to the same taxonomic group (often a species or genus); for example, when a recent taxonomic revision moved a species to a different genus, the older and newer names both circulate in the literature. See [Synonym (taxonomy) on Wikipedia](https://en.wikipedia.org/wiki/Synonym_\(taxonomy\)) for a fuller introduction. - **Missing names** — species in the data but not the tree, or in the tree but not the data, with no naming-rule that would link them. Fixing these by hand is tedious, error-prone, and poorly documented. **prepR4pcm** solves this with a structured matching cascade of algorithms: exact match → normalised match → synonym resolution. Every decision is recorded by the software in the reconciliation result, where you can inspect it via `reconcile_mapping()` or `reconcile_summary()`. ## Installation ```{r install, eval = FALSE} # Install pak if you don't have it # install.packages("pak") # Install prepR4pcm from GitHub pak::pak("itchyshin/prepR4pcm") ``` ```{r setup} library(prepR4pcm) ``` ## Example 1: Reconcile a dataset against a tree Suppose you have trait data and a phylogenetic tree with slightly different naming conventions. ```{r example-data} # Simulated trait data for 6 primate species trait_data <- data.frame( species = c( "Homo sapiens", "Pan_troglodytes", # underscore instead of space "Gorilla gorilla", "Pongo pygmaeus", "Macaca mulatta", "Cebus capucinus" ), body_mass = c(70, 50, 160, 80, 8, 3), brain_mass = c(1.35, 0.39, 0.50, 0.37, 0.11, 0.07) ) # Simulated phylogenetic tree (built manually for this example) tree <- ape::read.tree(text = paste0( "((((Homo_sapiens:5,Pan_troglodytes:5):3,", "Gorilla_gorilla:8):4,Pongo_pygmaeus:12):6,", "(Macaca_mulatta:10,Papio_anubis:10):8);" )) tree$tip.label # the tip labels (species names) on the tree plot(tree) # quick visual; underscores in tip labels render as spaces ``` (`ape::plot.phylo()` displays underscores as spaces by default — the underlying `tree$tip.label` strings still contain underscores, which is why `tree$tip.label` shows them.) Notice the mismatches: - `Pan_troglodytes` in the data has an underscore; the tree uses underscores throughout, but the data column mixes spaces and underscores. - `Cebus capucinus` is in the data but not in the tree. - `Papio anubis` is in the tree but not in the data. ```{r reconcile-tree} result <- reconcile_tree( x = trait_data, tree = tree, x_species = "species", authority = NULL, # skip synonym lookup for this example quiet = FALSE ) ``` ### Inspect the result ```{r print-result} print(result) ``` The "Reconciliation: data vs tree" header at the top of the output tells you the call that produced the result; the "Match summary" block underneath gives the count in each match category (exact, normalised, synonym, fuzzy, manual, unresolved). Use `reconcile_mapping()` to see the full per-name table: ```{r mapping} reconcile_mapping(result) ``` What the columns mean: - `name_x` — the species name as it appeared in your **data** (the argument `x` to `reconcile_tree()`). - `name_y` — the matching tip label on your **tree** (the argument `tree` to `reconcile_tree()`), or `NA` if no match was found. - `name_resolved` — the canonical name used when synonym resolution applied (the recognised form per the chosen taxonomic authority). `NA` for matches that didn't go through the synonym stage. - `match_type` — which stage of the cascade matched the name (see *Understanding match types* below). - `match_score` — confidence on `[0, 1]` (`1` for exact / normalised / synonym / manual; `< 1` for fuzzy / flagged). - `in_x`, `in_y` — logical: was this name in the data, in the tree, or both? - `notes` — human-readable note (e.g. "normalised: lowercased", "via synonym lookup against COL", "fuzzy match score 0.92"). For a detailed report: ```{r summary, eval = FALSE} reconcile_summary(result) ``` ### Apply manual overrides Suppose you know that `Cebus capucinus` should not be in the analysis. You can document this decision: ```{r override} result <- reconcile_override( result, name_x = "Cebus capucinus", name_y = NA, action = "reject", note = "Not in target phylogeny; exclude from analysis" ) ``` `reconcile_override()` updates the existing `result` (the `reconciliation` you built earlier) in place — no need to re-run `reconcile_tree()`. The three actions you can pass to `action = ...` are: - `"accept"` — confirm a specific `name_x → name_y` mapping. - `"reject"` — mark a name as deliberately excluded. - `"replace"` — redirect `name_x` to a different `name_y` than the cascade produced. ### Produce aligned objects Once satisfied with the reconciliation, apply it: ```{r apply} aligned <- reconcile_apply( result, data = trait_data, tree = tree, species_col = "species", drop_unresolved = TRUE ) # Aligned data frame — only species present in both data and tree aligned$data # Aligned tree — pruned to matched species ape::Ntip(aligned$tree) plot(aligned$tree) # the pruned tree ``` The `$data` and `$tree` components now have matching species, ready for comparative analysis. ## Example 2: Reconcile two datasets `prepR4pcm` can also reconcile species names *between two datasets*, not just between a dataset and a tree. The same matching cascade applies. This is useful when merging trait data from different sources, where species names often disagree across datasets. Here is a toy example: ```{r data-data} # df1: body mass for three primates (df1 uses an underscore for chimp) df1 <- data.frame( species = c("Homo sapiens", "Pan_troglodytes", "Gorilla gorilla"), mass = c(70, 50, 160) ) # df2: lifespan for three primates (df2 uses a space for chimp; orang # is here but not gorilla) df2 <- data.frame( species = c("Homo sapiens", "Pan troglodytes", "Pongo pygmaeus"), lifespan = c(79, 40, 45) ) # Reconcile the species columns of df1 and df2 against each other. # `authority = NULL` skips the synonym-lookup stage (no taxonomic # database needed for this small example). `quiet = TRUE` suppresses # progress messages. result2 <- reconcile_data( x = df1, y = df2, authority = NULL, quiet = TRUE ) # The output shows how many names matched, and via which stage. print(result2) ``` `Pan_troglodytes` (underscore) in `df1` is matched to `Pan troglodytes` (space) in `df2` via normalisation. `Gorilla gorilla` is in `df1` only and `Pongo pygmaeus` is in `df2` only — both end up as `unresolved` rows (`in_x = TRUE, in_y = FALSE` and vice versa). ## Understanding match types Every row in the `reconcile_mapping()` output has a `match_type` column. Here is what each value means and what action (if any) it requires: | `match_type` | Meaning | Action needed? | |---------------|---------|----------------| | `exact` | Verbatim string equality | None | | `normalized` | Names matched after stripping underscores, authority strings, and case differences | None — check the `notes` column if you want to confirm | | `synonym` | Names resolved through a taxonomic authority (e.g., Catalogue of Life) to the same accepted name | Verify the resolved name looks correct | | `fuzzy` | High-confidence character-level match (score ≥ `flag_threshold`, default 0.95) | Check the `match_score` column; review with `reconcile_suggest()` | | `flagged` | Lower-confidence match that needs human review: fuzzy score below `flag_threshold`, or an indirect synonym chain | Review with `reconcile_review()` or `reconcile_suggest()` | | `manual` | Set by `reconcile_override()` or the `overrides` argument | None — you decided this | | `unresolved` | No match found after all stages | Investigate; use `reconcile_suggest()` for candidates or `reconcile_override()` to document a decision | Use `reconcile_summary(result, detail = "mismatches_only")` to see only the rows that need attention. ## Example 3: Using a taxonomic authority A *taxonomic authority* is a curated database of species names that records, for each name, which is the currently-recognised one and which are synonyms (alternative names referring to the same taxon). prepR4pcm can use such an authority to recognise that two syntactically different names refer to the same species — the "synonym" stage of the matching cascade. Most authorities below are **databases served by the [taxadb](https://docs.ropensci.org/taxadb/) package** (Norman et al. 2020) — `authority = "col"` tells `prepR4pcm` to look up synonyms in the taxadb-cached copy of the Catalogue of Life, and so on. The first call for a taxadb provider downloads its database to your local cache (~100 MB); subsequent calls are fast and work offline. One alternative, `"gnverifier"`, is HTTP-backed instead of taxadb: it calls the [Global Names verifier](https://verifier.globalnames.org/) on each lookup. No database to download, but each lookup needs network access and the \pkg{httr2} package. - **`col`** — Catalogue of Life ([catalogueoflife.org](https://www.catalogueoflife.org/)). Broad coverage; the most general default for cross-taxon work. - **`itis`** — Integrated Taxonomic Information System ([itis.gov](https://www.itis.gov/about_itis.html)). North-American emphasis, strong on vertebrates and vascular plants. - **`gbif`** — Global Biodiversity Information Facility taxonomic backbone (dataset `d7dddbf4-2cf0-4f39-9b2a-bb099caae36c`). Pragmatic synthesis with very wide coverage. - **`ncbi`** — NCBI Taxonomy ([ncbi.nlm.nih.gov/taxonomy](https://www.ncbi.nlm.nih.gov/taxonomy)). Tracks names that appear in GenBank — most useful for molecular-data workflows. - **`ott`** — Open Tree Taxonomy ([tree.opentreeoflife.org/about/taxonomy-version](https://tree.opentreeoflife.org/about/taxonomy-version)). Note: `ott` here is a taxadb authority name, not an R package. The R package that *retrieves trees* from Open Tree of Life is called [`rotl`](https://docs.ropensci.org/rotl/), which is separate. Use `authority = "ott"` if you also use `pr_get_tree(source = "rotl")` and want the synonym-resolution step to use the same taxonomy as the tree. - **`itis_test`** — small bundled subset of ITIS used for the package's own examples and tests; not a general-purpose authority. - **`gnverifier`** — Global Names verifier ([verifier.globalnames.org](https://verifier.globalnames.org/)). Verifies names against ~100 authoritative sources (CoL, ITIS, GBIF, NCBI, Open Tree, …) in one HTTP call. Wider source coverage than any single taxadb provider and no ~100 MB local download, but each call needs network access and the \pkg{httr2} package. The taxadb-backed entries mirror the providers documented in `?taxadb::td_create`. **When should you set `authority`?** Use `authority = NULL` (skip synonym lookup) when: - You want a quick offline check — no database download required. - Species names in your data and tree are unlikely to differ much (most formatting differences are caught by the normalisation stage anyway). Set `authority = "col"` (or another taxadb provider) when names differ because of genuine taxonomic revisions — species moved to a different genus, splits, or lumps. The first run downloads a local database (~100 MB); subsequent runs are fast because the database is cached. Use `authority = "gnverifier"` when you would rather query the Global Names verifier over HTTP than maintain a local taxadb database. It is the right pick when you want broader source coverage than any one taxadb provider (it consults ~100 sources per call), when you do not want to download a ~100 MB cache, or when you would like the synonym stage to silently benefit from upstream-source improvements without re-downloading anything. The trade-off: every call needs network access (we degrade to "name not found" on failure, so the rest of the cascade still runs), and the request adds a round-trip to `verifier.globalnames.org`. Install \pkg{httr2} (`install.packages("httr2")`) before first use. ```{r authority, eval = FALSE} # Requires taxadb and a local database download (automatic on first use) result3 <- reconcile_tree( x = trait_data, tree = tree, x_species = "species", authority = "col" # Catalogue of Life ) ``` ## Example 4: Pre-built overrides Researchers often maintain a curated list of known corrections. You can pass these as a data frame, or as a path to a file in CSV format: > The chunks below use `my_data` and `my_tree` as **hypothetical** > objects (substitute your own data frame and `phylo` object). They > are marked `eval = FALSE` so the vignette renders without > requiring those objects to exist. ```{r overrides-table, eval = FALSE} # A data frame of known corrections corrections <- data.frame( name_x = c("Corvus sp.", "Turdus merulaa"), name_y = c("Corvus corax", "Turdus merula"), user_note = c("Only one Corvus in our tree", "Typo in source data") ) result4 <- reconcile_tree( x = my_data, tree = my_tree, overrides = corrections ) # Or from a CSV file: result5 <- reconcile_tree( x = my_data, tree = my_tree, overrides = "lab_corrections.csv" ) ``` Overrides are applied before any other matching stage, so they always take priority. ## Example 5: Multiple datasets against one tree `reconcile_multi()` reconciles several datasets at once, pooling all unique species names before running the cascade: ```{r multi, eval = FALSE} # Suppose you have several data frames to reconcile against one tree. # `my_ecology_data`, `my_morpho_data`, and `my_tree` are **hypothetical** # user-supplied objects; substitute your own. datasets <- list( traits = trait_data, # defined above ecology = my_ecology_data, # your own data frame morpho = my_morpho_data # your own data frame ) result6 <- reconcile_multi(datasets, my_tree) print(result6) ``` ## Key design principles 1. **Conservative**: Names are never silently changed. Ambiguous cases are flagged, not auto-resolved. 2. **Transparent**: Every decision is recorded with match type, score, source, and a human-readable note. 3. **Reproducible**: Database versions are pinned. All parameters used to build the result are stored on the result object itself, so a collaborator can re-run the same reconciliation later. 4. **Practical**: Works with the data types comparative biologists already use — a `data.frame` of trait values (one row per species) and a phylogenetic tree as an `ape::phylo` object. ## Typical workflow > The chunk below uses **hypothetical** files (`species_traits.csv`, > `species_tree.nwk`) — substitute your own paths. The chunk is > marked `eval = FALSE` so it doesn't try to read files that don't > exist when the vignette is rendered. ```{r workflow, eval = FALSE} library(prepR4pcm) # 1. Load your data and tree (hypothetical paths -- substitute your own) my_data <- read.csv("species_traits.csv") my_tree <- ape::read.tree("species_tree.nwk") # 2. Reconcile result <- reconcile_tree(my_data, my_tree, authority = "col") # 3. Review print(result) reconcile_summary(result, detail = "mismatches_only") # 4. Fix manually if needed result <- reconcile_override(result, "Corvus sp.", "Corvus corax", note = "Only one Corvus in tree") # 5. Apply aligned <- reconcile_apply(result, data = my_data, tree = my_tree, drop_unresolved = TRUE) # 6. Analyse # aligned$data and aligned$tree are ready for caper, phytools, MCMCglmm, etc. ``` ## References - Hadfield, J.D. (2010) MCMC methods for multi-response generalized linear mixed models: the MCMCglmm R package. *Journal of Statistical Software* 33:1--22. DOI 10.18637/jss.v033.i02 - Orme, D., Freckleton, R., Thomas, G., Petzoldt, T., Fritz, S., Isaac, N. & Pearse, W. (2025) caper: Comparative Analyses of Phylogenetics and Evolution in R. R package version 1.0.4. DOI 10.32614/CRAN.package.caper - Paradis, E. & Schliep, K. (2019) ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. *Bioinformatics* 35:526--528. DOI 10.1093/bioinformatics/bty633 - Revell, L.J. (2024) phytools 2.0: an updated R ecosystem for phylogenetic comparative methods (and other things). *PeerJ* 12:e16505. DOI 10.7717/peerj.16505