--- title: "Getting started with soilKey" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with soilKey} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` `soilKey` provides automated soil profile classification under WRB 2022 (4th edition), SiBCS 5ª ed. (2018), and USDA Soil Taxonomy (13th edition, 2022). The taxonomic key itself is implemented as deterministic R code driven by versioned YAML rules; vision-language extraction, spatial priors, and OSSL-based attribute prediction sit alongside it as modular layers, never inside it. # 0. The 30-second on-ramp If you just want to see soilKey work end-to-end on a real profile -- without writing any R code -- there are two paths. ## A. Zero-code GUI ```{r run-demo, eval = FALSE} library(soilKey) run_demo() # opens a one-screen Shiny app in your browser ``` Pick one of 31 canonical profiles from the dropdown (or upload your own horizons CSV), click **Classify**, and read the WRB / SiBCS / USDA names plus the deterministic key trace and the evidence grade. ## B. One R call, one fixture ```{r quick-start} library(soilKey) pedon <- make_ferralsol_canonical() # canonical Latossolo Vermelho classify_wrb2022(pedon, on_missing = "silent")$name classify_sibcs(pedon)$name classify_usda(pedon, on_missing = "silent")$name ``` That is the whole package: `PedonRecord` in, classification out. The remaining sections walk through how to build your own pedon and how the side modules (VLM, spatial, spectral) fit together. # 1. Building a PedonRecord from scratch `PedonRecord` is the central data carrier. It bundles site metadata, the horizons table (with a fixed canonical schema -- see `horizon_column_spec()` for the full list of columns), and optional spectra, images, documents, and a per-attribute provenance log. ```{r build-pedon} my_pedon <- PedonRecord$new( site = list( id = "example-001", lat = -22.5, lon = -43.7, country = "BR", parent_material = "gneiss" ), horizons = data.frame( top_cm = c(0, 15, 65, 130), bottom_cm = c(15, 65, 130, 200), designation = c("A", "Bw1", "Bw2", "C"), clay_pct = c(50, 60, 65, 60), silt_pct = c(15, 10, 8, 8), sand_pct = c(35, 30, 27, 32), cec_cmol = c(8, 5, 4.5, 4), bs_pct = c(20, 12, 10, 11), ph_h2o = c(4.8, 4.9, 5.0, 5.1), oc_pct = c(2.0, 0.4, 0.2, 0.1) ) ) my_pedon$validate() ``` The validator catches inverted depths, texture sums far from 100, implausible pH, sum of bases above CEC, Munsell out-of-range, and a handful of other soil-physical sanity checks. # 2. Canonical fixtures `soilKey` ships sixteen canonical fixtures designed so that exactly one of the eleven v0.2 diagnostics passes on each. Each profile also classifies cleanly through the wired WRB key. ```{r fixture-list} fixtures <- list( Ferralsol = make_ferralsol_canonical(), Luvisol = make_luvisol_canonical(), Acrisol = make_acrisol_canonical(), Lixisol = make_lixisol_canonical(), Alisol = make_alisol_canonical(), Chernozem = make_chernozem_canonical(), Kastanozem = make_kastanozem_canonical(), Phaeozem = make_phaeozem_canonical(), Calcisol = make_calcisol_canonical(), Gypsisol = make_gypsisol_canonical(), Solonchak = make_solonchak_canonical(), Cambisol = make_cambisol_canonical(), Plinthosol = make_plinthosol_canonical(), Podzol = make_podzol_canonical(), Gleysol = make_gleysol_canonical(), Vertisol = make_vertisol_canonical() ) ferralsol <- fixtures$Ferralsol ferralsol ``` # 3. Calling the diagnostics directly Every diagnostic returns a `DiagnosticResult` carrying the per-sub-test evidence, missing-attribute report, layer indices that satisfied, and the WRB literature reference. ```{r ferralic} ferralic(ferralsol) ``` # 4. Diagnostic matrix across the canonical fixtures ```{r matrix, results='asis'} diagnostics <- c("argic", "ferralic", "mollic", "calcic", "gypsic", "salic", "cambic", "plinthic", "spodic", "gleyic_properties", "vertic_properties") mat <- vapply(fixtures, function(p) { vapply(diagnostics, function(d) { fn <- get(d, envir = asNamespace("soilKey")) isTRUE(fn(p)$passed) }, logical(1)) }, logical(length(diagnostics))) knitr::kable(t(mat)) ``` Every fixture activates exactly one diagnostic (or, for the argic-derived RSGs Acrisol / Lixisol / Alisol / Luvisol, just the shared `argic`). # 5. RSG-derived diagnostics: argic and mollic families The argic horizon is shared by four RSGs -- Acrisols, Lixisols, Alisols, Luvisols -- which differ by clay activity (CEC per kg clay) and chemistry (BS or Al saturation). soilKey provides one diagnostic per RSG that runs `argic()` internally, then applies the activity and chemistry tests on the argic layer: ```{r argic-derived} acrisol(make_acrisol_canonical())$passed lixisol(make_lixisol_canonical())$passed alisol (make_alisol_canonical())$passed luvisol(make_luvisol_canonical())$passed ``` Same pattern for the mollic-derived family (Chernozems / Kastanozems / Phaeozems): ```{r mollic-derived} chernozem (make_chernozem_canonical())$passed kastanozem(make_kastanozem_canonical())$passed phaeozem (make_phaeozem_canonical())$passed ``` # 6. End-to-end WRB classification `classify_wrb2022()` consumes a `PedonRecord` and runs it through the YAML key (`inst/rules/wrb2022/key.yaml`). v0.2 wires 16 of 32 RSGs end-to-end; the other 16 are stubbed with `not_implemented_v01:` markers and return NA in the trace. ```{r classify-fr} classify_wrb2022(ferralsol) ``` ```{r classify-all} classifications <- vapply(fixtures, function(p) { classify_wrb2022(p, on_missing = "silent")$rsg_or_order }, character(1)) data.frame(fixture = names(classifications), assigned_rsg = classifications) ``` Each canonical fixture maps to its intended RSG. The trace shows which RSGs were tested, in canonical key order, before the assigned one. # 7. Provenance and evidence grade `PedonRecord$add_measurement()` records a value's provenance in a structured log. The final `ClassificationResult$evidence_grade` summarises that log on an A--D scale: A means every recorded value was laboratory-measured, D means the result rests on attributes extracted by VLM or assumed by the user. ```{r provenance} ferralsol_v <- make_ferralsol_canonical() # Mark the Bw1 clay value as predicted from spectroscopy ferralsol_v$add_measurement( horizon_idx = 4, attribute = "clay_pct", value = 60, source = "predicted_spectra", confidence = 0.85, overwrite = TRUE ) classify_wrb2022(ferralsol_v)$evidence_grade ``` ```{r provenance-vlm} ferralsol_w <- make_ferralsol_canonical() ferralsol_w$add_measurement(1, "clay_pct", 50, "extracted_vlm", confidence = 0.7, overwrite = TRUE) classify_wrb2022(ferralsol_w)$evidence_grade ``` # 8. Interoperability with `aqp` `PedonRecord$to_aqp()` returns an `aqp::SoilProfileCollection`, allowing soilKey to plug into aqp's plotting and aggregation tooling without owning that infrastructure: ```{r to-aqp, eval = requireNamespace("aqp", quietly = TRUE)} spc <- ferralsol$to_aqp() class(spc) aqp::profile_id(spc) ``` # 9. Module 4 -- OSSL spectroscopy bridge (gap-filling) When some horizon attributes are missing, but the profile carries Vis-NIR or MIR spectra, soilKey can fill the gaps via the Open Soil Spectral Library. The pipeline preprocesses the spectra (SNV / SG1 / trim), dispatches to a memory-based or PLSR backend, and writes each prediction into the `PedonRecord` with provenance `predicted_spectra` -- which the authority hierarchy treats as below laboratory-measured but above VLM-extracted values. The PI95 prediction interval is mapped to a `[0, 1]` confidence score via `pi_to_confidence()`. ```{r ossl, eval = FALSE} # Synthetic example -- a profile with measured spectra but missing CEC. pr_spec <- make_synthetic_pedon_with_spectra(n_horizons = 4) pr_spec$horizons$cec_cmol <- NA_real_ # erase CEC # Predict via memory-based learning against the OSSL global library. pr_filled <- fill_from_spectra( pedon = pr_spec, backend = "mbl", # or "plsr_local" / "pretrained" attrs = c("cec_cmol") # which attributes to gap-fill ) # Each predicted cell is logged with provenance source = "predicted_spectra". pr_filled$provenance classify_wrb2022(pr_filled)$evidence_grade # B (predicted_spectra present) ``` # 10. Module 3 -- SoilGrids / Embrapa spatial prior (sanity check) Once the deterministic key has reached a verdict, soilKey can **cross-check** that verdict against a spatial prior derived from ISRIC SoilGrids (global) or the Embrapa raster (Brazil). The prior *never* overrides the key -- it only attaches a `prior_check` entry to the result and emits a warning if the deterministic outcome lies in a low-probability region of the prior. ```{r prior, eval = FALSE} prior <- spatial_prior(lon = -43.7, lat = -22.5, source = "auto") prior # data.table of (rsg_code, probability) res <- classify_wrb2022( pedon = ferralsol, prior = prior, prior_threshold = 0.01 # warn if assigned RSG has prior < 1% ) res$prior_check ``` # 11. Module 2 -- Multimodal extraction via `ellmer` A field PDF or photo can be turned into a `PedonRecord` via the `extract_*` functions, each driven by an `ellmer` chat object (Anthropic, OpenAI, Google, or Ollama). The output is a schema-validated JSON (draft-07, in `inst/schemas/`) with `{value, confidence, source_quote}` per attribute, then merged into the `PedonRecord` with provenance `extracted_vlm`. The package ships a `MockVLMProvider` (R6) so the validation + retry loop can be exercised in tests without an API key: ```{r vlm-mock, eval = FALSE} mock <- MockVLMProvider$new( responses = list( list(horizons = list( list(top_cm = 0, bottom_cm = 15, designation = "A", clay_pct = list(value = 30, confidence = 0.9, source_quote = "30% clay (table 1)")), list(top_cm = 15, bottom_cm = 65, designation = "Bw", clay_pct = list(value = 55, confidence = 0.85, source_quote = "Bw horizon, 55% clay")) )) ) ) pr_extracted <- extract_horizons_from_pdf( pdf_path = "fieldsheet.pdf", provider = mock # in production: vlm_provider("anthropic") ) classify_wrb2022(pr_extracted)$evidence_grade # C or D depending on cell coverage ``` For real use: ```{r vlm-real, eval = FALSE} chat <- vlm_provider("anthropic", model = "claude-sonnet-4-5") pr <- extract_horizons_from_pdf("RADAMBRASIL_perfil_007.pdf", provider = chat) res <- classify_wrb2022(pr) res ``` # 12. SiBCS 5ª edição (Embrapa, 2018) soilKey ships the parallel SiBCS key alongside WRB 2022. The 13 ordens are wired in canonical Cap 4 order; calling `classify_sibcs()` on any `PedonRecord` runs the same engine that backs `classify_wrb2022()`. ```{r sibcs-demo} # A canonical Latossolo (Brazilian Ferralsol equivalent) pr_lat <- make_latossolo_canonical() classify_sibcs(pr_lat, on_missing = "silent")$rsg_or_order # A canonical Argissolo (B textural, low BS) pr_arg <- make_argissolo_canonical() classify_sibcs(pr_arg, on_missing = "silent")$rsg_or_order # A canonical Nitossolo (clay >=35% throughout, B/A <=1.5, cerosidade) pr_nit <- make_nitossolo_canonical() classify_sibcs(pr_nit, on_missing = "silent")$rsg_or_order # Cross-system: the SAME profile classified by both keys classify_wrb2022(pr_lat, on_missing = "silent")$rsg_or_order classify_sibcs(pr_lat, on_missing = "silent")$rsg_or_order ``` The diagnostic helpers also have Portuguese names that match the SiBCS literature. For example: ```{r sibcs-atributos} # Atividade da fração argila (Ta vs Tb) per Cap 1, p 30 atividade_argila_alta(make_luvissolo_canonical())$passed # TRUE -> Ta atividade_argila_alta(make_nitossolo_canonical())$passed # FALSE -> Tb # Caráter alítico (Cap 1, p 32): Al >= 4 cmol_c/kg + sat Al >= 50% + V < 50% carater_alitico(make_argissolo_canonical())$passed ``` # 13. v0.7 scope and the v0.3.3+ roadmap | Version | Scope | |----------|----------------------------------------------------------------------------------------------------------------------| | v0.1 | Core classes; argic, ferralic, mollic; Ferralsols path | | v0.2 | +calcic, gypsic, salic, cambic, plinthic, spodic, gleyic, vertic; +AC/LX/AL/LV/CH/KS/PH RSG diagnostics; 16/32 wired | | v0.3 | +histic, leptic, arenic, umbric, duric, technic, andic, fluvic, natric, nitic, planic, stagnic, retic, cryic, anthric; full WRB key (32/32 RSGs wired); 31 canonical fixtures | | v0.3.1 | Tier-1 corrections vs WRB 2022 Ch 3.1: argic 6/1.4/20 + band 50, ferralic drops ECEC, duric 10/10, vertic >=25 cm, salic alkaline + product gate | | v0.3.2 | RSG order in `key.yaml` aligned to canonical WRB 2022 Ch 4 (PL/ST before NT/FR; FL before AR) | | **v0.4** | **Module 4 -- OSSL spectroscopy bridge (MBL, PLSR-local, pretrained)** | | **v0.5** | **Module 3 -- SoilGrids spatial prior + Embrapa raster (sanity-check, never overrides)** | | **v0.6** | **Module 2 -- Multimodal extraction (PDF / photo / fieldsheet) via `ellmer`, schema-validated** | | **v0.3.3** | **Complete WRB Ch 3.1 / 3.2 / 3.3 coverage** -- +18 horizons, +12 properties, +16 materials. Schema +24 columns. | | **v0.3.4** | **Tier-2 RSG gate strengthening** -- vertisol, andosol, gleysol, planosol, ferralsol, chernozem_strict, kastanozem_strict wired into key.yaml; spodic refined to disambiguate from andic. | | **v0.3.5** | **Closes WRB Ch 3.1 -- 32/32 horizons** (+tsitelic, panpaic, limonic, protovertic). | | **v0.7** | **Module 6 -- SiBCS 5ª ed. (Embrapa, 2018) implemented in full**: 17 atributos diagnósticos + 24 horizontes diagnósticos + 13 ordens RSG-level following the canonical Cap 4 key (O→R→V→E→S→G→L→M→C→F→T→N→P). 13 fixtures canônicas, all classify correctly; 30 new tests; +830 expectations total in the suite. | | v0.8 | Module 5 -- USDA Soil Taxonomy parallel key (12 orders) | | v0.9 | All ~202 WRB qualifiers + 10 specifiers; vignettes 05-09; WoSIS benchmark | | v1.0 | CRAN submission and methodological paper | See `ARCHITECTURE.md` (in the package root) for the full design rationale.