---
title: "Getting started with bpgmm"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with bpgmm}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

`bpgmm` fits Bayesian parsimonious Gaussian mixture models for model-based
clustering. It targets three posterior inference goals described by Lu, Li, and
Love (2021): the partition of observations, the number of clusters, and the
cluster covariance structure.

The vignettes have separate roles:

| Article | Use it for |
|---|---|
| `data-preparation` | matrix orientation, scaling, and choosing `m_range` and `q_new` |
| `model-and-sampler` | formulas from the paper and their package arguments |
| `examples` | inspecting one fitted object |
| `model-selection` | RJMCMC summaries for \(m\) and covariance model \(v\) |
| `variable-prioritization` | exploratory variable rankings from posterior output |
| `posterior-diagnostics` | independent chains, traces, and co-clustering |

## Fit a model

Input data should be a numeric matrix with variables in rows and observations
in columns. The small example below has two variables and eight observations.
The example is short so the vignette runs quickly; applied analyses should use
larger `burn` and `niter` values.

```{r}
library(bpgmm)

set.seed(2026)

X <- cbind(
  matrix(rnorm(8, mean = -2, sd = 0.2), nrow = 2),
  matrix(rnorm(8, mean = 2, sd = 0.2), nrow = 2)
)
known_labels <- rep(1:2, each = 4)
```

The scatter plot gives a direct check of the simulated partition. Each point is
one observation, and the colors show the reference labels used later for the
ARI calculation.

```{r, fig.width = 5.5, fig.height = 4, fig.alt = "Scatter plot of the small two-cluster example."}
plot(
  X[1, ], X[2, ],
  col = c("#0072B2", "#D55E00")[known_labels],
  pch = 19,
  xlab = "Variable 1",
  ylab = "Variable 2",
  main = "Small two-cluster example",
  asp = 1
)
legend(
  "topleft",
  legend = paste("Reference", sort(unique(known_labels))),
  col = c("#0072B2", "#D55E00"),
  pch = 19,
  bty = "n"
)
```

```{r}
fit_log <- capture.output({
  fit <- pgmm_rjmcmc(
    X = X,
    m_init = 2,
    m_range = c(1, 3),
    q_new = 1,
    burn = 1,
    niter = 3,
    constraint = model_to_constraint("UUU"),
    m_step = 0,
    v_step = 0,
    verbose = FALSE
  )
})
tail(fit_log, 1)
```

The call fixes \(m = 2\) and the covariance model so the example is fast.
The important arguments are:

- `m_init` is the initial number of clusters.
- `m_range` is the allowed range for the number of clusters.
- `q_new` is the number of latent factors for a new cluster.
- `constraint` selects the starting covariance model.
- `m_step = 1` allows RJMCMC updates for the number of clusters.
- `v_step = 1` allows RJMCMC updates across covariance-constraint models.

## Summarize posterior samples

```{r}
summary <- summarize_pgmm_rjmcmc(fit, true_cluster = known_labels)

summary$allocation
summary$n_clusters
summary$n_constraints
summary$ari
```

If a reference partition is available, pass it as `true_cluster` to calculate
the adjusted Rand index.

The summary allocation can be plotted back on the original coordinates. This
plot is a first check before longer chains and convergence diagnostics.

```{r, fig.width = 5.5, fig.height = 4, fig.alt = "Scatter plot colored by posterior modal allocation."}
plot(
  X[1, ], X[2, ],
  col = c("#009E73", "#CC79A7", "#E69F00")[summary$allocation],
  pch = 19,
  xlab = "Variable 1",
  ylab = "Variable 2",
  main = "Posterior modal allocation",
  asp = 1
)
text(X[1, ], X[2, ], labels = seq_along(summary$allocation), pos = 3, cex = 0.75)
legend(
  "topleft",
  legend = paste("Cluster", sort(unique(summary$allocation))),
  col = c("#009E73", "#CC79A7", "#E69F00")[sort(unique(summary$allocation))],
  pch = 19,
  bty = "n"
)
```

## Common early mistakes

Common early problems are:

- `X` is accidentally supplied as observations by variables instead of
  variables by observations;
- variables with very different measurement scales are not standardized;
- `m_init` is outside `m_range`;
- `q_new` is too large for a small number of variables;
- `m_step` and `v_step` are left at zero when model selection is desired.

## Naming convention

Starting with version 1.2.0, the public API uses snake_case names throughout.
Use `pgmm_rjmcmc()`, `summarize_pgmm_rjmcmc()`, and snake_case argument names
such as `m_init`, `m_range`, and `q_new`.

## Citation

If you use `bpgmm` in published work, please cite the package and the
methodology paper:

```{r}
citation("bpgmm")
```

Lu, X., Li, Y., & Love, T. (2021). On Bayesian Analysis of Parsimonious
Gaussian Mixture Models. *Journal of Classification*, 38, 576-593.
https://doi.org/10.1007/s00357-021-09391-8