Before calling pgmm_rjmcmc(), the analyst must specify
the data orientation, scaling, missing-value handling, latent-factor
dimension, cluster range, and starting covariance model.
bpgmm expects a numeric matrix with variables in rows
and observations in columns. Many R data sets use the opposite
convention, so transpose after selecting numeric variables.
The package convention is
\[ X = \begin{bmatrix} x_{11} & \cdots & x_{1n} \\ \vdots & & \vdots \\ x_{p1} & \cdots & x_{pn} \end{bmatrix}, \]
where row \(j\) is variable \(j\) and column \(i\) is observation \(x_i\).
library(bpgmm)
#> bpgmm 1.3.1 loaded. If you use bpgmm in published work, please cite it with citation("bpgmm").
iris_numeric <- as.matrix(iris[, 1:4])
iris_labels <- as.integer(iris$Species)
dim(iris_numeric)
#> [1] 150 4
X <- t(iris_numeric)
dim(X)
#> [1] 4 150Rows now correspond to variables and columns correspond to observations.
The sampler requires finite numeric values. Handle missing values before fitting. Common choices include complete-case filtering, domain-specific imputation, or fitting the model to a subset of variables with reliable measurements.
Mixture models are sensitive to measurement scale. If variables are measured in different units, standardizing each variable is usually a sensible default. For each variable \(j\), the usual transformation is
\[ x_{ji}^{\mathrm{scaled}} = \frac{x_{ji} - \bar{x}_{j\cdot}}{s_j}, \qquad s_j^2 = \frac{1}{n - 1}\sum_{i=1}^n (x_{ji} - \bar{x}_{j\cdot})^2 . \]
X_scaled <- t(scale(t(X)))
round(rowMeans(X_scaled), 6)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 0 0 0 0
round(apply(X_scaled, 1, sd), 6)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 1 1 1The package does not scale internally because some scientific applications need the original measurement scale. Scaling should be an explicit analysis choice.
q_newq_new is the latent-factor dimension assigned to newly
proposed clusters. It controls the dimension of the factor-analyzer part
of the covariance model. In the paper’s notation,
\[ \Lambda_k \in \mathbb{R}^{p \times q_k}, \qquad y_{ki} \in \mathbb{R}^{q_k}. \]
The package uses q_new as the \(q_k\) value for a newly created
component.
Useful starting points:
q_new = 1 for very small examples or when covariance
structure should be simple.q_new = 2 or 3 for moderate-dimensional
data.The value should be smaller than the number of observed variables.
m_rangem_range is the allowed cluster-number range. A wide
range gives RJMCMC more freedom, but also increases the model space.
Start with a scientifically reasonable range, then assess
sensitivity.
For data with a known reference label, such as iris, the
reference partition gives a simple check on the range. In unsupervised
applications, use domain knowledge and exploratory plots.
species_cols <- c("#0072B2", "#D55E00", "#009E73")
plot(
X_scaled[1, ], X_scaled[2, ],
col = species_cols[iris_labels],
pch = 19,
xlab = rownames(X_scaled)[1],
ylab = rownames(X_scaled)[2],
main = "Scaled iris data",
asp = 1
)
legend(
"topleft",
legend = levels(iris$Species),
col = species_cols,
pch = 19,
bty = "n"
)The three-letter model labels describe whether loading matrices and
noise covariances are shared across clusters. UUU is
flexible; CCC is more constrained. A flexible starting
model is often reasonable when using v_step = 1, because
the sampler can move across covariance structures.
The fitted covariance is always
\[ \Sigma_k = \Lambda_k\Lambda_k^\top + \Psi_k, \]
but the label controls whether \(\Lambda_k\) and \(\Psi_k\) are shared and whether \(\Psi_k\) is isotropic or diagonal.
The call below is not evaluated in the vignette because applied analyses should use longer chains and repeated runs. It records how the prepared objects enter the package interface.
Before fitting:
X is numeric with variables in rows;m_range;q_new smaller than the number of variables;