bpgmm performs Bayesian model selection for the number
of clusters and PGMM covariance structure. It does not currently
implement a formal spike-and-slab or inclusion-indicator
variable-selection prior. Still, users often want to understand which
observed variables help explain the fitted clustering.
The calculations below:
bpgmm;These summaries are diagnostic and exploratory. Treat them as variable prioritization, not as formal posterior variable inclusion probabilities.
The simulation differs from the model-selection example. The aim is not to recover \(m\), but to examine how posterior output can be post-processed to rank variables. The first three variables have different cluster means, variables four and five have mostly covariance differences, and variable six is weak noise.
library(bpgmm)
simulate_screening_data <- function(n_per_cluster = 18, p = 6) {
means <- rbind(
c(-2.5, -1.5, 0.0, 0, 0, 0),
c( 2.0, -0.5, 1.8, 0, 0, 0),
c( 0.0, 2.2, -1.8, 0, 0, 0)
)
covariances <- list(
diag(c(0.35, 0.35, 0.35, 1.40, 0.35, 1.20)),
diag(c(0.35, 0.35, 0.35, 0.35, 1.40, 1.20)),
matrix(c(
0.35, 0, 0, 0, 0, 0,
0, 0.35, 0, 0, 0, 0,
0, 0, 0.35, 0, 0, 0,
0, 0, 0, 1.00, 0.45, 0,
0, 0, 0, 0.45, 1.00, 0,
0, 0, 0, 0, 0, 1.20
), nrow = p)
)
n <- n_per_cluster * 3
X <- matrix(NA_real_, nrow = p, ncol = n)
true_cluster <- rep(seq_len(3), each = n_per_cluster)
column <- 1
for (k in seq_len(3)) {
for (i in seq_len(n_per_cluster)) {
X[, column] <- MASS::mvrnorm(1, mu = means[k, ], Sigma = covariances[[k]])
column <- column + 1
}
}
rownames(X) <- paste0("variable_", seq_len(p))
list(X = X, true_cluster = true_cluster)
}
set.seed(2028)
sim <- simulate_screening_data()
X <- sim$X
true_cluster <- sim$true_clusterThe fitted PGMM still uses
\[ \Sigma_k = \Lambda_k\Lambda_k^\top + \Psi_k, \]
but this simulation deliberately separates mean-driven variables from covariance-driven variables so the two rankings below answer different questions.
fit_log <- capture.output({
fit <- pgmm_rjmcmc(
X = X,
m_init = 3,
m_range = c(1, 5),
q_new = 2,
burn = 2,
niter = 6,
constraint = "UUU",
m_step = 1,
v_step = 1,
verbose = FALSE
)
})
tail(fit_log, 1)
#> character(0)
fit_summary <- summarize_pgmm_rjmcmc(fit, true_cluster = true_cluster)
fit_summary$ari
#> [1] 0.6535139One simple score is the fraction of total variation explained by the posterior modal clusters:
\[ R_j^2 = \frac{\sum_k n_k(\bar{x}_{jk} - \bar{x}_{j\cdot})^2} {\sum_i (x_{ji} - \bar{x}_{j\cdot})^2}. \]
This quantity is not a posterior inclusion probability; it summarizes separation under the fitted partition.
cluster_separation <- function(X, allocation) {
vapply(seq_len(nrow(X)), function(j) {
x <- X[j, ]
overall <- mean(x)
total <- sum((x - overall)^2)
between <- sum(vapply(split(x, allocation), function(group) {
length(group) * (mean(group) - overall)^2
}, numeric(1)))
if (total == 0) 0 else between / total
}, numeric(1))
}
separation <- cluster_separation(X, fit_summary$allocation)
separation_table <- data.frame(
variable = rownames(X),
separation = round(separation, 3)
)
separation_table[order(separation_table$separation, decreasing = TRUE), ]
#> variable separation
#> 2 variable_2 0.809
#> 3 variable_3 0.778
#> 1 variable_1 0.401
#> 5 variable_5 0.305
#> 6 variable_6 0.013
#> 4 variable_4 0.006ordered <- order(separation, decreasing = TRUE)
barplot(
separation[ordered],
names.arg = rownames(X)[ordered],
las = 2,
col = "#56B4E9",
border = NA,
ylab = "Between-cluster variation / total variation",
main = "Exploratory variable separation"
)The loading matrices describe how observed variables relate to latent factors inside each component. A simple posterior diagnostic is the average absolute loading magnitude across saved samples and active components.
\[ L_j = \frac{1}{|\mathcal{S}|} \sum_{s \in \mathcal{S}} \frac{1}{|A_s|} \sum_{k \in A_s} \frac{1}{q_k}\sum_{\ell = 1}^{q_k} |\lambda_{jk\ell}^{(s)}|, \]
where \(A_s\) is the set of active clusters in posterior sample \(s\). Large \(L_j\) means variable \(j\) contributes strongly to the latent-factor covariance structure; it is not a posterior inclusion probability.
loading_importance <- function(fit) {
totals <- NULL
count <- 0
for (sample_id in seq_along(fit$lambda_samples)) {
lambdas <- fit$lambda_samples[[sample_id]]
active <- which(fit$active_cluster_samples[[sample_id]] == 1)
for (k in active) {
lambda <- lambdas[[k]]
if (!is.matrix(lambda)) {
next
}
score <- rowMeans(abs(lambda))
if (is.null(totals)) {
totals <- numeric(length(score))
}
totals <- totals + score
count <- count + 1
}
}
if (count == 0) {
return(rep(NA_real_, nrow(X)))
}
totals / count
}
loading_score <- loading_importance(fit)
loading_table <- data.frame(
variable = rownames(X),
mean_abs_loading = round(loading_score, 3)
)
loading_table[order(loading_table$mean_abs_loading, decreasing = TRUE), ]
#> variable mean_abs_loading
#> 5 variable_5 0.505
#> 3 variable_3 0.489
#> 1 variable_1 0.314
#> 6 variable_6 0.295
#> 2 variable_2 0.277
#> 4 variable_4 0.251ordered_loading <- order(loading_score, decreasing = TRUE)
barplot(
loading_score[ordered_loading],
names.arg = rownames(X)[ordered_loading],
las = 2,
col = "#E69F00",
border = NA,
ylab = "Mean absolute loading",
main = "Posterior loading magnitude"
)Use both summaries together:
For formal variable selection, the model would need an explicit variable inclusion prior and posterior inclusion summaries. The current package is best described as supporting model-based clustering, cluster-number selection, and PGMM covariance-structure selection, with exploratory variable prioritization available from posterior outputs.
When using these summaries in a manuscript or analysis report, use wording such as:
We ranked variables by their separation across posterior modal clusters and by posterior loading magnitudes. These rankings are exploratory diagnostics derived from the fitted clustering model, not posterior inclusion probabilities.
Avoid wording such as “selected variables” unless a formal variable-selection model has been fitted. This distinction keeps the statistical target clear and avoids overstating the model.