--- title: "Model selection on a larger simulated MFA data set" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Model selection on a larger simulated MFA data set} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The following simulation uses a moderately larger mixture-of-factor-analyzers data set to illustrate model selection in `bpgmm`. The chain is short so that the vignette builds quickly on CRAN and pkgdown. Applied analyses should use longer burn-in, more posterior samples, multiple chains, and convergence diagnostics. ## Simulate clustered factor-analytic data The simulator below creates three clusters, six observed variables, and two latent factors. The first three variables carry most of the cluster separation, while the last three variables are intentionally less informative. The generating model matches the paper's mixture-of-factor-analyzers form: \[ x_i = \mu_{z_i} + \Lambda_{z_i}y_i + \epsilon_i, \qquad y_i \sim N_2(0, I_2), \qquad \epsilon_i \sim N_6(0, \Psi_{z_i}). \] ```{r} library(bpgmm) simulate_mfa_data <- function(n_per_cluster = 20, p = 6, q = 2) { means <- rbind( c(-3.0, -2.0, 0.0, 0, 0, 0), c( 2.5, -1.0, 2.0, 0, 0, 0), c( 0.0, 2.5, -2.0, 0, 0, 0) ) lambdas <- list( matrix(c(1.2, 0.5, 0.2, 0, 0, 0, 0.0, 0.2, 0.8, 0.3, 0, 0), nrow = p), matrix(c(0.2, 1.1, 0.6, 0, 0, 0, 0.8, 0.1, 0.2, 0.6, 0, 0), nrow = p), matrix(c(0.7, -0.4, 1.0, 0, 0, 0, -0.2, 0.8, 0.4, 0.3, 0, 0), nrow = p) ) psi <- list( diag(c(0.25, 0.35, 0.30, 0.80, 0.90, 1.00)), diag(c(0.30, 0.25, 0.40, 0.80, 0.90, 1.00)), diag(c(0.35, 0.30, 0.25, 0.80, 0.90, 1.00)) ) n <- n_per_cluster * 3 X <- matrix(NA_real_, nrow = p, ncol = n) true_cluster <- rep(seq_len(3), each = n_per_cluster) column <- 1 for (k in seq_len(3)) { for (i in seq_len(n_per_cluster)) { latent_score <- rnorm(q) noise <- MASS::mvrnorm(1, mu = rep(0, p), Sigma = psi[[k]]) X[, column] <- means[k, ] + lambdas[[k]] %*% latent_score + noise column <- column + 1 } } rownames(X) <- paste0("variable_", seq_len(p)) list(X = X, true_cluster = true_cluster) } set.seed(2027) sim <- simulate_mfa_data() X <- sim$X true_cluster <- sim$true_cluster dim(X) table(true_cluster) ``` The first two variables already show substantial separation, but the model is fit to all six variables. ```{r, fig.width = 5.5, fig.height = 4.2, fig.alt = "Scatter plot of the first two variables colored by true cluster."} cluster_cols <- c("#0072B2", "#D55E00", "#009E73", "#CC79A7", "#E69F00") plot( X[1, ], X[2, ], col = cluster_cols[true_cluster], pch = 19, xlab = rownames(X)[1], ylab = rownames(X)[2], main = "Simulated MFA data", asp = 1 ) legend( "topleft", legend = paste("True cluster", sort(unique(true_cluster))), col = cluster_cols[sort(unique(true_cluster))], pch = 19, bty = "n" ) ``` ## Run RJMCMC model selection Here `m_step = 1` allows the sampler to move across different numbers of clusters, and `v_step = 1` allows moves across the PGMM covariance structures. The example uses only a few iterations so that the vignette is fast. The two discrete targets are \[ p(m \mid X) \quad \text{and} \quad p(v \mid X), \] where \(v\) is one of the eight covariance labels \(\{\mathrm{CCC}, \ldots, \mathrm{UUU}\}\). The saved samples estimate these posterior probabilities by empirical frequencies: \[ \widehat{p}(m = r \mid X) = \frac{1}{S}\sum_{s=1}^S I(m^{(s)} = r), \qquad \widehat{p}(v = a \mid X) = \frac{1}{S}\sum_{s=1}^S I(v^{(s)} = a). \] ```{r} fit_log <- capture.output({ fit <- pgmm_rjmcmc( X = X, m_init = 3, m_range = c(1, 5), q_new = 2, burn = 2, niter = 6, constraint = "UUU", m_step = 1, v_step = 1, split_combine = 0, verbose = FALSE ) }) tail(fit_log, 1) ``` ```{r} fit_summary <- summarize_pgmm_rjmcmc(fit, true_cluster = true_cluster) fit_summary$n_clusters fit_summary$n_constraints fit_summary$ari ``` The model-selection summaries are posterior sample counts. In an applied analysis, these bars should be based on a much longer chain. ```{r, fig.width = 7, fig.height = 4, fig.alt = "Bar plots of posterior counts for cluster number and covariance model."} old_par <- par(mfrow = c(1, 2), mar = c(5, 4, 3, 1)) barplot( fit_summary$n_clusters, col = "#56B4E9", border = NA, xlab = "Number of clusters", ylab = "Posterior sample count", main = "Cluster-number selection" ) barplot( fit_summary$n_constraints, col = "#E69F00", border = NA, xlab = "PGMM model", ylab = "Posterior sample count", main = "Covariance-model selection" ) par(old_par) ``` ## Compare true and posterior allocations The posterior modal allocation can be plotted against the known simulation labels. ```{r, fig.width = 7, fig.height = 4, fig.alt = "Two-panel scatter plot comparing true cluster labels and posterior modal allocation."} old_par <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1)) plot( X[1, ], X[2, ], col = cluster_cols[true_cluster], pch = 19, xlab = rownames(X)[1], ylab = rownames(X)[2], main = "True clusters", asp = 1 ) plot( X[1, ], X[2, ], col = cluster_cols[fit_summary$allocation], pch = 19, xlab = rownames(X)[1], ylab = rownames(X)[2], main = "Posterior modal allocation", asp = 1 ) par(old_par) ``` ## Interpreting short versus long runs The short chain in this vignette demonstrates the mechanics of model selection. It should not be interpreted as a final posterior analysis. In applied work: - increase `burn` and `niter` until summaries are stable; - compare independent chains with `pgmm_rjmcmc_chains()`; - check whether sampled cluster counts are stuck at either edge of `m_range`; - rerun sensitivity checks for plausible values of `q_new`; - inspect whether covariance-model posterior counts change when `v_step = 1`. If the posterior mass stays at the upper end of `m_range`, expand the range or revisit preprocessing. If it stays at the lower end, the data may support fewer clusters than expected, or the covariance model may be too flexible for the sample size.