---
title: "Exploratory variable prioritization after bpgmm clustering"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Exploratory variable prioritization after bpgmm clustering}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

`bpgmm` performs Bayesian model selection for the number of clusters and PGMM
covariance structure. It does not currently implement a formal spike-and-slab
or inclusion-indicator variable-selection prior. Still, users often want to
understand which observed variables help explain the fitted clustering.

The calculations below:

- fit `bpgmm`;
- summarize the posterior modal allocation;
- rank variables by between-cluster separation;
- compare that ranking with posterior loading magnitudes.

These summaries are diagnostic and exploratory. Treat them as variable
prioritization, not as formal posterior variable inclusion probabilities.

## Simulate data with informative and weak variables

The simulation differs from the model-selection example. The aim is not to
recover \(m\), but to examine how posterior output can be post-processed to
rank variables. The first three variables have different cluster means,
variables four and five have mostly covariance differences, and variable six is
weak noise.

```{r}
library(bpgmm)

simulate_screening_data <- function(n_per_cluster = 18, p = 6) {
  means <- rbind(
    c(-2.5, -1.5,  0.0, 0, 0, 0),
    c( 2.0, -0.5,  1.8, 0, 0, 0),
    c( 0.0,  2.2, -1.8, 0, 0, 0)
  )

  covariances <- list(
    diag(c(0.35, 0.35, 0.35, 1.40, 0.35, 1.20)),
    diag(c(0.35, 0.35, 0.35, 0.35, 1.40, 1.20)),
    matrix(c(
      0.35, 0,    0,    0,    0,    0,
      0,    0.35, 0,    0,    0,    0,
      0,    0,    0.35, 0,    0,    0,
      0,    0,    0,    1.00, 0.45, 0,
      0,    0,    0,    0.45, 1.00, 0,
      0,    0,    0,    0,    0,    1.20
    ), nrow = p)
  )

  n <- n_per_cluster * 3
  X <- matrix(NA_real_, nrow = p, ncol = n)
  true_cluster <- rep(seq_len(3), each = n_per_cluster)

  column <- 1
  for (k in seq_len(3)) {
    for (i in seq_len(n_per_cluster)) {
      X[, column] <- MASS::mvrnorm(1, mu = means[k, ], Sigma = covariances[[k]])
      column <- column + 1
    }
  }

  rownames(X) <- paste0("variable_", seq_len(p))
  list(X = X, true_cluster = true_cluster)
}

set.seed(2028)
sim <- simulate_screening_data()
X <- sim$X
true_cluster <- sim$true_cluster
```

The fitted PGMM still uses

\[
\Sigma_k = \Lambda_k\Lambda_k^\top + \Psi_k,
\]

but this simulation deliberately separates mean-driven variables from
covariance-driven variables so the two rankings below answer different
questions.

## Fit a short clustering chain

```{r}
fit_log <- capture.output({
  fit <- pgmm_rjmcmc(
    X = X,
    m_init = 3,
    m_range = c(1, 5),
    q_new = 2,
    burn = 2,
    niter = 6,
    constraint = "UUU",
    m_step = 1,
    v_step = 1,
    verbose = FALSE
  )
})
tail(fit_log, 1)

fit_summary <- summarize_pgmm_rjmcmc(fit, true_cluster = true_cluster)
fit_summary$ari
```

## Rank variables by posterior allocation separation

One simple score is the fraction of total variation explained by the posterior
modal clusters:

\[
R_j^2 =
\frac{\sum_k n_k(\bar{x}_{jk} - \bar{x}_{j\cdot})^2}
     {\sum_i (x_{ji} - \bar{x}_{j\cdot})^2}.
\]

This quantity is not a posterior inclusion probability; it summarizes
separation under the fitted partition.

```{r}
cluster_separation <- function(X, allocation) {
  vapply(seq_len(nrow(X)), function(j) {
    x <- X[j, ]
    overall <- mean(x)
    total <- sum((x - overall)^2)
    between <- sum(vapply(split(x, allocation), function(group) {
      length(group) * (mean(group) - overall)^2
    }, numeric(1)))
    if (total == 0) 0 else between / total
  }, numeric(1))
}

separation <- cluster_separation(X, fit_summary$allocation)
separation_table <- data.frame(
  variable = rownames(X),
  separation = round(separation, 3)
)
separation_table[order(separation_table$separation, decreasing = TRUE), ]
```

```{r, fig.width = 6.5, fig.height = 4, fig.alt = "Bar plot of exploratory variable separation scores."}
ordered <- order(separation, decreasing = TRUE)
barplot(
  separation[ordered],
  names.arg = rownames(X)[ordered],
  las = 2,
  col = "#56B4E9",
  border = NA,
  ylab = "Between-cluster variation / total variation",
  main = "Exploratory variable separation"
)
```

## Compare with loading magnitudes

The loading matrices describe how observed variables relate to latent factors
inside each component. A simple posterior diagnostic is the average absolute
loading magnitude across saved samples and active components.

\[
L_j =
\frac{1}{|\mathcal{S}|}
\sum_{s \in \mathcal{S}}
\frac{1}{|A_s|}
\sum_{k \in A_s}
\frac{1}{q_k}\sum_{\ell = 1}^{q_k}
|\lambda_{jk\ell}^{(s)}|,
\]

where \(A_s\) is the set of active clusters in posterior sample \(s\). Large
\(L_j\) means variable \(j\) contributes strongly to the latent-factor
covariance structure; it is not a posterior inclusion probability.

```{r}
loading_importance <- function(fit) {
  totals <- NULL
  count <- 0

  for (sample_id in seq_along(fit$lambda_samples)) {
    lambdas <- fit$lambda_samples[[sample_id]]
    active <- which(fit$active_cluster_samples[[sample_id]] == 1)

    for (k in active) {
      lambda <- lambdas[[k]]
      if (!is.matrix(lambda)) {
        next
      }
      score <- rowMeans(abs(lambda))
      if (is.null(totals)) {
        totals <- numeric(length(score))
      }
      totals <- totals + score
      count <- count + 1
    }
  }

  if (count == 0) {
    return(rep(NA_real_, nrow(X)))
  }
  totals / count
}

loading_score <- loading_importance(fit)
loading_table <- data.frame(
  variable = rownames(X),
  mean_abs_loading = round(loading_score, 3)
)
loading_table[order(loading_table$mean_abs_loading, decreasing = TRUE), ]
```

```{r, fig.width = 6.5, fig.height = 4, fig.alt = "Bar plot of average posterior absolute loading magnitudes by variable."}
ordered_loading <- order(loading_score, decreasing = TRUE)
barplot(
  loading_score[ordered_loading],
  names.arg = rownames(X)[ordered_loading],
  las = 2,
  col = "#E69F00",
  border = NA,
  ylab = "Mean absolute loading",
  main = "Posterior loading magnitude"
)
```

## Interpretation

Use both summaries together:

- high separation means a variable distinguishes the posterior clusters;
- high loading magnitude means a variable contributes to latent-factor
  covariance structure inside components;
- disagreement between the two summaries indicates that the mean structure and
  covariance structure carry different information.

For formal variable selection, the model would need an explicit variable
inclusion prior and posterior inclusion summaries. The current package is best
described as supporting model-based clustering, cluster-number selection, and
PGMM covariance-structure selection, with exploratory variable prioritization
available from posterior outputs.

## Suggested reporting language

When using these summaries in a manuscript or analysis report, use wording such
as:

> We ranked variables by their separation across posterior modal clusters and
> by posterior loading magnitudes. These rankings are exploratory diagnostics
> derived from the fitted clustering model, not posterior inclusion
> probabilities.

Avoid wording such as "selected variables" unless a formal variable-selection
model has been fitted. This distinction keeps the statistical target clear and
avoids overstating the model.