---
title: "Comparing Feature Engineering Approaches"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Comparing Feature Engineering Approaches}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

## All combinations

We want to design a structure that incorporates all these features.
`compare_methods()` function, then unpacks what the results mean.

We use the bundled `steel_industry` dataset: one full year of 15-minute
energy measurements from a Korean steel plant, including reactive power,
power factor, CO2 emissions, and time-of-day indicators.

```{r setup}
library(cyclicwave)
data(steel_industry)
```

## Preparing the data

Three preprocessing steps:

1. **Thinning**: keep every 10th row to reduce samples. The clustering analysis becomes both faster and more meaningful.
2. **Select numeric columns**: discard date and categorical columns.
3. **Z-score normalization**: required for any distance-based method.

```{r}
data_thin    <- thin_data(steel_industry, step = 10)
numeric_data <- select_numeric_columns(data_thin)
data_scaled  <- normalize_features(numeric_data, method = "zscore")

dim(data_scaled)
```

## Ground-truth labels

`steel_industry` doesn't ship with explicit class labels, but `Usage_kWh`
gives us natural ones: low, medium, and high consumption regimes,
defined by tertile cutoffs. We use these as a yardstick for evaluating
how meaningful each clustering result is.

```{r}
true_labels <- label_by_quantile(data_thin$Usage_kWh,
                                 probs = c(1/3, 2/3))
table(true_labels)
```

Each class has roughly N/3 observations.

## Defining the feature methods

`compare_methods()` takes a named list of feature extractors. Each is
just a function that takes the raw data and returns a numeric feature
matrix.
```{r}
feature_methods <- list(
  pca_only = function(d) {
    pca <- prcomp(d, center = FALSE, scale. = FALSE)
    pca$x[, 1:3]
  },
  pca_circular = function(d) {
    pca <- prcomp(d, center = FALSE, scale. = FALSE)
    phase <- compute_phase(d, axis = "feature")
    circ <- extract_circular_features(phase)
    cbind(pca$x[, 1:3], circ)
  }
)
```

## Defining the clustering methods

We try DBSCAN with two different parameter settings: one with a larger neighborhood radius (loose) and one with a smaller one (tight). This is a parameter sweep disguised as a method comparison.
```{r}
cluster_methods <- list(
  dbscan_loose = list(fn = run_dbscan, params = list(eps = 0.5, min_pts = 8)),
  dbscan_tight = list(fn = run_dbscan, params = list(eps = 0.3, min_pts = 5))
)
```

## One call to rule them all

`compare_methods()` runs every combination, evaluates each with the
requested metrics, and returns a single comparison table.

```{r}
comparison <- compare_methods(
  data            = data_scaled,
  feature_methods = feature_methods,
  cluster_methods = cluster_methods,
  metrics         = c("dbi", "accuracy", "n_clusters", "n_noise"),
  true_labels     = true_labels,
  normalize       = NULL,
  verbose         = FALSE
)

print(comparison)
```

Four rows, one per combination, four metrics each. Now to read it.