--- title: "Comparing Feature Engineering Approaches" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Comparing Feature Engineering Approaches} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## All combinations We want to design a structure that incorporates all these features. `compare_methods()` function, then unpacks what the results mean. We use the bundled `steel_industry` dataset: one full year of 15-minute energy measurements from a Korean steel plant, including reactive power, power factor, CO2 emissions, and time-of-day indicators. ```{r setup} library(cyclicwave) data(steel_industry) ``` ## Preparing the data Three preprocessing steps: 1. **Thinning**: keep every 10th row to reduce samples. The clustering analysis becomes both faster and more meaningful. 2. **Select numeric columns**: discard date and categorical columns. 3. **Z-score normalization**: required for any distance-based method. ```{r} data_thin <- thin_data(steel_industry, step = 10) numeric_data <- select_numeric_columns(data_thin) data_scaled <- normalize_features(numeric_data, method = "zscore") dim(data_scaled) ``` ## Ground-truth labels `steel_industry` doesn't ship with explicit class labels, but `Usage_kWh` gives us natural ones: low, medium, and high consumption regimes, defined by tertile cutoffs. We use these as a yardstick for evaluating how meaningful each clustering result is. ```{r} true_labels <- label_by_quantile(data_thin$Usage_kWh, probs = c(1/3, 2/3)) table(true_labels) ``` Each class has roughly N/3 observations. ## Defining the feature methods `compare_methods()` takes a named list of feature extractors. Each is just a function that takes the raw data and returns a numeric feature matrix. ```{r} feature_methods <- list( pca_only = function(d) { pca <- prcomp(d, center = FALSE, scale. = FALSE) pca$x[, 1:3] }, pca_circular = function(d) { pca <- prcomp(d, center = FALSE, scale. = FALSE) phase <- compute_phase(d, axis = "feature") circ <- extract_circular_features(phase) cbind(pca$x[, 1:3], circ) } ) ``` ## Defining the clustering methods We try DBSCAN with two different parameter settings: one with a larger neighborhood radius (loose) and one with a smaller one (tight). This is a parameter sweep disguised as a method comparison. ```{r} cluster_methods <- list( dbscan_loose = list(fn = run_dbscan, params = list(eps = 0.5, min_pts = 8)), dbscan_tight = list(fn = run_dbscan, params = list(eps = 0.3, min_pts = 5)) ) ``` ## One call to rule them all `compare_methods()` runs every combination, evaluates each with the requested metrics, and returns a single comparison table. ```{r} comparison <- compare_methods( data = data_scaled, feature_methods = feature_methods, cluster_methods = cluster_methods, metrics = c("dbi", "accuracy", "n_clusters", "n_noise"), true_labels = true_labels, normalize = NULL, verbose = FALSE ) print(comparison) ``` Four rows, one per combination, four metrics each. Now to read it.