| Title: | Detecting Measurement-Unit Errors via Gaussian Mixture Models |
|---|---|
| Description: | Tools to detect and correct measurement-unit errors in multivariate numeric data using model-based clustering. Gaussian mixture models with user-defined translation vectors identify clusters of records that differ in scale or unit. Core functionality includes cluster assignment via the EM algorithm, error correction based on posterior probabilities and pairwise scatterplot visualizations. For more details see Di Zio, Guarnera and Luzi (2005) <https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2005001/article/8087-eng.pdf>. |
| Authors: | Cristina Faricelli [aut, cre], Renato Magistro [aut] |
| Maintainer: | Cristina Faricelli <[email protected]> |
| License: | GPL-3 |
| Version: | 0.0.1 |
| Built: | 2026-05-25 07:56:29 UTC |
| Source: | https://github.com/cristinafaricelli/unitmix |
Performs clustering on a numeric dataset by shifting the mean vector according to user-defined translation vectors (shifts), and fitting a Gaussian Mixture Model (GMM) using the Expectation-Maximization (EM) algorithm.
assign.cluster( db, var, errorPatterns, prob_thrsh = 0, entropy_thrsh = 1, max_iter = 500, tol = 1e-06, ID = NULL )assign.cluster( db, var, errorPatterns, prob_thrsh = 0, entropy_thrsh = 1, max_iter = 500, tol = 1e-06, ID = NULL )
db |
A numeric data frame or matrix containing the variables to be clustered. All columns must be numeric and positive. |
var |
A character vector indicating the names of the numeric variables in 'db' to be used for clustering. All specified variables must be numeric and positive. |
errorPatterns |
A list of positive numeric vectors, each representing a shift from the global
mean vector to define a cluster centroid. Each vector must have the same length
as the number of columns in |
prob_thrsh |
Numeric. Minimum required posterior probability for an observation to be assigned to a cluster (default is 0). If no cluster exceeds this value, the observation is left unassigned. |
entropy_thrsh |
Numeric. Maximum normalized Shannon entropy allowed for a cluster assignment
(default is 1). Entropy is calculated on each observation's posterior distribution and
normalized by |
max_iter |
Integer. Maximum number of iterations for the EM algorithm. Default is 500. |
tol |
Numeric. Convergence threshold for the log-likelihood. EM stops when the change in log-likelihood between iterations is below this value. |
ID |
(Optional) A character string specifying the name of the identifier column in 'db'. If provided and the column exists in 'db', its values will be included in the output as the first column, labeled 'ID'. This allows traceability of each corrected record. If not provided or if the specified column does not exist, a default sequence from 1 to the number of rows will be used as identifiers. |
Each observation is softly assigned to all clusters through posterior probabilities. Final cluster assignment is made only if the maximum posterior probability exceeds a user-defined threshold ('prob_thrsh') and the normalized Shannon entropy of the probability vector is below another threshold ('entropy_thrsh'). This allows uncertainty-aware clustering. Entropy is low, close to zero, when the posterior probability is concentrated on a single cluster. The maximum entropy value is 1. Values below 0.1 are generally considered acceptable. The default value for 'prob_thrsh' is 0, while for 'entropy_thrsh' it is 1, which means that clusters are returned exactly as produced by the algorithm, without any filtering.
This method assumes multivariate log-normality and equal covariance across clusters. It implements robust assignment logic by rejecting uncertain observations based on entropy and minimum probability thresholds. Observations with no dominant cluster (high entropy or low posterior probability) are marked with cluster 0. It is the user's responsibility to ensure that input variables are:
Numeric
Strictly positive (since log transformation is applied)
Appropriately scaled (e.g., unit conversion issues should be captured by 'errorPatterns')
If these conditions are not met, the function will return informative error messages and halt execution.
Input data is not preprocessed automatically: preprocessing must be handled by the user.
A list containing:
A data frame including:
The observation ID (from the 'ID' column if provided, otherwise a sequence number),
The original variables used for clustering,
The final cluster assignment for each observation (with unassigned cases labeled as 0),
The normalized Shannon entropy of the posterior distribution,
The posterior probabilities for each cluster.
A data frame containing posterior probabilities for each cluster. One row per observation, one column per cluster.
The estimated shared covariance matrix used in the Gaussian mixture model.
A matrix of cluster centroids, obtained by translating the overall mean vector with each specified error pattern.
The number of EM iterations executed before convergence.
Di Zio, M., Guarnera, U., & Luzi, O. (2005). Editing systematic unity measure errors through mixture modelling. *Survey Methodology*, 31(1), 53–63. https://www.istat.it/it/files/2014/05/Survey-Methodology-311-53-63.pdf
# Simulated dataset with unit-of-measurement error on 3 values data <- data.frame( x1 = c(rlnorm(50, meanlog = 4, sdlog = 0.3), rlnorm(3, meanlog = 4, sdlog = 0.3) * 1000), x2 = rlnorm(53, meanlog = 6, sdlog = 0.3) ) var <- names(data) # Define two errorPatterns: no shift, and a ×1000 shift in x1 only errorPatterns <- list( c(0, 0), # Cluster 1: correct units c(1000, 0) # Cluster 2: scale error ×1000 ) # Run the clustering algorithm ac <- assign.cluster( db = data, var= var, errorPatterns = errorPatterns ) # Examine the results table(ac$data$cluster)# Simulated dataset with unit-of-measurement error on 3 values data <- data.frame( x1 = c(rlnorm(50, meanlog = 4, sdlog = 0.3), rlnorm(3, meanlog = 4, sdlog = 0.3) * 1000), x2 = rlnorm(53, meanlog = 6, sdlog = 0.3) ) var <- names(data) # Define two errorPatterns: no shift, and a ×1000 shift in x1 only errorPatterns <- list( c(0, 0), # Cluster 1: correct units c(1000, 0) # Cluster 2: scale error ×1000 ) # Run the clustering algorithm ac <- assign.cluster( db = data, var= var, errorPatterns = errorPatterns ) # Examine the results table(ac$data$cluster)
This function dynamically generates a set of scatter plots for all pairs of variables, log-transformed, in the provided dataset, enabling detailed visualisation of cluster assignments. The points are coloured according to the cluster to which each observation is assigned, based on posterior probabilities. If the number of variables is large, the number of plots is automatically adjusted to maintain readability, creating multiple graphs. Observations are only assigned to a cluster if the a posteriori probability for that cluster is above a specified threshold (default = 0.6).
cluster.plot( db, var, postprob = 1, errorPatterns = ncol(data), threshold = 0.6 )cluster.plot( db, var, postprob = 1, errorPatterns = ncol(data), threshold = 0.6 )
db |
A numeric data frame or matrix containing the variables to plot. |
var |
A character vector indicating the names of the ckusters numeric variables in 'db. All specified variables must be numeric and positive. |
postprob |
A matrix or dataframe of posterior probabilities where each column represents a cluster.
It is typically obtained from the function |
errorPatterns |
A list of positive numeric vectors that have been identified as being associated with potential unit errors which will be highlighted in the graphs. If not provided, no special errorPatterns are highlighted (default = ncol(data), used internally) |
threshold |
Minimum probability threshold for cluster assignment (default = 0.6). |
It is the user's responsibility to ensure that input variables are:
Numeric
Strictly positive (since log transformation is applied)
Appropriately scaled (e.g., unit conversion issues should be captured by 'errorPatterns')
If these conditions are not met, the function will return informative error messages and halt execution.
Input data is not preprocessed automatically: preprocessing must be handled by the user.
No value returned. The function produces side-effect graphs, which are useful for visually inspecting the separation between clusters and the reliability of assignments.
data <- data.frame( x1 = c(rlnorm(50, meanlog = 4, sdlog = 0.3), rlnorm(10, meanlog = 4, sdlog = 0.3) * 1000), x2 = rlnorm(60, meanlog = 6, sdlog = 0.3) ) var <- names(data) # Define two errorPatterns: no shift, and a ×1000 shift in x1 only errorPatterns <- list( c(0, 0), # Cluster 1: correct units c(1000, 0) # Cluster 2: scale error ×1000 ) results <- assign.cluster( db = data, var= var, errorPatterns = errorPatterns ) cluster.plot(data, var, postprob = results$postprob, errorPatterns, threshold = 0.6)data <- data.frame( x1 = c(rlnorm(50, meanlog = 4, sdlog = 0.3), rlnorm(10, meanlog = 4, sdlog = 0.3) * 1000), x2 = rlnorm(60, meanlog = 6, sdlog = 0.3) ) var <- names(data) # Define two errorPatterns: no shift, and a ×1000 shift in x1 only errorPatterns <- list( c(0, 0), # Cluster 1: correct units c(1000, 0) # Cluster 2: scale error ×1000 ) results <- assign.cluster( db = data, var= var, errorPatterns = errorPatterns ) cluster.plot(data, var, postprob = results$postprob, errorPatterns, threshold = 0.6)
Post-processing function for the output of assign.cluster.
It evaluates the compatibility of each record with its assigned cluster
using Mahalanobis distance on log-transformed data. Records incompatible
with their cluster (distance > chi-square cutoff) or belonging to unreliable
clusters (low compatibility proportion or small size) are reassigned to
cluster 0 (unassigned). This enhances assignment reliability by discarding
outliers and unstable groups.
refine.cluster( ac, vars, compat_level = 0.99, min_good_prop = 0.9, min_cluster_size = 5 )refine.cluster( ac, vars, compat_level = 0.99, min_good_prop = 0.9, min_cluster_size = 5 )
ac |
List. Output object returned by |
vars |
Character vector. Names of the variables used in
|
compat_level |
Numeric. Confidence level for the chi-square cutoff used to assess individual compatibility (default: 0.99). |
min_good_prop |
Numeric. Minimum proportion of compatible records required to consider a cluster reliable (default: 0.90). |
min_cluster_size |
Integer. Minimum cluster size required to consider a cluster valid (default: 5). |
This method assumes multivariate log-normality and uses the shared covariance
matrix from assign.cluster. It is the user's responsibility to ensure
input variables are numeric and strictly positive (log transformation applied).
Cluster-level diagnostics exclude cluster 0. Good clusters must meet both
size and compatibility thresholds. For visualization pipe to cluster.plot.
A list containing:
data |
A data frame with the original results and additional
columns: |
postprob |
Posterior probability matrix with rows set to zero for discarded records. |
cluster_summary |
Diagnostic table reporting, for each cluster:
|
params |
List of parameters used in the refinement process. |
# Simulated dataset with: # - 50 correct observations # - 3 unit-of-measurement errors (x1 ×1000) # - 2 structural outliers set.seed(123) data <- data.frame( x1 = c( rlnorm(50, meanlog = 4, sdlog = 0.3), # correct data rlnorm(3, meanlog = 4, sdlog = 0.3) * 1000, # scale error rlnorm(2, meanlog = 8, sdlog = 0.1) # outliers ), x2 = c( rlnorm(50, meanlog = 6, sdlog = 0.3), rlnorm(3, meanlog = 6, sdlog = 0.3), rlnorm(2, meanlog = 9, sdlog = 0.1) # outliers ) ) var <- names(data) # Define two errorPatterns: no shift, and ×1000 shift in x1 only errorPatterns <- list( c(0, 0), # Cluster 1: correct units c(1000, 0) # Cluster 2: scale error ×1000 ) # Run the clustering algorithm ac <- assign.cluster( db = data, var = var, errorPatterns = errorPatterns ) # Examine initial cluster assignments table(ac$data$cluster) # Run the refine cluster algorithm rc <- refine.cluster(ac, var, min_cluster_size=1) # Re-examine cluster assignments table(rc$data$cluster_refined)# Simulated dataset with: # - 50 correct observations # - 3 unit-of-measurement errors (x1 ×1000) # - 2 structural outliers set.seed(123) data <- data.frame( x1 = c( rlnorm(50, meanlog = 4, sdlog = 0.3), # correct data rlnorm(3, meanlog = 4, sdlog = 0.3) * 1000, # scale error rlnorm(2, meanlog = 8, sdlog = 0.1) # outliers ), x2 = c( rlnorm(50, meanlog = 6, sdlog = 0.3), rlnorm(3, meanlog = 6, sdlog = 0.3), rlnorm(2, meanlog = 9, sdlog = 0.1) # outliers ) ) var <- names(data) # Define two errorPatterns: no shift, and ×1000 shift in x1 only errorPatterns <- list( c(0, 0), # Cluster 1: correct units c(1000, 0) # Cluster 2: scale error ×1000 ) # Run the clustering algorithm ac <- assign.cluster( db = data, var = var, errorPatterns = errorPatterns ) # Examine initial cluster assignments table(ac$data$cluster) # Run the refine cluster algorithm rc <- refine.cluster(ac, var, min_cluster_size=1) # Re-examine cluster assignments table(rc$data$cluster_refined)