Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark - Nature

febbraio 18, 2021

The choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods.

Leggi tutto: https://doi.org/10.1038/s41598-021-83340-8

(o scarica il pdf dell'articolo: https://rdcu.be/dCnix)

Cerca nel blog

My Cookie Mix

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark - Nature

Commenti

Posta un commento

Post popolari in questo blog

Dove trovare raccolte di dati (dataset) utilizzabili gratuitamente

Alternative a Yahoo Finance per scaricare i dati di borsa

Data visualization (Cosa si intende per data visualization?)