Data di Pubblicazione:
2006
Abstract:
A distance-based outlier detection method that finds the top outliers in an unlabeled data set and provides a subset of it,
called outlier detection solving set, that can be used to predict the outlierness of new unseen objects, is proposed. The solving set
includes a sufficient number of points that permits the detection of the top outliers by considering only a subset of all the pairwise
distances from the data set. The properties of the solving set are investigated, and algorithms for computing it, with subquadratic time
requirements, are proposed. Experiments on synthetic and real data sets to evaluate the effectiveness of the approach are presented.
A scaling analysis of the solving set size is performed, and the false positive rate, that is, the fraction of new objects misclassified as
outliers using the solving set instead of the overall data set, is shown to be negligible. Finally, to investigate the accuracy in separating
outliers from inliers, ROC analysis of the method is accomplished. Results obtained show that using the solving set instead of the data
set guarantees a comparable quality of the prediction, but at a lower computational cost.
Tipologia CRIS:
01.01 Articolo in rivista
Keywords:
Distance-based outliers; outlier detection; outlier prediction; data mining
Elenco autori:
Pizzuti, Clara; Basta, Stefano
Link alla scheda completa:
Pubblicato in: