Data di Pubblicazione:
2013
Abstract:
We introduce a distributed method for detecting distance-based outliers in very large data sets. Our approach is based on
the concept of outlier detection solving set [2], which is a small subset of the data set that can be also employed for predicting novel
outliers. The method exploits parallel computation in order to obtain vast time savings. Indeed, beyond preserving the correctness of
the result, the proposed schema exhibits excellent performances. From the theoretical point of view, for common settings, the temporal
cost of our algorithm is expected to be at least three orders of magnitude faster than the classical nested-loop like approach to detect
outliers. Experimental results show that the algorithm is efficient and that its running time scales quite well for an increasing number of
nodes. We discuss also a variant of the basic strategy which reduces the amount of data to be transferred in order to improve both the
communication cost and the overall runtime. Importantly, the solving set computed by our approach in a distributed environment has
the same quality as that produced by the corresponding centralized method.
Tipologia CRIS:
01.01 Articolo in rivista
Keywords:
Distance-based outliers; outlier detection; parallel and distributed algorithms
Elenco autori:
Basta, Stefano
Link alla scheda completa:
Pubblicato in: