Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze
  1. Strutture

Incorporating clustering into set similarity join algorithms: The SjClust framework

Contributo in Atti di convegno
Data di Pubblicazione:
2016
Abstract:
Data cleaning and integration found on duplicate record identification, which aims at detecting duplicate records that represent the same real-world entity. Similarity join is largely used in order to detect pairs of similar records in combination with a subsequent clustering algorithm meant for grouping together records that refer to the same entity. Unfortunately, the clustering algorithm is strictly used as a postprocessing step, which slows down the overall performance, and final results are produced at the end of the whole process only. Inspired by this critical evidence, in this paper we propose and experimentally assess SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a variety of cluster representations that are smoothly merged during the set similarity task, carried out by the join algorithm. An optimization task is further applied on top of such framework. Experimental results, which are derived from an extensive experimental campaign, we retrieve are really surprising, as we are able to outperform the original set similarity join algorithm by an order of magnitude in most settings.
Tipologia CRIS:
04.01 Contributo in Atti di convegno
Keywords:
Big Data Clustering; Big Data Management
Elenco autori:
Cuzzocrea, ALFREDO MASSIMILIANO
Link alla scheda completa:
https://iris.cnr.it/handle/20.500.14243/324304
  • Dati Generali

Dati Generali

URL

http://www.scopus.com/record/display.url?eid=2-s2.0-84981278159&origin=inward
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)