Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze
  1. Pubblicazioni

Fully-Automatic XML Clustering by Structure-Constrained Phrases

Contributo in Atti di convegno
Data di Pubblicazione:
2015
Abstract:
Conventional approaches to XML clustering by content and structure are generally affected by a limitation due to the adoption of the bag-of-word model for the representation of their textual contents. This choice may lead to consider structure-constrained textual items of separate XML documents as related, even though the actual meaning of such items in their respective contexts is different. To overcome such a limitation, we propose XML clustering by structure-constrained phrases. The latter is a previously unexplored method relying on the more accurate bag-of-phrase model of the XML textual content, with which to better preserve the meaning of the structure-constrained content items for improved clustering effectiveness. In order to conduct an in-depth and systematic study of the effectiveness of the proposed method, we develop a parameter-free prototypical approach to XML partitioning, which projects the XML documents into a space of XML features representing fixed-length sequences of adjacent textual items in the context of root-to-leaf paths. Feature selection without any tunable threshold is used to choose a subset of the XML features on the basis of their relevance to clustering, which is assessed through a new scoring scheme. A comparative experimentation on real-world benchmark XML corpora reveals a higher effectiveness than several state-of-the-art competitors.
Tipologia CRIS:
04.01 Contributo in Atti di convegno
Keywords:
XML Clustering; Semistructured Data Analysis; Path-Constrained n-Grams
Elenco autori:
Ortale, Riccardo; Costa, Giovanni
Autori di Ateneo:
COSTA GIOVANNI
ORTALE RICCARDO
Link alla scheda completa:
https://iris.cnr.it/handle/20.500.14243/299603
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)