Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze
  1. Pubblicazioni

XML Clustering by Structure-Constrained Phrases: A Fully-Automatic Approach Using Contextualized N-Grams

Articolo
Data di Pubblicazione:
2017
Abstract:
Using the bag-of-word model to represent the textual data of XML documents may not be beneficial in XML clustering by content and structure. Indeed, the occurrence of similar structure-constrained textual items across distinct XML documents may enforce relatedness, despite the possibly different meaning implied by the order of item occurrence in the respective contexts. We propose XML clustering by structure-constrained phrases. It is a new method that better captures the meaning of the structure-constrained textual items of XML documents, by resorting to the more accurate bag-of-phrase model for improved clustering effectiveness. In order to conduct an in-depth and systematic study of the validity of the proposed method, we develop a parameter-free approach, that projects the XML documents into a space of XML features corresponding to sequences of textual items in the context of root-to-leaf paths. Automatic feature selection allows for choosing a subset of XML features, whose relevance is assessed through an innovative scoring scheme. The devised approach can operate with representations of the XML documents over both fixed-and mixed-length sequences of contextualized textual items. A novel criterion is presented to combine XML features with mixed lengths. A comparative experimentation on real-world benchmark XML corpora reveals the overcoming effectiveness of our approach. This highlights the potential of XML clustering by structure-constrained phrases and fosters further efforts. The scalability of the devised approach is also investigated.
Tipologia CRIS:
01.01 Articolo in rivista
Keywords:
Semi-structured data analysis; XML clustering by structure and nested text; Structure-constrained phrases; contextualized word n-grams of fixed- and mixed-length
Elenco autori:
Ortale, Riccardo; Costa, Giovanni
Autori di Ateneo:
COSTA GIOVANNI
ORTALE RICCARDO
Link alla scheda completa:
https://iris.cnr.it/handle/20.500.14243/334012
Pubblicato in:
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS
Journal
  • Dati Generali

Dati Generali

URL

http://www.scopus.com/record/display.url?eid=2-s2.0-85013643182&origin=inward
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)