Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze
  1. Pubblicazioni

A topic detection method for high dimensional datasets

Contributo in Atti di convegno
Data di Pubblicazione:
2014
Abstract:
Topics extraction from documents has become increasingly important due to its effectiveness in many tasks, including information retrieval, information filtering and organization of document collections in digital libraries. The Topic Detection consists to find the most significant topics within a document corpus. In this paper we explore the adoption of a methodology of feature ex- Traction and reduction to underline the most significant topics within a corpus. We used an approach based on a clustering algorithm (X-means) over the tf - idf matrix calculated starting from the corpus, by which we describe the frequency of terms, represented by the columns, that occur in each document, represented by a row. To extract the topics, we build n binary problems, where n is the numbers of clusters produced by an unsupervised clustering approach and we operate a supervised feature selection over them considering the top features as the topic descriptors. We will show the results obtained on two different corpora. Both collections are expressed in Italian: The first collection consists of documents of the University of Naples Federico II, the second one consists in a col- lection of medical records.
Tipologia CRIS:
04.01 Contributo in Atti di convegno
Keywords:
Clustering; Feature reduction; Tf - idf; Topic detection
Elenco autori:
Gargiulo, Francesco
Autori di Ateneo:
GARGIULO FRANCESCO
Link alla scheda completa:
https://iris.cnr.it/handle/20.500.14243/319347
  • Dati Generali

Dati Generali

URL

http://www.scopus.com/record/display.url?eid=2-s2.0-84914173148&origin=inward
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)