Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • People
  • Outputs
  • Organizations
  • Expertise & Skills

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • People
  • Outputs
  • Organizations
  • Expertise & Skills
  1. Outputs

Jointly modeling and simultaneously discovering topics and clusters in text corpora using word vectors

Academic Article
Publication Date:
2021
abstract:
An innovative model-based approach to coupling text clustering and topic modeling is introduced, in which the two tasks take advantage of each other. Specifically, the integration is enabled by a new generative model of text corpora. This explains topics, clusters and document content via a Bayesian generative process. In this process, documents include word vectors, to capture the (syntactic and semantic) regularities among words. Topics are multivariate Gaussian distributions on word vectors. Clusters are assigned corresponding topic distributions as their semantics. Content generation is ruled by text clusters and topics, which act as interacting latent factors. Documents are at first placed into respective clusters, then the semantics of these clusters is then repeatedly sampled to draw document topics, which are in turn sampled for word-vector generation. Under the proposed model, collapsed Gibbs sampling is derived mathematically and implemented algorithmically with parameter estimation for the simultaneous inference of text clusters and topics. A comparative assessment on real-world benchmark corpora demonstrates the effectiveness of this approach in clustering texts and uncovering their semantics. Intrinsic and extrinsic criteria are adopted to investigate its topic modeling performance, whose results are shown through a case study. Time efficiency and scalability are also studied.
Iris type:
01.01 Articolo in rivista
Keywords:
Document clustering; Topic modeling; Word embeddings; Bayesian text analysis
List of contributors:
Ortale, Riccardo; Costa, Giovanni
Authors of the University:
COSTA GIOVANNI
ORTALE RICCARDO
Handle:
https://iris.cnr.it/handle/20.500.14243/429174
Published in:
INFORMATION SCIENCES
Journal
  • Overview

Overview

URL

http://www.scopus.com/record/display.url?eid=2-s2.0-85107649025&origin=inward
  • Use of cookies

Powered by VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)