Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze
  1. Pubblicazioni

Lightweight random indexing for polylingual text classification

Articolo
Data di Pubblicazione:
2016
Abstract:
Multilingual Text Classification(MLTC) is a text classification task in which documents arewritten each in one among a setLof natural languages, and in which all documents must beclassified under the same classification scheme, irrespective of language. There are two mainvariants of MLTC, namelyCross-Lingual Text Classification(CLTC) andPolylingual TextClassification(PLTC). In PLTC, which is the focus of this paper, we assume (differentlyfrom CLTC) that for each language inLthere is a representative set of training documents;PLTC consists of improving the accuracy of each of the|L|monolingual classifiers byalso leveraging the training documents written in the other (|L| -1) languages. Theobvious solution, consisting of generating a single polylingual classifier from the juxtaposedmonolingual vector spaces, is usually infeasible, since the dimensionality of the resultingvector space is roughly|L|times that of a monolingual one, and is thus often unmanageable.As a response, the use of machine translation tools or multilingual dictionaries has beenproposed. However, these resources are not always available, or are not always free to use.One machine-translation-free and dictionary-free method that, to the best of our knowl-edge, has never been applied to PLTC before, isRandom Indexing(RI). We analyse RI interms of space and time efficiency, and propose a particular configuration of it (that wedubLightweight Random Indexing- LRI). By running experiments on two well known pub-lic benchmarks, Reuters RCV1/RCV2 (a comparable corpus) and JRC-Acquis (a parallelone), we show LRI to outperform (both in terms of effectiveness and efficiency) a numberof previously proposed machine-translation-free and dictionary-free PLTC methods thatwe use as baselines.
Tipologia CRIS:
01.01 Articolo in rivista
Keywords:
Polylingual text classification
Elenco autori:
MOREO FERNANDEZ, Alejandro; Esuli, Andrea; Sebastiani, Fabrizio
Autori di Ateneo:
ESULI ANDREA
MOREO FERNANDEZ ALEJANDRO DAVID
SEBASTIANI FABRIZIO
Link alla scheda completa:
https://iris.cnr.it/handle/20.500.14243/316537
Link al Full Text:
https://iris.cnr.it//retrieve/handle/20.500.14243/316537/179463/prod_359535-doc_117979.pdf
Pubblicato in:
THE JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
Journal
  • Dati Generali

Dati Generali

URL

http://jair.org/papers/paper5194.html
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)