Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze
  1. Pubblicazioni

Automatic Creation of Quality Multi-Word Lexica from Noisy Text Data

Contributo in Atti di convegno
Data di Pubblicazione:
2012
Abstract:
This paper describes the design of a tool for the automatic creation of multi-word lexica that is deployed as a web service and runs on automatically web-crawled data within the framework of the PANACEA platform. The main purpose of our task is to provide a (computationally "light") tool that creates a full high quality lexical resource of multi-word items. Within the platform, this tool is typically inserted in a work flow whose first step is automatic web-crawling. Therefore, the input data of our lexical extractor is intrinsically noisy. The paper evaluates the capacity of the tool to deal with noisy data, and in particular with texts containing a significant amount of duplicated paragraphs. The accuracy of the extraction of multi-word expressions from the original crawled corpus is compared to the accuracy of the extraction from a later "de-duplicated" version of the corpus. The paper shows how our method can extract with sufficiently good precision also from the original, noisy crawled data. The output of our tool is a multi-word lexicon formatted and encoded in XML according to the Lexical Mark-up Framework.
Tipologia CRIS:
04.01 Contributo in Atti di convegno
Keywords:
Lexical induction; multi-word extraction; web-based distributed platform; noisy data
Elenco autori:
Frontini, Francesca; Rubino, Francesco; Quochi, Valeria
Autori di Ateneo:
FRONTINI FRANCESCA
QUOCHI VALERIA
Link alla scheda completa:
https://iris.cnr.it/handle/20.500.14243/128272
Titolo del libro:
Proceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data
  • Dati Generali

Dati Generali

URL

http://www.kde.cs.tut.ac.jp/~aono/pdf/COLING2012/AND/pdf/AND04.pdf
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)