Data di Pubblicazione:
2024
Abstract:
With the increasing importance of multimedia and multilingual data in online encyclopedias,
novel methods are needed to fill domain gaps and automatically connect different modalities
for increased accessibility. For example,Wikipedia is composed of millions of pages written
in multiple languages. Images, when present, often lack textual context, thus remaining
conceptually floating and harder to find and manage. In this work, we tackle the novel task
of associating images from Wikipedia pages with the correct caption among a large pool
of available ones written in multiple languages, as required by the image-caption matching
Kaggle challenge organized by theWikimedia Foundation.Asystem able to perform this task
would improve the accessibility and completeness of the underlying multi-modal knowledge
graph in online encyclopedias. We propose a cascade of two models powered by the recent
Transformer networks able to efficiently and effectively infer a relevance score between
the query image data and the captions. We verify through extensive experiments that the
proposed cascaded approach effectively handles a large pool of images and captions while
maintaining bounded the overall computational complexity at inference time.With respect to
other approaches in the challenge leaderboard,we can achieve remarkable improvements over
the previous proposals (+8% in nDCG@5 with respect to the sixth position) with constrained
resources. The code is publicly available at https://tinyurl.com/wiki-imcap.
Tipologia CRIS:
01.01 Articolo in rivista
Keywords:
Multi-modal matching; Information retrieval; Deep learning; Transformer networks
Elenco autori:
Coccomini, DAVIDE ALESSANDRO; Falchi, Fabrizio; Esuli, Andrea; Messina, Nicola
Link alla scheda completa:
Link al Full Text:
Pubblicato in: