Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze
  1. Pubblicazioni

Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders

Articolo
Data di Pubblicazione:
2021
Abstract:
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences, i.e., image regions and words, respectively, in order to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way towards the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN.
Tipologia CRIS:
01.01 Articolo in rivista
Keywords:
Deep Learning; Cross-modal retrieval; Multi-modal matching; Computer vision; NLP
Elenco autori:
Messina, Nicola; Amato, Giuseppe; Gennaro, Claudio; Falchi, Fabrizio; Esuli, Andrea
Autori di Ateneo:
AMATO GIUSEPPE
ESULI ANDREA
FALCHI FABRIZIO
GENNARO CLAUDIO
Link alla scheda completa:
https://iris.cnr.it/handle/20.500.14243/395783
Link al Full Text:
https://iris.cnr.it//retrieve/handle/20.500.14243/395783/180071/prod_457546-doc_177571.pdf
Pubblicato in:
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS
Journal
  • Dati Generali

Dati Generali

URL

https://dl.acm.org/doi/10.1145/3451390
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)