一种融合注意力机制的跨模态图文检索算法-《计算机技术与发展》

文章信息/Info

Title:: A Cross-modal Image and Text Retrieval Algorithm for Integrating Attention Mechanism

文章编号:: 1673-629X(2023)11-0143-06

作者:: 杨迪; 吴春明; 西南大学计算机与信息科学学院,重庆 400700

Author(s):: YANG Di; WU Chun-ming; School of Computer and Information Science,Southwest University,Chongqing 400700,China

关键词:: 图文检索; 跨模态; 注意力机制; 全局特征; 局部特征

Keywords:: image-text retrieval; cross-modal; attention mechanism; global feature; local feature

分类号:: TP391

DOI:: 10. 3969 / j. issn. 1673-629X. 2023. 11. 021

摘要:: 随着不同模态的数据爆发式增长,跨模态检索成为信息检索领域的重要研究课题。由于语义相同事物在不同模态下底层特征异构,如何科学度量它们之间的相似性成为跨模态检索研究首先要解决的重要问题。当前主流的图文检索方法通过模型将异构特征映射到公共空间再进行相似性度量,这些工作主要可分为两种思路,一是从全局特征角度来实现全局信息对齐,二是从局部特征入手来实现细粒度信息对齐,但前者容易丢失局部细节信息,而后者容易导致语义信息不完善。为此,该文提出一种融合注意力机制的跨模态图文检索算法。首先,利用 Vision Transformer 和 Bert 模型获得包含上下文信息的图像和文本特征,再利用注意力机制获得模态内局部的图像和文本特征;其次,通过注意力机制得到模态间全局的图像和文本特征;最后,将这些优化的特征与基础特征融合来进行跨模态检索。该算法既充分利用了不同模态的细粒度特征,又更好地兼顾了全局信息,因而能取得更好的检索精度,通过在 Wikipedia 数据集上的大量对比实验,证明了该算法的有效性。

Abstract:: With the explosive growth of data in different modes,cross-modal retrieval has become an important research topic in the fieldof information retrieval. Because the underlying features of semantically identical objects are different in different modes,how to measurethe similarity between them scientifically has become the first important issue to be solved in the research of cross-modal retrieval. Thecurrent mainstream image and text retrieval methods map the heterogeneous features to the public space through the model and thenmeasure the similarity. These works can be divided into two main ideas. One is to achieve global information alignment from theperspective of global features, and the other is to achieve fine - grained information alignment from the perspective of local features.However,the former is easy to lose local details, while the latter is easy to lead to incomplete semantic information. Therefore, wepropose a cross-modal image and text retrieval algorithm that integrates attention mechanism. Firstly,we use Vision Transformer andBert model to obtain image and text features containing context information,and then use attention mechanism to obtain local image andtext features within the mode. Secondly,we use attention mechanism to obtain global image and text features between modes,and finallyfuse these optimized features with basic features to conduct cross-mode retrieval. The proposed algorithm not only makes full use of thefine-grained features of different modes,but also gives better consideration to the global information,so it can achieve better retrieval accuracy. Through a large number of comparative experiments on Wikipedia data sets, the effectiveness of the proposed algorithm isproved.

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

文章信息/Info

常用功能

导航/Navigate

工具/Tools

统计/Statistics