«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.cnki.ISSN1673-629X.2024.0039]
点击复制

基于多模态交互网络的图像描述()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 34
期数:: 2024年05期

页码:: 44-51

栏目:: 媒体计算

出版日期:: 2024-05-10

文章信息/Info

Title:: Multimodal Interaction Network for Image Captioning

文章编号:: 1673-629X(2024)05-0044-08

作者:: 段毛毛; 魏燚伟; 中国石油大学(北京)克拉玛依校区石油学院,新疆克拉玛依 834000

Author(s):: DUAN Mao-mao; WEI Yi-wei; School of Petroleum,China University of Petroleum-Beijing at Karamay,Karamay 834000,China

关键词:: 多模态; 图像描述; 自注意力; 长短期记忆网络; 视觉; 文本

Keywords:: multimodal; image captioning; self attention; long short-term memory; visual; semantic

分类号:: TP391

DOI:: 10.20165/j.cnki.ISSN1673-629X.2024.0039

摘要:: 在各类的图像描述方法中,多模态方法主要将视觉和文本两种模态的信息作为输入,以获得有效的多级信息。然而,其中的多数方法未考虑两种模态数据之间的关系,仅孤立地使用这两种模态的数据。为了在不同模态之间建立复杂的交互,充分利用模态之间的关系提升图像描述效果,首先,引入双向注意流模块(Bi-Directional Attention Flow,BiDAF),将自注意力机制升级为双向方式;然后,通过一个只需一个遗忘门就可以实现与长短期记忆网络(Long Short-Term Memory,LSTM)相同的功能的门控线性记忆模块(Gated Linear Memory,GLM)有效降低解码器的复杂度,并捕获多模态的交互信息;最后,将BiDAF和GLM分别应用为图像描述模型的编码器和解码器,形成多模态交互网络(Multimodal Interactive Network,MINet)。在公共数据集MS COCO上的实验结果表明,MINet与现有的多模态方法相比不仅具有更简洁的解码器、更好的图像描述效果、更高的评价分数,且无需进行预训练,图像描述更高效。

Abstract:: In image captioning,multimodal approaches are widely exploited by simultaneously providing visual inputs and semantic attributes to capture multi-level information.However,most approaches still utilize the two modalities in isolation,without considering the correlation between them.With the aim of filling this gap,we first introduce a Bi-Directional Attention Flow (BiDAF) module that extends the self attention mechanism as a bi-directional manner to model complex interactions between different modalities.Then,through a Gated Linear Memory (GLM) module that can realize the same function as a Long Short-Term Memory (LSTM) with only one forget gate,the decoder complexity is effectively reduced and multi-modal interaction information is captured.Finally,we apply BiDAF and GLM as the encoder and the decoder of the image captioning model respectively,forming a Multimodal Interactive Network(MINet).When tested on COCO,experimental results show that MINet not only has a more concise decoder,better image description,and higher evaluation scores than that of existing multimodal methods,but also more efficient in image description without pre-training.

相似文献/References:

[1]李彩云[],张著洪[]. 求解单目标区间数规划的改进型免疫优化算法[J].计算机技术与发展,2015,25(09):102.
　LI Cai-yun[],ZHANG Zhu-hong[]. Improved Immune Optimization Algorithm Solving Single-objective Interval Number Programming[J].,2015,25(05):102.
[2]王宇欣,方浩宇,张伟,等.注意力机制在情感分析中的应用研究[J].计算机技术与发展,2022,32(04):193.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 033]
　WANG Yu-xin,FANG Hao-yu,ZHANG Wei,et al.Application Research of Attention Mechanism in Sentiment Analysis[J].,2022,32(05):193.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 033]
[3]金海燕,肖照林,蔡磊,等.显著性目标检测理论与应用研究综述[J].计算机技术与发展,2022,32(09):1.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 001]
　JIN Hai-yan,XIAO Zhao-lin,CAI Lei,et al.Review on Theory and Application of Saliency Target Detection[J].,2022,32(05):1.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 001]
[4]陈耀传,奚雪峰*,崔志明,等.图像描述技术方法研究[J].计算机技术与发展,2023,33(04):9.[doi:10. 3969 / j. issn. 1673-629X. 2023. 04. 002]
　CHEN Yao-chuan,XI Xue-feng*,CUI Zhi-ming,et al.Research of Image Captioning Methods[J].,2023,33(05):9.[doi:10. 3969 / j. issn. 1673-629X. 2023. 04. 002]
[5]张石清,张星楠,赵小明.基于音视频信息的深度多模态抑郁症识别综述[J].计算机技术与发展,2023,33(07):1.[doi:10. 3969 / j. issn. 1673-629X. 2023. 07. 001]
　ZHANG Shi-qing,ZHANG Xing-nan,ZHAO Xiao-ming.A Survey of Deep Multimodal Depression Recognition Based on Audio-visual Cues[J].,2023,33(05):1.[doi:10. 3969 / j. issn. 1673-629X. 2023. 07. 001]
[6]刘译善,孙涵.基于特征增强的 RGB-D 显著性目标检测[J].计算机技术与发展,2023,33(11):28.[doi:10. 3969 / j. issn. 1673-629X. 2023. 11. 005]
　LIU Yi-shan,SUN Han.Feature Enhancement Based RGB-D Salient Object Detection[J].,2023,33(05):28.[doi:10. 3969 / j. issn. 1673-629X. 2023. 11. 005]
[7]张冬,梁平,顾进广.基于双融合图注意力网络多模态知识图谱链路预测[J].计算机技术与发展,2024,34(07):123.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0089]
　ZHANG Dong,LIANG Ping,GU Jin-guang.Multi-modal Knowledge Graph Link Prediction Based on Dual Fusion and Graph Attention Networks[J].,2024,34(05):123.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0089]
[8]曹茂俊,林世友,肖阳,等.面向测井领域的多模态知识图谱构建[J].计算机技术与发展,2024,34(09):195.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0132]
　CAO Mao-jun,LIN Shi-you,XIAO Yang,et al.Construction of Multi-modal Knowledge Graph for Logging Field[J].,2024,34(05):195.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0132]
[9]欧阳梦妮,樊小超,帕力旦·吐尔逊.基于目标对齐和语义过滤的多模态情感分析[J].计算机技术与发展,2024,34(10):171.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0209]
　OUYANG Meng-ni,FAN Xiao-chao,Palidan Turson.Multimodal Sentiment Analysis Based on Target Alignment and Semantic Filtering[J].,2024,34(05):171.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0209]
[10]周乐善,冯锡炜*.基于重构双注意力网络的图文情感分析[J].计算机技术与发展,2024,34(12):157.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0238]
　ZHOU Le-shan,FENG Xi-wei*.Images-text Sentiment Analysis Based on Reconstructed Dual Attention Networks[J].,2024,34(05):157.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0238]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed357
全文下载/Downloads120
评论/Comments