[1]段毛毛,魏燚伟.基于多模态交互网络的图像描述[J].计算机技术与发展,2024,34(05):44-51.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0039]
 DUAN Mao-mao,WEI Yi-wei.Multimodal Interaction Network for Image Captioning[J].,2024,34(05):44-51.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0039]
点击复制

基于多模态交互网络的图像描述()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
34
期数:
2024年05期
页码:
44-51
栏目:
媒体计算
出版日期:
2024-05-10

文章信息/Info

Title:
Multimodal Interaction Network for Image Captioning
文章编号:
1673-629X(2024)05-0044-08
作者:
段毛毛魏燚伟
中国石油大学(北京)克拉玛依校区 石油学院,新疆 克拉玛依 834000
Author(s):
DUAN Mao-maoWEI Yi-wei
School of Petroleum,China University of Petroleum-Beijing at Karamay,Karamay 834000,China
关键词:
多模态图像描述自注意力长短期记忆网络视觉文本
Keywords:
multimodalimage captioningself attentionlong short-term memoryvisualsemantic
分类号:
TP391
DOI:
10.20165/j.cnki.ISSN1673-629X.2024.0039
摘要:
在各类的图像描述方法中,多模态方法主要将视觉和文本两种模态的信息作为输入,以获得有效的多级信息。然而,其中的多数方法未考虑两种模态数据之间的关系,仅孤立地使用这两种模态的数据。为了在不同模态之间建立复杂的交互,充分利用模态之间的关系提升图像描述效果,首先,引入双向注意流模块(Bi-Directional Attention Flow,BiDAF),将自注意力机制升级为双向方式;然后,通过一个只需一个遗忘门就可以实现与长短期记忆网络(Long Short-Term Memory,LSTM)相同的功能的门控线性记忆模块(Gated Linear Memory,GLM)有效降低解码器的复杂度,并捕获多模态的交互信息;最后,将BiDAF和GLM分别应用为图像描述模型的编码器和解码器,形成多模态交互网络(Multimodal Interactive Network,MINet)。在公共数据集MS COCO上的实验结果表明,MINet与现有的多模态方法相比不仅具有更简洁的解码器、更好的图像描述效果、更高的评价分数,且无需进行预训练,图像描述更高效。
Abstract:
In image captioning,multimodal approaches are widely exploited by simultaneously providing visual inputs and semantic attributes to capture multi-level information.However,most approaches still utilize the two modalities in isolation,without considering the correlation between them.With the aim of filling this gap,we first introduce a Bi-Directional Attention Flow (BiDAF) module that extends the self attention mechanism as a bi-directional manner to model complex interactions between different modalities.Then,through a Gated Linear Memory (GLM) module that can realize the same function as a Long Short-Term Memory (LSTM) with only one forget gate,the decoder complexity is effectively reduced and multi-modal interaction information is captured.Finally,we apply BiDAF and GLM as the encoder and the decoder of the image captioning model respectively,forming a Multimodal Interactive Network(MINet).When tested on COCO,experimental results show that MINet not only has a more concise decoder,better image description,and higher evaluation scores than that of existing multimodal methods,but also more efficient in image description without pre-training.

相似文献/References:

[1]李彩云[],张著洪[]. 求解单目标区间数规划的改进型免疫优化算法[J].计算机技术与发展,2015,25(09):102.
 LI Cai-yun[],ZHANG Zhu-hong[]. Improved Immune Optimization Algorithm Solving Single-objective Interval Number Programming[J].,2015,25(05):102.
[2]王宇欣,方浩宇,张 伟,等.注意力机制在情感分析中的应用研究[J].计算机技术与发展,2022,32(04):193.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 033]
 WANG Yu-xin,FANG Hao-yu,ZHANG Wei,et al.Application Research of Attention Mechanism in Sentiment Analysis[J].,2022,32(05):193.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 033]
[3]金海燕,肖照林,蔡 磊,等.显著性目标检测理论与应用研究综述[J].计算机技术与发展,2022,32(09):1.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 001]
 JIN Hai-yan,XIAO Zhao-lin,CAI Lei,et al.Review on Theory and Application of Saliency Target Detection[J].,2022,32(05):1.[doi:10. 3969 / j. issn. 1673-629X. 2022. 09. 001]
[4]陈耀传,奚雪峰*,崔志明,等.图像描述技术方法研究[J].计算机技术与发展,2023,33(04):9.[doi:10. 3969 / j. issn. 1673-629X. 2023. 04. 002]
 CHEN Yao-chuan,XI Xue-feng*,CUI Zhi-ming,et al.Research of Image Captioning Methods[J].,2023,33(05):9.[doi:10. 3969 / j. issn. 1673-629X. 2023. 04. 002]
[5]张石清,张星楠,赵小明.基于音视频信息的深度多模态抑郁症识别综述[J].计算机技术与发展,2023,33(07):1.[doi:10. 3969 / j. issn. 1673-629X. 2023. 07. 001]
 ZHANG Shi-qing,ZHANG Xing-nan,ZHAO Xiao-ming.A Survey of Deep Multimodal Depression Recognition Based on Audio-visual Cues[J].,2023,33(05):1.[doi:10. 3969 / j. issn. 1673-629X. 2023. 07. 001]
[6]刘译善,孙 涵.基于特征增强的 RGB-D 显著性目标检测[J].计算机技术与发展,2023,33(11):28.[doi:10. 3969 / j. issn. 1673-629X. 2023. 11. 005]
 LIU Yi-shan,SUN Han.Feature Enhancement Based RGB-D Salient Object Detection[J].,2023,33(05):28.[doi:10. 3969 / j. issn. 1673-629X. 2023. 11. 005]
[7]张冬,梁平,顾进广.基于双融合图注意力网络多模态知识图谱链路预测[J].计算机技术与发展,2024,34(07):123.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0089]
 ZHANG Dong,LIANG Ping,GU Jin-guang.Multi-modal Knowledge Graph Link Prediction Based on Dual Fusion and Graph Attention Networks[J].,2024,34(05):123.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0089]
[8]曹茂俊,林世友,肖阳,等.面向测井领域的多模态知识图谱构建[J].计算机技术与发展,2024,34(09):195.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0132]
 CAO Mao-jun,LIN Shi-you,XIAO Yang,et al.Construction of Multi-modal Knowledge Graph for Logging Field[J].,2024,34(05):195.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0132]
[9]欧阳梦妮,樊小超,帕力旦·吐尔逊.基于目标对齐和语义过滤的多模态情感分析[J].计算机技术与发展,2024,34(10):171.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0209]
 OUYANG Meng-ni,FAN Xiao-chao,Palidan Turson.Multimodal Sentiment Analysis Based on Target Alignment and Semantic Filtering[J].,2024,34(05):171.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0209]
[10]周乐善,冯锡炜*.基于重构双注意力网络的图文情感分析[J].计算机技术与发展,2024,34(12):157.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0238]
 ZHOU Le-shan,FENG Xi-wei*.Images-text Sentiment Analysis Based on Reconstructed Dual Attention Networks[J].,2024,34(05):157.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0238]

更新日期/Last Update: 2024-05-10