动态视音场景下问答模型研究-《计算机技术与发展》

文章信息/Info

Title:: Research on Question and Answer Models in DynamicAudio-visual Scenarios

文章编号:: 1673-629X(2024)03-0163-07

作者:: 段毛毛; 连培榆; 史海涛; 中国石油大学(北京)克拉玛依校区石油学院,新疆克拉玛依 834000

Author(s):: DUAN Mao-mao; LIAN Pei-yu; SHI Hai-tao; School of Petroleum,China University of Petroleum-Beijing at Karamay,Karamay 834000,China

关键词:: 视音问答; 多模态融合; 联合注意力机制; Bi-LSTM; MUSIC-AVQA

Keywords:: audio-visual question and answer; multimodal fusion; joint attention mechanism; Bi-directional Long Short-Term Memory; MUSIC-AVQA

分类号:: TP391

DOI:: 10. 3969 / j. issn. 1673-629X. 2024. 03. 024

摘要:: 现实世界由大量不同模态内容构建而成,各种模态的信息相互关联和互补,充分挖掘不同模态之间的关系和特性能够有效弥补单一模态信息的局限性。动态视音场景下的问答模型研究,旨在通过视频中多模态信息回答不同视觉物体、声音及其相互联系的问题,使人工智能获得场景感知和时空推理能力。针对视音问答不准确的问题,提出了一种空间时序问答模型,该模型通过空间融合建模和时序融合建模对多模态特征进行融合,从而提高问答准确率。首先,分别使用Resnet_18,VGGish 和 Bi-LST 对音频、视频和文字进行特征提取;其次,根据声音和视频的关系,在特征融合时对声音和视频两种模态进行早期的空间融合,并使用联合注意力机制在相互辅助学习后进行特征融合,增强特征互补性;最后,在特征融合后添加注意力机制以增强融合特征与文字的相关性。基于 MUSIC-AVQA 数据集的实验准确率达 73. 49% ,实现了场景感知和时空推理能力的提升。

Abstract:: The real world consists of a variety of modalities,and information from different modalities is interrelated and complementary.Exploring the relationships and characteristics between different modalities can effectively compensate for the limitations of individual modalities. The research on dynamic audiovisual question answering ( QA) models aims?
to use multimodal information from videos toanswer questions about visual objects,sounds,and their relationships,enabling artificial intelligence to achieve scene understanding andspatio-temporal reasoning capabilities. To address the problem of imprecise audiovisual QA,a spatio-temporal question answering modelis proposed. This model combines spatial fusion modelling and temporal fusion modelling to integrate multimodal features and improvethe accuracy of QA. Firstly, audio, video and text features are extracted using ResNet - 18, VGGish and Bi - LSTM respectively.Secondly,an early fusion approach is applied to spatially fuse the audio and video modalities based on their relationship. Then,a joint attention mechanism is applied to fuse the features after mutual learning to enhance their complementarity. Finally,a post-fusion attentionmechanism is added to enhance the correlation between the fused features and the text. Experimental results on the MUSIC - AVQAdataset show an accuracy of 73. 49% , indicating the improvement in scene understanding and spatio - temporal reasoning capabilitiesachieved by the proposed model.

相似文献/References:

[1]邵曦,陶凯云. 基于音乐内容和歌词的音乐情感分类研究[J].计算机技术与发展,2015,25(08):184.
　SHAO Xi,TAO Kai-yun. Research on Music Emotion Classification Based on Music Content and Lyrics[J].,2015,25(03):184.
[2]颜增显,孔超,欧卫华*.基于多模态融合的人脸反欺骗算法研究[J].计算机技术与发展,2022,32(04):63.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 011]
　YAN Zeng-xian,KONG Chao,OU Wei-hua*.Research of Face Anti-spoofing Algorithm Based on Multi-modal Fusion[J].,2022,32(03):63.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 011]
[3]于翔,周波.基于多模态融合的室内人体跟踪技术研究[J].计算机技术与发展,2023,33(02):38.[doi:10. 3969 / j. issn. 1673-629X. 2023. 02. 006]
　YU Xiang,ZHOU Bo.Research on Indoor Human Tracking Technology Based on Multi-modal Fusion[J].,2023,33(03):38.[doi:10. 3969 / j. issn. 1673-629X. 2023. 02. 006]

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

文章信息/Info

相似文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics