«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.cnki.ISSN1673-629X.2024.0239]
点击复制

基于多模态融合的无监督视频摘要算法研究()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 34
期数:: 2024年11期

页码:: 29-35

栏目:: 媒体计算

出版日期:: 2024-11-10

文章信息/Info

Title:: Research on Unsupervised Video Summarization Algorithm Based on Multimodal Fusion

文章编号:: 1673-629X(2024)11-0029-07

作者:: 潘涛1; 陈虎1; 2; 黄菊3; 吴长柯2; 邓彪3; 吴志红1; 2; 1. 四川大学计算机学院,四川成都 610065;2. 四川大学视觉合成图形图像技术重点学科实验室,四川成都 610065;3. 中国东方电气集团有限公司,四川成都 610036

Author(s):: PAN Tao1; CHEN Hu1; 2; HUANG Ju3; WU Chang-ke2; DENG Biao3; WU Zhi-hong1; 2; 1. School of Computer Science,Sichuan University,Chengdu 610065,China;2. State Key Laboratory of Fundamental Science on Synthetic Vision,Sichuan University,Chengdu 610065,China;3. Dongfang Electric Corporation,Chengdu 610036,China

关键词:: 无监督视频摘要; 多模态融合; 自注意力网络; 特征金字塔网络; 特征编码

Keywords:: unsupervised video summarization; multimodal fusion; self-attention network; feature pyramid network; feature encoder

分类号:: TP391.4

DOI:: 10.20165/j.cnki.ISSN1673-629X.2024.0239

摘要:: 视频摘要生成算法通过选择视频内容中信息最丰富的部分来构建形成简洁而完整的概要,有利于快速了解视频内容并压缩存储空间。针对现有视频摘要方法存在的视频多模态信息利用不充分、特征表达能力弱等难题,该文提出了一种基于多模态融合及多尺度时序信息的无监督视频摘要生成算法。首先,基于视频图像、音频、文本特征,提出了一种两阶段特征融合模块,充分保留了模态间的非冗余信息,提升单帧特征表示能力;其次,采用自注意力和特征金字塔网络对融合特征进行全局及局部的依赖建模;最后,根据多尺度的上下文信息选择关键帧最终构成高质量的视频摘要。实验结果表明,与其他无监督视频摘要算法相比,该算法在 SumMe 数据集规范设置及增强设置中 F-Score 分别提升了 0. 5 百分点和 1. 4 百分点,在 TVSum 数据集上达到最佳 F-Score。

Abstract:: The aim of video summarization is to construct concise and comprehensive summaries by selecting the most important content of the video,facilitating a rapid understanding of the video and conserving storage space. Existing methods face challenges including in-adequate utilization of multimodal information and weak feature expression capabilities. We propose an unsupervised video summarization algorithm based on multimodal fusion and multiscale temporal information. Firstly,we introduce a two - stage feature fusion module based on video images,audio,and text features,preserving non-redundant information between modalities and enhancing the representation capability of features. Then,we employ self-attention and feature pyramid networks to obtain global and local temporal dependencies,select keyframes based on multi-scale contextual information,and form a high-quality video summary. The experimental results demonstrate that compared to other unsupervised video summarization algorithms, the proposed algorithm has achieved an improvement of 0. 5 percentage points and 1. 4 percentage points in F-Score on the SumMe dataset under canonical and augmented settings,respectively. Moreover,it has achieved the highest F-Score on the TVSum dataset.

相似文献/References:

[1]邵曦,陶凯云. 基于音乐内容和歌词的音乐情感分类研究[J].计算机技术与发展,2015,25(08):184.
　SHAO Xi,TAO Kai-yun. Research on Music Emotion Classification Based on Music Content and Lyrics[J].,2015,25(11):184.
[2]颜增显,孔超,欧卫华*.基于多模态融合的人脸反欺骗算法研究[J].计算机技术与发展,2022,32(04):63.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 011]
　YAN Zeng-xian,KONG Chao,OU Wei-hua*.Research of Face Anti-spoofing Algorithm Based on Multi-modal Fusion[J].,2022,32(11):63.[doi:10. 3969 / j. issn. 1673-629X. 2022. 04. 011]
[3]于翔,周波.基于多模态融合的室内人体跟踪技术研究[J].计算机技术与发展,2023,33(02):38.[doi:10. 3969 / j. issn. 1673-629X. 2023. 02. 006]
　YU Xiang,ZHOU Bo.Research on Indoor Human Tracking Technology Based on Multi-modal Fusion[J].,2023,33(11):38.[doi:10. 3969 / j. issn. 1673-629X. 2023. 02. 006]
[4]段毛毛,连培榆,史海涛.动态视音场景下问答模型研究[J].计算机技术与发展,2024,34(03):163.[doi:10. 3969 / j. issn. 1673-629X. 2024. 03. 024]
　DUAN Mao-mao,LIAN Pei-yu,SHI Hai-tao.Research on Question and Answer Models in DynamicAudio-visual Scenarios[J].,2024,34(11):163.[doi:10. 3969 / j. issn. 1673-629X. 2024. 03. 024]
[5]代子鑫,翟双姣,秦品乐,等.基于对比学习和扩散模型的多模态活动识别[J].计算机技术与发展,2025,(06):116.[doi:10.20165/j.cnki.ISSN1673-629X.2025.0027]
　DAI Zi-xin,ZHAI Shuang-jiao,QIN Pin-le,et al.Multimodal Activity Recognition Based on Contrastive Learning and Diffusion Models[J].,2025,(11):116.[doi:10.20165/j.cnki.ISSN1673-629X.2025.0027]
[6]肖华兴,马丽丽,陈金广.基于多粒度匹配的文本引导服装图像检索[J].计算机技术与发展,2024,34(07):24.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0119]
　XIAO Hua-xing,MA Li-li,CHEN Jin-guang.Text Guided Clothing Image Retrieval Based on Multi-granularity Matching[J].,2024,34(11):24.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0119]
[7]李晨,梁平,顾进广.基于轻量级的DialogueRNN多模态优化方法[J].计算机技术与发展,2024,34(08):30.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0148]
　LI Chen,LIANG Ping,GU Jin-guang.Multimodal Optimization Method Based on Lightweight DialogueRNN[J].,2024,34(11):30.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0148]
[8]欧阳安杰,孙大盟,何立明.基于语义空间感知与注意力的文本生成图像方法[J].计算机技术与发展,2025,(03):109.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0355]
　OUYANG An-jie,SUN Da-meng,He Li-ming.Semantic Spatial Awareness and Attention-based Text-to-Image Generation Method[J].,2025,(11):109.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0355]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed218
全文下载/Downloads105
评论/Comments