[1]罗志鑫,刘知贵,唐荣,等.多尺度KAN卷积与跨模态注意力的视听情绪识别[J].计算机技术与发展,2025,(07):100-107.[doi:10.20165/j.cnki.ISSN1673-629X.2025.0043]
 LUO Zhi-xin,LIU Zhi-gui,TANG Rong,et al.Multiscale KAN Convolution with Cross-modal Attention for Audiovisual Emotion Recognition[J].,2025,(07):100-107.[doi:10.20165/j.cnki.ISSN1673-629X.2025.0043]
点击复制

多尺度KAN卷积与跨模态注意力的视听情绪识别()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2025年07期
页码:
100-107
栏目:
人工智能
出版日期:
2025-07-10

文章信息/Info

Title:
Multiscale KAN Convolution with Cross-modal Attention for Audiovisual Emotion Recognition
文章编号:
1673-629X(2025)07-0100-08
作者:
罗志鑫1刘知贵12唐荣1潘志祥3李理12
1. 西南科技大学 信息工程学院,四川 绵阳 621000;
2. 四川省工业自主可控人工智能工程技术研究中心,四川 绵阳 621000;
3. 四川湖山电器股份有限公司,四川 绵阳 621000
Author(s):
LUO Zhi-xin1LIU Zhi-gui12TANG Rong1PAN Zhi-xiang3LI Li12
1. School of Information Engineering,Southwest University of Science and Technology,Mianyang 621000,China;
2. Sichuan Engineering and Technology Research Center for Industrial Autonomous and Controllable Artificial Intelligence,Mianyang 621000,China;
3. Sichuan Hushan Electrical Appliance Company Limited,Mianyang 621000,China
关键词:
情绪识别视听融合跨模态注意力KAN卷积多尺度提取
Keywords:
emotion recognitionaudiovisual fusioncross-modal attentionKAN convolutionmultiscale extraction
分类号:
TP391
DOI:
10.20165/j.cnki.ISSN1673-629X.2025.0043
摘要:
针对现有视听情绪识别方法在特征提取层级别的模态互补性研究不足以及传统的视听情绪识别方法通常难以充分挖掘音频与视频模态之间的互补性等问题,提出了一种基于 Kolmogorov-Arnold Networks(KAN)卷积、多尺度特征提取和跨模态注意力机制的情绪识别模型。 该模型在音视频特征提取过程中引入 KAN 卷积,通过多尺度卷积核捕捉不同层次的情绪特征,KAN 卷积通过可学习的 B 样条函数建模数据中的非线性模式,从而增强了模型对复杂情绪模式的学习能力。为了提升模态间信息的互补性,特征融合阶段采用了跨模态注意力机制。 能够有效地对音视频特征进行加权融合,使得模型能够更好地捕捉音视频模态的交互关系,从而提升情绪识别的性能。 在 RAVDESS 数据集上的实验结果表明,该模型的准确率和 F1 值分别达到了 75. 62% 和 77. 23% ,相较于传统方法取得了显著提升。研究表明,该模型在多模态情绪识别任务中表现出更强的鲁棒性和适应性,为视听情绪识别应用提供了新的有效方案。
Abstract:
To address the problems of insufficient research on modality complementarity at the feature extraction level in existing audio-visual emotion recognition methods,as well as the traditional methods’ inability to fully exploit the complementary relationship between the audio and video modalities,we propose an emotion recognition model based on Kolmogorov-Arnold Networks (KAN) convolution,multi-scale feature extraction,and cross-modal attention mechanisms. The model introduces KAN convolution during the audio and video feature extraction process,capturing emotion-related features at different levels through multi-scale convolution kernels. KAN con-volution models the nonlinear patterns in the data using learnable B- spline functions,thereby enhancing the model’s ability to learn complex emotional patterns.To improve the complementarity of information between modalities,a cross-modal attention mechanism is adopted during the feature fusion stage. This mechanism effectively weights and fuses the audio and video features,allowing the model to better capture the interactive relationships between the two modalities,thus enhancing emotion recognition performance. Experiments on the RAVDESS dataset show that the proposed model achieves the accuracy of 75. 62% and the F1 score of 77. 23% ,significantly outper-forming traditional methods. The study demonstrates that the proposed model exhibits stronger robustness and adaptability in multi-modal emotion recognition tasks,providing a new and effective solution for audio-visual emotion recognition applications.

相似文献/References:

[1]李广鹏,刘波,李坤,等.一种基于机器学习的人脸情绪识别方法研究[J].计算机技术与发展,2019,29(05):27.[doi:10. 3969 / j. issn. 1673-629X. 2019. 05. 006]
 LI Guang-peng,LIU Bo,LI Kun,et al.Research on Face Emotion Recognition Based on Machine Learning[J].,2019,29(07):27.[doi:10. 3969 / j. issn. 1673-629X. 2019. 05. 006]
[2]陈宗楠,金家瑞,潘家辉*.基于 Swin Transformer 的四维脑电情绪识别[J].计算机技术与发展,2023,33(12):178.[doi:10. 3969 / j. issn. 1673-629X. 2023. 12. 025]
 CHEN Zong-nan,JIN Jia-rui,PAN Jia-hui*.Swin Transformer-based 4-D EEG Emotion Recognition[J].,2023,33(07):178.[doi:10. 3969 / j. issn. 1673-629X. 2023. 12. 025]

更新日期/Last Update: 2025-07-10