多尺度KAN卷积与跨模态注意力的视听情绪识别-《计算机技术与发展》

文章信息/Info

Title:: Multiscale KAN Convolution with Cross-modal Attention for Audiovisual Emotion Recognition

文章编号:: 1673-629X(2025)07-0100-08

作者:: 罗志鑫1; 刘知贵1; 2; 唐荣1; 潘志祥3; 李理1; 2; 1. 西南科技大学信息工程学院,四川绵阳 621000;
2. 四川省工业自主可控人工智能工程技术研究中心,四川绵阳 621000;
3. 四川湖山电器股份有限公司,四川绵阳 621000

Author(s):: LUO Zhi-xin1; LIU Zhi-gui1; 2; TANG Rong1; PAN Zhi-xiang3; LI Li1; 2; 1. School of Information Engineering,Southwest University of Science and Technology,Mianyang 621000,China;
2. Sichuan Engineering and Technology Research Center for Industrial Autonomous and Controllable Artificial Intelligence,Mianyang 621000,China;
3. Sichuan Hushan Electrical Appliance Company Limited,Mianyang 621000,China

关键词:: 情绪识别; 视听融合; 跨模态注意力; KAN卷积; 多尺度提取

Keywords:: emotion recognition; audiovisual fusion; cross-modal attention; KAN convolution; multiscale extraction

分类号:: TP391

DOI:: 10.20165/j.cnki.ISSN1673-629X.2025.0043

摘要:: 针对现有视听情绪识别方法在特征提取层级别的模态互补性研究不足以及传统的视听情绪识别方法通常难以充分挖掘音频与视频模态之间的互补性等问题,提出了一种基于 Kolmogorov-Arnold Networks(KAN)卷积、多尺度特征提取和跨模态注意力机制的情绪识别模型。该模型在音视频特征提取过程中引入 KAN 卷积,通过多尺度卷积核捕捉不同层次的情绪特征,KAN 卷积通过可学习的 B 样条函数建模数据中的非线性模式,从而增强了模型对复杂情绪模式的学习能力。为了提升模态间信息的互补性,特征融合阶段采用了跨模态注意力机制。能够有效地对音视频特征进行加权融合,使得模型能够更好地捕捉音视频模态的交互关系,从而提升情绪识别的性能。在 RAVDESS 数据集上的实验结果表明,该模型的准确率和 F1 值分别达到了 75. 62% 和 77. 23% ,相较于传统方法取得了显著提升。研究表明,该模型在多模态情绪识别任务中表现出更强的鲁棒性和适应性,为视听情绪识别应用提供了新的有效方案。

Abstract:: To address the problems of insufficient research on modality complementarity at the feature extraction level in existing audio-visual emotion recognition methods,as well as the traditional methods’ inability to fully exploit the complementary relationship between the audio and video modalities,we propose an emotion recognition model based on Kolmogorov-Arnold Networks (KAN) convolution,multi-scale feature extraction,and cross-modal attention mechanisms. The model introduces KAN convolution during the audio and video feature extraction process,capturing emotion-related features at different levels through multi-scale convolution kernels. KAN con-volution models the nonlinear patterns in the data using learnable B- spline functions,thereby enhancing the model’s ability to learn complex emotional patterns.To improve the complementarity of information between modalities,a cross-modal attention mechanism is adopted during the feature fusion stage. This mechanism effectively weights and fuses the audio and video features,allowing the model to better capture the interactive relationships between the two modalities,thus enhancing emotion recognition performance. Experiments on the RAVDESS dataset show that the proposed model achieves the accuracy of 75. 62% and the F1 score of 77. 23% ,significantly outper-forming traditional methods. The study demonstrates that the proposed model exhibits stronger robustness and adaptability in multi-modal emotion recognition tasks,providing a new and effective solution for audio-visual emotion recognition applications.

相似文献/References:

[1]李广鹏,刘波,李坤,等.一种基于机器学习的人脸情绪识别方法研究[J].计算机技术与发展,2019,29(05):27.[doi:10. 3969 / j. issn. 1673-629X. 2019. 05. 006]
　LI Guang-peng,LIU Bo,LI Kun,et al.Research on Face Emotion Recognition Based on Machine Learning[J].,2019,29(07):27.[doi:10. 3969 / j. issn. 1673-629X. 2019. 05. 006]
[2]陈宗楠,金家瑞,潘家辉*.基于 Swin Transformer 的四维脑电情绪识别[J].计算机技术与发展,2023,33(12):178.[doi:10. 3969 / j. issn. 1673-629X. 2023. 12. 025]
　CHEN Zong-nan,JIN Jia-rui,PAN Jia-hui*.Swin Transformer-based 4-D EEG Emotion Recognition[J].,2023,33(07):178.[doi:10. 3969 / j. issn. 1673-629X. 2023. 12. 025]

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

文章信息/Info

相似文献/References:

常用功能

导航/Navigate

工具/Tools

统计/Statistics