基于深度视听模型的鸡尾酒会问题研究现状与展望-《计算机技术与发展》

文章信息/Info

Title:: Research State and Frontiers of Cocktail Party Problem Based on Deep Audio-visual Models

Author(s):: LU Hui-jun; CAI Dun-bo; HUANG Zhi-guo; HANG Tao; FENG Qing-song; QIAN Ling; China Mobile ( Suzhou) Software Technology Co. ,Ltd,Suzhou 215000,China

Keywords:: cocktail party problem; multi-speaker speech separation; deep learning; deep visual-audio model; visual-audio datasets

摘要:: “ 鸡尾酒会问题” 目前依然是语音处理领域很有挑战的一个问题,该问题的核心是多说话人语音分离。目前对于以上问题的研究取得了较大的进展,但缺少一个系统,简洁的分析和总结。文章围绕“ 鸡尾酒会问题” 的解决方案,总结了语音处理领域多说话人语音分离方法的发展:(1) 分析了经典的语音分离方法,包括谱减法、维纳滤波、计算听觉场景分析等;(2) 分析了引入深度学习思想后出现的语音分离方法,包括初期的深度音频的方法和其后出现的深度视觉听觉的方法,重点评述了基于深度学习的视觉听觉方法的主要算法思想和效果方面的新进展;(3)总结了目前深度视听方法中常用视听数据集的特点。文末对深度视听模型解决鸡尾酒会问题的现状以及当前存在的挑战进行了评述,并展望未来的研究方向。

Abstract:: The " cocktail party problem" is still a very challenging problem in the field of speech processing. The core of the problem is the separation of multi-speaker speech. At present,the research on the above issues has made great progress,but it lacks a systematic,concise analysis and summary. The solutions of the " cocktail party problem" are focused on and the development of multi - speaker speech separation methods in the field of speech processing is summarized. Firstly, the classic speech separation methods are analyzed briefly, including spectral subtraction, Wiener filtering, and computational auditory scene analysis. Secondly, the deep learning based speech separation methods are analyzed in-depth,including the auditory methods and deep audio-visual methods,and particularly reviews the new development of deep audio-visual models. Thirdly,the commonly used audio-visual datasets are reviewed. At the end,deepaudio-visual models to solve the cocktail party problem and current challenges are reviewed,and the future directions of research are dis鄄cussed.