语音分离又被称为鸡尾酒会议问题,其目标是指从多个混合的说话人中将每个目标说话人的语音信号给分离出来。近年来,基于深度学习的单通道语音分离在学术界和工业界受到广泛的关注,其也是语音信号处理中一个很重要的方向。本文主要分享下近几年基于深度学习的单通道语音分离方法。
1.基于频域的语音分离方法
1.1 深度聚类算法(deep clustering, DC)
1.2 排列不变性训练准则(PIT)
PIT算法包括帧级别和句子级别,理论情况下基于帧级别的PIT是最理想的情况,但是帧级别的PIT在将所有帧整合成一句话的时候没办确定哪些帧属于同一个说话人,因此还需要一个说话人追踪的技术。为了解决这个问题,基于句子级别的PIT(uPIT)算法,以句子为单位,计算所有可能的排列组合,选择最小的MSE作为优化目标。其目标训练函数为:
1.3 基于深度嵌入式特征和区分性学习的语音分离方法
1.4 小结
2.基于时域的语音分离方法
2.1 Conv-TasNet
编码器,利用一维卷积替换掉STFT对时域的波形点进行编码,用网络去学习编码参数。
2.2 Dual-Path RNN(DPRNN)
2.3基于深度注意力融合特征和端到端后置滤波的语音分离方法(E2EPF)
2.4 小结
3.总结
【参考文献】
[1] Hershey J R, Chen Z, Roux J L, et al. Deep clustering: Discriminative embeddings for segmentation and separation[C]. ICASSP, 2016: 31-35.
[2] Yu D, Kolbaek M, Tan Z, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation[C]. ICASSP, 2017: 241-245.
[3] Kolbaek M, Yu D, Tan Z, et al. Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2017, 25(10): 1901-1913.
[4] Fan C, Liu B, Tao J, et al. Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features[C]. interspeech, 2019.
[5] Wang Z, Roux J L, Wang D, et al. End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction[C]. interspeech, 2018: 2708-2712.
[6] Wang Z, Tan K, Wang D, et al. Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective[C]. ICASSP, 2019: 71-75.
[7] Liu Y, Wang D. Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2019, 27(12): 2092-2102.
[8] Luo Y, Mesgarani N. TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation[C]. ICASSP, 2018: 696-700.
[9] Luo Y, Mesgarani N. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256-1266.
[10] Fan C, Tao J, Liu B, et al. End-to-End Post-filter for Speech Separation with Deep Attention Fusion Features[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2020: 1303-1314.
END
如有侵权请联系小编删除