基于ASP-SERes2Net的说话人识别算法
Speaker Recognition Algorithm Based on ASP-SERes2Net
-
摘要: 为提升说话人识别的特征提取能力, 解决在噪声环境下识别率低的问题, 提出一种基于残差网络的说话人识别算法——ASP-SERes2Net。首先, 采用梅尔语谱图作为神经网络的输入; 其次, 改进Res2Net网络的残差块, 并且在每个残差块后引入压缩激活(squeeze-and-excitation, SE)注意力模块; 然后, 用注意力统计池化(attention statistics pooling, ASP)代替原来的平均池化; 最后, 采用附加角裕度的Softmax (additive angular margin Softmax, AAM-Softmax)对说话人身份进行分类。通过实验, 将ASP-SERes2Net算法与时延神经网络(time delay neural network, TDNN)、ResNet34和Res2Net进行对比, ASP-SERes2Net算法的最小检测代价函数(minimum detection cost function, MinDCF)值为0.040 1, 等误率(equal error rate, EER)为0.52%, 明显优于其他3个模型。结果表明, ASP-SERes2Net算法性能更优, 适合应用于噪声环境下的说话人识别。Abstract: To improve the feature extraction ability of speaker recognition and enhance the low recognition rate in noise environment, a speaker recognition algorithm—ASP-SERes2Net is proposed based on residual network. First, the Mel spectrum was used as the input of the neural network. Second, the residual block of the Res2Net was improved and squeeze-and-excitation (SE) attention module was introduced. Then, the average pooling was replaced by the attention statistics pooling (ASP). Finally, the additive angular margin Softmax (AAM-Softmax) function was used to classify the identity of the speaker. Through experiments, the performance of the ASP-SERes2Net algorithm was compared with that of time delay neural network (TDNN), ResNet34 and Res2Net. The minimum detection cost function (MinDCF) value of the ASP-SERes2Net algorithm was 0.040 1 and equal error rate (EER) was 0.52%, which were significantly better than the other three models. Results show that the ASP-SERes2Net algorithm has better performance and is suitable for speaker recognition applied in noise environment.