Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis

Last update: Oct 17, 2022

Overview

TDY-CNN for Text-Independent Speaker Verification

Official implementation of

Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis
by Seong-Hu Kim, Hyeonuk Nam, Yong-Hwa Park @ Human Lab, Mechanical Engineering Department, KAIST

Accepted paper in ICASSP 2022.

This code was written mainly with reference to VoxCeleb_trainer of paper 'In defence of metric learning for speaker recognition'.

Temporal Dynamic Convolutional Neural Network (TDY-CNN)

TDY-CNN efficiently applies adaptive convolution depending on time bins by changing the computation order as follows:

$y(f, t) = \sigma (\sum_{k=1}^{K} \pi_{k}(t)y_k(f,t))$

where x and y are input and output of TDY-CNN module which depends on frequency feature f and time feature t in time-frequency domain data. k-th basis kernel is convoluted with input and k-th bias is added. The results are aggregated using the attention weights which depends on time bins. K is the number of basis kernels, and σ is an activation function ReLU. The attention weight has a value between 0 and 1, and the sum of all basis kernels on a single time bin is 1 as the weights are processed by softmax.

Requirements and versions used

Python version of 3.7.10 is used with following libraries

pytorch == 1.8.1
pytorchaudio == 0.8.1
numpy == 1.19.2
scipy == 1.5.3
scikit-learn == 0.23.2

Dataset

We used VoxCeleb1 & 2 dataset in this paper. You can download the dataset by reffering to VoxCeleb1 and VoxCeleb1.

Training

You can train and save model in exps folder by running:

python trainSpeakerNet.py --model TDy_ResNet34_half --log_input True --encoder_type AVG --trainfunc softmaxproto --save_path exps/TDY_CNN_ResNet34 --nPerSpeaker 2 --batch_size 400

This implementation also provides accelerating training with distributed training and mixed precision training.

Use --distributed flag to enable distributed training and --mixedprec flag to enable mixed precision training.
- GPU indices should be set before training : os.environ['CUDA_VISIBLE_DEVICES'] ='0,1,2,3' in trainSpeakernet.py.

Results:

Network	#Parm	EER (%)	C_det (%)
TDY-VGG-M	71.2M	3.04	0.237
TDY-ResNet-34(×0.25)	13.3M	1.58	0.116
TDY-ResNet-34(×0.5)	51.9M	1.48	0.118

This result is low-dimensional t-SNE projection of frame-level speaker embed-dings of MHRM0 and FDAS1 using (a) baseline model ResNet-34(×0.25) and (b) TDY-ResNet-34(×0.25). Left column represents embeddings for different speakers, and right column represents em-beddings for different phoneme classes.
Embeddings by TDY-ResNet-34(×0.25) are closely gathered regardless of phoneme groups. It shows that the temporal dynamic model extracts consistent speaker information regardless of phonemes.

Pretrained models

There are pretrained models in folder pretrained_model.

For example, you can check 1.4786 of EER by running following script using TDY-ResNet-34(×0.5).

python trainSpeakerNet.py --eval --model TDy_ResNet34_half --log_input True --encoder_type AVG --trainfunc softmaxproto --save_path exps/test --eval_frames 400 --initial_model pretrained_model/pretrained_TDy_ResNet34_half.model

Citation

@article{kim2021tdycnn,
  title={Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis},
  author={Kim, Seong-Hu and Nam, Hyeonuk and Park, Yong-Hwa},
  journal={arXiv preprint arXiv:2110.03213},
  year={2021}
}

Please contact Seong-Hu Kim at [email protected] for any query.

Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis

Related tags

Overview

TDY-CNN for Text-Independent Speaker Verification

Temporal Dynamic Convolutional Neural Network (TDY-CNN)

Requirements and versions used

Dataset

Training

Results:

Pretrained models

Citation

Owner

Seong-Hu Kim

My usage of Real-ESRGAN to upscale anime, some test and results in the test_img folder

Python library for tracking human heads with FLAME (a 3D morphable head model)

LibMTL: A PyTorch Library for Multi-Task Learning

GeDML is an easy-to-use generalized deep metric learning library

[ArXiv 2021] Data-Efficient Instance Generation from Instance Discrimination

Code for reproducing key results in the paper "InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets"

This is the formal code implementation of the CVPR 2022 paper 'Federated Class Incremental Learning'.

(ImageNet pretrained models) The official pytorch implemention of the TPAMI paper "Res2Net: A New Multi-scale Backbone Architecture"

Tensorflow implementation of "BEGAN: Boundary Equilibrium Generative Adversarial Networks"

SimpleDepthEstimation - An unified codebase for NN-based monocular depth estimation methods

Pytorch implementation of Straight Sampling Network For Point Cloud Learning (ICIP2021).

Article Reranking by Memory-enhanced Key Sentence Matching for Detecting Previously Fact-checked Claims.

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

FAMIE is a comprehensive and efficient active learning (AL) toolkit for multilingual information extraction (IE)

ivadomed is an integrated framework for medical image analysis with deep learning.

[ACM MM 2021] Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)

MPI Interest Group on Algorithms on 1st semester 2021

AugLiChem - The augmentation library for chemical systems.

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

FewBit — a library for memory efficient training of large neural networks