Evaluation and Benchmarking of Speech Super-resolution Methods

Last update: Dec 20, 2022

Related tags

Overview

Speech Super-resolution Evaluation and Benchmarking

What this repo do:

A toolbox for the evaluation of speech super-resolution algorithms.
Unify the evaluation pipline of speech super-resolution algorithms for a easier comparison between different systems.
Benchmarking speech super-resolution methods (pull request is welcome). Encouraging reproducible research.

I build this repo while I'm writing my paper for INTERSPEECH 2022: Neural Vocoder is All You Need for Speech Super-resolution. The model mentioned in this paper, NVSR, will also be open-sourced here.

Installation

Install via pip:

pip3 install ssr_eval

Please make sure you have already installed sox.

Quick Example

A basic example: Evaluate on a system that do nothing:

from ssr_eval import test 
test()

The evaluation result json file will be stored in the ./results directory: Example file
The code will automatically handle stuffs like downloading test sets.
You will find a field "averaged" at the bottom of the json file that looks like below. This field mark the performance of the system.

"averaged": {
        "proc_fft_24000_44100": {
            "lsd": 5.152331300436993,
            "log_sispec": 5.8051057146229095,
            "sispec": 30.23394207533686,
            "ssim": 0.8484425044157442
        }
    }

Here we report four metrics:

Log spectral distance(LSD).
Log scale invariant spectral distance [1] (log-sispec).
Scale invariant spectral distance [1] (sispec).
Structral similarity (SSIM).

⚠️ LSD is the most widely used metric for super-resolution. And I include another three metrics just in case you need them.

Below is the code of test()

from ssr_eval import SSR_Eval_Helper, BasicTestee

# You need to implement a class for the model to be evaluated.
class MyTestee(BasicTestee):
    def __init__(self) -> None:
        super().__init__()

    # You need to implement this function
    def infer(self, x):
        """A testee that do nothing

        Args:
            x (np.array): [sample,], with model_input_sr sample rate
            target (np.array): [sample,], with model_output_sr sample rate

        Returns:
            np.array: [sample,]
        """
        return x

def test():
    testee = MyTestee()
    # Initialize a evaluation helper
    helper = SSR_Eval_Helper(
        testee,
        test_name="unprocessed",  # Test name for storing the result
        input_sr=44100,  # The sampling rate of the input x in the 'infer' function
        output_sr=44100,  # The sampling rate of the output x in the 'infer' function
        evaluation_sr=48000,  # The sampling rate to calculate evaluation metrics.
        setting_fft={
            "cutoff_freq": [
                12000
            ],  # The cutoff frequency of the input x in the 'infer' function
        },
        save_processed_result=True
    )
    # Perform evaluation
    ## Use all eight speakers in the test set for evaluation (limit_test_speaker=-1) 
    ## Evaluate on 10 utterance for each speaker (limit_test_nums=10)
    helper.evaluate(limit_test_nums=10, limit_test_speaker=-1)

The code will automatically handle stuffs like downloading test sets. The evaluation result will be saved in the ./results directory.

Baselines

We provide several pretrained baselines. For example, to run the NVSR baseline, you can click the link in the following table for more details.

Table.1 Log-spectral distance (LSD) on different input sampling-rate (Evaluated on 44.1kHz).

Method	One for all	Params	2kHz	4kHz	8kHz	12kHz	16kHz	24kHz	32kHz	AVG
NVSR [Pretrained Model]	Yes	99.0M	1.04	0.98	0.91	0.85	0.79	0.70	0.60	0.84
WSRGlow(24kHz→48kHz)	No	229.9M	-	-	-	-	-	0.79	-	-
WSRGlow(12kHz→48kHz)	No	229.9M	-	-	-	0.87	-	-	-	-
WSRGlow(8kHz→48kHz)	No	229.9M	-	-	0.98	-	-	-	-	-
WSRGlow(4kHz→48kHz)	No	229.9M	-	1.12	-	-	-	-	-	-
Nu-wave(24kHz→48kHz)	No	3.0M	-	-	-	-	-	1.22	-	-
Nu-wave(12kHz→48kHz)	No	3.0M	-	-	-	1.40	-	-	-	-
Nu-wave(8kHz→48kHz)	No	3.0M	-	-	1.42	-	-	-	-	-
Nu-wave(4kHz→48kHz)	No	3.0M	-	1.42	-	-	-	-	-	-
Unprocessed	-	-	5.69	5.50	5.15	4.85	4.54	3.84	2.95	4.65

Click the link of the model for more details.

Here "one for all" means model can process flexible input sampling rate.

Features

The following code demonstrate the full options in the SSR_Eval_Helper:

testee = MyTestee()
helper = SSR_Eval_Helper(testee, # Your testsee object with 'infer' function implemented
                        test_name="unprocess",  # The name of this test. Used for saving the log file in the ./results directory
                        test_data_root="./your_path/vctk_test", # The directory to store the test data, which will be automatically downloaded.
                        input_sr=44100, # The sampling rate of the input x in the 'infer' function
                        output_sr=44100, # The sampling rate of the output x in the 'infer' function
                        evaluation_sr=48000, # The sampling rate to calculate evaluation metrics. 
                        save_processed_result=False, # If True, save model output in the dataset directory.
                        # (Recommend/Default) Use fourier method to simulate low-resolution effect
                        setting_fft = {
                            "cutoff_freq": [1000, 2000, 4000, 6000, 8000, 12000, 16000], # The cutoff frequency of the input x in the 'infer' function
                        }, 
                        # Use lowpass filtering to simulate low-resolution effect. All possible combinations will be evaluated. 
                        setting_lowpass_filtering = {
                            "filter":["cheby","butter","bessel","ellip"], # The type of filter 
                            "cutoff_freq": [1000, 2000, 4000, 6000, 8000, 12000, 16000], 
                            "filter_order": [3,6,9] # Filter orders
                        }, 
                        # Use subsampling method to simulate low-resolution effect
                        setting_subsampling = {
                            "cutoff_freq": [1000, 2000, 4000, 6000, 8000, 12000, 16000],
                        }, 
                        # Use mp3 compression method to simulate low-resolution effect
                        setting_mp3_compression = {
                            "low_kbps": [32, 48, 64, 96, 128],
                        },
)

helper.evaluate(limit_test_nums=10, # For each speaker, only evaluate on 10 utterances.
                limit_test_speaker=-1 # Evaluate on all the speakers. 
                )

⚠️ I recommand all the users to use fourier method (setting_fft) to simulate low-resolution effect for the convinence of comparing between different system.

Dataset Details

We build the test sets using VCTK (version 0.92), a multi-speaker English corpus that contains 110 speakers with different accents.

Speakers used for the test set: p360, p361, p362, p363, p364, p374, p376, s5
For the remaining 100 speakers, p280 and p315 are omitted for the technical issues.
Other 98 speakers are used for training.

Citation

If you find this repo useful for your research, please consider citing:

@misc{liu2022neural,
      title={Neural Vocoder is All You Need for Speech Super-resolution}, 
      author={Haohe Liu and Woosung Choi and Xubo Liu and Qiuqiang Kong and Qiao Tian and DeLiang Wang},
      year={2022},
      eprint={2203.14941},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Reference

[1] Liu, Haohe, et al. "VoiceFixer: Toward General Speech Restoration with Neural Vocoder." arXiv preprint arXiv:2109.13731 (2021).

Evaluation and Benchmarking of Speech Super-resolution Methods

Related tags

Overview

Speech Super-resolution Evaluation and Benchmarking

Installation

Quick Example

Baselines

Features

Dataset Details

Citation

Reference

Owner

Haohe Liu (刘濠赫)

Companion code for the paper "An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence" (NeurIPS 2021)

SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data

The fastai book, published as Jupyter Notebooks

Code for ICE-BeeM paper - NeurIPS 2020

Training RNNs as Fast as CNNs

OBBDetection: an oriented object detection toolbox modified from MMdetection

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Deep Reinforcement Learning based autonomous navigation for quadcopters using PPO algorithm.

The official PyTorch implementation of the paper: Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." .

Captcha-tensorflow - Image Captcha Solving Using TensorFlow and CNN Model. Accuracy 90%+

Pretrained Cost Model for Distributed Constraint Optimization Problems

Adaptive Pyramid Context Network for Semantic Segmentation (APCNet CVPR'2019)

Airborne Optical Sectioning (AOS) is a wide synthetic-aperture imaging technique

Reproducible research and reusable acyclic workflows in Python. Execute code on HPC systems as if you executed them on your personal computer!

Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

Pytorch implementation of the unsupervised object discovery method LOST.

BBB streaming without Xorg and Pulseaudio and Chromium and other nonsense (heavily WIP)

Generalized Decision Transformer for Offline Hindsight Information Matching

A Pytorch implement of paper "Anomaly detection in dynamic graphs via transformer" (TADDY).

Face-Recognition-Attendence-System - This face recognition Attendence system using Python

Evaluation and Benchmarking of Speech Super-resolution Methods

Related tags

Overview

Speech Super-resolution Evaluation and Benchmarking

Installation

Quick Example

Baselines

Features

Dataset Details

Citation

Reference

Owner

Haohe Liu (刘濠赫)

Companion code for the paper "An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence" (NeurIPS 2021)

SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data

The fastai book, published as Jupyter Notebooks

Code for ICE-BeeM paper - NeurIPS 2020

Training RNNs as Fast as CNNs

OBBDetection: an oriented object detection toolbox modified from MMdetection

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Deep Reinforcement Learning based autonomous navigation for quadcopters using PPO algorithm.

The official PyTorch implementation of the paper: *Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." *.

Captcha-tensorflow - Image Captcha Solving Using TensorFlow and CNN Model. Accuracy 90%+

Pretrained Cost Model for Distributed Constraint Optimization Problems

Adaptive Pyramid Context Network for Semantic Segmentation (APCNet CVPR'2019)

Airborne Optical Sectioning (AOS) is a wide synthetic-aperture imaging technique

Reproducible research and reusable acyclic workflows in Python. Execute code on HPC systems as if you executed them on your personal computer!

Code for the paper "Controllable Video Captioning with an Exemplar Sentence"

Pytorch implementation of the unsupervised object discovery method LOST.

BBB streaming without Xorg and Pulseaudio and Chromium and other nonsense (heavily WIP)

Generalized Decision Transformer for Offline Hindsight Information Matching

A Pytorch implement of paper "Anomaly detection in dynamic graphs via transformer" (TADDY).

Face-Recognition-Attendence-System - This face recognition Attendence system using Python

The official PyTorch implementation of the paper: Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." .