Identify the emotion of multiple speakers in an Audio Segment

Last update: Dec 03, 2022

Overview

MevonAI - Speech Emotion Recognition

Identify the emotion of multiple speakers in a Audio Segment
Report Bug · Request Feature

Try the Demo Here

About the Project
- Built With
Getting Started
- Installation
- Running the Application
How it Works
Contributing
License
Acknowledgements
FAQ

About The Project

The main aim of the project is to Identify the emotion of multiple speakers in a call audio as a application for customer satisfaction feedback in call centres.

Built With

Getting Started

Follow the Below Instructions for setting the project up on your local Machine.

Installation

Create a python virtual environment

sudo apt install python3-venv
mkdir mevonAI
cd mevonAI
python3 -m venv mevon-env
source mevon-env/bin/activate

Clone the repo

git clone https://github.com/SuyashMore/MevonAI-Speech-Emotion-Recognition.git

Install Dependencies

cd MevonAI-Speech-Emotion-Recognition/
cd src/
sudo chmod +x setup.sh
./setup.sh

Running the Application

Add audio files in .wav format for analysis in src/input/ folder
Run Speech Emotion Recognition using

python3 speechEmotionRecognition.py

By Default , the application will use the Pretrained Model Available in "src/model/"
Diarized files will be stored in "src/output/" folder
Predicted Emotions will be stored in a separate .csv file in src/ folder

Here's how it works:

Speaker Diarization

Speaker diarisation (or diarization) is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. It is used to answer the question "who spoke when?" Speaker diarisation is a combination of speaker segmentation and speaker clustering. The first aims at finding speaker change points in an audio stream. The second aims at grouping together speech segments on the basis of speaker characteristics.

Feature Extraction

When we do Speech Recognition tasks, MFCCs is the state-of-the-art feature since it was invented in the 1980s.This shape determines what sound comes out. If we can determine the shape accurately, this should give us an accurate representation of the phoneme being produced. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCCs is to accurately represent this envelope.

The Above Image represents the audio Waveform , the below image shows the converted MFCC Output on which we will Run our CNN Model.

CNN Model

Use Convolutional Neural Network to recognize emotion on the MFCCs with the following Architecture

model = Sequential()

#Input Layer
model.add(Conv2D(32, 5,strides=2,padding='same',
                 input_shape=(13,216,1)))
model.add(Activation('relu'))
model.add(BatchNormalization())

#Hidden Layer 1
model.add(Conv2D(64, 5,strides=2,padding='same',))
model.add(Activation('relu'))
model.add(BatchNormalization())

#Hidden Layer 2
model.add(Conv2D(64, 5,strides=2,padding='same',))
model.add(Activation('relu'))
model.add(BatchNormalization())

#Flatten Conv Net
model.add(Flatten())

#Output Layer
model.add(Dense(7))
model.add(Activation('softmax'))

Training the Model

Download RAVDESS Emotional speech audio dataset
2DConvolution.ipynb file is used to training the model

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgements

FAQ

How do I do specifically so and so?
- Create an Issue to this repo , we will respond to the query

Identify the emotion of multiple speakers in an Audio Segment

Related tags

Overview

MevonAI - Speech Emotion Recognition

Try the Demo Here

Table of Contents

About The Project

Built With

Getting Started

Installation

Running the Application

Here's how it works:

Speaker Diarization

Feature Extraction

CNN Model

Training the Model

Contributing

License

Acknowledgements

FAQ

Owner

Suyash More

Rotated Box Is Back : Accurate Box Proposal Network for Scene Text Detection

[NeurIPS 2021] Source code for the paper "Qu-ANTI-zation: Exploiting Neural Network Quantization for Achieving Adversarial Outcomes"

Joint Learning of 3D Shape Retrieval and Deformation, CVPR 2021

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

E2e music remastering system - End-to-end Music Remastering System Using Self-supervised and Adversarial Training

More Photos are All You Need: Semi-Supervised Learning for Fine-Grained Sketch Based Image Retrieval

Real-time Joint Semantic Reasoning for Autonomous Driving

Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

TensorFlow implementation of the paper "Hierarchical Attention Networks for Document Classification"

PyTorch reimplementation of minimal-hand (CVPR2020)

A collection of random and hastily hacked together scripts for investigating EU-DCC

This is an official repository of CLGo: Learning to Predict 3D Lane Shape and Camera Pose from a Single Image via Geometry Constraints

Use .csv files to record, play and evaluate motion capture data.

A Factor Model for Persistence in Investment Manager Performance

The official implementation of A Unified Game-Theoretic Interpretation of Adversarial Robustness.

Chinese license plate recognition

ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.

Python scripts form performing stereo depth estimation using the HITNET model in ONNX.

This repository contains all the code and materials distributed in the 2021 Q-Programming Summer of Qode.