Mememoji - A facial expression classification system that recognizes 6 basic emotions: happy, sad, surprise, fear, anger and neutral.

Related tags

Deep Learningmememoji
Overview

alt text

a project built with deep convolutional neural network and ❤️

Table of Contents

  1. Motivation
  2. The Database
  3. The Model
  4. Model Validation
  5. The Apps
  6. About the Author
  7. References

1 Motivation

Human facial expressions can be easily classified into 7 basic emotions: happy, sad, surprise, fear, anger, disgust, and neutral. Our facial emotions are expressed through activation of specific sets of facial muscles. These sometimes subtle, yet complex, signals in an expression often contain an abundant amount of information about our state of mind. Through facial emotion recognition, we are able to measure the effects that content and services have on the audience/users through an easy and low-cost procedure. For example, retailers may use these metrics to evaluate customer interest. Healthcare providers can provide better service by using additional information about patients' emotional state during treatment. Entertainment producers can monitor audience engagement in events to consistently create desired content.

“2016 is the year when machines learn to grasp human emotions” --Andrew Moore, the dean of computer science at Carnegie Mellon.

Humans are well-trained in reading the emotions of others, in fact, at just 14 months old, babies can already tell the difference between happy and sad. But can computers do a better job than us in accessing emotional states? To answer the question, I designed a deep learning neural network that gives machines the ability to make inferences about our emotional states. In other words, I give them eyes to see what we can see.

2 The Database

The dataset I used for training the model is from a Kaggle Facial Expression Recognition Challenge a few years back (FER2013). It comprises a total of 35887 pre-cropped, 48-by-48-pixel grayscale images of faces each labeled with one of the 7 emotion classes: anger, disgust, fear, happiness, sadness, surprise, and neutral.

Figure 1. An overview of FER2013.

As I was exploring the dataset, I discovered an imbalance of the “disgust” class (only 113 samples) compared to many samples of other classes. I decided to merge disgust into anger given that they both represent similar sentiment. To prevent data leakage, I built a data generator fer2013datagen.py that can easily separate training and hold-out set to different files. I used 28709 labeled faces as the training set and held out the remaining two test sets (3589/set) for after-training validation. The resulting is a 6-class, balanced dataset, shown in Figure 2, that contains angry, fear, happy, sad, surprise, and neutral. Now we’re ready to train.

alt text

Figure 2. Training and validation data distribution.

3 The Model

Figure 3. Mr. Bean, the model for the model.

Deep learning is a popular technique used in computer vision. I chose convolutional neural network (CNN) layers as building blocks to create my model architecture. CNNs are known to imitate how the human brain works when analyzing visuals. I will use a picture of Mr. Bean as an example to explain how images are fed into the model, because who doesn’t love Mr. Bean?

A typical architecture of a convolutional neural network will contain an input layer, some convolutional layers, some dense layers (aka. fully-connected layers), and an output layer (Figure 4). These are linearly stacked layers ordered in sequence. In Keras, the model is created as Sequential() and more layers are added to build architecture.

Figure 4. Facial Emotion Recognition CNN Architecture (modification from Eindhoven University of Technology-PARsE).

###3.1 Input Layer

  • The input layer has pre-determined, fixed dimensions, so the image must be pre-processed before it can be fed into the layer. I used OpenCV, a computer vision library, for face detection in the image. The haar-cascade_frontalface_default.xml in OpenCV contains pre-trained filters and uses Adaboost to quickly find and crop the face.
  • The cropped face is then converted into grayscale using cv2.cvtColor and resized to 48-by-48 pixels with cv2.resize. This step greatly reduces the dimensions compared to the original RGB format with three color dimensions (3, 48, 48). The pipeline ensures every image can be fed into the input layer as a (1, 48, 48) numpy array.

###3.2 Convolutional Layers

  • The numpy array gets passed into the Convolution2D layer where I specify the number of filters as one of the hyperparameters. The set of filters(aka. kernel) are unique with randomly generated weights. Each filter, (3, 3) receptive field, slides across the original image with shared weights to create a feature map.
  • Convolution generates feature maps that represent how pixel values are enhanced, for example, edge and pattern detection. In Figure 5, a feature map is created by applying filter 1 across the entire image. Other filters are applied one after another creating a set of feature maps.

Figure 5. Convolution and 1st max-pooling used in the network

  • Pooling is a dimension reduction technique usually applied after one or several convolutional layers. It is an important step when building CNNs as adding more convolutional layers can greatly affect computational time. I used a popular pooling method called MaxPooling2D that uses (2, 2) windows across the feature map only keeping the maximum pixel value. The pooled pixels form an image with dimentions reduced by 4.

###3.3 Dense Layers

  • The dense layer (aka fully connected layers), is inspired by the way neurons transmit signals through the brain. It takes a large number of input features and transform features through layers connected with trainable weights.

Figure 6. Neural network during training: Forward propagation (left) to Backward propagation (right).

  • These weights are trained by forward propagation of training data then backward propagation of its errors. Back propagation starts from evaluating the difference between prediction and true value, and back calculates the weight adjustment needed to every layer before. We can control the training speed and the complexity of the architecture by tuning the hyper-parameters, such as learning rate and network density. As we feed in more data, the network is able to gradually make adjustments until errors are minimized.
  • Essentially, the more layers/nodes we add to the network the better it can pick up signals. As good as it may sound, the model also becomes increasingly prone to overfitting the training data. One method to prevent overfitting and generalize on unseen data is to apply dropout. Dropout randomly selects a portion (usually less than 50%) of nodes to set their weights to zero during training. This method can effectively control the model's sensitivity to noise during training while maintaining the necessary complexity of the architecture.

###3.4 Output Layer

  • Instead of using sigmoid activation function, I used softmax at the output layer. This output presents itself as a probability for each emotion class.
  • Therefore, the model is able to show the detail probability composition of the emotions in the face. As later on, you will see that it is not efficient to classify human facial expression as only a single emotion. Our expressions are usually much complex and contain a mix of emotions that could be used to accurately describe a particular expression.

It is important to note that there is no specific formula to building a neural network that would guarantee to work well. Different problems would require different network architecture and a lot of trail and errors to produce desirable validation accuracy. This is the reason why neural nets are often perceived as "black box algorithms." But don't be discouraged. Time is not wasted when experimenting to find the best model and you will gain valuable experience.

###3.5 Deep Learning I built a simple CNN with an input, three convolution layers, one dense layer, and an output layer to start with. As it turned out, the simple model preformed poorly. The low accuracy of 0.1500 showed that it was merely random guessing one of the six emotions. The simple net architecture failed to pick up the subtle details in facial expressions. This could only mean one thing...

This is where deep learning comes in. Given the pattern complexity of facial expressions, it is necessary to build with a deeper architecture in order to identify subtle signals. So I fiddled combinations of three components to increase model complexity:

  • number and configuraton of convolutional layers
  • number and configuration of dense layers
  • dropout percentage in dense layers

Models with various combinations were trained and evaluated using GPU computing g2.2xlarge on Amazon Web Services (AWS). This greatly reduced training time and increased efficiency in tuning the model (Pro tip: use automation script and tmux detach to train on AWS EC2 instance over night). In the end, my final net architecture was 9 layers deep in convolution with one max-pooling after every three convolution layers as seen in Figure 7.

Figure 7. Final model CNN architecture.

4 Model Validation

###4.1 Performance As it turns out, the final CNN had a validation accuracy of 58%. This actually makes a lot of sense. Because our expressions usually consist a combination of emotions, and only using one label to represent an expression can be hard. In this case, when the model predicts incorrectly, the correct label is often the second most likely emotion as seen in Figure 8 (examples with light blue labels).

Figure 8. Prediction of 24 example faces randomly selected from test set.

###4.2 Analysis

Figure 9. Confusion matrix for true and prediction emotion counts.

Let's take a closer look at predictions for individual emotions. Figure 9 is the confusion matrix for the model predictions on the test set. The matrix gives the counts of emotion predictions and some insights to the performance of the multi-class classification model:

  • The model performs really well on classifying positive emotions resulting in relatively high precision scores for happy and surprised. Happy has a precision of 76.7% which could be explained by having the most examples (~7000) in the training set. Interestingly, surprise has a precision of 69.3% having the least examples in the training set. There must be very strong signals in the suprise expressions.
  • Model performance seems weaker across negative emotions on average. In particularly, the emotion sad has a low precision of only 39.7%. The model frequently misclassified angry, fear and neutral as sad. In addition, it is most confused when predicting sad and neutral faces because these two emotions are probably the least expressive (excluding crying faces).
  • Frequency of prediction that misclassified by less than 3 ranks.

Figure 10. Correct predictions on 2nd and 3rd highest probable emotion.

###4.3 Computer Vision As a result, the feature maps become increasingly abstract down the pipeline when more pooling layers are added. Figure 11 and 12 gives an idea of what the machine sees in feature maps after 2nd and 3rd max-pooling. Deep nets are beautiful!.

Code for analysis and visualiation of the inter-layer outputs in the convolutional neural net: https://github.com/JostineHo/mememoji/blob/master/data_visualization.ipynb

Figure 11. CNN (64-filter) feature maps after 2nd layer of max-pooling.

Figure 12. CNN (128-filter) feature maps after 3nd layer of max-pooling.

5 The Apps

Figure 13. Web application and REST API.

###5.1 REST API I built a REST API that finds human faces within images and make prediction about each facial emotion in POST /v1.0.0/predict. You can paste the url of an image in image_url or drag-and-drop an image file to image_buf . In addition, you have the option to have the API return the image with annotated faces and cropped thumbnail of each face in base64 by using the dropdown menu in annotate_image and crop_image. The API returns the probabilities of emotions for each face (indexed) and an unique ID for each image in json format. MongoDB is installed to store input into facial expression database on EC2 for future training.

POST /v1.0.0/feedback can be used to collect user feedback from the web app for incorrect predictions. Developers have to option to send back user feedback (true emotion) by providing the unique ID and face index. The built-in MongoDB will use unique ID image_id to find the document and face_index to append the true emotion as feedback in the database.

Source Code: https://github.com/JostineHo/mememoji_api

Demo: mememoji.rhobota.com

###5.2 Interactive Web App Mememoji is an interactive emotion recognition system that detects emotions based on facial expressions. This app uses the REST API to predict the compositions of the emotions expressed by users. Users have the option to paste image url, upload your own image, or simply turn on your webcam to interact with the app. Users can also provide feedback by selecting the correct emotion from a dropdown menu should the convolutional neural network predicts incorrectly. This will serve as a training sample and help improve the algorithm in the future.

Special thanks to Chris Impicciche, Web Development Fellow at Galvanize, who made it possible for online demo of the technology.

Source Code: FaceX

Demo: mememoji.me

###5.3 Real-Time Prediction via Webcam In addition, I built a real-time facial emotion analyzer that can be accessed through a webcam. real-time.py overlays a meme face matching the emotion expressed in real-time. live-plotting outputs a live-recording graph that responds to the changes in facial expressions. The program uses OpenCV for face detection and the trained neural network for live prediction.

Source Code: https://github.com/JostineHo/real-time_emotion_analyzer

6 About the Author

Jostine Ho is a data scientist who loves building intelligent applications and exploring the exciting possibilities using deep learning. She is interested in computer vision and automation that creates innovative solutions to real-world problems. She holds a masters degree in Petroleum & Geosystems Engineering at The University of Texas at Austin. You can reach her on LinkedIn.

7 References

  1. "Dataset: Facial Emotion Recognition (FER2013)" ICML 2013 Workshop in Challenges in Representation Learning, June 21 in Atlanta, GA.

  2. "Andrej Karpathy's Convolutional Neural Networks (CNNs / ConvNets)" Convolutional Neural Networks for Visual Recognition (CS231n), Stanford University.

  3. Srivastava et al., 2014. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", Journal of Machine Learning Research, 15:1929-1958.

  4. Duncan, D., Shine, G., English, C., 2016. "Report: Facial Emotion Recognition in Real-time" Convolutional Neural Networks for Visual Recognition (CS231n), Stanford University.

Owner
Jostine Ho
Data scientist passionate about deep learning and behavior analytics. Knocking down data silos to learn the WHY and HOW that enables building products we love.
Jostine Ho
An end-to-end library for editing and rendering motion of 3D characters with deep learning [SIGGRAPH 2020]

Deep-motion-editing This library provides fundamental and advanced functions to work with 3D character animation in deep learning with Pytorch. The co

1.2k Dec 29, 2022
PyTorch Code for the paper "VSE++: Improving Visual-Semantic Embeddings with Hard Negatives"

Improving Visual-Semantic Embeddings with Hard Negatives Code for the image-caption retrieval methods from VSE++: Improving Visual-Semantic Embeddings

Fartash Faghri 441 Dec 05, 2022
[NeurIPS 2021] "Delayed Propagation Transformer: A Universal Computation Engine towards Practical Control in Cyber-Physical Systems"

Delayed Propagation Transformer: A Universal Computation Engine towards Practical Control in Cyber-Physical Systems Introduction Multi-agent control i

VITA 6 May 05, 2022
How the Deep Q-learning method works and discuss the new ideas that makes the algorithm work

Deep Q-Learning Recommend papers The first step is to read and understand the method that you will implement. It was first introduced in a 2013 paper

1 Jan 25, 2022
MaskTrackRCNN for video instance segmentation based on mmdetection

MaskTrackRCNN for video instance segmentation Introduction This repo serves as the official code release of the MaskTrackRCNN model for video instance

411 Jan 05, 2023
Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

GPT2-Pytorch with Text-Generator Better Language Models and Their Implications Our model, called GPT-2 (a successor to GPT), was trained simply to pre

Tae-Hwan Jung 775 Jan 08, 2023
A face dataset generator with out-of-focus blur detection and dynamic interval adjustment.

A face dataset generator with out-of-focus blur detection and dynamic interval adjustment.

Yutian Liu 2 Jan 29, 2022
Molecular AutoEncoder in PyTorch

MolEncoder Molecular AutoEncoder in PyTorch Install $ git clone https://github.com/cxhernandez/molencoder.git && cd molencoder $ python setup.py insta

Carlos Hernández 80 Dec 05, 2022
The PyTorch implementation of DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision.

DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision The PyTorch implementation of DiscoBox: Weakly Supe

Shiyi Lan 1 Oct 23, 2021
This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Reinforcement-trading This project uses Reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can

Deepender Singla 1.4k Dec 22, 2022
(Py)TOD: Tensor-based Outlier Detection, A General GPU-Accelerated Framework

(Py)TOD: Tensor-based Outlier Detection, A General GPU-Accelerated Framework Background: Outlier detection (OD) is a key data mining task for identify

Yue Zhao 127 Jan 05, 2023
Multi-Modal Machine Learning toolkit based on PaddlePaddle.

简体中文 | English PaddleMM 简介 飞桨多模态学习工具包 PaddleMM 旨在于提供模态联合学习和跨模态学习算法模型库,为处理图片文本等多模态数据提供高效的解决方案,助力多模态学习应用落地。 近期更新 2022.1.5 发布 PaddleMM 初始版本 v1.0 特性 丰富的任务

njustkmg 520 Dec 28, 2022
Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

RGBT Crowd Counting Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin. "Cross-Modal Collaborative Representation Learning and a L

37 Dec 08, 2022
A JAX-based research framework for writing differentiable numerical simulators with arbitrary discretizations

jaxdf - JAX-based Discretization Framework Overview | Example | Installation | Documentation ⚠️ This library is still in development. Breaking changes

UCL Biomedical Ultrasound Group 65 Dec 23, 2022
implementation for paper "ShelfNet for fast semantic segmentation"

ShelfNet-lightweight for paper (ShelfNet for fast semantic segmentation) This repo contains implementation of ShelfNet-lightweight models for real-tim

Juntang Zhuang 252 Sep 16, 2022
A DeepStack custom model for detecting common objects in dark/night images and videos.

DeepStack_ExDark This repository provides a custom DeepStack model that has been trained and can be used for creating a new object detection API for d

MOSES OLAFENWA 98 Dec 24, 2022
Implementations of the algorithms in the paper Approximative Algorithms for Multi-Marginal Optimal Transport and Free-Support Wasserstein Barycenters

Implementations of the algorithms in the paper Approximative Algorithms for Multi-Marginal Optimal Transport and Free-Support Wasserstein Barycenters

Johannes von Lindheim 3 Oct 29, 2022
A PyTorch library and evaluation platform for end-to-end compression research

CompressAI CompressAI (compress-ay) is a PyTorch library and evaluation platform for end-to-end compression research. CompressAI currently provides: c

InterDigital 680 Jan 06, 2023
Official PyTorch implementation of MX-Font (Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Experts)

Introduction Pytorch implementation of Multiple Heads are Better than One: Few-shot Font Generation with Multiple Localized Expert. | paper Song Park1

Clova AI Research 97 Dec 23, 2022
PyGCL: A PyTorch Library for Graph Contrastive Learning

PyGCL is a PyTorch-based open-source Graph Contrastive Learning (GCL) library, which features modularized GCL components from published papers, standa

PyGCL 588 Dec 31, 2022