A pratical implementation of deep neural network for facial emotion recognition

People's emotions are rarely put into words, far more often they are expressed through other cues. The key to intuiting another's feelings is in the ability to read nonverbal channels, tone of voice, gesture, facial expression and the like. Facial expressions are used by humans to convey various types of meaning in a variety of contexts. The range of meanings extends from basic, probably innate, social-emotional concepts such as "surprise" to complex, culture-specific concepts such as "neglect". The range of contexts in which humans use facial expressions extends from responses to events in the environment to specific linguistic constructs in sign languages. In this paper, we will use an artificial neural network to classify each image into seven facial emotion classes. The model is trained on a database of FER+ images that we assume is large and diverse enough to indicate which model parameters are generally preferable. The overall results show that, the CNN model is efficient to be able to classify the images according to the state of emotions even in real time.


Introduction
Human faces are arguably the most important things we see. We are quick to detect them in any scene, and they command our attention. Faces express a wealth of important social information, such as whether another person is angry or scared, which in turn allows us to prepare for fight or flight. Does this mean facial expressions are universal? It's a question scientists have debated for half a century, and it remains without a definitive answer. Emotions are essential to our lives. They allow us to improve communication between individuals, to ensure a better understanding of the message conveyed and to adapt to a given situation.
Emotion recognition is an important aspect of affective computing, one of whose objectives is the study and development of behavioral and emotional interactions between humans and machines. This can be useful to verify that the person standing in front of the camera is not just a 2-dimension representation [1].
It is also important because it allows the observer to infer the emotional states and intentions of others and to anticipate their actions, but also to regulate its own behaviors accordingly. Thus, the ability to recognize emotions influences one's ability to adapt to the Facial recognition of emotions is linked to several domains, for example:  Marketing: applications to measure customer satisfaction, to predict the products that interest them.  Security: stress detection.  Medicine: help to detect certain psychological diseases  Human-Computer Interaction: support robot.  Education: distance learning.

History introducing emotions
The history of emotions is based on the research of several scientists, starting with Darwin who in 1872 wrote one of the first hypothesis that will influence the research on emotions [2]. He will be followed since the 1960s by several scientists such as Paul Ekman emotions [3], Carroll Izard, Alan Fridlund and Sylvan Tompkins who will try to demonstrate the universality of certain fundamental emotions for human beings. In the 20th century, the history of emotions took a decisive turn and only grew thanks to the studies of Lucien Febvre, who was one of its precursors in France.
Emotions have been classified according to two categories: simple or complex.
According to Paul Ekman (1984) simple emotions are happiness, sadness, anger, surprise and disgust. Complex emotions are a combination of simple emotions [4].   The images are represented as follows:

Labels distribution and statistics
For a better understanding of the FER+ data, we decided to display the histogram to see the distribution of the emotion classes. For statistics detail, we used describe of pandas libraries, the result is as shown in the table below: The images in the database vary in many parameters which can affect directly on recognition accuracy and performance. These are some difficulties such as rotation, brightness and illumination changes even for the same person's images. To address this problem, a normalization of the face image such as detecting, de-noising and some other preprocessing such as correcting the rotation is performed. The image brightness and contrast variations increase the complexity of the problem.
In image processing, normalization is a process that changes the range of pixel intensity values. Applications include photographs with poor contrast due to glare, for example. Normalization is sometimes called contrast stretching or histogram stretching. In more general fields of data processing, such as digital signal processing, it is referred to as dynamic range expansion [6].

Image Cropping
The original face images have background information that is not important and could make the output to be less accurate. The cropping region also tries to remove facial parts that do not contribute to the expression.
Cropping is the removal of unwanted outer areas from a photographic or illustrated image. The process usually consists of the removal of some of the peripheral areas of an image to remove extraneous trash from the picture, to improve its framing, to change the aspect ratio, or to accentuate or isolate the subject matter from its background.

Organization and structure
The structure of the paper is as follows:  First, we have training and validation images that we have processed, normalized.  Train the model and make sure we have a good accuracy, then made predictions with images taken by the webcam in real time.  Finally analyze the errors and display the performances, restart the training if they are not very efficient. is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets have the ability to learn these filters/characteristics. The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons in the Human Brain and was inspired by the organization of the Visual Cortex. Individual neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. A collection of such fields overlaps to cover the entire visual area.

Technical information of the model
The parameters of the model are described in the following table: Here are some lines of code to illustrate the model.

Model explanation
This work consists of using a simple Convolutional Neural Network with linear one-vs-all SVM at the top. The network receives an 48x48 image as an input and then returns the confidence of each expression as an output.
The first layer of the CNN is a convolution layer that applies a convolution kernel of 3x3 and outputs an image. This layer is followed by a subsampling layer that uses Max-pooling with kernel size 2x2. The activation function [10] used is 'Relu' for each convolution Once the filters were applied, we passed the resulting vector on a two layers: -A hidden layer with 1024 neurons -An output layer, with 7 neurons corresponding to each class and softmax function.
The optimization method used is ADAM [11]. Adam combines the concept of momentum with AdaGrad [12].
For classification problems using deep learning techniques, it is standard to use the softmax or 1-of-K encoding at the top. Here we have 7 possible classes, the softmax layer has 7 nodes denoted by π, where i = 1, . . ., 7. π specifies a discrete probability [13].
Dropout is a technique to reduce overfitting during model training. The term "Dropout" refers to the removal of neurons in the layers of a Deep Learning mode [14].

Result predictions
Here are some results of model prediction on real time images using the camera  The prediction displayed on the Figure 10. is an image taken in real time with a function that uses openCV library to launch webcam, capture the face, apply the prediction and finally take a picture.

Loss and accuracy analyses
The following figure explains the evolution of the loss function and the accuracy as a function of the number of epochs. The model accuracy on training data is 62% and 85% on testing one.

Confusion matrix
To analyze the performance and predictions of the model, a confusion matrix is displayed on the Figure 10. as a technique for summarizing the performance of a classification algorithm.

Conclusion
The proposed paper is designed to develop a real time system to detect, recognize and classify human face emotions.
Training a neural network can take a long time, ranging from a few hours to a week, depending on the size of the data source and the complexity of the model.
The limitations of our datasets are most related to the variable settings of the FER+ database. In particular, the assumption that this dataset is large and diverse enough to indicate which model settings are generally preferable.
The model used in this paper allowed us to have a good classification of the images according to the emotions on the faces.
The technology of facial expression recognition has enormous market potential and, in the near future, it will enhance most human computer interfaces.