Karthik Balasubramanian May 16th, 2019
Facial emotions are important factors in human communication that help us understand the intentions of others. In general, people infer the emotional states of other people, such as joy, sadness, and anger, using facial expressions and vocal tone. According to different surveys, verbal components convey one-third of human communication, and nonverbal components convey two-thirds. Among several nonverbal components, by carrying emotional meaning, facial expressions are one of the main information channels in interpersonal communication. Interest in automatic facial emotion recognition (FER) has also been increasing recently with the rapid development of artificial intelligent techniques, including in human-computer interaction (HCI), virtual reality (VR), augment reality (AR), advanced driver assistant systems (ADASs), and entertainment. Although various sensors such as an electromyograph (EMG), electrocardiogram (ECG), electroencephalograph (EEG), and camera can be used for FER inputs, a camera is the most promising type of sensor because it provides the most informative clues for FER and does not need to be worn.
My journey to decide on this project was exciting. My motive was to compare the performances of Deep neural nets in the contemporary research to heuristically learned pattern recognition methods. Facial emotional recoginition/ pattern recognintion had been in research since long. The following academic papers were very helpful in
The objective of this project is to showcase two different solutions in solving the problem of Facial emotional recognition from a posed dataset. Both the solutions are based on the problem space of supervised learning. But the first solution I propose is more involved and has more human interference than the second solution which uses state of art artificial neuralnets. The goal is to compare the two approaches using a performance metric - i.e how well the supervised learning model detects the expression posed in a still image. The posed dataset has labels associated with it. The labels define the most probable emotion. After running our two different supervised learning model solutions, I will compare the performance of these solutions using the metrics I define in the next section.
Here is the basic structure in attempting to solve the problem.
This is a problem of classifying an image to its right emotion. A multi-class classification problem. Our immediate resort to any such problem will be
Accuracy - is defined as
$(True Positives + True Negatives)$ /$(True Positives + True Negatives+ False Positives + False Negatives)$
But when the dataset is not uniformly distributed across classes we use cross entropy loss.
Cross entropy loss - Which compares the loss between a vector and a true label in the SVM model.In the deep networks it compares the distance between two vectors. The predicted vector has the probabilities of all the classes. The true vector is a one-hot encoded vector of output labels. The lesser the cross entropy the better the model is. We should also aim at minimizing this loss (i.e closer to 0). In deep learning models, a softmax layer is introduced in the final layer to get the probability vector of classes from the matrix multiplicated computations.
$CE = -\sum_{i}^{C}t_{i} log (s_{i})$
Where $t_{i}$ and $s_{i}$ are the groundtruth (label or one hot encoded vector) and the probability for each class $i$ in $C$ classes.
I use Cohn-Kanade dataset. This dataset has been introduced by Lucey et al. 210 persons, aged 18 to 50, have been recorded depicting emotions.Out of 210 people, only 123 subjects gave posed facial expression. This dataset contains the recordings of their emotions. Both female and male persons are present from different background. 81 % Euro-Americans and 13% are Afro-Americans. The images are of size 640 * 490 pixels as well as 640 * 480 pixels. They are both grayscale and colored. in total there are 593 emotion-labeled sequences. There are seven different emotions that are depicted. They are:
The images within each subfolder may have an image sequence of the subject. The first image in the sequence starts with a neutral face and the final image in the sub folder has the actual emotion. So from each subfolder ( image sequence), I have to extract two images, the neutral face and final image with an emotion. ONLY 327 of the 593 sequences have emotion sequences. This is because these are the only ones the fit the prototypic definition. Also all these files are only one single emotion file. I have to preprocess this dataset to make it as an uniform input. I will make sure the images are all of same size and atmost it has one face depicting the emotion for now. After detecting the face in the image, I will convert the image to grayscale image, crop it and save it. I will use OpenCV to automate face finding process. OpenCv comes up with 4 different pre-trained classifiers. I will use all of them to find the face in the image and abort the process when the face is identified. These identified, cropped, resize image becomes input feature. The emotion labels are the output.
Here is the statistical report of the dataset after I do all the preprocessing.
Abnormalities in the Dataset
I am aware of the following abnormalities in the dataset. but I am still going with the dataset.
The dataset is not uniformly distributed. So using Accuracy as a metric will lead us into Accuracy Paradox. Hence I have introduced another metric called log loss or categorical cross entropy as another metric.
We have a very small dataset per emotion after preprocessing. We had to drop many image sequences as they had no labels.
Before we jump into finding solutions to the problem by defining the algorithms, we have to understand the feature extraction process. I will be using standard preprocessing methods in deep learning models as I plan to use transfer learning. I transfer learn the emotions from state of art models like VGG16 and Xception. They have their own pre processing methods. All I have to do is present the image in a format (face identified, cropped, resized image) to get preprocessed to a format that these state of art models want to. But there is a significant education needed to understand the feature extraction process to implement the baseline models.
The Feature Extraction process has 2 different phases.
get_frontal_face_detector
which is handy to identify the face regionI use dlib’s shape_predictor
and its learned landmark predictor shape_predictor_68_face_landmarks.bat
to extract AUs.
The below image helps us understand the facial action units.
The shape_predictor_68_face_landmarks
above have extracted 67 points in any face in both X and Y axis from the image presented. This X and Y points when combined becomes a Facial Landmark. They describe the position of all the “moving parts” of the depicted face, the things you use to express an emotion. The good thing about extracting facial landmark is that I will be extracting very important information from the image to use it and classify an emotion. But,
There are some problems when we directly capture these facial landmarks.
The solution to this problem is derived in the following way.
What we now have is the relationship between all the points with the center point and how they are relatively positioned in the 2D space.Each tuple will have the following values <x, y, distance_from_center, angle_relative_to_center>
. This additional information to each coordinate makes it location invariant. i.e There is a way to derive these points in the 2D system. These becomes features to our baseline models.
Example feature vector
[34.0, 172.0, 143.73715832690573, -163.42042572345252]
I have chosen Support vector machines (SVMs) to map the different facial features to their emotions. SVMs attempt to find the hyperplane that maximizes the margin between positive and negative observations for a specified emotion class. Therefore its also called Maximum margin classifier.
We use libSVM which uses one vs one classifier. i.e It will create $ (K * (K-1))/2 $ binary classifiers in total - where K here is number of classes 8. A total of 28 binary classfiers are created.
Linear SVM
Definitions taken from Cohn-Kanade+ paper
A linear SVM classification decision is made for an unlabeled test observation x*
by,
$w^Tx^* >^{true} b$
$w^Tx^* <=^{false} b$
where w is the vector normal to the separating hyperplane and b is the bias. Both w and b are estimated so that they minimize the risk of a train-set, thus avoiding the possibility of overfitting to the training data. Typically, w is not defined explicitly, but through a linear sum of support vectors.
Polynomial SVM
The kernel methods in SVM are used when we don’t have lineraly seperable data. Kernel methods transform the data to higher dimension to make them seperable. By default, we have our feature set expressed to a 3 degree polynomial.
random_state
- This is like seed value for model to return same response everytime we run it.probability
- We have asked the model to provide probability scores on different categories.kernel
- linear/ polytolerance
- Tolerence for stopping criteria for the model.Transfer Learning
I have implemented transfer learned Neural Nets. My goal here is to do very minimal work, reuse the wealthy knowledge of deep networks that have been proven before for image detection. The concepts are the same, but the task to identify is only different.
Here are the steps I am planning to take to make my model more generic and expressive to capture the emotions.
VGG16 bottleneck based model
Layer (type) Output Shape Param #
===============================================================
global_average_pooling2d_1 ( (None, 512) 0
_______________________________________________________________
dense_1 (Dense) (None, 8) 4104
===============================================================
Total params: 4,104
Trainable params: 4,104
Non-trainable params: 0
_______________________________________________________________
Xception bottleneck based model
Layer (type) Output Shape Param #
===============================================================
global_average_pooling2d_2 ( (None, 2048) 0
_______________________________________________________________
dense_2 (Dense) (None, 8) 16392
===============================================================
Total params: 16,392
Trainable params: 16,392
Non-trainable params: 0
_______________________________________________________________
batch_size
= 20 (Use 20 examples to train, backpropagate and learn every epoch)epochs
- I have used 20 epochs. That is 20 instances of training.callbacks
- ModelCheckpoint to save the best epoch weightgradient optimizer
- Adamloss
- categorical crossentropyThe Baseline models perform pretty well. They are applied after doing handcrafted feature extractions specified above in the Exploratory Visualization section.
Baseline Algorithm | Cross Entropy Loss - Train | Cross Entropy Loss - Test | Accuracy - Train | Accuracy - Test |
---|---|---|---|---|
Linear SVM | 0.31 | 0.57 | 1.0 | 0.84 |
Polynomial SVM | 0.31 | 0.61 | 1.0 | 0.81 |
I chose Linear SVM as the baseline model and I want to see if the transfer learned neural nets are performing better in terms of getting lower Cross Entropy loss and higher accuracy in the test set.
Linear SVM Test data Confusion Matrix
The data preprocessing steps are two-fold. The first is extracting and converting image files to dataset. Second is extracting features and converting them to train/test/valid sets to suit for Baseline and Deep Neural nets.
Dataset creation steps
source_images
source_emotion
pre_dataset
and dataset
. pre_dataset
arranges images by emotions and dataset
checks if images has a face and resizes the faces of all images to 350 * 350 grayscale images.pre_dataset
pre_dataset
to dataset
parent folder.Feature Extraction and Train/Test Split - Baseline
After educating yourself from the Exploratory Visualization section, use the following helper functions.
get_files
to randomly split the data in each emotion folder to training and test files. I have used 80/20 split.make_sets
runs across each emotion folder and gets the training and test files from get_files
function. It then converts each image to feature using get_landmarks
function and tags the file to the emotion label. Thus we have our training_data, training_labels, test_data, test_labelsFeature Extraction and Train/Test/Valid Split- Deep Neural Nets
Now that we have train and test sets obtained from the above process, we want to make our predictions more robust. So I introduced validation set as well in the Deep Neural Nets training. Here are the steps needed for creating train/test/validation steps for Deep Neural Nets. The feature extraction process is in-built with the State-of-Art Neural Net that we are planning to use.
The non-uniformity of image distribution among different emotions could not be resolved as the data itself is very constrained. The papers I read had defined ways in extracting the emotions via peak frames. If I have to introduce any data from outside the dataset, I would have to go against the methodology of Cohn-Kanade data collection. The final test set had the following distribution of emotions.
emotion
0 23
1 9
2 3
3 11
4 5
5 13
6 5
7 16
dtype: int64
After obtaining train test and valid sets, I used the APIs defined in SKLearn
and Keras
to implement the algorithms with its parameters. I have clearly defined the algorithms that were implemented in the project in the Algorithms and Techniques
section.
Observations from implementations of Transfer Learning
In both VGG16 and Xception Transfer learning experiments we saw that the Training Accuracy and Training loss imporves drastically. But the Validation accuracy and losses suffer reaching a plateau in a few iterations.
The deep models has not got enough data to express the right emotions.
So my next logical step was to augument the images and feed to the model.
Image Augumentation
The Neural network agents that learn the emotions behind scenes only learn the pixel. These pixels are basically RGB inputs.Our underlying problem of less data to learn can be solved by introducing different versions of the same images transforming their scale, position and rotations of the image.This gives more concrete representations among each emotion and might increase the accuracy of the transfer learned network. Our goal is to see if this avoid overfitting.
If seen above we have non-uniform data for different emotion classes and it is limited. To take care of this problem, we do data augmentation by rotating, shifting and flipping and more such combinations to get more data with the already existing one. Majorly rotation, width and height shift, and horizontal flip will be required. This cannot solve the problem of non-uniformity completely as the data difference is large in this case.
I have employed the following transformations:
False
- Do not set input mean to 0 over the datasetFalse
- Do not set each sample mean to 0False
- Do not divide inputs by std of the dataset, feature-wise.False
- Do not divide each input by its std.False
- Do not do ZCA whitening. Learn more about whitening transformations here.50
- Rotate images upto 50 degrees0.2
- Shift width by 20 %0.2
- Shift height by 20 %True
- Flip the image horizontally randomlyFalse
- Do not Flip the images verticallynearest
- fill in new pixels for images after we apply any of the preceding operations (especially rotation or translation). In this case, we just fill in the new pixels with their nearest surrounding pixel values.I will create two datagen.flow
objects. One for train and one for validation. The train flow will have a batch of 32 inputs trained and learned using backpropagation. The valid set is used to get validation loss and accuracy after training the model from each batch.
Image augumented VGG16 model components
There are total of 6 components to our image augumented VGG16 model.
VGG16 model itself. Our Input to the VGG16 model will be a image of (resized length, resized width, # of channels). We will not take the last layer of the VGG16 model. We will make sure none of the layers of our VGG16 model is trained. We will also make sure that we pass down the output of the VGG16 model as input to the lower layers of our image augumented custom VGG16 model.
categorical_crossentrophy
and gradient optimizer as Adam
.Also we are interested in Accuracy
metric.ModelCheckpoint
will save only the model with best validation loss. ReduceLROnPlateau
will automatically reduce the learning rate of the gradient optimization function by a percentage if it sees a plateau like validation loss for last 5 epochs. EarlyStopping
will stop the training if it sees no imporvement in validation loss for the last 3 continuous epochs. Which infact makes ReduceLROnPlateau
dummy though.But!
The process could not be improved. It only got worse. I realized my mistakes later on after completing the project. I had no time to do them so I will document them as part of improvements.
Algorithm | Cross Entropy Loss - Train | Cross Entropy Loss - Valid | Cross Entropy Loss - Test | Accuracy - Train | Accuracy - Valid | Accuracy - Test |
---|---|---|---|---|---|---|
Transfer Learned Xception | 0.74 | 1.13 | 1.14 | 0.79 | 0.68 | 0.61 |
Transfer Learned VGG16 | 0.42 | 1.15 | 1.27 | 0.91 | 0.63 | 0.56 |
Data Augumented VGG16 | 1.71 | 1.83 | 1.78 | 0.25 | 0.24 | 0.21 |
After several iterations of the parameters in the models, I present the output above. From the given output we can clearly see that none of the tried models is resonable. They have barely learnt emotion other than the skewed Neutral emotion in the hold out test dataset. Accuracy has been a very bad metric to put into evaluation. We can see from the confusion matrix that the models have barely learnt anything. Still due to its non-uniformity, Accuracy has been upto 68%. Data Augumentation has performed horribly. Augumentations to data has greatly affected the loss performance. The models above cannot be trusted for a production usecase. I could not decide on the final model based on the performance results I got.
I declare that my Deep Neural Nets models did not do better than the Baseline SVM models. There are several reasons behind it. I will capture some here. Before that I want to document the following.
Since I could not decide on a solution to the problem via Deep Learning, I will use this section to document what I could have done in finding a solution using Deep Neural nets. Mostly my take is on the dataset.
acted out
rather than authentically expressed. Acted facial expressions are usually more exaggerated than what we see in the real world. If we were to classify new images with these models, they would most likely underperform. Systems trained on datasets created in a controlled lab setting generally fail to generalize across datasets.So if I have another chance, I would take a emotions dataset in wild to capture the emotions.
To summarize the whole pipeline of the project, I started with identifying a posed dataset which had captured 8 different emotions of the subjects. I defined the metrics to compare, explained the feature extraction methods to be applied in the data preprocessing phase. Then carried out Data preparation where I preprocessed the data, resized and extracted relevant features to form train, test and valid sets. I then applied different algorithms and showcased their performances. Later I stumbled upon problems pertaining to less data and reasoned out the probabale ways I could have resolved them. I did not find one good model to solve the problem of Facial Emotional Recognition but I am confident that this pipleline when enhanced can become a decent tool for the problem. The most interesting aspects of this project to me was in getting a historic overview of how heuristic pattern recognition methods have taken place so far. Most difficult aspects were in learning the steps that I have to take in improving the model with less data.
There is definitely more work to do to make this project more robust.
http://www.paulvangent.com/2016/08/05/emotion-recognition-using-facial-landmarks/
https://github.com/mayurmadnani/fer
https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a
https://github.com/karthikBalasubramanian/FER