Affective Computing for Subjective Video Quality Assessment: Building EmoPy
This post was AI-generated from the project’s source code, thesis, and documentation. It is an automated summary, not original writing.
Can a Computer Tell How a Video Makes You Feel?
When you watch a sad scene from The Lion King, your face drops, your heart rate shifts, and your skin conductance changes. What if a machine could read all of that in real time? That’s the question behind this project – a university research effort that combines facial expression recognition (FER), bio-sensor data, and deep learning to objectively measure subjective emotional responses to video content.
The broader field is called Affective Computing: teaching machines to recognize, interpret, and respond to human emotions. Its applications range from multimedia quality assessment and social media analytics to clinical tools that aid people with autism in reading emotions. This project focuses on one specific angle – can we assess video quality not just by pixel metrics, but by how the content actually makes viewers feel?
The Experimental Setup
Video Stimuli with Arousal-Valence Mapping
The study uses a 20-minute experimental session built around the arousal-valence model of emotion. Participants watch two 8-minute blocks of video clips, separated by a 3-minute preparation phase and a 1-minute cool-down:
- Positive valence / high arousal: clips from Tarzan, Kevin Hart stand-up, The Lion King, Whose Line Is It Anyway?
- Negative valence / high arousal: clips from Jackass, Cry Freedom, The Ring, Sinister, Walking Death, Hooligans
A key modification to the standard arousal-valence model was extending it with unipolar valence axes – separate positive and negative scales – rather than a single bipolar dimension. The stimuli were manually labelled for higher arousal to ensure clear emotional activation.
Bio-Sensors: ECG and EDA
Alongside the camera-based FER system, participants were fitted with:
- ECG (Electrocardiogram) – three channels measuring cardiac activity. The analysis focused on peak detection, with peaks remaining relatively constant across stimuli but showing potential for heart-rate variability analysis.
- EDA (Electrodermal Activity) – two channels measuring skin conductance on the palm. EDA proved particularly effective at capturing mood shifts: the signal evolution was clearly viewable even in small time frames, with distinct patterns emerging between positive and negative valence video blocks.
Does It Work?
The answer: sometimes. The results revealed an interesting complementarity between modalities:
- For emotions like sadness, bio-sensor data strongly mirrored the FER predictions – both channels agreed.
- For disgust, the face told the story clearly, but bio-sensors showed no significant reaction.
- For anger, neither modality was consistently reliable – bio-sensors often stayed flat even when subjects visually appeared engaged.
This inconsistency raises a compelling question the team posed for future work: Can the bio-sensor and FER approaches benefit from each other?
EmoPy: The Facial Expression Recognition Engine
The software backbone of this project is EmoPy – an emotion recognition system built with Dlib, Keras, and TensorFlow. It classifies facial expressions into 7 emotions: neutral, sad, happy, angry, disgust, surprise, and fear.
Architecture Overview
EmoPy is designed as a modular pipeline with pluggable neural network architectures and preprocessing stages.
Data Collection supports two datasets:
- CK/CK+ (Cohn-Kanade) – 327 directly labelled images with FACS (Facial Action Coding System) data
- FER2013 (Kaggle) – a larger but noisier dataset of 48x48 grayscale facial images
Preprocessing is handled by a hierarchy of preprocessor classes, each tailored to a specific model type:
| Preprocessor | Input Type | Description |
|---|---|---|
ImagePreprocessor | Raw images | Grayscale face images resized to input dimensions |
DlibPreprocessor | Dlib landmarks | Extracts 68 facial landmark points, distances from centroid, and vector angles |
MultiPreprocessor | Both | Combines image pixels with Dlib geometric features |
CapsPreprocessor | Raw images | Specialized preprocessing for Capsule Networks |
The Neural Network Zoo
The project explored a striking variety of architectures – “maybe too many,” as the team candidly noted in their presentation. Here’s what they built:
1. Image Input CNN (imagenn)
A straightforward convolutional neural network processing raw grayscale face images:
Input (48x48x1)
-> BatchNorm -> Conv2D(32, 3x3) -> PReLU -> BatchNorm -> MaxPool(2x2)
-> Conv2D(32, 1x1) -> PReLU -> BatchNorm
-> Conv2D(64, 3x3) -> PReLU -> BatchNorm
-> Conv2D(128, 3x3) -> PReLU -> BatchNorm -> MaxPool(2x2)
-> Conv2D(256, 3x3) -> PReLU -> BatchNorm
-> Flatten -> Dense(256) -> Dense(1024) -> PReLU
-> Dense(2048) -> PReLU -> Dropout(0.25)
-> Dense(512) -> PReLU -> Dense(7, softmax)
Uses PReLU activations throughout and Glorot normal initialization – a solid baseline architecture.
2. Dlib Landmark CNN (dlibnn)
Instead of raw pixels, this network operates on geometric facial features extracted by Dlib’s 68-point face landmark detector. It processes three parallel input streams:
- Landmark coordinates (68 x 2): raw (x, y) positions of facial keypoints
- Distances from centroid (68 x 1): how far each landmark sits from the face center
- Angles relative to centroid (68 x 1): the angular position of each landmark
Each stream passes through its own convolutional tower, then all three are concatenated and fed through dense layers. This is a clever approach – it encodes facial geometry explicitly, making the model invariant to lighting and texture.
3. Multi-Input CNN (minn)
The most sophisticated CNN architecture, combining both raw image features and Dlib geometric features in a single model:
Image branch: Conv2D stack (32->64->128->256) with strides -> Flatten -> 1024 features
Dlib branch: Flatten all 3 streams -> Concatenate -> Dense(256) features
|
Concatenate (1024 + 256 = 1280)
|
Dropout(0.2) -> Dense(1024) -> Dense(7, softmax)
This architecture includes TensorBoard logging, model checkpointing, and learning rate decay – the most production-ready of the set.
4. Capsule Network (caps)
An implementation of Capsule Networks (Sabour et al., 2017) for emotion recognition – a particularly forward-thinking choice for 2018. The architecture features:
- A standard conv layer followed by PrimaryCaps (8-dimensional capsules)
- A CapsuleLayer with dynamic routing producing one 16-dimensional capsule per emotion class
- A decoder network (Dense 512 -> 1024 -> reconstruct input) for regularization
- Custom margin loss with the classic CapsNet formulation:
m+ = 0.9,m- = 0.1
5. Additional Architectures
The codebase also includes Inception-ResNet and VGGFace transfer learning models, as well as commented-out LSTM variants for sequential/video-based emotion recognition – evidence of ambitious experimentation.
Results
The best model achieved ~60% accuracy on the FER2013 dataset’s 7-class problem. While this may sound modest, it’s worth noting that FER2013 is notoriously noisy – even human annotators achieve only about 65% agreement on this dataset. The team experimented extensively with architectural choices:
- Replacing max pooling with strided convolutions slowed learning and increased overfitting
- Adding dropout layers increased training time but helped generalization
- Large learning rates combined with dropout after every conv layer hindered learning entirely
The Runner System
EmoPy is orchestrated through a clean runner script (runners.py) that supports three modes via command-line arguments:
init_data– preprocesses and structures the CK and FER2013 datasetstrain– trains the selected architecture with configurable hyperparameterspredict– runs inference on images, video files, or live webcam feedsvisualize– generates neuron activation visualizations inspired by the Deep Visualization Toolbox
Integration: FER Meets ROS
For real-time deployment, the FER model was integrated into a ROS (Robot Operating System) people model. The pipeline chains together:
- USB camera input -> raw video frames
- Dlib face detection and shape prediction -> facial landmarks
- People model -> tracking individuals with distance, position, and attributes
- Emotion classification -> real-time per-face emotion labels
- Debug visualization -> overlay of predictions on the video feed
This ROS integration allowed the FER system to run alongside the bio-sensor data collection during the video stimuli experiments, providing synchronized multi-modal emotion readings.
Challenges and Lessons Learned
The team identified several fundamental challenges in affective computing:
- Capturing “true” emotions – lab settings induce self-consciousness; posed vs. spontaneous expressions differ significantly.
- Data quality trade-offs – there’s an inherent tension between data clarity (controlled conditions) and ecological validity (natural behavior).
- Aesthetics and emotion are entangled – visual quality of the stimulus itself influences emotional response, making it difficult to isolate content-driven affect.
- Bio-sensor/FER disagreement – different modalities capture different aspects of emotion, and they don’t always align.
Looking Forward
The project outlined several directions for future work:
- Standardizing affective terminology – the field needs clearer, shared definitions for each research aim
- Crowdsourcing and wearables – scaling data collection beyond the lab using consumer devices
- Stronger machine learning integration – leveraging advances in deep learning to fuse multiple modalities more effectively
- FACS-to-continuous-dimension mapping – converting Facial Action Coding System data into continuous arousal-valence dimensions, as proposed by Egon van den Broek’s work on Affective Signal Processing
Technical Stack
| Component | Technology |
|---|---|
| Face detection & landmarks | Dlib (68-point predictor) |
| Deep learning framework | Keras + TensorFlow |
| Transfer learning | VGGFace, Inception-ResNet |
| Computer vision | OpenCV |
| Real-time integration | ROS (Robot Operating System) |
| Bio-sensors | ECG (3-channel), EDA (2-channel) |
| Camera | GoPro (via goprocam) + USB webcam |
| Data analysis | NumPy, SciPy, scikit-learn, Pandas |
Full documents: Report (PDF) · Presentation (PDF)
This project was presented on March 29, 2018, by Jan Laun-Cieslik and Fabian Klopfer (5th Semester, BSc Computer Science) as part of a collaboration between the University of Konstanz and TU Berlin. Contributors: Nida Cilasun, Mengzhu Deng, Franz Hahn, Yazan Kittaneh, Fabian Klopfer, Anlin Liang, Zhou Long, Jan Laun-Cieslik, Daniel Matzkuhn, and Andrei Sacuiu.