Video-based facial recognition solution to generate montages of individual fans enjoying the match.

Video-based facial recognition solution to generate montages of individual fans across videos taken from various sports matches. Trained various 2D and 3D neural network models for classification, image captioning for natural language descriptions of fans, audio analytics for background noise removal, etc. Organized the process of data collection and labeling, building pipelines for data processing using Luigi. Deploying the models as microservices with Flask.

Client

The client is a US-based mobile app developer. The mobile app is a sports video platform where fans can sign up to receive videos of them from live events directly to their phone. Their TV and video board appearances can be saved and shared from within the app. The platform partners with TV stations, NBA, NFL, MBL, and NHL teams, as well as entire leagues, to deliver the footage to fans.

‍

Problem

The client wants to develop an automated solution for creating fan videos across large amounts of video content. It takes a large amount of time to go through 3-hour videos, select shots with fans, identify them, and prepare a video montage. This caused long turnaround times that prompted negative reviews and a bad user experience. Instead, the client wanted to develop an engine that will analyze large amounts of videos and prepare the videos automatically.

The client first experimented with open-source solutions for facial recognition, but the results were not satisfactory due to the camera shaking, fan body motion, and facial expressions that were too difficult for pre-trained models trained with stationary cameras and persons. Accuracy was crucial to make sure fans don’t receive videos featuring other fans.

They reached out to BroutonLab to develop an engine for gender recognition, facial detection & recognition, shot quality measurement, and video captioning, in order to automatically produce a ready-made video to send to fans.

Emotion recognition (from Introduction to Emotion Recognition 2021)

Solution

The solution is a complex engine that can be separated into multiple models, including:

Facial recognition from videos
Gender recognition
Detecting whether the video frame shows fans or not
Emotion recognition
Scene detection - detecting whether an image represents the beginning of a new scene
Audio noise removal
Background removal from video
Automated image captioning

Different models were trained and used for different sports (basketball, baseball, hockey, football) since data differed across videos from various sports and TV broadcasters.

Datasets

The data was provided by the client and labeled by a team of labelers. We organized the process of data collection and labeling, set up the rules to follow during these processes, and built pipelines to manage input and output data flows with Luigi and DVC.

The source of the provided data was video recordings of fans taken from TV companies who broadcast the sports matches, using only the frames that focus on fans. However, we used both single images and image sequences as inputs to our models.

Example of blurry and small faces for recognition (from State of Origin 2021: Decision on MCG match due in next 24 hours)

Models

Different state-of-the-art (at that time) models were tested and employed during the project. For classification problems, we started with pretrained models and architectures like ResNet, Xception, and EfficientNet. For scene detection, we handcrafted a custom neural network model by stacking some Conv3D layers. The approach was initially based on a research paper, but later it was redesigned to achieve better speed and accuracy of the model.

Xception architecture overview (from Xception: Deep Learning with Depthwise Separable Convolutions)

Emotion Recognition

Another important task was developing an algorithm to find emotional moments in the broadcasts. This helps our client make additional analytics for video broadcasts, allowing them to understand which part of the broadcast is more attractive for users. To do this, we subsequently labeled our face images with six basic emotions: anger, disgust, fear, happiness, sadness, and surprise. These kinds of emotions are standard for problems of emotion recognition, based on the Basic emotions, rationality, and folk theory article.

We trained a simple Xception single-image classifier based on our newly labeled dataset and received the accuracy of 71%. The next step was to develop an algorithm based on a sequence of faces and predicting the "excited face score" ranging from 0.0 to 1.0. This score was more useful for our client and, with this approach, we achieved an accuracy of 91%, which was sufficient to use this model in production.

Gender/Age Recognition

To make the task of facial recognition easier, we decided to reduce the subsets between which facial matching occurs. In other words, we decided to only match women with women, older people with older people, and so on, when we want to match two faces in the database. The gender classification problem was resolved with outstanding quality, and we used it in all cases, but the age recognition was unstable with smaller face crops, and we only used it for high-quality faces. Both models are based on the Xception architecture.

Arcface loss explaining (from ArcFace: Additive Angular Margin Loss for Deep Face Recognition)

Facial Recognition

The main task was to match the users' selfies from the mobile app with the faces found in the sports broadcast videos. The environment during sports matches is complex and unpredictable. There are a lot of blurry images, and the faces are too small for detection sometimes. Open-source solutions don’t work well in this context, and we had to deal with the problem by training a custom model.

Instead of focusing plenty of time on developing a complex and trickly architecture, we instead decided to focus on data quality and developing custom metrics. We took the ResNet101 backbone and the ArcFace loss and trained it on our private database, which contained a combination of faces from selfies and faces from video broadcasts.

The dataset was the main challenge in the project since facial recognition projects require starting with a clean dataset. We created a pipeline for data collection and parsing with a Luigi/Dropbox/web application.

Finally, we developed custom metrics to estimate the quality of the model based on the business needs of our client. For example, some metrics include hours of manual work saved, execution time per face, average mismatching per sport or match, and others. We monitored these metrics during the training process, and It helped us choose the most valuable model for our client.

‍

Server

The models were deployed as microservices using Flask in Microsoft Azure Clouds. The client’s backend was written in C#, so using Flask microservices was the best decision for simple model integration.

‍

Results

Our solution reduced the cost of labor related to video preparation by 90%. As a result of fast turnarounds for fan requests, 20 more contracts were signed with TV channels, and the time needed to process a single video was decreased from 3 hours to 30 minutes. The decrease of work and cost associated with video preparation allowed our client to shift focus to other venues and focus on other aspects of the platform.