How we built a sports fan engagement engine that finds each fan in broadcast footage
Sports fan engagement mobile app needed automated fan-clip production from broadcast footage. We built the multi-model CV engine that cut per-video time 3h → 30m and labor cost ~90%.
A US sports fan engagement mobile app — fans sign up, opt in their selfie, and receive personalized highlight clips from the games they watched. To replace the manual review work that was bottlenecking video delivery, we built a multi-model computer vision engine that cut per-video processing from three hours to thirty minutes and dropped labor cost on video preparation by ~90%.
The problem
The client runs a mobile app that’s part fan engagement platform, part personalized memorabilia. Fans sign up, opt in their selfie, and receive video clips of themselves whenever they’re caught on a stadium camera or TV broadcast during a game. The product partners with TV stations and teams in the NBA, NFL, MLB, and NHL — fan moments captured during broadcasts get delivered directly to the phone.
Before our engagement, the team behind the product produced these clips manually. A typical game means three hours of broadcast footage. To produce even one fan’s clip, someone had to:
- Watch through the broadcast and pull every shot showing fans.
- Identify each visible fan against the opted-in user pool.
- Cut and assemble a personalized montage.
The turnaround between game end and clips becoming buyable in the app was slow enough that fans were leaving negative reviews — the moment had already passed by the time the video arrived. Accuracy was non-negotiable: a fan receiving a clip featuring a different person would be worse than getting nothing.
Manual review gated the entire product. Three-hour broadcasts, one fan at a time, post-game review backlog stretching into the next day’s news cycle.
The client had tried open-source facial recognition first — the obvious starting point. It didn’t work. Pre-trained models from public face recognition SDK packages are trained on stationary cameras and cooperative subjects: passport photos, security checkpoints, posed selfies. Live sports broadcasts give you almost the opposite: shaky camera work, fans in motion, exaggerated facial expressions, partial occlusion, side angles, lighting that changes by the second. Engineers exploring OpenCV face matching tutorials, or wiring up an off-the-shelf cloud SDK, hit the same domain-gap wall: recall on broadcast frames collapses, false matches climb, and accuracy never gets close to “ship to consumers.”
What the client actually needed wasn’t a face matcher in isolation — it was an end-to-end engine that could chew through broadcast footage, identify fans, find the emotional moments, cut clean scenes, and produce a finished video. That meant a system of coordinated models, not a single classifier.
Our approach
We picked a small set of architectural decisions and built around them.
1. A multi-model engine, not one big classifier. The product needs different things from CV — who’s this person, are they fans or staff, what scene boundary just happened, is this moment emotionally interesting, can we caption the clip cleanly? Each is a separate problem. Stacking them into one mega-model is a research detour. We trained focused models for each task and orchestrated them as a pipeline.
2. Data quality over architectural cleverness. Facial recognition projects live or die on their training data. Instead of spending months tuning architecture, we invested in the data side: a clean, sport-specific dataset combining fan selfies with broadcast face crops, a labeling pipeline the team could run with consistency, and custom evaluation metrics tied to actual business outcomes.
3. Sport-specific model variants. Broadcast characteristics vary by sport — basketball camera work, hockey crowds, baseball pacing, football angles. One model trained on everything is a worse model than four trained per-domain. We split.
4. Microservices, not a monolith. Each model became a Flask service. The client’s backend is C# on Azure. Flask microservices were the cleanest integration boundary — language-agnostic HTTP, easy to scale individual hot paths, easy to swap a model without redeploying the rest.
What we built
Facial recognition — the core matcher
The central task: match a fan’s selfie (clean, well-lit, from their phone) against the same fan’s face appearing in messy broadcast footage. We picked ResNet101 as the backbone with ArcFace loss — a well-understood combination for embedding-based face matching. Both selfies and broadcast face crops map into the same embedding space; matching is cosine similarity against the opted-in pool.
The architecture wasn’t novel. What made the model production-grade was the dataset. We assembled a private training corpus combining:
- Fan selfies from the app (with explicit user opt-in).
- Face crops extracted from broadcast video frames the client provided.
- Negative examples — non-fan staff, mismatched pairs, hard cases drawn from real production confusions.
To build that dataset we set up a pipeline using Luigi for orchestration, DVC for data versioning, and a custom web application backed by Dropbox for the human labeling workflow. The labelers were a managed team; we set the rules, handled the data flow, and kept the labels consistent across sports.
Custom evaluation metrics tied to business outcomes
Standard face-recognition benchmarks (LFW, MegaFace) don’t predict how a model performs on a stadium feed during a basketball game. We developed model-selection metrics that mirrored what mattered to the client:
- Hours of manual review saved per game version of the model.
- Execution time per face crop — directly translates to videos-per-server-hour.
- Average mismatching rate per sport and per match — the production-impact metric.
We watched these throughout training and used them, not accuracy on a public benchmark, to pick the model version that actually shipped.
Gender and age — to shrink the matching subset
The matching problem gets meaningfully easier if you can filter candidates before running cosine similarity. We trained Xception-based classifiers for gender and age. The trick: at match time, women’s selfies only get compared against women’s broadcast crops; older fans only against older broadcast crops. The candidate pool shrinks, false matches drop, latency improves.
- Gender classification was reliable enough to use universally across all matches.
- Age classification was less stable on small or low-quality crops, so we only applied it when the broadcast crop quality cleared a threshold.
Emotion recognition — finding the moments that matter
Beyond identification, the product wants to surface the emotional moments — fans cheering, gasping, celebrating. That gives the client both a clip-cutting signal (which seconds are highlight-worthy) and analytics (which moments of a broadcast hit hardest with the crowd).
We labeled face crops with the six basic emotions — anger, disgust, fear, happiness, sadness, surprise — and started with a single-frame Xception classifier. It reached 71% accuracy, which sounded reasonable in a vacuum but was nowhere near production-ready for a consumer product.
The fix came from treating emotion as a sequence problem instead of a single-frame one. We built a model that ingests a short sequence of consecutive face crops and outputs an “excited face score” from 0.0 to 1.0 — continuous, calibratable, and much more useful for downstream highlight-cutting logic than a single-class label. 91% accuracy on the sequence-based scorer. That’s what shipped.
Scene detection — boundary frames in continuous broadcast
Broadcast video doesn’t have clean cuts the way edited content does. The camera pans, sweeps, transitions through motion blur. To cut a clean fan clip you need to know where the scenes actually begin and end. We built a custom architecture stacking Conv3D layers to do this — initially modeled on a research paper, then redesigned for the speed and accuracy our pipeline needed. Scene boundaries gate the clipping logic downstream.
Auxiliary models
The full engine also includes several supporting models, each smaller but each removing a manual step from the original workflow:
- Fan / non-fan frame classifier — skip frames that show only the field, players, or staff. Cuts compute upstream of identification.
- Audio noise removal — clean up the audio of cut clips before delivery.
- Background removal from video — for use cases where the client wants to compose fan clips on alternative backdrops.
- Automated image captioning — natural-language descriptions of what’s in the clip, used for metadata and search inside the app.
Sport-specific variants
We trained separate model versions for basketball, baseball, hockey, and football. Broadcasts differ enough across sports — camera setups, crowd density, lighting, pacing — that a unified model traded too much accuracy on the most-played sports for marginal gains on the long tail. Per-sport variants beat one big model on every metric that mattered.
Deployment — Flask microservices on Azure
Every model in the engine ships as a Flask microservice. The client’s backend is C#; HTTP boundaries between services kept the language gap clean and let either side iterate independently. The full stack runs on Microsoft Azure, sized for multi-game-per-day throughput. We instrumented every model so the team could see latency per stage, push hot paths onto larger instances, and detect quality regressions when a new model version went out.
Results
After the system shipped to production:
- Labor cost on video preparation dropped ~90%. What had been frame-by-frame manual review became an automated pipeline the team supervised rather than executed.
- Per-video processing time fell from ~3 hours to ~30 minutes — a 6× speedup on the unit of work that actually mattered for the product.
- 20 new TV-channel contracts were signed off the back of the faster turnaround. Networks needed clips inside specific windows; a multi-day turnaround was a dealbreaker, a 30-minute one wasn’t.
- The cost-and-labor reduction freed the client’s product team to shift focus to new venues and other parts of the platform rather than scaling headcount on manual review.
When this architecture applies
The pattern — a multi-model CV engine where individual models handle identification, classification, segmentation, and content cleanup, each evaluated by business-relevant metrics, glued together as microservices — generalizes well beyond this specific client. The same shape applies to:
- Other sports tech companies building fan-content products that have to bridge selfie-quality data with messy in-the-wild broadcast or stadium footage.
- Live-event platforms (concerts, conferences, festivals) where attendees opt in with photos and want personalized memorabilia after the event.
- Out-of-home video systems where the same individual appears across multiple low-quality cameras and needs to be re-identified with strong precision guarantees.
A surprising amount of the engineering value here wasn’t in any single model — it was in the dataset, the labeling pipeline, and the custom business metrics used to pick what shipped. That part transfers cleanly across domains.