Seamless User-Generated Content Processing for Smart Media: Delivering QoE-Aware Live Media with YOLO-Based Bib Number Recognition

Alberto del Rio; Álvaro Llorente; Sofia Ortiz-Arce; Maria Belesioti; George Pappas; Alejandro Muñiz; Luis M. Contreras; Dimitris Christopoulos

doi:10.3390/electronics14204115

,

and

¹

Signals, Systems and Radiocommunications Department, Universidad Politécnica de Madrid, 28040 Madrid, Spain

²

Hellenic Telecommunications Organization, S.A., 15124 Athens, Greece

³

Telefónica Innovación Digital, 28050 Madrid, Spain

⁴

Foundation of the Hellenic World, 17778 Athens, Greece

Electronics2025, 14(20), 4115;https://doi.org/10.3390/electronics14204115

This article belongs to the Special Issue Recent Advances and Challenges in IoT, Cloud and Edge Coexistence, 2nd Edition

Version Notes

Order Reprints

Abstract

The increasing availability of User-Generated Content during large-scale events is transforming spectators into active co-creators of live narratives while simultaneously introducing challenges in managing heterogeneous sources, ensuring content quality, and orchestrating distributed infrastructures. A trial was conducted to evaluate automated orchestration, media enrichment, and real-time quality assessment in a live sporting scenario. A key innovation of this work is the use of a cloud-native architecture based on Kubernetes, enabling dynamic and scalable integration of smartphone streams and remote production tools into a unified workflow. The system also included advanced cognitive services, such as a Video Quality Probe for estimating perceived visual quality and an AI Engine based on YOLO models for detection and recognition of runners and bib numbers. Together, these components enable a fully automated workflow for live production, combining real-time analysis and quality monitoring, capabilities that previously required manual or offline processing. The results demonstrated consistently high Mean Opinion Score (MOS) values above 3 72.92% of the time, confirming acceptable perceived quality under real network conditions, while the AI Engine achieved strong performance with a Precision of 93.6% and Recall of 80.4%.

Keywords:

smart media; user-generated content; cloud–edge orchestration; QoE; video quality assessment; AI-based media enrichment; YOLO; remote production; 5G

1. Introduction

The production and consumption of media content during large-scale events has undergone a profound transformation in recent years [1]. Spectators are no longer passive viewers; empowered by the ubiquity of smartphones and high-speed mobile networks, they have become active contributors of live content, often complementing or even competing with professional broadcasting streams [2]. This paradigm shift creates new opportunities for more diverse, personalized event coverage, while simultaneously introducing significant technical challenges [3].

Although User-Generated Content (UGC) provides a flexible and scalable approach for capturing live events with mobile devices, current systems present significant limitations that restrict their use in professional media production. Mobile contributors operate under diverse and often unpredictable conditions, such as varying 5G network coverage, inconsistent device performance, uncontrolled scene lighting, and unsteady camera operation. These challenges result in fluctuating video quality, high latency, and poor stream synchronization. Distortion in UGC videos is inevitable, making it crucial to assess the quality of encoded and distorted content accurately. Therefore, effective UGC integration demands a more intelligent and robust control layer to guarantee that the captured content aligns with the technical requirements of live multimedia workflows and delivers a high-quality viewing experience [4]. Managing the heterogeneity of media sources, guaranteeing consistent Quality of Service (QoS), and orchestrating resources across distributed infrastructures are essential for delivering a seamless user experience [5,6]. At the same time, media-oriented businesses are increasingly seeking cost-effective solutions to enhance user engagement in live events, leveraging participatory content models as both a technological innovation and a sustainable business strategy [7].

Usually, cloud–edge–IoT solutions target the industrial sectors. But in doing so, we ignore the fact that any new technology also needs citizen acceptance; therefore, we should also look into solutions for media and entertainment [8]. To showcase the potential of cloud technologies in these areas, we decided to create a prototype solution for live events. We propose a solution that enhances the spectator experience for live marathon running events using the resource management functionalities of a cloud-based Meta Operating System (meta-OS), such as Next Generation Meta Operating (NEMO) [9]. NEMO is a meta-OS framework that is being developed as part of a three-year EU-funded project, which seek to address these challenges through the development of an intelligent orchestration layer that operates across the cloud–edge–device continuum [10]. Its main objective is to enable flexible, adaptive, and multi-domain service provisioning through a unified meta-OS [11]. Considering the complexity of heterogeneous infrastructures, NEMO supports dynamic resource allocation, intent-driven service management, and cross-domain interoperability, thereby facilitating a wide range of application areas.

Validation within NEMO follows a Living Labs methodology, which integrates real-world deployments with active stakeholder engagement. In contrast to closed testbeds, Living Labs are open, collaborative environments where users, operators, and industry partners jointly evaluate solutions under realistic conditions. Each Living Lab addresses a specific vertical domain, such as media, mobility, or Industry 4.0, while the meta-OS orchestrates heterogeneous resources to meet their requirements.

An implementation of this methodology is the Smart Media City use case (https://meta-os.eu/index.php/trial-5/, accessed on 2 September 2025), designed to evaluate the applicability of meta-OS within the media domain. This use case investigates how distributed infrastructures can support live media coverage in large public events, emphasizing scalability, interactivity, and user engagement. It serves both as a demanding technical benchmark, given the stringent requirements of live media workflows regarding latency, bandwidth, synchronization, and perceived quality, and as demonstrator of the social and business potential of a hybrid media ecosystem enabled by participatory and AI-driven orchestration.

Specifically, the Smart Media use case was validated during a race event in Athens, where spectators and contributors captured live media along the running circuit, while remote production was coordinated from Spain. Incoming streams were processed at the edge, where they were annotated, transcoded, and handled for live broadcast. This setup was intended not only to assess the technical robustness of distributed media workflows, but also to explore how spectators can be transformed into co-creators of event narratives. Moreover, the trial enhanced media streams with AI-based detection and recognition and GPS-driven contextual information, thereby enriching user experience through interactivity and contextualization. A key research objective was to examine the feasibility of remote production and the combined use of 5G and WiFi connectivity to ensure resilience, scalability, and high-quality delivery of media services in demanding live-event environments.

The main scientific contributions of this work center on the experimental validation and performance analysis of an automated, distributed media ecosystem, specifically:

Transformation of spectator roles: Exploring and proving the concept of how spectators can be transformed from passive consumers into active co-creators of event narratives by seamlessly integrating UGC into a professional live production workflow.
Feasibility of remote production: Examining the feasibility of remote production and the combined use of 5G and WiFi connectivity to ensure resilience, scalability, and high-quality service delivery in demanding live-event environments.
Validation of automated Quality of Experience (QoE) management: Demonstrating the robust performance of a fully automated workflow for live media production that integrates real-time QoE monitoring and analysis.
Assessment of AI-enhanced distributed media: Assessing the technical robustness and functional performance of AI-based media enrichment (runner/bib number detection) within a live, distributed workflow.
Living Labs methodology validation: Validating the Living Labs methodology within the NEMO framework to assess the social and business potential of a hybrid media ecosystem enabled by participatory and AI-driven orchestration.

Building upon these findings, the core technical novelties proposed through the development of the use case and its underlying components include:

Hybrid real-time media pipeline: Implementation of an end-to-end distributed media pipeline that supports a hybrid content model with real-time semantic enrichment.
Integrated cognitive services: The seamless integration of advanced cognitive services into the live pipeline.
Novel cloud-native architecture: The deployment of a cloud-native architecture based on Kubernetes to enable dynamic and scalable integration of heterogeneous sources (smartphone streams and remote production tools) into a unified live media workflow.
Meta-OS orchestration: Leveraging the NEMO framework to provide dynamic resource management and workload migration; autonomous scalability for AI/ML nodes based on predefined Service Level Objectives (SLOs); and advanced network management features to ensure bandwidth requirements and low latency.

2. Background and Related Work

2.1. Smart Media and Immersive Event Coverage in the Literature

Recent research in both academic and industrial domains has increasingly emphasized the role of smart media and immersive event coverage as indicators of a significant transformation in media consumption practices. Audiences have evolved from being passive recipients to active participants, driven by the widespread adoption of smartphones and the availability of high-speed networks [12]. This emerging paradigm transcends traditional broadcasting by promoting an interactive ecosystem that integrates professional media streams with UGC, aiming to deliver highly engaging and personalized narratives [13]. Nonetheless, the literature highlights a central challenge in migrating from centralized, cloud-dependent architectures to distributed infrastructures capable of efficiently managing the heterogeneity and scale of these new media sources [14].

At the core of this emerging paradigm lies the concept of media orchestration, understood as the intelligent coordination of multimedia assets and computational resources across the edge–cloud–device continuum to ensure a seamless user experience and consistent QoS [15,16]. Current research explore frameworks such as meta operating systems, designed to unify heterogeneous resources within a single adaptive layer, thereby enabling dynamic resource allocation and intent-driven service provisioning in complex media workflows. Crucially, this body of work emphasized that technological progress alone is insufficient; its social and economic impact must also be demonstrated. Accordingly, the literature highlights the importance of Living Labs and other real-world experimentation environments as fundamental mechanism to assess the technical, social, and commercial viability of emerging media solutions, effectively bridging the gap between controlled testbeds and large-scale market deployment [17,18].

2.2. Edge Computing, AI/ML for Media Annotation, and UGC Orchestration

The convergence of edge computing and AI has created new paradigms for real-time media processing, particularly for live events. Cloud-focused traditional models face challenges with high latency and bandwidth expenses when transmitting large volumes of raw video data from event sites to central servers [19]. Edge computing addresses this by shifting computational tasks, such as transcoding, filtering, and AI inference, to nodes located closer to the data source, the cameras and smartphones of contributors [20]. This architecture is critical for enabling low-latency applications. Concurrently, advancements in AI, especially deep learning models like You Only Look Once (YOLO) [21], have revolutionized media annotation. These models can perform real-time object detection, recognition, and tracking directly on edge devices, allowing for the immediate semantic enrichment of video streams.

The proliferation of smartphones and high-speed mobile networks has empowered spectators to become active content creators, leading to a surge in UGC during live events [22]. Automated recognition of objects, faces, or contextual features (e.g., bib numbers in sports events) provides semantic enrichment that enhances the value of UGC [23]. Deep learning frameworks like Transformer-based vision models have been applied to annotation tasks, offering robust accuracy even in heterogeneous and dynamic capture conditions [24,25]. These AI-driven annotations enable content indexing, personalized recommendation, and event-specific storytelling, bridging the gap between raw user contributions and professional-grade production standards [26]. In parallel, research on UGC orchestration highlights the importance of integrating crowd-sourced inputs into coherent workflows [27,28].

While previous studies have applied YOLO and other deep learning models to object detection in sports contexts, they are often limited to offline scenarios or isolated components. In contrast, our work demonstrates a fully integrated and real-time deployment of YOLO-based bib number detection and recognition within a complete media orchestration framework. Moreover, our approach goes beyond model performance, addressing the operational challenges of embedding AI-driven annotation into a live media workflow.

3. Smart Media City Use Case

3.1. Use Case Description

Traditional multimedia coverage has relied predominantly on professional cameras crews and centralized production workflows. While this approach ensures high-quality output, it inherently limits the diversity of perspective and often introduces latency in content distribution. The Smart Media City scenario addresses these limitations by enabling a hybrid model in which professional and crowd-sourced content are dynamically captured, processed, and distributed through the NEMO platform.

The core of this scenario is a multi-stage media pipeline designed to integrate heterogeneous sources into a coherent real-time experience. Spectators equipped with smartphones continuously capture media, while professional cameras and 360° video devices are strategically deployed at checkpoints, finish lines, and other key locations. This dual-layer architecture combines breadth coverage with spontaneous, crowd-driven contributions and depth of quality, guaranteed by professional-grade reliability.

Captured media streams are enriched with contextual information, such as geolocation metadata, and contributors may further enhance submissions through manual annotations, providing narrative detail or highlighting specific incidents in real time. This combination of automated analysis and user-driven input transforms raw media into semantically enriched content objects, which can be systematically indexed, searched, and prioritized for distribution.

To ensure scalability and responsiveness, computationally intensive processing tasks are offloaded to edge nodes distributed throughout the infrastructure. These nodes execute advanced operations such as video analytics (e.g., runner bib detection), transcoding, and quality adaptation. By offloading demanding workloads, even resource-constrained devices can participate meaningfully, while edge processing reduces bandwidth requirements and supports near-real-time responsiveness.

Distribution of media follows a hybrid strategy: content is delivered both to dedicated race applications and to public IP channels for broader external access. This model not only increases accessibility but also promotes the creation of a co-produced narrative, where citizen journalists and professional broadcasters contribute on equal basis. A distinctive element of the Smart Media City scenario is its emphasis on interactive audience participation. Contributors act as active co-creators rather than passive sources, refining uploads, adding annotations, and refining content retrospectively. Likewise, spectators and remote audience can interact with content by commenting, voting on highlights, or notifying organizers about incidents along the race circuits.

To traditionally implement such a solution, one had to overcome various risks, such as the need for large computing and network resources, the avoidance of high network latency, security, and other factors. This could all could be achieved having heavy hardware on site and local private networks. But since we based our implementation on top of a meta-OS framework, we had out-of-the-box support for many features like:

Resources management and migration of workloads across the continuum: The NEMO Meta-Orchestrator (MO) dynamically manages and migrates workloads optimizing the placement of demanding tasks (like runner bib detection) based on real-time resource availability and network conditions.
Scalability in deploying additional AI/ML nodes on the cloud: This plane monitors defined SLOs and uses AI-driven logic in the NEMO Cybersecure Federated Deep Reinforcement Learning (CF-DRL) to trigger autonomous scaling decisions to the MO.
Advanced network management to ensure bandwidth requirements: In this plane, the NEMO meta Network Cluster Controller (mNCC) integrates network features like monitoring and ensures bandwidth requirements are met and there is low latency.
A secure execution environment: This not only manages access but enforces attested integrity and secure communication for every distributed component.

All this enabled us to just use small IoT devices locally with 5G coverage, as the major functionality was on the cloud, thus lowering not only the cost but also the hardware complexity.

3.2. System Components and Functional Roles

Figure 1 presents the complete end-to-end architecture of the Smart Media City use case, organized into five functional areas: acquisition, processing, cognitive enrichment, production control, and delivery. The workflow follows a well-defined multimedia pipeline, starting from the capture of raw media streams at the edge and concluding in the delivery of enriched broadcasts to spectators.

Figure 1. End-to-end architecture for the Smart Media City use case.

The pipeline begins with the Race Stream App, deployed on user smartphones. This application enables contributors to record live video, capture still images, and provide annotations directly from the race circuit. These inputs are then ingested through the Media Gateway Manager, which bridges external contributions with the internal processing pipeline, ensuring reliable integration of all incoming streams.

At the core of the system lies the Media Production Engine, which consolidates, processes, and prepares streams for live mixing. This engine incorporates several key subcomponents: the Stream Transcoder, which standardizes video formats and bitrates to maintain consistency; the Stream Monitor, which tracks performance metrics such as bitrate and memory utilization; and the Production Core, responsible for real-time operations including source selection, scene switching, and live mixing.

The Production Control layer provides remote management capabilities for media workflows. During the trial, operators based in Madrid accessed the Production Core through a secure VPN connection. This interface allowed remote producers to preview multiple incoming feeds, adjust layouts, and control live streams with minimal latency.

Media streams were further refined within the Cognitive Service layer, which introduced AI-based processing and semantic enrichment. Among its key modules were the Video Quality Probe, responsible for automatically estimating the QoE with Mean Opinion Score (MOS) values in real time [29], and the AI Engine [30], which performed inference tasks such as runner detection and bib number recognition. In combination, these components enhanced contextual awareness and helped ensure the overall QoE throughout the trial.

Finally, enriched and curated media streams were disseminated through the Media Delivery Manager, which distributed content to clients via the Real-Time Messaging Protocol (RTMP) protocol. The RTMP server was configured with two distinct network interfaces: one dedicated to ingesting input streams and another exposed externally through port forwarding for delivery. The output streams were ultimately consumed by the Race Spectator App, enabling audiences to follow the live event, interact with contributors, and access personalized highlights.

The main video stream, directed towards the Production Control layer for live-view monitoring, is optimized for minimal latency, achieving a near real-time playout with a maximum observable delay of 0.5 to 2 s due to buffering and transport protocols. Conversely, the information stream, which is processed by the Video Quality Probe and the AI Engine for cognitive services, is designed for metric generation to inform intelligent orchestration decisions rather than instant visual feedback. This cognitive processing path therefore tolerates higher latency: the Video Quality Probe is configured to return MOS values periodically every 3.5 s, and the AI Engine requires approximately 5 s for its inference and reporting cycle.

3.2.1. Race Stream and Spectator App

The Race Stream and Spectator App served as the primary interface between end users and the NEMO media ecosystem. Its design integrated content creation, live media consumption, and personal contribution management within a unified platform, thereby enabling both spectators and participants to engage actively with the event in real time.

The Race Stream and Spectator App enabled multi-perspective live streaming, providing users with access to diverse viewpoints and camera sources distributed along the race circuit. Spectators could seamlessly switch between streams to follow specific segments of interest, rather than being limited to a single broadcast feed. This functionality enriched the viewing experience by offering a comprehensive situational understanding of the event, extending beyond the scope of traditional curated highlights.

In addition to consumption, the application empowered users to contribute actively as content creators. Through an intuitive capture interface (see Figure 2), users could record short videos, shoot photos, or add contextual comments. Each submission was automatically tagged with geolocation and timestamp information, ensuring semantic alignment with the ongoing race. A guided workflow, covering location selection, capture, and annotation, made the contribution process straightforward, even for non-technical users.

Figure 2. User interface of the Race Stream App developed as part of the NEMO project.

Once uploaded, UGC contributions were seamlessly integrated into the media pipeline, making them available to both production teams and the final audience. The application also featured a “My Contributions” section (see Figure 3), where users could revisit their uploads, track feedback, and verify whether their content had been featured in the live broadcast. This feature promoted a sense of ownership and recognition, encouraging sustained user engagement.

Figure 3. The contributions gallery example in the Race Stream App.

Beyond content sharing, the application supported bidirectional interaction. Spectators could engage with contributions from others by posting comments or reacting to clips, while organizers were able to highlight selected crowd-sourced moments during the race. This interactive layer transformed the application into more than a passive viewing tool, facilitating collaborative storytelling that merged professional coverage with UGC.

The evaluation of the Smart Media City use case was supported by a multi-source data collection framework integrated into the NEMO platform. The core of this framework was the NEMO RabbitMQ message broker, which served as the information bus interconnecting cognitive services, monitoring probes, and end-user applications (see Figure 1). All relevant metadata, including AI-based annotations and quality metrics, was routed into dedicated queues, thereby ensuring a consistency and centralized flow of information for further analysis.

3.2.2. Media Gateway and Delivery Manager

The Media Gateway Manager functioned as the primary entry point for all live contributions to the NEMO platform. Streams originating from smartphones, 360° cameras, and professional devices were transmitted over UDP to a set of public IP endpoints hosted on the Proxmox v8.1 cluster (https://www.proxmox.com/, accessed on 20 May 2025). Each stream was assigned to a unique port, simplifying identification and routing across the infrastructure. The Media Gateway Manager was responsible for port forwarding and ingestion, ensuring that every input was accurately redirected to its corresponding Media Production Engine instance.

This architecture enabled parallel processing across three independent production pipelines, thereby providing redundancy and facilitating balanced allocation of computational resources. By separating streams at the gateway level, the system was capable of handling a large number of concurrent inputs without overloading individual components, while maintaining flexibility for orchestration and scaling. In this manner, the Media Gateway Manager served as a robust and modular bridge between external capture devices and the internal processing architecture.

In contrast, the Media Delivery Manager acted as the complementary output point, responsible for distributing curated and enriched streams to end users. Once finalized by the Production Core and Production Control, output streams were forwarded to the Media Delivery Manager through a dedicated RTMP entry point.

The RTMP service, hosted within the delivery manager, converted and rebroadcasted the content through a public interface. The Race Spectator App, running on standard mobile devices, subscribed directly to this public stream, allowing audiences to access the final mixed output enriched with contextual metadata and synchronized annotations.

3.2.3. Media Production Engine and Production Control

The Media Production Engine served as the central processing hub of the workflow, responsible for adapting, analyzing, and preparing live streams for final broadcast. It comprised a set of containerized components deployed across the Proxmox and Kubernetes environments, each designed to ensure quality, scalability, and resilience within the media pipeline.

At the input stage, the Stream Transcoder functioned as the primary handler of incoming live streams. Built on top of FFmpeg v6.1 (https://ffmpeg.org/, accessed on 30 April 2025), it dynamically adjusted bitrate and format in real time, thereby ensuring compatibility across heterogeneous sources and adapting to varying network conditions. To support scalability, one transcoder instance was provisioned per stream, enabling parallel processing of multiple video feeds without introducing blottlenecks. Transcoded outputs were subsequently forwarded to downstream components, including the AI Engine for semantic enrichment, the Video Quality Probe for quality assessment, and the Production Core for live mixing and source switching.

Complementing the transcoder, the Stream Monitor continuosly tracked the health and performance of each live feed. Real-time metrics such as input/output bitrate, CPU utilization, and memory consumption were collected, providing operators and the orchestration framework with a detailed view of resource usage and system stability.

The Production Core, implemented through Voctocore v1.3 (https://github.com/voc/voctomix/tree/voctomix2/voctocore, accessed on 20 May 2025), together with its companion control interface Voctogui v1.3 (https://github.com/voc/voctomix/tree/voctomix2/voctogui, accessed on 20 May 2025), constituted the central component of the live media mixing process. Built on the open-source Voctomix framework, Voctocore was deployed as a virtual machine within the Proxmox cluster, provisioned with substantial computational resources to support concurrent multi-stream processing. Its functionality enabled the real-time combination and switching of live inputs, including smooth transitions between camera angles, picture-in-picture layouts, and side-by-side views.

The Production Control was operated via Voctogui, a GTK-based graphical interface that provided remote producers with monitoring and control capabilities over the live mix. Through a secure VPN connection between Madrid and the Voctocore instance running in Athens, operators could interact with the production pipeline in real time and with minimal latency. This setup allowed for the execution of live editing tasks, such as switching layouds, overlaying additional content, or emphasizing specific moments, while offering immediate visual feedback of the active output. The configuration validated the feasibility of remote production workflows, demonstrating that geographically distributed teams could collaboratively manage live content creation with negligible delay. As illustrated in Figure 4, the left panel displays three camera feeds capturing different race segments, while the right panel renders the final output stream enriched with an embedded secondary feed in picture-in-picture mode.

Figure 4. Remote production control interface (Voctogui), showing multiple incoming feeds (left) and the final mixed stream (right).

3.2.4. AI Engine

The AI Engine was developed as a cognitive service to automate runner identification and provide contextual enrichment of video streams. Its primary function was the detection and recognition of runner bib numbers in real time, enabling live annotation of media content with structured metadata. This capability facilitated the seamless association of visual evidence with specific race participants, thereby enhancing the contextual relevance of captured contributions.

The detection pipeline followed a three-stage modular design [30]. In the first stage, a person detection model identified runners within video frames, isolating them from heterogeneous backgrounds containing spectators and background activity. The second stage applied a bib detection model to localize bib regions, ensuring bounding boxes were accurately positioned under varying angles and viewing conditions. In the final stage, a bib number recognition model extracted and classified the digits. This sequential structure enhanced robustness, as errors in one stage could mitigated through targeted improvements in another.

Implementation relied on the YOLOv8 family models [31], with the nano variant being selected for deployment to optimize inference latency and support near real-time performance in non-GPU environments. Larger YOLOv8 configurations were also trained and benchmarked on GPU-powered machines, serving as a reference for evaluating accuracy and scalability. For the live Smart Media trial, however, the lightweight nano model provided the most effective balance between detection accuracy and processing speed.

Each stage of the pipeline was trained on specialized datasets. Runner detection used the RBNR [32] and TGCRBNW [33] datasets, offering diverse annotations of athletes in motion. Bib localization relied on the BDBD dataset [34], which captures bibs under varied lighting, occlusion, and orientation conditions. Finally, digit recognition was trained on the Street View House Numbers (SVHN) dataset [35], providing robust samples for unconstrained digit classification.

Model training was conducted on external GPU servers, enabling extensive hyperparameters tuning, data augmentation, and comparative evaluation across YOLOv8 variants. Augmentation strategies such as random rotation, scaling, and blurring were applied to replicate the diversity of capture conditions in outdoor race environments. Once optimized, the models were containerized and deployed within the NEMO orchestration framework, running as Kubernetes pods to ensure scalability, resilience, and portability.

3.2.5. Video Quality Probe

The Video Quality Probe is a cognitive component designed to evaluate the perceptual quality of video streams in real time. Its primary objective is to support adaptive streaming mechanisms and trigger alerts in the event of quality degradation, relying on objective, data-driven metrics. This approach overcomes the limitations of traditional subjective evaluation methods, which are costly and time-consuming to measure. The Video Quality Probe provides consistent and scalable assessment of QoE by automating the estimation of MOS values, following the Recommendation ITU-R BT.500 [36]. The model outputs scores on a five-point scale, ranging from 1 (Bad) to 5 (Excellent), thereby enabling standardized evaluation of perceived video quality.

The core of the system is a machine learning model trained on a heterogeneous dataset comprising more than 3000 annotated video sequences drawn from established public video databases. The dataset was curated to encompass diverse video content and degradation types, ensuring robustness and generalization across real-world scenarios.

The feature extraction process was based on the AGH Video Quality of Experience Team’s tool [37], which provided 70 video quality indicators capturing both spatial and temporal impairments. These included blockiness, block loss, blur, flickering, freezing, contrast variations, noise, and exposure issues. By leveraging this extensive feature set, the model was able to capture global quality trends while simultaneously detecting localized artifacts. Previous analysis revealed that blockiness and block loss were the two main factors that allowed for distinguishing between high-quality content from low-quality or even lossy content [38].

A systematic approach was followed to develop the MOS estimation model. The dataset was first normalized using standard scaling, followed by dimensionality reduction through Principal Component Analysis (PCA) to enhance efficiency and minimize overfitting. Two predictive models were tested: Random Forest regressor and Ridge regression, both implemented with scikit-learn v1.2.0 and optimized using GridSearchCV (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html, accessed on 7 October 2025) with five-fold cross-validation. The Random Forest model achieved the best performance, with a mean R2 score of 0.15 and a mean negative squared error of −0.61, demonstrating its ability to capture key variability in subjective quality scores. Based on this performance, the Random Forest regressor was selected as the regression model for the Video Quality Probe component.

The Video Quality Probe facilitates early detection of quality degradation by enabling real-time monitoring of video streams. The component supports adaptive streaming strategies and optimized resource allocation in video delivery systems. Its capability to accurately predict MOS values from a diverse set of quality features not only improves operational efficiency but also provides actionable insights to enhance video compression, encoding, and transmission strategies.

During the use case trial, the Video Quality Probe was deployed to evaluate its performance under live conditions using the video streams generated within the project scenario. The component captured streams in real time, analyzing visual quality and producing results at 2 s intervals. For each interval, the system computed a predicted MOS value, extracted detailed video quality metrics, and provided alerts whenever degradations were detected. All generated information was published to a RabbitMQ messaging queue, ensuring integration with other components of the NEMO platform. Each message encapsulates summarized information about the analyzed video interval:

Stream metadata, including path, timestamp, image resolution (width and height), frame rate, and total number of frames.
Video quality indicators, such as spatial activity, temporal activity, blur, blockiness, block loss, exposure, contrast, interlace artifacts, noise, and flickering, offering a comprehensive assessment of both spatial and temporal impairments. Additional flags (e.g., letterbox, pillarbox, freezing, blackout) indicate the presence or absence of specific visual artifacts.
Audio quality indicators, covering average and peak audio volume.
Binary alert flags for both video and audio, highlighting whether predefined thresholds were exceeded for specific metrics, including blur, blockiness, block loss, freezing, uniform frame, black frame, no audio, and silence.
Predicted quality value (MOS value), providing an estimate of perceived quality for the analyzed video interval.

4. Implementation and Deployment

4.1. Technical Setup

The deployment of the Smart Media City use case was supported by a cloud-native, containerized infrastructure designed to ensure scalability, modularity, and automated orchestration services. The technical setup was structured into three main layers: virtualized resources provisioned through Proxmox, service orchestration via Kubernetes, and continuous deployment automation using Gitlab and FluxCD. All components were hosted in Athens, with the exception of the Production Control, which operated remotely from Madrid and was securely connected via VPN.

At the infrastructure layer, Proxmox served as the virtualization backbone, providing and managing the computing resources required for the Living Lab trials. It enabled the creation of a flexible pool of virtual machines that could be dynamically allocated to different functional roles, including edge processing nodes, orchestration controllers, and media servers. This virtualization layer ensured efficient hardware utilization while isolating workloads to preserve performance and reliability under varying user demands. On top of these resources, Kubernetes acted as the orchestration engine for containerized services. All core components of the Smart Media City workflow, from media ingestion to transcoding, were deployed as microservices within Kubernetes pods.

To streamline deployment and updates, the consortium adopted GitOps [39] workflow based on Gitlab for version control and FluxCD for continuous delivery. Each service was described declaratively in configuration files stored within GitLab repositories. Modifications to the codebase or deployment descriptors automatically triggered CI/CD pipelines, which validated and propagated updates into the Kubernetes clusters. FluxCD continuously reconciled the actual cluster state with the declared configuration, enabling that deployed services always aligned with the desired specifications. This approach not only accelerated development cycles but also reinforced reproducibility and robustness across experimental implementations.

The integration of these technologies provided strong automation across the deployment lifecycle. All core media components, including the Stream Transcoder, Stream Monitor, Production Core, Video Quality Probe, and AI Engine, were implemented as microservices and deployed as distinct pods within the specific Kubernetes cluster namespace. Other services like Media Gateway Manager, Media Delivery Manager, and Production Control were physically installed on virtual machines on the same Proxmox environment. New versions of AI/ML models, for instance, could be containerized, committed to the GitLab repository, and seamlessly rolled out into the Smart Media platform with minimal manual intervention.

4.2. Experimental Setup

The validation of the Smart Media City use case was conducted at the Egaleo Municipal Stadium in Athens (see Figure 5 for the race track and surrounding area), selected as a representative environment for replicating the dynamics of a large-scale race. The stadium offered a controlled but realistic setting, characterized by a dense presence of spectators, continuous movement of participants, and heterogeneous media capture conditions. The trial was fully supported by 5G coverage across the venue, providing high-bandwidth and low-latency connectivity to both contributors and platform services.

Figure 5. Aerial view of the Egaleo Municipal Stadium in Athens, the site of the trial.

Connectivity between the stadium and the NEMO infrastructure was established through a dedicated Access Point Network (APN) called i-trialnemo, configured with an IPsec tunnel to ensure traffic integrity and confidentiality. This private network configuration setup enabled stable communication across smartphones, edge servers, and the central orchestration infrastructure. Public and private IP addressing schemes were carefully designed to expose streaming endpoints while maintaining isolation of the internal Kubernetes services.

5. Results and Validation

The final validation phase of the Smart Media pilot focused on the real-time orchestration of multimedia streams, demonstrating the seamless integration of AI-based analysis, remote production, and live content delivery to end users. The complete multimedia pipeline was tested under actual race conditions, thereby demonstrating its functionality and robustness in a realistic deployment scenario.

A first set of results focused on audience reach and engagement. The live broadcast attracted between 15 and 25 viewers, connecting not only from Athens but also from Madrid and other European locations. These users accessed the curated streams through the Race Spectator App, confirming that the platform could support both local and remote participation simultaneously. The achieved range was well within the project’s target of 10–100 concurrent consumers, demonstrating the feasibility of the use case for multi-site events.

In terms of media quality, the Video Quality Probe continuously estimated MOS values for each stream. During the trial, all three active streams consistently maintained MOS scores above 3, which is the typical threshold used to indicate “acceptable” user experience. This threshold value is defined in Recommendation ITU-R BT.500 [36] and corresponds to the “minimum quality” outlined in Recommendation EBU R132 [40] for signal quality in HDTV production and broadcast services.

The platform also demonstrated its ability to integrate a diverse set of audiovisual sources. During the trial, a total of six distinct inputs were processed: five live smartphone streams transmitted over the 5G network, complemented by a 360-degree camera feed offering immersive event coverage. This diversity validated the flexibility of the NEMO media pipeline in handling both conventional and high-bandwidth content.

From an architectural perspective, the deployment relied on a comprehensive set of services orchestrated across Proxmox and Kubernetes clusters. These included multiple replicas of the Stream Transcoder, AI Engine, and Video Quality Probe, as well the RTMP server, Production Core, and Production Control interface. The adoption of containerized services not only ensured scalability and redundancy but also enabled detailed monitoring of resource allocation and service performance.

5.1. Captured Streams and Video Quality Assessment

To facilitate the post-analysis of media quality, a representative video stream was recorded during the trial. This recording captured different perspectives of the race, reflecting variations in camera angles, runner positions, and group sizes along the track. The collected material offered a diverse foundation for evaluating both technical parameters and perceptual aspects of video quality. Figure 6 presents selected frames extracted from the 90 s video stream.

Figure 6. Sample frames captured during the trial. (a) Video capture from camera 1, right side view; (b) Video capture from camera 2, front-side view; (c) Video capture from camera 3, left side view; (d) Video capture from camera 4, right side view with wide focus.

Table 1 summarizes the video quality metrics analyzed for the sample video. The average MOS value was 3.27, corresponding to a quality level generally perceived as more than “acceptable” in Recommendation ITU-R BT.500 [36] and Recommendation EBU R132 [40]. The relatively low standard deviation of 0.34 indicates that perceived quality remained consistent across the analyzed video.

Table 1. Aggregated performance metrics for key video quality indicators.

Among specific quality indicators, blockiness and blur exhibited low mean values of 0.95 and 3.63, respectively, suggesting minimal compression artifacts and overall visual clarity. By contrast, the block loss metric presented a higher standard deviation of 1.92 and a maximum value of 8.78, pointing to occasional spikes in lost data blocks that may have introduced visual disruption. Other indicators, such as spatial activity, exposure, and contrast, remained within ranges typical of standard video content.

The MOS analysis was performed on the single video, with values extracted every two seconds. As shown in Figure 7, the results indicate that no significant variations in perceived quality were observed throughout the recording. The MOS values remained stable across the entire duration, confirming the consistency of the viewing experience during the trial. More than 72.92% of the measurements exceeded a MOS value of 3, reflecting an overall level of content quality considered acceptable for viewers.

Figure 7. Mean Opinion Score (MOS) over time.

5.2. Bib Number Detection and Recognition Results

The performance of the AI Engine was evaluated across the recorded video stream, processed at 15 frames per second to emulate a live analysis scenario. The aggregated results, summarized in Table 2, demonstrate high accuracy and reliability, with an overall Precision of 93.64%, Recall of 80.36%, and F1-Score of 86.50%. These metrics indicate that the majority of the model’s predictions were correct (high Precision) while successfully identifying most of the bib numbers present in the frames (strong Recall). This balance is particularly notable given the deployment of the lightweight YOLOv8-nano model, selected to prioritize low-latency inference. The primary objective was to ensure real-time operation on edge infrastructure without requiring high-end GPU resources, thus validating its feasibility for live, large-scale public events.

Table 2. AI Engine aggregated performance metrics.

A detailed per-bib number analysis is presented in Figure 8. The results reveal that the engine performed exceptionally well for bib numbers such as 106, 109, and 119, exhibiting a high ratio of True Positives (correct detections) to False Negatives (missed detections). However, numbers like 103 and 117 posed greater challenges, with a comparatively higher number of misses, likely due to in-the-wild conditions such as partial occlusion or motion blur. Despite these challenges, the low total count of False Positives (15) across the video highlights the robustness of the model in avoiding incorrect detections.

Figure 8. Aggregated performance of the AI Engine, showing True Positives (TPs) and False Negatives (FNs) for each detected bib number.

6. Discussion

The Smart Media City trial within the “Round of Athens Race” provided a unique opportunity to validate distributed media workflows under real-world conditions. Several key lessons emerged regarding automation, operational feasibility, and the broader implications of the NEMO approach for smart city ecosystems.

One of the most notable strengths of the trial was the level of automation achieved through GitOps-based CI/CD pipelines. By integrating FluxCD with the NEMO orchestration framework, deployments became reliable, reproducible, and fast. Developers could commit updates to GitLab, triggering automatic validation and deployment across Kubernetes clusters, ensuring consistency and minimizing human intervention.

Equally important were the logistical and organizational insights. The experience of securing the Egaleo Stadium in Athens highlighted the necessity of early engagement with stakeholders and local authorities. Live video trials in public spaces require careful planning of permits, safety procedures, and technical logistics. This non-technical dimension proved critical to the smooth execution of the trial, and a proactive approach to site selection and stakeholder coordination is recommended for any future deployment involving public events or sensitive environments.

A third lesson concerned the success of remote production, enabled by the distributed setup between Athens and Madrid. By connecting the OTE infrastructure in Greece with remote servers in Madrid via VPN, the trial demonstrated that real-time production can be fully location-agnostic. Content curation could occur in Athens while being supervised and controlled from Madrid, confirming the feasibility of centralized management combined with distributed content acquisition. This model offers potential benefits in terms of reducing travel costs, logistical overhead, and carbon footprint for large-scale productions.

Nevertheless, the trial also revealed limitations and challenges. While the CI/CD automation streamlined service management, the initial cluster setup and configuration required substantial expertise, potentially limiting rapid adoption by smaller media organizations. This complexity stems from a core architectural trade-off: all media pipeline services, from edge processing to cloud transcoding, are implemented as microservices within a NEMO cluster environment to enable the benefits from the platform. For instance, the Production Control component was deployed on a single physical machine with minimal setup, connected via VPN to the cluster. However, adopting and extending this strategy sacrifices the entire value proposition of NEMO.

While the Video Quality Probe component provides objective MOS estimations for video streams, this work did not include a formal end-user subjective validation. Future work will focus on designing and conducting subjective tests to validate and further improve the performance of the Video Quality Probe module in real-world live media scenarios.

Although the trial demonstrated the feasibility of real-time delivery to 15–25 concurrent viewers, additional research is necessary to evaluate the scalability of the Media Delivery Manager. It has yet to be determined whether the current implementation can support a larger number of simultaneous connections or if scaling would demand enhanced computational and network resources. The limited user scale tested in the trial (tens of viewers rather than thousands) also leaves open questions regarding performance under massive-scale public events. Furthermore, while the AI pipeline (detection, bib recognition, digit recognition) utilized general datasets, its performance is highly dependent on the initial training domain; future work should validate generalization by applying the pipeline to other sports with similar identification methodologies. Addressing these constraints will be crucial to advance from proof-of-concept trials to fully commercial deployments.

Additionally, the reliance on stable 5G coverage underscores potential barriers in cities where such infrastructures are not yet mature.

Author Contributions

Conceptualization, A.d.R., Á.L., S.O.-A., M.B., A.M., L.M.C. and D.C.; Methodology, A.d.R., Á.L. and S.O.-A.; Software, A.d.R. and Á.L.; Formal analysis, A.d.R., Á.L., S.O.-A., G.P. and D.C.; Writing—original draft, A.d.R., Á.L., A.M. and D.C.; Writing—review & editing, M.B. and G.P.; Project administration, A.d.R., L.M.C. and D.C. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is based on the “NEMO” (“Next Generation Meta Operating System”) project. The NEMO project has received funding from the EU Horizon Europe Research and Innovation Programme under Grant Agreement No. 101070118.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Author Maria Belesioti and George Pappas were employed by the company Hellenic Telecommunications Organization, S.A. Author Luis M. Contreras was employed by the company Telefónica Innovación Digital. Author Dimitris Christopoulos was employed by the company Foundation of the Hellenic World. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
APN	Access Point Name
CI/CD	Continuous Integration/Continuous Delivery
CPU	Central Processing Unit
CF-DRL	Cybersecure Federated Deep Reinforcement Learning
GPS	Global Positioning System
KPI	Key Performance Indicator
MO	Meta-Orchestrator
Meta-OS	Meta Operating System
ML	Machine Learning
mNCC	Meta Network Cluster Controller
MOS	Mean Opinion Score
NEMO	Next Generation Meta Operating System
PCA	Principal Component Analysis
QoE	Quality of Experience
QoS	Quality of Service
RTMP	Real-Time Messaging Protocol
SLOs	Service Level Objectives
UDP	User Datagram Protocol
UGC	User-Generated Content
VPN	Virtual Private Network
YOLO	You Only Look Once

References

Ericsson. The Latest Social Media Trend: Live Streaming. 2016. Available online: https://www.ericsson.com/en/reports-and-papers/mobility-report/articles/latest-social-media-trend-live-streaming (accessed on 16 September 2025).
Nokia. How 5G Will Transform Live Events. 2023. Available online: https://www.nokia.com/thought-leadership/articles/how-5g-will-transform-live-events/ (accessed on 16 September 2025).
Ericsson. 5G: Meeting Consumer Demands at Big Events. Available online: https://www.ericsson.com/en/reports-and-papers/consumerlab/reports/5g-meeting-consumer-demands-at-big-events (accessed on 16 September 2025).
Zhang, Y.; Wang, J.; Zhu, Y.; Xie, R. Subjective and objective quality evaluation of UGC video after encoding and decoding. Displays 2024, 83, 102719. [Google Scholar] [CrossRef]
Orive, A.; Agirre, A.; Truong, H.L.; Sarachaga, I.; Marcos, M. Quality of Service Aware Orchestration for Cloud–Edge Continuum Applications. Sensors 2022, 22, 1755. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Deng, G.; Bai, C.; Yang, J.; Wang, G.; Zhang, H.; Bai, J.; Yuan, H.; Xu, M.; Wang, S. Demystifying the QoS and QoE of Edge-hosted Video Streaming Applications in the Wild with SNESet. Proc. ACM Manag. Data 2023, 1, 236. [Google Scholar] [CrossRef]
Ericsson. 5G Elevates Connectivity at 2024’s Biggest Events. 2024. Available online: https://www.ericsson.com/en/press-releases/3/2024/consumerlab-5g-elevates-connectivity-experiences (accessed on 16 September 2025).
Comission, E. The Next Generation Internet of Things|Shaping Europe’s Digital Future. Available online: https://digital-strategy.ec.europa.eu/en/policies/next-generation-internet-things (accessed on 16 September 2025).
Chochliouros, I.P.; Pages-Montanera, E.; Alcázar-Fernández, A.; Zahariadis, T.; Velivassaki, T.H.; Skianis, C.; Rossini, R.; Belesioti, M.; Drosos, N.; Bakiris, E.; et al. NEMO: Building the Next Generation Meta Operating System. In Proceedings of the 3rd Eclipse Security, AI, Architecture and Modelling Conference on Cloud to Edge Continuum, eSAAM ’23, Ludwigsburg, Germany, 17 October 2023; pp. 1–9. [Google Scholar] [CrossRef]
Belesioti, M.; Chochliouros, I.P.; Dimas, P.; Sofianopoulos, M.; Zahariadis, T.; Skianis, C.; Montanera, E.P. Putting Intelligence into Things: An Overview of Current Architectures. In Proceedings of the AIAI 2023 IFIP WG 12.5 International Workshops on Artificial Intelligence Applications and Innovations, León, Spain, 14–17 June 2023; Springer: Cham, Switzerland, 2023; pp. 106–117. [Google Scholar]
Segou, O.; Skias, D.S.; Velivassaki, T.H.; Zahariadis, T.; Pages, E.; Ramiro, R.; Rossini, R.; Karkazis, P.A.; Muniz, A.; Contreras, L.; et al. NExt generation Meta Operating systems (NEMO) and Data Space: Envisioning the future. In Proceedings of the 4th Eclipse Security, AI, Architecture and Modelling Conference on Data Space, eSAAM ’24, Mainz, Germany, 22–24 October 2024; pp. 41–49. [Google Scholar] [CrossRef]
Chen, J.; Jung, S.; Cai, L. A critical review of technology-facilitated event engagement: Current landscape and pathway forward. Int. J. Contemp. Hosp. Manag. 2025, 37, 169–189. [Google Scholar] [CrossRef]
Chang, S.; Suh, J. The Impact of Digital Storytelling on Presence, Immersion, Enjoyment, and Continued Usage Intention in VR-Based Museum Exhibitions. Sensors 2025, 25, 2914. [Google Scholar] [CrossRef] [PubMed]
Hu, M.; Luo, Z.; Pasdar, A.; Lee, Y.C.; Zhou, Y.; Wu, D. Edge-Based Video Analytics: A Survey. arXiv 2023, arXiv:2303.14329. [Google Scholar] [CrossRef]
Jiang, X.; Yu, F.R.; Song, T.; Leung, V.C.M. A Survey on Multi-Access Edge Computing Applied to Video Streaming: Some Research Issues and Challenges. IEEE Commun. Surv. Tutor. 2021, 23, 871–903. [Google Scholar] [CrossRef]
Bisicchia, G.; Forti, S.; Pimentel, E.; Brogi, A. Continuous QoS-compliant orchestration in the Cloud-Edge continuum. Software Pract. Exp. 2024, 54, 2191–2213. [Google Scholar] [CrossRef]
Sodagi, S.; Mangalwede, S.; Hariharan, S. Race AI: A Deep Learning Approach to Marathon Bib Detection and Recognition. In Proceedings of the 2025 4th International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS), Ernakulam, India, 11–13 June 2025; pp. 752–757. [Google Scholar] [CrossRef]
Keltsch, M.; Prokesch, S.; Gordo, O.P.; Serrano, J.; Phan, T.K.; Fritzsch, I. Remote Production and Mobile Contribution Over 5G Networks: Scenarios, Requirements and Approaches for Broadcast Quality Media Streaming. In Proceedings of the 2018 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), Valencia, Spain, 6–8 June 2018; pp. 1–7. [Google Scholar] [CrossRef]
Anitha, R.; Esther Jyothi, V.; Regulagadda, R.; Macherla, S.; Naga Malleswari, D.; Siva Nageswara Rao, G.; Nagesh, P. Cloud Computing and Multimedia IoT. In Multimedia Technologies in the Internet of Things Environment; Springer Nature: Singapore, 2025; Volume 4, pp. 227–242. [Google Scholar] [CrossRef]
Liu, S.; Wang, S.; Ye, F.; Wu, Q. Cloud-Edge Collaborative Transcoding for Adaptive Video Streaming: Enhancing QoE in Wireless Networks. IEEE Trans. Green Commun. Netw. 2025. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Nasrabadi, M.A.; Beauregard, Y.; Ekhlassi, A. The implication of user-generated content in new product development process: A systematic literature review and future research agenda. Technol. Forecast. Soc. Change 2024, 206, 123551. [Google Scholar] [CrossRef]
Ettrich, O.; Stahlmann, S.; Leopold, H.; Barrot, C. Automatically identifying customer needs in user-generated content using token classification. Decis. Support Syst. 2024, 178, 114107. [Google Scholar] [CrossRef]
Zhao, H.; Tang, Z.; Li, Z.; Dong, Y.; Si, Y.; Lu, M.; Panoutsos, G. Real-Time Object Detection and Robotic Manipulation for Agriculture Using a YOLO-Based Learning Approach. In Proceedings of the 2024 IEEE International Conference on Industrial Technology (ICIT), Bristol, UK, 25–27 March 2024; pp. 1–6. [Google Scholar] [CrossRef]
Yue, S.; Zhang, Z.; Shi, Y.; Cai, Y. WGS-YOLO: A real-time object detector based on YOLO framework for autonomous driving. Comput. Vis. Image Underst. 2024, 249, 104200. [Google Scholar] [CrossRef]
Midoglu, C.; Sabet, S.S.; Sarkhoosh, M.H.; Majidi, M.; Gautam, S.; Solberg, H.M.; Kupka, T.; Halvorsen, P. AI-Based Sports Highlight Generation for Social Media. In Proceedings of the 3rd Mile-High Video Conference, MHV ’24, Denver, CO, USA, 11–14 February 2024; pp. 7–13. [Google Scholar] [CrossRef]
Vamsikeshwaran, M. AI Powered Video Content Moderation Governed by Intensity Based Custom Rules with Remedial Pipelines. In Proceedings of the 2024 International Conference on Computer Vision and Image Processing, Chennai, India, 19–21 December 2024; Springer: Cham, Switzerland, 2024; pp. 390–403. [Google Scholar]
Bai, T.; Zhao, H.; Huang, L.; Wang, Z.; Kim, D.I.; Nallanathan, A. A Decade of Video Analytics at Edge: Training, Deployment, Orchestration, and Platforms. IEEE Commun. Surv. Tutor. 2025. [Google Scholar] [CrossRef]
Ortiz-Arce, S.; Llorente, A.; Rio, A.D.; Alvarez, F. An Enhanced Method for Objective QoE Analysis in Adaptive Streaming Services. IEEE Access 2025, 13, 159273–159285. [Google Scholar] [CrossRef]
Martinez, R.; Llorente, A.; del Rio, A.; Serrano, J.; Jimenez, D. Performance Evaluation of YOLOv8-Based Bib Number Detection in Media Streaming Race. IEEE Trans. Broadcast. 2024, 70, 1126–1138. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://docs.ultralytics.com/es/models/yolov8/#yolov8-usage-examples (accessed on 30 April 2025).
Ben-Ami, I.; Basha, T.; Avidan, S. Racing Bib Numbers Recognition. In Proceedings of the BMVC 2012, Guildford, UK, 3–7 September 2012; pp. 1–10. [Google Scholar]
Hernandez-Carrascosa, P.; Penate-Sanchez, A.; Lorenzo-Navarro, J.; Freire-Obregon, D.; Castrillon-Santana, M. TGCRBNW: A Dataset for Runner Bib Number Detection (and Recognition) in the Wild. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 9445–9451. [Google Scholar] [CrossRef]
HCMUS. Bib Detection Big Data Dataset. 2023. Available online: https://universe.roboflow.com/hcmus-3p8wh/bib-detection-big-data (accessed on 12 February 2024).
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading Digits in Natural Images with Unsupervised Feature Learning. In Proceedings of the NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning, Granada, Spain, 16–17 December 2011. [Google Scholar]
Recommendation ITU-R BT.500-11: Methodology for the Subjective Assessment of the Quality of Television Pictures; ITU: Geneva, Switzerland, 2002.
Leszczuk, M.; Hanusiak, M.; Farias, M.C.Q.; Wyckens, E.; Heston, G. Recent developments in visual quality monitoring by key performance indicators. Multimed. Tools Appl. 2016, 75, 10745–10767. [Google Scholar] [CrossRef]
del Rio, A.; Serrano, J.; Jimenez, D.; Contreras, L.M.; Alvarez, F. Multisite gaming streaming optimization over virtualized 5G environment using Deep Reinforcement Learning techniques. Comput. Netw. 2024, 244, 110334. [Google Scholar] [CrossRef]
Red Hat. What is GitOps? Available online: https://www.redhat.com/en/topics/devops/what-is-gitops (accessed on 16 September 2025).
European Broadcasting Union (EBU). EBU—Recommendation R132: Signal Quality in HDTV Production and Broadcast Services; Technical report, Guidelines for technical, operational & creative staff on how to achieve and maintain sufficient technical quality along the production chain; European Broadcasting Union: Geneva, Switzerland, 2011. [Google Scholar]

Figure 1. End-to-end architecture for the Smart Media City use case.

Figure 2. User interface of the Race Stream App developed as part of the NEMO project.

Figure 3. The contributions gallery example in the Race Stream App.

Figure 4. Remote production control interface (Voctogui), showing multiple incoming feeds (left) and the final mixed stream (right).

Figure 5. Aerial view of the Egaleo Municipal Stadium in Athens, the site of the trial.

Figure 6. Sample frames captured during the trial. (a) Video capture from camera 1, right side view; (b) Video capture from camera 2, front-side view; (c) Video capture from camera 3, left side view; (d) Video capture from camera 4, right side view with wide focus.

Figure 7. Mean Opinion Score (MOS) over time.

Figure 8. Aggregated performance of the AI Engine, showing True Positives (TPs) and False Negatives (FNs) for each detected bib number.

Table 1. Aggregated performance metrics for key video quality indicators.

Metric	Mean	Median	Min	Max	Std. Dev.
Mos	3.27	3.29	2.55	3.85	0.34
Blockiness	0.95	0.95	0.92	0.98	0.02
Blur	3.63	3.63	3.10	4.22	0.28
Block loss	3.25	2.65	0.73	8.78	1.92
Temporal activity	23.93	22.79	7.15	42.37	8.07
Spatial activity	96.30	95.17	78.18	121.67	8.69
Exposure	128.89	129.04	123.35	131.72	1.89
Contrast	44.81	46.46	36.11	51.70	4.78
Noise	0.91	0.73	0.39	3.32	0.52

Table 2. AI Engine aggregated performance metrics.

Metric	Value
True Positives (TPs)	221
False Positives (FPs)	15
False Negatives (FNs)	54
Precision	0.9364
Recall	0.8036
F1-Score	0.8650

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Seamless User-Generated Content Processing for Smart Media: Delivering QoE-Aware Live Media with YOLO-Based Bib Number Recognition

Abstract

1. Introduction

2. Background and Related Work

2.1. Smart Media and Immersive Event Coverage in the Literature

2.2. Edge Computing, AI/ML for Media Annotation, and UGC Orchestration

3. Smart Media City Use Case

3.1. Use Case Description

3.2. System Components and Functional Roles

3.2.1. Race Stream and Spectator App

3.2.2. Media Gateway and Delivery Manager

3.2.3. Media Production Engine and Production Control

3.2.4. AI Engine

3.2.5. Video Quality Probe

4. Implementation and Deployment

4.1. Technical Setup

4.2. Experimental Setup

5. Results and Validation

5.1. Captured Streams and Video Quality Assessment

5.2. Bib Number Detection and Recognition Results

6. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics