A Survey on Video Big Data Analytics: Architecture, Technologies, and Open Research Challenges

Do, Thi-Thu-Trang; Huynh, Quyet-Thang; Kim, Kyungbaek; Nguyen, Van-Quyet

doi:10.3390/app15148089

Open AccessReview

A Survey on Video Big Data Analytics: Architecture, Technologies, and Open Research Challenges

by

Thi-Thu-Trang Do

^1,2

,

Quyet-Thang Huynh

¹

,

Kyungbaek Kim

³

and

Van-Quyet Nguyen

^2,*

¹

School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi 100000, Vietnam

²

Faculty of Information Technology, Hung Yen University of Technology and Education, Hung Yen 160000, Vietnam

³

Department of Artificial Intelligence Convergence, Chonnam National University, Gwangju 61186, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 8089; https://doi.org/10.3390/app15148089

Submission received: 9 June 2025 / Revised: 14 July 2025 / Accepted: 16 July 2025 / Published: 21 July 2025

(This article belongs to the Special Issue Advances in Machine Learning and Data Mining: Emerging Trends and Applications)

Download

Browse Figures

Versions Notes

Abstract

The exponential growth of video data across domains such as surveillance, transportation, and healthcare has raised critical challenges in scalability, real-time processing, and privacy preservation. While existing studies have addressed individual aspects of Video Big Data Analytics (VBDA), an integrated, up-to-date perspective remains limited. This paper presents a comprehensive survey of system architectures and enabling technologies in VBDA. It categorizes system architectures into four primary types as follows: centralized, cloud-based infrastructures, edge computing, and hybrid cloud–edge. It also analyzes key enabling technologies, including real-time streaming, scalable distributed processing, intelligent AI models, and advanced storage for managing large-scale multimodal video data. In addition, the study provides a functional taxonomy of core video processing tasks, including object detection, anomaly recognition, and semantic retrieval, and maps these tasks to real-world applications. Based on the survey findings, the paper proposes ViMindXAI, a hybrid AI-driven platform that combines edge and cloud orchestration, adaptive storage, and privacy-aware learning to support scalable and trustworthy video analytics. Our analysis in this survey highlights emerging trends such as the shift toward hybrid cloud–edge architectures, the growing importance of explainable AI and federated learning, and the urgent need for secure and efficient video data management. These findings highlight key directions for designing next-generation VBDA platforms that enhance real-time, data-driven decision-making in domains such as public safety, transportation, and healthcare. These platforms facilitate timely insights, rapid response, and regulatory alignment through scalable and explainable analytics. This work provides a robust conceptual foundation for future research on adaptive and efficient decision-support systems in video-intensive environments.

Keywords:

video big data analytics; edge AI and federated learning; real-time video processing; hybrid cloud–edge computing; privacy-preserving AI

1. Introduction

The explosive growth of video data has transformed multiple sectors, including transportation, healthcare, surveillance, and retail. According to Statista [1], video accounted for over 80% of global internet traffic in 2023, largely driven by the increase of smart cameras, IoT devices, autonomous systems, and online video platforms. This growth demands scalable, real-time, and privacy-preserving Video Big Data Analytics (VBDA) systems to support intelligent decision-making across distributed environments.

VBDA has become an integral part of modern urban infrastructure, as seen in smart city projects such as Singapore’s traffic monitoring and electronic road pricing (ERP) systems, helping to reduce congestion and optimize signal flow [2]. In the healthcare domain, AI-driven video surveillance supports early detection of critical events in hospital environments, contributing to improved clinical outcomes [3]. In the retail sector, systems like Amazon Go leverage VBDA to monitor customer behavior, automate checkouts, and deliver personalized services [4].

Despite the promising real-world applications of VBDA, its development remains challenged by the high volume, heterogeneity, and complexity of video data. Video streams are inherently multimodal and high-dimensional, encompassing spatial, temporal, and auditory characteristics, and are often encoded in varied formats such as MP4, AVI, and MKV [5]. These characteristics introduce difficulties in decoding, indexing, and storage management [6,7]. Real-time video analytics faces latency, bandwidth, and storage limitations, particularly in safety-critical applications such as autonomous driving and emergency response [4,8]. Furthermore, privacy and data protection regulations such as GDPR and HIPAA require the incorporation of secure and explainable AI methodologies [3,9].

Although distributed data processing frameworks such as Apache Spark, Kafka, and Flink [10,11] have proven to be effective for scalability, they often fail to deliver low-latency responses and video-specific optimizations. Recent advances in edge AI and Federated Learning (FL) have led to a shift toward decentralized analytics closer to the data source [5,12,13]. FL, in particular, enables collaborative training between edge devices without transmitting raw data, which in turn improves privacy and reduces the load on the network [9,14]. However, challenges such as model synchronization, non-independent and identically distributed (non-IID) data distributions, and edge-device constraints remain open research issues [15,16].

This paper presents a comprehensive synthesis of the current landscape in VBDA, focusing on system architectures, enabling technologies, and application-oriented scenarios. Existing VBDA platforms are systematically categorized into our representative system architectures as follows: centralized, cloud-based, edge computing, and hybrid cloud–edge models. Concurrently, the study analyzes a broad spectrum of enabling technologies, including real-time stream processing frameworks, scalable distributed processing engines, deep learning, and federated learning models for intelligent analytics, and advanced storage solutions for managing multimodal, high-volume video data. Beyond architectural and technological classification, the study highlights critical challenges that persist throughout the VBDA analytical process, particularly regarding latency constraints, storage efficiency, and privacy preservation in distributed learning settings. To address these challenges, this paper proposes ViMindXAI, a hybrid AI-driven platform that incorporates edge and cloud orchestration, adaptive video storage mechanisms, privacy-aware learning processes, and intelligent semantic-level video retrieval capabilities.

Our work, as described in this paper, makes three main contributions. First, this work presents a consolidated and up-to-date evaluation of VBDA platform architectures and technologies, together with a functional taxonomy of core video analytics tasks and their real-world applications. Second, the paper identifies and synthesizes key open research challenges related to real-time responsiveness, scalability, AI integration, and privacy constraints in large-scale video analytics. Third, this study proposes ViMindXAI as an integrated architectural framework that serves as a reference design for the development of scalable, intelligent, and trustworthy next-generation VBDA platforms. It is important to note that ViMindXAI is currently introduced as a conceptual reference architecture, developed through in-depth analysis of underlying technologies and their capacity to address the practical requirements of modern video big data analytics platforms. Its prototyping and comprehensive performance evaluation are planned as part of future research efforts.

To conduct this study, we adopt a survey-based methodology, integrating insights from a broad spectrum of the scholarly literature and real-world implementations across both academic and industrial contexts. An initial corpus of over 350 publications was retrieved from reputable databases, including Scopus, Web of Science, IEEE Xplore, and the ACM Digital Library. A systematic screening process was applied to remove non-peer-reviewed works, studies not directly addressing system architectures or enabling technologies in VBDA, and publications based on outdated technological approaches no longer relevant to current developments. As a result, 157 publications were retained for comprehensive analysis. The majority of these works, focusing on VBDA architectures and core technologies, were published between 2018 and 2025. A substantial portion appeared in high-impact journals and conferences, demonstrating their technical depth, alignment with emerging research directions, and significant contributions to the advancement of VBDA platforms. This systematically selected set of studies ensures that the survey captures both foundational and cutting-edge contributions to the field.

The remainder of this paper is organized as follows. Section 2 provides a comprehensive overview of VBDA, detailing architectural paradigms, enabling technologies, and a functional taxonomy of core video processing tasks. Section 3 examines in greater depth VBDA system architectures and foundational technologies, tracing their development and emerging trends from 2018 to 2025. Section 4 introduces the proposed ViMindXAI platform, emphasizing its hybrid architecture, modular design, and AI-driven capabilities for scalable, privacy-preserving video analytics. This section also presents three representative case studies demonstrating the platform’s adaptability to diverse real-world scenarios. Section 5 discusses open challenges and outlines future research directions, followed by the conclusion in Section 6.

2. Literature Overview in VBDA

The growing complexity and scale of video data have driven significant research into scalable, low-latency, and privacy-preserving VBDA systems. While previous surveys have explored key aspects individually, such as edge computing [17], FL [9,18], and video surveillance architectures [19], these works often consider each technology in isolation and lack a unified, system-level perspective. For example, Elharrouss et al. [19] provide a detailed review of conventional surveillance architectures but do not address recent advances in FL or explainable AI. Xu et al. [17] focus on the architectural challenges of edge-based video analytics but overlook federated or collaborative learning methods. Similarly, surveys on FL [9,18] offer high-level overviews without contextualizing their insights within end-to-end video analytics pipelines or practical deployment frameworks. This section aims to bridge these gaps by presenting a unified and structured overview of the VBDA research landscape. It begins with the historical evolution of system architectures and continues with a review of enabling technologies and current architectural paradigms. This synthesis lays the groundwork for the ViMindXAI platform, which will be detailed in the following sections.

2.1. Evolution and Trends in VBDA

The evolution of VBDA represents a significant paradigm shift toward distributed, AI-driven architectures, prompted by the growing demands for scalability, responsiveness, and data privacy. As illustrated in Figure 1, this transformation has been driven by advances in networked video systems, scalable data infrastructures, and intelligent analytics. Since the early 2010s, coinciding with the growth of digital surveillance systems such as city-wide CCTV infrastructures, smart building monitoring, and big data platforms such as Apache Hadoop, Spark, VBDA has progressed through five distinct architectural stages. Each phase contributed specific capabilities to address critical challenges, including scalability, low-latency processing, data privacy, and contextual video understanding [7,20,21]. Understanding this historical progression provides essential context for the subsequent overview of the current research landscape presented in this section.

(1): Centralized and On-Premises VBDA (2010–2012)

The initial phase of VBDA was defined by centralized, on-premises architectures, where video acquisition, storage, and processing were handled entirely on local servers. These systems employed RAID or SAN/NAS storage and used relational databases such as MySQL v5.5 and PostgreSQL v9.0, with occasional support from early NoSQL platforms like MongoDB v2.4 for metadata handling [17,19]. With no support for distributed processing or elastic scalability, they were ill-suited for high-throughput or real-time analytics. Video understanding was based on classical machine learning techniques using handcrafted features. Among these, Histogram of Oriented Gradients (HOG) and Scale-Invariant Feature Transform (SIFT) were the most widely adopted for object detection and recognition due to their robustness to scale and orientation changes [19]. Additionally, LBP and Haar-like features were frequently used in face detection pipelines and embedded systems, given their computational efficiency [17,19]. Although tools like OpenCV 2.0 (2010) introduced basic GPU acceleration through CUDA, performance was still constrained by hardware limitations globally, including in developed countries such as the United States, Germany, and Japan. Despite early adoption of GPU acceleration via CUDA, these regions faced practical deployment challenges due to memory bandwidth, power constraints, and the high cost of large-scale real-time video processing. A major shift occurred in 2012 with the introduction of AlexNet, which demonstrated the power of deep learning for visual recognition and revealed the shortcomings of rule-based methods. This breakthrough established the foundation for transitioning to scalable, AI-enhanced cloud architectures, which characterized the subsequent batch-processing era [22,23].

(2): Cloud-Based Batch Processing VBDA (2013–2016)

As video data grew in volume and complexity, VBDA systems shifted from on-premises architectures to cloud-based infrastructures. Distributed frameworks like Apache Hadoop (https://hadoop.apache.org, accessed on 12 May 2025) and Apache Spark (https://spark.apache.org, accessed on 12 May 2025) facilitated large-scale batch analytics, allowing efficient processing of video archives within elastic cloud infrastructures [24,25]. Simultaneously, cloud object storage platforms like Amazon S3 and Google Cloud Storage offered scalable, cost-effective alternatives to traditional systems, supporting flexible data ingestion and on-demand compute provisioning [26].

This period also marked the integration of deep learning models such as VGGNet and ResNet into VBDA pipelines, improving object detection and scene interpretation through cloud-native AI services [27]. Elasticsearch, widely adopted during this phase, enabled scalable indexing and fast metadata retrieval across distributed video repositories [25]. However, batch-based processing introduced significant latency and lacked responsiveness. These limitations rendered it less suitable for real-time scenarios such as surveillance and traffic analytics [24]. These limitations spurred the shift toward more agile, edge-driven architectures in the next phase.

(3): Streaming Video Analytics and Edge Computing (2017–2019)

To overcome the latency and scalability limitations of cloud-only batch processing, VBDA systems began incorporating edge computing and streaming analytics. This evolution led to hybrid cloud–edge architectures, where edge devices performed real-time inference, while cloud platforms handled model training and archival storage. Stream processing tools such as Apache Kafka (https://kafka.apache.org, accessed on 15 May 2025) and Apache Flink (https://flink.apache.org, accessed on 13 May 2025) became foundational for enabling low-latency video ingestion and analytics pipelines [28]. At the same time, hardware accelerators like NVIDIA Jetson (NVIDIA Corporation, Santa Clara, CA, USA) and Intel Movidius (Intel Corporation, Santa Clara, CA, USA) enabled efficient on-device video processing [2].

The introduction of YOLOv3 in 2018 further enhanced real-time object detection capabilities, reinforcing the practicality of edge-first analytics [29]. Complementary tools such as Elasticsearch for semantic indexing and graph databases for contextual correlation, advanced video content organization, and retrieval. Emerging technologies like ONNXpromoted AI model portability across heterogeneous hardware. While early 5G adoption helped reduce communication latency. Nevertheless, coordinating distributed inference, handling heterogeneous edge hardware, and achieving seamless edge–cloud orchestration continued to present formidable challenges.

(4): Hybrid Cloud-Edge AI and FL (2020–2022)

This phase marked a turning point in VBDA system design, defined by the rise of hybrid architectures that integrated real-time edge inference with cloud-based training and coordination [11]. FL gained traction as a privacy-preserving paradigm, enabling model training directly on distributed edge devices without transferring raw video data [14].

During this time, lakehouse platforms like Delta Lake (https://delta.io, accessed on 18 May 2025) were increasingly adopted to manage large-scale, structured video data, offering both scalability and transactional consistency. Semantic indexing and entity modeling were enhanced by the use of graph databases such as Neo4j [30], while Apache NiFi played a key role in orchestrating complex, high-throughput video processing workflows [11]. Despite these advancements, the coordination of distributed models and the optimization of video analytics across heterogeneous infrastructure continued to pose significant challenges.

(5): AI-Driven Autonomous Video Processing (2023–2025)

The current stage of VBDA marks a shift toward autonomous, AI-native systems capable of handling complex video analytics with minimal human intervention. Advanced generative models, including GANs, ViT, and diffusion architectures, are increasingly applied in tasks such as anomaly detection, semantic summarization, and predictive surveillance [31,32]. Although these models offer promising performance, their applicability in real-time systems is constrained by substantial computational overhead.

Concurrently, blockchain technologies are being leveraged to support secure metadata logging, transparent access control, and tamper-resistant video sharing [33]. The integration of privacy-preserving mechanisms such as Zero-Knowledge Proofs and decentralized consensus further reinforces compliance with data protection standards. Future advancements, such as 6G networks and quantum-enhanced video processing, are expected to drive the emergence of scalable, resilient, and privacy-conscious VBDA systems [5].

In summary, VBDA has evolved from centralized, rule-based systems to hybrid, AI-driven architectures that effectively address latency, scalability, and data privacy challenges. Key milestones, ranging from cloud-based batch processing to edge computing and FL, have laid the foundation for more adaptive and resilient video analytics frameworks. Modern platforms now integrate generative models, semantic indexing, and decentralized training to enable real-time, autonomous video understanding. Looking forward, advances in explainable AI, privacy-preserving computation, and next-generation infrastructure such as 6G networks and quantum-enhanced processing are expected to drive the development of secure, intelligent, and context-aware VBDA systems for critical real-time applications [9,20,32].

2.2. Evolution of Architectural Paradigms in VBDA

Over the past decade, VBDA architectures have progressively transitioned from rigid centralized pipelines to distributed and collaborative models. Traditional systems were monolithic and tightly coupled to specific applications, typically running on standalone servers and using tools such as FFmpeg or GStreamerfor basic media processing [24,34,35]. Although suitable for offline tasks, these early setups presented significant limitations in scalability, modularity, and integration with big data ecosystems, rendering them ineffective for high-throughput or real-time video analytics.

This evolution phase was marked by the introduction of distributed batch processing frameworks, including Apache Hadoop and Apache Spark, which enabled greater parallelism and scalability [36,37]. These architectures enabled large-scale video indexing and archive-based analysis, proving effective for retrospective tasks. However, their batch-oriented nature came with high latency and reliance on centralized coordination, making them unsuitable for time-sensitive domains like smart surveillance or autonomous systems.

To support real-time inference and more dynamic workloads, cloud-centric architectures emerged as a flexible and scalable alternative. These systems utilized elastic cloud resources for deep learning tasks such as object detection, summarization, and large-scale storage [11,28]. Despite offering scalability, centralized cloud designs introduce latency, bandwidth overhead, and regulatory concerns associated with transmitting raw video data [38], thereby necessitating decentralized alternatives.

Edge computing addressed these limitations by pushing inference and preprocessing closer to data sources, including IP cameras and sensors. This approach reduced network load and latency, enabling faster decision-making in critical applications like autonomous driving and real-time surveillance [4,5]. However, edge systems continue to encounter challenges related to hardware heterogeneity, constrained computational resources, and synchronization of models across distributed nodes [3,14].

In response, hybrid cloud–edge architectures have become the dominant model in modern VBDA. These systems offload real-time tasks to edge devices while leveraging the cloud for orchestration, storage, and training [4,20]. Hybrid architectures offer a balanced trade-off between latency, scalability, and data privacy, yet they introduce orchestration complexity and unresolved challenges in distributed indexing, workload balancing, and coordination [5].

Hybrid architectures are increasingly evolving to incorporate generative AI and context-aware orchestration mechanisms. Recent paradigms within the edge-to-cloud spectrum are designed to enable semantic-level video understanding through capabilities such as real-time summarization, anomaly detection, and autonomous analytics [31]. This architectural continuum facilitates seamless coordination between resource-constrained edge devices and scalable cloud infrastructures [39], thereby enabling distributed intelligence across heterogeneous computing layers. Nevertheless, the intensive computational requirements of generative and multimodal AI models introduce new challenges. These include the need for advanced task scheduling, intelligent compression techniques, and resilient synchronization frameworks to ensure efficient operation across diverse environments.

In summary, the evolution of VBDA architectures reflects ongoing efforts to balance scalability, latency, and privacy. Early monolithic and batch-based systems lacked responsiveness and flexibility. Cloud-based models improved scalability but raised concerns over latency and data privacy. Edge architectures reduced delay and preserved data locality but faced resource and coordination limits. Hybrid designs, integrating edge and cloud, describe a balanced approach with adaptive workload distribution and semantic analytics. However, challenges remain in model management, device interoperability, and orchestration. Advancing VBDA requires the integration of these architectural strategies through intelligent coordination, enabling scalable, real-time, and privacy-aware video analytics systems.

2.3. Technologies and AI Enhancements in VBDA

The foundation of VBDA has been significantly influenced by advances in distributed computing, real-time stream processing, and intelligent data infrastructure. The shift from batch-oriented processing to real-time streaming and from centralized systems to edge-based deployments reflects increasing demands for low-latency, scalable, and privacy-aware video analytics solutions.

2.3.1. Enabling Technologies in VBDA

Real-time streaming frameworks such as Apache Kafka and Apache Flink play a central role in video data ingestion and analysis. Kafka enables scalable and fault-tolerant pipelines, while Flink supports low-latency, stateful processing with fine-grained event handling [40]. Nevertheless, large-scale VBDA environments continue to face unresolved challenges in adaptive resource allocation and seamless pipeline integration.

Distributed data processing platforms continue to serve as key infrastructure for handling massive video workloads. Hadoop offers cost-effective long-term storage for archived video but lacks native support for streaming [41]. Spark improves processing speed through in-memory execution, yet is less suited for real-time video stream analytics [42]. Apache Flink, while designed for streaming, still faces deployment limitations in edge-centric and latency-sensitive contexts [11].

Video data storage systems have also evolved to manage the volume and complexity of multimedia streams. Hadoop HDFS supports durable, large-scale storage but suffers from high query latency [6]. Cloud-based solutions like Amazon S3 provide elastic storage capacity but incur latency and bandwidth costs for real-time access [5]. NoSQL and graph-based databases offer flexible schemas for representing semantic relationships among video elements such as scenes and objects, albeit with trade-offs in consistency guarantees and query performance [43,44].

To address these limitations, hybrid storage models that combine SQL, NoSQL, and data lakes are under active investigation [45]. AI-driven storage mechanisms, including self-supervised video indexing and adaptive caching, are actively investigated to enhance storage efficiency and minimize retrieval latency [46].

2.3.2. AI Capabilities and Integration

Recent advancements in artificial intelligence have substantially enhanced the functional scope of VBDA systems, particularly in tasks such as object detection, activity recognition, summarization, and semantic interpretation of large-scale video data.

AI-driven video analytics commonly employ deep learning models, including convolutional neural networks (CNNs) [47] and Transformer-based architectures [48]. State-of-the-art detectors such as YOLO, Vision Transformers (ViT), and EfficientDet have demonstrated high accuracy in dynamic and complex video environments [49]. While these models are well-suited for high-level tasks such as face recognition and behavior detection, their deployment in real-time settings is often limited by substantial computational demands [5].

Edge AI offers a promising solution for reducing latency by shifting inference closer to the data source. By processing frames locally, edge devices can minimize reliance on cloud infrastructure and reduce bandwidth usage [50]. Lightweight architectures such as TinyML enable on-device analysis for continuous video streams. Nonetheless, the performance of edge-based inference is frequently limited by hardware constraints, particularly in terms of memory capacity and computational throughput [51]. Moreover, achieving synchronized inference between edge and cloud environments remains a persistent challenge [52].

FL further supports privacy-preserving analytics by enabling decentralized model training across distributed nodes without transferring raw video data [53]. This paradigm is particularly useful in privacy-sensitive domains such as healthcare and surveillance. Nonetheless, FL introduces communication overhead, and synchronization becomes increasingly complex in the presence of non-IID data and unreliable network conditions [54].

In recent years, generative models such as GANs and diffusion-based architectures have gained traction in tasks like video augmentation, anomaly detection, and spatiotemporal prediction [55]. These models enhance semantic understanding and dataset diversity but are often computationally intensive, limiting their applicability in latency-sensitive scenarios [56].

In parallel, blockchain-based approaches have been explored to support secure provenance tracking, access control, and authentication in decentralized video systems. Smart contracts enable immutable logging of video events and transparent authorization mechanisms [57]. However, the high computational cost and latency associated with current blockchain protocols hinder their deployment in real-time video pipelines [33].

Despite notable progress, VBDA systems still face substantial barriers related to latency, model complexity, and infrastructure coordination. Edge–cloud coordination lacks robust synchronization, and privacy-preserving methods such as FL increase communication load. In addition, adaptive video indexing and semantic retrieval continue to be computationally demanding, especially in large-scale and dynamic environments.

2.4. Problem Taxonomy and Applications of Video Processing in VBDA

VBDA encompasses a wide range of processing tasks aimed at addressing the inherent complexities of large-scale video interpretation and analysis. This section introduces a problem-centric taxonomy, organized around core functions such as object detection, activity recognition, anomaly detection, and content-based retrieval. Each task is systematically aligned with practical use cases in domains such as surveillance, healthcare, transportation, and smart retail. By linking technical approaches to application needs, the taxonomy provides a structured lens to evaluate the effectiveness, scalability, and real-world applicability of existing VBDA systems.

2.4.1. Object Detection and Tracking

Object detection and tracking are foundational components of VBDA, enabling automated monitoring of entities such as people, vehicles, and objects in dynamic environments. These tasks underpin applications in surveillance, traffic control, and sports analytics [5]. Whereas early systems relied on handcrafted feature-based techniques such as HOG and SVM [58], recent approaches have embraced deep learning models like YOLO, Faster R-CNN, and SSD, which are specifically optimized for edge deployment and real-time inference [59,60]. Tracking is often handled using DeepSORT [61], FairMOT [62], or Siamese networks [63], while 3D object detectors like PointNet [64] and VoxelNet [65] are gaining traction in autonomous driving and robotics.

2.4.2. Activity and Behavior Analysis

Human activity recognition plays a pivotal role in contextual video understanding, particularly in applications such as fall detection, violence recognition, retail analytics, and interactive media [5,66]. The field has evolved from using handcrafted features to adopting deep learning architectures, including 3D CNNs [67], LSTMs [68], and Transformer-based models. Skeleton-based approaches [69] provide fine-grained motion analysis and are particularly effective in healthcare and smart home environments [70].

2.4.3. Anomaly Detection

Anomaly detection focuses on identifying irregular or unexpected events in video streams, particularly in security-critical domains such as industrial monitoring, smart surveillance, and emergency response [5]. Techniques range from unsupervised models like autoencoders [71] and isolation forests [72] to more recent methods leveraging graph neural networks (GNNs) [73] and Transformers [74]. Benchmark datasets including UCSD [75], Avenue, and ShanghaiTech [76], are commonly used for evaluation.

2.4.4. Crowd Analysis and Management

Crowd analytics addresses the need for real-time understanding of group movement, density estimation, and behavioral anomalies in densely populated areas. These techniques are critical for ensuring event safety, optimizing public space design, and managing pedestrian movement in high-density environments [77]. Approaches have evolved from traditional optical flow and density map estimation to GNN-based and Transformer-based models for dynamic prediction. Benchmark datasets such as Mall [78], UCF-QNRF [79], and ShanghaiTech support comparative evaluation.

2.4.5. Content-Based Video Retrieval and Understanding

Semantic video retrieval enables efficient access to relevant content in massive video repositories, supporting applications in surveillance triage, content moderation, and multimedia recommendation. Recent advances in cross-modal learning have led to models like Show and Tell [80], VideoBERT [81], and Frozen in Time [82], which align visual and textual modalities to support tasks such as zero-shot captioning and search. Large-scale datasets such as YouTube-8M [83], ActivityNet [84], and MSR-VTT [85] have been pivotal for training and evaluation.

2.4.6. Video Summarization and Captioning

Video summarization transforms extended footage into succinct representations, thereby facilitating efficient review, archival, and content personalization. This is particularly valuable in applications like surveillance review, broadcast editing, and education. Recent approaches go beyond keyframe extraction to include generative models such as VST [48], ViViT, and BLIP-2 for producing text summaries or abstract visual synopses. Commonly used benchmarks include SumMe [86] and TVSum [87].

A structured summary of key problem domains in VBDA is presented in Table 1, outlining core technologies, datasets, and application contexts. While these tasks have enabled practical deployments in domains such as transportation and healthcare, transitioning from research prototypes to scalable production systems remains hindered by unresolved technical challenges.

A primary barrier is scalability and system heterogeneity. Real-world VBDA must handle continuous, high-resolution video from diverse sources such as CCTV cameras, drones, and IoT devices. This introduces substantial demands on bandwidth, storage, and computational infrastructure. Coordinating processing across decentralized layers spanning edge and cloud adds complexity in terms of orchestration, resource allocation, and vendor interoperability.

Latency and responsiveness represent additional bottlenecks. Despite improvements in deep learning accuracy, many state-of-the-art models remain too resource-intensive for real-time inference on edge-constrained hardware. Meeting low-latency, high-throughput requirements calls for end-to-end optimization across both model design and system architecture.

Privacy and compliance are equally critical, especially in domains involving sensitive video data. Regulatory frameworks such as GDPR and HIPAA require secure governance practices, including FL, encrypted computation, and verifiable data workflows to prevent misuse and ensure data sovereignty.

Finally, limited labeled data remains a critical bottleneck in advancing robust VBDA systems. Most current approaches rely heavily on large-scale supervised learning, yet collecting and annotating video data is both time-consuming and costly. To address this challenge, self-supervised and semi-supervised learning techniques have been increasingly adopted as alternatives to fully supervised approaches. Nevertheless, these models frequently encounter difficulties in generalization, especially under distribution shifts or deployment in novel operational contexts [66].

3. VBDA Architectures and Core Technologies

This section offers a comprehensive analysis of VBDA system architectures and enabling technologies, synthesizing developments from 2018 to 2025. The first part provides a structured comparison of foundational components, ranging from data acquisition and storage to processing and analytics frameworks, highlighting how these elements collectively support real-time, scalable video understanding. The second part focuses on emerging trends that define the trajectory of next-generation VBDA platforms, including hybrid edge–cloud architectures, Edge AI, FL, and AI-driven semantic indexing. Together, these innovations contribute to the development of intelligent, privacy-aware, and resource-efficient systems capable of meeting the growing demands of dynamic, high-throughput video environments.

3.1. System-Level Architectural Models in VBDA

Building on the evolutionary trends outlined earlier, this section synthesizes the VBDA architecture landscape into four core system-level models: Centralized, Cloud-Centric, Edge Computing, and Hybrid Cloud–Edge. These models reflect the prevailing deployment strategies in contemporary systems, each characterized by specific trade-offs in latency, scalability, privacy, cost, and complexity. They serve as foundational blueprints for designing and optimizing VBDA systems across diverse operational contexts. Table 2 summarizes a structured comparison of these architectures, outlining their respective strengths and limitations in real-world application scenarios.

Centralized architectures offer simplicity and low cost, making them suitable for small-scale or archival VBDA systems. However, they rely on a monolithic design with single-node processing, resulting in poor scalability and high latency. A critical risk is the Single Point of Failure (SPOF), where the failure of a central node disrupts the entire system. These architectures are best for static environments with low-performance demands [6,100].

Cloud-Centric architectures leverage elastic computing and storage, supporting large-scale analytics and model training, while powerful, their dependency on remote data transfer leads to latency and privacy challenges, limiting their use in real-time scenarios. As shown by Chen et al. (2021) [101] and Xu et al. (2020) [102], they are most effective for batch analytics, archival processing, and AI training pipelines.

Edge computing architectures enable low-latency, privacy-preserving analytics by performing inference close to data sources. Ideal for real-time applications (e.g., smart surveillance), they reduce bandwidth usage and ensure timely responses. However, edge devices face hardware constraints and synchronization complexity [104,105], requiring lightweight models and careful orchestration.

Hybrid cloud–edge architectures integrate the advantages of both paradigms, supporting real-time processing at the edge while utilizing the cloud for training, storage, and system orchestration. They support FL and dynamic workload allocation, making them highly scalable and adaptive [107,109]. While complex to deploy, they offer the best trade-off for future-proof VBDA systems across diverse and dynamic environments. Hybrid cloud–edge architectures are best suited to meet the evolving demands of VBDA. Their adaptability, scalability, and built-in support for privacy-preserving, AI-driven video analytics make them the most promising foundation for future intelligent video systems.

3.2. VBDA Core Technologies

Next-generation VBDA systems rely on integrated technological stacks to achieve real-time performance, intelligent interpretation, and secure, scalable data management. As video data continues to grow in volume, velocity, and variety, system architectures must evolve to handle end-to-end processing needs from ingestion and distributed computation to semantic understanding and visualization. This section provides a focused analysis of five critical components underpinning modern VBDA platforms: scalable data processing frameworks, deep learning-based video analytics, FL combined with edge AI, advanced storage and retrieval infrastructures, and visualization tools for system interpretability. These technologies serve as the core foundation for intelligent VBDA pipelines, enabling low-latency inference, model coordination across devices, and privacy-conscious data management across diverse domains.

3.2.1. Scalable Big Data Processing Frameworks

The transition from batch-oriented to real-time stream processing has fundamentally reshaped the architecture and responsiveness of VBDA systems. As summarized in Table 3, Apache Kafka has established itself as a cornerstone for real-time ingestion and message queuing, offering high throughput, scalability, and robust fault tolerance through its distributed architecture [110]. Apache Spark remains a widely adopted engine, valued for its ability to unify batch and micro-batch processing, thus supporting scalable analytics across both historical and near-real-time video streams [10].

Apache Flink has gained traction for its native support of true stream processing and stateful computations, enabling low-latency analytics in applications such as anomaly detection, traffic monitoring, and intelligent surveillance [11]. Complementing these systems, Apache NiFi is increasingly used to orchestrate end-to-end VBDA pipelines, offering low-code automation for ingesting, transforming, and routing video data between edge and cloud components [113].

Although Hadoop MapReduce previously played a central role in batch video analytics, its relevance has declined in modern VBDA pipelines due to inherent latency and a lack of flexibility. It is now primarily used for back-end archival processing and long-term video storage, rather than front-line real-time inference [30,107].

More recently, Apache Pulsar (https://pulsar.apache.org, accessed on 18 May 2025) has emerged as a potential alternative or complement to Kafka. Its support for multi-tenancy, geo-replication, and a decoupled architecture for storage and computing make it suitable for hybrid cloud–edge deployments. Although its adoption in VBDA remains limited as of 2024, its architectural flexibility positions it as a promising component in next-generation, AI-integrated video analytics systems [10].

3.2.2. AI-Driven Video Analytics

AI models play a pivotal role in enabling spatial, temporal, and semantic understanding of complex video data. As detailed in Table 4, a wide range of deep learning architectures has been employed in VBDA from 2018 to 2025, reflecting significant evolution in both model design and application scope. Among object detection models, YOLO and R-CNN variants remain the most widely adopted. YOLO is widely adopted in latency-sensitive applications owing to its real-time inference capabilities. In contrast, R-CNN models, though more computationally demanding, are favored in scenarios requiring high detection precision, such as security surveillance and edge analytics [51].

For temporal modeling, LSTM-based architectures have gradually been supplanted by Transformer models, which demonstrate superior capabilities in capturing long-range dependencies, multi-modal learning, and contextual scene understanding [130]. These models are now commonly used for video summarization, action recognition, and event detection across heterogeneous VBDA workloads. In parallel, self-supervised learning (SSL) techniques such as contrastive learning are gaining prominence for their ability to reduce reliance on labeled data [132]. SSL enables systems to extract meaningful features from raw video streams, making it highly relevant for tasks such as anomaly detection, motion prediction, and behavior recognition, especially under privacy constraints where the annotation is costly or infeasible.

Generative models, including GANs and diffusion-based architectures, contribute to simulation, rare event reconstruction, and data augmentation. For instance, retrieval-augmented generation techniques have been shown to improve the realism and contextual relevance of synthesized video sequences in simulation-based training and safety-critical environments [31]. Additionally, vision–language models (VLMs) such as CLIP, VideoBERT, and BLIP are now integral to cross-modal VBDA tasks [81,135,136]. These models enable zero-shot video retrieval, caption generation, and semantic alignment between visual and textual modalities, supporting applications in forensic search, intelligent surveillance, and human-AI interaction [32].

Beyond mainstream models, the integration of multi-object tracking (MOT), SSL-enhanced anomaly detection, and hybrid spatiotemporal architectures point to emerging directions in adaptive video intelligence [5]. These techniques provide deeper behavioral insight, enabling more robust reasoning in dynamic environments.

Overall, the AI landscape in VBDA demonstrates a clear shift from conventional deep learning models to more flexible, multi-modal, and privacy-aware architectures. This transition aligns with the design principles of the proposed ViMindXAI platform, which emphasizes scalable learning, decentralized intelligence, and explainable analytics across diverse video-driven domains.

3.2.3. Federated Intelligence and Edge AI for Distributed VBDA

The integration of FL and Edge AI has become a cornerstone for building scalable, privacy-aware intelligence in distributed VBDA systems. Rather than sending raw video to centralized servers, FL allows model training to occur locally at edge nodes where data is originally captured by sharing only model updates. This decentralized strategy minimizes bandwidth usage, enhances responsiveness, and aligns with strict data privacy regulations. As a result, it is especially well-suited for sensitive, high-volume applications like smart surveillance, healthcare monitoring, and urban mobility analytics.

As summarized in Table 5, FedAvg remains the most widely adopted FL paradigm due to its simplicity and communication efficiency [137]. However, it exhibits suboptimal performance in environments characterized by non-IID data, a common scenario in real-world video deployments. To mitigate this, improved variants such as FedProx have been proposed, offering better convergence and robustness under heterogeneous data distributions. Meanwhile, FL platforms such as FedML and TensorFlow Federated (TFF) provide scalable toolkits for deploying distributed VBDA applications with modular design and cross-platform compatibility [138].

Efficient inference at the edge requires advanced model compression techniques to accommodate resource-constrained hardware. Methods such as pruning, quantization, and knowledge distillation have proven effective in reducing model size while preserving inference accuracy [3]. Edge devices, including NVIDIA Jetson, Google Coral TPU, and Intel Movidius are frequently employed in VBDA systems to support real-time tasks such as object detection, anomaly recognition, and behavior analysis [5].

Nonetheless, VBDA systems continue to face persistent challenges, particularly in model synchronization, resource heterogeneity, and orchestration across distributed infrastructures. Furthermore, workload balancing is complicated by diverse computational capabilities, memory footprints, and energy constraints across nodes. To address these issues, recent developments have focused on lightweight orchestration frameworks and container-based deployment tools such as KubeEdge and NVIDIA FLARE, which enable dynamic scheduling, resource-aware task allocation, and seamless edge–cloud coordination [8].

3.2.4. Storage and Indexing in VBDA Systems

Efficient storage and indexing are critical to managing the scale, speed, and variety of video data in VBDA systems. Contemporary solutions favor hybrid architectures that blend data lakes with data warehouses, offering both flexibility and structured querying. In parallel, AI-driven indexing techniques are increasingly used to automate metadata generation, support compression, and enable scalable organization of video content. As shown in Table 6, these advances are designed to provide low-latency access, reduce storage costs, and support the dynamic requirements of large-scale video analytics.

Data lakes serve as the backbone of scalable video storage in VBDA systems, offering the flexibility to ingest diverse and unstructured content, including video streams, sensor logs, and metadata. Hadoop HDFS has played a central role due to its robustness and batch-processing capabilities, cited in many of the systems surveyed [11]. However, its high-latency access and rigid structure have led to a gradual transition toward cloud-native object storage platforms such as Amazon S3, Google Cloud Storage, and Azure Data Lake. These services provide elastic scalability and seamless integration with big data pipelines and AI workloads [7]. In more localized or hybrid settings, lightweight solutions such as MinIO present S3-compatible APIs with lower overhead, making them well suited for edge-centric deployments [11].

For managing dynamic and schema-flexible metadata, NoSQL databases such as MongoDB, Cassandra, and HBase are commonly used to support distributed ingestion and real-time querying. In parallel, graph databases like Neo4j and TigerGraph are increasingly adopted for modeling semantic relationships between objects, scenes, and events. These models enable sophisticated reasoning and contextual analytics, especially in surveillance and smart city scenarios [145].

Traditional data warehouses continue to play a vital role in structured analytics and metadata management. PostgreSQL is favored for its support of JSONB, spatial queries, and compatibility with tools such as Grafana and Superset, while MySQL remains useful for lightweight, transactional operations [11,108].

Emerging data lakehouse architectures, including Delta Lake, and Apache Iceberg, bridge the gap between flexible data lakes and structured data warehouses. They support ACID transactions, schema evolution, and real-time ingestion, while integrating tightly with processing engines such as Spark and Flink. These capabilities make them well-suited for hybrid and federated VBDA systems that require both agility and consistency [7].

AI-driven indexing frameworks are transforming video data management by enabling efficient organization, retrieval, and interpretation of content. Technologies such as CLIP, VideoBERT, BLIP, and FAISS [146] support semantic tagging, cross-modal retrieval, and zero-shot learning. When combined with platforms like Elasticsearch or RDF-based engines, these tools enable scalable, multimodal analytics across a variety of domains, including healthcare monitoring, autonomous vehicles, traffic analysis, and public safety [144,145].

In summary, VBDA storage architectures have transitioned from traditional structured warehouses and standalone data lakes to integrated data lakehouse platforms supported by AI-driven indexing. Each layer, including NoSQL systems, graph databases, and cloud-native object stores, addresses specific requirements for scalability, flexibility, and semantic comprehension. This convergence equips modern VBDA systems with the capability to perform low-latency, real-time analytics while ensuring privacy-aware video intelligence across distributed and heterogeneous environments.

3.2.5. Visualization and Interpretability

Table 7 describes an overview of the key visualization tools adopted in VBDA systems from 2018 to 2025, highlighting their integration scopes, functional roles, and typical application contexts. These tools support tasks ranging from real-time monitoring and operational oversight to statistical analysis, model debugging, and interactive system feedback.

Among enterprise solutions, Tableau and Power BI are widely used to deliver interactive dashboards for performance tracking and system health assessment [149]. Alternatives such as Qlik Sense and Apache Zeppelin (https://zeppelin.apache.org, accessed on 14 May 2025) are favored in scenarios requiring associative exploration and research-oriented debugging [112]. For time-series data and system logs, Grafana has gained popularity due to its flexible dashboarding, alerting features, and integration with Prometheus and InfluxDB [100], while Kibana remains prevalent in Elasticsearch-based environments [113].

In experimental and academic settings, Python libraries like Matplotlib, Seaborn, and Plotly enable customizable visualization of detection outputs, performance metrics, and training dynamics. Emerging visualization interfaces, including AR dashboards and 3D spatial renderings, are also being explored in high-risk domains such as disaster response and urban surveillance [117]. Although still limited in adoption, these immersive tools hold promise for improving spatial awareness and situational interpretation. Moving forward, the integration of explainable AI (XAI) elements such as heatmaps and semantic overlays into real-time dashboards is expected to enhance interpretability, human-AI interaction, and trust in automated VBDA systems.

Figure 2 illustrates the most commonly adopted technologies across VBDA platforms between 2018 and 2025, highlighting a clear progression toward scalable and intelligent video analytics. At the data processing layer, Apache Kafka and Apache Spark stand out as dominant frameworks, each adopted by approximately 11% of of the 24 surveyed systems, owing to their robustness in managing distributed stream and batch workloads. In the AI-driven analysis layer, YOLO (9%) and LSTM (8%) are widely utilized for object detection and activity recognition, while emerging approaches such as CLIP and Transformer-based models are increasingly applied to multimodal video understanding. At the edge–cloud orchestration level, the use of FL with FedAvg (6%) and deployment via KubeEdge (6%) reflects a growing emphasis on privacy-preserving, low-latency computation at the network edge. For data management, Elasticsearch (6%) and MongoDB (3%) are frequently employed to support scalable indexing and semantic retrieval. Visualization remains a critical component, with tools like Tableau facilitating real-time dashboards for informed decision-making. Notably, the most prevalent technologies in these layers, including Kafka, Spark, YOLO, LSTM, FedAvg, and Elasticsearch, collectively appear in more than 50% of the surveyed platforms. This reinforces their foundational role in the current VBDA ecosystem.

Synthesizing the key findings from Section 2 and Section 3, we identify several persistent challenges in the VBDA landscape:

Latency and Scalability: Centralized cloud-centric designs often struggle with real-time analytics under high video throughput and geographically dispersed deployments.
Data Heterogeneity: ulti-source video data varies in encoding formats, resolutions, frame rates, and modalities (e.g., visual, thermal, audio), posing standardization and fusion challenges.
Privacy and Federated Learning: FL adoption in video domains remains limited due to non-IID data distribution, communication overhead, and explainability concerns.
Semantic Understanding: While progress has been made in video understanding using AI models, semantic indexing, multimodal fusion, and real-time reasoning remain underdeveloped in practice.
Deployment and Resource Management: Lack of modular, open source reference platforms hinders rapid deployment, especially in resource-constrained edge settings.

These challenges are synthesized from our survey of both the academic and industry literature between 2018 and 2025. To address them holistically, the following section introduces ViMindXAI, a hybrid, modular VBDA platform designed to operationalize cloud–edge coordination, semantic video analytics, and privacy-preserving learning in real-world scenarios.

4. ViMindXAI: A Scalable and Cognitive AI Platform for VBDA

4.1. Overview of the Proposed Platform Architecture

The proposed ViMindXAI platform introduces a next-generation hybrid architecture for VBDA as illustrated in Figure 3. The name ViMindXAI reflects the platform’s core design: “Vi” for video-centric data, “Mind” for semantic analysis and adaptive intelligence, “X” for extended scalability and innovation, and “AI” for the end-to-end use of artificial intelligence, from edge inference to federated learning and explainable reasoning. It seamlessly integrates edge and cloud computing to meet critical requirements in scalability, low-latency inference, and privacy-conscious processing. Unlike conventional centralized or cloud-only approaches, ViMindXAI enables intelligent workload orchestration, executing real-time analytics at the edge, utilizing the cloud for large-scale storage and deep model training, and deploying on-premise infrastructure for security-critical and context-sensitive operations.

ViMindXAI employs a unified Hybrid AI architecture that combines Edge AI, FL, and adaptive video storage to support essential VBDA tasks such as semantic understanding, anomaly detection, and privacy-aware surveillance. The platform is organized into four layers: the Data Acquisition Layer (DAL), the Data Infrastructure Layer (DIL), the Data and Application Layer (DaAL), and the User Experience Layer (UEL). Each layer incorporates mature and widely adopted technologies that align with prevailing trends in the VBDA ecosystem. Notably, foundational components such as Apache Kafka, Spark, YOLO, LSTM, Elasticsearch, and FedAvg, which appear in more than 50% of surveyed platforms, constitute the technological core of the ViMindXAI architecture, as illustrated in Figure 2. This strategic alignment ensures compatibility with modern deployment environments, reinforcing the platform’s scalability, resilience, and applicability across diverse domains, including smart cities, healthcare, and content moderation.

4.2. Layered Architecture and Functional Components

4.2.1. Data Acquisition Layer (DAL)

The DAL serves as the foundational tier of the ViMindXAI platform. It is responsible for ingesting, integrating, and preprocessing video data from diverse sources, including surveillance cameras, drones, dashcams, IoT sensors, and streaming services. Its main objective is to support low-latency, on-device filtering before forwarding data to higher layers, thereby reducing bandwidth usage and enhancing system responsiveness.

The DAL comprises two main modules: Data Collection and Edge AI Preprocessing and Optimization. The first module aggregates multi-source video and sensor inputs into a unified stream. The second module executes real-time inference, denoising, compression, and metadata extraction using lightweight models deployed on edge hardware such as NVIDIA Jetson Nano/Xavier, Intel Neural Compute Stick (NCS2), and Google Coral TPU. These embedded devices provide sufficient GPU/TPU acceleration for on-device AI tasks, enabling scalable and cost-effective inference in resource-constrained environments.

These processes are accelerated by stream engines like Apache Flink and Spark Streaming. AI functionalities in this layer enable real-time event detection, intelligent caching, and dynamic prioritization of critical segments. FL is embedded to support local model training and inference, enabling privacy-preserving analytics at the edge.

Technologies such as Apache Kafka and NiFi manage data ingestion, while Flink and Spark handle distributed stream computation. ONNX ensures interoperability across AI models. The output of the DAL, consisting of cleaned and enriched metadata, is forwarded to the Data Infrastructure Layer for indexing and persistent storage. It also feeds the FL module in the application layer to support collaborative updates.

In summary, the DAL streamlines early-stage video analytics by enabling efficient edge inference, reducing transmission overhead, and facilitating scalable ingestion from diverse video sources.

4.2.2. Data Infrastructure Layer (DIL)

The DILconstitutes a central component of the ViMindXAI architecture, offering a scalable and intelligent foundation for storing, organizing, and retrieving large-scale video data. Tailored to the unique requirements of VBDA, this layer combines cloud and on-premises storage with real-time metadata indexing and AI-enhanced retrieval pipelines, thereby enabling high-throughput, low-latency, and semantically aware search capabilities. The hardware foundation supporting this layer includes cloud-native storage services such as AWS EC2/S3, GCP Cloud Storage, on-premises clusters with NVMe SSD arrays for high-speed access, and GPU-accelerated nodes such as NVIDIA A100, T4 for semantic indexing and batch analytics. RAID configurations and hardware-assisted encryption modules are used to ensure data redundancy, fault tolerance, and security at rest. The DIL is composed of three main modules: AI-Driven Storage Orchestration, Data Lakehouse, and Semantic Indexing.

(1): AI-Driven Storage Orchestration

This module is a core component of the ViMindXAI platform, designed to manage large-scale video data intelligently, efficiently, and in compliance with privacy regulations. It orchestrates data storage across edge, cloud, and on-premise systems, ensuring high availability, scalability, and regulatory alignment in dynamic VBDA environments. This module comprises three tightly integrated layers: Big Data Storage, AI-driven Storage Optimization, and Privacy-Aware Synchronization.

Big Data Storage: establishes a hybrid, fault-tolerant infrastructure for ingesting and managing video data across diverse environments. Cloud services like Amazon S3 offer elastic scalability, while on-premise platforms such as MinIO, and Hadoop HDFS ensure secure, policy-compliant storage. At the edge, Ceph supports low-latency buffering and ingestion. Efficient video encoding formats such as MPEG-4, H.264, and H.265 are integrated into the pipeline to reduce transmission load and long-term storage costs, especially in surveillance and healthcare contexts [150]. Adaptive compression is employed to align bitrate with network and storage constraints [151]. This multi-layered storage backbone supports lifecycle management, distributed retention, and efficient data provisioning for downstream analytics within Lakehouse architectures.
AI-Based Storage Optimization: leverages reinforcement learning to automate storage tiering decisions, guided by observed usage patterns and anticipated access frequency. This enables dynamic tiering across hot, warm, and cold layers, balancing performance and cost-efficiency. High-speed caching with Redis accelerates access to frequently queried data, while observability tools like Grafana, MLFlow, and Prometheus monitor system performance and guide optimization. To enable intelligent storage tiering, we integrate a Deep Q-Network (DQN)-based reinforcement learning agent [152]. The agent interacts with the environment to learn optimal retention policies, adapting dynamically to changes in data access frequency, latency requirements, and cost-performance trade-offs. This allows ViMindXAI to proactively allocate resources across hot, warm, and cold tiers under varying workload conditions.
Privacy-Aware Synchronization: facilitates secure and regulation-compliant data transfers across distributed environments, maintaining confidentiality and integrity in multi-tenant deployments. It employs federated synchronization to preserve data locality and prevent unnecessary exposure of sensitive content. Containerized deployment via Docker and Kubernetes ensures portability and scalability, while Apache Zookeeper and secure API gateways manage access control and coordination in multi-tenant setups. This layer is instrumental in ensuring compliance with data protection regulations, including GDPR and HIPAA.

AI-Driven Storage Orchestration equips ViMindXAI with adaptive, intelligent, and regulation-compliant video data management capabilities, essential for large-scale operations in heterogeneous and privacy-sensitive settings.

(2): Data Lakehouse

This module offers a unified architectural framework that merges the flexibility of data lakes with the performance and consistency of traditional data warehouses. It enables ViMindXAI to manage and analyze large-scale, heterogeneous video and sensor data across multiple formats and modalities. This hybrid architecture enables both streaming and batch analytics, supporting structured metadata processing, semantic enrichment, and long-term data archival. Organizing data into logical zones such as raw, staging, and curated layers supports reproducible ETL workflows, governance, and traceability.

Representative technologies such as Delta Lake and Apache Iceberg are leveraged to implement this architecture. Delta Lake, optimized for Spark-based workloads, provides ACID transactions, time-travel, and schema enforcement ideal for structured processing and model training pipelines. In contrast, Apache Iceberg offers native support for multiple query engines such as Spark, Flink, and Trino and is well-suited for federated environments with complex partitioning, schema evolution, and snapshot isolation. Together, these technologies provide ViMindXAI with a scalable, interoperable, and regulation-compliant data infrastructure designed for both real-time and historical video analytics.

(3): Semantic Indexing

This module interfaces with Delta Lake to provide ACID-compliant storage and supports both batch and streaming workloads. In large-scale deployments, selecting technologies for semantic indexing requires balancing trade-offs among performance, scalability, and infrastructure complexity. FAISS offers fast, memory-efficient indexing for on-premise environments but lacks native support for distributed execution or elastic scaling, necessitating custom orchestration in multi-node setups. CLIP enhances semantic precision through strong vision-language alignment but incurs high computational costs for embedding generation, often requiring GPU clusters or inference acceleration frameworks in large-scale or real-time applications. Pinecone, as a cloud-native vector database, provides built-in elasticity and managed scalability, streamlining deployment. However, it introduces potential operational overhead, reliance on cloud infrastructure, and limited flexibility for hybrid or on-premise scenarios. These trade-offs inform ViMindXAI’s indexing architecture, enabling adaptation to varying workload demands, privacy constraints, and deployment environments.

4.2.3. Data and Application Layer (DaAL)

The DaAL serves as the intelligent control center of the VidMindXAI architecture, responsible for real-time inference, event summarization, anomaly detection, and FL. Positioned between the infrastructure and user interface layers, it transforms indexed multimodal data into actionable insights through AI-driven orchestration. This design facilitates secure, explainable, and scalable analytics deployment across critical sectors, including healthcare, public safety, and smart city infrastructure. The hardware foundation of the DaAL includes GPU-accelerated cloud instances such as NVIDIA A100, V100, or T4, and on-premises inference servers equipped with Tensor cores or AI inference accelerators such as Intel Habana Gaudi, AMD Instinct. Containerized orchestration is achieved via Kubernetes and KubeEdge, while distributed execution is supported through GPU clusters using Horovod or Ray. NVMe-based local storage and high-speed networking, such as RDMA, and 10/40 GbE ensure a low-latency model serving and parallel computation across modalities. The DaAL consists of two core modules: Real-Time AI Video Analytics and AI-Driven Data Services (DaaS).

(1): Real-Time AI Video Analytics

This module performs real-time analytics using advanced deep learning models. It includes three functional groups:

Video Analytics Core: Executes real-time tasks such as object detection, tracking, and activity recognition using models like YOLOv11, DeepSORT, and EventNet.
Video Understanding and XAI: Generates contextual summaries and explainable insights through captioning and attention visualization using models such as VideoBERT, BLIP, and GradCAM.
FL and Privacy: Coordinates distributed model training and aggregation from edge devices, while preserving privacy using frameworks like FedAvg, TFF, and NVIDIA FLARE. To address non-IID data and client drift challenges in federated learning, an asynchronous FedAvg scheme enhanced with FedProx regularization is adopted as the coordination strategy. Model updates are managed through TFF, while edge orchestration is supported via KubeEdge. Privacy guarantees are reinforced using secure aggregation protocols and optional differential privacy, depending on the application context. While FedAvg and FedProx remain effective for scalable and moderately heterogeneous environments, we also evaluate more recent algorithms: MOON [153], which applies contrastive learning to enhance client representation consistency; SCAFFOLD [154], which uses control variates to mitigate client drift; and FedNova [155], which normalizes local updates to address training time imbalance. These methods are currently under consideration for integration into ViMindXAI’s FL stack, particularly in deployments with high skew or dynamic data heterogeneity.

(2): AI-Driven Data Services (DaaS)

This module delivers secure, intelligent, and regulation-compliant access to video big data, enabling scalable analytics and semantic interaction across the ViMindXAI platform. It includes three core components:

Semantic Video Interaction Engine (SVIE): This module supports natural language-based interaction, semantic summarization, and retrieval using foundation models such as CLIP, Whisper, GPT, BLIP, and VideoBERT. SVIE transforms video and audio streams into deep semantic representations, enabling users to query video segments through conversational language or multimodal prompts. By integrating with the semantic index, SVIE enables contextual exploration, smart captioning, and voice-assisted scene navigation, operating as the AI-logic bridge between data and user queries.
Semantic Index and Retrieval Manager: Responsible for embedding extraction, indexing, and query resolution, this manager leverages multimodal encoders such as CLIP, and BLIP to generate dense vector embeddings from raw data streams. Indexing frameworks such as FAISS or Elasticsearch support approximate nearest neighbor search and multimodal retrieval, which are crucial for rapid access, recommendation, and cross-modal exploration. Through standard APIs (REST, GraphQL), the module enables scalable semantic video search.
Security and Access Control: Sensitive data is safeguarded through robust security mechanisms, including OAuth2, JWT, and RBAC, which enable fine-grained authentication and access control. It adopts a Zero Trust Security (ZTS) model and optionally uses hardware-based protections like Intel SGX for secure execution. Comprehensive audit logging ensures transparency, while access policies safeguard against misuse and ensure compliance.
Metadata and Compliance: This submodule oversees metadata tagging, contextual annotations (e.g., object type, location, identity), and regulatory adherence. It manages sensitive content through automated tagging and enforces data privacy standards such as GDPR, HIPAA, and CCPA. Tools like Apache Atlas and policy-based engines provide lineage tracking, compliance auditing, and enforcement of fine-grained access rules.

Collectively, these components form a robust application layer that converts infrastructure-level video data into secure, explainable, and semantically enriched services, bridging low-level data orchestration with real-time user interaction and AI reasoning.

The DaAL orchestrates end-to-end video analytics by managing real-time AI workflows for tasks such as inference, anomaly detection, video summarization, and federated model training. Serving as a bridge between the infrastructure layer and user-facing components, it transforms indexed video data into actionable insights through adaptive dashboards, semantic search, and immersive AR/VR interfaces. Through the integration of XAI and privacy-preserving learning techniques, the analytics processes are rendered scalable, interpretable, and secure, aligning with the needs of high-stakes domains.

4.2.4. User Experience Layer (UEL)

The UEL serves as the principal interface between end-users and the ViMindXAI platform, facilitating access to intelligent video analytics services. It delivers multimodal, intuitive, and role-sensitive interfaces that allow users to perceive, explore, and act upon intelligent video analytics. Although the core computational logic resides in lower architectural layers, the UEL bridges these capabilities with user-facing insights, enabling informed decisions across domains such as public safety, healthcare, and operations. It is designed with modularity and cognitive adaptability in mind. To maximize usability across diverse environments, the UEL supports deployment on web, tablet, and AR/VR interfaces, requiring only minimal local inference. This ensures responsive and accessible user interaction, even in bandwidth-constrained or mobility-oriented scenarios. The UEL is structured around two core components: User Intelligence Interfaces (UII) and End-User Groups.

(1): User Intelligence Interfaces (UII)

This module includes three intelligent components designed to support user interaction and visualization:

Immersive Monitoring Interface: The Immersive Monitoring Interface provides real-time AR/VR-based visualizations for operational monitoring and training, utilizing platforms such as Unity, Meta Quest, and HoloLens. It supports environments that demand high spatial and contextual awareness.
Adaptive AI Dashboard: Presents personalized, role-specific dashboards with real-time updates and explainability features such as GradCAM and SHAP. Integrates with standard BI tools like Tableau, Grafana, Kibana, and Power BI.

Conversational access is enabled through seamless integration with the SVIE in the DaAL, users are enabled to interact with analytics through natural language queries, contextual prompts, and voice-based commands, supporting intuitive human-AI interaction. This facilitates intuitive human-AI communication across roles and contexts.

(2): End-User Groups

Categorizes system users into Business, Individual, and Public Sector segments. Analytics are accessed via the UII, which dynamically adapts content presentation and functionalities according to user roles, while upholding strict privacy and security policies.

By interfacing with both the DaAL and the DIL, the UEL ensures that real-time analytics are presented in an explainable, personalized, and privacy-aware manner. This design supports forward-compatible deployment in environments requiring cognitive interaction and extended reality (XR), while ensuring transparency in AI behavior aligned with user needs.

To support real-time VBDA in cloud–edge environments, the ViMindXAI platform incorporates a carefully selected set of scalable, AI-enhanced technologies. These components are distributed across four architectural layers, Data Acquisition, Infrastructure, Application, and User Experience to ensure optimal performance, security, and adaptability. Table 8 provides an overview of the representative technologies, chosen based on their technical maturity, community adoption, open source accessibility, and applicability to large-scale video data processing. The resulting technology stack balances innovation with deployment practicality, aligning with the platform’s scalability and real-world integration requirements.

4.3. Real-World Application Scenarios of ViMindXAI

This section introduces three case studies that demonstrate the real-world feasibility and domain adaptability of the ViMindXAI platform in smart buildings, healthcare, and industrial environments. Each scenario reflects practical requirements for low-latency processing, privacy preservation, and regulatory compliance. All case studies follow the four-layer architecture of ViMindXAI: edge data acquisition, hybrid data infrastructure, distributed analytics and learning, and user-facing outputs with explainability and compliance support. These examples highlight how the platform can be adapted to diverse operational contexts while maintaining modularity, scalability, and trustworthiness.

4.3.1. Case Study 1: Smart Building Surveillance and Access Management

ViMindXAI addresses the increasing demand for intelligent and privacy-aware surveillance in smart building environments. As IoT-enabled camera networks continue to expand, the resulting surge in real-time video streams necessitates analytics systems that are both scalable and compliant with privacy regulations. ViMindXAI adopts a modular edge–cloud architecture designed to support distributed threat detection, anomaly analysis, and semantic video retrieval across heterogeneous sources.

At the DAL, video streams are collected from IP cameras and processed locally on edge devices such as NVIDIA Jetson. Lightweight models like YOLOv8-tiny are used at this layer for early-stage filtering and entity detection, helping reduce transmission load. Preprocessing tasks, including frame selection and compression, further optimize bandwidth usage. Robust ingestion pipelines built on Apache Kafka and NiFi ensure reliable, fault-tolerant data flow to downstream components.

In the DIL, hybrid storage is orchestrated using MinIO and Delta Lake, organizing video data across logical zones to manage lifecycle stages. Semantic indexing is powered by CLIP and Whisper embeddings, stored in FAISS to support fast, vector-based retrieval. Federated synchronization strategies enable local storage autonomy while contributing to a shared global index under privacy-preserving constraints.

The DaAL executes core analytics workflows such as anomaly detection and person re-identification using advanced inference models. YOLOv11 is deployed here for high-precision object detection tasks in centralized pipelines, while Transformer and GNN architectures are employed for complex spatiotemporal reasoning. These models are updated through FL strategies like FedAvg. Explainable AI (XAI) modules provide transparent and interpretable outputs to support operational decisions.

Finally, the UEL delivers interactive dashboards via Grafana, enabling semantic search queries (e.g., “loitering near exit C”), spatial heatmaps, and real-time alerting. These tools support monitoring, incident response, and forensic analysis while enhancing trust through visual explanations.

By combining semantic intelligence, scalable infrastructure, and privacy-centric coordination, ViMindXAI delivers a practical and regulation-compliant platform for smart surveillance. Its architecture incorporates the most adopted technologies in modern VBDA systems, reinforcing its applicability in real-world deployments.

4.3.2. Case Study 2: Healthcare Monitoring and Patient Safety

In healthcare environments such as hospitals and long-term care facilities, continuous and intelligent video monitoring is essential for ensuring patient safety. Critical incidents, including falls, prolonged inactivity, and unsupervised movement, require immediate detection and response. At the same time, analytics systems must uphold strict privacy standards and operate efficiently in resource-constrained settings. ViMindXAI addresses these requirements by delivering real-time, federated, and privacy-aware video analytics tailored specifically for clinical deployments.

At the DAL, video streams from patient rooms and corridors are processed on edge devices such as Jetson Xavier and Coral TPU. Lightweight activity recognition models, including LSTM and 3D CNNs, are deployed via ONNX for hardware-agnostic inference, enabling the detection of behavioral indicators such as falls and wandering. In parallel, audio streams are preprocessed using Whisper to identify signs of vocal distress. Data ingestion is handled through Apache Kafka and Apache NiFi, ensuring consistent, low-latency, and fault-tolerant delivery from multiple endpoints.

The DIL establishes a secure and federated data backbone based on Delta Lake and MinIO. This hybrid architecture supports both local buffering and long-term cloud-integrated indexing, enabling scalable and privacy-conscious video storage. Semantic indexing leverages multimodal embeddings generated by CLIP and Whisper, capturing both visual and auditory context. These embeddings are indexed using FAISS to facilitate rapid, approximate similarity-based retrieval. Privacy-aware synchronization protocols govern data access and movement in alignment with HIPAA and GDPR.

Within the DaAL, ViMindXAI executes real-time anomaly detection and temporal pattern modeling to identify clinically significant risks, including prolonged inactivity, irregular movement, or repeated expressions of distress. These models are continuously refined using FL strategies such as FedAvg, allowing for localized model updates across hospital zones without exposing raw video data. XAI modules enhance interpretability by visualizing the spatial and temporal features that contribute to each decision, supporting transparent clinical validation and oversight.

The UEL provides HIPAA-compliant dashboards powered by Grafana and integrated with ViMindXAI’s semantic search engine. Clinicians can receive real-time alerts, issue natural-language video queries such as “falls inward A after midnight,” and visualize mobility trends through interactive dashboards. Annotated video summaries and XAI overlays further support clinical review and facilitate timely, evidence-based decision-making.

ViMindXAI delivers a robust, semantically enriched, and privacy-conscious video analytics platform for smart healthcare monitoring. Its modular and AI-driven architecture supports real-time event detection, regulatory compliance, and transparent system behavior across all layers, making it a comprehensive solution for next-generation clinical safety systems.

4.3.3. Case Study 3: Industrial Safety and Compliance Monitoring

Ensuring occupational safety and regulatory compliance in industrial environments such as manufacturing plants and logistics hubs requires real-time detection of safety violations, including missing personal protective equipment (PPE), unauthorized access, and unsafe machine interactions. ViMindXAI offers a scalable, AI-powered video analytics platform that supports incident prevention, policy enforcement, and post-event auditing while maintaining privacy compliance.

At the DAL, video streams from fixed and wearable cameras are processed on-site using edge devices such as NVIDIA Jetson. PPE-aware object detection models, including YOLOv8-PPE, along with tracking algorithms like FairMOT, are deployed to detect violations such as helmet absence or unauthorized entry into restricted areas. Preprocessing modules running at the edge perform event tagging and filtering before transmitting selected data via Apache Kafka and NiFi pipelines.

The DIL employs a hybrid lakehouse architecture based on Delta Lake and MinIO to manage video streams, sensor metadata, and compliance logs. Semantic indexing is achieved by combining multimodal embeddings generated by CLIP and Whisper, enriched with contextual metadata such as zone identifiers, worker IDs, and violation types. These embeddings are indexed using FAISS for fast similarity-based search. Optionally, Neo4j is used to model spatiotemporal relationships between workers, machines, and incident locations. Federated synchronization ensures that data locality and privacy regulations are respected across distributed industrial sites.

Within the DaAL, advanced analytics tasks such as cross-shift pattern detection and violation prediction are performed using Transformer-based reasoning models and anomaly detectors. FL strategies, including FedAvg, allow for site-specific model personalization without exchanging raw data. Reinforcement learning techniques are used to dynamically optimize inspection resource allocation based on incident history and risk profiles. XAI modules enhance transparency by identifying the key features influencing each safety decision.

The UEL delivers interactive dashboards through tools such as Power BI, Grafana, or Kibana. These interfaces support semantic video search (e.g., “PPE violations near conveyor belt 3”), compliance heatmaps, and worker-specific safety profiles. Real-time alerts are integrated with enterprise notification systems to enable immediate response, while visual explanations generated by XAI modules assist safety officers in reviewing and validating flagged events.

Through its modular and privacy-aware architecture, ViMindXAI provides a robust and adaptable solution for industrial safety monitoring. By integrating real-time analytics, federated coordination, semantic indexing, and explainable auditing, the platform enhances operational resilience and ensures regulatory compliance in dynamic and high-risk industrial environments.

4.4. Deployment Considerations and Solutions

As VBDA systems move from prototype to production, real-world deployment introduces challenges related to latency, scalability, and data privacy. Ensuring robust performance across heterogeneous environments requires seamless coordination between AI models, storage infrastructures, and orchestration mechanisms. This section outlines the major deployment obstacles and the optimization strategies integrated into the ViMindXAI platform.

4.4.1. Practical Deployment Challenges

Deployment of scalable video analytics systems such as ViMindXAI in real-world environments faces multiple operational and architectural challenges. These span across edge–cloud orchestration, FL, infrastructure compatibility, and data platform migration.

One of the most critical challenges is edge–cloud orchestration, where system responsiveness depends on minimizing inference latency while balancing computational load between edge devices and cloud servers. Application domains like surveillance, healthcare, and intelligent transportation operate under highly dynamic conditions, varying network bandwidth, heterogeneous device capabilities, and fluctuating workloads [108,117]. This necessitates the use of adaptive scheduling, resilient failover strategies, and real-time resource optimization mechanisms to maintain performance under variable constraints.

A second major issue is synchronization in FL, while FL is a privacy-preserving paradigm that avoids centralized raw data sharing, it introduces significant communication overhead, especially in multi-node environments. Additionally, the presence of non-IID (non-independent and identically distributed) data across devices leads to biased local model updates and degraded convergence [107,113]. Effective deployment requires asynchronous aggregation, model pruning, and gradient compression techniques to reduce communication bottlenecks and ensure scalable training.

Another challenge lies in resource and bandwidth limitations at the edge. Most edge devices are not equipped to execute large-scale transformer-based models or multimodal inference in real-time. Moreover, streaming high-resolution video data imposes heavy pressure on upstream and inter-device bandwidth [7,112]. Addressing this requires careful model optimization, such as using quantized or distilled versions of core models, efficient codec settings, and adaptive frame skipping strategies.

Finally, the shift to modern data infrastructures introduces migration complexities toward Data Lakehouse architectures. Technologies like Delta Lake and Apache Iceberg require rethinking legacy ingestion pipelines, adopting new metadata strategies, and ensuring backward compatibility with existing BI and AI workflows. These changes, while beneficial in the long run, demand strong governance, incremental rollout, and organizational readiness for transformation.

Together, these deployment challenges highlight the importance of robust system design that accommodates dynamic environments, privacy constraints, and evolving data infrastructure, while ensuring the scalability and resilience needed for mission-critical video analytics systems.

4.4.2. Proposed Optimization Strategies

ViMindXAI addresses these challenges through a multi-tiered optimization strategy designed for real-world resilience:

Model Compression for Edge Inference: Techniques such as pruning, quantization, and knowledge distillation are applied to compress models without compromising accuracy. This enables real-time inference on edge hardware like NVIDIA Jetson and Intel Movidius for tasks such as anomaly detection and object tracking [100,134].
Tiered and Intelligent Storage Architecture: The platform combines edge-level buffering, AI-driven caching, and cloud-based archiving. Semantic indexing using models like CLIP, FAISS, and VideoBERT enables rapid, context-aware retrieval [30,144,145], while tiered orchestration minimizes redundancy and optimizes bandwidth usage.
Metadata Optimization and Context-Aware Allocation: Rich metadata, including object categories, spatial context, and scene attributes, is used to prioritize content for storage and retrieval. This improves responsiveness and reduces data transfer volumes, especially in bandwidth-constrained scenarios [11,142].

Together, these strategies equip ViMindXAI with the capability to operate effectively in real-world deployments, delivering scalable, low-latency, and privacy-aware analytics. The system’s design supports modular extensibility, aligning with future advancements outlined in Section 5.

4.5. Comparative Analysis with Existing VBDA Platforms

To contextualize the architectural contributions of ViMindXAI, we conducted a comparative analysis against representative VBDA platforms: FALKON [118], SIAT [119], SurveilEdge [52], and VIDEX [133]. These systems span centralized, cloud-based, edge-centric, and hybrid deployments and are widely cited in the literature. Our comparison focuses on architectural depth, semantic capability, explainability, and integration of emerging AI paradigms such as FL, XAI, and multimodal indexing. Table 9 summarizes this evaluation across ten core criteria, categorized by the functional layers of ViMindXAI (DAL, DIL, DaAL, UEL) and mapped to architectural types introduced in Section 3.

In contrast to FALKON’s centralized batch-based architecture or SurveilEdge’s edge-focused object detection, ViMindXAI adopts a layered, hybrid cognitive framework. The DAL incorporates lightweight models (e.g., YOLOv8-tiny), preprocessing, and embedded FL at the edge, surpassing the ingestion-only approach in SIAT or the static logic in VIDEX. The DIL leverages an AI-optimized Lakehouse architecture, integrating orchestration tools (e.g., Delta Lake, Apache Iceberg), semantic indexing (e.g., CLIP, VideoBERT), and GraphDB-based reasoning, features absent in the compared platforms. In the DaAL, ViMindXAI integrates explainability (e.g., GradCAM, SHAP) with coordinated model management using TFF and NVIDIA FLARE, supporting privacy-preserving learning and accountability. The UEL enables immersive, adaptive dashboards and LLM-powered interfaces for semantic video interaction, extending far beyond the static dashboards found in prior systems.

Although Table 9 offers a qualitative comparison, the architectural design of ViMindXAI is grounded in empirical evidence from recent federated VBDA systems. Kingori et al. [156] demonstrated that a hybrid VAE–SURF model, deployed using federated learning and edge computing, can achieve 25 ms latency and 40 fps throughput with only 2.5 GFLOPs per frame, highlighting its effectiveness for real-time anomaly detection on edge devices. Similarly, Li [157] introduced Dashlet, a cross-layer video streaming system that predicts user swipe behavior to optimize prefetching, achieving sub-100 ms latency improvements in live deployments.

Drawing on these insights, ViMindXAI incorporates edge-side inference using CLIP, asynchronous federated learning via FedAvg and FedProx, and orchestration through KubeEdge and TFF. These design choices enable real-time semantic search, scalable explainability, and low-latency responsiveness while maintaining practical deployability across diverse infrastructures. Although full-scale benchmarking is planned as part of future work in Section 6, the current architecture reflects a well-founded integration of proven strategies tailored to the performance and privacy demands of modern VBDA.

While ViMindXAI is in early-stage prototyping, all integrated components are mature, open source, and production-ready. The platform is designed for heterogeneous deployment environments via an edge–fog–cloud topology: edge nodes perform inference on devices like Jetson or Coral TPU; fog nodes handle caching and policy enforcement; the cloud layer oversees training and coordination. It supports microservice and master–slaver orchestration patterns, with federated agents managing updates under privacy and bandwidth constraints. Network features like 5G/6G, SDN, and QoS are embedded in task scheduling and data routing, ensuring readiness for real-world, large-scale deployments.

5. Open Research Challenges and Future Directions

While ViMindXAI provides a promising architectural foundation and illustrates its applicability through conceptual case scenarios, its transition to real-world deployment still requires addressing several unresolved challenges. These include real-time scalability, privacy-preserving learning, explainability, and the integration of emerging technologies. Overcoming these limitations is essential to enable secure, efficient, and intelligent video analytics in high-demand and sensitive domains. These challenges not only shape the roadmap of ViMindXAI but also highlight persistent research gaps across the broader landscape of video analytics.

5.1. Scalability and Real-Time Processing

The exponential rise in video data from surveillance, healthcare, and smart city infrastructure has increased the demand for scalable, low-latency analytics. Cloud-centric architectures often suffer from latency and network bottlenecks [108,112], making them less suitable for real-time inference. While hybrid edge–cloud systems offer partial solutions, key challenges remain in adaptive inference scheduling, task offloading, and dynamic resource orchestration. Emerging strategies, such as reinforcement learning-based scheduling and predictive offloading, leverage context-aware optimization using network states and device profiles, but require further refinement to meet the stringent latency demands of live video analytics.

Edge device constraints further complicate deployment. Running deep models in resource-limited environments necessitates lightweight variants. Techniques such as quantization, pruning, and knowledge distillation [30] show promise, but need to be optimized for high-dimensional video streams without sacrificing performance. While next-generation infrastructure such as 6G may alleviate bandwidth limitations, coordinated co-design of AI models and network protocols will be essential. To this end, future versions of ViMindXAI will incorporate reinforcement learning-based schedulers and predictive offloading into the DAL and DaAL layers for context-aware resource management. ONNX-compatible lightweight models enhanced with quantization and distillation will support efficient edge inference under practical constraints.

5.2. Security, Privacy, and Ethical AI

With the expansion of VBDA into privacy-sensitive sectors such as healthcare and public safety, data protection and responsible AI usage have become critical. FL offers a promising approach by enabling local model training without sharing raw video data. However, FL remains vulnerable to threats like model poisoning, gradient leakage, and degraded performance under non-IID data distributions [154]. Addressing these risks requires privacy-preserving techniques such as asynchronous FL, HE, and ZKP. While effective, these methods often introduce high computational overhead. SMPC and DP offer additional safeguards but still face challenges in scaling, particularly in real-time or bandwidth-constrained deployments.

To strengthen privacy and trust, ViMindXAI embeds asynchronous FL with FedProx regularization in the DaAL layer to reduce client drift under non-IID conditions. Privacy is further reinforced via secure aggregation protocols and optional DP, depending on the sensitivity of the application. For highly sensitive domains like clinical video analysis, we plan to extend support for HE and SMPC. Beyond security, ViMindXAI addresses fairness by integrating inherently interpretable models and fairness auditing pipelines, enabling explainability-by-design in tasks such as identity recognition and behavioral analysis.

5.3. Integration of Emerging Technologies

Breakthroughs in blockchain, quantum computing, SSL, and generative AI are reshaping the evolving VBDA landscape. These technologies hold the potential to redefine system capabilities if integrated thoughtfully.

Blockchain for Secure Access Control and Metadata Logging: Blockchain can offer immutable logging, decentralized identity management, and transparent access auditing for video streams [144]. Smart contracts can enforce granular access policies, yet the latency and storage demands of public blockchains limit their use in real-time systems. Research on hybrid blockchain–cloud architectures could balance performance with data integrity and transparency.

Quantum Computing for Video Analytics: Quantum-enhanced algorithms hold promise for accelerating complex video processing tasks, including encryption, compression, and content-based search. However, practical limitations such as decoherence, error rates, and hardware immaturity still hinder adoption. Future directions should explore hybrid quantum-classical pipelines and quantum-friendly video representations.

Self-Supervised Learning and Vision Transformers: Annotation remains a major bottleneck in training video models. SSL methods such as masked video modeling, contrastive learning, and temporal prediction can learn meaningful representations from unlabeled data [7]. Coupled with ViTs, these approaches support scalable and label-efficient learning pipelines, with strong performance across retrieval, summarization, and activity recognition.

Generative AI for Video Understanding: Generative models like GANs and diffusion architectures enable applications such as scene completion, synthetic data generation, and anomaly simulation [144]. While these tools enhance model robustness, they raise concerns around deepfake misuse, high training costs, and instability. Future research should prioritize training efficiency, transparent evaluation benchmarks, and the development of safeguards for ethical deployment.

To support a pragmatic transition from conceptual design to real-world deployment, we propose a phased research roadmap. In the short term, our development efforts will focus on investigating real-time scalability and privacy-preserving learning. Specifically, we intend to incorporate reinforcement learning-based scheduling, predictive task offloading, and self-supervised learning techniques to enable label-efficient model training under constrained cloud–edge environments. As an initial step toward implementation, we plan to prototype ViMindXAI through a smart building surveillance and access management scenario, where multiple edge cameras monitor common areas to detect anomalous events such as unauthorized access or falls (as presented in Case Study 1, Section 4.3). The system will preserve user privacy through federated learning, supported by ONNX-compatible lightweight models and adaptive task schedulers. This experimental deployment will serve to evaluate latency, inference accuracy under non-IID data distributions, privacy guarantees, and system responsiveness in resource-constrained edge settings. In the medium term, we aim to enhance the platform’s trustworthiness and accountability by integrating blockchain-based access control mechanisms and fairness auditing pipelines. In the long term, we will explore the potential of emerging paradigms such as quantum computing and generative AI, contingent upon advances in infrastructure maturity and algorithmic stability.

6. Conclusions

This paper presented a comprehensive survey of the evolution of VBDA, highlighting the shift from centralized pipelines to hybrid cloud–edge platforms enhanced by deep learning, federated learning, and explainable AI. We identified key challenges, such as latency, scalability, data heterogeneity, and privacy, and introduced ViMindXAI, a modular platform designed to address these limitations through emerging AI and infrastructure techniques. The layered design of the platform aligns closely with the architectural gaps outlined in this survey, offering a coherent response to system-level and user-centric demands in modern video analytics. By mapping critical challenges to targeted solutions, ViMindXAI contributes a unified framework for building scalable, intelligent, and privacy-aware video analytics systems. These findings point to key directions for developing next-generation VBDA platforms that support real-time, data-driven decision-making in domains such as public safety, transportation, and healthcare. They also provide a solid foundation for future research on adaptive and efficient decision-support systems in video-intensive environments.

In the future, to advance ViMindXAI from conceptual design to real-world deployment, we outline a phased research roadmap. In the short term, we first focus on improving real-time scalability and privacy-preserving learning through reinforcement learning-based scheduling, predictive task offloading, and self-supervised training. As an initial step, ViMindXAI will be prototyped in a smart building surveillance scenario to evaluate latency, accuracy under non-IID data, privacy, and responsiveness. We then plan to prioritize lightweight model optimization for edge devices, secure federated learning with differential privacy, and blockchain-based metadata governance. In the long term, our efforts will explore quantum-inspired models and self-supervised techniques for semantic video retrieval. Empirical validation will be conducted in real-world domains such as healthcare monitoring and industrial safety, which demand low-latency processing, robust privacy protection, and high model transparency.

Author Contributions

Conceptualization, T.-T.-T.D. and V.-Q.N.; methodology, formal analysis, and writing: original draft preparation, T.-T.-T.D.; resources, visualization, writing: review and editing, V.-Q.N., Q.-T.H. and K.K.; supervision, project administration, and funding acquisition, V.-Q.N. and K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Hung Yen University of Technology and Education under the grant number UTEHY.L.2023.02. This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP)-Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government (MSIT) (IITP-2025-RS-2022-00156287).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Statista Research Department. Global Internet Video Traffic as a Share of Total Internet Traffic Worldwide from 2017 to 2023. 2023. Available online: https://www.statista.com/statistics/871513/worldwide-video-traffic-as-share-of-internet-traffic/ (accessed on 27 March 2025).
Sumalee, A.; Ho, H.W. Smarter and more connected: Future intelligent transportation system. Iatss Res. 2018, 42, 67–71. [Google Scholar] [CrossRef]
Wang, Z.; Wu, F.; Yu, F.; Zhou, Y.; Hu, J.; Min, G. Federated Continual Learning for Edge-AI: A Comprehensive Survey. arXiv 2024. [Google Scholar] [CrossRef]
Kwon, B.; Kim, T. Toward an online continual learning architecture for intrusion detection of video surveillance. IEEE Access 2022, 10, 89732–89744. [Google Scholar] [CrossRef]
Badidi, E.; Moumane, K.; El Ghazi, F. Opportunities, applications, and challenges of edge-AI enabled video analytics in smart cities: A systematic review. IEEE Access 2023, 11, 80543–80572. [Google Scholar] [CrossRef]
Wu, X.; Yan, G.; Xie, X.; Bao, Y.; Zhang, W. Construction and Application of Video Big Data Analysis Platform for Smart City Development. Adv. Math. Phys. 2022, 2022, 7592180. [Google Scholar] [CrossRef]
Zhai, Y. Design and Optimization of Smart Fire IoT Cloud Platform Based on Big Data Technology. In Proceedings of the 2024 International Conference on Electrical Drives, Power Electronics & Engineering (EDPEE), Athens, Greece, 27–29 February 2024; pp. 843–846. [Google Scholar]
Su, C.; Wen, J.; Kang, J.; Wang, Y.; Su, Y.; Pan, H.; Zhong, Z.; Hossain, M.S. Hybrid RAG-Empowered Multi-Modal LLM for Secure Data Management in Internet of Medical Things: A Diffusion-Based Contract Approach. IEEE Internet Things J. 2024, 12, 13428–13440. [Google Scholar] [CrossRef]
Pandya, S.; Srivastava, G.; Jhaveri, R.; Babu, M.R.; Bhattacharya, S.; Maddikunta, P.K.R.; Mastorakis, S.; Piran, M.J.; Gadekallu, T.R. Federated learning for smart cities: A comprehensive survey. Sustain. Energy Technol. Assess. 2023, 55, 102987. [Google Scholar] [CrossRef]
Abusalah, B.; Qadah, T.M.; Stephen, J.J.; Eugster, P. Interminable Flows: A Generic, Joint, Customizable Resiliency Model for Big-Data Streaming Platforms. IEEE Access 2023, 11, 10762–10776. [Google Scholar] [CrossRef]
Alam, A.; Ullah, I.; Lee, Y.K. Video big data analytics in the cloud: A reference architecture, survey, opportunities, and open research issues. IEEE Access 2020, 8, 152377–152422. [Google Scholar] [CrossRef]
Deng, S.; Zhao, H.; Fang, W.; Yin, J.; Dustdar, S.; Zomaya, A.Y. Edge intelligence: The confluence of edge computing and artificial intelligence. IEEE Internet Things J. 2020, 7, 7457–7469. [Google Scholar] [CrossRef]
Das, A.; Roopaei, M.; Jamshidi, M.; Najafirad, P. Distributed ai-driven search engine on visual internet-of-things for event discovery in the cloud. In Proceedings of the 2022 17th Annual System of Systems Engineering Conference (SOSE), Rochester, NY, USA, 7–11 June 2022; pp. 514–521. [Google Scholar]
Brecko, A.; Kajati, E.; Koziorek, J.; Zolotova, I. Federated learning for edge computing: A survey. Appl. Sci. 2022, 12, 9124. [Google Scholar] [CrossRef]
Rahman, K.J.; Ahmed, F.; Akhter, N.; Hasan, M.; Amin, R.; Aziz, K.E.; Islam, A.M.; Mukta, M.S.H.; Islam, A.N. Challenges, applications and design aspects of federated learning: A survey. IEEE Access 2021, 9, 124682–124700. [Google Scholar] [CrossRef]
Janaki, G.; Umanandhini, D. Federated Learning Approaches for Decentralized Data Processing in Edge Computing. In Proceedings of the 2024 5th International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 18–20 September 2024; pp. 513–519. [Google Scholar]
Xu, R.; Razavi, S.; Zheng, R. Edge video analytics: A survey on applications, systems and enabling techniques. IEEE Commun. Surv. Tutor. 2023, 25, 2951–2982. [Google Scholar] [CrossRef]
Wang, D.; Shi, S.; Zhu, Y.; Han, Z. Federated analytics: Opportunities and challenges. IEEE Netw. 2021, 36, 151–158. [Google Scholar] [CrossRef]
Elharrouss, O.; Almaadeed, N.; Al-Maadeed, S. A review of video surveillance systems. J. Vis. Commun. Image Represent. 2021, 77, 103116. [Google Scholar] [CrossRef]
Cao, L. AI and data science for smart emergency, crisis and disaster resilience. Int. J. Data Sci. Anal. 2023, 15, 231–246. [Google Scholar] [CrossRef] [PubMed]
Alam, A.; Lee, Y.K. Tornado: Intermediate results orchestration based service-oriented data curation framework for intelligent video big data analytics in the cloud. Sensors 2020, 20, 3581. [Google Scholar] [CrossRef] [PubMed]
Wefelscheid, C. Monocular Camera Path Estimation Cross-linking Images in a Graph Structure. Ph.D. Thesis, TU Berlin, Berlin, Germany, 2013. [Google Scholar]
Farahbakhsh, R. D2.2 State of the Art Analysis Report; TWIRL Project, ITEA3 Programme. 2023. Available online: https://itea3.org/project/workpackage/document/download/5794/TWIRL_D2.2_SOTA_Report.pdf (accessed on 3 June 2025).
Zhang, W.; Xu, L.; Duan, P.; Gong, W.; Lu, Q.; Yang, S. A video cloud platform combing online and offline cloud computing technologies. Pers. Ubiquitous Comput. 2015, 19, 1099–1110. [Google Scholar] [CrossRef]
Serrano, D.; Zhang, H.; Stroulia, E. Kaleidoscope: A Cloud-Based Platform for Real-Time Video-Based Interaction. In Proceedings of the 2016 IEEE World Congress on Services (SERVICES), San Francisco, CA, USA, 27 June–2 July 2016; pp. 107–110. [Google Scholar]
Ara, A.; Ara, A. Cloud for big data analytics trends. IOSR J. Comput. Eng. 2016, 18, 01–06. [Google Scholar] [CrossRef]
Gao, G.; Liu, C.H.; Chen, M.; Guo, S.; Leung, K.K. Cloud-based actor identification with batch-orthogonal local-sensitive hashing and sparse representation. IEEE Trans. Multimed. 2016, 18, 1749–1761. [Google Scholar] [CrossRef]
Subudhi, B.N.; Rout, D.K.; Ghosh, A. Big data analytics for video surveillance. Multimed. Tools Appl. 2019, 78, 26129–26162. [Google Scholar] [CrossRef]
Geng, D.; Zhang, C.; Xia, C.; Xia, X.; Liu, Q.; Fu, X. Big data-based improved data acquisition and storage system for designing industrial data platform. IEEE Access 2019, 7, 44574–44582. [Google Scholar] [CrossRef]
Dai, J.J.; Wang, Y.; Qiu, X.; Ding, D.; Zhang, Y.; Wang, Y.; Jia, X.; Zhang, C.L.; Wan, Y.; Li, Z.; et al. Bigdl: A distributed deep learning framework for big data. In Proceedings of the ACM Symposium on Cloud Computing, Santa Cruz, CA, USA, 20–23 November 2019; pp. 50–60. [Google Scholar]
Zhao, P.; Zhang, H.; Yu, Q.; Wang, Z.; Geng, Y.; Fu, F.; Yang, L.; Zhang, W.; Jiang, J.; Cui, B. Retrieval-augmented generation for ai-generated content: A survey. arXiv 2024, arXiv:2402.19473. [Google Scholar]
Liang, C.; Du, H.; Sun, Y.; Niyato, D.; Kang, J.; Zhao, D.; Imran, M.A. Generative AI-driven semantic communication networks: Architecture, technologies and applications. IEEE Trans. Cogn. Commun. Netw. 2024, 11, 27–47. [Google Scholar] [CrossRef]
Na, D.; Park, S. Blockchain-based dashcam video management method for data sharing and integrity in v2v network. IEEE Access 2022, 10, 3307–3319. [Google Scholar] [CrossRef]
Simoens, P.; Xiao, Y.; Pillai, P.; Chen, Z.; Ha, K.; Satyanarayanan, M. Scalable crowd-sourcing of video from mobile devices. In Proceedings of the 11th Annual International Conference on Mobile Systems, Applications, and Services, Taipei, Taiwan, 25–28 June 2013; pp. 139–152. [Google Scholar]
Xu, H.; Wang, L.; Xie, H. Design and experiment analysis of a Hadoop-based video transcoding system for next-generation wireless sensor networks. Int. J. Distrib. Sens. Netw. 2014, 10, 151564. [Google Scholar] [CrossRef]
Chen, H.; Niu, D.; Lai, K.; Xu, Y.; Ardakani, M. Separating-plane factorization models: Scalable recommendation from one-class implicit feedback. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; pp. 669–678. [Google Scholar]
Zhang, Y.; Xu, F.; Frise, E.; Wu, S.; Yu, B.; Xu, W. DataLab: A version data management and analytics system. In Proceedings of the 2016 IEEE/ACM 2nd International Workshop on Big Data Software Engineering (BIGDSE), Austin, TX, USA, 16 May 2016; pp. 12–18. [Google Scholar]
Li, L.; Ota, K.; Dong, M. Humanlike driving: Empirical decision-making system for autonomous vehicles. IEEE Trans. Veh. Technol. 2018, 67, 6814–6823. [Google Scholar] [CrossRef]
Trigka, M.; Dritsas, E. Edge and Cloud Computing in Smart Cities. Future Internet 2025, 17, 118. [Google Scholar] [CrossRef]
Alghamdi, A.M.; Al Shehri, W.A.; Almalki, J.; Jannah, N.; Alsubaei, F.S. An architecture for COVID-19 analysis and detection using big data, AI, and data architectures. PLoS ONE 2024, 19, e0305483. [Google Scholar] [CrossRef] [PubMed]
Syafrudin, M.; Alfian, G.; Fitriyani, N.E.; Rhee, J.Y. Performance analysis of IoT-based sensor, big data processing, and machine learning model for real-time monitoring system in automotive manufacturing. Sensors 2018, 18, 2946. [Google Scholar] [CrossRef] [PubMed]
Zhou, B.; Li, J.; Wang, X.; Gu, Y.; Xu, L.; Hu, Y.; Zhu, L. Online internet traffic monitoring system using spark streaming. Big Data Min. Anal. 2018, 1, 47–56. [Google Scholar] [CrossRef]
Simakovic, M.; Cica, Z.; Drajic, D. Big-Data Platform for Performance Monitoring of Telecom-Service-Provider Networks. Electronics 2022, 11, 2224. [Google Scholar] [CrossRef]
Zhou, J.; Zhang, K.; Wang, L.; Wu, H.; Wang, Y.; Chen, C. SQLFlow: An Extensible Toolkit Integrating DB and AI. J. Mach. Learn. Res. 2023, 24, 1–9. [Google Scholar]
Kothandapani, H.P. Emerging trends and technological advancements in data lakes for the financial sector: An in-depth analysis of data processing, analytics, and infrastructure innovations. Q. J. Emerg. Technol. Innov. 2023, 8, 62–75. [Google Scholar]
Ranasinghe, K.; Ryoo, M.S. Language-based action concept spaces improve video self-supervised learning. Adv. Neural Inf. Process. Syst. 2023, 36, 74980–74994. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Li, H.; Ke, Q.; Gong, M.; Zhang, R. Video joint modelling based on hierarchical transformer for co-summarization. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3904–3917. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Lu, T.; Li, L.; Huang, D. Enhancing personalized search with ai: A hybrid approach integrating deep learning and cloud computing. J. Adv. Comput. Syst. 2024, 4, 1–13. [Google Scholar] [CrossRef]
Prangon, N.F.; Wu, J. AI and computing horizons: Cloud and edge in the modern era. J. Sens. Actuator Netw. 2024, 13, 44. [Google Scholar] [CrossRef]
Sathupadi, K.; Achar, S.; Bhaskaran, S.V.; Faruqui, N.; Abdullah-Al-Wadud, M.; Uddin, J. Edge-cloud synergy for AI-enhanced sensor network data: A real-time predictive maintenance framework. Sensors 2024, 24, 7918. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Yang, S.; Zhao, C. SurveilEdge: Real-time video query based on collaborative cloud-edge deep learning. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; pp. 2519–2528. [Google Scholar]
Kumar, Y.; Marchena, J.; Awlla, A.H.; Li, J.J.; Abdalla, H.B. The AI-Powered Evolution of Big Data. Appl. Sci. 2024, 14, 10176. [Google Scholar] [CrossRef]
Hassan, A.; Prasad, V.; Bhattacharya, P.; Dutta, P.; Damaševičius, R. Federated Learning and AI for Healthcare 5.0; IGI Global: Hershey, PA, USA, 2024. [Google Scholar]
Feuerriegel, S.; Hartmann, J.; Janiesch, C.; Zschech, P. Generative ai. Bus. Inf. Syst. Eng. 2024, 66, 111–126. [Google Scholar] [CrossRef]
Akram, F.; Sani, M. Real-Time AI Systems: Leveraging Cloud Computing and Machine Learning for Big Data Processing. 2025. Available online: https://www.researchgate.net/publication/388526113_Real-Time_AI_Systems_Leveraging_Cloud_Computing_and_Machine_Learning_for_Big_Data_Processing (accessed on 8 June 2025).
Moolikagedara, K.; Nguyen, M.; Yan, W.Q.; Li, X.J. Video Blockchain: A decentralized approach for secure and sustainable networks with distributed video footage from vehicle-mounted cameras in smart cities. Electronics 2023, 12, 3621. [Google Scholar] [CrossRef]
Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. Adv. Neural Inf. Process. Syst. 1993, 6, 737–744. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Şengönül, E.; Samet, R.; Abu Al-Haija, Q.; Alqahtani, A.; Alturki, B.; Alsulami, A.A. An analysis of artificial intelligence techniques in surveillance video anomaly detection: A comprehensive survey. Appl. Sci. 2023, 13, 4956. [Google Scholar] [CrossRef]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Graves, A.; Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
Xia, L.; Chen, C.C.; Aggarwal, J.K. View invariant human action recognition using histograms of 3D joints. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 20–27. [Google Scholar]
Kong, Y.; Fu, Y. Human action recognition and prediction: A survey. Int. J. Comput. Vis. 2022, 130, 1366–1401. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
Berroukham, A.; Housni, K.; Lahraichi, M.; Boulfrifi, I. Deep learning-based methods for anomaly detection in video surveillance: A review. Bull. Electr. Eng. Inform. 2023, 12, 314–327. [Google Scholar] [CrossRef]
Wang, S.; Miao, Z. Anomaly detection in crowd scene. In Proceedings of the IEEE 10th International Conference on Signal Processing Proceedings, Beijing, China, 24–28 October 2010; pp. 1220–1223. [Google Scholar]
Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
Alotaibi, S.R.; Mengash, H.A.; Maray, M.; Alotaibi, F.A.; Alkharashi, A.; Alzahrani, A.A.; Alotaibi, M.; Alnfiai, M.M. Integrating Explainable Artificial Intelligence with Advanced Deep Learning Model for Crowd Density Estimation in Real-world Surveillance Systems. IEEE Access 2025, 13, 20750–20762. [Google Scholar] [CrossRef]
Chen, K.; Loy, C.C.; Gong, S.; Xiang, T. Feature mining for localised crowd counting. In Proceedings of the British Machine Vision Conference (BMVC), Surrey, UK, 3–7 September 2012; Volume 1, p. 3. [Google Scholar]
Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 532–546. [Google Scholar]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; Schmid, C. Videobert: A joint model for video and language representation learning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7464–7473. [Google Scholar]
Bain, M.; Nagrani, A.; Varol, G.; Zisserman, A. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1728–1738. [Google Scholar]
Abu-El-Haija, S.; Kothari, N.; Lee, J.; Natsev, P.; Toderici, G.; Varadarajan, B.; Vijayanarasimhan, S. Youtube-8m: A large-scale video classification benchmark. arXiv 2016, arXiv:1609.08675. [Google Scholar]
Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
Xu, J.; Mei, T.; Yao, T.; Rui, Y. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5288–5296. [Google Scholar]
Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating summaries from user videos. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part VII 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 505–520. [Google Scholar]
Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. Tvsum: Summarizing web videos using titles. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5179–5187. [Google Scholar]
Kushwah, J.S.; Dave, M.H.; Sharma, A.; Shrivastava, K.; Sharma, R.; Ahmed, M.N. AI-Enhanced Tracksegnet an Advanced Machine Learning Technique for Video Segmentation and Object Tracking. ICTACT J. Image Video Process. 2024, 15, 3384–3394. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXII; Springer: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Patel, A.S.; Vyas, R.; Vyas, O.; Ojha, M.; Tiwari, V. Motion-compensated online object tracking for activity detection and crowd behavior analysis. Vis. Comput. 2023, 39, 2127–2147. [Google Scholar] [CrossRef] [PubMed]
Vora, D.; Kadam, P.; Mohite, D.D.; Kumar, N.; Kumar, N.; Radhakrishnan, P.; Bhagwat, S. AI-driven video summarization for optimizing content retrieval and management through deep learning techniques. Sci. Rep. 2025, 15, 4058. [Google Scholar] [CrossRef] [PubMed]
Morshed, M.G.; Sultana, T.; Alam, A.; Lee, Y.K. Human Action Recognition: A Taxonomy-Based Survey, Updates, and Opportunities. Sensors 2023, 23, 2182. [Google Scholar] [CrossRef] [PubMed]
Ahmadi, A.M.; Kiani, K.; Rastgoo, R. A Transformer-based model for abnormal activity recognition in video. J. Model. Eng. 2024, 22, 213–221. [Google Scholar]
Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep learning for anomaly detection: A review. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
Bendali-Braham, M.; Weber, J.; Forestier, G.; Idoumghar, L.; Muller, P.A. Recent trends in crowd analysis: A review. Mach. Learn. Appl. 2021, 4, 100023. [Google Scholar] [CrossRef]
Liao, Y.; Xie, J.; Geiger, A. KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3292–3310. [Google Scholar] [CrossRef] [PubMed]
Kadam, P.; Vora, D.; Patil, S.; Mishra, S.; Khairnar, V. Behavioral Profiling for Adaptive Video Summarization: From Generalization to Personalization. MethodsX 2024, 13, 102780. [Google Scholar] [CrossRef] [PubMed]
Hussain, T.; Muhammad, K.; Ding, W.; Lloret, J.; Baik, S.W.; De Albuquerque, V.H.C. A comprehensive survey of multi-view video summarization. Pattern Recognit. 2021, 109, 107567. [Google Scholar] [CrossRef]
Wang, S.; Zhong, Y.; Wang, E. An integrated GIS platform architecture for spatiotemporal big data. Future Gener. Comput. Syst. 2019, 94, 160–172. [Google Scholar] [CrossRef]
Chen, H.; Zi, X.; Zhang, Q.; Zhu, Y.; Wang, J. Computer big data technology in Internet network communication video monitoring of coal preparation plant. J. Phys. Conf. Ser. 2021, 2083, 042067. [Google Scholar] [CrossRef]
Xu, C.; Du, X.; Yan, Z.; Fan, X. ScienceEarth: A big data platform for remote sensing data processing. Remote Sens. 2020, 12, 607. [Google Scholar] [CrossRef]
Zhu, L.; Yu, F.; Wang, Y.; Ning, B. Big data analytics in intelligent transportation systems: A survey. IEEE Trans. Intell. Transp. Syst. 2018, 20, 383–398. [Google Scholar] [CrossRef]
Liu, W.; Long, Z.; Yang, G.; Xing, L. A self-powered wearable motion sensor for monitoring volleyball skill and building big sports data. Biosensors 2022, 12, 60. [Google Scholar] [CrossRef] [PubMed]
Sharshar, A.; Eitta, A.H.A.; Fayez, A.; Khamis, M.A.; Zaky, A.B.; Gomaa, W. Camera coach: Activity recognition and assessment using thermal and RGB videos. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 1–8. [Google Scholar]
Fenil, E.; Dinesh Jackson Samuel, R.; Manogaran, G.; Vivekananda, G.N.; Thanjaivadivel, T.; Jeeva, S.; Ahilan, A. Real time violence detection framework for football stadium comprising of big data analysis and deep learning through bidirectional LSTM. Comput. Netw. 2019, 151, 191–200. [Google Scholar]
Alam, A.; Khan, M.N.; Khan, J.; Lee, Y.K. Intellibvr-intelligent large-scale video retrieval for objects and events utilizing distributed deep-learning and semantic approaches. In Proceedings of the 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), Busan, Republic of Korea, 19–22 February 2020; pp. 28–35. [Google Scholar]
Gayakwad, M. Real-Time Clickstream Analytics with Apache. J. Electr. Syst. 2024, 20, 1600–1608. [Google Scholar] [CrossRef]
Mouradian, C.; Ebrahimnezhad, F.; Jebbar, Y.; Ahluwalia, J.K.; Afrasiabi, S.N.; Glitho, R.H.; Moghe, A. An IoT platform-as-a-service for NFV-based hybrid cloud/fog systems. IEEE Internet Things J. 2020, 7, 6102–6115. [Google Scholar] [CrossRef]
Raptis, T.P.; Cicconetti, C.; Falelakis, M.; Kalogiannis, G.; Kanellos, T.; Lobo, T.P. Engineering resource-efficient data management for smart cities with Apache Kafka. Future Internet 2023, 15, 43. [Google Scholar] [CrossRef]
Hlaing, N.N.; Nyunt, T.T.S. Developing Scalable and Lightweight Data Stream Ingestion Framework for Stream Processing. In Proceedings of the 2023 IEEE Conference on Computer Applications (ICCA), Yangon, Myanmar, 27–28 February 2023; pp. 405–410. [Google Scholar]
Xu, B.; Jiang, J.; Ye, J. Information intelligence system solution based on Big Data Flink technology. In Proceedings of the 4th International Conference on Big Data Engineering, Beijing, China, 26–28 May 2022; pp. 21–26. [Google Scholar]
Shafiyah, S.; Ahsan, A.S.; Asmara, R. Big Data Infrastructure Design Optimizes Using Hadoop Technologies Based on Application Performance Analysis. Sist. J. Sist. Inf. 2022, 11, 55–72. [Google Scholar] [CrossRef]
Yang, C.T.; Chen, T.Y.; Kristiani, E.; Wu, S.F. The implementation of data storage and analytics platform for big data lake of electricity usage with spark. J. Supercomput. 2021, 77, 5934–5959. [Google Scholar] [CrossRef]
Tripathi, V.; Gangodkar, D.; Singh, D.P.; Bordoloi, D. Using Apache Spark Streaming and Kafka to Perform Face Recognition on Live Video Streams of Pedestrians. Webology 2021, 18, 3416–3423. [Google Scholar]
Melenli, S.; Topkaya, A. Real-time maintaining of social distance in COVID-19 environment using image processing and big data. In Trends in Data Engineering Methods for Intelligent Systems: Proceedings of the International Conference on Artificial Intelligence and Applied Mathematics in Engineering (ICAIAME 2020), Antalya, Turkey, 18–20 April 2020; Springer: Cham, Switzerland, 2021; pp. 578–589. [Google Scholar]
Mendhe, C.H.; Henderson, N.; Srivastava, G.; Mago, V. A scalable platform to collect, store, visualize, and analyze big data in real time. IEEE Trans. Comput. Soc. Syst. 2020, 8, 260–269. [Google Scholar] [CrossRef]
Khan, M.N.; Alam, A.; Lee, Y.K. Falkon: Large-Scale Content-Based Video Retrieval Utilizing Deep-Features and Distributed In-Memory Computing. In Proceedings of the 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), Busan, Republic of Korea, 19–22 February 2020; pp. 36–43. [Google Scholar]
Uddin, M.A.; Alam, A.; Tu, N.A.; Islam, M.S.; Lee, Y.K. SIAT: A distributed video analytics framework for intelligent video surveillance. Symmetry 2019, 11, 911. [Google Scholar] [CrossRef]
Supangkat, S.H.; Hidayat, F.; Dahlan, I.A.; Hamami, F. The implementation of traffic analytics using deep learning and big data technology with Garuda Smart City Framework. In Proceedings of the 2019 IEEE Global Conference on Consumer Electronics (GCCE), Osaka, Japan, 15–18 October 2019; pp. 883–887. [Google Scholar]
Zhang, W.; Sun, H.; Zhao, D.; Xu, L.; Liu, X.; Ning, H.; Zhou, J.; Guo, Y.; Yang, S. A streaming cloud platform for real-time video processing on embedded devices. IEEE Trans. Cloud Comput. 2019, 9, 868–880. [Google Scholar] [CrossRef]
Lv, J.; Wu, B.; Liu, C.; Gu, X. Pf-face: A parallel framework for face classification and search from massive videos based on spark. In Proceedings of the 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM), Xi’an, China, 13–16 September 2018; pp. 1–7. [Google Scholar]
Munshi, A.A.; Mohamed, Y.A.R.I. Data lake lambda architecture for smart grids big data analytics. IEEE Access 2018, 6, 40463–40471. [Google Scholar] [CrossRef]
Yang, Q. Application of Recommendation System Technology and Architecture in Video Streaming Platforms. Preprints 2025. [Google Scholar] [CrossRef]
Rahman, M.; Provath, M.A.M.; Deb, K.; Dhar, P.K.; Shimamura, T. CAMFusion: Context-Aware Multi-Modal Fusion Framework for Detecting Sarcasm and Humor Integrating Video and Textual Cues. IEEE Access 2025, 13, 42530–42546. [Google Scholar] [CrossRef]
Ahamad, R.; Mishra, K.N. Hybrid approach for suspicious object surveillance using video clips and UAV images in cloud-IoT-based computing environment. Clust. Comput. 2024, 27, 761–785. [Google Scholar] [CrossRef]
Jung, J.; Park, S.; Kim, H.; Lee, C.; Hong, C. Artificial intelligence-driven video indexing for rapid surveillance footage summarization and review. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 3–9 August 2024; pp. 8687–8690. [Google Scholar]
Mon, S.L.; Onizuka, T.; Tin, P.; Aikawa, M.; Kobayashi, I.; Zin, T.T. AI-enhanced real-time cattle identification system through tracking across various environments. Sci. Rep. 2024, 14, 17779. [Google Scholar] [CrossRef] [PubMed]
Liu, A.; Mahapatra, R.P.; Mayuri, A. Hybrid design for sports data visualization using AI and big data analytics. Complex Intell. Syst. 2023, 9, 2969–2980. [Google Scholar] [CrossRef]
Wu, K.; Xu, L. Deep Hybrid Neural Network With Attention Mechanism for Video Hash Retrieval Method. IEEE Access 2023, 11, 47956–47966. [Google Scholar] [CrossRef]
Alpay, T.; Magg, S.; Broze, P.; Speck, D. Multimodal video retrieval with CLIP: A user study. Inf. Retr. J. 2023, 26, 6. [Google Scholar] [CrossRef]
Ul Haq, H.B.; Asif, M.; Ahmad, M.B.; Ashraf, R.; Mahmood, T. An effective video summarization framework based on the object of interest using deep learning. Math. Probl. Eng. 2022, 2022, 7453744. [Google Scholar] [CrossRef]
Apostolidis, E.; Adamantidou, E.; Metsai, A.I.; Mezaris, V.; Patras, I. Video summarization using deep neural networks: A survey. Proc. IEEE 2021, 109, 1838–1863. [Google Scholar] [CrossRef]
Kul, S.; Tashiev, I.; Şentaş, A.; Sayar, A. Event-based microservices with Apache Kafka streams: A real-time vehicle detection system based on type, color, and speed attributes. IEEE Access 2021, 9, 83137–83148. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning. PmLR, Virtual, 8–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning. PMLR, Baltimore, MD, USA, 17–23 July 2022; Volume 162, pp. 12888–12900. [Google Scholar]
Zeng, L.; Ye, S.; Chen, X.; Zhang, X.; Ren, J.; Tang, J.; Yang, Y.; Shen, X.S. Edge Graph Intelligence: Reciprocally Empowering Edge Networks with Graph Intelligence. IEEE Commun. Surv. Tutor. 2025. early access. [Google Scholar] [CrossRef]
Ramamoorthi, V. Applications of AI in Cloud Computing: Transforming Industries and Future Opportunities. Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol. 2023, 9, 472–483. [Google Scholar]
Hridi, A.P.; Sahay, R.; Hosseinalipour, S.; Akram, B. Revolutionizing AI-Assisted Education with Federated Learning: A Pathway to Distributed, Privacy-Preserving, and Debiased Learning Ecosystems. In Proceedings of the AAAI Symposium Series, Stanford, CA, USA, 25–27 March 2024; Volume 3, pp. 297–303. [Google Scholar]
Hakam, N.; Benfriha, K.; Meyrueis, V.; Liotard, C. Advanced Monitoring of Manufacturing Process through Video Analytics. Sensors 2024, 24, 4239. [Google Scholar] [CrossRef] [PubMed]
Rocha Neto, A.; Silva, T.P.; Batista, T.; Delicato, F.C.; Pires, P.F.; Lopes, F. Leveraging edge intelligence for video analytics in smart city applications. Information 2020, 12, 14. [Google Scholar] [CrossRef]
Zhang, J.; Tsai, P.H.; Tsai, M.H. Semantic2Graph: Graph-based multi-modal feature fusion for action segmentation in videos. Appl. Intell. 2024, 54, 2084–2099. [Google Scholar] [CrossRef]
Divya, G.; Swetha, K.; Santhi, S. A Decentralized Fog Architecture for Video Preprocessing in Cloud-based Video Surveillance as a Service. In Proceedings of the 2024 International Conference on Cognitive Robotics and Intelligent Systems (ICC-ROBINS), Coimbatore, India, 17–19 April 2024. [Google Scholar]
Abdallah, R.; Harb, H.; Taher, Y.; Benbernou, S.; Haque, R. CRIMEO: Criminal Behavioral Patterns Mining and Extraction from Video Contents. In Proceedings of the 2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA), Thessaloniki, Greece, 9–13 October 2023; pp. 1–8. [Google Scholar]
Gu, M.; Zhao, Z.; Jin, W.; Hong, R.; Wu, F. Graph-based multi-interaction network for video question answering. IEEE Trans. Image Process. 2021, 30, 2758–2770. [Google Scholar] [CrossRef] [PubMed]
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
Kadam, A.J.; Akhade, K. A Review on Comparative Study of Popular Data Visualization Tools. Alochana J. 2024, 13, 532–538. [Google Scholar]
Kumar, A.; Shawkat Ali, A. Big Data Visualization Tools, Challenges and Web Search Popularity-An Update till Today. In Big Data Intelligence and Computing: International Conference, DataCom 2022, Denarau Island, Fiji, 8–10 December 2022, Proceedings; Springer: Singapore, 2022; pp. 305–315. [Google Scholar]
Il-Agure, Z.; Dempere, J. Review of data visualization techniques in IoT data. In Proceedings of the 2022 8th International Conference on Information Technology Trends (ITT), Dubai, United Arab Emirates, 25–26 May 2022; pp. 167–171. [Google Scholar]
Sullivan, G.J.; Ohm, J.R.; Han, W.J.; Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
Jiang, X.; Yu, F.R.; Song, T.; Leung, V.C. A survey on multi-access edge computing applied to video streaming: Some research issues and challenges. IEEE Commun. Surv. Tutor. 2021, 23, 871–903. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Li, Q.; He, B.; Song, D. Model-contrastive federated learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10713–10722. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. Scaffold: Stochastic controlled averaging for federated learning. In Proceedings of the 37th International Conference on Machine Learning. PMLR, Virtual, 3–18 July 2020; pp. 5132–5143. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Kingori, S.W.; Nderu, L.; Njagi, D. Variational Auto-Encoder and Speeded-Up Robust Features Hybrd Model for Anomaly Detection and Localization in Video SequenCE with Scale Variation. J. Comput. Commun. 2025, 13, 153–165. [Google Scholar] [CrossRef]
Li, Z. Cross-Layer Optimization for Video Delivery on Wireless Networks. Ph.D. Thesis, Princeton University, Princeton, NJ, USA, 2023. [Google Scholar]

Figure 1. Evolution of the VBDA platform.

Figure 2. Key technologies in video big data platforms.

Figure 3. ViMindXAI: A hybrid AI-driven platform for scalable and privacy-aware VBDA.

Table 1. Problem taxonomy, core technologies, and applications in VBDA.

Problem Class	Representative Technologies	Benchmark Datasets	Applications	References
Object Detection and Tracking	YOLO, Faster R-CNN, SSD; DeepSORT, FairMOT; PointNet, VoxelNet	COCO, KITTI, VisDrone, AICity	Surveillance, Traffic Monitoring, Autonomous Driving, Sports	[88,89,90,91]
Activity and Behavior Analysis	3D CNNs, LSTM, SlowFast, Transformers; OpenPose, HRNet; Multimodal Fusion	UCF-101, HMDB-51, NTU RGB+D, Kinetics	Healthcare, Human–Robot Interaction, Smart Spaces, Sports	[70,92,93,94]
Anomaly Detection	Autoencoders, Isolation Forest, GANs, GNNs, Transformers	UCSD, Avenue, ShanghaiTech	Intrusion Detection, Industrial Safety, Emergency Response	[5,74,95]
Crowd Analysis	CNN Density Maps, Optical Flow, YOLOv8, GNNs, Transformers	UCF-QNRF, ShanghaiTech, Mall	Event Safety, Urban Planning, Transit Optimization	[77,91,96]
Content-Based Retrieval and Understanding	VideoBERT, CLIP, FAISS, Pinecone, ElasticSearch	YouTube-8M, ActivityNet, MSR-VTT	Surveillance Forensics, Media Search, Smart Recommender	[82,93,97]
Video Summarization and Captioning	VST, ViViT, TimeSformer; BLIP-2, GPT-4V; Diffusion Models	TVSum, SumMe, Kinetics, Show and Tell	Surveillance Triage, Media Editing, Accessibility	[92,98,99]

Table 2. Comparison of VBDA system architectures.

Architecture	Key Characteristics	Advantages	Limitations	VBDA Suitability	Reference
Centralized	Single-node processing; local storage; no distribution.	Low deployment cost; simple management; consistent control.	High latency; poor scalability; privacy risks; SPOF.	Limited: Suitable for small-scale or archival VBDA.	Wu et al. (2022) [6] Wang et al. (2019) [100] Chen et al. (2021) [101]
Cloud-Centric	Compute and storage offloaded to cloud; centralized orchestration.	High scalability; deep learning support; elastic compute; lower hardware burden.	Latency from data upload; privacy concerns; higher recurring cost.	Moderate to High: Batch analytics, model training.	Chen et al. (2021) [101] Xu et al. (2020) [102] Zhu et al. (2018) [103]
Edge Computing	Local inference near data source; minimal cloud dependency.	Ultra-low latency; enhanced privacy; bandwidth-efficient; real-time response.	Limited compute; synchronization overhead; hardware heterogeneity.	High: Suitable for real-time apps (e.g., smart cities, surveillance).	Liu et al. (2022) [104] Sharshar et al. (2023) [105] Fenil et al. (2019) [106]
Hybrid Cloud–Edge	Edge handles real-time inference; cloud for training, storage, orchestration.	Balanced latency and scalability; privacy-aware; supports FL; adaptive workload distribution.	Complex deployment; resource orchestration challenges; higher initial setup cost.	Very High: Best for large-scale, real-time VBDA systems.	Alam et al. (2020) [107] Gayakwad et al. (2024) [108] Mouradian et al. (2020) [109]

Table 3. Scalable big data processing frameworks for VBDA.

Reference	Year	Hadoop (Map Reduce)	Kafka	Spark	Flink	NiFi	Hive	Impala	Pig	Storm
Gayakwad et al. [108]	2024			✓			✓
Hlaing et al. [111]	2023		✓			✓
Abusalah et al. [10]	2023		✓	✓	✓
Raptis et al. [110]	2023		✓			✓
Xu et al. [112]	2022		✓		✓
Shafiyah et al. [113]	2022		✓				✓		✓
Yang et al. [114]	2021			✓			✓	✓
Tripathi et al. [115]	2021		✓	✓						✓
Melenli et al. [116]	2021		✓	✓			✓	✓
Alam et al. [21]	2020	✓	✓	✓
Alam et al. [107]	2020		✓	✓
Alam et al. [11]	2020	✓	✓	✓	✓
Mendhe et al. [117]	2020	✓		✓
Khan et al. [118]	2020		✓	✓
Uddin et al. [119]	2019		✓	✓
Supangkat et al. [120]	2019	✓	✓	✓			✓			✓
Wang et al. [100]	2019	✓	✓	✓	✓	✓
Dai et al. [30]	2019		✓	✓
Zhang et al. [121]	2019		✓	✓						✓
Lv et al. [122]	2018		✓
Munshi et al. [123]	2018	✓		✓			✓	✓
Total		6	17	16	4	3	6	3	1	3

Table 4. AI and deep learning for video analytics in VBDA.

Reference	Year	YOLO	R-CNN Family	3D-CNN/ SlowFast	LSTM	Transformers	GANs	CLIP/ VLMs	SSL Methods	Anomaly Models	MOT Trackers
Yang et al. [124]	2025				✓		✓			✓
Rahman et al. [125]	2025	✓			✓	✓
Ahamad et al. [126]	2024	✓	✓
Liang et al. [32]	2024				✓	✓	✓	✓
Jung et al. [127]	2024	✓		✓						✓	✓
Zhao et al. [31]	2024				✓	✓	✓	✓	✓
Su et al. [8]	2024							✓
Sathupadi et al. [51]	2024	✓	✓		✓
Mon et al. [128]	2024	✓			✓						✓
Hassan et al. [54]	2024	✓	✓		✓					✓
Liu et al. [129]	2023		✓
Wu et al. [130]	2023				✓	✓
Alpay et al. [131]	2023	✓	✓			✓		✓	✓
Badidi et al. [5]	2023	✓	✓		✓					✓	✓
Kwon et al. [4]	2022	✓	✓
Wu et al. [6]	2022				✓
Na et al. [33]	2022		✓
Ulhaq et al. [132]	2022	✓	✓						✓
Apostolidis et al. [133]	2021				✓	✓	✓
Kul et al. [134]	2021	✓	✓
Alam et al. [107]	2020	✓
Khan et al. [118]	2020	✓
Subudhi et al. [28]	2019						✓
Dai et al. [30]	2019				✓	✓
Geng et al. [29]	2019	✓	✓	✓	✓	✓
Total		14	11	2	13	8	5	4	3	4	3

Table 5. Federated learning and edge AI for VBDA.

Reference	Year	FL Algorithms		FL Frameworks		Privacy	Model Optimization			Edge AI Hardware			Edge-Cloud Orchestration
Reference	Year	FedAvg	FedProx	FedML	TFF	Techs	Prun.	Quant.	KD	Jetson	Coral	Movidius	Edge-Cloud Orchestration
Zeng et al. [137]	2025	✓	✓
Hridi et al. [139]	2024	✓				✓
Hakam et al. [140]	2024									✓	✓	✓
Su et al. [8]	2024	✓											✓
Wang et al. [3]	2024	✓		✓	✓	✓	✓		✓	✓		✓	✓
Prangon et al. [50]	2024	✓				✓							✓
Pandya et al. [9]	2023	✓				✓							✓
Badidi et al. [5]	2023	✓				✓				✓	✓	✓	✓
Ramamoorthi et al. [138]	2023	✓	✓	✓	✓								✓
Brecko et al. [14]	2022	✓			✓	✓	✓						✓
Das et al. [13]	2022								✓	✓			✓
Kwon et al. [4]	2022							✓	✓	✓			✓
Rocha et al. [141]	2020	✓	✓	✓	✓
Total		10	3	3	4	6	2	1	3	5	2	3	9

Table 6. Technologies for data storage and management.

Reference	Year	Hadoop (HDFS)	Cloud Storages			NoSQL DBs			Graph DB	SQL DBs		Auxiliary Techs	AI-Driven Indexing and Retrieval
Reference	Year	Hadoop (HDFS)	Amazon S3	Google Cloud	Azure Data Lake	HBase	Mongo DB	Cassandra	Graph DB	MySQL	Postgre SQL	Auxiliary Techs	AI-Driven Indexing and Retrieval
Gayakwad et al. [108]	2024		✓										AI Search
Zhang et al. [142]	2024								✓				Graph-based Retrieval
Divya et al. [143]	2024		✓	✓	✓
Zhai et al. [7]	2024	✓	✓	✓	✓							Elasticsearch, Phoenix	FAISS
Abdallah et al. [144]	2023								✓				CLIP
Raptis et al. [110]	2023	✓						✓			✓	Zookeeper, Elasticsearch
Shafiyah et al. [113]	2022	✓				✓	✓					Zookeeper, Elasticsearch
Gu et al. [145]	2021								✓			RDF	VideoBERT
Yang et al. [114]	2021	✓				✓				✓		Phoenix
Meleni et al. [116]	2021	✓				✓				✓		Zookeeper Elasticsearch
Kul et al. [134]	2021	✓									✓	Zookeeper Docker
Alam et al. [21]	2020	✓				✓						Zookeeper Phoenix
Khan et al. [118]	2020	✓				✓						Zookeeper Phoenix
Xu et al. [102]	2020	✓				✓						Thrift Elasticsearch
Mendhe et al. [117]	2020						✓					Elasticsearch
Alam et al. [107]	2020	✓				✓						RDF
Alam et al. [11]	2020	✓			✓	✓	✓			✓
Uddin et al. [119]	2019	✓	✓	✓	✓	✓						Zookeeper Elasticsearch Phoenix
Dai et al. [30]	2019	✓				✓						Thrift	BLIP
Wang et al. [100]	2019	✓	✓	✓	✓		✓				✓	Docker, Elasticsearch
Zhang et al. [121]	2019	✓				✓						Zookeeper Docker
Munshi et al. [123]	2018	✓		✓									Google Knowledge Graph
Total		16	5	5	5	11	4	1	3	3	3

Table 7. Technologies for data visualization and interpretation.

Reference	Year	Tableau	Power BI	Zeppelin	Qlik Sense	Grafana	Kibana
Gayakwad et al. [108]	2024	✓
Kadam et al. [147]	2024	✓	✓
Kumar et al. [148]	2022	✓	✓		✓
Xu et al. [112]	2022						✓
Il et al. [149]	2022	✓	✓			✓	✓
Shafiyah et al. [113]	2022	✓		✓			✓
Mendhe et al. [117]	2020						✓
Wang et al. [100]	2019			✓			✓
Munshi et al. [123]	2018	✓
Total		6	3	2	1	1	5

Table 8. Core technologies used across ViMindXAI layers.

Components	Applicable Technologies
Edge Inference	YOLOv8-tiny, MobileNet, TensorRT, TFLite, OpenVINO, NVIDIA Jetson
FL Frameworks	TensorFlow Federated, NVIDIA FLARE, FedAvg, Secure Multi-Party Computation (SMPC)
Stream Processing	Apache Kafka, Apache Flink, Spark Streaming, Apache NiFi
Hybrid and AI-Optimized Storage	Hadoop HDFS, MinIO, AWS S3, Ceph, Redis
Structured and Semantic Storage	Delta Lake, PostgreSQL, MongoDB, HBase, GraphDB
Search and Indexing	FAISS, OpenAI CLIP, Pinecone, Whisper, Elasticsearch, VideoBERT
Privacy and Security	Homomorphic Encryption, Differential Privacy, Zero-Knowledge Proofs (ZKP), RBAC, Intel SGX
UI/UX and Visualization	Tableau, Kibana, Grafana, Power BI, Unity, Meta Quest, GPT, HoloLens, GradCAM, SHAP

Table 9. Comparative Analysis of ViMindXAI and Existing VBDA Platforms.

Criteria/Layer	FALKON [115]	SIAT [116]	SurveilEdge [49]	VIDEX [130]	ViMindXAI (Proposed)
Architecture Type	Centralized	Cloud-Centric	Edge Computing	Hybrid Cloud–Edge	Hybrid Cloud–Edge (Cognitive)
Edge Preprocessing (DAL)	Basic edge input; limited model support	Standard ingestion (Kafka/NiFi)	YOLO-based detection + task allocation	Parallel object/anomaly threads	Edge AI (YOLOv8-tiny), denoising, ONNX models, FL at acquisition layer
Federated Learning and Privacy (DAL/DaAL)	Minimal FL, no privacy focus	FedAvg at cloud layer only	FL partially applied	Loosely coordinated FL logic	Full FL (FedAvg/Prox), PETs, Zero Trust sync
Lakehouse and Storage Orchestration (DIL)	HDFS-based batch storage	HDFS/HBase architecture	Local DB (SQLite) only	Unspecified backend storage	Delta Lake, Iceberg, Redis, hybrid S3/Ceph
Multimodal Semantic Indexing (DIL)	Basic temporal-spatial indexing	Partial Elastic-based indexing	Edge-level metadata tagging	Metadata-based object search	CLIP, VideoBERT, FAISS, GraphDB
Graph-Based Reasoning (DIL)	Not supported	Not supported	Not supported	Lightweight reasoning	Integrated graph-based modeling
Explainable AI (DaAL)	Not integrated	Some dashboard visualizations	Cloud-side reclassification logic	Parallel logic with minimal XAI	XAI tools (GradCAM, SHAP) throughout pipeline
Compliance and Security (DaAL)	Standard encryption only	Limited PETs in FL logic	Device-local privacy only	Basic authentication	GDPR/HIPAA compliance, Apache Atlas, Zero Trust
Immersive and Explainable UI (UEL)	Basic dashboard visualization	Grafana/Kibana UI	2D dashboard UI	MVVM GUI with static views	Immersive monitoring, adaptive dashboards, XAI overlays
Natural Language Analytics (UEL)	Not supported	Not supported	Not supported	Not supported	Natural language querying via GPT and semantic video embeddings

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Do, T.-T.-T.; Huynh, Q.-T.; Kim, K.; Nguyen, V.-Q. A Survey on Video Big Data Analytics: Architecture, Technologies, and Open Research Challenges. Appl. Sci. 2025, 15, 8089. https://doi.org/10.3390/app15148089

AMA Style

Do T-T-T, Huynh Q-T, Kim K, Nguyen V-Q. A Survey on Video Big Data Analytics: Architecture, Technologies, and Open Research Challenges. Applied Sciences. 2025; 15(14):8089. https://doi.org/10.3390/app15148089

Chicago/Turabian Style

Do, Thi-Thu-Trang, Quyet-Thang Huynh, Kyungbaek Kim, and Van-Quyet Nguyen. 2025. "A Survey on Video Big Data Analytics: Architecture, Technologies, and Open Research Challenges" Applied Sciences 15, no. 14: 8089. https://doi.org/10.3390/app15148089

APA Style

Do, T.-T.-T., Huynh, Q.-T., Kim, K., & Nguyen, V.-Q. (2025). A Survey on Video Big Data Analytics: Architecture, Technologies, and Open Research Challenges. Applied Sciences, 15(14), 8089. https://doi.org/10.3390/app15148089

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Survey on Video Big Data Analytics: Architecture, Technologies, and Open Research Challenges

Abstract

1. Introduction

2. Literature Overview in VBDA

2.1. Evolution and Trends in VBDA

2.2. Evolution of Architectural Paradigms in VBDA

2.3. Technologies and AI Enhancements in VBDA

2.3.1. Enabling Technologies in VBDA

2.3.2. AI Capabilities and Integration

2.4. Problem Taxonomy and Applications of Video Processing in VBDA

2.4.1. Object Detection and Tracking

2.4.2. Activity and Behavior Analysis

2.4.3. Anomaly Detection

2.4.4. Crowd Analysis and Management

2.4.5. Content-Based Video Retrieval and Understanding

2.4.6. Video Summarization and Captioning

3. VBDA Architectures and Core Technologies

3.1. System-Level Architectural Models in VBDA

3.2. VBDA Core Technologies

3.2.1. Scalable Big Data Processing Frameworks

3.2.2. AI-Driven Video Analytics

3.2.3. Federated Intelligence and Edge AI for Distributed VBDA

3.2.4. Storage and Indexing in VBDA Systems

3.2.5. Visualization and Interpretability

4. ViMindXAI: A Scalable and Cognitive AI Platform for VBDA

4.1. Overview of the Proposed Platform Architecture

4.2. Layered Architecture and Functional Components

4.2.1. Data Acquisition Layer (DAL)

4.2.2. Data Infrastructure Layer (DIL)

4.2.3. Data and Application Layer (DaAL)

4.2.4. User Experience Layer (UEL)

4.3. Real-World Application Scenarios of ViMindXAI

4.3.1. Case Study 1: Smart Building Surveillance and Access Management

4.3.2. Case Study 2: Healthcare Monitoring and Patient Safety

4.3.3. Case Study 3: Industrial Safety and Compliance Monitoring

4.4. Deployment Considerations and Solutions

4.4.1. Practical Deployment Challenges

4.4.2. Proposed Optimization Strategies

4.5. Comparative Analysis with Existing VBDA Platforms

5. Open Research Challenges and Future Directions

5.1. Scalability and Real-Time Processing

5.2. Security, Privacy, and Ethical AI

5.3. Integration of Emerging Technologies

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI