Internet-of-Things Edge Computing Systems for Streaming Video Analytics: Trails Behind and the Paths Ahead

: The falling cost of IoT cameras, the advancement of AI-based computer vision algorithms, and powerful hardware accelerators for deep learning have enabled the widespread deployment of surveillance cameras with the ability to automatically analyze streaming video feeds to detect events of interest. While streaming video analytics is currently largely performed in the cloud, edge computing has emerged as a pivotal component due to its advantages of low latency, reduced bandwidth, and enhanced privacy. However, a distinct gap persists between state-of-the-art computer vision algorithms and the successful practical implementation of edge-based streaming video analytics systems. This paper presents a comprehensive review of more than 30 research papers published over the last 6 years on IoT edge streaming video analytics (IE-SVA) systems. The papers are analyzed across 17 distinct dimensions. Unlike prior reviews, we examine each system holistically, identifying their strengths and weaknesses in diverse implementations. Our ﬁndings suggest that certain critical topics necessary for the practical realization of IE-SVA systems are not sufﬁciently addressed in current research. Based on these observations, we propose research trajectories across short-, medium, and long-term horizons. Additionally, we explore trending topics in other computing areas that can signiﬁcantly impact the evolution of IE-SVA systems.


Introduction
The falling cost of IoT video cameras [1], and the increasing capability of deep-learningbased computer vision algorithms [2,3] makes it possible to use streaming video analytics (SVA) to visually sense the environment.SVA refers to the real-time or near real-time processing of video streams to determine events in the environment as they happen.Figure 1 shows an example application of SVA in the use of traffic surveillance video cameras to detect and alert pedestrians and drivers of dangerous situations as they occur [4].In contrast, in batch video analytics, the processing of stored videos may happen at a much later time.For example, traffic engineers may analyze stored traffic video streams to identify congestion patterns for the purpose of roadway planning.From a computing perspective, video analytics is challenging due to the large sizes of the data involved and the computation-intensive algorithms needed to process it [5].A number of other use cases of streaming video analytics exist in multiple areas including healthcare, manufacturing, environmental monitoring, and national security [6].

Need for Edge Computing
The computation could potentially be realized on the cloud by taking advantage of powerful and scalable computation resources that are available on-demand.However, bandwidth, latency, and privacy necessitate the use of the edge computing paradigm for streaming video analytics [7,8].To put this into context, a single H.265 1080p IP camera with a high video quality operating at 15 fps requires a 2.3 Mbps uplink and generates over 25 GB of data per day [9].Half a dozen cameras in a home or at a traffic intersection could easily saturate the local uplink capacity (typically 10 Mbps).Regarding latency, the camerato-cloud communication over the Internet is of the order of hundreds of milliseconds.For the pedestrian safety use case, a vehicle driving at a speed of 45 mph (72 kph) covers a distance of 60 ft (20 m) in a second.Detecting whether such a moving vehicle poses a threat to a pedestrian requires detecting events with tens of milliseconds of latency including computation and communication overheads.Regarding privacy, video streams are a rich source of information.Transmitting videos to a distant cloud data center could violate user expectations of privacy, and legal requirements such as GDPR [10] and the guiding principles of the government use of surveillance technologies [11].Furthermore, if the video stream is accessed by unauthorized third parties, unintended information (for example, personal identities for the pedestrian safety use case) could be revealed.By performing video analytics at the edge close to the video cameras, communication is limited to local area networks, thus reducing the network latency.Furthermore, the aggregate bandwidth requirements are reduced due to the distributed processing at the edge enabling the scaling of video cameras.Moreover, sensitive video data can be confined to the privacy perimeter expected by the user (for example, a home or legal jurisdiction).
Figure 1.A smart-city traffic intersection equipped with surveillance cameras.Many city departments could consume the camera feeds for multiple applications including traffic monitoring, pedestrian safety, detecting traffic law violations, public safety, and environmental monitoring.Each application may require a separate video analytic pipeline (VAP), with the possibility of sharing VAP components between the applications.The processing of video streams is implemented on edge nodes including the cameras, and on nearby edge servers such as in the traffic box.

Key Contributions
In recent years, there has been a notable surge in research focused on SVA at the edge.Despite the widespread deployment of cameras in cities (for example, London, UK, has more than half a million IoT surveillance cameras) and private establishments, as well as the significant advancements in deep learning and AI-powered computer vision, a substantial gap still persists in effectively utilizing AI to analyze these camera streams in real-time to derive actionable insights.The research question that this paper seeks to answer is: what is the state-of-the-art in IoT edge streaming video analytics systems (IE-SVA)?The goal of this paper is to thoroughly analyze IE-SVA systems as reported in the research literature, so as to provide clarity to researchers and industry practitioners on the progress to date, the techniques employed, and the additional research and development that needs to be performed to further the field.
To achieve these goals, we begin by outlining the characteristics of an ideal IE-SVA system.By using this ideal system as a framework, we analyzed existing research literature on video analytics along 17 unique dimensions.The analysis is presented in tabular format with 37 reported works listed in chronological order (2015-2023) to help the reader readily understand the research progress made by these systems along different dimensions.These systems are also classified according to their primary research focus, allowing readers to easily access works that delve into specific techniques aligned with their interests.Based on this analysis, we propose research directions for the short, medium, and long term.
In the short term (3 years), our proposed research aims to build upon existing work by incorporating different techniques reported in the literature in a straightforward manner so as to advance the state-of-the-art on end-to-end edge systems for video analytics.The medium-term research proposals (5 years) outline new areas of study that the community needs to undertake in order to address system-level challenges identified in the review.Looking ahead, the long-term research plan (10 years) envisions the future evolution of streaming video analytics at the edge and proposes exploratory research directions.
While experienced researchers may benefit from our comprehensive treatment of IE-SVA systems, this review specifically aims to provide new researchers with a broad background, and a timeline-based overview of how IE-SVA systems have developed over the years.In particular, Section 6 describes multiple techniques covering nine specific issues for implementing IoT edge streaming video analytics systems with research gaps identified that can be pursued in the short term.

Paper Organization
This paper is organized as follows.Section 2 describes related work, emphasizing how this review contributes a distinct perspective to the few available reviews on this topic.Section 3 lays out the necessary background on topics such as streaming video analytics, components of IoT edge systems, and the challenges of using edge computing for streaming video analytics.In Section 4, we outline the optimal system requirements for edge computing in the context of streaming video analytics.Section 5 offers a critical analysis of previously documented IE-SVA systems.Section 6 then drills down on the specific techniques with case studies from the reported work.Section 7 then outlines our research vision, breaking it down into short-term, medium-term, and long-term objectives.Section 8 briefly examines how other advancements in computing might impact systems designed for video analytics at the edge.Finally, in Section 9, we wrap up with a summary of the paper.

Related Work
The studies closest to ours are the surveys on deep-learning-driven edge video analytics by Xu et al. [12] and edge-based video analytics by Hu et al. [13].These surveys catalog various edge video analytics systems from the literature, categorizing them based on a range of criteria including edge architectures, profiling, algorithms, scheduling, orchestration framework, and aspects of privacy and security.As a result, they often refer to the same system multiple times under different criteria.While this approach offers the advantage of providing an extensive perspective on the techniques used, its disadvantage is that it can make it challenging to understand how these techniques are integrated into an end-to-end system.
Other related reviews include Goudarzi et al.'s [14] survey which analyzes scheduling IoT applications in edge and fog computing environments.They study factors such as the application structure, environmental architecture, optimization characteristics, scheduling framework, and performance evaluation to identify research gaps.However, their focus is limited to scheduling.Zhang et al.'s [6] survey focuses on edge video analytics specifically for public safety.While some overlap exists with our work, they mainly examine algorithms and systems from a public safety perspective.Other surveys related to edge computing or video analytics are limited in scope to their respective areas [15][16][17].
In contrast, our work adopts a more systems-oriented perspective with an emphasis on issues regarding the practical deployment of IE-SVA systems.We provide a comprehensive overview of 37 recently reported systems for edge video analytics, analyzing each system along 17 different dimensions.The information is presented in a tabular format allowing readers to readily grasp the evolution of edge video analytics systems, the mix of techniques employed in a particular system, and the areas that have received limited attention from the research community to date.Additionally, we focus on specific techniques, categorizing them by the problems they address.For each technique, we briefly dive into reported works, which in our opinion, provide good illustrations of the highlighted techniques.Furthermore, we briefly review research developments in other areas of computing that may impact edge video analytics.Based on this comprehensive analysis, we make concrete proposals for short-term, medium-term, and long-term research on edge video analytics systems.As a limitation, our review does not comprehensively cover all the reported work on IE-SVA systems.Instead, we selectively chose the works to review in order to make the content more manageable for readers.

Background
In this section, we provide a brief background of streaming video analytics, hardware and software system components that are needed to realize video analytics at the edge, and the challenges in implementing these systems.References to the more detailed tutorial-like treatment of these topics is provided.

Streaming Video Analytics
Streaming video analytics involves the real-time processing of video frames to extract valuable information [18,19].For instance, IoT surveillance camera streams at traffic intersections can be utilized to detect and alert pedestrians about potential hazards.It is worth noting that streaming video analytics operates within specific time constraints.In contrast, in batch video analytics, video frames are stored and queried later for useful information.For example, traffic engineers can analyze months' worth of stored traffic video streams to understand traffic patterns.In addition to generating real-time actionable insights, streaming video analytics also offers the advantage of reducing data storage requirements.By storing only the object type and position instead of the entire video, significant storage savings can be achieved.Moreover, videos contain abundant information that can potentially compromise privacy and confidentiality.Through streaming video analytics, only the relevant information needed for the specific application is extracted, allowing the videos themselves to be discarded.
Video analytics typically involves a series of operations organized as a directed acyclic graph (DAG).These operations typically include video decompression, frame pre-processing, object detection, object classification, image segmentation, object tracking, pose estimation, and action classification [20] .Algorithm 1 [21] describes an example of a video analytics pipeline (VAP) for recognizing activity detection.The different stages utilize computationally intensive deep learning algorithms.Additionally, multiple deep learning algorithms with varying performance-accuracy trade-offs exist for each operation.For instance, object detection can be accomplished using a more accurate but slower two-stage detector like Fast-RCNN or Mask-RCNN, or a faster but less accurate single-stage detector such as YOLO or CenterNet.The Tensorflow model zoo has over 40 models for object detection [22].Moreover, each implementation offers adjustable parameters like bit-rate, frame rate, and resolution.Consequently, a single VAP can have numerous implementations with different performance-resource trade-offs.Furthermore, the performance of these operations is heavily influenced by the content of the video scene.Techniques aimed at enhancing computational efficiency, such as adjusting the decoding bit rate, dropping frames, or filtering specific parts of video frames, impact both the application's requirements and computational resources [23,24].

Application Use Cases
This section provides a brief overview of the various applications of streaming video analytics across diverse industries.It is worth noting that each of these application areas poses domain-specific constraints that influence the system design.
Transportation: Autonomous vehicles leverage multiple video cameras for environmental perception.Given the stringent latency requirements, these video streams are processed in real-time within the vehicle itself [25].Intelligent roadways use streaming video analysis to alert both the drivers and pedestrians of potentially hazardous situations [26,27].
Public safety: The public safety sector benefits from the automatic analysis of surveillance camera feeds to detect incidents, such as crime and assault [6].Additional applications include crowd monitoring to avoid dangerous overcrowding [28] and the early detection of fire incidents [29].However, this application poses concerns about potential biases against certain demographic groups [30] and potential privacy infringements [31].
Healthcare: The healthcare sector employs video analytics for continuous patient monitoring in healthcare and homecare facilities, enabling the immediate notification of healthcare personnel in the event of incidents like falls [32].
Environmental monitoring: Environmental monitoring uses video analytics for tracking events such as wildfires, flash floods, illegal dumping, and wildlife movement [33].As climate change challenges intensify, visual environmental sensing to drive intelligent interventions will become increasingly crucial.
Industrial applications: On factory floors, video analytics serve to monitor worker and site safety [34] and assembly line efficiency [35].
Retail: In retail settings, video analytics are used to monitor customer behavior in stores, gauge customer interest, and effectively deploy store staff for customer assistance.Other applications include automated checkouts and inventory management [36].
Search and rescue: The ability to deploy drones to locate individuals needing rescue during adverse events like floods, earthquakes, and wildfires is greatly enhanced by performing streaming analytics on drone footage [37,38].
National security: National security applications encompass drone-based battlefield surveillance, perimeter monitoring, and personnel search and rescue operations [39].
Augmented and virtual reality (AR/VR): The demanding latency requirements for AR/VR applications necessitate the processing of camera streams from headsets at the edge, with results relayed back to the headset for visual display [40].
Robotics: Robots, in their operation, employ multiple cameras to perform onboard streaming video analytics, thereby enriching their perception of the environment [41].

System Architecture
In a broad sense, edge computing refers to any computing performed on the edge of the network.As shown in Figure 2, the IoT edge streaming video analytics (IE-SVA) hardware hierarchy forms a tree structure with IoT cameras at the leaf nodes; single board computers on a local area network (LAN) are shared with the IoT cameras at the next higher level; workstations on the LAN at a level higher, micro data centers on a wide-area network (WAN) at a level further up; and finally public/private cloud at the root node.Multiple such edge hierarchies could be geo-distributed to cover a large area.Within this hierarchical structure, computing could be organized vertically with control flowing from top to bottom, and data flowing from bottom to top of the edge tree.Alternatively, computing could be organized in a hybrid fashion, where computing is organized vertically within a subtree, and horizontally (peer-to-peer) across subtrees.Furthermore, the individual nodes in the tree could be stationary or could be mobile (for example, a drone-based camera).Also, not all levels may not be present in a particular implementation.In others, additional levels of computation may be added to the tree structure as the system scales.End users connect to the system via web-based or mobile application frontends.Aside from the cameras, not all levels need be present in an implementation.The computation nodes may be organized as clusters for availability and fault tolerance.Data flow vertically up the hierarchy starting from the cameras, while the control flows down the hierarchy.In some implementations, data may flow horizontally as well.

Hardware
Hardware components comprising IE-SVA systems include wired and wireless video cameras (equipped with or without onboard processing), a low-power edge computing cluster consisting of single-board computers (SBCs) incorporating embedded GPUs and TPUs, workstations equipped with consumer grade GPUs, and microdata centers housing server-class machines, powerful GPUs, FPGAs, TPUs, and a public cloud backend with virtually limitless computing resources.Storage options range from SD cards on SBCs, to SSDs on workstations, to networked storage in the microdata center, and storage-as-aservice (for example, AWS S3 object store) in the cloud.Networking options range from wireless networks (WiFi, 4G, 5G) to wired networks (Ethernet, optical).
Cloud computing typically involves large data centers typically equipped with racks of homogeneous servers connected by high-speed networks (25-100 Gbps).In contrast, edge computing hardware is highly heterogeneous, connected by less reliable networks at lower speeds (1 Mbps-1 Gbps).Additionally, unlike the dedicated cloud data centers, edge resources are housed in a variety of locations from weather proof cases in the proximity of outdoor cameras to small server rooms in office buildings.

System Software Stack
Cloud computing entails a vast distributed system consisting of numerous servers organized in data centers.Over the past 15 years, an extensively developed open source software stack has aimed to simplify the programming and management of these largescale distributed systems.However, it is important to note that the cloud software stack is specifically designed for a homogeneous distributed system.It assumes that there are high-speed, reliable networks connecting the system and a controlled environment within a data center.As large-scale edge computing is still in its early stages, initial attempts have involved using cloud system software at the edge.Hence, we will provide a brief overview of the cloud software stack.A comprehensive treatment of cloud system software is provided by [42][43][44].
Cloud computing has widely adopted the infrastructure-as-a-service (IaaS) paradigm, involving the virtualization of the physical computation, storage, and network.Users are able to provision these virtualized resources in a pay-as-you-go fashion, with the ability to scale the resources up and down as needed.Cloud applications typically adopt a microservice architecture, where different services primarily communicate via a REST or RPC API.The loose coupling of services enables rapid code iteration and scalability in deployments.Microservices are packaged within lightweight OS-level virtual machines known as containers, with Docker [45] being a widely used container implementation.Orchestrating these containers is Kubernetes [46], which currently serves as the de facto cloud operating system.Asynchronous communication between services is achieved through distributed broker-based messaging systems, with Kafka [47], NATS [48], and RabbitMQ [49] being prominent open source solutions.For data storage, open source options include both SQL (for example, Postgres, MySQL) and NoSQL (for example, MongoDB, Cassandra) solutions, each offering various architectures for distributed storage [50].In recent years, serverless architectures have gained popularity, where application programmers focus solely on the business logic while the cloud provider handles server deployment [44].Examples of the serverless paradigm include function-as-a-service and backend-as-a-service.Many of the computation, storage, and service offerings by cloud vendors today adhere to this paradigm.For instance, Amazon AWS S3 object storage seamlessly scales its capacity to accommodate any amount of data without users needing to worry about infrastructure provisioning.A comprehensive review by Schleier-Smith et al. [44] explores the state-of-the-art in serverless computing and argues that this paradigm will dominate cloud computing in the next decade.While Kubernetes has been adapted for edge computing through projects like K3s [51] supported by commercial vendors, much of the other system software native to the edge remain confined to research publications [52][53][54][55].

Edge Computing Challenges
Edge computing offers several advantages over cloud computing, including reduced latency, lower bandwidth requirements, and the ability to keep data within privacy boundaries.However, along with these benefits come a unique set of challenges.One of the primary challenges is the limited resources available at the edge.
Cloud companies have large-scale data centers equipped with thousands of servers, massive storage capacities, high-speed networks, and dedicated operational and maintenance teams.In contrast, edge computing takes place in various settings, ranging from server rooms in office buildings or cellular base stations to more constrained environments such as a traffic box at a roadway intersection.The hardware used at the edge is also diverse, ranging from comprehensive cloud-in-a-box solutions like AWS Outposts [56] to smaller clusters of low-cost single-board computers like Nvidia Jetson and Raspberry Pi.Moreover, as mentioned earlier, unlike the high-speed low-latency networks on the cloud, the edge employs a variety of networks including wired local/wide/metropolitan area networks (LANs, WANs, MANs), and wireless networks including WiFi, 4G/LTE, and 5G.
These resource limitations pose obstacles to implementing streaming video analytics applications at the edge.For instance, supporting multiple complex deep-learning-based computer vision pipelines may simply not be feasible due to resource constraints.Furthermore, from a system software perspective, these limitations make it challenging to employ the traditional cloud strategy of ensuring reliability through replication.Additionally, the hardware heterogeneity across different edge platforms makes it difficult to develop a single solution that can be applied universally.
In the context of streaming video analytics, a resource-intensive deep learning model that can efficiently execute on a cloud-in-a-box setup may experience high latency or fail to execute altogether on a single-board computer.Another challenge at the edge is the issue of shared ownership.In a smart-city scenario, for example, an IE-SVA platform, including IoT cameras, may be owned by state transportation departments.Various city departments such as traffic management, law enforcement, and environmental monitoring share their utilization of the cameras for their specific applications.Each application may have significantly different performance requirements, and some may even need to actively control the cameras (e.g., pan, zoom, and tilt) [5] for accurate event detection.
Securing the system from external attackers and malicious users is challenging at the edge as compared to the cloud because of the diversity of hardware and software systems, the limited physical security of the computation equipment, and the limited engineering experience of operational teams associated with end-users such as cities, community organizations, and small businesses.From a budgetary point of view, unlike the cloud infrastructure maintained by companies with considerable resources, communityowned and non-profit edge infrastructures are often subject to tight budgetary constraints.Finally, from an environmental perspective, minimizing the energy use, the incorporation of sustainable energy sources, and maximizing as well as extending the life of equipment are important considerations.

Ideal System Requirements for IoT Edge Streaming Video Analytics
In this section, we present our view of the ideal characteristics of IoT edge streaming video analytics IE-SVA systems designed for video analytics.These are grouped by hardware, application support, operational ease, human factors, and sustainability.

Resource Heterogeneity
Ideally, the edge system supports different types of IoT surveillance cameras, each with its own unique specifications such as resolution, connection capabilities, and processing capability.The system allows for the number of deployed cameras to grow as needed.Moreover, the system possesses the ability to virtualize these cameras, exposing logical camera streams to applications and thus allowing for flexibility and scalability in their deployment.Additionally, the system is designed to take advantage of a wide variety of distributed computation resources, including single board computers, edge servers, micro data centers, private and public clouds.These computation resources are heterogeneous, supporting multiple processor, accelerator, network, and storage architectures.

Application Support
The system is ideally designed for multiple organizations to deploy numerous video analytics applications, using shared resources such as camera feeds, deep learning models, and intermediate processing results.This approach supports the camera-as-a-service concept [57], separating camera users from owners to ensure the maximum utilization of the deployed hardware.It also fosters a rich ecosystem of applications.Moreover, the system is ideally built to keep up with advancements in video analytics algorithms.It offers a mechanism to upgrade all or certain parts of existing applications with newer models, catering to the evolving demands of video analytics.For applications utilizing multiple cameras, the system facilitates localized cooperation between the cameras to enhance performance.The deployment framework provided by the system is both forward and backward compatible with video analytics applications.This feature further enriches its usability, making it a robust and adaptable platform for future advancements.

Operational Ease
The efficient operation of the IE-SVA system is critical to its success.The system is designed to be easily managed without the need for a sophisticated operational team.It incorporates built-in redundancy and fault tolerance mechanisms, enabling certain parts of the system to operate autonomously in case of failures.Moreover, the system possesses the ability for operators to perform root cause analysis in the event of a failure, allowing for the swift identification and resolution of issues.It is also built with a secure-by-design approach, promptly reporting security events and facilitating the isolation of compromised parts of the system.Furthermore, the modular nature of the system encourages innovation at every level of the stack, fostering continuous improvement and adaptability to dynamic changes in resource availability, and application requirements.

User Friendliness for End Users
The system prioritizes ease of use, ensuring that non-technical users can easily interact with the system, particularly when submitting queries or accessing relevant information.Additionally, the system provides multiple logical abstractions allowing application developers to choose the abstractions appropriate for their use case.

Sustainability
The sustainability of widely deployed edge systems for video analytics is both an economic and societal concern.This revolves around optimizing the system's utilization, minimizing power consumption, and tapping into renewable and possibly intermittent energy sources.It also includes maximizing the use of sunk costs, which allows for the operation of deployed hardware for many years.Fully utilizing the system's capabilities ensures that resources are efficiently employed, thus reducing e-waste and unnecessary expenditure.

Reported Systems for IoT Edge Streaming Video Analytics
In this section, we analyze research reported in the literature for realizing IoT edge streaming video analytics (IE-SVA) systems.Importantly, we exclude works solely dedicated to video analytics algorithms, or those involving cloud-only video analytics.Cloud video analytics has a longer history [19], and early works such as Gabriel [58] and Vigil [59] involved the use of video processing at the edge.However, the 2017 IEEE Computer article titled "Real-Time Video Analytics: The Killer App for Edge Computing" by Ananthanarayanan et al. [5] from Microsoft Research was influential in popularizing research on IE-SVA systems.In their article, the authors make a case of why edge computing is essential for streaming video analytics, and sketch out the core technologies needed to realize such a system.Their application emphasis is on live traffic analytics.Since their initial paper, the Microsoft Research group has been active in publishing research articles on this topic, including the release of the open source Microsoft Rocket for live video analytics [60].Rocket is, however, tied to Microsoft products such as the Azure cloud and IoT platforms.Tables 1 and 2 show research projects reported in the literature on IE-SVA systems from 2015 to 2023.Thirty-seven papers were analyzed in chronological order along 17 different criteria.The papers were selected for their capacity to illuminate various facets of IE-SVA.Our goal is to provide the reader with a sense of the historical evolution of IE-SVA systems, the multiple system design aspects considered in these works, and to highlight topics that have not yet received sufficient attention from the research community.A more detailed analysis of these works is presented Section 6.

Analysis Criteria
We provide a brief description of the criteria by which the research works are analyzed in Tables 1 and 2. The criteria are derived from the requirements of an ideal IE-SVA system described in Section 4.

1.
Project (year): Project name and year of publication.If the project is not named by the authors, then the last name of the first author is listed.

2.
Focus: The primary design goal of the paper.

3.
Cross-camera inference: "Yes" indicates that the video analytics pipelines jointly consider the output of two or more cameras."No" indicates that the analytics of each camera are independent.4.
VAP components: Describes the distinct operations implemented by the video analytics pipelines described in the work.It should be noted that, while core components such as object detection and tracking involve computation-intensive deep learning algorithms, others such as video decoding and background subtraction use classical signal and image processing techniques.

5.
Performance objectives: These include both application performance objectives and system performance objectives.Application latency is the end-to-end latency from the point of capturing the video stream until the delivery of detected events to the end user.Application accuracy is typically expressed with metrics such as F1 score.System performance objectives revolve around computation, memory, bandwidth, power, and cost constraints.

6.
Profiling method: Profiling involves measuring the performance and resources associated with a video analytic pipeline using benchmark videos.Profiling could be performed either offline or online.7.
Architecture: Figure 2 shows a generic edge architecture for video analytics.Within this general framework, specific edge architectures include edge-cloud, distributed edge, and multi-tiered hierarchical edge depending on the layers involved, and the communication patterns.Furthermore, an implementation could involve a combination of these architectures.For example, a scalable system without a public cloud could be composed of clusters of distributed edge nodes, with a geo-distribution-based hierarchy (indicated as DE and HE in Table 1).8.
Scheduling: Describes algorithms reported for placing VAP components on the edge nodes such that performance and resource constraints are met.9.
Runtime adaptation: Indicates whether a run-time performance adaptation technique was employed.10.Control plane: Indicates whether the work describes the design of a control plane.
The control plane consists of the system software that controls the edge infrastructure.11.Data plane: Indicates whether the work describes the design of a data plane.The data plane consists of the system software that facilitates the flow of data between the analytics components.12. Human interface: Indicates whether the work reports aspects of the human user interface.Users, developers, and operators are the different types of people that interact with edge video analytic systems.The human interface design seeks to make this interaction easy and intuitive.A good UI/UX is key in ensuring that the systems constructed are used to their full potential by users.13.Security: Indicates whether the work considers the cybersecurity aspects of the system.Securing the system from malicious use is of the utmost importance, especially considering the sensitive nature of video data.14.Fault tolerance: indicates whether the work describes fault tolerance aspects of the system.Faults include both hardware and software failures.15.Observability: indicates whether the work considers observability aspects of the system.The ability to measure and analyze system operational information and application logs are critical to understanding the operational status of large-scale IE-SVA systems, as well as troubleshooting, locating, and repairing failures.16.Evaluation: Describes the type of evaluation testbeds used in the work.Approaches include the emulation of edge nodes using virtual machines, video workloads from standard datasets, the use of simulators, and edge hardware to build experimental testbeds.

Discussion
The previous section offered an overview of IoT edge streaming video analytics (IE-SVA) systems as discussed in the existing literature.In this section, we focus on specific techniques, categorizing them by the problems they address.For each technique, we briefly dive into a few of the papers listed in Tables 1 and 2, which in our opinion, provide good illustrations of the highlighted techniques.We also list the other papers wherein a similar technique is employed.Additionally, we highlight gaps in the research with suggestions for future work.It should be noted that the discussions are meant to provide a high level understanding of the technique, and are specifically focused on IE-SVA systems.The readers are encouraged to examine the cited work for implementation details.

Network Bandwidth
The "big data" nature of video analytics processing places a high demand on network bandwidth both between the edge and the cloud, and between the IoT camera nodes and the edge.As a result, many papers have focused on tackling the associated bandwidth challenges.

Technique 1: Trade-Offs in Application Accuracy vs. Bandwidth
The general idea is to exploit the ability of video analytics applications to operate at reduced accuracy to achieve bandwidth savings by using data reduction techniques such as dropping frames, or encoding frames at a reduced bit rate.The papers on AWSStream [66] and CASVA [88] provide two different approaches to the application of this technique.
AWStream employs a hybrid approach of offline and online training to construct a precise model that correlates an application's accuracy with its bandwidth usage via a set of tuning knobs.These knobs include resolution, framerate, and quantization.It autonomously identifies a Pareto-optimal policy to govern the timing and manner of leveraging these knobs.In real-time operation, AWStream's system continuously monitors network conditions and adjusts the data streaming rate to align with the available bandwidth.It maintains high-accuracy levels by utilizing the pre-learned Pareto-optimal configurations.During network congestion, the system employs a state machine-based adaptation algorithm to lower the accuracy, thereby reducing the data rate and preventing the buildup of a persistent queue.
In contrast to AWStream, CASVA aims to optimize server-side DNN inference accuracy while dynamically adjusting to fluctuating network bandwidth.It does this by manipulating configuration parameters such as resolution and frame rate.Unlike AWStream, CASVA forgoes the use of profiling-based methods to correlate configuration settings with performance.This decision is made to eliminate the computational overhead of profiling and its limitations in capturing video content variations.Instead, CASVA employs an actorcritic architecture based on deep reinforcement learning (DRL) to determine the optimal configuration for individual video segments.This approach enables the system to adapt to both granular changes in network bandwidth and variations in video content.CASVA's training process occurs offline and leverages a trace-driven simulator that accounts for diverse network conditions and video content.After completing the training, CASVA's configuration controller utilizes the learned policy to determine the optimal settings for each video segment during live streaming sessions.
Examples of other works reported in the literature that employ this technique include, Vigil [59], Wang et al. [68], and Runespoor [41].

Technique 2: Hybrid Computation between Edge and Cloud
It should be noted that, unlike the approach of exclusively scheduling the computation, either in the edge or the cloud, here the computation is performed jointly between the edge and the cloud.EdgeDuet [90] and Clownfish [79] projects employ this technique.
EdgeDuet conducts object detection through a hybrid edge-cloud architecture.Large objects are locally identified on the edge device using lightweight deep neural network (DNN) models.Conversely, small objects are detected using computationally intensive DNN models located in the cloud.To optimize the cloud-based detection of small objects, EdgeDuet employs two key techniques: region-of-interest (RoI) frame encoding and content-prioritized tile offloading.RoI frame encoding is specifically used to reduce the network bandwidth consumption.In this approach, only pixel blocks that are likely to contain small objects are transmitted at a high resolution, while the remaining portions of the frame are compressed and transmitted at low quality.
Clownfish employs a two-tiered deep learning (DL) architecture to optimize both the response speed and analytical accuracy.At the edge, it deploys a lightweight, optimized DL model for rapid data processing.In the cloud, a more comprehensive DL model ensures high-accuracy analytics.Clownfish takes advantage of the temporal correlation present in the video content to selectively transmit only a subset of video frames to the cloud, thereby conserving network bandwidth.The system then improves the quality of analytics by merging results obtained from both the edge and cloud models.
Examples of other works that employ this technique include FilterForward [71] and Runespoor [41].

Research Gaps
While offline and online profiling has received a fair amount of attention in the literature, the use of model-free methods has received considerably less attention.Only the CASVA [88] project explores the use of deep reinforcement learning (DRL) in IE-SVA systems.However, in our opinion, much more work needs to be performed in the application of DRL for exploiting the accuracy-bandwidth trade-offs.Offline learning strategies require the use of trace-driven simulators with real-world traces.The traces used in CASVA relies on a fixed broadband dataset provided by FCC which may not be applicable to a particular operating environment of interest.An alternative is to use online policy learning strategies, and hybrid approaches such as experience replay [96].Another promising line of research would be the merging of the two techniques described above, with a continuum of hybrid edge-cloud processing depending on the bandwidth and resource constraints available at the edge.

Computational Efficiency
The high computational requirements of DNN models, and resource constraints at the edge have led researchers to explore techniques to effectively use DNNs at the edge.

Technique 1: Trade-Offs in Application Accuracy vs. Resource Usage
Similarly to bandwidth reduction, techniques that sacrifice the accuracy of the application can potentially reduce the computational load.The VideoEdge [65] project highlights the use of this technique.
In the VideoEdge project, video analytics queries are processed through a pipeline of computer vision components.These components have varying resource requirements and produce results with different levels of accuracy.Additionally, each component can be adjusted through various parameters such as frame resolution and frame rate, creating a large configuration search space.VideoEdge narrows this search space by identifying Pareto optimal configurations that balance resource demand and output accuracy.The project also introduces the concept of "dominant demand", defined as the maximum ratio of demand to capacity across all resources and clusters in the system hierarchy.This metric facilitates the direct comparison of different configurations in terms of both demand and accuracy, preventing the excessive consumption of any single resource.To optimize the system further, VideoEdge employs a greedy heuristic that iteratively explores configurations within the Pareto optimal band, switching to configurations that offer increased accuracy with a minimal increase in dominant demand.
Chameleon [67] is another work that employs this technique.

Technique 2: Edge Efficient DNN Models
The goal here is to come up with strategies to maximally utilize accelerators such as GPUs and TPUs on the edge nodes.Gemel [92], DeepQuery [72], DeepRT [87], and MicroEdge [89] are examples of applications of this technique.
To address the challenge of running multiple deep-learning-based video analytics applications on resource-limited edge GPUs, the GEMEL project introduces an innovative memory management strategy.They leverage the structural similarities among various edge vision models to share layers, including their weights, thereby reducing memory consumption and minimizing model-swapping delays.The process employs a step-bystep layer merging technique, prioritizing the layers that consume the most memory.Additionally, the method uses an adaptive retraining strategy based on the success or failure of each merging attempt.The GPU scheduling is also optimized to enhance the benefits of model merging by minimizing the frequency and duration of model swaps.Before deploying the merged models to the edge, GEMEL validates that they meet predetermined accuracy standards and continuously monitor for data drift.
The DeepQuery project focuses on multiple DNN models on resource-constrained edge GPUs by co-locating real-time and delay-tolerant tasks, and exploits a predictive and plan ahead approach to alleviate resource contention due to co-locating by using dynamic batch sizing for delay-tolerant tasks so that they can finish before real-time tasks are to be scheduled.
The DeepRT project proposes a soft real-time GPU scheduler that employs an admission control mechanism that uses schedulability analysis, batches image frames from multiple requests using earliest deadline first (EDF) to perform the real-time scheduling of the batches, and an adaptation module that penalizes jobs that overrun deadlines so as to avoid the unpredictable deadline misses of other jobs in the system.
The MicroEdge project provides multi-tenancy support for coral TPUs by extending K3s [51], an edge-specific distribution of Kubernetes, through an admission control algorithm that allows for a fraction of TPU usage.

Technique 3: Continuous Learning at the Edge
To address data drift in dynamic video environments, DNN-based video analytics models deployed at the edge require regular retraining.Traditional retraining methods, however, are both resource-intensive and slow, making them unsuitable for edge devices with limited resources.The RECL [93] project offers a solution by selectively reusing pre-trained, specialized DNNs from a historical model repository.The key innovation lies in a rapid and reliable model selection process, achieved through a lightweight DNN-based model selector.This enables the quick identification and deployment of an appropriate model.Moreover, RECL incorporates an efficient scheduler that optimizes the retraining of multiple models.It does so by monitoring real-time accuracy gains during training and dynamically reallocating GPU resources to models that show greater improvement.
The Ekya [91] project also addresses the topic of continuous model retraining via a microprofiling approach that identifies the models that need to be retrained, and a resource scheduler for supporting both training and inference on a resource-constrained edge device.

Research Gaps
To enhance the efficiency of real-time video analytics workloads on edge devices, it is important to expand the focus beyond Nvidia GPUs to include accelerators from other vendors, such as AMD and Intel.One challenge in effectively utilizing Nvidia GPUs is the proprietary nature of their drivers, which limits the research flexibility.According to Otterness and Anderson [97], the open-source architecture of AMD GPUs facilitates better support for real-time tasks.
Addressing continuous learning on edge devices poses significant challenges due to resource constraints and the computational demands of training DNN models.Federated learning [98], a technique that maintains user privacy by locally aggregating data on user devices for collaborative model training, has gained considerable traction in recent years.Exploring the applicability of federated learning in a peer-to-peer context could capitalize on the diverse capabilities of edge nodes and periods of low activity, (for example, such as late-night hours in traffic monitoring systems) to efficiently perform distributed model training.

Scheduling
The edge represents a form of distributed computing that operates on heterogeneous resources.Scheduling algorithms are critical for optimizing the use of this complex environment, prompting researchers to explore more efficient algorithmic solutions.Some of the research works on edge video analytics, wherein scheduling is not the primary focus, have used simple approaches such as round-robin [82] and or bin-packing heuristics such as worst-fit [74].Sophisticated approaches formulate the problem as a constraint optimization, and propose heuristics to solve it.

Technique: Constraint Optimization Problem Formulation
The scheduling problem is expressed as a cost optimization problem subject to constraints.As an illustrative example, in the VideoEdge [65] project, the application configuration and placement is modeled as the following binary integer problem (BIP)-maximize the sum of the accuracies of all queries (cost function) and subject it to the computation capacity (constraint), minimum accuracy (constraint), and the configuration and placement which are chosen for each task (constraint).Since solving the above optimization problem has an exponential time complexity, they propose a greedy heuristic.

Research Gaps
While the existing projects offer valuable scheduling techniques, there is a need for a comprehensive comparison of these approaches.Additionally, it is important to evaluate the suitability of distributed scheduling methods proposed for the cloud [99] adapted to the heterogeneous nodes of IE-SVA systems.

Control and Data Plane
The control plane serves as the mechanism to implement the scheduling decisions, while the data plane allows the transparent movement of data among the video analytics pipeline components.The distributed heterogeneous nature of the edge, and the potentially large data transfers involved in streaming video analytics makes the design of the control and data plane challenging.We highlight two recent works on control and data planes for IE-SVA systems.

Technique: Distributed Hierarchical Architecture
One approach to designing control planes for edge video analytics is to use industry standard distributed systems and adapt it to the edge such as in the open source K3s [51] and KubeEdge [100] projects.In contrast, instead of retrofitting Kubernetes, which is optimized for cloud-based, throughput-focused applications, and has a centralized control plane design, the OneEdge [84] project proposes a control plane that enables autonomous scheduling at individual edge sites without the need for central coordination.This is particularly useful for applications that largely operate independently.For applications requiring global coordination (for example, a drone fleet), OneEdge incorporates a centralized component that maintains an eventually consistent state of the system across all sites.This centralized component helps in effectively deploying multi-site applications.
To ensure reliable deployment decisions, OneEdge utilizes an enhanced two-phase commit protocol that synchronizes with the edge sites involved.It also offers specialized interfaces for developers, allowing them to implement latency-sensitive and location-aware scheduling.OneEdge continuously monitors end-to-end latency, ensuring it meets the specified service-level objectives (SLOs).To support location awareness, OneEdge triggers a migration process that relocates the client-such as a connected vehicle or drone-to a geographically suitable application instance.

Technique: Flexible Stream Processing Framework
Many projects have used an embedded queuing library such as ZeroMQ for the data plane [73,77,82].While embedded libraries perform well, they lack features such as data persistence, encryption, and replication, which are available in comprehensive data streaming frameworks like Apache Flink [101] and Apache Storm [102].However, these industry-supported frameworks have a limitation since they use a "stop-the-world" strategy for application reconfiguration, requiring global coordination.This is problematic at the edge, where resource constraints and dynamic conditions often lead to frequent reconfigurations and associated latency spikes.
To address this, the Shepherd [55] project implements a late-binding routing strategy.In this setup, a computation operation does not need to know in advance the location of the next operation that will consume its data.This flexibility is achieved through a separable transport layer that can be modified independently of the data processing layer.As a result, Shepherd allows for quick and flexible reconfigurations without the need for global coordination and with minimal system downtime.

Research Gaps
While the OneEdge project investigated a hierarchical control plane architecture for the edge, a peer-to-peer control plane architecture that avoids the need for a central cloud-based controller needs to be explored for IE-SVA systems.On the data plane side, incorporating persistence into the strategies proposed in the Shepherd project is needed to support stateful video analytics applications.

Multi-Camera Analytics
Collaborative video analytics of output from multiple cameras is necessary for improving the accuracy of analytics, and in pruning the search space exploiting temporal and spatial collaborations between cameras.We highlight the implementations of these two techniques that achieve this.

Technique 1: Multi-Camera Analysis to Improve Accuracy
In camera networks with overlapping fields of view, the ability to capture scenes from multiple angles can mitigate issues related to model limitations and object occlusions.Within the Vigil [59] project, cameras grouped into a cluster performed edge-based data fusion to enhance the surveillance capabilities.Specifically, video frames within the cluster are prioritized based on a utility metric that quantifies the number of "re-identified" objects, thereby improving the detection and tracking accuracy.

Technique 2: Cross-Camera Analytics to Improve Efficiency
In large-camera networks, implementing cross-camera analytics presents significant system challenges due to its computational and network resource demands.Unlike singlecamera "stateless" tasks, such as object detection in a single feed, cross-camera analytics involves identifying correlations both within and between multiple video streams.The Spatula [80] project addresses these challenges through a three-stage approach: (1) Offline profiling phase: Spatula creates a spatio-temporal correlation model using unlabeled video data that encapsulate historically observed patterns of object locations and movements across the camera network.(2) Inference phase: during real-time analytics, Spatula consults the spatio-temporal model to eliminate camera feeds that are not likely to contain information relevant to the query identity's current location.This selective filtering effectively reduces the computational load and network bandwidth usage.(3) Recovery phase: To address any missed detections, Spatula performs a rapid review of recently filtered frames stored in its memory to identify any overlooked query instances.
The Anveshak [82] project similarly exploits cross-camera correlations in pedestrian tracking.In contrast with the Spatula project, they use a physics-based approach, where the information on the road network and the speed of person movement are used to constrain the cameras that need to be activated.

Research Gaps
As camera deployments grow, we believe the two above techniques of redundancy with multi-cameras, and efficiency based on predictive modeling could be combined in novel ways to further increase efficiency and accuracy.Additionally, the incorporation of PTZ cameras (pan, tilt, and zoom) incorporated into IE-SVA systems has not been explored.PTZ cameras allows for the reorienting and zooming of a particular camera (for example, to zoom or change the field of view to track a particular object) potentially based on the output generated by neighboring cameras.

Video Analytics Pipeline Components
Most of the works surveyed have used DNN-based video analytics components, primarily for object detection, object tracking, and re-identification.The DNN-based components are computationally intensive, and require GPUs for model training.While CPU-based inference is possible, the use of embedded GPUs allows applications to sustain tens-of-frames-per-second throughput needed for real-time tracking.

Research Gaps
Recently, more complex video processing pipelines have been proposed for highlevel cognitive tasks such as human activity recognition [103], and reasoning over the vision module output using large-language models (LLMs) [104].Future research should investigate the edge computing approaches involving these complex models.Given the computational complexity of these models, distributed approaches may be required at the edge, even for inference purposes.

Fault Tolerance
While fault tolerance has received considerable attention in cloud computing research [105], we observed that IE-SVA systems' specific issues are only addressed in a single recent paper on the topic.The VU [78] project investigates surveillance camera failure modes from the study of a large-scale camera network.They identified 12 such failure modes and proposed an online failure detection approach.

Research Gaps
To attain the widespread adoption of edge video analytics, fault-tolerant operation is necessary to build reliable applications.Failures at the edge could include hardware, software, network, and power failures.The traditional cloud data center approaches of fault tolerance through redundancy are only partially applicable at the edge due to resource constraints.Unfortunately, we are unable to envision any straightforward approach to tackling this problem to all IE-SVA application contexts.We believe that successful solutions will be application-and customer-dependent.For example, in non-critical applications, if an edge node fails, camera streams could bypass the failed node, and directly transmit the video stream at a reduced rate to a backend cloud until the failure is fixed.This approach may allow some analytics operations to continue, albeit at a reduced accuracy.

Privacy
While much work has been performed regarding the privacy aspects of video analytics (see Section 7), some works on edge computing are limited to computing and discarding videos in IE-SVA systems as the sole privacy mechanism [81].However, in practice, post incident analyses of videos may be needed for forensics and other investigative purposes.The OpenFaace project [63] proposes a mechanism for denaturing video streams that selectively blur faces according to specified policies at full-frame rates.However, privacy is limited to detecting and encrypting the regions of the image containing faces.The recently published PECAM [86] project proposes a video transformation technique that achieves broader visual privacy at the edge.

Technique: Reversible Video Transformations That Preserve Privacy While Allowing Analytics
The PECAM [86] project proposes a security-reinforced cycle-consistent generative adversarial network (GAN) to generate camera-specific video transformer and reconstructor pairs.Privacy preserving video transformation is performed by the transformer so that enough information is present for analytics tasks (for example, vehicle counting) while preserving privacy (for example, license plate information).The reconstructor allows authorized parties to restore the requested frames to the original version.PECAM also proposes techniques to reduce the computational and bandwidth costs of the proposed privacy-preserving mechanisms.They evaluate the efficacy of the proposed approach under different attack scenarios.

Research Gaps
Homomorphic encryption [106] allows performing computations and analytics directly on encrypted data without requiring the data to be decrypted.The use of homomorphic encryption in video analytics is in the early stages [107].We note that the computational costs of these methods will need to be addressed to make it amenable to edge video analytics.

Sustainability
Sustainability is a complex topic that includes the use of sustainable materials, reducing e-waste, reducing the carbon footprint by energy-efficient computing, and reducing social inequalities.In the existing literature concerning IE-SVA systems, energy efficiency is only addressed in a limited number of studies, while other topics remain unexplored.

Technique: Activate Video Analytics Only When Necessary
In the RL-CamSleep [95] project, for a smart parking application, a deep reinforcement learning-based controller automatically adapts the camera operation to parking patterns, saving energy while preserving the operational utility.For example, parking assistance is less needed in empty parking lots.Furthermore, they study the operation of the controller in the cloud, and at the edge powered by solar.

Research Gaps
To enhance the sustainability of IE-SVA systems, the focus on this aspect must be intensified, especially in large-scale deployments.Running applications with lower accuracy may facilitate the workload consolidation and enable the energy-efficient idling of edge nodes and IoT cameras.Additionally, system costs should be prioritized as a key design parameter to make IE-SVA systems accessible to economically disadvantaged communities.
Other works with a focus on energy-efficient IS-SVA systems include the REVAMP 2 T [81] and the Marlin project [76].Both of these projects explore techniques involving the energyefficient operation of DNN models.

Path Ahead
In this section, we present a comprehensive research and development roadmap for streaming analytics edge systems.The Gantt chart of Figure 3 shows our proposed research roadmap in the short term (3 years), medium term (5 years), and long term (10 years).

Short-Term Research
Section 6 identifies multiple research gaps in the existing state-of-the-art.Many of these could be implemented over the next three years.Additionally, we would like to highlight a few other research directions that could be pursued over the short term.
Joint compute and bandwidth optimization: In existing research, the tuning of DNN models has been studied in isolation, focusing on adjustments based on operating conditions like resource limitations and bandwidth variability.We propose that these isolated approaches should be integrated.Special consideration should be given to scenarios involving high-activity video content, where trade-offs between computation and communication are required based on the current computational load and network state.
Serverless: From a system standpoint, the application of the serverless paradigm, and more precisely, the function-as-a-service (FaaS) framework, at the edge merits further investigation to alleviate deployment burdens.Serverless computing hides the servers by providing programming abstractions for application builders that simplify development, making software easier to write [44].Serverless computing is an active research topic in the cloud with many public cloud offerings.Open source serverless frameworks like KNative [108] and OpenFaaS [109] are available for use at the edge, but their compatibility with different types of edge hardware has not yet been evaluated.Furthermore, in the context of IE-SVA systems, given the resource constraints at the edge, it is not apparent what serverless abstractions are appropriate.
Testbeds: Current testbeds employ virtual machines (VMs) to emulate edge nodes.Emulation allows the direct execution of video analytics, albeit at a slower speed.An advantageous initiative would be the creation of a library of VMs that mimic various types of edge hardware, which researchers could readily use.Furthermore, to scrutinize communication-related bandwidth and latency issues, it would be beneficial to incorporate network simulators like NS3 [110] to the emulation platform.

Medium-Term Research
The medium-term work described in this section would require launching new research projects to tackle problems for edge video analytics systems.In our view, these studies could leverage the related body of work performed in other areas of computing, but would require the non-trivial adoption of these techniques to satisfy the constraints and peculiarities of video analytics at the edge.
Security: Foremost among these is security, a topic that has not been explicitly addressed by any of the video analytic edge systems reviewed in Section 5.In the context of IE-SVA systems, security involves a combination of traditional IoT security issues (for example, DDoS attacks), adversarial attacks on DNN models, and the compromise of privacy expectations (see below).The criticality of security for edge video analytics is highlighted by a March 2021 incident where a hacker group was able to publish live video feeds from 150,000 surveillance cameras [111].An additional threat is that the large-scale edge computing infrastructure deployed for edge video analytics could be compromised, and recruited for running BotNet similar to the 2016 Mirai BotNet of compromised IoT devices [112].The security triad of confidentiality, integrity, and availability has been extensively explored for cloud computing [113].In their review on edge computing security, Xiao et al. [114] identified weak computation power, OS and protocol heterogeneity, attack unawareness, and coarse-grained access control as key differences between the cloud and edge from a security perspective.They list six major classes of attacks applicable to edge computing-DDoS attacks, side-channel attacks, malware injection attacks, authentication and authorization attacks, man-in-the-middle attacks, and bad data injection attacks.They review solutions proposed in the literature on the first four of these attack classes as they are particularly relevant to the edge.In the case of edge video analytics, as shown by Li et al. [115], side-channel attacks can be exploited to leak sensitive video information despite encryption.Defense strategies such as implementing fine-grained access control [114], the use of deep learning and machine learning algorithms to detect attacks [116], and the use of hardware mechanisms for the isolation of software components [117] are possible.However, their applicability to resource-constrained edge nodes, and its potential impact on VAP performance needs to be systematically investigated across multiple platforms.
Adversarial attacks could be directed at the deep learning VAP components such as classification and object detection [118,119].The goal of the adversarial attack is to insert small perturbations in the image to compromise the predictions of the deep-learning-based VAP components.Akhtar et al. [118] provide a comprehensive review of adversarial attacks and defenses in computer vision.Proposed defenses against these attacks require model robustification [120], input modification for removing perturbations [121], and adding external detectors to the model [122].The performance impact of these defense strategies, as implemented on resource-constrained edge devices, needs to be comprehensively explored and evaluated.
Privacy: Since videos are a rich source of information, preserving privacy is of the utmost importance to prevent the leakage of unintended information.For example, an edge video analytics system for pedestrian safety might capture information regarding the identities of individuals.The REVAMP 2 T project [81] uses skeletal pose information to track a pedestrian identity without storing any videos.The PECAM project [86] project proposes a novel generative adversarial network to perform the privacy-enhanced securelyreversible video transformation at the edge nodes.Similarly to security-related counter measures, the cost of implementing privacy-related computations on different types of research-constrained edge devices needs to be evaluated.Furthermore, since a video stream might be used for multiple applications, the ability of the proposed techniques to serve multiple applications needs to considered.A related technique would be the use of federated learning approaches [98] where the training data are used to train a local model that is then transmitted to a central coordinator, thus avoiding the need to send training data outside a specified privacy domain.The interplay between federated learning approaches and continuous learning [91,93] required at the edge to mitigate model drift needs an in-depth investigation.
Usability: The success of edge video analytics critically depends on how easily different types of personnel can interact with the system.Developers should be able to readily explore different VAP designs [123] and be provided with suitable system abstractions for VAP deployments.Operators should be able to readily determine the operational status of the complex distributed edge hierarchy, possibly involving hundreds of edge nodes, and thousands of cameras spread over a large geographic area.They should be quickly notified of system failures, and be able to perform root cause analysis to identify and rectify these.Users such as city personnel and law enforcement should be able to query the system in an intuitive fashion, preferably through the use of natural language.
In the cloud, the DevOps workflow uses an API-driven model that enables developers and operators to interact with infrastructure programmatically and at scale [124].Artificial intelligence for IT operations (AIOps) [125,126], a recently introduced approach in DevOps, leverages data analytics and machine learning to improve the quality of computing platforms in a cost-effective manner using practices such as continuous integration and continuous deployment (CI/CD) [127].Similar capabilities need to be developed to successfully implement and manage large-scale edge video analytic systems.An important consideration in applying successful cloud DevOps and AIOps practices at the edge is the challenge of moving large amounts of data (operational data and deployment images) over bandwidth-limited networks.

Long-Term Research
While predicting technology trajectories over the 5-10 year horizon is challenging given the ongoing rapid advances in all areas of computing especially in AI, we believe that overcoming certain problems would require research projects with a long time frame given the complexity of the problem, and the many stakeholders that need to be involved.
Real-world datasets: As the deployment of edge video analytics expands, we should seek to collect and open source real-world transactions and operational data with suitable privacy guards.Transactional data refer to the queries issued on the edge system by users and operators, while operational data refer to resource metrics, application logs, query traces, and failure statuses.This would allow researchers to gain an understanding of real-world systems, and direct their efforts towards impactful solutions.
Interoperability: As edge vision systems proliferate, there is a danger of these systems lacking interoperability due to custom protocols, data formats, and lack of standardization.Furthermore, updating legacy systems may become problematic, resulting in communities where these systems are deployed getting stuck with outdated technology.Standardization and modular design are two approaches that can be used to tackle this issue; the technical and standards community would need to take strong leadership roles in this regard within the next few years before these systems see widespread deployment.
5G and 6G wireless: High-speed communication technologies, like 5G and optic fiber networks, enable the widespread deployment of edge video analytics.However, it is crucial to recognize that the availability of these technologies is not uniform across the global population.Many areas still lack access to high-speed networks due to financial constraints and limited spectrum availability [128].As the technical community progresses towards 6G standards, with an anticipated initial rollout around 2030 [129] boasting impressive capabilities of 1000 Gbps bandwidth and a latency of less than 100 microseconds, it becomes even more critical for the edge video analytics systems community to proactively explore and understand these emerging possibilities and challenges in order to make the most of 6G advancements.

Impact of Advancements in Other Areas in Computing
In this section, we provide a brief review of important developments and other areas of computing and related societal concerns that in our opinion are highly relevant to edge video analytics.Since these are fast-moving areas of research, the exact nature of their impacts on edge video analytics systems in not clear at this point.
Large language models: In recent years, significant progress has been observed in the realm of large language models (LLMs).These advancements have paved the way for more sophisticated and accurate AI applications with emergent abilities to tackle a wide range of tasks [130].In recent months, the services of these LLMs models have been made available to the general public through services such as OpenAI's ChatGPT [131], Microsoft Bing AI [132], and Google's Bard [131].More recently, Meta has made available its LLama 2 LLM available for free download [133] potentially allowing the broader technical community to specialize these models for specific tasks based on training with proprietary data.We believe that IE-SVA systems could incorporate these LLMs as a part of their analytics pipelines possibly to reason about relationships between events detected.
Web assembly: WebAssembly (Wasm)-based sandboxing has experienced a rapid rise as a notable technology.Wasm is a binary instruction format designed for a stack-based virtual machine, functioning as a portable compilation target for various programming languages, making it suitable for deployment on the web in client and server applications [134].The platform-neutral nature of Wasm allows a single binary to be compiled and executed on diverse architectures and operating systems, eliminating the need for dealing with platform-specific information at the container level [135].Consequently, this enables a lightweight, portable, and highly secure alternative to the container-based implementations of microservices, offering significant advantages, especially at the edge.It should be noted that the development of the Wasm system interface (WASI) is still ongoing [136].
AI regulation: The regulatory status of AI models is still evolving, but there is a growing awareness of the need for some form of regulation.Among the recent developments are the European Union Artificial Intelligence Act [137] and the United States White House statement on responsible AI research, development, and deployment [138].In our opinion, the edge video analytics research community should keep themselves abreast of these and other emerging regulations, so that the systems they design are compliant with them.

Conclusions
The widespread adoption of IoT edge streaming video analytics (IE-SVA) systems is propelled by the rapid advancements in deep-learning-based computer vision algorithms.These algorithms have revolutionized the automatic analysis of streaming video feeds, enabling the detection of events of interest.To facilitate this development, edge computing has emerged as a crucial component, offering advantages such as low latency, reduced bandwidth, and enhanced privacy.However, despite its potential, a significant gap remains in the successful practical implementation of edge-based streaming video analytics systems.
This paper presents an in-depth review of more than 30 studies on edge video analytics systems, assessed across 17 dimensions published over the last 8 years.Diverging from prior reviews, our approach examines each system holistically, enabling a comprehensive assessment of the strengths and weaknesses in various implementations.Our analysis reveals that certain crucial aspects essential for the practical realization of edge video analytics systems, such as security, privacy, and user support, and energy efficient operation, have not received sufficient attention in current research.
Based on these findings, we propose research trajectories spanning short-, medium-, and long-term horizons to address the identified challenges.Moreover, we explore trending topics in other computing domains that hold considerable potential to significantly impact the field of edge video analytics.This article aims to help new researchers rapidly understand the current state-of-the-art and inspire research initiatives that contribute to the widespread deployment of IE-SVA systems.

Figure 2 .
Figure 2. IE-SVA systems hierarchy.Aside from the cameras, not all levels need be present in an implementation.The computation nodes may be organized as clusters for availability and fault tolerance.Data flow vertically up the hierarchy starting from the cameras, while the control flows down the hierarchy.In some implementations, data may flow horizontally as well.

Figure 3 .
Figure 3. Gantt Chart with a proposed research roadmap over short-, medium-and long-term time horizons.