1. Introduction
In recent years, a massive deployment of surveillance cameras has been observed in urban, industrial, and domestic environments, driving the demand for intelligent systems capable of automatically analyzing video streams from multiple cameras. Unlike single-camera scenarios, multi-camera systems expand spatial coverage, reduce blind spots, and coordinate multiple views, enabling continuous tracking of people or objects across different locations with minimal human intervention [
1,
2]. In public security, these networks support large-area surveillance and the recognition of critical events, including anomalous behaviors [
3,
4]. In intelligent transportation, distributed cameras enable vehicle tracking, speed enforcement, and traffic analysis in real-world scenarios [
5,
6]. In healthcare, multi-camera systems have been proposed for fall detection and activity recognition [
7,
8,
9]. Furthermore, recent applications include industry and mining (e.g., mine or process monitoring), smart cities and IoT, as well as precision agriculture for animal behavior monitoring [
10,
11,
12].
Figure 1 illustrates the conceptual difference between single-camera and multi-camera detection scenarios. In a single-camera setup, object detection and tracking rely exclusively on one viewpoint, which may be affected by occlusions, a limited field of view, or unfavorable perspectives. In contrast, a multi-camera configuration provides complementary viewpoints of the same scene, improving detection robustness, spatial consistency, and identity preservation across perspectives. The availability of multiple synchronized views reduces ambiguity and enhances the reliability of object localization and tracking, particularly in dynamic and crowded environments.
Despite their advantages, multi-camera systems present significant technical challenges. A central issue is robust multi-target multi-camera tracking while preserving identity, particularly in non-overlapping camera networks with illumination variations, viewpoint changes, and occlusions, which increase the complexity of inter-camera association compared to single-camera scenarios [
13,
14,
15]. Additionally, the massive volume of data generated by camera networks demands efficient computational architectures for real-time processing and resource allocation [
16,
17,
18]. In this context, edge computing has become essential to reduce latency, bandwidth consumption, and exposure of sensitive data, enabling deployments in which analysis occurs close to the capture point [
19,
20]. Moreover, integration issues with IoT and device–object pairing in multi-camera environments introduce additional challenges related to interoperability and consistency [
21].
Recent advances in computer vision and deep learning have driven more sophisticated solutions for multi-camera systems. In particular, modern approaches integrate detection, tracking, and re-identification (Re-ID), including training strategies that improve cross-camera generalization and lightweight variants for edge deployment [
22,
23,
24]. In biometrics, multi-camera approaches have been explored for pose-robust face recognition and gait analysis, broadening the identification spectrum in surveillance contexts [
25,
26,
27]. At the system level, the convergence of IoT and AI has fostered distributed and collaborative architectures, including coordination and trust mechanisms for multi-camera tracking in edge computing environments [
28,
29,
30].
In this context, the rapid evolution of computer vision and deep learning techniques has further strengthened the capabilities of multi-camera systems, enabling more robust and scalable solutions for complex real-world scenarios. In particular, the evolution of object detection techniques, as comprehensively analyzed in recent surveys, has provided a strong foundation for accurate recognition and localization tasks in dynamic environments [
31]. Furthermore, emerging paradigms based on transformer architectures have introduced new opportunities for modeling long-range dependencies and multi-view representations, improving cross-camera understanding and scene interpretation [
32]. In parallel, the integration of edge computing has become a key enabler for real-time multi-camera analytics, reducing latency and bandwidth consumption while allowing distributed processing closer to data sources [
33]. These advances are particularly relevant in large-scale surveillance and smart city applications, where distributed intelligent systems must efficiently coordinate multiple data streams [
34]. Additionally, recent research in multi-target multi-camera tracking and re-identification has demonstrated the importance of robust feature representation and cross-view consistency to ensure reliable tracking across non-overlapping camera networks [
35].
Although there are reviews focused on subproblems, such as multi-target multi-camera tracking [
13,
36] or anomaly detection in video surveillance [
3], a comprehensive perspective consolidating applications and emerging trends in multi-camera systems during the 2020–2025 period is still needed. Therefore, this article presents a comprehensive review of the recent literature, organized by application domains (security and biometrics, intelligent transportation, smart cities/IoT, healthcare, agriculture/environment, industry/robotics, mobile/PTZ cameras, and other emerging areas), with the aim of identifying predominant techniques, significant advances, and knowledge gaps to guide future research [
37].
In line with this perspective, this review is not limited to Industry 4.0 and cyber–physical production environments. Instead, multi-camera computer vision systems are examined as a transversal technology that enables a wide range of intelligent infrastructures, including smart cities, transportation systems, healthcare monitoring, environmental analysis, and retail analytics. Consequently, the focus is placed on the technological evolution of these systems across multiple domains rather than on purely industrial applications.
2. Methodology
This systematic review was conducted following the PRISMA 2020 framework guidelines to ensure transparency, traceability, and reproducibility in the identification, selection, and synthesis of the scientific literature [
38]. The methodological process was designed to structure the analysis in a manner comparable to recent reviews on multi-camera systems and computer vision [
13,
36,
37].
2.1. Search Strategy
To conduct this systematic review, the PRISMA 2020 guidelines were followed to ensure transparency and reproducibility in the identification, screening, and synthesis of relevant studies (See
Supplementary Materials). The review was not prospectively registered. Literature searches were conducted in major scientific databases relevant to engineering and computer vision research, including IEEE Xplore, Scopus, and Web of Science. The search terms consisted of combinations of keywords related to multi-camera systems and their primary application domains. Specifically, the search strategy was structured using Boolean operators as follows: (“multi camera” OR “multi-camera” OR “multiple cameras” OR “multi camera tracking” OR “multi view” OR “multi-view” OR “camera network”) AND (“surveillance” OR “smart city” OR “intelligent transportation”).
The search strategy was designed to identify relevant studies published between January 2020 and December 2025, and all databases were last searched in January 2026. Database-specific filters were applied to restrict the results to journal articles and conference papers within the subject areas of engineering, computer science, and automation. In addition, a backward snowballing procedure was conducted by examining the reference lists of selected studies to identify potentially relevant publications that may not have been retrieved during the initial database search.
Additionally, a snowballing technique was employed to identify further studies from the reference lists of relevant papers, which is a common practice in technical reviews in this field [
3].
2.2. Selection Process
All retrieved references were managed using a reference management tool to remove duplicates. Subsequently, a two-phase screening process was applied: (i) title and abstract evaluation to exclude studies outside the scope, and (ii) full-text review to verify compliance with the inclusion criteria.
After systematic filtering and evaluation, a total of 93 studies that met all established criteria were included (
Figure 2). These studies were organized into a structured matrix documenting the application domain, main task, multi-camera configuration, key technologies, and central contribution, enabling a coherent comparative analysis across domains.
Some studies addressed multiple application domains (e.g., intelligent transportation and smart cities). In such cases, classification was performed according to the primary application context emphasized by the authors. When the contribution was clearly multidisciplinary, the study was assigned to the domain where the multi-camera system played the most central functional role in the proposed architecture.
2.3. Inclusion Criteria
Studies were considered if they:
Were peer-reviewed publications between 2020 and 2025.
Explicitly addressed multi-camera systems or camera networks (minimum of two cameras).
Proposed technical advances in detection, tracking, re-identification, data fusion, or edge/cloud architectures.
Applied multi-camera configurations in domains such as surveillance, transportation, healthcare, agriculture, industry, or smart cities.
2.4. Exclusion Criteria
Studies were excluded if they:
2.5. Research Questions
The review was structured around the following questions:
What are the main application domains of multi-camera systems during 2020–2025?
What techniques and technologies are used to address the specific challenges of these systems?
What advances and limitations have been recently reported?
What research gaps remain, and what future directions are proposed?
These questions guided the systematic extraction of information and the organization of results by application domain, enabling a cross-domain analysis of the state of the art.
3. Results
3.1. Overview of Included Studies
After applying the methodological procedure described above, a total of 93 studies published between 2020 and 2025 were ultimately selected, covering a broad spectrum of domains and applications for multi-camera systems. The findings are organized into thematic subsections according to the predominant application domain, considering that some studies may belong to more than one category.
The most represented domains include public security and biometric surveillance, intelligent transportation and traffic, smart cities and IoT, healthcare and monitoring of vulnerable individuals, precision agriculture and environmental monitoring, industry and robotics, as well as emerging applications involving mobile cameras, drones, and pan-tilt-zoom (PTZ) systems (see
Table 1).
It is important to note that the domain classification presented in
Table 1 is not mutually exclusive. Several studies address multiple application contexts simultaneously due to the transversal nature of multi-camera vision technologies. Based on our analysis, approximately one third of the reviewed studies contribute to more than one domain.
The most common overlaps occur between Public Security and Intelligent Transportation, as well as between Smart Cities and IoT-based urban monitoring applications. These intersections largely reflect the general applicability of core technologies such as multi-camera multi-object tracking (MTMCT), cross-camera re-identification, and distributed video analytics across different operational environments.
3.2. Public Security and Biometric Surveillance
Multi-camera surveillance oriented toward public security constitutes one of the most consolidated and technically mature domains in the recent literature. These systems are designed not only to prevent crime and detect suspicious behavior but also to maintain continuous and coherent tracking of individuals across heterogeneous urban environments such as city streets, transportation hubs, airports, campuses, and densely crowded public spaces. A central research task in this context is Multi-Target Multi-Camera Tracking (MTMCT), typically complemented by person re-identification (Re-ID) techniques to preserve an individual’s identity while moving across cameras with non-overlapping or partially overlapping fields of view [
13,
14]. Together, MTMCT and Re-ID enable long-term trajectory reconstruction, cross-camera identity consistency, and forensic-level traceability.
Early approaches focused on inter-camera association through collaborative probabilistic models and robust visual descriptors designed to encode color, texture, and geometric information [
2,
39]. While these methods introduced important mechanisms for cross-view matching, they were often sensitive to illumination changes, pose variation, and occlusions. Subsequently, the integration of deep learning significantly improved identity discrimination under challenging conditions, including illumination variability, pose changes, and partial occlusions [
22,
23]. Deep feature embeddings enabled end-to-end representation learning, strengthening robustness across viewpoints.
More recent models incorporate fine-grained spatiotemporal constraints, graph-based optimization, and distributed processing strategies to enhance global tracking consistency across extended camera networks [
24,
40,
41]. These approaches move beyond appearance-based similarity and integrate motion dynamics, temporal coherence, and contextual reasoning, reflecting a shift toward holistic multi-view scene modeling.
In real-world deployment contexts, architectural efficiency has become a decisive factor. Edge-computing-based architectures have been proposed to execute Re-ID and tracking modules directly on local gateways or AIoT devices, thereby reducing latency, bandwidth consumption, and exposure of sensitive data [
16,
19]. Likewise, distributed and collaborative approaches, including blockchain-based mechanisms and active perception strategies in mobile camera networks, have been introduced to improve scalability, resilience, and trust in complex urban environments [
29,
30].
At the architectural level, distributed tracking approaches integrating Re-ID modules in multi-camera scenarios have also been reported, along with practical implementations that combine modern detectors (e.g., YOLO variants) with tracking algorithms to build complete end-to-end pipelines [
42,
43,
44]. These unified frameworks demonstrate operational feasibility and near-real-time performance in surveillance infrastructures.
Beyond traditional tracking, additional biometric techniques have been explored in multi-camera networks to reinforce identification reliability. Multi-camera face recognition has been employed to improve identification rates in scenarios characterized by partial views or uncontrolled environmental conditions [
25,
26]. In parallel, gait recognition has emerged as a non-intrusive biometric alternative capable of identifying individuals at a distance even when facial information is unavailable [
27]. Studies highlight that viewing angle, clothing variability, and illumination significantly affect accuracy, motivating the deployment of multiple cameras to obtain more invariant and complementary representations.
Regarding anomaly and threat detection, the availability of multiple synchronized perspectives has been shown to improve robustness against occlusions and reduce false alarm rates. Recent reviews demonstrate the rapid growth of anomaly detection techniques in video surveillance, combining classical modeling approaches with deep spatiotemporal networks [
3]. Multi-camera-specific proposals include weakly supervised frameworks such as MC MIL [
45] and deep spatiotemporal models for detecting anomalous behaviors in dense urban environments [
46]. Additionally, the integration of violent activity recognition using optimized YOLO-based models in multi-camera configurations has been explored in educational environments [
47].
Other works have addressed precise pedestrian detection and localization through multi-camera extrinsic calibration and three-dimensional reconstruction, strengthening spatial coherence across views and enabling metric-level consistency [
48,
49]. Hybrid approaches combining HOG descriptors and CNN architectures have also been developed to improve detection rates in non-overlapping camera networks [
15]. Overall, the 2020–2025 literature demonstrates significant progress in multi-camera surveillance, primarily driven by deep learning, distributed architectures, and spatiotemporal fusion strategies. However, challenges remain related to scalability as the number of cameras increases substantially, the limited availability of labeled multi-camera datasets for anomaly detection, and the need to balance performance and privacy in large-scale biometric applications.
While convolutional neural networks (CNNs) remain the dominant backbone in most multi-camera vision systems due to their computational efficiency and strong performance in real-time detection tasks, recent research has begun exploring Transformer-based architectures such as Vision Transformers (ViT) and Swin Transformers. These models enable improved modeling of global contextual relationships and long-range feature dependencies across camera views, which can be particularly beneficial for tasks such as cross-camera tracking and person re-identification (Re-ID).
However, despite their promising capabilities, Transformer-based models generally require higher computational resources, which currently limits their adoption in real-time and edge-based multi-camera deployments. Consequently, lightweight CNN-based architectures continue to dominate practical implementations in surveillance and smart city environments.
3.3. Intelligent Transportation and Traffic
The domain of intelligent transportation systems (ITS) represents one of the most active application areas for multi-camera networks, as vehicle traffic management and road safety require continuous observation from multiple angles. In complex urban environments, including intersections, highways, tunnels, and parking facilities, a single camera is insufficient to cover the entire scene. Therefore, coordinated multi-camera configurations are deployed to enable continuous tracking of vehicles and pedestrians across different segments of the road network [
50,
51].
One of the most developed research lines is multi-camera multi-object tracking (MC MOT) applied to vehicles. Several studies have incorporated deep-learning-based vehicle re-identification techniques to maintain vehicle identities across non-overlapping cameras [
52,
53]. In particular, approaches combine YOLO-type detectors with feature extraction networks (e.g., OSNet or attention-based variants) to strengthen inter-camera association and improve identity preservation under viewpoint changes [
54].
The TIMS system (Traffic Informed Multi-Camera Sensing) was proposed to incorporate contextual information about vehicle flow to improve detection association between nearby cameras, thereby optimizing temporal tracking coherence [
50]. Likewise, recent proposals have demonstrated real-time vehicle tracking in congested scenarios, such as drive-through environments, while maintaining identity despite prolonged occlusions [
55]. These results confirm that multi-camera collaboration reduces ambiguities that would be difficult to resolve in single-camera configurations.
Regarding road safety and violation detection, multi-camera systems have been developed for average speed enforcement, dangerous driving detection, and anomalous trajectory analysis over extended highway segments [
56,
57]. The integration of multiple views enables the reconstruction of complete trajectories and the detection of events such as sudden braking, abrupt lane changes, or illegal maneuvers. Additionally, the use of heat maps and field-of-view overlap reasoning has improved the robustness of multi-vehicle tracking in complex environments [
53].
Detection under adverse conditions has also been investigated. For example, methods have been proposed that fuse vehicle parts detected by different cameras to improve nighttime detection and reduce false alarms [
58]. In high-speed scenarios, specific techniques have been introduced to compensate for motion blur using regression models and feature fusion prior to inter-camera association [
59]. These solutions illustrate how the particular challenges of the ITS domain require specialized strategies in multi-camera environments.
Beyond individual tracking, multi-camera networks enable macro-level traffic analytics. Pedestrian re-identification across cameras has been used to estimate origin–destination (O–D) matrices in transportation infrastructures, allowing the inference of mobility patterns and dwell times [
60]. Similarly, graph-based frameworks have been proposed that model the road network and multi-camera detections as dynamic graphs to predict congestion and optimize traffic management [
61]. These approaches integrate visual sensing with predictive analytics, extending the scope of ITS beyond simple detection.
Another relevant application is intelligent parking management. Adaptive fusion of multiple cameras enables the estimation of parking occupancy with greater robustness to illumination variations and viewing angles [
62]. These systems dynamically adjust the weight assigned to each camera according to environmental conditions, improving metrics such as IoU and reducing false positives in nighttime scenarios.
Recent reviews emphasize that combining multiple cameras and complementary sensors is essential to achieving comprehensive coverage in traffic systems and autonomous vehicles [
63]. Furthermore, cooperative edge–cloud architectures have been implemented to distribute the computational load of multi-camera vehicle tracking, reducing latency and bandwidth requirements [
18,
51]. Overall, multi-camera systems applied to intelligent transportation during 2020–2025 have demonstrated significant advances in city-scale continuous tracking, violation detection, dangerous driving monitoring, adverse-condition perception, and advanced mobility analytics. However, challenges remain related to scalability in large urban networks, interoperability across heterogeneous infrastructures, and legal implications arising from the combined use of vehicle and biometric recognition in public spaces. Additionally, previous works have explored vehicle re-identification with tracking context in highways and multi-camera environments, reinforcing cross-view association and temporal consistency [
64,
65,
66].
3.4. Smart Cities and IoT: Distributed Multi-Camera Networks
With the consolidation of the Smart Cities paradigm, multi-camera systems have become an essential component of interconnected urban infrastructure. In this context, cameras no longer operate as isolated devices but are integrated into IoT ecosystems and edge–cloud architectures that enable ubiquitous, scalable, and resource-efficient surveillance [
11,
37]. Smart cities deploy distributed cameras in streets, public transportation systems, buildings, and open spaces not only for security purposes but also for urban service management, including traffic control, energy optimization, and infrastructure monitoring.
The main challenge lies in coordinating the large volumes of data generated by heterogeneous cameras in real time while ensuring low latency, energy efficiency, and data protection. A widely adopted strategy is edge computing, in which detection and tracking tasks are executed on nodes close to the capture source, reducing reliance on centralized data centers [
16,
17]. In this direction, cooperative cloud–edge architectures have been proposed in which primary analysis occurs locally and only relevant events or metadata are transmitted to central urban platforms [
11]. To better illustrate the structural organization of multi-camera systems in smart city environments,
Figure 3 presents a layered architecture integrating sensing, communication, distributed processing, and application services. This framework reflects common design patterns observed in urban surveillance systems, where data is processed across edge, fog, and cloud layers to enable scalable and real-time decision-making.
A representative example is the implementation of re-identification microservices on AIoT gateways, which balance computational load and preserve privacy by avoiding the transmission of raw images to the cloud [
19]. These systems employ dynamic orchestration and lightweight virtualization techniques to scale according to the number of active cameras, demonstrating feasibility in real urban deployments.
While most privacy preservation strategies in multi-camera smart city systems rely on architectural approaches such as edge computing and local processing, recent research has begun exploring algorithmic privacy-enhancing technologies (PETs). Approaches such as federated learning and differential privacy enable collaborative model training across distributed camera nodes without sharing raw visual data.
However, the adoption of these techniques in multi-camera deployments remains limited due to several practical challenges, including additional computational overhead, synchronization requirements among distributed camera nodes, and potential performance degradation in real-time detection and tracking tasks. Consequently, most current urban surveillance systems still prioritize edge-based data reduction and metadata transmission as primary privacy-preserving mechanisms.
The convergence between multi-camera video and IoT infrastructures has also driven self-organization and distributed coordination schemes. Recent proposals explore intelligent interconnection mechanisms among cameras through spatial optimization and collaborative strategies [
28]. Furthermore, distributed frameworks supported by blockchain technologies have been introduced to ensure integrity and traceability in decentralized urban surveillance environments [
29].
The integration of cameras with other IoT sensors expands multimodal analytics capabilities. In complex urban scenarios, multi-camera systems can be complemented with acoustic, environmental, or traffic sensors to detect critical events with greater accuracy. In particular, recent anomaly-detection frameworks in smart cities combine multiple video views with deep spatiotemporal processing to recognize anomalous behaviors in densely populated environments [
3,
46].
Representative smart city applications illustrate how multi-camera systems support large scale urban monitoring and management tasks. For instance, distributed camera networks can be used for public infrastructure monitoring, detecting anomalies such as flooding, structural damage, or vandalism in streets and public facilities [
67].
Another important use case involves crowd management during large scale public events, where fixed and mobile cameras are combined to estimate crowd density, monitor pedestrian flows, and detect potentially dangerous situations [
41].
Beyond security and transportation, multi-camera systems in smart cities enable the monitoring of pedestrian flows in public spaces, the estimation of occupancy densities, and support for real-time decision-making. For example, bird’s-eye-view projection systems combine multiple views to estimate social distancing and crowd density [
68]. Similarly, occupancy and flow estimation in smart buildings have relied on multi-camera detection and tracking techniques [
69,
70]. In parallel, scene composition and mosaicking from multiple calibrated cameras have been addressed to improve coverage and global scene understanding, especially in traffic and urban surveillance contexts [
71].
Another emerging domain is energy management and the optimization of urban resources. Presence detection through multi-camera networks enables the dynamic adjustment of public lighting or HVAC systems in smart buildings, integrating computer vision with automated control systems. The literature emphasizes the importance of designing camera networks while considering integration with communication infrastructure and other intelligent devices, prioritizing scalability, efficiency, and interoperability [
18].
In summary, recent developments point toward distributed, cooperative multi-camera networks empowered by edge AI, capable not only of observation but also of generating autonomous local actions. However, challenges remain related to interoperability among heterogeneous systems, the distributed updating of AI models, and data governance in large-scale urban environments, all of which are critical aspects for consolidating intelligent surveillance as a central component of future cities. Additionally, semantically guided multi-camera pedestrian detection approaches and trajectory forecasting models based on multiple cameras have been proposed, extending analytics capabilities beyond instantaneous tracking [
72,
73].
3.5. Healthcare and Monitoring for Vulnerable Individuals
The healthcare and assisted-care domain has progressively adopted multi-camera systems for medical emergency detection, home monitoring, and epidemiological surveillance. Between 2020 and 2025, applications such as fall detection for older adults and social distancing monitoring during the COVID–19 pandemic have been particularly prominent, reflecting the growing role of intelligent visual systems in public health and assisted-living environments.
Automatic fall detection constitutes a critical application in geriatric care facilities and hospital environments, where a rapid response can significantly reduce morbidity and mortality. Although wearable accelerometer-based devices have been widely used, they present limitations related to user comfort, incomplete spatial coverage, battery dependency, and potential non-compliance. In contrast, vision-based systems provide a non-invasive alternative; however, single-camera configurations often suffer from occlusions, blind spots, and limited coverage range. Multi-camera configurations mitigate these drawbacks by expanding spatial coverage, reducing blind areas, and enabling multi-view confirmation of critical events.
Shu and Shu [
7] developed an eight-camera fall detection system deployed in a home environment, capable of recognizing different types of falls at significantly greater distances compared with single-camera approaches. By fusing multiple viewpoints, the system minimized occlusion effects and achieved high detection accuracy using local processing on low-cost hardware, demonstrating feasibility for smart-home integration. Similarly, Ezatzadeh et al. [
8] proposed a multi-camera fusion framework for fall detection that integrates spatial and temporal information to improve robustness against illumination variations and viewpoint changes. These studies demonstrate how visual redundancy across cameras enhances both sensitivity and specificity while reducing false alarms.
Integration with the IoMT (Internet of Medical Things) paradigm has further expanded these capabilities. Hussain et al. [
9] introduced a human-centric attention framework based on deep multi-scale fusion, combining information from multiple cameras with contextual data for activity recognition in medical environments. Such multimodal solutions enable the correlation of visual information with physiological or environmental variables, thereby improving the detection of critical events such as collapses or anomalous behaviors.
During the COVID–19 pandemic, multi-camera networks were widely employed for social-distancing monitoring and contact tracing. Tseng et al. [
74] proposed a deep-learning-based person retrieval approach for video surveillance, enabling the identification of prolonged proximity between individuals on campuses, in hospitals, and in public spaces. Likewise, bird’s-eye-view projection approaches combined multiple perspectives to estimate occupancy densities and detect interpersonal distance violations with higher geometric consistency [
68].
In hospital and emergency settings, multi-camera systems have also been implemented for detecting anomalous behaviors or risky situations involving vulnerable patients, using deep spatiotemporal models that simultaneously analyze multiple views [
46]. Recent reviews on anomaly detection highlight the increasing incorporation of multi-camera architectures in healthcare contexts [
3]. Salau and Krieter [
75] applied instance segmentation with Mask R-CNN in a multi-camera environment to detect and localize dairy cows, demonstrating the effectiveness of instance-level segmentation in complex scenes with frequent occlusions.
A critical aspect in this domain is privacy preservation. Since these systems operate in domestic or clinical environments, many studies prioritize local inference on edge devices, avoiding the transmission of raw video to external servers [
7]. Furthermore, visual anonymization techniques, such as silhouettes, pose maps, or skeletal representations instead of full RGB imagery, have been proposed to protect patient identity while maintaining detection capability.
Overall, the recent literature demonstrates that multi-camera systems significantly improve spatial coverage, reduce detection latency, and increase reliability in assisted-care applications. Nevertheless, challenges remain regarding ethical acceptance, regulatory compliance in the handling of sensitive health data, and the balance between high accuracy and strict privacy constraints. Continued advancements in edge hardware, lightweight deep learning models, and IoMT integration suggest that these systems will become increasingly accessible, enabling proactive remote assistance and the early detection of critical medical events.
3.6. Precision Agriculture and Environmental Monitoring
In the agricultural and environmental sectors, multi-camera systems have emerged as strategic tools to enhance productivity, animal welfare, and ecological surveillance. The digitalization of agriculture (AgTech) increasingly incorporates computer vision for livestock tracking, animal segmentation, crop monitoring, and environmental assessment, where multiple cameras enable the coverage of extensive areas or provide complementary viewpoints for more robust analysis.
A prominent application is livestock monitoring. In farming environments, multi-camera configurations allow the supervision of large pens or barns without reliance on wearable sensors, which may cause stress or require maintenance. Salau and Krieter applied instance segmentation based on Mask R–CNN in a dairy farming context using multiple cameras, demonstrating that cows can be segmented and counted even in scenes with partial overlap [
76]. The integration of multi-view segmentation enhances counting accuracy and reduces identity switching under occlusions.
More recently, multi-camera fusion with bird’s-eye-view projection has been proposed for continuous monitoring of cattle in large enclosures, integrating perimeter camera views into a unified top-down spatial representation [
12]. This approach facilitates the analysis of movement patterns, feeding behavior, and anomalous activities, contributing to the early detection of stress or disease.
Advances in detection models have also been evaluated for individual animal identification. Borwarnginn et al. [
77] conducted comparative analyses of YOLO architectures in precision livestock scenarios, highlighting the importance of well-annotated multi-camera datasets to improve robustness against illumination variability and morphological similarity among animals. These findings underscore the need for standardized benchmarks tailored to agricultural contexts.
In environmental monitoring, multi-camera networks have been applied to water-level detection and flood prevention. Borwarnginn et al. implemented a system using CCTV cameras to estimate river levels through deep learning, demonstrating feasibility in repurposing existing infrastructure for early warning systems [
78]. Multi-point observation enhances reliability under perspective distortion and adverse weather conditions.
Additionally, aerial and fluvial datasets combining fixed cameras and unmanned aerial vehicle (UAV) platforms have been introduced for semantic segmentation of riverbeds and riparian vegetation, enabling the training of models that integrate ground and aerial viewpoints [
67]. This hybrid approach expands spatial coverage and supports continuous ecosystem monitoring.
A critical technical requirement in outdoor environments is accurate multi-camera calibration. Tripicchio et al. [
48] proposed a real-time extrinsic calibration method for distributed cameras deployed in large open areas, enabling consistent 3D reconstruction and coherent spatial tracking. Such calibration is essential in structural monitoring applications, including dam, bridge, or infrastructure displacement analysis.
Experimental evaluation of CNN-based positioning and detection systems using fixed cameras has further provided empirical evidence regarding accuracy limitations and deployment constraints in real-world conditions [
79]. In the UAV domain, photorealistic multi-camera simulators have been developed for drone-based perception research, facilitating the training and benchmarking of algorithms under complex agricultural and environmental scenarios [
80]. These simulation environments enable the modeling of interactions between mobile and fixed cameras for integrated monitoring strategies.
Overall, multi-camera applications in agriculture and environmental monitoring have demonstrated measurable benefits in livestock tracking, early disease detection, hydrological monitoring, and ecological observation. Nonetheless, challenges persist regarding hardware durability under harsh outdoor conditions, connectivity limitations in rural areas, and computational constraints in resource-limited sites. Despite these constraints, multi-camera fusion remains a promising strategy for expanding situational awareness and supporting data-driven decision-making in agricultural and environmental domains.
3.7. Industry and Robotics
In industrial and automation environments, multi-camera systems have become key components for improving operational efficiency, safety, and robotic perception within the framework of Industry 4.0. Multiple cameras are typically integrated with cyber-physical systems, autonomous mobile robots, and IIoT platforms, providing multi-angle observation for monitoring, quality control, process supervision, and navigation tasks. The redundancy and geometric diversity offered by multi-camera configurations enhance robustness in complex and dynamic industrial settings.
In mining and heavy industry, multi-camera configurations are primarily deployed to expand visual coverage in hostile and confined environments. Bai et al. [
10] developed a real-time video stitching system for surveillance in underground mines, combining multiple cameras into a continuous panoramic mosaic. Through hybrid image-registration techniques and geometric alignment, the system mitigated adverse lighting conditions and airborne particles, significantly improving situational awareness in narrow tunnels where individual camera fields of view are severely limited.
Within smart-factory environments, multi-camera networks enable distributed tracking of objects, mobile robots, and material flows across production lines. Decentralized architectures integrating edge computing with blockchain-based trust mechanisms have been proposed to securely share tracking information among industrial nodes [
29]. This approach enhances data integrity, traceability, and resilience against cyberattacks, which are critical requirements in IIoT ecosystems.
Collaboration between fixed cameras and mobile robotic platforms represents another relevant advancement. Casao et al. [
30] introduced a distributed active-perception framework in which multiple cameras, including sensors mounted on robots, cooperate to maintain persistent tracking of targets in large industrial spaces. This hybrid strategy ensures tracking continuity when an object exits the field of view of a mobile robot and is subsequently captured by fixed cameras, or vice versa, thereby reducing tracking fragmentation.
To provide a clearer understanding of how multi-camera systems interact with robotic platforms in industrial environments,
Figure 4 illustrates a simplified workflow integrating distributed visual processing and industrial applications such as robot navigation and process monitoring.
In industrial navigation and logistics, the precise localization of autonomous guided vehicles (AGVs) using multi-camera networks has shown promising improvements in positioning accuracy. A recent study optimized PnP-based localization through regression modeling and multi-camera fusion, significantly reducing root-mean-square error compared with single-camera solutions [
81]. The geometric redundancy provided by multiple synchronized views enables the correction of calibration errors and enhances robustness against partial occlusions or dynamic obstacles.
The optimal design of multi-camera networks in industrial environments has also been extensively studied. Camera-placement optimization algorithms have been proposed to maximize coverage while minimizing deployment cost [
82,
83]. More recent approaches incorporate energy constraints into adaptive coverage-optimization strategies, improving sustainability in large-scale facilities [
84]. Efficient planning is particularly relevant in expansive industrial complexes where infrastructure decisions directly impact operational expenditures.
Intelligent control of PTZ cameras using multi-agent reinforcement learning has emerged as an active research direction. Yang et al. [
85] proposed a hierarchical reinforcement-learning framework to optimize PTZ camera trajectories and orientations in dynamic tracking tasks. Such adaptive control mechanisms are especially suitable for automated warehouses and manufacturing plants characterized by high object mobility and dynamic reconfiguration.
In industrial aerial robotics, UAVs equipped with multiple cameras are increasingly used for infrastructure inspection, inventory auditing, and high-level monitoring. The photorealistic MCS Sim simulator facilitates the training and validation of multi-camera algorithms for UAV-based inspection in complex industrial settings prior to real-world deployment [
80]. These tools reduce operational risks and accelerate development cycles.
Regarding physical system design, specialized structures for volumetric surveillance and cost-optimized deployment of multi-camera arrays have been proposed, complementing algorithmic camera-placement optimization [
86]. Overall, multi-camera systems in industry and robotics have enabled advancements in visual mosaicking, secure distributed tracking, precise localization, collaborative perception, and energy-efficient coverage. Nevertheless, challenges persist concerning robustness under adverse industrial conditions (dust, vibration, fluctuating illumination), interoperability among heterogeneous platforms, and cybersecurity protection in critical infrastructure environments.
3.8. Mobile Cameras, Drones, and Active Perception
A significant subfield within multi-camera systems involves the integration of mobile or actuated cameras, including autonomous PTZ (pan-tilt-zoom) cameras and sensors mounted on unmanned aerial vehicles (UAVs). Unlike static configurations, these systems introduce the paradigm of active perception, in which cameras dynamically adjust orientation or trajectory to optimize coverage, tracking continuity, or observation quality.
In the PTZ domain, Kumari et al. [
87] proposed a dynamic scheduling scheme for an autonomous PTZ camera integrated within a fixed camera network. The scheduling algorithm determines the optimal orientation and zoom level at each time step to maximize event detection probability while accounting for movement costs and reorientation latency. Experimental results demonstrated that a strategically controlled PTZ camera can effectively fill coverage gaps that would otherwise require additional fixed sensors.
The evolution of this approach has incorporated multi-agent reinforcement learning for cooperative control of multiple mobile cameras. Yang et al. [
85] developed a hierarchical reinforcement-learning framework to optimize trajectories and orientation policies for PTZ cameras in dynamic multi-target tracking scenarios. In this framework, cameras coordinate to distribute targets efficiently, minimizing redundant field-of-view overlap while improving tracking persistence.
In UAV-based surveillance, aerial mobility significantly extends spatial coverage and adaptability. Gonchigsumlaa et al. [
88] formulated an entropy- and coverage-driven optimal-control model for cooperative multi-camera UAV systems. Their results demonstrated substantial performance gains compared with static patrol patterns, particularly in large-perimeter monitoring and critical-infrastructure inspection scenarios.
The validation of these cooperative strategies has been supported by photorealistic simulation environments. MCS Sim provides a virtual testbed for evaluating dynamic calibration, target handoff, and cooperative tracking among drones prior to physical deployment [
80]. Simulation-based validation reduces operational risks and facilitates the analysis of occlusion handling and coordination policies in complex environments.
Hybrid integration between fixed and mobile cameras represents another important advancement. In distributed active-perception frameworks, fixed cameras detect initial events and trigger intervention by mobile sensors (PTZ units or UAVs) for closer inspection and persistent tracking [
30]. This layered model integrates continuous passive surveillance with adaptive response mechanisms, effectively closing the perception–action loop.
Active collaboration strategies supported by pose estimation have also been proposed, demonstrating that pose-informed coordination improves tracking robustness in dynamic scenes [
89]. Overall, mobile multi-camera systems enhance spatial coverage, adaptability, and response time relative to purely static networks. However, they introduce additional technical challenges, including dynamic recalibration between mobile and static cameras, energy-management constraints in UAV platforms, and secure coordination protocols to avoid redundancy or collision. Ongoing research addresses these issues through optimal-control theory, distributed learning, and cooperative multi-agent architectures.
3.9. Other Emerging Applications: Retail, Education, and Forensic Analysis
Beyond traditional domains such as security and transportation, multi-camera systems have expanded into emerging applications including retail analytics, educational environments, and forensic video analysis. In these contexts, the integration of multiple viewpoints enhances interpretability, coverage, and analytical depth.
In the retail sector, smart stores deploy multi-camera networks to analyze customer behavior patterns and optimize operational efficiency. Trajectory-based re-identification systems have been used to map customer movement paths in shopping centers without relying on explicit biometric identification [
90]. Such systems enable the estimation of flow between commercial zones and dwell-time analysis while prioritizing metadata-based processing over raw video storage.
In retail logistics and drive-through service management, multi-camera vehicle tracking has demonstrated improvements in throughput and congestion reduction [
55]. Robust cross-camera association supports the optimization of service times and the mitigation of bottlenecks in high-demand environments.
In educational settings, multi-camera systems have been applied for student detection and counting in classrooms, mitigating occlusion through complementary viewing angles [
91]. These configurations improve detection accuracy in dense environments and have been extended to safety applications, including automated recognition of violent activities using deep multi-camera models [
92]. While these implementations offer safety benefits, they also raise ethical and regulatory considerations related to surveillance in academic institutions.
In forensic analysis, multi-camera networks have enabled advanced tools for automatic video summarization and efficient search across large datasets. Veesam and Satish [
93] proposed an integrated multi-camera summarization framework combining object detection and multimodal fusion for crime-scene investigation applications. This approach significantly reduces manual review time while preserving critical evidence.
Probabilistic identification methods in multi-camera environments have also been developed for scenarios in which cameras are unsynchronized or exhibit temporal inconsistencies [
94]. These methods employ inference based on visual attributes and spatiotemporal context to accelerate re-identification in complex forensic investigations.
Recent reviews have synthesized machine-learning techniques applied to multi-camera networks, highlighting both technical advancements and practical limitations [
37]. Key challenges include computational cost, hardware requirements, and efficient integration with edge and fog architectures.
From an IoT integration perspective, efficient encoding mechanisms and device–object pairing strategies have been proposed to improve interoperability between cameras and smart infrastructure components [
21,
95]. These contributions facilitate coordinated operation in commercial and urban ecosystems.
Overall, emerging applications demonstrate the continued expansion of the multi-camera paradigm into diverse operational domains. Although each sector presents unique ethical or regulatory considerations, they share common technical challenges related to identity association, spatiotemporal fusion, scalability, and computational efficiency. The technological maturity achieved in security and transportation domains increasingly serves as the foundation for these new applications. Finally, multi-camera approaches for suspicious object localization and end-to-end event image stitching and edge detection have been explored, reinforcing the role of multi-view perception in demanding operational contexts [
96,
97], see
Table 2.
4. Discussion
Across the reviewed studies, the evaluation protocols show notable variability depending on the application domain. In surveillance and multi-target tracking research, benchmark datasets such as DukeMTMC, CityFlow, and other multi-camera tracking datasets are commonly used, with evaluation metrics including IDF1, MOTA, and tracking accuracy. In object detection-oriented applications, metrics such as mean Average Precision (mAP) and detection rate remain dominant.
However, in several emerging domains such as agriculture, healthcare, and industrial monitoring, the absence of standardized multi-camera datasets often leads researchers to construct custom experimental datasets, which limits reproducibility and cross-study comparison. This heterogeneity highlights the need for more standardized evaluation protocols and shared datasets for multi-camera computer vision research.
Based on the comprehensive review of the literature published between 2020 and 2025, it is evident that multi-camera systems have reached a substantial level of technical maturity across a wide spectrum of domains, including public security, transportation, healthcare, agriculture, industry, and emerging smart-city ecosystems. Nevertheless, despite these advances, several structural and cross-domain challenges persist that limit scalability, long-term robustness, and broader societal acceptance.
One of the most recurrent and technically complex issues is cross-camera data association. Despite significant progress in re-identification and multi-object tracking, maintaining the consistent identity of an individual or object across multiple spatially distributed cameras remains an open challenge [
13,
14]. Identity fragmentation, appearance ambiguity, and trajectory discontinuities become particularly problematic in uncontrolled environments characterized by variable illumination, prolonged occlusions, crowd density, and non-overlapping fields of view. Recent surveys emphasize that combining appearance descriptors, dynamic motion modeling, and spatiotemporal constraints enhances robustness; however, determining the optimal integration of these components remains an unresolved research direction [
13,
14].
Another relevant technical factor influencing the performance of multi-camera systems is the configuration of the camera network itself. The reviewed studies employ diverse deployment strategies, including overlapping and non-overlapping camera fields of view, distributed urban camera networks, and hybrid configurations combining fixed and mobile cameras.
Camera placement directly affects the complexity of cross-camera association tasks. Overlapping views simplify spatial matching through geometric constraints, whereas non-overlapping configurations require stronger appearance-based re-identification and spatiotemporal reasoning. Furthermore, temporal synchronization among cameras plays an important role in ensuring coherent trajectory reconstruction and avoiding identity fragmentation in multi-target tracking scenarios.
Geometric calibration is another critical requirement for several multi-camera applications, particularly in bird’s-eye-view projection, multi-view fusion, and trajectory estimation tasks. Calibration errors or inconsistent spatial alignment can significantly degrade tracking accuracy and cross-view consistency, especially in large-scale camera networks deployed in urban or industrial environments.
From an algorithmic perspective, deep learning has become the dominant approach for addressing detection, tracking, and re-identification tasks in multi-camera systems. Convolutional neural networks are widely used for object detection and feature extraction, while deep embedding networks enable robust person or vehicle re-identification across different viewpoints.
These models learn discriminative appearance representations that help mitigate challenges such as illumination variation, viewpoint changes, and partial occlusions. In addition, several recent works integrate spatiotemporal constraints with deep feature embeddings to improve cross-camera identity association and trajectory reconstruction in large-scale camera networks.
In this context, formulations based on spatiotemporal reasoning and robust cross-camera association have been explored to improve global matching performance in realistic deployment scenarios [
40,
41]. Complementarily, in intelligent transportation systems, learned representations combined with edge AI architectures have enabled cooperative vehicle tracking across distributed camera networks [
51]. These efforts illustrate the trend toward integrating perception algorithms with distributed architectural intelligence.
Another critical axis concerns scalability and computational efficiency. As the number of cameras increases and video resolutions continue to rise, real-time processing becomes a significant bottleneck. Although edge and fog computing architectures, along with resource allocation strategies, have been proposed to distribute computational load [
16,
18,
37], efficient orchestration of heterogeneous resources remains a non-trivial challenge. The need to balance latency, energy consumption, and bandwidth constraints becomes especially relevant in large-scale urban and industrial deployments. In this direction, encoding and compression mechanisms tailored for IoT-based surveillance have also been introduced to accelerate transmission and processing [
95].
5. Conclusions
Between 2020 and 2025, multi-camera vision systems have consolidated their position as a key technology across numerous domains, providing environmental perception capabilities that are difficult to achieve with isolated cameras. In this review, the main applications have been identified and analyzed, ranging from public security surveillance, traffic monitoring in intelligent transportation systems, and smart city management to emerging fields such as assisted healthcare, precision agriculture, Industry 4.0, and retail environments, demonstrating that the versatility of camera networks spans a wide range of sectors [
36,
37]. Each domain leverages intrinsic advantages of multiple viewpoints: enhanced spatial coverage, reduced blind spots, continuous tracking of objects across different scenarios, and redundancy against occlusions [
13,
98].
The predominant techniques and approaches enabling these advances have also been summarized. The rise of deep learning has been a cross-cutting factor, with detectors and re-identification models significantly improving detection accuracy and identity association in complex environments [
22,
23,
24]. Complementarily, optimization and planning algorithms for mobile cameras, edge computing to distribute processing loads, and information fusion strategies to enhance situational awareness have been integrated into modern systems [
16,
85,
87]. Overall, many systems combine multi-object tracking, biometrics (e.g., face or gait recognition), and behavioral analysis within hybrid architectures tailored to specific contexts [
13,
27].
Despite these achievements, the review reveals persistent challenges. Robust cross-camera object association in arbitrary situations remains complex, especially under prolonged occlusions, dense crowds, or abrupt appearance changes, keeping the MTMCT problem open in uncontrolled scenarios [
13,
14]. Scaling systems to dozens or hundreds of cameras while maintaining real-time processing is another major obstacle: distributed solutions and edge computing mitigate some limitations but introduce orchestration complexity, network requirements, and the need for efficient resource allocation [
16,
17,
18]. Furthermore, privacy and security are critical concerns; large-scale deployment requires technical safeguards (local processing, data minimization, integrity, and traceability) as well as robust protection mechanisms against threats [
19,
29]. Finally, knowledge gaps remain in less explored subareas (e.g., multi-camera anomaly detection and extreme scenarios), requiring further research attention and suitable datasets for evaluation [
3,
45].
Looking ahead, multi-camera systems are expected to become increasingly integrated with IoT/IIoT infrastructures and cooperative edge–cloud architectures, leading to more intelligent and collaborative networks capable of delivering comprehensive services in cities and factories [
11,
18]. A trend toward active perception and hybrid networks combining fixed cameras with mobile cameras (PTZ, robots, or UAVs) is also evident, aiming to improve coverage and event response through intelligent control and simulation support [
30,
80,
88].
In summary, during the reviewed period, multi-camera systems have evolved from promising research topics into practical implementations across diverse domains, although technical and data governance limitations remain. Addressing the research questions posed: (1) the main identified domains include security surveillance, transportation and traffic, smart cities, healthcare, agriculture/livestock, industry/robotics, and commercial environments; (2) key techniques include deep learning for detection, tracking, and re-identification, distributed edge computing architectures, active camera control, and IoT-based fusion; (3) recent achievements show improvements in accuracy and coverage, yet limitations persist in cross-camera association, computational cost, and privacy; and (4) future opportunities include more robust association algorithms, efficient scalability through edge AI, integrated security and privacy mechanisms, and new active perception schemes [
13,
37].
In conclusion, multi-camera systems are on track to become fundamental components of intelligent infrastructures, enhancing safety, efficiency, and situational awareness. Their future development will require a multidisciplinary approach combining technical innovation with ethical considerations and public policy frameworks, ensuring that these networks become more autonomous, collaborative, and privacy-aware.