A Social Distance Estimation and Crowd Monitoring System for Surveillance Cameras

Social distancing is crucial to restrain the spread of diseases such as COVID-19, but complete adherence to safety guidelines is not guaranteed. Monitoring social distancing through mass surveillance is paramount to develop appropriate mitigation plans and exit strategies. Nevertheless, it is a labor-intensive task that is prone to human error and tainted with plausible breaches of privacy. This paper presents a privacy-preserving adaptive social distance estimation and crowd monitoring solution for camera surveillance systems. We develop a novel person localization strategy through pose estimation, build a privacy-preserving adaptive smoothing and tracking model to mitigate occlusions and noisy/missing measurements, compute inter-personal distances in the real-world coordinates, detect social distance infractions, and identify overcrowded regions in a scene. Performance evaluation is carried out by testing the system’s ability in person detection, localization, density estimation, anomaly recognition, and high-risk areas identification. We compare the proposed system to the latest techniques and examine the performance gain delivered by the localization and smoothing/tracking algorithms. Experimental results indicate a considerable improvement, across different metrics, when utilizing the developed system. In addition, they show its potential and functionality for applications other than social distancing.


Introduction
The rapid outbreak of the Coronavirus Disease 2019 (COVID-19) has imposed restrictions on people's movement and daily life [1]. Reducing the spread of the virus mandates constraining social interactions, traveling, and access to public areas and events [1]. These limitations arise to mainly advocate social distancing; the practice of increasing physical space among people to minimize virus transmission [2]. Monitoring and maintaining social distancing is carried out by governmental bodies and agencies using mass surveillance systems and closed-circuit television (CCTV) cameras [3]. Nonetheless, this task is cumbersome and suffers from subjective interpretations and human error due to fatigue; hence, computer vision and machine learning tools are convenient for automation [4]. In addition, they enable crowd behavior to be monitored and anomalies such as congested regions, curfew infractions, and illegal gatherings to be recognized. The widespread of mass surveillance and its integration with Machine Learning is hindered by ethical concerns, including possible breach of privacy and potential abuse [3]. Therefore, privacy-preserving surveillance and Machine Learning solutions are paramount to their ethical adoption and application [5].
The design of vision-based social distance estimation and crowd monitoring system deals with the following challenges [4]: (1) geometry understanding, in terms of ground plane identification and homography estimation; (2) multiple people detection and localization; and (3) statistical/temporal characterization for social distance infractions, e.g., short-term violations are irrelevant. Currently, Machine Learning-based solutions identify social distance infringements using off-the-shelf person detection and tracking models [4]. In general, the models' performance is conjoined with privacy; they yield high performance by carrying and processing person-specific information to develop robustness against occlusions and missing data [4]. In addition, they localize human subjects via bounding boxes that can be over-sized or incomplete which results in significant distance estimation errors [6]. Therefore, we propose a privacy-preserving adaptive social distance estimation and crowd monitoring system that can be implemented on top of any existing CCTV infrastructure. The main contributions of the paper are as follows: (1) Developing a robust person localization strategy using pose estimation techniques; (2) Forming an adaptive smoothing and tracking paradigm to mitigate the problem of occlusions and missing data without compromising privacy; (3) Designing a real-time privacy-preserving social distance estimation and crowd monitoring solution with potential to cover other application areas and tasks.
The rest of this paper is organized as follows: Section 2 overviews the related work and Section 3 describes our methodologies to build and evaluate the proposed system. Afterwards, we present and discuss the system outcome and performance in Section 4. Finally, Section 5 concludes the paper and suggests topics for future research.

Related Work
This section reviews state-of-the-art Machine Learning-based social distance estimation and monitoring solutions and summarizes their advantages and limitations. First, we analyze various person detection and localization strategies within the scope of social distancing. After that, we review different approaches to recognize social distancing abnormalities. Finally, we discuss the latest vision-based crowd monitoring techniques.

Person Detection and Localization
Several methods exist in the literature and fall under two main categories: object detectors and pose estimation techniques. The former identifies objects by bounding a box around them, while the latter detects the human joints and connects them resulting in pose estimates [7]. On the one hand, object detectors, such as YOLO models [8], are more general-purpose than pose estimation techniques, but their utility for identifying human subjects may require pruning and/or retraining. In addition, they do not offer further information about the detected objects and their bounding boxes can be over-sized or incomplete [6]. On the other hand, pose estimators are specialized models; hence, they are more suitable to detect people in a scene. Specifically, they account for various body orientations/actions such as standing, sitting, riding, and bending, when compared to object detectors [9]. Moreover, their ability to work in dense crowds was verified in [10,11], which is the very same setting social distance monitoring is dealing with. Nonetheless, pose estimators are computationally more expensive than object detectors and their high entropy output requires further processing [7].
In [12], a visual analysis technique is proposed to quantify and monitor contact tracing for COVID-19. The detection and tracking of human subjects are performed by a YOLO architecture and a Simple Online and Real-time Tracking model, respectively. In addition, each subject is localized by its bounding box bottom mid-point. Similar detection and tracking approaches are proposed in [13,14], but the latter localizes the subjects by their bounding box centroid. The aforementioned solutions, although accurate, are not suitable, because they carry person-specific information which hinders their adoption for privacy-preserving applications. Nonetheless, privacy-preserving techniques are developed in [6,15] to monitor the evolution of social distancing patterns using CCTV cameras. The first work utilizes YOLO-v3 to detect pedestrians and the bounding box centroid for localization. Moreover, the second work explores two-person detectors and one end-to-end model and provides evidence that the latter does not necessarily improve performance and the bounding box bottom mid-point is the best for localization. Many variants of the YOLO model and other neural network architectures are used to detect humans in videos and the bounding box centroid, top left edge, or bottom center, is used for localization [16][17][18][19][20][21][22][23][24][25]. Lastly, the social distancing problem is tackled in [26] using a pose estimation model to detect human subjects in videos and to infer their location using the predicted feet joints. The same approach is employed in [27] to measure inter-personal distances but for still images. This has motivated us to use pose estimation techniques to detect people because they offer rich information about the localized subjects and mitigate the pitfalls of bounding boxes.

Anomaly Recognition
The scope of the social distancing problem defines an anomaly in a surveillance video by the presence of social distance violations [4]. This task requires estimating interpersonal distances among the localized subjects and comparing them to a predefined safety threshold [4]. In [13,15], the localized subjects' pair-wise distances are calculated in the realworld coordinates and social distance violations are identified by a 2 m safety threshold; however, the problem of occlusion is not tackled in [15]. Furthermore, in [12,18,20,23,24], the localization results are morphed to the real-world top-view coordinates to calculate the pair-wise distances. The social distance violations are identified by 1, 1.8, and 2 m safety distances. However, the reported results focus on the person detection performance and they illustrate identifying infractions by a few qualitative examples. Moreover, the developed systems in [18,20,23,24] do not mitigate the problem of occlusion nor missing detections. This is important because these are major limitations and tracking with privacy preservation is an essential remedy [28]. In [21], a centroid tracking algorithm is used to resolve occlusions [29], pair-wise distances are computed, and violations are identified by a 1.8 m safety threshold. However, the performance evaluation is assessed using a single video with only two people in it. This restricts generalizing the system's efficacy and its applicability to real-life scenarios. Moreover, inter-personal distances are computed in [6] and the violations are identified at three safety levels; 1, 1.8, and 3.6 m. The study concludes that incomplete or over-sized bounding boxes introduce significant errors to the distance calculation; hence, selecting an appropriate person detector is paramount to the system's feasibility and success. Finally, in [26], pair-wise distances are approximated through the estimated body joints and social distance infractions are identified by a 2 m threshold.
The reviewed literature shows a discrepancy in the safety distance selection for detecting social distance violations. This inconsistency hinders fair comparisons, but it has motivated us to test the proposed system applicability across a wide-range of safety distances and to utilize various performance measures.

Crowd Monitoring
Crowd monitoring aims to attain a high-level understanding of crowd behavior by processing the scene in a global or local manner [30]. Macroscopic methods such as crowd density, crowd counting, and flow estimation, neglect the local features and focus on the scene as a whole [31,32]. In contrast, microscopic techniques start by detecting individual subjects and then group their statistics to summarize the crowd state [33]. These two approaches are complementary in terms of the efficiency/accuracy trade-off. In other words, macroscopic techniques are efficient in handling high-density crowds, while microscopic methods are accurate for sparse groups [31].
An approach to analyze the crowd and social distancing behavior from UAV captured videos is proposed in [31]. Discrete density maps are generated to classify the crowd state in each aerial frame patch as dense, medium, sparse, or empty. In addition, a microscopic technique is employed to detect, track, and compute inter-personal distances. In [34], crowd counting and action recognition techniques are reviewed in the scope of social distancing. The study suggests that density-based approaches are preferred due to their inherent error suppression in which the contribution of faulty counts or missing detections is insignificant to the long-term-averaged density map. Moreover, pedestrians' spatial patterns are captured in [6] by long-term occupancy and crowd density maps. The former describes the spatial signature exerted by the subjects in the surveilled scene, while the latter encodes the spatial impression of social distance infractions [35]. Similarly, heatmaps are generated in [13,26,36] to represent the regions in which social distance violations are frequent. These studies demonstrate that short and long-term occupancy/crowd density maps are important to identify high-risk regions in the scene. In addition, they allow a quantification for the pedestrians' compliance with social distancing guidelines [6].

Methodology
The proposed social distance estimation and crowd monitoring system is depicted in Figure 1. The model is comprised of the following stages:

Parameter Estimation
Anomaly Recognition

1.
Read a frame from the surveillance camera. This component can be adjusted to skip/drop frames in case of using high-resolution and/or high-frame-rate cameras.

2.
Detect human subjects in the input frame and compute their position. The position of each detected subject is estimated as a single point.

3.
Discard any localized positions outside a selected region of interest (ROI). The ROI is defined by the user beforehand and typically encloses the ground plane.

4.
Transform the localized positions from the image-pixel coordinates to the real-world coordinates. This provides a top-view depiction of the subject's position.

5.
Smooth the noisy top-view positions and compensate for missing data due to occlusion with tracking. 6.
Estimate the inter-personal distances among the detected subjects and the occupancy/crowd density maps. 7.
Recognize social distance violations and identify congested or overcrowded regions in the scene.

8.
Integrate the smoothed/tracked positions, estimated parameters, and detected anomalies with the video frame. 9.
View the integrated video frame and generate a dynamic top-view map for the scene. This component allows adjusting the type and amount of appended information.
The proposed system design process is governed by the following requirements: • High accuracy and reliability in terms of robustness to noise and missing data. • Light weight for implementation and deployment. • Modularity to facilitate maintenance, upgrades, decentralization, and to avoid resource allocation bottlenecks. • Privacy-preserving by not carrying nor processing person-specific features. • Robustness against different vertical pose states and actions, e.g., standing, sitting, bowing, bending, walking, and cycling.
The remaining subsections discuss and detail each stage in the proposed system. We use an example video frame from the EPFL-MPV dataset to illustrate the outcome of each stage-see Section 4.1 for more details on the dataset.

Person Detection and Localization
Given an input video frame, we detect and localize human subjects using a pose estimation technique, because object detection models can yield incomplete or over-sized bounding boxes and they do not offer rich information [6].

Detection
We utilize OpenPose to detect and estimate human poses in the input video frame. Specifically, OpenPose estimates and connects the body joints using part affinity fields [37]. Let N and M be the total number of true and detected subjects in the video frame, respectively, and J m m∈ [1,M] be the set of estimated joints for all detected subjects where J m = j j m j∈ [1,25]   Ideally, OpenPose yields 25 joints for each detected subject, but we recognize that some might not be detected due to various reasons. This results in some empty entries in J m , but does not change the indexing scheme. Moreover, to model a realistic scenario, we assume that N and M are not necessarily equal, i.e., the number of detected subjects can be less, equal, or more than the true number of people in a frame. Finally, note that we select OpenPose due to its simplicity and availability, but it can be replaced with any other pose estimation model given the same body joints indexing scheme. Figure 3 shows the pose estimation outcome for an example input frame with 5 people moving freely in a room. OpenPose yields five detections shown in gray, red, orange, green, and blue with 13, 22, 20, 17, and 8 total number of connected joints, respectively. The gray and blue poses are incomplete because of partial occlusion and missing data.

Localization
We select the midpoint of the feet of each subject as the anchor to localize their positions, also known as the ground position. The selected point offers reliable estimation because: (1) it is independent of the subjects height, width, and orientation; (2) it lies on the ground; thus, homography transformation is possible; (3) it has a clear definition when compared to bounding boxes; (4) it carries no person-specific information; hence, privacy is preserved.
In [26], given the non-empty set of feet joints j 12 m , j 25 m and the condition #J m ≥ 13, the ground position of subject m is estimated as follows: where 25], and # denotes the number of non-empty elements in the set. We call this approach the basic ground position estimation and argue that it is inadequate because the constraints are quite restrictive. For instance, Equation (1) assumes human subjects with perfect vertical orientation, which may not be the case. In addition, in Equation (2), the sole reliance on detecting any foot joint and the required minimum number of joints limits its applicability in real-life scenarios. In fact, this approach estimates the ground position only when information is abundant. Therefore, we propose a localization strategy that eliminates the basic position pitfalls and relaxes its restrictions and constraints.
Algorithm 1 explains the proposed localization strategy. First, we eliminate the conditions mandated by the basic approach and expand the search space to include the subject's feet, knees, hips, and torso. In particular, for the horizontal coordinate u m , we leverage the joints left/right symmetry by averaging the horizontal position of two opposing joints. For instance, u 2 and u 3 in Figure 2 are computed by the 1st case (line 2), while u 1 is found by the 7th case using the hip joints, i.e., u 10 1 and u 13 1 (line 8). Moreover, for the vertical coordinate v m , we relax the requirement for detecting the feet joints by exploiting the human average skeletal characteristics. More specifically, we use the ratio between the torso and lower body lengths to infer the ground position vertical coordinate [26], i.e., (0.85/0.6) in line 15. Finally, regardless of the approach, we discard any estimated positions outside the user-defined ROI-see Figure 2.
Both feet joints are available 3: Left foot and right knee joints are available 4: Left knee and right foot joints are available 5: Both knees' joints are available 6: case u 10 m = ∅ ∧ u 14 m = ∅ then u m = Hip's joints are available 9: Torso's joints are available 10: case α = ∅ ∨ β = ∅ then u m = ∑{α, β} #{α, β} and F u m = 2. Consider any available feet joints 11: otherwise u m = ∅ and F u m = 0. 12: end switch 13: switch true do 14: case Consider any available feet joints 15: Torso's joints are available 16: otherwise v m = ∅ and F v m = 0. 17: end switch The proposed localization strategy is driven by the argument that noisy measurements with known error states are more valuable than no measurements at all. In other words, if we predict the subject's ground position and supplement it with the state of available information, we can append each prediction with a flag describing its integrity, or confidence level. In this work, we coin this concept by forming the error state flags F u m and F v m in the following manner: is not available, regardless of the reason.
: subject is detected and u m (v m ) is predicted using other joints. Similarly, an overall localization error flag is constructed for each detected subject m as follows: (3) Figure 4a demonstrates the basic and proposed localization results using the estimated poses in Figure 3. In addition, it shows the selected ROI in cyan, which encloses the floor plane in the scene. By examining Figure 4a, one notes that both localization strategies yield valid estimates when supplied with enough number of connected joints. However, the proposed approach is more accurate since it does not assume perfect vertical orientation. Moreover, it mitigates partial occlusion by inferring the position vertical coordinate using the torso to lower body lengths ratio-see the estimated position in gray. Nonetheless, both strategies are limited, because they cannot resolve the ground position when information is scarce or completely missing. For instance, they cannot localize the fifth subject, the one with the blue pose in Figure 3 because we only have a few joints.

Top-View Transformation
Let us assume that the surveillance camera is placed at height h and oriented with a pan and tilt angles θ p and θ t , respectively. The transformation from a three-dimensional position in the real-world coordinates to its corresponding two-dimensional (2D) position in the image-pixel coordinates; [x, y, z] → [u, v] is expressed as follows: where α s is the image-to-real distance scale, K ∈ R 3×3 is the camera intrinsic parameter matrix which maps the camera coordinates to the image coordinates, [R T 0 ] maps the real-world coordinates to the camera coordinates, R ∈ R 3×3 is a rotation matrix that compensates for the camera orientation (θ p and θ t ), and T 0 ∈ R 3×1 is a translation vector which deals with the camera position and height. Since we are only concerned with transforming the subjects' ground positions from the image coordinates [u m , v m ] to the real-world ground plane [x m , y m ], Equation (4) simplifies to: where H ∈ R 3×3 is the camera homography matrix. This transformation results in a top-view depiction of the subject's real-world positions-see Figure 4e.
In this work, we assume the homography matrix H and the image-to-real distance scale α s to be known for simplicity; however, they can be obtained by GPS and accelerometers [38][39][40], determined by calibration [41,42], inferred from the computed poses [26,27], or estimated by a four-point perspective transformation [43].   Figure 3. The color notion is dropped in (c,d,g,h) to preserve privacy and to emphasize the recognition of a social distance infringement; red/green indicates the presence/absence of subjects violating the defined social safety distance.

Smoothing and Tracking
The top-view transformed ground positions are noisy and suffer from missing values. The former is due to uncertainties and errors in the localization technique while the latter comes from occlusions. In this section, we formulate the estimated positions temporal evolution by a constant velocity model. Afterwards, we compensate for localization errors and missing measurements by a linear Kalman filter (KF) and a global nearest neighbor (GNN) tracker.

State and Measurement Models
Let x m,t = [x m,t ,ẋ m,t , y m,t ,ẏ m,t ] T be the state vector of subject m that defines its ground position and velocity at frame t. Assuming constant velocity, x m,t and its measured counterpart y m,t are expressed as follows [44]: where F is a constant state transition matrix from x m,t−1 to x m,t , H is a constant state-tomeasurement matrix, ω m,t ∼N (0, Q m,t ), and ν m,t ∼N (0, R m,t ).

The Linear Kalman Filter
The KF offers an optimal estimate for x m,t given the measurement y m,t by following the process depicted in Figure 5. First, given a previous (or initial) posterior estimatex m,t−1 with error covariance P m,t−1 , the KF predicts a prior estimatex m,t and computes its error covarianceP m,t . Afterwards, it calculates the posterior estimatex m,t with error covariance P m,t using a Kalman filter gain K m,t . Finally, the process repeats usingx m,t and P m,t as inputs to the state prediction stage.

State Prediction
(1) Project state vector: (2) Project error covariance: By examining the Kalman gain equation in the measurement correction stage in Figure 5, one notes that increasing/decreasing R m,t decreases/increases the reliance of x m,t on the measurement y m,t . In this work, we control this mechanism by adjusting the variance σ 2 m,t in R m,t according to the overall localization error flag F m,t , i.e., [45]: In other words, the measurement error variance is adapted to smooth the estimated positions according to their appended quality. Consequently, the KF reduces the localization noise and can offer posterior estimates when the measurement is missing [45]. Nevertheless, the KF equations require knowing the correspondence between the detections/predictions at consecutive frames. This is generally tackled via multiple object tracking (MOT) approaches such as the global nearest neighbor (GNN) algorithm.

Global Nearest Neighbor Tracking
GNN is a real-time light-weight MOT solution that tracks objects by assigning detections/predictions to tracks, and by maintaining its track record [46]. It solves the assignment task by minimizing the following cost function: where M is the number of detected subjects, Q is the number of maintained (or initiated) tracks, C m,q is the cost of assigning detection m to track q and α m,q ∈ {0, 1} such that if detection m is assigned to track q, then α m,q = 1, otherwise α m,q = 0. The constraints in Equation (9) ensure that each detection can be assigned to only one track and vice versa. The GNN defines the assignment cost C m,q in Equation (9) as follows: whereŷ q,t = Hx q,t is the estimated measurement with error covariance HP q,t H T + R q,t , D y m,t ,ŷ q,t is the Mahalanobis distance between y m,t andŷ q,t , log |X| is the natural logarithm of the determinant of X, and γ g is a gating threshold that reduces unnecessary computations; it selects detections that are close to predictions. In this work, we solve the GNN assignment problem in Equation (9) Figures 4b and 4f present the smoothed/tracked ground positions in the image-pixel and real-world coordinates, respectively. In addition, we overlay the plots with the original localization results in Figures 4a and 4e to visualize the role of smoothing and tracking. By examining the results, one notes that the KF corrects the predicted position in gray and makes it closer to the subject's actual location. In addition, the fifth subject's unresolved position, because of missing information, is now compensated for by GNN-see the predicted position in blue. In summary, the smoothing and tracking stage lowers the localization error through the KF and corrects for the missing measurements by GNN. Note that this stage preserves privacy and it is intended for data correction rather than conventional tracking; hence, we are not concerned with the re-identification problem nor the subjects' particular identities.

Parameter Estimation
The crowd state, in terms of social distancing behavior and congestion, is estimated by computing the inter-personal distances and the occupancy/crowd density maps.

Inter-Personal Distance
The instantaneous pair-wise Euclidean distance between subjects i and j is expressed as: Given a social safety distance r, the instantaneous number of violations is computed by: where N t is the number of estimated/tracked people in frame t and V t counts the number of subjects that are r or less apart from each other-see Figures 4d and 4h.

Occupancy and Crowd Density Maps
The occupancy density map (ODM) encodes the spatial patterns exerted by the subjects in the surveilled environment [6]. It is formed by summing and averaging Gaussian functions centered at the subjects' ground positions, i.e.: G(x, y) = 2 π δ 2 exp where O(x, y) is the averaged ODM, T is the current frame number (or total number of frames), G(x, y) is a 2D symmetric Gaussian function, and δ controls the spatial resolution of the map. Similarly, the crowd density map (CDM) offers a spatial signature for the social distance infringements in the scene [35]. It is formulated by imposing the safety distance constraint as follows: where C(x, y) is the averaged CDM and ψ i,t is a binary mask that is 1 or 0 if subject i violates or follows the social safety distance r, respectively. Figures 4c and 4g show the instantaneous ODM in the image-pixel and real-world coordinates, respectively. In addition, we superimpose the smoothed/tracked localization results and the computed inter-personal distances in both domains. Moreover, Figures 4d  and 4h illustrate the instantaneous CDM in the image-pixel and real-world coordinates, respectively.

Anomaly Recognition
We define an irregularity in the surveillance video by the presence of social distance infractions and overcrowded, or congested, regions. We treat the first task as a classification problem by forming the binary label S t as follows: Moreover, we consider the second task as a segmentation problem where we identify overcrowded areas in the scene by thresholding the averaged CDM as follows: where γ m is selected to keep 50% of the energy in C(x, y).

Performance Evaluation
The social distance estimation and crowd monitoring system is evaluated in terms of its ability to detect human subjects, localize their positions, recognize social distance violations, estimate crowd density maps, and to identify overcrowded regions in surveillance videos.
Let N t and N t be the true and estimated/tracked number of people in frame t. The averaged person detection rate (PDR) and localization relative error are calculated as follows: where (x i,t , y i,t ) and (x i,t ,ŷ i,t ) are the true and estimated ground coordinates for subject i at frame t, respectively. We associate the estimated positions with their true counterparts using the optimal Munkres algorithm [47,48]. Moreover, given the true and predicted binary outputs S t and S t , respectively, we assess the detection of social distance violations by accuracy, precision, recall, and the F1-score, i.e.: where TP, TN, FP, and FN are true positives, true negatives, false positives, and false negatives, respectively. Furthermore, we complement the former evaluations by computing the averaged violations count rate (VCR), i.e.: where V t and V t are the true and predicted counts, respectively-see Equation (13). Finally, we evaluate the quality of the averaged CDM by the Pearson's correlation coefficient (CORR) and assess the identified overcrowded regions using the intersection over union (IOU), i.e.: where R(x, y) and R(x, y) are the true and predicted thresholded averaged CDM, respectively -see Equation (19).

Dataset
We utilize the EPFL-MPV, EPFL-Wildtrack, and OxTown public datasets along with the pose estimations prepared in [26]. The EPFL-MPV is comprised of four sequences, named 6p-c0, 6p-c1, 6p-c2, and 6p-c3, for six people moving freely in a room [49]. The sequences are synchronized and view the same environment but from different perspectives. Each sequence is recorded at 25 frames per second (fps) and has 2954 frames. The EPFL-Wildtrack contains seven synchronized sequences, named C1-C7, with approximately 20 people moving outdoor [50]. The sequences view walking pedestrians outside the main building of the ETH university in Switzerland. They are shot using seven cameras positioned at different locations and each has a total number of 400 frames. Lastly, the OxTown is a street surveillance video with 4501 frames shot with a single camera at 25 fps. It oversees, on average, 16 people walking down a street in Oxford, London [51].

Preprocessing and Settings
The utilized datasets offer annotations in terms of bounding boxes that localize people in the scene. Additionally, they provide the homography matrix and the image-to-real distance scale of each recording camera. The EPFL-MPV and OxTown bounding boxes are vertically over-sized and enclose more than the areas occupied by the human subjects. Therefore, their bottom mid-points are lower than the subjects actual ground positions. In this work, we correct for this by shifting the mid-points up a percentage of the bounding box total height. In specific, we apply a 10% and 2% uplift to the EPFL-MPV and OxTown localization data, respectively. Moreover, the OxTown dataset annotation includes bounding boxes for babies in strollers/prams accompanied by adults. This is outside the scope of our work; hence, we discard them (This corresponds to the following subject IDs: 24, 42, 44, 45, and 47). Finally, the ROI for each dataset/sequence is manually selected, in the image-pixel domain, to cover the floor of the scene. The ROIs include most annotated positions, but we discard the remaining few that are outside the selected area. This corresponds to excluding 2.38% (960 out of 40,393), 6.67% (4767 out of 71,460) and 15% (6403 out of 42,721) of the EPFL-MPV, EPFL-Wildtrack, and OxTown annotations, respectively. The proposed system smoothing and tracking parameters are found for every dataset/sequence by minimizing the localization error in Equation (21) using the Bayesian optimization algorithm in MATLAB; see Table 1. The optimization is executed for 500 iterations using the expected improvement plus acquisition function and repeated five times for verification [52].  Figure 6 illustrates three examples for integrating the proposed system outputs and displaying them on the user interface unit. These examples offer complementary interpretations for the scene and serve different purposes depending on the intended application or required analysis. For instance, in Figure 6a, the input video frame, depicted in Figure 4, is overlaid with the localization and averaged ODM results. This type of display is important when monitoring crowds in public areas or for analyzing customer's browsing habits and preferences in shops. Moreover, we show in Figure 6b that the former information can be replaced with the detected social distance violations and the averaged thresholded CDM. This example is directly intended for social distance monitoring applications and can be used to oversee critical waiting areas, e.g., in airports and hospitals. Furthermore, Figure 6c demonstrates a dynamic top-view map for the scene by plotting the localization, inter-personal distances, and the averaged CDM in the real-world coordinates. This figure serves as a footprint for redesigning congested areas and facilitates developing physical interaction protocols and guidelines. Finally, apart from these applications, one can merge and/or adjust the type and amount of displayed information. In addition, the user is able to view one or multiple integrated frames, or top-view maps, simultaneously; hence, offering valuable information about the scene and crowd state. The supplementary material of this paper includes videos of the system integration outcome for other video sequences.   Figure 7 demonstrates the social distance violation detection performance of the basic and proposed approaches in terms of accuracy, F1-score, and VCR. In addition, it shows their IOU for identifying the overcrowded regions in the scene. The results are computed for a range of safety distances and averaged across all video sequences. We vary the safety distance from 1 to 2.5 m with a 0.05 step to cover a wide range of guidelines. Moreover, Table 2 illustrates the system capacity to detect human subjects, localize their positions, recognize social distance violations, estimate crowd density maps, and identify high-risk areas in each video sequence; it summarizes the PDR, localization error, accuracy, F1score, precision, recall, VCR, CORR, and IOU. The results are averaged across the range of safety distances and we assess the gain in performance delivered by the smoothing and tracking stage.  The trends in Figure 7 indicate that the accuracy, F1-score, and IOU increase with the safety distance, whereas the VCR is stable for the proposed approach and decreases for the basic method. Additionally, they depict the gain in performance delivered by the proposed system. Specifically, the boost in accuracy, F1-score, VCR, and IOU is up to 5.8%, 9.5%, 7.6%, and 10.7%, respectively. Furthermore, by examining the results in Table 2, one notes a clear advantage for utilizing the proposed system as it yields the best overall performance across all measures, except precision to ensure balanced precision/recall trade-off. In specific, it offers the highest person detection rates and lowest localization errors for all video sequences with gains up to 43% and 38.3%, respectively. Similarly, it results in better social distance violation recognition and raises the conventional method accuracy, F1-score, and VCR by 17%, 9.6%, and 39%, respectively. Moreover, the quality of the estimated crowd density maps, in terms of correlation, is high for both techniques, because the contribution of faulty detections is insignificant to the long-term averaged estimation. However, it is not the case when identifying high-risk regions. The results highlight a growth in the IOU of the proposed method up to 12.4%; hence, it is more reliable. Finally, Table 2 emphasizes the smoothing and tracking role where it offers a considerable improvement due to its treatment for occlusions and missing data. In particular, it balances the system efficacy, by reducing the difference between precision and recall, and expands its functionality to cover various tasks and application domains. Table 3 shows a comparison between the proposed system, the basic pose-based approach from [26], and an object detection-based system developed in [15]. The comparison focuses on the systems' ability to detect social distance violations in the OxTown dataset with a 2 m social safety distance. Note that since the compared solutions do not utilize tracking, we demonstrate the proposed system results with and without the smoothing/tracking stage. In addition, we illustrate example results in Figure 8 to visualize the proposed system outcomes. The results in Table 3 verify the proposed system applicability and the adequacy of pose-based techniques to detect social distance infractions. They indicate a 4.6% and 3% gain in accuracy and F1-score, respectively, when compared to the object detection-based method in [15]. In addition, they affirm the smoothing and tracking stage role which pushes the proposed system accuracy and F1-score by 0.9% and 0.5%, respectively.  Table 6 in [15]. The proposed approach is compared with () and without () the smoothing/tracking stage (S/T). Best results are in bold to ease interpretation and results that are used in the discussion are in brackets.

Method
Accuracy

Computational Complexity Analysis
The complexity of the proposed system is measured by its frame rate; the number of processed video frames per second, and processing rate; the amount of processing time per frame. The assessment is conducted by Monte-Carlo simulations where we run the model depicted in Figure 1 using all video frames and repeat the process ten times for validation. Note that we exclude the complexity of OpenPose since we use the precomputed poses in [26]. Nevertheless, OpenPose real-time operation on both CPU and GPU machines was verified in [37,53]. In addition, we select OpenPose due to its simplicity and availability, but it can be replaced with any other pose estimation model given the same body joints indexing scheme described in Section 3.1.1. We use a desktop equipped with 2 Intel Xeon E5-2697V2 x64-based processors, 192 GB of memory, and MATLAB R2020b. Figure 9 demonstrates the developed system frame and processing rates with respect to the number of detected/tracked subjects. The averaged results suggest that the system is capable of running in real-time despite the smoothing/tracking stage additional complexity. Specifically, it runs at 106.5 fps (9.9 ms/frame) when solely relying on the proposed localization strategy and at 33.6 fps (44.5 ms/frame) when accommodating the tracking algorithm. Moreover, the results indicate that the localization approach complexity depends on the amount of occlusions present in the video frame-see Figure 9a. This is shown by the drop in frame rate when 2-6 people are present and by its slow decline when having more than 7 people in the scene. The first drop is caused by the EPFL-MPV dataset where we have six subjects moving in a highly confined environment resulting in many occlusions, while the second is due to the general increase in the number of people, which escalates the chances of occlusion. Furthermore, the smoothing/tracking introduced complexity is demonstrated by the frame rate rapid decay when increasing the number of subjects-see Figure 9b. The trends reveal the system limited ability to resolve highly dense crowds. In particular, the average frame rate drops below 25 fps (40 ms/frame) and 12 fps (83 ms/frame) when we have more than 10 and 17 people, respectively. Nevertheless, these findings highlight a need to distribute the computational load across the surveillance infrastructure. For instance, stages 1-4 in Figure 1 can be performed locally by the camera or on edge devices, while stages 5-9 require more resources.

Conclusions
The COVID-19 pandemic has deemed social distancing a critical first line of defense against its wide spread; nevertheless, safety distance guidelines are not always followed. Monitoring social distancing is important to draw realistic mitigation plans and to structure exit strategies. However, it is a labor-intensive task and suffers from subjective interpretations; therefore, combining computer vision and machine learning models with mass surveillance is intuitive for automation, but it must preserve privacy to ensure ethical adoption and application.
This work presented a privacy-preserving adaptive social distance estimation and crowd monitoring system for surveillance cameras. We evaluated the system's ability to detect human subjects, localize their positions, recognize social distance violations, estimate crowd density maps, and identify high-risk areas. Additionally, we analyzed its computational complexity in terms of processing time. The results indicated a clear advantage for utilizing the proposed localization approach when compared to the latest techniques. In addition, they showed a considerable improvement delivered by the adaptive smoothing and tracking stage. Specifically, the system improves the PDR, localization relative error, accuracy, F1-score, VCR, and IOU up to 43%, 38.3%, 17%, 9.6%, 39%, and 12.4%, respectively. In addition, it runs at 33.6 fps (44.5 ms/frame) making it a real-time solution for low to medium-dense crowds. The proposed system occupancy/crowd density map functionality extends its application domain beyond the COVID-19 pandemic to cover other areas. For instance, it can help re-configure or re-design common physical layouts and relocate facilities in businesses to optimally reduce congestion. Additionally, it is capable of facilitating the analysis of customer's browsing habits in shops and quantifying the effectiveness of marketing kiosks.
The developed system, although advantageous, is still limited and can be extended in various ways such as: (1) estimating the body orientation to relax the assumption of vertically oriented subjects; (2) fuse detections and estimations from multi-view cameras to assess the environment state rather than the camera specific scenery; (3) develop an automatic online training paradigm for the tracking algorithm parameters; (4) embed regression techniques to estimate the crowd density maps; (5) detect other anomalies such as fire, smoke, unattended objects in public places, and abnormal individual or crowd behavior. These will be the topics of our future research.