DepTSol: An Improved Deep-Learning- and Time-of-Flight-Based Real-Time Social Distance Monitoring Approach under Various Low-Light Conditions

: Social distancing is an utmost reliable practice to minimise the spread of coronavirus disease (COVID-19). As the new variant of COVID-19 is emerging, healthcare organisations are concerned with controlling the death and infection rates. Different COVID-19 vaccines have been developed and administered worldwide. However, presently developed vaccine quantity is not sufﬁcient to fulﬁl the needs of the world’s population. The precautionary measures still rely on personal preventive strategies. The sharp rise in infections has forced governments to reimpose restrictions. Governments are forcing people to maintain at least 6 feet (ft) of safe physical distance to stay safe. With summers, low-light conditions can become challenging. Especially in the cities of underdeveloped countries, where poor ventilated and congested homes cause people to gather in open spaces such as parks, streets, and markets. Besides this, in summer, large friends and family gatherings mostly take place at night. It is necessary to take precautionary measures to avoid more drastic results in such situations. To support the law and order bodies in maintaining social distancing using Social Internet of Things (SIoT), the world is considering automated systems. To address the identiﬁcation of violations of a social distancing Standard Operating procedure (SOP) in low-light environments via smart, automated cyber-physical solutions, we propose an effective social distance monitoring approach named DepTSol. We propose a low-cost and easy-to-maintain motionless monocular time-of-ﬂight (ToF) camera and deep-learning-based object detection algorithms for real-time social distance monitoring. The proposed approach detects people in low-light environments and calculates their distance in terms of pixels. We convert the predicted pixel distance into real-world units and compare it with the speciﬁed safety threshold value. The system highlights people violating the safe distance. The proposed technique is evaluated by COCO evaluation metrics and has achieved a good speed–accuracy trade-off with 51.2 frames per second (fps) and a 99.7% mean average precision (mAP) score. Besides the provision of an effective social distance monitoring approach, we perform a comparative analysis between one-stage object detectors and evaluate their performance in low-light environments. This evaluation will pave the way for researchers to study the ﬁeld further and will enlighten the efﬁciency of deep-learning algorithms in timely responsive real-world applications.


Introduction
COVID-19 caused by SARS-CoV2 originated from Wuhan, China, and created a catastrophe in 219 countries [1]. On 23 December 2021, World Health Organization (WHO) declared it a pandemic when it spread in 114 countries with 0.5 million active daily cases [2][3][4].

•
We develop an efficient deep-learning-based physical distance monitoring approach in collaboration with ToF technology to monitor physical distancing under various low-light conditions. • In comparison to the social distance monitoring solution provided by Adina et al. [8] in the DepTSol model, the limitation of monitoring people at a fixed camera distance in a given environment is addressed by monitoring people at varying camera distances.

•
In this article, we evaluate the performance of the newly released, scaled-YOLOv4 algorithm under various low-light environments and perform a comparative analysis between seven different one-stage object detectors in low-light scenarios without applying any image cleansing or visibility enhancement techniques. In the literature, no other studies analyse the performance of deep learning algorithms in the context of low-light scenarios. Based on comparative analysis, in terms of both speed and accuracy, we choose the best algorithm for the implementation of our real-time social distance monitoring framework.

•
The proposed technique is not only limited to monitoring social distancing at night, but it is also implementable in generic low-light environments for the detection and tracking of people, as likely violation of safety measures occur at night.

Literature Review
The researchers have made remarkable contributions and presented effective solutions to deal with the COVID-19 pandemic. Notable work has been done in the literature on social distance monitoring after it was declared an effective solution for the prevention of disease. Prem et al. [9] used synthetic location-specific contact patterns in Wuhan to monitor the effect of population mixing on outbreaks. They simulated the outbreak trajectory by using the susceptible-exposed-infected-removed (SEIR) model. Their study showed that the mixing of people of different age groups causes different effects in the spread of disease. Young people are found to be less infected than older people, and physical distancing was shown to be an utmost reliable practice to reduce the epidemic peak in Wuhan, China. Adolph et al. [10] analysed the effects of outbreaks in the USA and further evaluated the decision of different politicians and policymakers concerning social distancing. The results were contradictory, which delayed the lockdown and resulted in the spread of COVID-19. The pandemic has triggered a dire need for technology-oriented digital healthcare solutions. To promote social distancing, many governments have utilised a social IoT system comprised of infrared thermometers and self-temperature scanners. To educate the people about the importance of social distancing, the Qatari government [9] employed security robots in various residential and public areas. In Singapore, a temperature screening system was introduced based on the artificial powered thermal scanner SPOTON [11]. The Kuwaiti government [11] introduced an application named 'Shlonik' to monitor people in quarantine. Indonesia launched a robot medical assistance system to limit the contact between patients and medical staff. The robots can carry up to 50 kg of items such as medicine, clothes, and food to a patient's room [11]. Similarly, Iran developed a mobile application for electronic self-tests of COVID infection [11]. Kyrgyzstan created a website for its citizens [11]. The people who need food assistance can register their needs online and obtain food at their doors. The Ministry of Health and Education provided a free program named 'MASK' [11]. This application enables people to see contaminated areas on the map based on the places highly visited by infected people.
In the past few decades, the detection of humanoid forms by using deep learning algorithms have been widely practised. In the literature, different deep-learning-based research studies have been conducted for the automation of social distance monitoring by detecting and monitoring people with high accuracy. Punn et al. [12] presented a deep-learning-based framework and implemented it with surveillance cameras for the automation of physical distance monitoring. The YOLOv3 [13] algorithm is utilised in collaboration with the deep-sort technique for real-time object detection and tracking. In the same background, Sahraoui et al. [14] used Social Internet of Vehicles (SIoV) technology with a Faster RCNN [15] algorithm to monitor physical distancing and alert generation. According to this study, every vehicle is equipped with cameras that capture images, objects in images are detected by Faster RCNN, and notifications regarding violations are sent through an advertisement board. The model's efficiency was evaluated by vehicleto-infrastructure communication and found very effective. Similarly, Bouhlel et al. [16] introduced two different methods for measuring physical distancing. In the first method, they estimated the crowd's density and classification of ariel frame patches, whereas, in the second method, they used deep learning for detection and tracking. They tested their model on three different datasets and achieved good accuracy. Recently, Adina et al. [8] presented a real-time social distance monitoring strategy in collaboration with deep learning and ToF technology. The authors utilised the YOLOv4 [17] algorithm for real-time people detection and suggested a camera calibration approach for social distance monitoring at a fixed camera distance. The authors mainly focused on low-light scenarios. The model can observe people and show their relevant distance in real-world units with high accuracy and a minimal error rate.
In the drastic situation of COVID-19, Social IoT, deep learning, and computer vision have played a vital role. Researchers have made contributions and provided efficacious, deep-learning-based social distance monitoring solutions, as discussed above, but lowlight conditions are yet to receive due attention. We focused on low-light scenarios and presented an efficient social distance monitoring approach by maintaining a good speedaccuracy trade-off, but the technique was limited to monitoring people at a fixed camera distance in a given environment [8]. By considering this research gap, in this article, a real-time physical distance monitoring approach was introduced by maintaining optimal performance in terms of both speed and accuracy. The proposed approach maintains high privacy standards. Instead of targeting individuals when a safety breach is detected, we propose general voice warnings via speakers.

Overview of Scaled-YOLOv4 Algorithm
We have seen a vast number of applications of computer vision and deep-learningbased algorithms in the current era, such as fraud detection [18][19][20], face recognition [21], theft detection [22,23], pedestrian detection [24][25][26], traffic monitoring [27][28][29], and business analytics [30,31]. All of these applications need to be trained on large datasets for effective results. These vast datasets require massive computing capabilities such as GPU, cloud computing facilities, single embedded devices, and large clusters for training. Model scaling plays a vital role in the design of an effective object detector with optimal speed-accuracy features. To make training easier and suitable on different devices, the most common practice is to change the number of convolutional filters, i.e., the width of the backbone, and the number of convolutional layers, i.e., the depth of the backbone, in convolutional neural networks (CNNs). By following the same practice, on 22 February 2021, Wang et al. [32] introduced a scalable-YOLOv4 model, where they showed that a YOLOv4 object detector based on a cross-stage-partial (CSP) framework can be easily scaled up or down and can be easily applied to both small and large networks by maintaining a good speed-accuracy trade-off.
After the successful execution of model scaling, the next phase is to monitor quantitative and qualitative elements that will change. These elements incorporate cost, inference time, and accuracy. The qualitative elements have different effects than quantitative elements depending on the user database or equipment. During the design of effective model scaling strategies, it is ensured that, whether the model is scaled up or down, the quantitative cost can be easily managed accordingly. The authors of the scaled-YOLOv4 model have analyzed different CNN models (ResNet [33], ResNext [34], and Darknet [13]) and monitored their quantitative cost by performing upscaling and downscaling. From the experiments, they found that the change in the number of layers, network size, and width increases the computational cost, whereas their proposed approach of converting CNNs to CSPNet can effectively minimise the floating-point operations per second on ResNet, ResNext, and Darknet by 23.5%, 46.7%, and 50.0%, proving to be the overall best model scaling approach so far. The architecture of the scaled-YOLOv4 model is shown in Figure 1.

CSP-ized YOLOv4
Authors of scaled-YOLOv4 have designed scaling algorithms for all general, high-end, and low-end GPUs, as YOLOv4 [17] is the only general GPU-based real-time object detector. The downsampling convolution was not present in the design of the CSPDarknet53, which proved helpful in the reduction of computation in every stage of CSPDarknet by whb2(9/4 + 3/4 + 5k/2). This reduction formula has proved the CSPDarknet to be beneficial over the simple Darknet backbone only when the value of k is greater than 1. Every stage of CSPDarknet has [1-2-8-8-4] residual layers. To attain an optimal performance, the authors placed the first CSP stage into the original Darknet residual layer [17].
The path aggregation network (PAN), a short form of PANet, is used for image segmentation by conserving spatial information, which improves localisation. The PAN in the YOLOv4 is CSP-ized in scaled-YOLOv4 to lessen the computational cost by 40%. In previous object detection algorithms, the Spatial Pyramid Pooling (SPP) is present in the centre of the first computational list of the neck [35]. The designers of scaled-YOLOv4 also added that it is the centre of the first computational list of CSPPAN. The architecture of the proposed computational list is shown in Figure 2.

YOLOv4-Tiny
Model size affects the inference time and computational cost and requires powerful hardware resources for the best performance. Therefore, during tiny model scaling for low-end devices, some factors such as memory access cost (MAC), the traffic of dynamic random-access memory (DRAM), and memory bandwidth need to be fully examined.
In lightweight models, to acquire high accuracy with minimal computations, a higher parameter utilisation efficiency is required. The authors analysed the network with the computational load of DenseNet and OSANet with the growth rate (g) and found that OSANet was the best model for tiny model scaling because of its low computational complexity, which is less than O(whkb2). Similarly, to attain the best computing speed, the authors introduced a new concept and performed gradient truncation between the computational layers of CSPOSANet. Power consumption is the most significant factor that is considered when the computational cost of low-end devices is being evaluated. MAC is found to be the biggest factor that affects power consumption, which is calculated by Equation (1). MAC = hw (C in + C out ) + KC in C out (1) where h represents height, w represents the width of the feature map, C in represents the channel number of inputs, C out represents the channel number of outputs, and K represents the kernel size of the convolutional filter. According to the authors, the smallest MAC value can be derived when C in = C out . By minimising the convolutional input-output (CIO), the DRAM traffic can be minimised. The authors evaluated the CIO of OSA, CSP, and their designed CSPOSANet, as shown in Equation (2), and found that the proposed CSPOSANet can achieve the best CIO results when kg > b = 2.

YOLOv4-Large
While scaling for high-end devices, the accuracy and inference speed can be improved by adjusting the detector's input, backbone, and neck. The prediction capability of the model depends upon the receptive fields of the feature vector. In neural networks, the stage is directly related to the receptive fields, and the feature pyramid network (FPN) indicates that a higher number of stages helps in the prediction of larger objects. YOLOv4-large is designed for the training of large models on distributed cloud-based GPUs. A fully CSP-ized YOLO-P5 is designed and is scaled in YOLOv4-P6 and YOLOv4-P7. The authors performed compound scaling on {size input , #stage}, set the depth scale of each stage to 2 dsi , set d s to [1, 3,15,15,7,7,7], and found the best results.

Training Dataset
To observe people in low-light environments, we utilised the ExDark [36] dataset, which contains images of 12 different low-light scenarios. It is the first dataset available that is entirely based on low-light scenarios. The dataset contains the images of 10 different classes. We extracted the dataset for the person class and trained our models to it.

Testing Dataset
DepTSol was tested on a custom dataset collected from Pakistan at night in the days of COVID-19. Pakistan is one of the most urbanised countries in South Asia. The large population and congested streets make it a riskier place in the growth of COVID-19, and it is very difficult to maintain a safe distance in such narrow places. Hence, the monitoring system needs a high accuracy in terms of the detection and location of people. The test dataset was a collection of 323 RGB frames collected from different low-light conditions and different crowded and less crowded places. In this study, 186 frames were collected from images depicting a crowd in the market of Rawalpindi Pakistan, which help in assessing the performance of object detectors in low-light conditions; the remaining 134 frames were collected from various outdoor environments. We obtained signed consent forms from the participants of the study, and the identities of those captured in crowded areas have been removed. All frames were captured by a ToF camera of a Samsung Galaxy Note 10+, where Gh is 4.5 ft, and FL is 35 mm [37]. The dataset is publicly available [38].

Problem Articulation
We defined a scene as five-tuple value S = {V f , G h , TH ud , A n , BB c }, where V f = height × width × 3 shows the width and height of an RGB video frame and V f ∈ + , G h is the camera height from the ground in feet, TH ud shows the least physical distance that should be maintained to stay safe, A n is a binary control signal for sending a voice warning if the monitored inter-personal distance is less than TH ud , and BB c is the colour of the detected bounding boxes. In a given S, we are interested in finding the inter-personal pixel dis-tance Dpx = {pd (1,2) , pd (1, 3) , . . . , pd (1,n) , pd (2,3) , pd (2,4) , . . . , pd (2,n) , . . . , pd (n−1,n) } at varying CF D values, where CF D ∈ a, and a is a multiple of the specified safe physical distance. In our case, it is 180 cm ≈ 6 ft. Therefore, a = {180, 360, 540, 720, . . . , n}. After finding the D px , we converted it into real-world units centimeters (cm) UD i+n . We found TH ud to highlight the safety distance violations (UD i+n < TH ud |UD i+n ≥ TH ud ) in the given ROI. In the end, if a safety breach is detected, the BB c becomes red, and a voice warning is sent to the people violating the safe physical distance by setting the A n = 1; else, in the normal cases, BB c remains green, and A n = 0.

Real-Time People Detection
In this study, from the list of scaled-YOLOv4, the CSP-ized YOLOv4 algorithm was utilised for the detection of humans in V f , as it improves prediction accuracy with a high inference speed. A detailed discussion of the model is presented in the Data Model section. The output of the model is the bounding boxes of detected people bb i = {bb (i,1) , bb (i,2) , bb (i, 3) , . . . , bb (i,n) }, their confidence score bc i , and the class label bl i . bb (i,j) = {x (i,j) , y (i,j) } gives pixel indices of bounding boxes in V f , where j shows the associative four corners: bottom-left, bottom-right, top-left, and top-right. The aim was to develop a robust real-time people detection model with minimal localisation and classification errors, capable of delivering high precision by considering various challenges such as variations in clothes, height, poses, and partial visibility. Figure 3 demonstrates the structure of the YOLO-based person detection module.

Camera-to-People Distance Estimation
We propose a motionless monocular ToF camera [38] for real-time video surveil-lance. The built-in accuracy of ToF cameras is good, as it combines the advantage of active sensors and camera-based approaches. Bad lighting conditions and texture mixing are usually noticed in stereo vision cameras, and they are computationally expensive, whereas ToF cameras have proven to be best in such scenarios. In comparison with 3D vision systems, ToF was found to be compact and straightforward, as they have a built-in illumination ability with no moving parts. It yields efficient results based on low processing power. In contrast to laser scanners, ToF cameras can measure up to 100 fps in one shot, which is much faster than laser technology. ToF technology has a variety of applications, including pathplanning for manipulators [39,40], obstacle avoidance [41,42], wheelchair assistance [42], medical respiratory motion detection [43], semantic scene analysis [43], simultaneous localization and mapping (SLAM) [44], and human-machine interaction [45][46][47][48].
A ToF camera helps us measure the camera-to-person distance with high accuracy, which allows us obtain optimal performance in our people monitoring approach. In the ToF camera unit, the camera's light blinks, and a modulated light pulse travels from the illumination source to the object. The distance between the camera and the object is calculated by the time taken by the light pulse to return to the source object after striking the target object. The transmitted light faces a delay according to the distance it covers to reach the object and then return to the source, which means that the farther the object, the more time the pulse will take to return to the source. The time delay T D that the illumination faces is expressed in Equation (3).
where D o represents the object distance in meters (m), and v o is the velocity of the light in meters per second (m/s). The maximum range that the camera can cover is determined by the pulse width of the illumination and calculated by Equation (4), whereas the camera-toobject distance is calculated by Equation (5).
The distance between the camera and the object is half of the total distance travelled by the light pulse. Here, T o shows the length of the pulse.
where a 2 is the signal that is generated when the light pulse is emitted, and a 1 represents the signal when no light emission is encountered.

Threshold Specification and People Inter-Distance Estimation
To initiate the monitoring process, we calibrated the camera in the real-world environment by specifying intrinsic and extrinsic camera parameters. For intrinsic camera parameters, we assumed the fixed focal length (FL) that we set according to the area where the surveillance system was installed, depending on the required field of view (FoV). To specify the extrinsic camera parameters, we divided the S into three different camera ranges: CF D − near, CF D − far, and CF DR . To start the monitoring process, we defined a threshold distance in V f in the form of pixels. For the specification of threshold distance in V f and to proceed further, we made arrangements in a real-world environment. We took four target objects, T1, T2, T3, and T4, from which two targets (T1, T2) were placed on the camera-to-frame distance CF D − near, and the other two (T3, T4) were placed at CF D − far. Two different ranges of frames, i.e., CF D − near and CF D − far with respect to the camera, with four target objects are shown in Figure 4, where CF D is the distance between the ToF camera and the V f , (h m , h f ) is the total height, (l m , l f ) is the total length of the near and the far V f , (ixm T1 , ixm T2 ) and (i xfT3 , i xfT4 ) show the length, and (iym T1 , iym T2 ) and (i yfT3 , i yfT4 ) show the height of the objects in the near and far frames. The pixel size of objects projected in CF D − near is different from the target objects in CF D − far and decreases as the value of CF D increases. Figure 4 shows that, as long as the frame moves away from the camera, the pixel size of all other parameters increases in addition to the pixel size of the objects present in the frames. We execute Algorithms 1 and 2 to specify the threshold value and monitor people at CF D − near and CF D − far, whereas Algorithm 3 is executed to monitor people at a distance above the CF D − far up to the specified maximum camera range CF DR , which means that people outside CF DR will not be monitored.

Monitoring People at CF D − near
To initiate the procedure, we should know the threshold distance between the target objects (T1 and T2) in units-in our case, TH ud = 180 cm ≈ 6 ft, the minimum specified safe distance by WHO. We then initialise the camera-to-frame distance CF D − near, which shows how far we start the monitoring process from the camera. E px represents extra pixels that we require because, as CF D increases, the number of pixels starts to decrease. At the start, E px is initialised at 0 because, at the beginning of the procedure, no pixel loss is encountered. In Step 8, we calculate the Euclidean distance between the centroids of T1 and T2, which yields the threshold distance in terms of pixels TH pd , equivalent to TH ud . In Step 10, to convert pixel distance into units, we find that the proportion of TH ud and TH pd yields the unit points equivalent to pixels, where k represents a constant value that maps pixel distance to the unit distance (cm). In Steps 11 and 12, we calculate the Euclidean distance between the centre points of all detected bb i at CF D − near and convert it to the unit distance cm. UD i+1 shows the distance between all detected persons at CF D − near in terms of cm. In Steps 13 to 18, we compare the monitored unit distance with TH ud . The people violating the TH ud are highlighted by red bounding boxes and are notified by a general voice warning.

Monitoring People at CF D − Far
In Algorithm 2, we change the camera-to-frame distance from CF D − near to CF D − far. We place T3 and T4 at CF D − far, where their self-distance is the same threshold value TH ud = 180 cm. We then calculate the Euclidean distance between the centre points of T3 and T4. The value of E px is updated this time as objects are now at CF D − far, so the Euclidean distance between the centre points of objects at CF D − far is not the same as that of objects at CF D − near, but the TH ud between both CF D values is the same.
To recover the lost pixels at CF D − far, we calculate the difference between the Euclidean distance of T1 and T2 at CF D − near and that of T3 and T4 at CF D − far and multiply it by c, where the initial value of c is 1 and increases as long as CF D increases. After calculating the difference, we update the value of E px and add the lost pixels that are stored in E px to UD i+2 by multiplying E px with k, which converts the recovered pixels into cm, where UD i+2 shows the distance between all detected persons at CF D − far in terms of cm.

Monitoring People up to CF DR
In Step 1 of Algorithm 3, we start a loop to monitor people above the CF D − far up to the maximum specified camera range CF DR . We initialise CF D with a 3 , where a 3 ∈ a. In Step 2, we check whether more than one object is present at CF D . If more than one object is present at CF D , then Steps 5-17 of the algorithm are executed, where we increment the value of c to recover the lost pixels at each CF D , convert the monitored Euclidean distance between the centre points of detected objects into cm, and compare it with TH ud . We execute Steps 16 and 17 if a single object or no object is detected at CF D . The workflow of the proposed DepTSol model is shown in Figure 5.

Monitoring People at CFD − Far
In Algorithm 2, we change the camera-to-frame distance from CFD − near to CFD − far. We place T3 and T4 at CFD − far, where their self-distance is the same threshold value THud = 180 cm. We then calculate the Euclidean distance between the centre points of T3 and T4. The value of Epx is updated this time as objects are now at CFD − far, so the Euclidean distance between the centre points of objects at CFD − far is not the same as that of objects at CFD − near, but the THud between both CFD values is the same.
To recover the lost pixels at CFD − far, we calculate the difference between the Euclidean distance of T1 and T2 at CFD − near and that of T3 and T4 at CFD − far and multiply it by c, where the initial value of c is 1 and increases as long as CFD increases. After calculating the difference, we update the value of Epx and add the lost pixels that are stored in Epx to UDi+2 by multiplying Epx with k, which converts the recovered pixels into cm, where UDi+2 shows the distance between all detected persons at CFD − far in terms of cm.

Monitoring People Up to CFDR
In Step 1 of Algorithm 3, we start a loop to monitor people above the CFD − far up to the maximum specified camera range CFDR. We initialise CFD with a3, where a3 ∈ a. In Step 2, we check whether more than one object is present at CFD. If more than one object is present at CFD, then Steps 5-17 of the algorithm are executed, where we increment the value of c to recover the lost pixels at each CFD, convert the monitored Euclidean distance between the centre points of detected objects into cm, and compare it with THud. We execute Steps 16 and 17 if a single object or no object is detected at CFD. The workflow of the proposed DepTSol model is shown in Figure 5.

Experimental Setup
We performed transfer-learning on the MS COCO dataset [49] to train a custom object detector to attain the highest model accuracy. The selection of hyper-parameters for the training of one stage object detectors on the ExDARK dataset were as follows: The network size was 512 × 512. The initial learning rate was 0.01. The initial batch size was 64 with 16 subdivisions. To accelerate gradients vectors in the right directions, stochastic gradient descent (SGD) momentum was used with an initial value of 0.937 and a weight decay of 0.0005. For bounding box regression, generalised intersection over union (GIOU) loss was adopted with an initial value of 0.05. The initial class loss gain was 0.5, and the class binary cross-entropy (BCE) loss positive gain was 1.0. The object loss gain and object BCE loss positive gain was 1.0. The adopted intersection over union (IoU) target-anchor training threshold was 0.2, and the anchor threshold was 4.0. To handle the class imbalance problem by assigning more weights to hard or easily misclassified examples, the focal loss (gamma) was used with an initial value of 0.0.
From data augmentation, the following parameters were adopted: To train the model on varying image colours, the chosen fraction of hue, saturation, and value augmentation were 0.015, 0.7, and 0.4, respectively. To add non-linearity, the mish activation function was used. To make the model localise all people in different portions of the frame, the mosaic data augmentation technique was utilised. All experiments were performed on a Tesla T4 GPU. The utilised PyYAML version was 5.4.1, the torch version was 1.8.0 with cu101, and the mish version was 0.0.3. The architectural configuration of CSP-ized YOLOv4 is shown in Figure 6.

Evaluation Measures
We used common performance evaluation metrics precision and recall to perform comparative analysis between different one stage object detectors and chose the best for our real-time social distance monitoring solution in terms of performance [50].
Precision is the proportion of the number of true positives (TP) to the total number of positive predictions. In contrast, recall is the proportion of the number of TP to the total number of actual objects. Precision and recall are calculated by Equations (6) and (7). In the field of object detection, IoU is a threshold value that determines whether the predicted result is TP or true negative (TN).
Recall = TP TP + FN (7) Average precision (AP) depends on the precision-recall (PR) curve and is defined as the precision score averaged over all distinctive recall levels as shown in Equation (8), whereas average recall (AR) is calculated by Equation (9).
In this article, the COCO evaluation metric [51] was used for performance evaluation because of its varsity. The standard evaluation metric is Pascal VOC [52], but it defines the mAP score at only 0.5 IoU. However, the COCO evaluation metric contains an mAP score at three different IoU threshold values, including the primary challenge metric that averages the mAP score at 10 different IoU thresholds from 0.50 up to 0.95 with a step size of 0.05. A standard metric is the same as Pascal VOC, which considers only a single threshold value of 0.5, and a strict metric, where the IoU threshold is 0.75. Besides this, COCO provides an mAP score for small (area < 32 2 ), medium (area > 32 2 and < 96 2 ), and large size objects (area > 96 2 ). As demonstrated in Equation (10), the mAP score is the mean of all AP values over N number of classes.
Similar to the mAP score, the mAR score also has two sets of variations. In the first set, mAR gives the various number of detections per frame; e.g., mAR max=1 gives only one detection per frame, mAR max=10 gives 10 detections per frame, and mAR max=100 gives 100 detections per frame. In the second set, the mAR is calculated based on the size of detected objects such as small (area < 32 2 ), medium (area > 32 2 and < 96 2 ), and large objects (area > 96 2 ). The mean average recall (mAR) is the mean of all AR values over N number of classes as shown in Equation (11).
For the final evaluation of the DepTSol model, we used the mean absolute error (MAE) [53] score, which is the mean of the difference of observed and actual distance values, as shown in Equation (12).
where y i shows the observed distance, and y i represents the actual distance.

Results
We performed various experiments for the evaluation of our social distance monitoring approach DepTSol. Besides evaluating the performance of the CSP-ized YOLOv4, we evaluated the performance of one-stage object detection models on the ExDark dataset and compared the results with CSP-ized YOLOv4 both in terms of speed and accuracy. As per the literature, low-light environments are not focused on much in the field of object detection. The direct evaluation of object detection models in low-light scenarios will pave the way for researchers to further study the field. The comparative analysis between seven different object detection models, including the Single-Shot Detector (SSD) [54], Reti-naNet [55], the Enriched Feature Guided Refinement Network (EFGRNet) [56], YOLOv3, YOLOv3 Spatial Pyramid Pooling (YOLOv3-SPP) [35], YOLOv4, and the CSP-ized YOLOv4, is shown in Tables 1 and 2. From the results, we can analyse that CSP-ized YOLOv4 shows the best performance both in terms of speed and accuracy. The training convergence of CSP-ized YOLOv4 on GIoU loss, objectness loss, classification, precision, recall, and mAP is shown in Figure 7 with a network size of 512 × 512. The SSD has attained the second position in terms of speed but achieved the sixth rank at various mAP scores and remained at the last level in terms of mAP for small area objects. YOLOv4 has achieved the second rank in terms of accuracy and the third rank in terms of speed. YOLOv3-SPP stands at the third level in terms of accuracy, YOLOv3 achieves the fourth rank in terms of speed, and YOLOv3-SPP and YOLOv3 achieve almost the same fps score with a difference of 0.7 fps. EFGRNet has achieved the third rank in terms of fps score and the fourth in terms of mAR and mAP. RetinaNet has reported the lowest speed and mAP score as compared to all other models, while performing better than SSD, EFGRNet, and YOLOv3 for small-size objects.
Based on the comparative analysis, we have taken the trained model of CSP-ized YOLOv4 for high performance. We obtained the object detection results from CSP-ized YOLOv4 and applied social distance monitoring algorithms on the obtained images coordinates for inter distance estimation. We tested our DepTSol model at 230 different RGB frames. Some of the test results from the qualitative evaluation are shown in Figure 8, and Table 3 shows the quantitative results in terms of the predicted unit distance U D , the actual unit distance AU D , and their relevant pixel values. Figure 9 depicts the further qualitative results of the DepTSol model from the testing dataset. To start the monitoring process, we initialised CF D − near with a 1 CF D − far with a 2 , and the CF DR was a 3 , where (a 1 , a 2 , a 3 ) ∈ a. We monitored people at each CF D , calculated the error rate between U D and AU D at each level, and summarised it with an MAE score.  Figure 7. The graphic depiction of CSP-ized YOLOv4 convergence over GIoU, objectness, classification, precision, recall, and mAP score.    Table 3.

Limitations and Discussion
Low-light environments play a vital role in the spread of disease. The prvision of effective social distance monitoring approaches is required to serve that motive. The detection of people in low-light environments is itself a challenging task. The application of image processing techniques for the enhancement of dark images and the subsequent application of object detection algorithms results in a slow response time and requires high-power machines to execute multiple tasks. To make the system highly responsive, we directly applied object detection algorithms in low-light scenarios and evaluated the performance. We tested seven different one-stage object detection algorithms on the ExDARK dataset and evaluated the models both in terms of accuracy and speed. From the obtained results, Figures 10 and 11 depict the empirical results of the performed experiments. To summarise the compared models' performance, we explored the testing results of each model by COCO evaluation metrics on a Tesla T4 GPU with a network size of 512 × 512. The CSP-ized YOLOv4 achieved the best performance results as compared to the six other one-stage detectors. Based on COCO evaluation, the CSP-ized YOLOv4 obtained an fps value of 51.2 and an mAP [0.5] of 99.7%. Due to its high performance as compared to other one-stage object detectors, we utilised it for our social distance monitoring task to control FP and TN and to support real-time monitoring. Analysis shows that the direct application of object detection algorithms in low-light environments for human detection and monitoring purposes is very effective. Additionally, the direct application of deep-learning-based object detection algorithms on low-light datasets promotes the acquisition of effective results at a very low cost. We can save the cost incurred on powerful devices to perform image cleansing and visibility enhancement. Furthermore, the fps score of the models can be further enhanced by utilising the GPUs such as Tesla V100, Volta, and Titan Volta, whereas the training of the models on a higher network size results in a higher mAP score. The proposed CSP-ized YOLOv4 and ToF-based real-time social distance monitoring approach has shown effective results with an overall MAE of 2.23 cm. Figure 12 presents visualisations of U D and AU D . The approach considers individual's privacy concerns. Instead of targeting people individually, we use general voice warnings that alert all people present at the location. The proposed general warning system is highly feasible in outdoor environments, such as night outdoor gatherings. Moreover, for indoor environments such as offices, homes, libraries, and hospitals, we can use non-intrusive audiovisual cues that only target and notify certain people in the environment without distracting others in the surrounding area. The proposed camera calibration technique has addressed the limitations of the previous study of monitoring people at a fixed camera distance C D in a given environment by dividing the scene into multiple safety threshold distance values (e.g., CF D − near and CF D − far, up to the maximum specified camera range CF DR ). The proposed approach can effectively monitor people at multiple camera distances in a given environment and generate voice warnings. Moreover, in contrast to the previous study, in the DepTSol model, we improved the mAP score by 1.86%, while no single FP or FN was detected. Besides these numerous improvements, the approach is limited in giving feasible results at CF D values that lie behind multiple safety thresholds (e.g., if we start the monitoring process 180 cm away from the camera and initialise a CF D − near of 180 cm and then a CF D − far of 360 cm, monitoring can be done. The approach does not yield correct results for people between a CF D of 180 cm and 360 cm). Furthermore, in the proposed camera calibration approach to start the monitoring process, we have to place four target objects in a real-world environment. In addition, the installation of the system in a given environment is dependent on the extrinsic camera parameters (i.e., FL and FoV), which means the pixel threshold value TH pd relevant to the unit threshold value TH ud needs to be calculated every time a change in these parameters is encountered. These limitations can be tackled by introducing a new camera calibration approach that can monitor people apart from specific CF D values.

Conclusions and Future Directions
Social distancing is a highly recommended personal preventive strategy for mitigating the effects of COVID-19. We propose an approach named DepTSol where we mainly focus on low-light scenarios, as such scenarios can play a vital role in the escalation of death and infection rates. We propose a smart implementation of SIoT utilising computer vision and deep learning algorithms with the collaboration of ToF technology and present a cost-efficient and fast, automated social distance monitoring solution. We use a ToF camera to capture people in a real-world environment. People in the images are detected by CSP-ized YOLOv4. In the proposed approach, we calculate the Euclidean distance between the centroids of bounding boxes detected across people and convert distance in cm. Based on the achieved unit distance, we highlight violations, and a general voice warning is generated to those present in the environment. We evaluated the technique both quantitatively and qualitatively, performed a comparative analysis between different onestage object detectors, and found that CSP-ized YOLOv4 outperformed all other techniques. Furthermore, the proposed technique achieves outstanding performance in terms of both speed and accuracy, with 51.2 fps and a 99.7% mAP score. The speed and accuracy obtained by DepTSol is higher than those obtained by Adina et al. [8] in their research work, which was 46.2 fps and a 97.84% mAP score, respectively.
In the future, we aim to introduce a new camera calibration technique to resolve the limitations of this study. Furthermore, we aim to extend this approach by adding a facemask detection feature to identify people who are not wearing a mask or who are not wearing a mask correctly at night. Besides this, we will monitor people inside cars and on motorbikes. We will monitor whether the windows of cars are closed or whether people are wearing a facemask, and for bikers, we will ensure that they are wearing a facemask or helmet. In underdeveloped cities, where congested streets similarly play a vital role in the spread of disease, congested roads with minimal distance between traffic can also boost the infection rate.