Next Article in Journal
Special Issue on the Development of Microfluidic Devices for Medical Applications
Previous Article in Journal
An Easy to Use Deep Reinforcement Learning Library for AI Mobile Robots in Isaac Sim
 
 
Article
Peer-Review Record

Holistic Interpretation of Public Scenes Using Computer Vision and Temporal Graphs to Identify Social Distancing Violations

Appl. Sci. 2022, 12(17), 8428; https://doi.org/10.3390/app12178428
by Gihan Jayatilaka 1,2,†, Jameel Hassan 1,†, Suren Sritharan 3,†, Janith Bandara Senanayaka 1, Harshana Weligampola 1,*, Roshan Godaliyadda 1, Parakrama Ekanayake 1, Vijitha Herath 1, Janaka Ekanayake 1,4 and Samath Dharmaratne 5
Reviewer 1: Anonymous
Appl. Sci. 2022, 12(17), 8428; https://doi.org/10.3390/app12178428
Submission received: 6 July 2022 / Revised: 14 August 2022 / Accepted: 15 August 2022 / Published: 24 August 2022
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

This study try to provide a unified framework solution for monitoring social distancing protocols, interpersonal interactions such as handshakes, as well as mask-wearing by utilizing CCTV footage based on computer vision and graph theory.

The treat level has been demonstrated by the action of subjects such as handshake, wearing mask, social distancing variability by determine the increment and decrement between of frame sequence.

I think the author need to suggest the standard protocols for control parameters for threat level assessment score.

How about the time of interactions. If it is not included for consideration, suggest to have an assumption.

The results presented is strongly based on the frame quantification interpretation, I think author need to strengthen the quantification part.

Overall, the presentation is beneficial but the reliability of the results must be improved by including different public scenarios, which is considering the possibilities of social distancing practiced based on different events.

Comments for author File: Comments.pdf

Author Response

Please see the attached file.

Author Response File: Author Response.pdf

Reviewer 2 Report

This manuscript describes a system for monitoring social distancing violations, which is in turn composed of multiple models each handling some required relevant subtask (Figure 1; e.g. tracking of people, estimation of distances/groups, identification of physical interaction, detection of masking). The predictions/outputs of these individual models are then collated into a temporal graph for analysis. A further challenge is the possibly non-optimal placement of cameras, such that there might not exist a video stream that is directly overhead (which enables straightforward assessment of distance) or covering all directions (to enable mask detection regardless of person orientation)

 

Person/mask object detection in video is performed with YOLO (Section 2.1, 2.4), and distance estimation by geometric transform of estimated person-foot locations, onto the "floor plane" (Section 2.2). Groups of people are estimated by clustering by distance metric (Section 2.3). All these information is represented as a temporal graph (Section 2.5), which then enables threat quantification as estimated by various metrics (Section 2.6; Table 1).

 

1. About Line 204, there appears to be overlapping text. This might be addressed.

 

2. In Section 2.3, it is stated that a person is considered to be a member of a group if they are [close] to at least ome member of the group. This appears to imply that the entire population might be collected into a single group, if dense enough. Details/parameters of the spectral clustering used, might thus be clarified further.

 

3. In Section 2.6 (Equations 15 & 16), the threat level appears defined for each frame. It might be clarified as to whether such frame-level metrics are summed over total time of contact, to produce a probably more-relevant threat assessment, and whether the threat assessment related only to the entire scene, and not to individuals.

 

4. In Section 3.3.3, it is stated that the UOP dataset was required (i.e. contained subjects with masks). The number of subjects and length of the relevant videos might be stated. In general, the dataset descriptions in Section 3.1 might be more detailed and include such information on number of subjects, length/number of video clips, etc.

 

5. In Section 3.3.4, it is stated that the threat level T(t) was compared against expert human input. The standards/guidelines followed by the human expert(s) might be stated. Moreover, were the metrics in Table 1 designed with knowledge of such standards/guidelines?

 

6. In Section 4.2, it might be clarified if (near-full) occlusion, affects distance estimation too (if the feet are occluded)

 

7. In Section 4.4, it is claimed that the system performs significantly well in identifying masked faces, versus unmasked faces. This appears unintuitive and might be clarified further.

 

8. Related to the above, it seems that masks would not be able to be detected, if a masked person rotates away from the camera. It might be clarified as to whether the temporal graph maintains a record/memory of mask status, in such cases.

 

9. For Section 4.6. the meaning of "full system performance" might be clarified, since features such as masks were said not be exist in certain datasets.

 

Author Response

Please see the attached file.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Authors reflecting well and do comprehensive corrections according to the previous comments. The corrected paper is ready for publication.

Author Response

Please find the PDF attached herewith 

Author Response File: Author Response.pdf

Reviewer 2 Report

We thank the authors for largely addressing our previous comments. The stability of the frame-level threat level metric might be briefly described - is it generally robust?

Author Response

Please find the attached file 

Author Response File: Author Response.pdf

Back to TopTop