Holistic Interpretation of Public Scenes Using Computer Vision and Temporal Graphs to Identify Social Distancing Violations
Round 1
Reviewer 1 Report
This study try to provide a unified framework solution for monitoring social distancing protocols, interpersonal interactions such as handshakes, as well as mask-wearing by utilizing CCTV footage based on computer vision and graph theory.
The treat level has been demonstrated by the action of subjects such as handshake, wearing mask, social distancing variability by determine the increment and decrement between of frame sequence.
I think the author need to suggest the standard protocols for control parameters for threat level assessment score.
How about the time of interactions. If it is not included for consideration, suggest to have an assumption.
The results presented is strongly based on the frame quantification interpretation, I think author need to strengthen the quantification part.
Overall, the presentation is beneficial but the reliability of the results must be improved by including different public scenarios, which is considering the possibilities of social distancing practiced based on different events.
Comments for author File: Comments.pdf
Author Response
Please see the attached file.
Author Response File: Author Response.pdf
Reviewer 2 Report
This manuscript describes a system for monitoring social distancing violations, which is in turn composed of multiple models each handling some required relevant subtask (Figure 1; e.g. tracking of people, estimation of distances/groups, identification of physical interaction, detection of masking). The predictions/outputs of these individual models are then collated into a temporal graph for analysis. A further challenge is the possibly non-optimal placement of cameras, such that there might not exist a video stream that is directly overhead (which enables straightforward assessment of distance) or covering all directions (to enable mask detection regardless of person orientation)
Person/mask object detection in video is performed with YOLO (Section 2.1, 2.4), and distance estimation by geometric transform of estimated person-foot locations, onto the "floor plane" (Section 2.2). Groups of people are estimated by clustering by distance metric (Section 2.3). All these information is represented as a temporal graph (Section 2.5), which then enables threat quantification as estimated by various metrics (Section 2.6; Table 1).
1. About Line 204, there appears to be overlapping text. This might be addressed.
2. In Section 2.3, it is stated that a person is considered to be a member of a group if they are [close] to at least ome member of the group. This appears to imply that the entire population might be collected into a single group, if dense enough. Details/parameters of the spectral clustering used, might thus be clarified further.
3. In Section 2.6 (Equations 15 & 16), the threat level appears defined for each frame. It might be clarified as to whether such frame-level metrics are summed over total time of contact, to produce a probably more-relevant threat assessment, and whether the threat assessment related only to the entire scene, and not to individuals.
4. In Section 3.3.3, it is stated that the UOP dataset was required (i.e. contained subjects with masks). The number of subjects and length of the relevant videos might be stated. In general, the dataset descriptions in Section 3.1 might be more detailed and include such information on number of subjects, length/number of video clips, etc.
5. In Section 3.3.4, it is stated that the threat level T(t) was compared against expert human input. The standards/guidelines followed by the human expert(s) might be stated. Moreover, were the metrics in Table 1 designed with knowledge of such standards/guidelines?
6. In Section 4.2, it might be clarified if (near-full) occlusion, affects distance estimation too (if the feet are occluded)
7. In Section 4.4, it is claimed that the system performs significantly well in identifying masked faces, versus unmasked faces. This appears unintuitive and might be clarified further.
8. Related to the above, it seems that masks would not be able to be detected, if a masked person rotates away from the camera. It might be clarified as to whether the temporal graph maintains a record/memory of mask status, in such cases.
9. For Section 4.6. the meaning of "full system performance" might be clarified, since features such as masks were said not be exist in certain datasets.
Author Response
Please see the attached file.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Authors reflecting well and do comprehensive corrections according to the previous comments. The corrected paper is ready for publication.
Author Response
Please find the PDF attached herewith
Author Response File: Author Response.pdf
Reviewer 2 Report
We thank the authors for largely addressing our previous comments. The stability of the frame-level threat level metric might be briefly described - is it generally robust?
Author Response
Please find the attached file
Author Response File: Author Response.pdf