Real-Time Mask Recognition
Round 1
Reviewer 1 Report
The paper presents, reviews and benchmarks mask detection algorithms based on classic ML networks.
The overall contribution and quality of paper is good. However, i have many questions concerning the paper.
- Is it difficult to ,even with data augmentation (Instagram Scraping) to have a balanced dataset (with no major 'no-mask' class compared to "mask" class)?
- The Accuracy is a not an accurate evaluation metric to consider in the case of unbalanced datasets. I suggest to the authors to consider other performance metrics (such as F1-score).
- The authors did not motivate the choice of MobileNet-v2 and RestNet-18 for mask detection tasks. I would suggest the use of others more robust computer vision algorithms.
- What is the output probability authors are referring to in section 2.2.1.
- How the authors explain that, in fig 3 and fig 4, the miss classification of not wearing a mask lay on a given directions (Az = 60 and -60), while for the case of "wearing mask" the miss classification are all over the scatter plot ?
- Did the authors the code used for the annotation available to data science community ? (through github for example ?)
- The authors did not motivate why this computer vision paper is relevant to be published in a IoT journal
- What are Y, N and Y in table 5 and 6 ?
- Figures 10, 11, 12, 13 take a lot of place and are not illustrative. I suggest to the authors to summarise them in a table.
- How the authors defines the confidence score?
- Did the authors get the right authorisations from people to use there images to build the dataset? For both instagram scrapped images and campus live recording?
- The paper title is: "real-time ....". I didn't found in the paper any deep analysis on the impact of the CV algorithms on the edge AI devices (Rasbperry pi for example). Authors should tackle the implementation feasibility of the proposed solution.
- Open question : Now days, to enter to building in many country, people should show there Vaccination Proof (usually in the form of QR-code). How can this work be extended to detected if the vaccination is valid of a given person ? (supposing that the person show the QR-code to the camera.
Author Response
Reviewers, thank you for your time and feedback as part
of this process. To help highlight updates to our paper, we
have appended a tracked-changes version of the paper and
also attempt to offer summaries of the changes below:
Reviewer 1: The paper presents, reviews and benchmarks
mask detection algorithms based on classic ML
networks. The overall contribution and quality of paper
is good. However, i have many questions concerning the
paper.
– Is it difficult to ,even with data augmentation (Instagram
Scraping) to have a balanced dataset (with no
major ’no-mask’ class compared to ”mask” class)?
– Yes - the internet has been around for long before
COVID-19, and even though some ‘mask’ photos
started becoming more prevalent when the dataset
was created, ‘no-mask’ photos still comprised the
vast majority of available images. In our presentation
attack analysis, the data is more balanced, as participants
were encourage to attempt images both with
and without masks. The real-life captures on campus,
being unscripted, are subject to the behaviors of the
random people captured on camera.
– The Accuracy is a not an accurate evaluation metric
to consider in the case of unbalanced datasets. I
suggest to the authors to consider other performance
metrics (such as F1-score).
– We agree and have added a section explaining this
in the paper. While an area for future improvement,
we have worked to dissect that accuracy analysis
according to the different underlying contributors to
errors, such that any bias in the ’masked’ dataset are
highlighted rather than lost.
– The authors did not motivate the choice of
MobileNet-v2 and RestNet-18 for mask detection
tasks. I would suggest the use of others more robust
computer vision algorithms.
– Those two models were used because they are wellknown
and recognized in the machine learning community
and because trained models are easily available.
Further, the MobileNet-v2 was specifically designed
for mobile applications, which is in line with
our goal of deploying a real-time IoT-scale solution.
We concur that employing more robust computer
vision algorithms would have higher performance
potential, yet the resources required in many cases
to deploy those algorithms would prevent their use
as an IoT device.
– What is the output probability authors are referring
to in section 2.2.1.
– This is the confidence score/probability of correct
classification. It is explained more in detail in the
appendix.
– How the authors explain that, in fig 3 and fig 4, the
miss classification of not wearing a mask lay on a
given directions (Az = 60 and -60), while for the
case of ”wearing mask” the miss classification are
all over the scatter plot ?
– Our goal in that analysis was to build an understanding
of the variations in performance as facial images
vary from boresight (which is a foundational, yet
non-real world, assumption of the previous papers).
While not universally true, we found that the trained
model is generally more robust to changes in azimuth
for images of individuals wearing a mask. In
practical deployment, this would encourage use of a
visual / subconscious beacon that attracted attention
of the person being observed and thus more closely
approximated a boresight image. This also suggest
the need for a future expansion of the dataset to
include a wider range of elevation angles.
– Did the authors the code used for the annotation
available to data science community ? (through
github for example ?)
– No; frankly, the code is pretty simple and there are
likely much better tools for annotation. Additionally,
while all participants consented to participation in
the study, we are still reviewing with our Institutional
Review Board what level of detail may be released
– we anticipate this will be part of future work. A
note on this was added to the paper.
- The authors did not motivate why this computer
vision paper is relevant to be published in a IoT
journal
– We have updated the introductory sections to improve
communication of this motivation - the fundamental
focus is on addressing real-world use cases
in the context of an IoT-scale, including the inherent
limitations that arise for bounded control on the
images (low quality, off boresight) and bounded
processing. - – What are Y, N and Y in table 5 and 6 ?
– ‘Y’ and ‘N’ are abbreviations for ‘Yes’ and ‘No’,
respectively. This information has been added to the
captions for tables 5 and 6. - – Figures 10, 11, 12, 13 take a lot of place and are not
illustrative. I suggest to the authors to summarise
them in a table.
– An appendix was added containing Figures 10-13
(now Figures A1-A4) and their analysis, which does
indeed save room and focus the discussion. - – How the authors defines the confidence score?
– This explanation is added to the appendix.
– Did the authors get the right authorisations from
people to use there images to build the dataset? For
both instagram scrapped images and campus live
recording?
– All images from Instagram are publicly accessible
and legal to use. All live recordings on campus, both
of the outdoor webcam captures and the Zoom-based
presentation attacks, were pre-approved by Virginia
Tech’s Institutional Review Board (IRB) under protocol
20-736. Further, all participants received advance
descriptions of what data would be collected, how the
images would be used, etc. We agree this is a very
valid concern, and is why we have not made the
augmented dataset or video recordings from campus
freely available. We aim to do that in future work
that builds upon this foundation.
– The paper title is: ”real-time ....”. I didn’t found in
the paper any deep analysis on the impact of the CV
algorithms on the edge AI devices (Rasbperry pi for
example). Authors should tackle the implementation
feasibility of the proposed solution.
– A more in-depth discussion on the implementation
feasibility has been added to the conclusion.
– Open question : Now days, to enter to building in
many country, people should show there Vaccination
Proof (usually in the form of QR-code). How can
this work be extended to detected if the vaccination
is valid of a given person ? (supposing that the person
show the QR-code to the camera.
– Where our research is being performed in the United
States, QR code vaccination verifications are not in
use, although masking requirements (independent of
vaccination status) are. Therefore, we had not considered
the new QR code vaccination proof - thanks
for bringing it up! A commentary on this is added in
the conclusion as a case for future work. In short, we
believe this represents a data fusion challenge, where
the QR code is a more intentional action by the user
to have their data read by an existing QR scanner
(the user will need to pause to make their QR code
readable), in which case the variation on facial angles
/ capture times may be simplified since the user could
then be required to pause for a mask verification.
As for the mask detection, we believe the unscripted
variation as presented in the paper represents a more
difficult IoT-scale verification. We will need to do
a broader search of international protocols in future
work.
Author Response File: Author Response.pdf
Reviewer 2 Report
The authors present a solution for mask recognition focused on a low-cost, real-time environment. They evaluate two transfer learning-based approaches with the MobileNetv2 and ResNet18 models. In addition, they include images with diversity in terms of ethnicity, gender, quality, and others.
However, I have a major concern regarding the research presented in the paper. Specifically, with respect to the metric selected to evaluate classifier performance. The authors presented the results in terms of accuracy (Table 2 and Table 4), which has been widely discouraged for unbalanced classification problems, as in the current case, where the amount of data in one class is significantly larger than the other. For example, in the test there are 416 images for the "mask" class, and 13,621 images for the "no-mask" class (i.e., ratio 1:33). Let us assume that all images belonging to the "no-mask" class together with 320 images belonging to the "mask" class are correctly classified. Then, it corresponds to an accuracy of 0.993 (similar to the results reported by the authors), but the classifier only performed well on 77% of the images belonging to the "mask" class (i.e. 320 out of 416). That performance is not a high value, and furthermore, the reported accuracy value, equal to 99.3%, does not show the mid-level performance of the classifier for the "no-mask" class.
I suggest reviewing the article entitled "A Survey of Predictive Modelling under Imbalanced Distributions", which the authors state that: "Considering a user preference bias towards the minority (positive) class examples, accuracy is not suitable because the impact of the least represented, but more important examples, is reduced when compared to that of the majority class. For instance, if we consider a problem where only 1% of the examples belong to the minority class, a high accuracy of 99% is achievable by predicting the majority class for all examples. Yet, all minority class examples, the rare and more interesting cases for the user, are misclassified. This is worthless when the goal is the identification of the rare cases." Additionally, they present classifier evaluation metrics for unbalanced problems.
Author Response
Reviewer 2: The authors present a solution for mask
recognition focused on a low-cost, real-time environment.
They evaluate two transfer learning-based approaches
with the MobileNetv2 and ResNet18 models. In addition,
they include images with diversity in terms of
ethnicity, gender, quality, and others. However, I have
a major concern regarding the research presented in the
paper. Specifically, with respect to the metric selected to
evaluate classifier performance. The authors presented the
results in terms of accuracy (Table 2 and Table 4), which
has been widely discouraged for unbalanced classification
problems, as in the current case, where the amount of
data in one class is significantly larger than the other. For
example, in the test there are 416 images for the ”mask”
class, and 13,621 images for the ”no-mask” class (i.e.,
ratio 1:33). Let us assume that all images belonging to
the ”no-mask” class together with 320 images belonging
to the ”mask” class are correctly classified. Then, it
corresponds to an accuracy of 0.993 (similar to the results
reported by the authors), but the classifier only performed
well on 77% of the images belonging to the ”mask”
class (i.e. 320 out of 416). That performance is not a
high value, and furthermore, the reported accuracy value,
equal to 99.3%, does not show the mid-level performance
of the classifier for the ”no-mask” class. I suggest
reviewing the article entitled ”A Survey of Predictive
Modelling under Imbalanced Distributions”, which the
authors state that: ”Considering a user preference bias
towards the minority (positive) class examples, accuracy
is not suitable because the impact of the least represented,
but more important examples, is reduced when compared
to that of the majority class. For instance, if we consider
a problem where only 1% of the examples belong to the
minority class, a high accuracy of 99% is achievable by
predicting the majority class for all examples. Yet, all
minority class examples, the rare and more interesting
cases for the user, are misclassified. This is worthless
when the goal is the identification of the rare cases.”
Additionally, they present classifier evaluation metrics for
unbalanced problems.
– Thank you for your in-depth analysis. We concur
with the comments about using accuracy as a primary
metric when evaluating a biased dataset and have
added a section on this in the paper. The augmented
data collected and used within our experiment represents
both a more real-world analysis and contains
less bias than that of the original datasets, so we have
aimed to construct the improved models and make
‘accuracy’ a more representative measure. As described
in response to reviewer 1 comments, the segmentation
of ‘accuracy’ results across a wide range
of underlying contributors, mask vs. no-mask being
dominant, hopefully helps clarify that we have not
blindly applied ‘accuracy’ and enabled it to bias our
results in the same fashion that the underlying dataset
is biased. In general, the accuracy performance of
the bifurcated ‘mask’ and ‘no-mask’ use cases are
comparable, with the underlying dataset being more
of a 70/30 split than a 99/1 split, hopefully mitigating
the concerns.
Author Response File: Author Response.pdf
Reviewer 3 Report
The paper “Real-Time Mask Recognition”, by Rachel M. Billings and Alan J. Michaels, highlights a modality to analyze individuals wearing or not face masks, and making in the same time a distribution upon race, gender, luminosity of image, and so on based on face recognition software. The manuscript respects the journal’s template.
- Line 37: the word “lieu” is used usually in French language, with the meaning of “place”. Even if appears rarely in English language, I recommend to change the expression, by replacing the word “lieu”.
- Lines 188 – 191: in the following paragraph, the authors say: “…A few examples of edge cases that are shown to impact classification performance include people wearing masks with animal faces (such as a dog nose), human faces, or other distracting patterns printed on them, and people with beards or mustaches that extend past the mask”. My question is: where are these results presented precisely in the manuscript?
- The authors made the article, by using some images, but they don’t present them. Instead of these, they say something regarding the implementation of the algorithms for image processing, but without showing some line codes, or formulas, etc. They’ve inserted some references, but, as we already know, you don’t use the entire routine from the reference, you have to adapt it to your own case. Please insert some lime codes, or used formulas, and some steps of your image while is processed.
- Line 341: I think after “240 x240” you should write “pixel”.
Author Response
Reviewer 3: The paper “Real-Time Mask Recognition”,
by Rachel M. Billings and Alan J. Michaels, highlights
a modality to analyze individuals wearing or not face
masks, and making in the same time a distribution upon
race, gender, luminosity of image, and so on based on
face recognition software. The manuscript respects the
journal’s template.
– Line 37: the word “lieu” is used usually in French
language, with the meaning of “place”. Even if
appears rarely in English language, I recommend to
change the expression, by replacing the word “lieu”.
– Thanks for noticing this! I meant to say ‘light’ but
made a mistake while typing.
– Lines 188 – 191: in the following paragraph, the
authors say: “. . .A few examples of edge cases that
are shown to impact classification performance include
people wearing masks with animal faces (such
as a dog nose), human faces, or other distracting
patterns printed on them, and people with beards or
mustaches that extend past the mask”. My question
is: where are these results presented precisely in the
manuscript?
– We have added additional detail in the paper to
address this concern. In general, many of the presentation
attacks (e.g., a dog wearing a mask) occur in
such small quantity, and are only considered during
the evaluation phase (i.e., they are not included in the
training of the model) that we have limited ability
to draw concrete conclusions as to why errors occur
(only with what accuracy) and we have little concern
about them affecting the primary use of the model
since not included in training phases. Our primary
goal is to document the breaking point(s) of the
model when presented with data outside its primary
training.
– The authors made the article, by using some images,
but they don’t present them. Instead of these, they say
something regarding the implementation of the algorithms
for image processing, but without showing
some line codes, or formulas, etc. They’ve inserted
some references, but, as we already know, you don’t
use the entire routine from the reference, you have
to adapt it to your own case. Please insert some lime
codes, or used formulas, and some steps of your
image while is processed.
– We have added a pseudo-code algorithm near Figure
1 to help clarify the image classification process.
Open sourcing the actual dataset is under discussion
with our IRB, yet will likely require collection of
images under a different / future protocol.
– Line 341: I think after “240 x240” you should write
“pixel”.
– Done!
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
The authors successfully addressed my questions. I congratulate the authors about this interesting work.
Reviewer 2 Report
--