Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Swin Transformer-Based Object Detection Model Using Explainable Meta-Learning Mining

Appl. Sci. 2023, 13(5), 3213; https://doi.org/10.3390/app13053213

by Ji-Won Baek¹

and Kyungyong Chung^2,*

Reviewer 1: Anonymous

Reviewer 2:

Yung-Hsiang Chen

Reviewer 3:

A. Anitha

Appl. Sci. 2023, 13(5), 3213; https://doi.org/10.3390/app13053213

Submission received: 23 December 2022 / Revised: 1 March 2023 / Accepted: 1 March 2023 / Published: 2 March 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

This manuscript presents a swin transformer-based object detection model to detect fires. The method combines the Swin Transformer and YOLOv3 models and applies meta-learning to construct an explainable object detection model. Additionally, few-shot learning is applied to enable effective learning with small data.I believe that this work is relatively novel and meaningful, but there are significant shortcomings in the background justification and experimental design in the manuscript. Revisions are needed before considering acceptance of this manuscript.

1. In the Related Work section of the manuscript, the author provides summaries of Object Detection Technology Based on Vision Transformer and Trends in Meta-Learning Technology. However, there are many kinds of target detection algorithms in the field of fire detection, and the author does not summarize the development and advantages and disadvantages of these algorithms. There is a lack of comparison between the proposed algorithm and previous algorithms in this field.

2. In the manuscript, YOLOv3 is chosen as the detection algorithm, but the YOLO series has already updated many algorithms. What are the advantages of the YOLOv3 algorithm compared to other YOLO series algorithms? Perhaps using a more advanced algorithm from the YOLO series would lead to more improvement? The author does not provide a comparison or explanation.

3. In the manuscript, Grad-CAM is applied to provide a visualization explanation of the proposed algorithm, but what is the contribution of Grad-CAM to this algorithm? Did the author make any adjustments to the network or improvements to the algorithm based on the visualization results?

4. Where is formula (1) in the manuscript? Formula (2) does not conform to the writing requirements for formulas and does not provide explanations or interpretations of the characters in the formula.

5. Why were YOLACT and YOLACT++ chosen for comparison with the proposed algorithm in the manuscript? Are they the SOTA algorithms in this field? A justification and explanation is needed. It is also suggested to compare with other type of object detection algorithms.

6. There are few examples images of detection results displayed in the manuscript. It is suggested to add more examples images of correct and incorrect detection results, as well as to increase the number of examples images of detection results from different algorithms for comparison. For example, comparing the detection results from different algorithms in Tables 1 and 2 with examples images would more clearly demonstrate the advantages of the algorithms.

7. While the manuscript concludes with a summary of the advantages of the proposed algorithm, it is also necessary to summarize the weaknesses and reasons for these weaknesses, and suggest directions for improvement and future research priorities.

Author Response

Reviewer#1:

This manuscript presents a swin transformer-based object detection model to detect fires. The method combines the Swin Transformer and YOLOv3 models and applies meta-learning to construct an explainable object detection model. Additionally, few-shot learning is applied to enable effective learning with small data. I believe that this work is relatively novel and meaningful, but there are significant short comings in the background justification and experimental design in the manuscript. Revisions are needed before considering acceptance of this manuscript.:

In the Related Work section of the manuscript, the author provides summaries of Object Detection Technology Based on Vision Transformer and Trends in Meta-Learning Technology. However, there are many kinds of target detection algorithms in the field of fire detection, and the author does not summarize the development and advantages and disadvantages of these algorithms. There is a lack of comparison between the proposed algorithm and previous algorithms in this field.

Author response: Thank you for your kind comment.

Author action: I have additionally described studies related to fire detection in chapter 2 of the related work.

2.3. Fire Prediction Technology using Object Detection

The National Fire Information System [1] of the National Fire Agency is a pan-national system to inform the public of fire damage and promote fire prevention and fire prevention by providing statistical information on mechanisms from fire occurrence to evacuation situations and fire suppression. It helps establish fire prevention policies for each institution by providing real-time fire risk information to the public and related agencies through real-time anomaly detection and monitoring functions. Accordingly, the risk of fire can be recognized in advance and used as fire prevention and response data. However, it is inefficient due to the high cost of manpower for real-time monitoring, re-porting in the event of a flame, and problems with device malfunction. To solve this problem, an accurate and effective fire prediction technology using deep learning is needed. Therefore, Therefore, F. Saeed et. al. [22] proposed a machine learning-based approach for multimedia surveillance in the event of a fire emergency. This led to the efficient prediction of fires using sensor data such as smoke, heat, and gas for the training of the Ado-be-MLP model. Accordingly, it is a CNN model that can be detected immediately when a fire occurs. However, sensor data can reduce prediction performance in case of malfunction. Z. Tang et. al [23] proposed a deep learning-based wildfire event object detection model from aerial image data. This creates a labeled dataset to be shared publicly from data collected from aviation. It is also a cos-to-fine framework for automatically detecting rare, small, irregular-shaped wildfires. The coarse detector adaptively selects sub-areas that are likely to contain objects of interest, and the fine detector performs additional scru-tiny by passing only the details of sub-areas, not the entire area. Accordingly, the learning time for prediction may be reduced and accuracy may be improved. However, there is a disadvantage that can only be used if a flame is detected. Also, S. Mohana Kumar et. al [24] proposed a forest fire prediction method using image processing and machine learning. It aims to predict forest fires with an input stream of images. The proposed method uses image processing, background removal and special wavelet analysis. An SVM model is used to classify candidate areas as real fire or non-fire. Fast fire detection is possible by using a faster R-CNN object detection model for full image convolution operation. How-ever, the larger the amount of data, the slower the training speed.

[22] Saeed, F.; Paul, A.; Hong, W. H.; Seo, H. Machine learning based approach for multimedia surveillance during fire emer-gencies. Multimed. Tools Appl., 2020, 79, 16201-16217.

[23] Tang, Z.; Liu, X.; Chen, H.; Hupy, J.; Yang, B. Deep learning based wildfire event object detection from 4K aerial images acquired by UAS. AI, 2020, 1, 166-179.

[24] Mohana Kumar, S.; Sowmya, B. J.; Priyanka, S.; Ruchita Sharma, S. T. Forest Fire Prediction Using Image Processing and Machine Learning. Nat. Volatiles Essent., 2021, 13116-13134.

In the manuscript, YOLOv3 is chosen as the detection algorithm, but the YOLO series has already updated many algorithms. What are the advantages of the YOLOv3 algorithm compared to other YOLO series algorithms? Perhaps using a more advanced algorithm from the YOLO series would lead to more improvement? The author does not provide a comparison or explanation.

Author response: Thank you for your kind comment.

Author action: The reason why I use YOLOv3 is as follows.

YOLOv3 can predict bounding boxes at various scales. Therefore, even small objects in the data can be detected. In addition, the sigmoid function is applied to all classes at the last layer to enable binary classification according to each class. In addition, it was judged that it was suitable for real-time detection because the detection speed was fast. Accordingly, it is easy to classify normal and abnormal through binary classification results in real-time fire risk prediction.

In the manuscript, Grad-CAM is applied to provide a visualization explanation of the proposed algorithm, but what is the contribution of Grad-CAM to this algorithm? Did the author make any adjustments to the network or improvements to the algorithm based on the visualization results?

Author response: Thank you for your kind comment.

Author action: The grad-cam used was used for visualization among explanatory methods. In this paper, the contribution of the grad-cam expresses the part that the proposed model focuses on and judges when proceeding with classification.

Where is formula (1) in the manuscript? Formula (2) does not conform to the writing requirements for formulas and does not provide explanations or interpretations of the characters in the formula.

Author response: Thank you for your kind comment.

Author action: A variable description for the formula was written.

In this study, APs for multiple classes are needed. So, the AP value for each class is calculated, and their mean value is extracted. To do that, Mean Average Precision (mAP) is applied. mAP is capable of extracting the mean of AP values by interpolating all points. Equation 1 presents the interpolation of all points.

(1)

The higher mAP value, the more accuracy. The lower value, the less accuracy. Equation 2 presents mAP. APi represents the AP value of ith class, and N means the total number of classes to evaluate.

(2)

Why were YOLACT and YOLACT++ chosen for comparison with the proposed algorithm in the manuscript? Are they the SOTA algorithms in this field? A justification and explanation is needed. It is also suggested to compare with other type of object detection algorithms.

Author response: Thank you for your kind comment.

Author action: The reasons for using YOLOACT and YOLOACT++ as comparison models are described as follows and Performance evaluation with YOLOS, a basic base model, was additionally conducted.

The network used in the proposed method has a feature pyramid structure. Therefore, the performance of the object detection model based on the feature pyramid structure is com-pared. The object detection models used for comparison are YOLOACT, YOLOACT++, and YOLOS models.

Object Detection Model	mAP
YOLACT [40]	30.45
YOLACT++ [41]	34.65
YOLOS [15]	30.04
swin Transformer+yolo+few-shot learning (ours)	51.52

There are few examples images of detection results displayed in the manuscript. It is suggested to add more examples images of correct and incorrect detection results, as well as to increase the number of examples images of detection results from different algorithms for comparison. For example, comparing the detection results from different algorithms in Tables 1 and 2 with examples images would more clearly demonstrate the advantages of the algorithms.

Author response: Thank you for your kind comment.

Author action: An example was added.

Figure 7 (a) and (b) shows that Grad-CAM normally captures the highest importance in fire prediction as fire. In addition, low importance indicates the smoke that is occurring around the fire. However, there is a disadvantage that smoke far from the fire could not be detected. In addition, (c) in Figure 8 shows incorrectly detected results. It detected the en-tire part, not the fire part, rather than the fire part. It is judged that the focus is more on other objects. Thus, the generalization performance for the dataset is poor.

While the manuscript concludes with a summary of the advantages of the proposed algorithm, it is also necessary to summarize the weaknesses and reasons for these weaknesses, and suggest directions for improvement and future research priorities.

Author response: Thank you for your kind comment.

Author action: The conclusion has been revised.

This study proposed the swin transformer-based object detection model using ex-plainable meta-learning mining. The proposed method is based on three steps. In the first step, prediction data were collected. In the second step, Swin Transformer-based objection detection was carried out. Since it ignores modeling for image local structures like lines and edges in the course of transforming images into patches, it causes information loss. To solve the problem, the shifted window of Swin Transformer was applied. The feature pyramid network of YOLOv3 was also used to detect objects of various sizes. Nevertheless, only with YOLOv3, it is hard to detect small objects. Accordingly, Swin Transformer was combined with the approach in order to detect objects in various scales. In the third step, meta-learning-based objection detection mining was applied. This approach is capable of making efficient learning with a small amount of data and preventing information loss. Few-shot learning as one of the meta-learning methods was used. It generated a feature extractor on the basis of distance and improved object detection. In order to find the cause for the object classification result from the proposed method, explainable visualization was applied. With the use of Grad-CAM, it was possible to keep the high resolution, re-duce noise, and find the cause for classification easily. The performance of the proposed model was evaluated. In addition, the proposed one was compared with a conventional object detection algorithm. As a result, the proposed method was evaluated to have excel-lent performance. Therefore, the proposed method makes it possible to monitor CCTVs continuously and detect abnormality effectively. However, we do not consider the generalization of the data. Therefore, there are limitations to the data used. To address this, future studies will improve generalization performance on various datasets.

Author Response File: Author Response.docx

Reviewer 2 Report

Please check attachment.

Comments for author File: Comments.pdf

Author Response

Reviewer#2:

The research results can be detailed in the abstract with more sentences to help readers understanding main contributions quickly.

Author response: Thank you for your kind comment.

Please compare with more researches to clarify the optimization of this proposed research.

Author response: Thank you for your kind comment.

Author action: Performance evaluation with YOLOS, a basic base model, was additionally conducted.

Object Detection Model	mAP
YOLACT [40]	30.45
YOLACT++ [41]	34.65
YOLOS [15]	30.04
swin Transformer+yolo+few-shot learning (ours)	51.52

There is no detailed description of activation function to each layer. Please state what activation functions are chosen and why these activation functions are chosen in this research.

Author response: Thank you for your kind comment.

Author action: It is written in the description part of Figure 5.

In Figure 5, the architecture consists of the Swin Transformer section and the YOLOv3 section. Swin Transformer transforms input image data into patches and merges them in the way of increasing a patch size gradually from a small one. Window-Multi-Head Self-Attention executes self-attention operation between patches found in the current window. It can reduce the cost of operations due to the high association of adjacent pixels. Shifted-Window-Multi-Head Self-Attention shifts the window and applies a mask in order to prevent self-attention operation. That is because it is meaningless for a patch to move to the opposite side through the shift and to execute the self-attention operation. After mask operation, the original value is recovered and thereby it is possible to present the connectivity of windows. The object detector is based on YOLOv3. DarkNet53 is used as a backbone network. Initial, intermediate, and final feature maps are used in the feature pyramid network. In the feature pyramid network, each feature map is set to the same size through up-sampling and is connected together. Accordingly, binary classification of the image is possible through the sigmoid function.

There is no detailed description what percentages of training set and testing set to each dataset. Please state what percentages of training set and testing set are chosen and why these percentages are chosen in this research. Please also show the validation performance to proposed method and other models by figures.

Author response: Thank you for your kind comment.

Author action: In Chapter 4.2, the split contents of the dataset are described as follows:

In this paper, 35,159 fire prediction image data provided by AI hub are used for performance evaluation. Accordingly, to prevent overfitting, the training dataset is 50%, the validation dataset is 10%, and the test dataset is 40%.

There is no discussion what computing specifications and software are applied in your experiments. Please list down such as computer model, cpu, gpu, ram etc.

Author response: Thank you for your kind comment.

Author action: In chapter 4.2, the experimental environment was prepared as follows.

In addition, the implementation environment uses AMD Ryzen 9 5900X 12-Core Processor 3.70 GHz, 96GB RAM, NVIDIA GeForce RTX 3090, Python 3.6 version, etc.

I suggest to provide more explanations of the conclusion from corresponding studies.

Author response: Thank you for your kind comment.

Author action: The contents of the performance evaluation results were prepared.

In the evaluation results of Table 1, the mAP result of the standard swin transformer is evaluated as 46.5. Also, in the case of swin transformer + YOLO, mAP is 50.54, which is higher than standard swin transformer [39]. This is considered to be because swin trans-former + YOLO can detect object information and pixel information in various scales. In addition, in the case of the proposed swin transformer + YOLO + Few-shot learning (ours), mAP was extracted the highest at 51.2.

Table 2 compares the mAP of the YOLACT, YOLACT++ and YOLOS models and the proposed model. YOLACT and YOLACT++ models have the advantage of real-time detection. As a result of performance evaluation, the proposed method is evaluated highly by about 17. Therefore, through the proposed method, it is possible to predict and monitor fire occurrence in real-time.

The performance of the proposed model was evaluated. In addition, the proposed one was compared with a conventional object detection algorithm. As a result, the proposed method was evaluated to have excellent performance. Therefore, the proposed method makes it possible to monitor CCTVs continuously and detect abnormality effectively. However, we do not consider the generalization of the data. Therefore, there are limitations to the data used. To address this, future studies will improve generalization performance on various datasets.

There are two Figure 6 in this manuscript. Please revise this issue.

Author response: Thank you for your kind comment.

Author action: The modifications have been reflected as follows.

Before	After

Figure 6 → Figure 7

Where is the equation (1)?

Author response: Thank you for your kind comment.

Author action: Sorry, there was an error. Therefore, we modified it as follows.

Before	After

Equation (2) → Equation (1) and Equation (3) →Equation (2)

The APi did not include into equation (3).

Author response: Thank you for your kind comment.

Author action: Sorry, there was an error. Therefore, we modified it as follows

Before	After

Please refine the Figure 1.

Author response: Thank you for your kind comment.

Author action: The checked part has been corrected.

Please refine the Figure 2.

Author response: Thank you for your kind comment.

Author action: The checked part has been corrected.

What’s wrong with “time cost []” at line 35.

Author response: Thank you for your kind comment.

Author action: Sorry, there was an error. Therefore, we modified it as follows:

Before	After
In fact, the human monitoring of numerous CCTV devices is problematic in terms of ac-curacy, efficiency, and time cost [].	In fact, the human monitoring of numerous CCTV devices is problematic in terms of accuracy, efficiency, and time cost.

Total 12 pages are not enough to detailed description this topic. Please expend contents of this manuscript.

Author response: Thank you for your kind comment.

Author action: We reflected as many revisions as possible from reviewers.

Author Response File: Author Response.docx

Reviewer 3 Report

From the point of innovation and framework of the paper, this paper should be rejected for the following three reasons: first, the advantages and disadvantages of the previous work are not clearly expounded, in other words, the motivation of writing the paper is not explained.

Second, the paper is lack innovation, only a simple combination of Swin transformation with meta-learning using explainable AI, without explaining the adequacy and advantages of the combination of these models.

Third, Please find a native speaker to improve the presentation. In-Line no 35, the reference number is missing.

Forth, sentence formation is to be improved.

-Of many object detection methods, YOLO-based objection detection has a One-stage design. So, it features fast detection speed and low accuracy (Line no 40 and 41) meaning less.

Fifth, Section 2- Needs modification with the flow of the paper.

Section 4, Results and performance evaluation should be expanded and improved to support the proposed model.

Author Response

Reviewer#3:

From the point of innovation and framework of the paper, this paper should be rejected for the following three reasons:

first, the advantages and disadvantages of the previous work are not clearly expounded, in other words, the motivation of writing the paper is not explained.

Author response: Thank you for your kind comment.

Author action: By adding the fire detection technology trend, the disadvantages of the method for fire detection in the previous study and the necessity of the proposed method are described.

2.3. Fire Prediction Technology using Object Detection

Second, the paper is lack innovation, only a simple combination of Swin transformation with meta-learning using explainable AI, without explaining the adequacy and advantages of the combination of these models.

Author response: Thank you for your kind comment.

Author action: The advantages and contributions of the proposed method are written in the introduction.

Based on Swin Transformer, it solves the problem of information loss, because it ignores the modeling for image local structures like patch lines and edges.
In the combination of YOLOv3 and Swin Transformer, Swin Transformer makes it possible to detect small objects with various sizes, the ones which are hard to be detected by YOLOv3.
Grad-CAM as an explainable visualization technique is used to keep data at high resolution, reduce noise, and to find the causes of classification accurately.

Third, please find a native speaker to improve the presentation. In-Line no 35, the reference number is missing.

Author response: Thank you for your kind comment.

Author action: Sorry, there was an error. Therefore, we modified it as follows:

Before	After
In fact, the human monitoring of numerous CCTV devices is problematic in terms of ac-curacy, efficiency, and time cost [].	In fact, the human monitoring of numerous CCTV devices is problematic in terms of accuracy, efficiency, and time cost.

Forth, sentence formation is to be improved. -Of many object detection methods, YOLO-based objection detection has a One-stage design. So, it features fast detection speed and low accuracy (Line no 40 and 41) meaning less.

Author response: Thank you for your kind comment.

Author action: The grammar of the sentence was modified as follows.

Before	After
Of many object detection methods, YOLO-based objection detection has a One-stage design. So, it features fast detection speed and low accuracy.	Among several object detection methods, object detection using YOLO is designed as a one-stage, so it has a fast detection speed but low accuracy.

Fifth, Section 2- Needs modification with the flow of the paper. Section 4, Results and performance evaluation should be expanded and improved to support the proposed model.

Author response: Thank you for your kind comment.

Author action: The point was corrected.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The manuscript revised some of the problems raised before, such as adding the research background in the field of fire detection. However, the experimental design and the demonstration of the experimental results still need to be overhauled. I think this manuscript should be rejected. The following are suggestions for the revision of this manuscript.

1. In Section 4.2 Performance Evaluation of the manuscript, the detection models compared with the proposed models are all of the same type, which should be compared with more different types of detection models to further discuss the progressiveness of the proposed models.

2. After the revision, there are still few figures showing the detection results in the manuscript. It is suggested to add a comparison illustration of the detection results of different models mentioned in Table 2. A detailed analysis of the results should be conducted to more effectively highlight the advantages of the proposed model.

3. There is no Figure 8 in the manuscript. Please modify the sentence ‘(c) in Figure 8 shows incorrectly detected results.' (Page 10, line 367). The author should thoroughly review the formatting and grammar of the manuscript.

Author Response

Reviewer#1:

In Section 4.2 Performance Evaluation of the manuscript, the detection models compared with the proposed models are all of the same type, which should be compared with more different types of detection models to further discuss the progressiveness of the proposed models.

Author response and action: Thank you for your kind comment. But the excellence of the proposed method can be evaluated if the same type of model and evaluation are carried out.

After the revision, there are still few figures showing the detection results in the manuscript. It is suggested to add a comparison illustration of the detection results of different models mentioned in Table 2. A detailed analysis of the results should be conducted to more effectively highlight the advantages of the proposed model.

Author response and action: Thank you for your kind comment. Instead of adding a picture, I conducted an additional performance evaluation that could indicate the excellence of the proposed method.

Table 2. The results of the performance comparison between object detection algorithms.

Object Detection Model	mAP
YOLACT [40]	30.45
YOLACT++ [41]	34.65
YOLOS [15]	30.04
swin Transformer+yolo+few-shot learning (ours)	51.52

Table 3. The performance evaluation results based on the confusion matrix.

Object Detection Model	Accuracy	F-measure
YOLACT [40]	66.12	67.88
YOLACT++ [41]	67.54	69.24
YOLOS [15]	64.52	66.18
swin Transformer+yolo+few-shot learning (ours)	70.84	69.84

There is no Figure 8 in the manuscript. Please modify the sentence ‘(c) in Figure 8 shows incorrectly detected results.' (Page 10, line 367). The author should thoroughly review the formatting and grammar of the manuscript.

Author response and action: Thank you for your kind comment. We revised the sentence as follows. Also, we reviewed the format and grammar again.

Before	After
In addition, (c) in Figure 8 shows incorrectly detected results. It detected the entire part, not the fire part, rather than the fire part	In addition, (c) in Figure 7 shows incorrectly detected results. It detected the entire part, not the fire part, rather than the fire part.

Author Response File: Author Response.docx

Reviewer 2 Report

Thanks for your responses.
Please note the following comments:

1. The authors did not well response my comments such as comments 1, 2, 4, 13.

2.Total 12 pages (exclude reference) are still not enough to detailed description this topic. Please try to expend contents of this manuscript due to the high quality of this journal.

Author Response

Reviewer#2:

Thanks for your responses.

Please note the following comments:

The authors did not well response my comments such as comments 1, 2, 4, 13.

Author response and action: Thank you for your kind comment. I checked the previous review version, and I will reply again as follows.

The research results can be detailed in the abstract with more sentences to help readers understanding main contributions quickly.

Author action: The contributions of this paper were written in more detail as follows.

Before

Based on Swin Transformer, it solves the problem of information loss, because it ig-nores the modeling for image local structures like patch lines and edges.

In the combination of YOLOv3 and Swin Transformer, Swin Transformer makes it possible to detect small objects with various sizes, the ones which are hard to be de-tected by YOLOv3.

Grad-CAM as an explainable visualization technique is used to keep data at high resolution, reduce noise, and to find the causes of classification accurately.

After

General transformers have the potential to cause information loss on image local structures such as lines and edges in the process of generating images as patches. Therefore, based on Swin Transformer, solves the problem of information loss, because it ignores the modeling for image local structures like patch lines and edges.

There are objects of various sizes in the image data. In order to detect all of these, it is necessary to be able to detect objects of various sizes. Therefore, the combination of YOLOv3 and Swin Transformer, Swin Transformer makes it possible to detect small objects of various sizes, the ones which are hard to be detected by YOLOv3.

Grad-CAM as an explainable visualization technique is used to keep data at high resolution, reduce noise, and to find the causes of classification accurately.

Please compare with more researches to clarify the optimization of this proposed research.

Author action: Performance evaluation with YOLOS, a basic base model, was additionally conducted.

Table 2. The results of the performance comparison between object detection algorithms.

Object Detection Model	mAP
YOLACT [40]	30.45
YOLACT++ [41]	34.65
YOLOS [15]	30.04
swin Transformer+yolo+few-shot learning (ours)	51.52

And to evaluate the performance of classification results, accuracy, and f-measure were evaluated based on the confusion matrix.

In addition, accuracy and f-measure are compared based on the confusion matrix to evaluate the classification results of each model. Equation 3 shows the accuracy based on the confusion matrix.

(3)

Also, Equation 4 shows the F-measure.

(4)

In Equations (3) and (4), TP (True Positive) indicates the case where the predicted value is true and the actual value is true. TN (True Negative) represents a case where the predicted value is false and the actual value is false. FP (False Positive) indicates the case where the predicted value is true but the actual value is false. FN (False Negative) represents a case where the predicted value is false and the actual value is false. Table 3 shows the performance evaluation results based on the confusion matrix.

Table 3. The performance evaluation results based on the confusion matrix.

Object Detection Model	Accuracy	F-measure
YOLACT [40]	66.12	67.88
YOLACT++ [41]	67.54	69.24
YOLOS [15]	64.52	66.18
swin Transformer+yolo+few-shot learning (ours)	70.84	69.84

As a result of the evaluation in Table 3, the proposed method is evaluated about 2-4% better than other methods. YOLACT and YOLACT++ also detect smoke in occluded parts through mask-RCNN. YOLOS loses information while composing images into patches, so its accuracy is evaluated as low. However, the proposed method performs better than other methods by detecting objects at various scales and minimizing information loss during patch construction.

There is no detailed description what percentages of training set and testing set to each dataset. Please state what percentages of training set and testing set are chosen and why these percentages are chosen in this research. Please also show the validation performance to proposed method and other models by figures.

Author action: The ratio of the data set was divided into 6:1:4 for the training set, validation set, and test set, respectively, to prevent overfitting. In addition, performance comparison figures with other methods are shown in Tables 2 and Table 3.

Table 2. The results of the performance comparison between object detection algorithms.

Object Detection Model	mAP
YOLACT [40]	30.45
YOLACT++ [41]	34.65
YOLOS [15]	30.04
swin Transformer+yolo+few-shot learning (ours)	51.52

Total 12 pages (exclude reference) are still not enough to detailed description this topic. Please try to expend contents of this manuscript due to the high quality of this journal.

Author response and action: Thank you for your kind comment. The number of pages was increased based on reviewers' reviews.

Author Response File: Author Response.docx

Reviewer 3 Report

The authors are requested to have their individual responses for the review comments to be in blue and to change the font colour to red where changes have been made in the text of the manuscript.

1. Abstract should be rewritten.

2. Line no 240-241. The fire scene is divided into black smoke, grey smoke, white 240 smoke, and fire (flame) classes. class is composed. - Sentence is to be reframed to provide meaningful information

3. In Section 3.1. Collect and preprocess risk prediction data- The author discussed the decision class labels, but should explain in detail the preprocessing technique involved. Also, since the smoke is involved in the processed images, how the image clarity is achieved using swin transformer as a feature extractor to help in the quick detection process.

4. It would be nice to make a picture of the algorithm of the realized research.

5. As the authors mentioned that “there is a disadvantage that smoke far from the fire could not 366 be detected. In addition, (c) in Figure 8 shows incorrectly detected results. It detected the 367 entire part, not the fire part, rather than the fire part. It is judged that the focus is more on 368 other objects. Thus, the generalization performance for the dataset is poor.” - what can be the possibility to improve the performance on a generalized dataset can be explained as future scope along with the conclusion.

6. Mean Average Precision (mAP) was indicated as the performance metric. Author has to check against other specific objective metrics to check the performance of the proposed model.

7. Comparative analysis should be included with other benchmarking techniques to check the performance of the proposed model.

8. The Conclusion section should be expanded and improved.

9. Overall the article should be checked for English language and grammatical mistakes.

Good luck,

Regards

Author Response

Reviewer#3:

The authors are requested to have their individual responses for the review comments to be in blue and to change the font colour to red where changes have been made in the text of the manuscript.

Abstract should be rewritten.

Author response and action: Thank you for your kind comment. The summary has been rewritten.

Before

In order to minimize damage in the event of a fire, the ignition point must be detected and dealt with before the fire spreads. However, the method of detecting fire by heat or fire is more dam-aging because it can be detected after the fire has spread. Therefore, this study proposes a swin transformer-based object detection model using explainable meta-learning mining. The pro-posed method merges Swin Transformer and YOLOv3 model and applies meta-learning so as to build an explainable object detection model. In order for efficient learning with small data in the course of learning, it applies Few-Shot Learning. To find the causes of the object detection results, Grad-CAM as an explainable visualization method is used. In this study, with the use of Mean Average Precision (mAP), performance evaluation is carried out in two ways. First, the performance of the proposed object detection model is evaluated. Secondly, the performance of the proposed method is compared with a conventional object detection method’s performance. Given the results of the evaluation, the proposed method supports accurate and real-time monitoring and analysis.

After

In order to minimize damage in the event of a fire, the ignition point must be detected and dealt with before the fire spreads. However, the method of detecting fire by heat or fire is more damaging because it can be detected after the fire has spread. Therefore, this study proposes a swin transformer-based object detection model using explainable meta-learning mining. The proposed method merges Swin Transformer and YOLOv3 model and applies meta-learning so as to build an explainable object detection model. In order for efficient learning with small data in the course of learning, it applies Few-Shot Learning. To find the causes of the object detection results, Grad-CAM as an explainable visualization method is used. It detects small objects of smoke in the fire image data and classifies them according to the color of the smoke generated when a fire breaks out. Accordingly, it is possible to predict and classify the risk of fire occurrence to minimize damage caused by fire. In this study, with the use of Mean Average Precision (mAP), performance evaluation is carried out in two ways. First, the performance of the proposed object detection model is evaluated. Secondly, the performance of the proposed method is compared with a conventional object detection method’s performance. Given the results of the evaluation, the proposed method supports accurate and real-time monitoring and analysis.

Line no 240-241. The fire scene is divided into black smoke, grey smoke, white 240 smoke, and fire (flame) classes. Class is composed. - Sentence is to be reframed to provide meaningful information.

Author response and action: Thank you for your kind comment. The composition of the sentence has been modified as follows.

Before

It is classified into three types: fire scene, similar scene, and unrelated scene. The fire scene is divided into black smoke, gray smoke, white smoke, and fire (flame) classes. Class is composed.

After

It is classified into three types: fire scene, fire scene, fire scene, black smoke, gray smoke, white smoke, fire (fire) class, and fire scene consists of fog, lighting, sunlight, and leaves.

In Section 3.1. Collect and preprocess risk prediction data- The author discussed the decision class labels, but should explain in detail the preprocessing technique involved. Also, since the smoke is involved in the processed images, how the image clarity is achieved using swin transformer as a feature extractor to help in the quick detection process.

Author response and action: Thank you for your kind comment. The composition and preprocessing of the data collected in this paper are as follows.

The data used in this paper is fire prediction image data provided by AI Hub [25]. The fire occurrence prediction image data is image data obtained by photographing smoke occurring before a fire occurs. It is classified into three types: fire scene, similar scene, and unrelated scene. The fire scene is divided into black smoke, gray smoke, white smoke, and fire (flame) classes. class is composed. In addition, unrelated scenes include objects unrelated to smoke generation. When predicting a fire, fire is predicted by the color of the smoke. Therefore, in this paper, 13,159 train data and 22,000 validation data from the fire scene are used. The data is composed equally in 1920*1080 size.

In addition, the JSON file of meta information such as data location information and class information is reconstructed similarly to the JSON file format of the COCO dataset.

It would be nice to make a picture of the algorithm of the realized research.

Author response and action: Thank you for your kind comment. The configuration diagram of the implemented model is shown in Figures 5 and 6.

As the authors mentioned that “there is a disadvantage that smoke far from the fire could not 366 be detected. In addition, (c) in Figure 8 shows incorrectly detected results. It detected the 367 entire part, not the fire part, rather than the fire part. It is judged that the focus is more on 368 other objects. Thus, the generalization performance for the dataset is poor.” - what can be the possibility to improve the performance on a generalized dataset can be explained as future scope along with the conclusion.

Author response and action: Thank you for your kind comment.

Before

This study proposed the swin transformer-based object detection model using ex-plainable meta-learning mining. The proposed method is based on three steps. In the first step, prediction data were collected. In the second step, Swin Transformer-based objection detection was carried out. Since it ignores modeling for image local structures like lines and edges in the course of transforming images into patches, it causes information loss. To solve the problem, the shifted window of Swin Transformer was applied. The feature pyramid network of YOLOv3 was also used to detect objects of various sizes. Nevertheless, only with YOLOv3, it is hard to detect small objects. Accordingly, Swin Transformer was combined with the approach in order to detect objects in various scales. In the third step, meta-learning-based objection detection mining was applied. This approach is capable of making efficient learning with a small amount of data and preventing information loss. Few-shot learning as one of the meta-learning methods was used. It generated a feature extractor on the basis of distance and improved object detection. In order to find the cause for the object classification result from the proposed method, explainable visualization was applied. With the use of Grad-CAM, it was possible to keep the high resolution, re-duce noise, and find the cause for classification easily. The performance of the proposed model was evaluated. In addition, the proposed one was compared with a conventional object detection algorithm. As a result, the proposed method was evaluated to have excel-lent performance. Therefore, the proposed method makes it possible to monitor CCTVs continuously and detect abnormality effectively. However, we do not consider the gener-alization of the data. Therefore, there are limitations to the data used. To address this, future studies will improve generalization performance on various datasets.

After

This study proposed the swin transformer-based object detection model using ex-plainable meta-learning mining. The proposed method is based on three steps. In the first step, prediction data were collected. In the second step, Swin Transformer-based objection detection was carried out. Since it ignores modeling for image local structures like lines and edges in the course of transforming images into patches, it causes information loss. To solve the problem, the shifted window of Swin Transformer was applied. The feature pyramid network of YOLOv3 was also used to detect objects of various sizes. Nevertheless, only with YOLOv3, it is hard to detect small objects. Accordingly, Swin Transformer was combined with the approach in order to detect objects in various scales. In the third step, meta-learning-based objection detection mining was applied. This approach is capable of making efficient learning with a small amount of data and preventing information loss. Few-shot learning as one of the meta-learning methods was used. It generated a feature extractor on the basis of distance and improved object detection. In order to find the cause for the object classification result from the proposed method, explainable visualization was applied. With the use of Grad-CAM, it was possible to keep the high resolution, re-duce noise, and find the cause for classification easily. The performance of the proposed model was evaluated. In addition, the proposed one was compared with a conventional object detection algorithm. As a result, the proposed method was evaluated to have excel-lent performance. Therefore, the proposed method makes it possible to monitor CCTVs continuously and detect abnormality effectively. However, the generalization of the data was not considered. Therefore, there are limitations on the data to be used. In order to solve this problem, future research plans to improve the generalization performance by adjusting parameters for various datasets and combining each model.

Mean Average Precision (mAP) was indicated as the performance metric. Author has to check against other specific objective metrics to check the performance of the proposed model.

Author response and action: Thank you for your kind comment. The performance comparison between the proposed method and the existing method was conducted through accuracy and f-measure based on mAP and confusion matrix. The equations are shown in Equations 1 to 4. The results are shown in Table 2 and Table 3.

(1)

(2)

(3)

(4)

Table 2. The results of the performance comparison between object detection algorithms.

Object Detection Model	mAP
YOLACT [40]	30.45
YOLACT++ [41]	34.65
YOLOS [15]	30.04
swin Transformer+yolo+few-shot learning (ours)	51.52

Table 3. The performance evaluation results based on the confusion matrix.

Object Detection Model	Accuracy	F-measure
YOLACT [40]	66.12	67.88
YOLACT++ [41]	67.54	69.24
YOLOS [15]	64.52	66.18
swin Transformer+yolo+few-shot learning (ours)	70.84	69.84

Comparative analysis should be included with other benchmarking techniques to check the performance of the proposed model.

Table 2. The results of the performance comparison between object detection algorithms.

Object Detection Model	mAP
YOLACT [40]	30.45
YOLACT++ [41]	34.65
YOLOS [15]	30.04
swin Transformer+yolo+few-shot learning (ours)	51.52

Table 3. The performance evaluation results based on the confusion matrix.

Object Detection Model	Accuracy	F-measure
YOLACT [40]	66.12	67.88
YOLACT++ [41]	67.54	69.24
YOLOS [15]	64.52	66.18
swin Transformer+yolo+few-shot learning (ours)	70.84	69.84

The Conclusion section should be expanded and improved.

Author response and action: Thank you for your kind comment. The Conclusions section has been revised as follows.

This study proposed the swin transformer-based object detection model using ex-plainable meta-learning mining. The proposed method is based on three steps. In the first step, prediction data were collected. In the second step, Swin Transformer-based objection detection was carried out. Since it ignores modeling for image local structures like lines and edges in the course of transforming images into patches, it causes information loss. To solve the problem, the shifted window of Swin Transformer was applied. The feature pyramid network of YOLOv3 was also used to detect objects of various sizes. Nevertheless, only with YOLOv3, it is hard to detect small objects. Accordingly, Swin Transformer was combined with the approach in order to detect objects in various scales. In the third step, meta-learning-based objection detection mining was applied. This approach is capable of making efficient learning with a small amount of data and preventing information loss. Few-shot learning as one of the meta-learning methods was used. It generated a feature extractor on the basis of distance and improved object detection. In order to find the cause for the object classification result from the proposed method, explainable visualization was applied. With the use of Grad-CAM, it was possible to keep the high resolution, re-duce noise, and find the cause for classification easily. For the performance evaluation, the performance evaluation of the proposed method was conducted step by step. In addition, accuracy and f-measure evaluation were conducted based on the mAP and confusion matrix of the proposed and existing methods. As a result of the evaluation, the proposed method was evaluated excellently. This is because it is possible to detect up to a small area by solving the disadvantages of the existing vision transformer. Therefore, the proposed method makes it possible to monitor CCTVs continuously and detect abnormality effectively. However, the generalization of the data was not considered. Therefore, there are limitations on the data to be used. In order to solve this problem, future research plans to improve the generalization performance by adjusting parameters for various datasets and combining each model.

Overall the article should be checked for English language and grammatical mistakes.

Author response and action: Thank you for your kind comment. General English grammar errors were confirmed.

Author Response File: Author Response.docx

Round 3

Reviewer 1 Report

The author responded to the questions raised last time. The following are comments on this manuscript.

1. The abstract part of the manuscript should be revised according to the content after the last revision. The author added new parameters to evaluate the proposed model in section 4.2. So the content of corresponding abstract needs to be modified.

2. On page 9, "First" in line 344 and "Second" in line 350 are sequential relationships, should not be broken here.

3. The manuscript emphasizes that the proposed method is accurate and real-time. It is recommended to add the comparison of detection FPS parameter used by different methods.

Author Response

Reviewer#1:

The author responded to the questions raised last time. The following are comments on this manuscript.:

The abstract part of the manuscript should be revised according to the content after the last revision. The author added new parameters to evaluate the proposed model in section 4.2. So the content of corresponding abstract needs to be modified.

Author response and action: Thank you for your kind comment. The Abstract was modified in relation to the indicators used for performance evaluation as follows.

Before

In order to minimize damage in the event of a fire, the ignition point must be detected and dealt with before the fire spreads. However, the method of detecting fire by heat or fire is more damaging because it can be detected after the fire has spread. Therefore, this study proposes a swin transformer-based object detection model using explainable meta-learning mining. The proposed method merges Swin Transformer and YOLOv3 model and applies meta-learning so as to build an explainable object detection model. In order for efficient learning with small data in the course of learning, it applies Few-Shot Learning. To find the causes of the object detection results, Grad-CAM as an explainable visualization method is used. It detects small objects of smoke in the fire image data and classifies them according to the color of the smoke generated when a fire breaks out. Accordingly, it is possible to predict and classify the risk of fire occur-rence to minimize damage caused by fire. In this study, with the use of Mean Average Precision (mAP), performance evaluation is carried out in two ways. First, the performance of the pro-posed object detection model is evaluated. Secondly, the performance of the proposed method is compared with a conventional object detection method’s performance. Given the results of the evaluation, the proposed method supports accurate and re-al-time monitoring and analysis.

After

In order to minimize damage in the event of a fire, the ignition point must be detected and dealt with before the fire spreads. However, the method of detecting fire by heat or fire is more damaging because it can be detected after the fire has spread. Therefore, this study proposes a swin transformer-based object detection model using explainable meta-learning mining. The proposed method merges Swin Transformer and YOLOv3 model and applies meta-learning so as to build an explainable object detection model. In order for efficient learning with small data in the course of learning, it applies Few-Shot Learning. To find the causes of the object detection results, Grad-CAM as an explainable visualization method is used. It detects small objects of smoke in the fire image data and classifies them according to the color of the smoke generated when a fire breaks out. Accordingly, it is possible to predict and classify the risk of fire occur-rence to minimize damage caused by fire. In this study, with the use of Mean Average Precision (mAP), performance evaluation is carried out in two ways. First, the performance of the pro-posed object detection model is evaluated. Secondly, the performance of the proposed method is compared with a conventional object detection method’s performance. In addition, the accuracy comparison using confusion matrix and the suitability of real-time object detection using FPS are judged. Given the results of the evaluation, the proposed method supports accurate and re-al-time monitoring and analysis.

On page 9, "First" in line 344 and "Second" in line 350 are sequential relationships, should not be broken here.

Author response and action: Thank you for your kind comment. Secondly, the sentence was modified to connect with the sentence firstly.

Before

Secondly, combines feature maps using gradient signals that do not require network changes. This assigns importance using gradient information that proceeds to a specific layer of the CNN. Therefore, it is possible to solve the limited problem of the model struc-ture and to explain the structure and results of various models. Figure 7 shows the results drawn from the vision transformer-based explainable object detection using meta-learning mining.

After

Firstly, when the score of target detection is extracted for explainable object detection, de-tect box can be different from the existing target. For this reason, the overlapping score between the correct target and the detect box is multiplied. Based on the final score, the proposed object detection model determines a level of prediction reliability for an object, and the accuracy of its position. Secondly, combines feature maps using gradient signals that do not require network changes. This assigns importance using gradient information that proceeds to a specific layer of the CNN. Therefore, it is possible to solve the limited problem of the model struc-ture and to explain the structure and results of various models. Figure 7 shows the results drawn from the vision transformer-based explainable object detection using meta-learning mining.

The manuscript emphasizes that the proposed method is accurate and real-time. It is recommended to add the comparison of detection FPS parameter used by different methods.

Author response and action: Thank you for your kind comment. We added the results of comparing fps with other real-time object detection models in Table 4 as follows.

The last performance evaluation evaluates the FPS to evaluate the suitability of the proposed method in real-time detection. FPS (Frame Per Second) in object detection represents the detection rate per second. Table 4 shows the FPS performance evaluation results.

Table 4. The FPS performance evaluation results.

Object Detection Model	FPS
YOLACT [40]	30.1
YOLACT++ [41]	31.2
YOLOS [15]	30.4
swin Transformer+yolo+few-shot learning (ours)	30.35

In the FPS performance evaluation results in Table 4, the FPS of the proposed method is not judged to be the best. However, since real-time is generally uninterrupted and natural at 30 fps, the proposed method can be judged to be suitable for real-time.

Author Response File: Author Response.docx

Reviewer 2 Report

Thanks for your responses.

Author Response

Reviewer#2:

Thanks for your responses.

Author response and action: Thank you for your kind comment.

Reviewer 3 Report

The authors have provided their responses according to the comments raised.

Only one concern in Line no : 239.

It is classified into three types: fire scene, fire scene, fire scene, black smoke, grey smoke, white smoke, fire (fire) class, and fire scene consists of fog, lighting, sunlight, and leaves. (It is provided in the manuscript)

In the response letter, it is mentioned as

" It is classified into three types: fire scene, similar scene, and unrelated scene."

Authors are requested to make sure, the meaning of the sentence provided.

Good luck, regards

Author Response

Reviewer#3:

The authors have provided their responses according to the comments raised.

Only one concern in Line no : 239. It is classified into three types: fire scene, fire scene, fire scene, black smoke, grey smoke, white smoke, fire (fire) class, and fire scene consists of fog, lighting, sunlight, and leaves. (It is provided in the manuscript) In the response letter, it is mentioned as " It is classified into three types: fire scene, similar scene, and unrelated scene." Authors are requested to make sure, the meaning of the sentence provided. Good luck, regards

Author response and action: Thank you for your kind comment. We revised the sentence as follows.

Before

It is classified into three types: fire scene, fire scene, fire scene, black smoke, gray smoke, white smoke, fire (fire) class, and fire scene consists of fog, lighting, sunlight, and leaves. In addition, unrelated scenes include objects unrelated to smoke generation.

After

It is classified into three types: fire scene, similar scene, and unrelated scene, and the fire scene is divided into black smoke, gray smoke, white smoke, and fire (flame) classes. similar scenes are composed of classes such as fog, lighting, sunlight, and leaves. In addition, unrelated scenes include objects unrelated to smoke generation.

Author Response File: Author Response.docx

Article Menu

Swin Transformer-Based Object Detection Model Using Explainable Meta-Learning Mining

Further Information

Guidelines

MDPI Initiatives

Follow MDPI