Using Hybrid Algorithms of Human Detection Technique for Detecting Indoor Disaster Victims

: When an indoor disaster occurs, the disaster site can become very difﬁcult to escape from due to the scenario or building. Most people evacuate when a disaster situation occurs, but there are also disaster victims who cannot evacuate and are isolated. Isolated disaster victims often cannot move quickly because they do not have all the necessary information about the disaster, and secondary damage can occur. Rescue workers must rescue disaster victims quickly, before secondary damage occurs, but it is not always easy to locate isolated victims within a disaster site. In addition, rescue operators can also suffer from secondary damage because they are exposed to disaster situations. We present a HHD technique that can detect isolated victims in indoor disasters relatively quickly, especially when covered by ﬁre smoke, by merging one-stage detectors YOLO and RetinaNet. HHD is a technique with a high human detection rate compared to other techniques while using a 1-stage detector method that combines YOLO and RetinaNet. Therefore, the HHD of this paper can be beneﬁcial in future indoor disaster situations.


Introduction
Currently, people spend 80-90% of their time in buildings, which have become larger and more complex due to the development of architectural technology [1]. People can be exposed to disaster situations such as gas leaks and fires when they spend more time indoors. In addition, in recent years, the risk of disasters is also increasing with the development of building technologies such as residential and commercial complexes, new apartments, and government offices. There are various reasons why disasters occur, but those due to fire, in particular, are frequent.
When a disaster such as a fire occurs indoors, the "golden" time is five minutes [2]. In an indoor fire situation, smoke is more dangerous for the victims than flames; more than 60% of deaths due to fire are suffocation or death from gas and smoke [3]. When an indoor disaster occurs, most people recognise the situation and evacuate. However, often, victims cannot escape in time due to late situational awareness or for personal reasons.
Disaster victims who have yet to evacuate often do not know about the severity of the situation because it is challenging to see due to smoke caused by a fire situation. Therefore, rescue activities should be carried out as quickly as possible to prevent secondary damage. Rescue workers also have difficulty with low visibility, so they often have to carry out rescue operations using the cries of disaster victims. As a result, rescue workers are also exposed to disaster situations, which can result in injury due to secondary damage. This is because they have to enter the building interior to carry out rescue operations without knowing the location of the disaster victim.
Most previous studies on this subject guide the rescue route from the "current" location [4][5][6] to the escape route for those who were not able to evacuate the disaster site early

Related Works
In general, 2-stage methods use anchors to suggest objects for classification and regression [33,34], whereas 1-stage methods [30,35,36] proceed directly to classification (i.e., the anchor box is modified without object suggestion).
In this chapter, the following three studies related to human detection in indoor disaster situations are discussed: detecting people in fire smoke; detecting disaster victims using CNN; and detecting disaster victims using YOLO.

Human Detection in an Area with Fire Smoke
The first study proposes a novel method combining a situational awareness framework and automatic visual smoke detection [33]. The detection work was carried out by learning information about scenarios with smoke and fire, along with information about people. Seventy percent of the training dataset was trained with the k-nearest neighbour (KNN) [37] classifier. Figure 1 shows two examples of detecting a person obscured by smoke using the KNN classifier and the system classifier [34]. An adaptive background subtraction algorithm Computation 2022, 10, 197 3 of 15 was used to identify moving objects based on the dynamic characteristics of smoke. Two features, colour and fuzziness, are applied to filter regions without smoke motion. Only an area that satisfies certain colour analysis and fuzzy characteristics is selected as an acting candidate area. There are still problems determining the existence of humans for various reasons, including the smoke itself. However, it is possible to determine a person's presence by detecting only a part of the body as a feature point, if the person is partially covered by smoke. tation 2022, 10, x FOR PEER REVIEW 3 of 15 people. Seventy percent of the training dataset was trained with the k-nearest neighbour (KNN) [37] classifier. Figure 1 shows two examples of detecting a person obscured by smoke using the KNN classifier and the system classifier [34]. An adaptive background subtraction algorithm was used to identify moving objects based on the dynamic characteristics of smoke. Two features, colour and fuzziness, are applied to filter regions without smoke motion. Only an area that satisfies certain colour analysis and fuzzy characteristics is selected as an acting candidate area. There are still problems determining the existence of humans for various reasons, including the smoke itself. However, it is possible to determine a person's presence by detecting only a part of the body as a feature point, if the person is partially covered by smoke.

Victim Detection Using Convolutional Neural Networks
Another study describes the detection of people and pets in any location by providing an infrared (IR) image with location information during combustion to a convolutional neural network [35]. Two methods are proposed to develop a CNN model for detecting people and pets at high temperatures. The first method consists of a feed-forward design that categorises objects displayed in the IR image into three classes. The second method consists of a cascading two-step CNN design that separates the classification decisions at each step.
IR images are captured at the combustion site and transmitted to the base station via an autonomous embedded system vehicle [30]. The CNN model indicates whether a person or pet is detected in the IR image on the primary computer. Next, it analyses each IR image to determine one of three classes: "people", "pet", or "no victims". The proposed CNN model design improves the safety and performance of firefighters when evacuating victims from fires by setting priorities for rescue protocols.
However, since the above study uses IR images, the object's shape is not precise enough. Given that it is a study aimed at searching for disaster victims, the details for classifying disaster victims are insufficient.

Detection of Natural Disaster Victims Using YOLO
Finally, studies were conducted using the YOLO method to take images of victims using drones that help locate victims in complex or vulnerable locations, to direct human access. These use image processing to design natural disaster victim detection systems [38].
As shown in Figure 2, when an image is used as input to the network model, the output is calculated according to the parameters and structure of the model. This output includes image category information, coordinate information corresponding to the bounding box, and other corresponding information.

Victim Detection Using Convolutional Neural Networks
Another study describes the detection of people and pets in any location by providing an infrared (IR) image with location information during combustion to a convolutional neural network [35]. Two methods are proposed to develop a CNN model for detecting people and pets at high temperatures. The first method consists of a feed-forward design that categorises objects displayed in the IR image into three classes. The second method consists of a cascading two-step CNN design that separates the classification decisions at each step.
IR images are captured at the combustion site and transmitted to the base station via an autonomous embedded system vehicle [30]. The CNN model indicates whether a person or pet is detected in the IR image on the primary computer. Next, it analyses each IR image to determine one of three classes: "people", "pet", or "no victims". The proposed CNN model design improves the safety and performance of firefighters when evacuating victims from fires by setting priorities for rescue protocols.
However, since the above study uses IR images, the object's shape is not precise enough. Given that it is a study aimed at searching for disaster victims, the details for classifying disaster victims are insufficient.

Detection of Natural Disaster Victims Using YOLO
Finally, studies were conducted using the YOLO method to take images of victims using drones that help locate victims in complex or vulnerable locations, to direct human access. These use image processing to design natural disaster victim detection systems [38].
As shown in Figure 2, when an image is used as input to the network model, the output is calculated according to the parameters and structure of the model. This output includes image category information, coordinate information corresponding to the bounding box, and other corresponding information. The process to be performed includes predicting the coordinates and position of the bounding box containing the object, the probability of the bounding box containing the object, and the probability of each object of the bounding box being contained in the specified class. Next, the output image of the CNN process performs a filtering process to determine more specific objects. The output video contains information about the name of each detected object.
The data set it used contains 200 images: 100 for training and 100 for testing. The training data were trained 3000 times, and the experiment had an accuracy of 89%. However, one disadvantage is that several factors, such as the background of the object in the image, as well as the position, height, and distance, affect the detection result; this can significantly reduce the accuracy. In addition, detecting disaster victims using YOLO alone without considering the exact disaster situation has a very high probability that the detection accuracy will be low when the disaster victim is in a different situation.

Hybrid Human Detection Method
The detection method design for this study is described next. For YOLO, the type and location of an object can be guessed just by looking at the image [39]. In addition, because the background is not part of a class and only objects are designated as candidates, it is a fast and simple process with a relatively high mAP (mean average performance). However, it has low accuracy for small objects.
For RetinaNet, the background and object classes are separate, so when there are significantly more areas in the image than the area in which the object is located, the loss function is used to increase accuracy [40]. Figure 3 shows a flowchart of our proposed disaster victim detection task. False positive (FP) and false negative (FN) classification results are obtained when the first detection operation is performed using YOLOv3, and then the secondary detection operation is performed using RetinaNet. The secondary detection operation is performed after excluding the operation results, classified as true positive (TP) and true negative (TN) in the primary detection operation. The process to be performed includes predicting the coordinates and position of the bounding box containing the object, the probability of the bounding box containing the object, and the probability of each object of the bounding box being contained in the specified class. Next, the output image of the CNN process performs a filtering process to determine more specific objects. The output video contains information about the name of each detected object.
The data set it used contains 200 images: 100 for training and 100 for testing. The training data were trained 3000 times, and the experiment had an accuracy of 89%. However, one disadvantage is that several factors, such as the background of the object in the image, as well as the position, height, and distance, affect the detection result; this can significantly reduce the accuracy. In addition, detecting disaster victims using YOLO alone without considering the exact disaster situation has a very high probability that the detection accuracy will be low when the disaster victim is in a different situation.

Hybrid Human Detection Method
The detection method design for this study is described next. For YOLO, the type and location of an object can be guessed just by looking at the image [39]. In addition, because the background is not part of a class and only objects are designated as candidates, it is a fast and simple process with a relatively high mAP (mean average performance). However, it has low accuracy for small objects.
For RetinaNet, the background and object classes are separate, so when there are significantly more areas in the image than the area in which the object is located, the loss function is used to increase accuracy [40]. Figure 3 shows a flowchart of our proposed disaster victim detection task. False positive (FP) and false negative (FN) classification results are obtained when the first detection operation is performed using YOLOv3, and then the secondary detection operation is performed using RetinaNet. The secondary detection operation is performed after excluding the operation results, classified as true positive (TP) and true negative (TN) in the primary detection operation.  When fire causes an indoor disaster situation, a person at the scene can be obscured by smoke. To detect a person in such a situation, an image is selected and learned through a machine learning module. The accuracy of human detection varies with smoke concentration. At this time, the optimal IoU value is found by trial and error; this is the most crucial factor in saving lives in an indoor disaster. For this reason, we designed a hybrid human detection (HHD) method that focuses on finding the optimal IoU value using both YOLOv3 and RetinaNet.
The proposed HHD task is divided into four significant steps, as shown in Figure 4. First, set the candidate group of objects in the input image. Next, the input image is divided into a grid (dimensions S × S). Each grid cell predicts B bounding boxes along with a confidence score. Each bounding box predicts the reliability of the x-and y-coordinates, the value h for height, and the value w for area. The confidence predicts the ground truth box and IoU of the predicted box, and an optimal IoU value is derived. One class is predicted per cell. Second, the bounding box with the highest reliability is used for an object in postprocessing, and the remaining bounding boxes are removed. The bounding box with the highest confidence is found. Through this series of processes, objects classified as FN and FP are searched for as a result of classification.
Third, using RetinaNet, re-detection is performed for objects classified as FN and FP, and the anchor box is created. RetinaNet is divided into a subnet containing object and bounding box coordinate information. By separating the background and the object, the detection operation is performed more accurately by focusing on the object detection.
If proceeding at this point, much loss will occur. Because only people are detected among objects, an imbalance between foreground and background occurs. To resolve this imbalance, focal loss is applied in the last step. Focal loss applies a shallow loss value to data that are already classified. In addition, it is possible to represent the detection result more accurately because it gives more weight to the loss by concentrating on the misclassified data.
The creation of bounding boxes in the first step is shown in Figure 5. ph and pw represent the height and width of the anchor box; tx, ty, tw, and th represent prediction values; When fire causes an indoor disaster situation, a person at the scene can be obscured by smoke. To detect a person in such a situation, an image is selected and learned through a machine learning module. The accuracy of human detection varies with smoke concentration. At this time, the optimal IoU value is found by trial and error; this is the most crucial factor in saving lives in an indoor disaster. For this reason, we designed a hybrid human detection (HHD) method that focuses on finding the optimal IoU value using both YOLOv3 and RetinaNet.
The proposed HHD task is divided into four significant steps, as shown in Figure 4. First, set the candidate group of objects in the input image. Next, the input image is divided into a grid (dimensions S × S). Each grid cell predicts B bounding boxes along with a confidence score. Each bounding box predicts the reliability of the x-and y-coordinates, the value h for height, and the value w for area. The confidence predicts the ground truth box and IoU of the predicted box, and an optimal IoU value is derived. One class is predicted per cell.  When fire causes an indoor disaster situation, a person at the scene can be obscured by smoke. To detect a person in such a situation, an image is selected and learned through a machine learning module. The accuracy of human detection varies with smoke concentration. At this time, the optimal IoU value is found by trial and error; this is the most crucial factor in saving lives in an indoor disaster. For this reason, we designed a hybrid human detection (HHD) method that focuses on finding the optimal IoU value using both YOLOv3 and RetinaNet.
The proposed HHD task is divided into four significant steps, as shown in Figure 4. First, set the candidate group of objects in the input image. Next, the input image is divided into a grid (dimensions S × S). Each grid cell predicts B bounding boxes along with a confidence score. Each bounding box predicts the reliability of the x-and y-coordinates, the value h for height, and the value w for area. The confidence predicts the ground truth box and IoU of the predicted box, and an optimal IoU value is derived. One class is predicted per cell. Second, the bounding box with the highest reliability is used for an object in postprocessing, and the remaining bounding boxes are removed. The bounding box with the highest confidence is found. Through this series of processes, objects classified as FN and FP are searched for as a result of classification.
Third, using RetinaNet, re-detection is performed for objects classified as FN and FP, and the anchor box is created. RetinaNet is divided into a subnet containing object and bounding box coordinate information. By separating the background and the object, the detection operation is performed more accurately by focusing on the object detection.
If proceeding at this point, much loss will occur. Because only people are detected among objects, an imbalance between foreground and background occurs. To resolve this imbalance, focal loss is applied in the last step. Focal loss applies a shallow loss value to data that are already classified. In addition, it is possible to represent the detection result more accurately because it gives more weight to the loss by concentrating on the misclassified data.
The creation of bounding boxes in the first step is shown in Figure 5. ph and pw represent the height and width of the anchor box; tx, ty, tw, and th represent prediction values; Second, the bounding box with the highest reliability is used for an object in postprocessing, and the remaining bounding boxes are removed. The bounding box with the highest confidence is found. Through this series of processes, objects classified as FN and FP are searched for as a result of classification.
Third, using RetinaNet, re-detection is performed for objects classified as FN and FP, and the anchor box is created. RetinaNet is divided into a subnet containing object and bounding box coordinate information. By separating the background and the object, the detection operation is performed more accurately by focusing on the object detection.
If proceeding at this point, much loss will occur. Because only people are detected among objects, an imbalance between foreground and background occurs. To resolve this imbalance, focal loss is applied in the last step. Focal loss applies a shallow loss value to data that are already classified. In addition, it is possible to represent the detection result more accurately because it gives more weight to the loss by concentrating on the misclassified data.
The creation of bounding boxes in the first step is shown in Figure 5. p h and p w represent the height and width of the anchor box; t x , t y , t w , and t h represent prediction Computation 2022, 10, x FOR PEER REVIEW 6 of 15 and bx, by, bw, and bh represent post-processing information. b is the predicted offset of the bounding box to be used in the anchor box. c indicates the offset of the upper left end of each grid cell. The object is detected using the final value of b and the ground truth IoU, calculated by the following equation: IoU is a method for evaluating two boxes-Bgt (ground truth), the location bounding box of an actual object, and Bp (prediction), the predicted bounding box-through overlapping areas. Larger overlapping areas indicate better evaluations. The existence of an object is determined by the degree of overlap between the ground truth and bounding boxes. In previous studies, the IoU value is dynamically changed because it is obtained using a video. However, this makes it difficult to determine accurately the existence of an object because the existence of a person obscured by smoke must be explored through fragmentary images. Therefore, it is necessary to find the optimal IoU value.
In addition, our goal was to detect people obscured by smoke in an indoor disaster situation, in particular, a fire, rather than detecting a person in a generic situation. Thus, the occlusion phenomenon occurs for humans as objects being detected. In this case, a problem arises when the correct box may be deleted. Using the following formula prevents the box from being deleted: represents the threshold and is also the classification score. If the centre distance is long and the IoU is large, it is possible to detect another object, so the issue of deleting the correct box can be prevented.
YOLO predicts multiple bounding boxes for each grid cell. To compute the loss for true positives, we need to select one box that best contains the detected objects. The formula below is for performing this process.
where s is the number of grids, and B is the number of bounding boxes predicted by each grid cell. A 7 × 7 grid predicts two bounding boxes and optimises by finding a loss only when an object is within a grid cell. Specifically, YOLO divides the image into 7 × 7 grid cells and predicts two candidates for objects of various sizes centred on each grid cell. In the case of the 2-stage method, if more than 1000 candidates are proposed, YOLO proposes only 7 × 7 × 2 = 98 candidates, IoU is a method for evaluating two boxes-B gt (ground truth), the location bounding box of an actual object, and B p (prediction), the predicted bounding box-through overlapping areas. Larger overlapping areas indicate better evaluations. The existence of an object is determined by the degree of overlap between the ground truth and bounding boxes. In previous studies, the IoU value is dynamically changed because it is obtained using a video. However, this makes it difficult to determine accurately the existence of an object because the existence of a person obscured by smoke must be explored through fragmentary images. Therefore, it is necessary to find the optimal IoU value.
In addition, our goal was to detect people obscured by smoke in an indoor disaster situation, in particular, a fire, rather than detecting a person in a generic situation. Thus, the occlusion phenomenon occurs for humans as objects being detected. In this case, a problem arises when the correct box may be deleted. Using the following formula prevents the box from being deleted: represents the threshold and is also the classification score. If the centre distance is long and the IoU is large, it is possible to detect another object, so the issue of deleting the correct box can be prevented.
YOLO predicts multiple bounding boxes for each grid cell. To compute the loss for true positives, we need to select one box that best contains the detected objects. The formula below is for performing this process.
where s is the number of grids, and B is the number of bounding boxes predicted by each grid cell. A 7 × 7 grid predicts two bounding boxes and optimises by finding a loss only when an object is within a grid cell. Specifically, YOLO divides the image into 7 × 7 grid cells and predicts two candidates for objects of various sizes centred on each grid cell. In the case of the 2-stage method, if Computation 2022, 10, 197 7 of 15 more than 1000 candidates are proposed, YOLO proposes only 7 × 7 × 2 = 98 candidates, so the performance is worse. Furthermore, the detection accuracy is significantly lower when there are several objects surrounding one object, i.e., if there are several objects in one cell the detection accuracy decreases.
After performing the primary detection task with YOLO, the second detection task is performed using RetinaNet for objects classified as FN and FP that were not detected. It divides the missing object into the subnet of object information and the coordinates of the bounding box. It then classifies the bounding box and predicts the distance between the bounding box and the ground truth object box. RetinaNet can also collect background information and focus on the object, which increases accuracy.
If the background and foreground become unbalanced, the loss rate of detection accuracy increases, and a loss function should be applied to reduce this imbalance. The following equations are used to calculate the loss function to reduce the loss rate: p is the value predicted by the model, while y is the value for the ground truth class. In Equation (4), p is 1 or −1; it represents the ground-truth class. p t is between 0 and 1; it represents the class probability for a class predicted by the model. Equation (5) defines a function that is slightly more convenient than Equation (4). When p t ≥ 0.5, it is easy to classify; the easy example has a slight loss, but it takes up most of the loss when the number increases. Therefore, the following formula is used to reduce the effect of this easy example on the loss: Equation (6) is an expression for focal loss. Focal loss is a loss function that downweights cases that are easy to classify; it learns by focusing on difficult classification problems. The modulating factor (1 − p t ) γ and the tuneable focusing parameter γ are added to CE. As γ becomes larger than 0, the difference in loss values between well-detected and non-detected objects becomes more evident. Figure 6 shows a representative image data set in which a smoke filter is applied to an image of people indoors. The concentration of the smoke filter was adjusted for transparency, starting at 75% and increasing to 85% in 1% increments. The same image filter and transparency were applied to all test image datasets.

Applications and Results of the Hybrid Human Detection Method
For smoke concentrations greater than 90%, comparisons between techniques are meaningless because the concentration is too thick to identify a person. We performed and compared detection using YOLO only, RetinaNet only, and HHD. Figure 7 shows how to set the object candidate group during the work to detect disaster victims covered by smoke using YOLOv3. Figure 7a shows the generation of bounding boxes surrounding predicted objects. Figure 7b shows the areas where it is predicted that there are humans. Following this step, and further preventing the correct bounding box from being deleted due to the phenomenon of occlusion of a human by smoke (Equation (2)), only the most reliable bounding box is left, and the final result is shown in Figure 8 below. non-detected objects becomes more evident. Figure 6 shows a representative image data set in which a smoke filter is applied to an image of people indoors. The concentration of the smoke filter was adjusted for transparency, starting at 75% and increasing to 85% in 1% increments. The same image filter and transparency were applied to all test image datasets.  For smoke concentrations greater than 90%, comparisons between techniques are meaningless because the concentration is too thick to identify a person. We performed and compared detection using YOLO only, RetinaNet only, and HHD. Figure 7 shows how to set the object candidate group during the work to detect disaster victims covered by smoke using YOLOv3. Figure 7a shows the generation of bounding boxes surrounding predicted objects. Figure 7b shows the areas where it is predicted that there are humans. Following this step, and further preventing the correct bounding box from being deleted due to the phenomenon of occlusion of a human by smoke (Equation (2)), only the most reliable bounding box is left, and the final result is shown in Figure  8 below.  Although it is predicted that there is an object in the ground truth box, it results in an FN classification that cannot extract the object, as shown in Figure 9. Although there are no objects in these FN or ground truth boxes, secondary detection is performed using RetinaNet only for objects classified as FP, which predict that there are objects (Figure 10). RetinaNet puts an anchor box on each point per feature map. For smoke concentrations greater than 90%, comparisons between techniques are meaningless because the concentration is too thick to identify a person. We performed and compared detection using YOLO only, RetinaNet only, and HHD. Figure 7 shows how to set the object candidate group during the work to detect disaster victims covered by smoke using YOLOv3. Figure 7a shows the generation of bounding boxes surrounding predicted objects. Figure 7b shows the areas where it is predicted that there are humans. Following this step, and further preventing the correct bounding box from being deleted due to the phenomenon of occlusion of a human by smoke (Equation (2)), only the most reliable bounding box is left, and the final result is shown in Figure  8 below.  Although it is predicted that there is an object in the ground truth box, it results in an FN classification that cannot extract the object, as shown in Figure 9. Although there are no objects in these FN or ground truth boxes, secondary detection is performed using RetinaNet only for objects classified as FP, which predict that there are objects (Figure 10). RetinaNet puts an anchor box on each point per feature map.  Although it is predicted that there is an object in the ground truth box, it results in an FN classification that cannot extract the object, as shown in Figure 9. Although there are no objects in these FN or ground truth boxes, secondary detection is performed using RetinaNet only for objects classified as FP, which predict that there are objects ( Figure 10). RetinaNet puts an anchor box on each point per feature map. For objects classified as FN and FP, the subnet of object information and that of the bounding box coordinate information are divided, and then the bounding box is classified. Next, the distance between the bounding box and the ground truth box is predicted. If there are more objects in the background than in the foreground, class imbalance occurs. To prevent this phenomenon, the focal loss function can be used. Figure 11 shows the results obtained for objects classified as FN and FP using this method.   For objects classified as FN and FP, the subnet of object information and that of the bounding box coordinate information are divided, and then the bounding box is classified. Next, the distance between the bounding box and the ground truth box is predicted. If there are more objects in the background than in the foreground, class imbalance occurs. To prevent this phenomenon, the focal loss function can be used. Figure 11 shows the results obtained for objects classified as FN and FP using this method.  For objects classified as FN and FP, the subnet of object information and that of the bounding box coordinate information are divided, and then the bounding box is classified. Next, the distance between the bounding box and the ground truth box is predicted. If there are more objects in the background than in the foreground, class imbalance occurs. To prevent this phenomenon, the focal loss function can be used. Figure 11 shows the results obtained for objects classified as FN and FP using this method. Figure 12 below shows the relative accuracies of the YOLO, RetinaNet, and HHD methods when applying IoU values of 0.3, 0.5, and 0.7. For a smoke concentration of 70%, the accuracy of the three methods was similar, but for 75% and higher, the HHD method was more accurate.

Applications and Results of the Hybrid Human Detection Method
For objects classified as FN and FP, the subnet of object information and that of the bounding box coordinate information are divided, and then the bounding box is classified. Next, the distance between the bounding box and the ground truth box is predicted. If there are more objects in the background than in the foreground, class imbalance occurs. To prevent this phenomenon, the focal loss function can be used. Figure 11 shows the results obtained for objects classified as FN and FP using this method.  the accuracy of the three methods was similar, but for 75% and higher, the HHD method was more accurate. A value of IoU = 0.3 gave the highest overall detection accuracy, however, as shown in Figure 13, a person was detected in a part of the image where there was no person. Therefore, IoU = 0.3 is not ideal for this task. IoU = 0.7 gave more accurate results than IoU = 0.5 for 70-79% smoke concentration. However, the accuracy is lower for a smoke concentration of 80% and higher. As the smoke thickened and obscured more people, the accuracy decreased. A value of IoU = 0.3 gave the highest overall detection accuracy, however, as shown in Figure 13, a person was detected in a part of the image where there was no person. Therefore, IoU = 0.3 is not ideal for this task. IoU = 0.7 gave more accurate results than IoU = 0.5 for 70-79% smoke concentration. However, the accuracy is lower for a smoke concentration of 80% and higher. As the smoke thickened and obscured more people, the accuracy decreased.
A value of IoU = 0.3 gave the highest overall detection accuracy, however, as shown in Figure 13, a person was detected in a part of the image where there was no person. Therefore, IoU = 0.3 is not ideal for this task. IoU = 0.7 gave more accurate results than IoU = 0.5 for 70-79% smoke concentration. However, the accuracy is lower for a smoke concentration of 80% and higher. As the smoke thickened and obscured more people, the accuracy decreased.   Even with IoU = 0.5, the detection accuracy was not high for scenarios where only a part of the human body was visible. To minimise the scenarios in which non-human objects are mistakenly recognised as human or cannot be detected when a part of the human body is covered, we found the optimum value of IoU with the highest detection accuracy for each smoke concentration. Figure 14 shows the results when IoU = 0.3, 0.5, 0.7, and optimal IoU were assigned. Table 1 gives the numerical values of the graph in Figure 14. The blue highlighted figures in the table represent the optimal IoU values for each smoke concentration.
Computation 2022, 10, x FOR PEER REVIEW 11 of 15 human in addition to the actual humans. For IoU = 0.7, when the smoke was thick (higher concentration), the shape of the person was not visible and could not be detected. Even with IoU = 0.5, the detection accuracy was not high for scenarios where only a part of the human body was visible. To minimise the scenarios in which non-human objects are mistakenly recognised as human or cannot be detected when a part of the human body is covered, we found the optimum value of IoU with the highest detection accuracy for each smoke concentration. Figure 14 shows the results when IoU = 0.3, 0.5, 0.7, and optimal IoU were assigned. Table 1 gives the numerical values of the graph in Figure 14. The blue highlighted figures in the table represent the optimal IoU values for each smoke concentration.    In Table 1 and Figure 14, IoU = 0.3 showed higher average accuracy than the optimal IoU for smoke concentrations of 79%, 80%, and 90%; objects covered by the smoke were regarded as persons, as shown in Figure 13a above. Table 2 shows the average detection rates of the YOLO, RetinaNet, and HHD methods for different smoke concentrations. On average, YOLO was the fastest with an average speed of 1 s, and RetinaNet had an average speed of 2-3 s. HHD was slightly slower than YOLO but faster than RetinaNet by 1 s, on average.  Figure 15 shows the precision and recall by smoke concentration when IoU values of 0.3, 0.5, 0.7, and optimal values are applied to the HHD method. Precision is defined by the percentage of correct detections among all detection results, and recall is defined by the percentage of correctly detected objects in the ground truth box. For the recall, the reason the value exceeds 1 when the value of IoU is 0.3 is that objects were incorrectly recognised as human.  Since a smoke filter was applied to an image rather than a video, it was possible to derive an optimal IoU value for each smoke concentration. When IoU = 0.3, as mentioned above, a non-human object was recognised as a human. Furthermore, compared with IoU = 0.5 and 0.7, the method in which the optimal value of IoU suggested here was found and applied was higher than 1% and as high as 6%.
These results mean that victims of smoky disasters can be detected more accurately, and people in indoor disasters can be detected more accurately in urgent situations. This information can be processed and delivered to rescuers to minimise victims of indoor disasters.

Discussion
In a previous study of human detection in smoke, the classification of persons in smoke was biased because it included smoke in the detection criteria [33]. However, in the present study, even when a person was covered by smoke, it was possible to detect the person by detecting a part of that person's body as a feature point. In another study, Since a smoke filter was applied to an image rather than a video, it was possible to derive an optimal IoU value for each smoke concentration. When IoU = 0.3, as mentioned above, a non-human object was recognised as a human. Furthermore, compared with IoU = 0.5 and 0.7, the method in which the optimal value of IoU suggested here was found and applied was higher than 1% and as high as 6%.
These results mean that victims of smoky disasters can be detected more accurately, and people in indoor disasters can be detected more accurately in urgent situations. This information can be processed and delivered to rescuers to minimise victims of indoor disasters.

Discussion
In a previous study of human detection in smoke, the classification of persons in smoke was biased because it included smoke in the detection criteria [33]. However, in the present study, even when a person was covered by smoke, it was possible to detect the person by detecting a part of that person's body as a feature point. In another study, IR cameras were used for detection; class imbalances due to a large number of human data sets were balanced by oversampling [35]. In the present study, we avoided class imbalance by using the focal loss method. We also developed a hybrid human detection method that merged the learning methods of YOLO and RetinaNet; we improved the accuracy of the new method by reducing loss to increase the detection rate and by developing an approach to determine an optimal IoU value through the dynamic assignment of multiple IoU values.
However, because we applied the smoke filter to images rather than videos, our approach determines the IoU value separately for each smoke concentration. Therefore, when our approach is applied to videos, different optimal IoU values might be obtained.
In addition, because we tested our approach on random non-disaster situations, rather than real disaster situations, the applicability of our approach to real disaster situations is limited; it requires further training and optimisation.

Summary and Conclusions
In this paper, we propose a method to more accurately detect disaster victims who have not yet evacuated from an indoor disaster situation, especially those isolated due to fire smoke. The work was carried out using the 1-stage detector method, and a HHD method combining the YOLO and RetinaNet methods was proposed. Since the image of a part of a person's body is learned, it could be detected accurately even if the entire shape of a person was not visible, and we also increased the detection accuracy by using CCTV images. In addition, the work was carried out to detect disaster victims by applying a unique environment, called a disaster.
Because a person's body can be partially or completely obscured by smoke, the accuracy is significantly lower than for detection in a non-smoky situation. In this study, detecting disaster victims hidden by smoke is carried out using the proposed HHD method. HHD uses the YOLO and RetinaNet methods together, improves detection accuracy by repeating the search for FN and FP classifications, and also finds the optimal IoU value to produce better results.
For YOLO, high detection accuracy was found when the human body was completely visible. However, when the smoke thickened and only a part of the human body was visible, or the human body was blurred, the accuracy was low. In this case, detection may not be possible due to the overlap of objects or class imbalance. For RetinaNet, since there are classes divided by background and objects, the detection accuracy was higher than that of YOLO, but there was a significant difference with YOLO in terms of speed. Furthermore, some objects were missed completely.
When HHD was used, the detection accuracy was higher than that of YOLO. When compared with RetinaNet, the detection accuracy was not significantly different from that of YOLO, but the result could be arrived at more quickly. When comparing the detection accuracies of YOLO, RetinaNet, and HHD, on average, when HHD and YOLO were applied with IoU = 0.5, the most significant deviation was shown, and the detection accuracy ranged from 3% to 20%. When comparing HHD and RetinaNet, the most significant deviation was found when IoU = 0.7 was applied, and the high detection accuracy ranged from 1% to 9%. The parameters showing the most significant deviation on average was for the difference in accuracy starting with a smoke concentration of 80%. Furthermore, the differences between using values of IoU = 0.3, 0.5, and 0.7 were calculated for HHD and the optimal IoU value was found and applied.
Our results show that the HHD method proposed in this paper produces better results than when YOLO or RetinaNet is used alone. Utilising videos and partial smoke coverage of people might further improve the accuracy of the HDD method but requires further testing. Our method could constitute an essential factor in identifying victims in indoor disaster situations, especially where there is fire and smoke. This information has the advantage of contributing to the prevention of additional disaster victims because rescue work could be conducted more quickly.