Smart Drone Surveillance System Based on AI and on IoT Communication in Case of Intrusion and Fire Accident

: Research on developing a smart security system is based on Artiﬁcial Intelligence with an unmanned aerial vehicle (UAV) to detect and monitor alert situations, such as ﬁre accidents and theft/intruders in the building or factory, which is based on the Internet of Things (IoT) network. The system includes a Passive Pyroelectric Infrared Detector for human detection and an analog ﬂame sensor to sense the appearance of the concerned objects and then transmit the signal to the workstation via Wi-Fi based on the microcontroller Espressif32 (Esp32). The computer vision models YOLOv8 (You Only Look Once version 8) and Cascade Classiﬁer are trained and implemented into the workstation, which is able to identify people, some potentially dangerous objects, and ﬁre. The drone is also controlled by three algorithms—distance maintenance, automatic yaw rotation, and potentially dangerous object avoidance—with the support of a proportional–integral–derivative (PID) controller. The Smart Drone Surveillance System has good commands for automatic tracking and streaming of the video of these speciﬁc circumstances and then transferring the data to the involved parties such as security or staff.


Introduction
Currently, unmanned aerial vehicles (UAVs) are utilized in a wide range of applications [1][2][3][4][5], especially in surveillance systems [6,7].Drone surveillance involves visually monitoring an individual, a group, items, or a situation to prevent potential threats.The establishment of an efficient surveillance system with drone fleets necessitates the smooth integration of dependable hardware and sophisticated automation software.In buildings and factories, there is a high demand for smart security systems with drone applications.Drones perform significantly faster than patrol vehicles or security personnel, enabling them to promptly arrive at the location of an incident, thereby facilitating a swift remedial response.
Research [8] uses a drone attached to a USB camera interfaced with Raspberry Pi, which is capable of autonomous flight monitoring in a campus, office, and industrial areas.However, the system does not include an object identification function and mainly focuses on video streaming.Therefore, this system cannot track a specific intruder automatically.Another article [9] studies the human motion tracking algorithm with a drone based on the MediaPipe Framework [10,11].A drone-based method for 3D human motion capture has been developed by researchers [12], in which a drone circles a human subject, records a video, and then reconstructs 3D full-body postures.These studies try to reconstruct the 3D posture sequence of a subject rather than concentrating on a methodology for autonomous human motion tracking.
Another paper discusses the deep convolutional neural network (CNN) model (EfficientNet-B3) [13] for plant disease identification, and images were captured using a drone with a convolution neural network (CNN).Nevertheless, EfficientNet's performance can be reduced on hardware accelerator such as GPUs, which are designed for large computation models and where data movement is a relatively small component of the overall performance.In this case, EfficientNet requires significantly less computation and more data movement than comparable networks.
In this work, a Smart Drone Surveillance System (SDSS) is developed for the case of intruders and fire accidents.Sensors are responsible for detecting the intrusion or the accident flame and then sending the signal to the workstation via Wi-Fi.The drone is already located in a hidden place at a proper distance inside the building.After receiving the sensor signal, the workstation controls the drone to reach the sensor location.On its trajectory, the drone can also track suspicious objects.The drone turns around the sensor place to identify a person or fire.Once it captures the concerned object, the drone starts tracking at a safe distance.The object identification is trained by YOLOv8 [14] and Cascade Classifier [15].The drone-controlling software is developed based on Python [16], which sets the drone's trajectory to the sensor locations and transfers the video camera to the workstation.After obtaining the object video, the first 5 s or 10 s of the videos are sent to other PCs or mobile phones of the emergency contact, such as the security or police.Then, the drone keeps following the detected object and stores the video in the workstation.In this way, considerable circumstances are under security monitoring, and all the evidence is saved.
In addition, three algorithms are implemented into the flight control system: distance maintenance, automatic yaw rotation, and potentially dangerous object avoidance.Distance maintenance is regulated by a proportional-integral-derivative (PID) controller [17] based on the prediction box area of the Artificial Intelligence (AI) object identifier.The area box must be at a specific threshold to guarantee a safe distance between the object and the drone.The automatic yaw rotation is also adjusted by the PID controller, depending on the difference between the box center x-coordinate and the frame center.The drone rotates its yaw angle, following the right/left motion of the object to ensure that the object is at the center of the frame.The third algorithm helps the drone avoid dangerous objects that can be used to throw, such as knives, scissors, cups, and bottles.Once the drone detects these objects close to it (based on the object area size in the image), the drone moves to one side and rotates its heading to the opposite side, so the tracking process continues with the intruder.
The devices in SDSS communicate with each other under the IoT protocol [18,19].The IoT refers to the interconnection of things over the Internet, enabling smart devices to engage in communication, data exchange, and interaction between electrical components, software systems, networks, and sensors, thus facilitating effective communication processes.Effective communication among smart devices plays a crucial role in the IoT, as it facilitates the gathering and sharing of data.This capability significantly contributes to the overall success of IoT products and projects.
This paper contributes the following points to scientific research: • Both Yolov8 and Cascade Classifier are successfully implemented together into the flight system to support each other in object detection, which accomplishes high accuracy and speed for surveillance purposes.

•
The distance maintenance and yaw rotation algorithms based on the PID controller are described in detail, providing deep comprehension for the reader about the drone control field with the support of AI techniques.

•
An algorithm for potentially dangerous object avoidance is proposed, which utilizes a straight strategy to dodge the approaching object based on the trained model.

•
The strong point of this paper is to combine the computer vision models and the UAV algorithms into a smart system.There is a highly effective connection between the two sides of implementation.The drone-controlled algorithms are based on AI models.Thus, this paper not only describes the robust flight control methods in detail but also describes the automatic operation connected to the trained AI object identifier models.
The paper is organized as follows: the 1st part describes the utilized components, the computer vision models, and the control algorithms for the drone.The 2nd part contains the experimental setup and result analysis.Finally, the conclusion and the future work are outlined.

Related Work
SDSS utilizes AI models in computer vision, such as YOLOv8 and Cascade Classifier, to identify the suspected objects.A human detection method based on YOLOv5 was proposed in [20], which successfully identified the human object.When comparing YOLOv8 against YOLOv5, YOLOv8 demonstrates faster speed and improved throughput with a similar number of parameters due to the implementation of hardware-efficient and architecturally reformed approaches [21].On the other hand, Cascade Classifier, based on the AdaBoost algorithm, is speedy, robust, and accurate.Meanwhile, deep learning methods like convolutional neural networks (CNNs) [22] require significant data and processing power.
A system for autonomous human-following drones used 3D pose estimation to gather data on human direction for following a moving subject from a certain direction [23].This system uses PID controls with input values such as the longitudinal movement speed, vertical movement speed, and rotational angular speed with the neck and middle hip as the reference points.In our paper, the PID controller works based on the output bounding box area from computer vision models, comparing it with the desired threshold.A larger box area corresponds to a closer distance and vice versa.
The same research [23] also proposes a system that calculates human direction from the 3D joint points of the left shoulder and right shoulders.From that, the left-right movement speed of the drone is changed to approach the target value.Unlike that control system, our manuscript utilizes the PID controller to regulate motor speed by calculating the difference between the prediction box center and the image center in pixels.This method can constantly track the person's direction change to adapt, aiming to set the human-object box around the frame center.
Typically, to avoid obstacles or approaching objects, the drone needs to be equipped with a light detection and ranging (LiDAR) sensor to enable the perception of obstacles within the surrounding environment [24] or ultrasonic sensors for distance estimation [25].To minimize the extra cost and high power consumption, our drone system has implemented the algorithm to avoid the right/left by detecting the approaching object position respecting the image frame.
Another study uses a tracking camera with a path-planning algorithm for collision avoidance in horizontal space [26].The tracking camera on the drone is used to track its position.The idea is to drive the drone to the target with three possible trajectories.If there is an obstacle, the drone moves to another planned path.However, this method requires the external sensor system, such as the motion tracker of Optitrack, to track obstacle positions [27].Our developed system tracks the obstacles directly based on the object identifier YOLOv8.Hence, the SDSS has good time responses with self-detection and optimized cost construction for the drone.
The sensors are essential in the Internet of Things (IoT) [28], which acquire and transmit data from their surroundings to the center of wireless communication.In the proposed system, the PIR sensor [29] for human appearance detection and the flame detection sensor [30] are mounted in the monitoring places to transmit the alert to the workstation based on ESP32 [31].The smart system receives this alert and then automatically activates the whole flight system, implementing the robust algorithm with AI models inside.In this way, the SDSS can be seen in Figure 1.

Drone Components
The drone is a continuous low humming sound.As shown in Figure 2, a quadcopter drone has four propellers and motors, a power distribution board, a frame, an electric motor controller (ESC), a flight controller, a battery, a receiver, a camera, and sensors.Pressure sensors measure the altitude or distance between the ground and the drone.The inertial measurement unit (IMU) sensors [32][33][34] measure the accelerations and angles of the drone.

Drone Working Principle
The drone can work in 4 degrees of freedom (DOF), which translates in 3 directions and rotates in 1.As illustrated in Figure 3, 2 propellers rotate clockwise, and the other 2 rotate anticlockwise, generating zero angular momentum and creating the lift to fly the drone.This characteristic keeps the drone stationary rather than rotating in one direction while hovering.

Drone Components
The drone is a continuous low humming sound.As shown in Figure 2, a quadcopter drone has four propellers and motors, a power distribution board, a frame, an electric motor controller (ESC), a flight controller, a battery, a receiver, a camera, and sensors.Pressure sensors measure the altitude or distance between the ground and the drone.The inertial measurement unit (IMU) sensors [32][33][34] measure the accelerations and angles of the drone.

Drone Components
The drone is a continuous low humming sound.As shown in Figure 2, a quadcopter drone has four propellers and motors, a power distribution board, a frame, an electric motor controller (ESC), a flight controller, a battery, a receiver, a camera, and sensors.Pressure sensors measure the altitude or distance between the ground and the drone.The inertial measurement unit (IMU) sensors [32][33][34] measure the accelerations and angles of the drone.

Drone Working Principle
The drone can work in 4 degrees of freedom (DOF), which translates in 3 directions and rotates in 1.As illustrated in Figure 3, 2 propellers rotate clockwise, and the other 2 rotate anticlockwise, generating zero angular momentum and creating the lift to fly the drone.This characteristic keeps the drone stationary rather than rotating in one direction while hovering.

Drone Working Principle
The drone can work in 4 degrees of freedom (DOF), which translates in 3 directions and rotates in 1.As illustrated in Figure 3, 2 propellers rotate clockwise, and the other 2 rotate anticlockwise, generating zero angular momentum and creating the lift to fly the drone.This characteristic keeps the drone stationary rather than rotating in one direction while hovering.The translations are up-down, left-right, and forward-backward.To rotate the drone yaw following the counterclockwise movement, the speed of the counterclockwise motor must be increased, and the speed of the anticlockwise motor must be reduced and vice versa.
The speed of drone motors is controlled in cm/s by a Python library: DJITelloPy [35].

YOLOv8
YOLO is a widely adopted ensemble of object detection models utilized for real-time object identification and classification within computer vision.The primary characteristic of YOLO is its singular-stage detection methodology, which was specifically devised to identify objects rapidly and accurately in real time.YOLOv8 provides the most significant advantages in both accuracy and speed in detection among all YOLO versions [36].As illustrated in Figure 4, the trained models output the prediction boxes with bounding heights and widths.The translations are up-down, left-right, and forward-backward.To rotate the drone yaw following the counterclockwise movement, the speed of the counterclockwise motor must be increased, and the speed of the anticlockwise motor must be reduced and vice versa.
The speed of drone motors is controlled in cm/s by a Python library: DJITelloPy [35].

YOLOv8
YOLO is a widely adopted ensemble of object detection models utilized for real-time object identification and classification within computer vision.The primary characteristic of YOLO is its singular-stage detection methodology, which was specifically devised to identify objects rapidly and accurately in real time.YOLOv8 provides the most significant advantages in both accuracy and speed in detection among all YOLO versions [36].As illustrated in Figure 4, the trained models output the prediction boxes with bounding heights and widths.The translations are up-down, left-right, and forward-backward.To rotate the drone yaw following the counterclockwise movement, the speed of the counterclockwise motor must be increased, and the speed of the anticlockwise motor must be reduced and vice versa.
The speed of drone motors is controlled in cm/s by a Python library: DJITelloPy [35].

YOLOv8
YOLO is a widely adopted ensemble of object detection models utilized for real-time object identification and classification within computer vision.The primary characteristic of YOLO is its singular-stage detection methodology, which was specifically devised to identify objects rapidly and accurately in real time.YOLOv8 provides the most significant advantages in both accuracy and speed in detection among all YOLO versions [36].As illustrated in Figure 4, the trained models output the prediction boxes with bounding heights and widths.YOLOV8 incorporates the C2f module, which effectively combines two parallel branches of gradient flow, enhancing the resilience and effectiveness of gradient information propagation.The integration of advanced characteristics with contextual information enhances the precision of detection.The detection module utilizes a combination of convolutional and linear layers to effectively transform the high-dimensional information into the desired output bounding boxes and item classifications.The system's backbone changed with the C2f, replacing C3 (composed of 3 convolutional layers) in YOLOV5 [37].C2f has the outputs from the concatenated bottleneck (a combination of two 3 × 3 convolutional layers with residual connections), while C3 only utilizes the output from the last bottleneck.Every convolution filter is responsible for extracting a particular characteristic from the image.Figures 5 and 6 depict the bottleneck and C2F structures, respectively.
Note: CBS = Conv + BN + SiLU where in the convolutional layer, BN (batch normalization) normalizes the previous layers' output using the current batch's mean and variance and SiLU (Sigmoid Linear Units) is an activation function for neural networks.
Drones 2023, 7, x FOR PEER REVIEW 6 of 21 YOLOV8 incorporates the C2f module, which effectively combines two parallel branches of gradient flow, enhancing the resilience and effectiveness of gradient information propagation.The integration of advanced characteristics with contextual information enhances the precision of detection.The detection module utilizes a combination of convolutional and linear layers to effectively transform the high-dimensional information into the desired output bounding boxes and item classifications.The system's backbone changed with the C2f, replacing C3 (composed of 3 convolutional layers) in YOLOV5 [37].C2f has the outputs from the concatenated bottleneck (a combination of two 3 × 3 convolutional layers with residual connections), while C3 only utilizes the output from the last bottleneck.Every convolution filter is responsible for extracting a particular characteristic from the image.Figures 5 and 6 depict the bottleneck and C2F structures, respectively.
Note: CBS = Conv + BN + SiLU where in the convolutional layer, BN (batch normalization) normalizes the previous layers' output using the current batch's mean and variance and SiLU (Sigmoid Linear Units) is an activation function for neural networks.Overall, the YOLOv8 structure mainly contains an input segment, a backbone, a neck, and an output segment with a detection head.
The input segment implements mosaic data augmentation, adaptive anchor computation, and adaptive grayscale padding on the input picture.
In the backbone network, the input picture undergoes processing by several convolutional (Conv) and C2f modules in order to extract feature maps at various scales.The feature maps produced as output undergo processing through the spatial pyramid pooling fast (SPPF) module.This module utilizes pooling with different kernel sizes to merge the feature maps.The combined outputs are subsequently transmitted to the neck layer.YOLOV8 incorporates the C2f module, which effectively combines two parallel branches of gradient flow, enhancing the resilience and effectiveness of gradient information propagation.The integration of advanced characteristics with contextual information enhances the precision of detection.The detection module utilizes a combination of convolutional and linear layers to effectively transform the high-dimensional information into the desired output bounding boxes and item classifications.The system's backbone changed with the C2f, replacing C3 (composed of 3 convolutional layers) in YOLOV5 [37].C2f has the outputs from the concatenated bottleneck (a combination of two 3 × 3 convolutional layers with residual connections), while C3 only utilizes the output from the last bottleneck.Every convolution filter is responsible for extracting a particular characteristic from the image.Figures 5 and 6 depict the bottleneck and C2F structures, respectively.
Note: CBS = Conv + BN + SiLU where in the convolutional layer, BN (batch normalization) normalizes the previous layers' output using the current batch's mean and variance and SiLU (Sigmoid Linear Units) is an activation function for neural networks.Overall, the YOLOv8 structure mainly contains an input segment, a backbone, a neck, and an output segment with a detection head.
The input segment implements mosaic data augmentation, adaptive anchor computation, and adaptive grayscale padding on the input picture.
In the backbone network, the input picture undergoes processing by several convolutional (Conv) and C2f modules in order to extract feature maps at various scales.The feature maps produced as output undergo processing through the spatial pyramid pooling fast (SPPF) module.This module utilizes pooling with different kernel sizes to merge the feature maps.The combined outputs are subsequently transmitted to the neck layer.Overall, the YOLOv8 structure mainly contains an input segment, a backbone, a neck, and an output segment with a detection head.
The input segment implements mosaic data augmentation, adaptive anchor computation, and adaptive grayscale padding on the input picture.
In the backbone network, the input picture undergoes processing by several convolutional (Conv) and C2f modules in order to extract feature maps at various scales.The feature maps produced as output undergo processing through the spatial pyramid pooling fast (SPPF) module.This module utilizes pooling with different kernel sizes to merge the feature maps.The combined outputs are subsequently transmitted to the neck layer.The utilization of Sequentially Connected Three Maximum Pooling Layers (SPPFs) results in a reduction in computing effort and a decrease in delay.
The neck layer of YOLOv8 utilizes the Feature Pyramid Network (FPN) [38] and Path Aggregation Network (PAN) [39] architecture to augment the model's capacity to fuse features.This architectural design integrates high-level and low-level feature maps by utilizing upsampling and downsampling techniques, enabling the effective transmission of both semantic and localization cues.By employing this methodology, the network gains improved capability to integrate characteristics from items with diverse scales, hence augmenting its detection efficacy on things with variable scales.
The detection component of YOLOv8 adheres to the conventional approach of segregating the classification component from the detection component.The process involves the computation of loss and target detection box filtering.The computation of loss has two main components, namely classification and regression, with the exclusion of the object branch.The classification branch employs the Binary Cross-Entropy (BCE) loss function, whereas the regression branch utilizes the Distribution Focal Loss (DFL) [40] and CIoU loss functions [41].The formation of prediction boxes in YOLOv8 involves the utilization of decoupled heads, which have the capability to predict categorization scores and regression coordinates concurrently.The representation of classification scores is accomplished by a two-dimensional matrix, which signifies the existence of an item within each individual pixel.The regression coordinates can be denoted by a four-dimensional matrix, which represents the displacement of the object's center with respect to each pixel.YOLOv8 utilizes a task-aligned assigner to calculate a task alignment measure by utilizing the classification scores and regression coordinates.The task alignment measure integrates the classification scores with the Intersection over Union (IoU) value, facilitating the joint optimization of classification and localization while mitigating the impact of low-quality prediction boxes.The IoU metric is extensively utilized in the field of object identification.It plays a crucial role in identifying positive and negative samples and estimating the distance between predicted boxes and ground truth.An item is commonly categorized as detected when the IoU surpasses a threshold of 0.5.
Although YOLOv8 is a robust model, it still has a limit in small object detection or objects with low contrast.Thus, another OpenCV [42] model is implemented for fire detection since it is essential to discover the flame from the ignition state when it just starts in a small shape.

Cascade Classifier
The Cascade Classifier technique works based on the Haar feature-based Cascade Classifier, which is an effective object detection method proposed in the article [43].In this case, there are about 600 positive and 400 negative image samples to train classifiers for fire detection.Positive images contain fire, and negative images do not contain fire.As illustrated in Figure 7 • Step 1: Gathering the Haar Features.In a detection window, a Haar feature is effectively the result of computations on neighboring rectangular sections.The pixel intensities in each location must be summed together to determine the difference between the sums.Figure 8 shows the Haar feature types.• Step 1: Gathering the Haar Features.In a detection window, a Haar feature is effectively the result of computations on neighboring rectangular sections.The pixel intensities in each location must be summed together to determine the difference between the sums.Figure 8 shows the Haar feature types.• Step 1: Gathering the Haar Features.In a detection window, a Haar feature is effectively the result of computations on neighboring rectangular sections.The pixel intensities in each location must be summed together to determine the difference between the sums.Figure 8 shows the Haar feature types.• Step 2: Creating Integral Images.In essence, the calculation of these Haar characteristics is sped up using integral pictures.It constructs sub-rectangles and array references for each of them rather than computing each pixel.The Haar features are then computed using them.

•
Step 3: Adaboost Training.Adaboost selects the top features and trains the classifiers to utilize them.It combines weak classifiers to produce a robust classifier for the algorithm to find items.Weak learners are produced by sliding a window across the input image and calculating Haar characteristics for each area of the image.This distinction contrasts with a learned threshold distinguishing between non-objects and objects.These are weak classifiers, whereas a strong classifier requires a lot of Haar features to be accurate.The last phase merges these weak learners into strong ones using cascading classifiers.

•
Step 4: Implementing Cascading Classifiers.The Cascade Classifier comprises several stages, each containing a group of weak learners.Boosting trains weak learners, resulting in a highly accurate classifier from the average prediction of all weak learners.Based on this prediction, the classifier decides to go on to the next region (negative) or report that an object was identified (positive).Due to the majority of the windows not containing anything of interest, stages are created to discard negative samples as quickly as possible.

Evaluation Metrics
Precision (P), recall (R), and average precision (AP) are the evaluation metrics to validate the object detection models.The AP is the average accuracy of the model [44].
The formula to calculate P and R is as follows: • Step 2: Creating Integral Images.In essence, the calculation of these Haar characteristics is sped up using integral pictures.It constructs sub-rectangles and array references for each of them rather than computing each pixel.The Haar features are then computed using them.

•
Step 3: Adaboost Training.Adaboost selects the top features and trains the classifiers to utilize them.It combines weak classifiers to produce a robust classifier for the algorithm to find items.Weak learners are produced by sliding a window across the input image and calculating Haar characteristics for each area of the image.This distinction contrasts with a learned threshold distinguishing between non-objects and objects.These are weak classifiers, whereas a strong classifier requires a lot of Haar features to be accurate.The last phase merges these weak learners into strong ones using cascading classifiers.

•
Step 4: Implementing Cascading Classifiers.The Cascade Classifier comprises several stages, each containing a group of weak learners.Boosting trains weak learners, resulting in a highly accurate classifier from the average prediction of all weak learners.Based on this prediction, the classifier decides to go on to the next region (negative) or report that an object was identified (positive).Due to the majority of the windows not containing anything of interest, stages are created to discard negative samples as quickly as possible.

Evaluation Metrics
Precision (P), recall (R), and average precision (AP) are the evaluation metrics to validate the object detection models.The AP is the average accuracy of the model [44].
The formula to calculate P and R is as follows: where TP is true positive; FP is false positive; and FN is false negative.

Distance Maintenance
To track a person or fire, a drone needs to identify these objects using AI in computer vision.A rectangular boundary will cover the detected object, as shown in Figure 4.
Then, the safety distance must be adjusted in real time, which is carried out based on the area threshold in pixels.This area of the prediction box must be maintained in a specific range [A, B] following the below algorithm: For instance, the image width and length are 640 and 480, respectively.The threshold can be about [33,200,33,800] for person detection, and the threshold for fire detection can be about [600,1800].
PID controls the speed variation to maintain the proper distance between the drone and the target, as shown in Figure 9. Kp should be great enough if the error is significant; the control output will be proportionately high.Kd should be set higher if the change is rapid.Ki should be suitable to eliminate the residual error due to the historic cumulative value of the error.

•
The P-controller is an essential element in the control systems.The system offers a direct control action proportional to the error between the target setpoint and the measured process variable.The drone's controller continuously modifies the motor speed depending on the difference between the desired and predicted box areas to ensure the drone maintains the appropriate distance from the item.The back motors are slowed to gently return the drone back if it is too near than intended and vice versa.The difference between the required and measured rectangle areas determines how much correction is made; higher differences yield more vital adjustments.

•
The D-controller aids in system control by monitoring the rate of change.The focus is placed on the rate of change between the target value and the measured value.
When the drone reaches the proper distance, the D-controller helps keep the drone steady by looking at how quickly the drone's speed is changing.If the drone is going backward or downward too fast, the D-controller will adjust to slow it down.This feature helps the drone stay at the desired distance smoothly, ensuring stability and precise speed control.

•
The I-controller operates by continuously summing the error signal over a period of time and utilizing the resultant integrated value to provide suitable modifications to control inputs.If the drone deviates from its setpoint, the integral controller calculates the duration and magnitude of the accumulative error and applies corrective actions proportionally.The P and D controllers can make quick adjustments but struggle to remove minor, persistent errors that occur over time, leading to steady-state errors.
As shown in Figure 10, the overall control function: • K p should be great enough if the error is significant; the control output will be proportionately high.K d should be set higher if the change is rapid.K i should be suitable to eliminate the residual error due to the historic cumulative value of the error.Kp should be great enough if the error is significant; the control output will be proportionately high.Kd should be set higher if the change is rapid.Ki should be suitable to eliminate the residual error due to the historic cumulative value of the error.

Yaw Rotation for Object Position Adaption
When the object starts moving to the side, the drone has to rotate the yaw to capture the object in the image center follow the target to the left or right motion.If the person moves to the left side, the drone has to rotate to the left to bring the object back to the frame center.Thus, the speed of the left motors should be adjusted to be higher than the right motors and vice versa, as demonstrated in Figure 11.
The PID is also applied to this case to avoid overshooting issues.When the drone rotates closely to the object, its speed must be decreased gradually.The PID will adjust the motor speed depending on how far the drone's yaw is from the object's actual point.

•
If the x-coordinate of the rectangle center < x-coordinate of the image center → Target moves to the left → PID adjusts the drone yaw to increase the left motor speed.

•
If the x-coordinate of the rectangle center > the x-coordinate of the image center → Target moves to the left → PID adjusts the drone yaw to increase the right motor speed.
the motor speed depending on how far the drone's yaw is from the object's actual point.
• If the x-coordinate of the rectangle center < x-coordinate of the image center → Target moves to the left → PID adjusts the drone yaw to increase the left motor speed.

•
If the x-coordinate of the rectangle center > the x-coordinate of the image center → Target moves to the left → PID adjusts the drone yaw to increase the right motor speed.

Potentially Dangerous Object Avoidance
Some other objects are learned in model YOLOv8 to support drone safety, such as knife, bottle, cup, cell phone, and scissor.If the intruder throws these objects, the drone detects them at a specific range; it is programmed to move to one side, and the yaw rotates to the opposite side to keep tracking the person.For instance, if a knife is thrown at the drone, and the drone moves to the left and then rotates the yaw to the right.Another area threshold (pixel) is set to detect whether the objects approach close to the drone.When the drone identifies those objects, the area is greater than this area threshold on camera, and the drone will avoid them.Figure 12 describes the object at the frame center; Figure 13 shows the cases when the approaching object is at the left and right.

Potentially Dangerous Object Avoidance
Some other objects are learned in model YOLOv8 to support drone safety, such as knife, bottle, cup, cell phone, and scissor.If the intruder throws these objects, the drone detects them at a specific range; it is programmed to move to one side, and the yaw rotates to the opposite side to keep tracking the person.For instance, if a knife is thrown at the drone, and the drone moves to the left and then rotates the yaw to the right.Another area threshold (pixel) is set to detect whether the objects approach close to the drone.When the drone identifies those objects, the area is greater than this area threshold on camera, and the drone will avoid them.Figure 12 describes the object at the frame center; Figure 13 shows the cases when the approaching object is at the left and right.Object at the center → Drone moves to the right 25 cm and then turns yaw to the left 32°.

Sensor Utilization
As shown in Figure 14, the "Passive Pyroelectric Infrared Detector" (PIR HAT) is an M5StickC compatible human body induction sensor that detects infrared radiation from the body.The sensor will output HIGH when infrared is detected, which will continue for two seconds until the next detection cycle.

Sensor Utilization
As shown in Figure 14, the "Passive Pyroelectric Infrared Detector" (PIR HAT) is an M5StickC compatible human body induction sensor that detects infrared radiation from the body.The sensor will output HIGH when infrared is detected, which will continue for two seconds until the next detection cycle.Object at the right → Drone moves to the left 20 cm and then turns yaw to the right 30°.

Sensor Utilization
As shown in Figure 14, the "Passive Pyroelectric Infrared Detector" (PIR HAT) is an M5StickC compatible human body induction sensor that detects infrared radiation from the body.The sensor will output HIGH when infrared is detected, which will continue for two seconds until the next detection cycle.The sensor detects infrared radiation from objects in its field of vision.It is referred to as "passive" since it does not produce any energy of its own and instead monitors changes in the quantities of infrared radiation around it.A pyroelectric substance, which produces an electric charge when subjected to temperature variations, is the main component of a PIR sensor.This substance typically has a crystalline structure and persistent electric polarization.When a person moves within the sensor's range, it causes a rapid change in the infrared radiation levels falling on the pyroelectric material.The pyroelectric material produces an electric signal in response to this radiation change.
As shown in Figure 15, flame sensor DFR0076 [8] is an analog flame sensor consisting of an amplifier circuit, a lens, and a photodiode/phototransistor.When a flame is present, it emits light throughout a broad band of wavelengths, including ultraviolet and infrared.The sensor lens directs this light onto the photodiode/phototransistor, which generates an electrical current according to the light's brightness.The amplifier circuit amplifies this signal, and the resulting analog voltage output shows the flame's presence and intensity.The flame sensor can be used to sense the fire or other wavelengths at 760~1100 nm light.The flame sensor probe is positioned at an angle of 60 degrees, which allows for enhanced sensitivity to the unique spectral characteristics of flames.The operational temperature range of the flame sensor is from −25 to 85 • C. electric polarization.When a person moves within the sensor's range, it causes a rapid change in the infrared radiation levels falling on the pyroelectric material.The pyroelectric material produces an electric signal in response to this radiation change.
As shown in Figure 15, flame sensor DFR0076 [8] is an analog flame sensor consisting of an amplifier circuit, a lens, and a photodiode/phototransistor.When a flame is present, it emits light throughout a broad band of wavelengths, including ultraviolet and infrared.The sensor lens directs this light onto the photodiode/phototransistor, which generates an electrical current according to the light's brightness.The amplifier circuit amplifies this signal, and the resulting analog voltage output shows the flame's presence and intensity.The flame sensor can be used to sense the fire or other wavelengths at 760~1100 nm light.The flame sensor probe is positioned at an angle of 60 degrees, which allows for enhanced sensitivity to the unique spectral characteristics of flames.The operational temperature range of the flame sensor is from −25 to 85 °C.

Experimental Setup
In this research, the Tello drone [45], as shown in Figure 16, is used in the experiment.It can shoot up to 720p video at 30 frames per second.
The system with the AI models were trained in the host machine with an NVIDIA Quadro P620 (combining a 512 CUDA core Pascal GPU), 2 GB GDDR5, Intel Core i7 vPro-10850H Processor (2.70 GHz), and RAM of 32 GB.The image annotation was carried out by using OpenCV [46].YOLOv8 was trained based on Ultralytics [47], and Cascade Classifier was trained by OpenCV [48], [49].Python was used as an integrated development environment for the project.The image width and length for the AI model are 640 and 480, respectively.
The flame sensor is connected with the ESP32, as shown in Figure 17, and the PIR sensor with M5StickC is set up, as shown in Figure 18.

Experimental Setup
In this research, the Tello drone [45], as shown in Figure 16, is used in the experiment.It can shoot up to 720p video at 30 frames per second.The system with the AI models were trained in the host machine with an NVIDIA Quadro P620 (combining a 512 CUDA core Pascal GPU), 2 GB GDDR5, Intel Core i7 vPro-10850H Processor (2.70 GHz), and RAM of 32 GB.The image annotation was carried out by using OpenCV [46].YOLOv8 was trained based on Ultralytics [47], and Cascade Classifier was trained by OpenCV [48,49].Python was used as an integrated development environment for the project.The image width and length for the AI model are 640 and 480, respectively.
Drones 2023, 7, 694 14 of 20 The flame sensor is connected with the ESP32, as shown in Figure 17, and the PIR sensor with M5StickC is set up, as shown in Figure 18.Table 1 reports the performance of two computer vision models from YOLOv8 and Cascade Classifier for person and flame detection, respectively.Although the flame detection model has slightly superior metrics, both models achieve high accuracy, guaranteeing good efficiency in object tracking.On the other hand, two other popular YOLO models, YOLOv5 [50] and YOLOv7 [51], are also trained to compare with YOLOv8 performance.The metric evaluation shows that YOLOv8 obtains the best execution in precision, recall, and AP.

Person Detection
The PIR sensor detects a human appearance; it sends the signal to the workstation via Wi-Fi thanks to the ESP32 microcontroller embedded inside M5StickC.The workstation controls the drone to fly to the sensor place to detect that person and keep following, as illustrated in Figure 19.On the other hand, two other popular YOLO models, YOLOv5 [50] and YOLOv7 [51], are also trained to compare with YOLOv8 performance.The metric evaluation shows that YOLOv8 obtains the best execution in precision, recall, and AP.

Person Detection
The PIR sensor detects a human appearance; it sends the signal to the workstation via Wi-Fi thanks to the ESP32 microcontroller embedded inside M5StickC.The workstation controls the drone to fly to the sensor place to detect that person and keep following, as illustrated in Figure 19.

Evaluation on the Distance Maintenance
The distance and direction evaluation techniques were carried out.The prediction box area is implemented to maintain a specific range [33,200,33,500].In order to evaluate the effectiveness of the proposed system in terms of distance, we conducted an experiment where the drone was tasked with tracking a participant who walked and periodically stopped along a linear path.The person stood for about 10 s, then walked fast for 1 step and stopped for another 10 s, then walked again and stopped.In this case, when the distance between the drone and the person is maintained, the bounding box area of person detection is supposed to be approximately the same.
The result in Figure 20 shows the drone's success in following the person.Since the person moves very fast, the distance between the drone and the object is extended at high speed, corresponding to the sudden drop of the bounding box area.However, the drone recovers the distance by moving forward immediately.Here, the person moves away only one short step, so the drone covers the distance quickly, and then the AI system continues to detect the person with a bounding box.

Evaluation of Direction Rotation
To evaluate the drone's rotation ability, the person moves to the left and then moves to the right side of the drone at the same distance with normal speed.The purpose is to observe whether the drone yaw can rotate according to the motion.Figure 21 demonstrates that the drone has good capability of adapting its yaw rotation, following the direction of the tracked object.
detection is supposed to be approximately the same.
The result in Figure 20 shows the drone's success in following the person.Since the person moves very fast, the distance between the drone and the object is extended at high speed, corresponding to the sudden drop of the bounding box area.However, the drone recovers the distance by moving forward immediately.Here, the person moves away only one short step, so the drone covers the distance quickly, and then the AI system continues to detect the person with a bounding box.

Evaluation of Direction Rotation
To evaluate the drone's rotation ability, the person moves to the left and then moves to the right side of the drone at the same distance with normal speed.The purpose is to observe whether the drone yaw can rotate according to the motion.Figure 21 demonstrates that the drone has good capability of adapting its yaw rotation, following the direction of the tracked object.

Evaluation of Direction Rotation
To evaluate the drone's rotation ability, the person moves to the left and the to the right side of the drone at the same distance with normal speed.The purp observe whether the drone yaw can rotate according to the motion.Figure 21 strates that the drone has good capability of adapting its yaw rotation, followin rection of the tracked object.

Potentially Dangerous Object Detection
The drone is able to detect the potentially dangerous object detection, as shown in Figure 22.Once these identified objects are thrown at the drone, it can avoid them by moving to one side and adjusting the yaw orientation to keep tracking the person due to algorithm of automatic yaw rotation.The maximum moving speed is about 100 cm/s.For other weapon types, the system needs further training to gain the ability to detect diverse weapons such as a gun.

Evaluation on the Distance Maintenance
The distance and direction evaluation techniques were carried out.The prediction box area is implemented to maintain a specific range [33,200,33,500].In order to evaluate the effectiveness of the proposed system in terms of distance, we conducted an experiment where the drone was tasked with tracking a participant who walked and periodically stopped along a linear path.The person stood for about 10 s, then walked fast for 1 step and stopped for another 10 s, then walked again and stopped.In this case, when the distance between the drone and the person is maintained, the bounding box area of person detection is supposed to be approximately the same.
The result in Figure 20 shows the drone's success in following the person.Since the person moves very fast, the distance between the drone and the object is extended at high speed, corresponding to the sudden drop of the bounding box area.However, the drone recovers the distance by moving forward immediately.Here, the person moves away only one short step, so the drone covers the distance quickly, and then the AI system continues to detect the person with a bounding box.

Fire Detection
Similarly, when the flame sensor collects the firelight, it sends the signal to the workstation via Wi-Fi.Then, the drone begins its action and tracks the fire, as shown in Figure 23.After 5 or 10 s of recording, the video is delivered to the security, building administrator, or the house owner via mail or other telecommunication tools.

Potentially Dangerous Object Detection
The drone is able to detect the potentially dangerous object detection, as shown in Figure 22.Once these identified objects are thrown at the drone, it can avoid them by moving to one side and adjusting the yaw orientation to keep tracking the person due to algorithm of automatic yaw rotation.The maximum moving speed is about 100 cm/s.For other weapon types, the system needs further training to gain the ability to detect diverse weapons such as a gun.The Cascade Classifier is quick, since each classifier in the cascade only requires processing a small portion of the data.This feature enables quicker object recognition by reducing data that have to be processed.In addition, objects may still be detected by the chain of classifiers despite noise and other distortions.Due to each classifier in the cascade being trained on a fraction of the data, the Cascade Classifier technique can provide high accuracy in object detection.In future work, more fire scenarios will be tested in diverse circumstances.The Cascade Classifier is quick, since each classifier in the cascade only requires processing a small portion of the data.This feature enables quicker object recognition by reducing data that have to be processed.In addition, objects may still be detected by the chain of classifiers despite noise and other distortions.Due to each classifier in the cascade being trained on a fraction of the data, the Cascade Classifier technique can provide high accuracy in object detection.In future work, more fire scenarios will be tested in diverse circumstances.

Advantages Limits
• Drone is automatically controlled to arrive and track the concerned object with wireless communication from sensors.

•
Drone is implemented the practical algorithms to enhance its security with the capability of maintaining a safe distance, following the object's direction, and realizing the dangerous object.

•
System has fast object tracking thanks to robust computer vision models.

•
The drone should stay at the proper distance from other sensor positions for easy activation.• SSDS is robust but complex, requiring a good working station to operate.

Conclusions
An SSDS was successfully developed to detect and track the objects concerned in case of intrusion and fire accidents with the support of AI models and IoT communication.Both of the computer vision models we used, YOLOv8 and Cascade Classifier, were trained and implemented in the workstation for object classification.Furthermore, three algorithms for drone control were implemented for the automation optimization of drone functions like target following and dangerous object avoidance.The entire system is capable of collecting the alerts from IoT sensors and manipulating the drone for acquiring the data, monitoring the stream to store the data, and transmitting the data to other responsible electronic devices via Wi-Fi.In the future, the developed system will be applied to more circumstances, such as particular factories, for further experiments to collect more information about the system's pros and cons.From this stage, more drone control algorithms and AI models can be employed in the smart flight system.

Figure 1 .
Figure 1.Smart system diagram based on IoT.

Figure 1 .
Figure 1.Smart system diagram based on IoT.

Figure 4 .
Figure 4. Image frame and prediction box in pixel coordinate.

Figure 4 .
Figure 4. Image frame and prediction box in pixel coordinate.Figure 4. Image frame and prediction box in pixel coordinate.

Figure 4 .
Figure 4. Image frame and prediction box in pixel coordinate.Figure 4. Image frame and prediction box in pixel coordinate.

•
If the box area < A → Drone is too far away → 2 front motor speed is increased → Drone moves forward.• If the box area > B → Drone is too close → 2 back motor speed is increased → Drone moves backward.• If the box area ∈ [A, B] → Drone is at the proper distance from the object → Drone maintains motor speed.

Figure 9 .
Figure 9. Drone tracks object at the proper distance.

Figure 12 .
Figure 12.Object in the frame.

Figure 12 .
Figure 12.Object in the frame.

Figure 12 .
Figure 12.Object in the frame.

Figure 18 .
Figure 18.PIR sensor with M5StickC setup for person detection.

Figure 18 .
Figure 18.PIR sensor with M5StickC setup for person detection.Figure 18. PIR sensor with M5StickC setup for person detection.

Figure 18 .
Figure 18.PIR sensor with M5StickC setup for person detection.Figure 18. PIR sensor with M5StickC setup for person detection.

4. 2 .
Results and Analysis 4.2.1.Computer Vision Test Performance About 1000 images are utilized for each model, where 70% are used for testing and 30% are used for testing.

Figure 22 .
Figure 22.Potentially dangerous object detection.4.2.6.Fire DetectionSimilarly, when the flame sensor collects the firelight, it sends the signal to the workstation via Wi-Fi.Then, the drone begins its action and tracks the fire, as shown in Figure23.After 5 or 10 s of recording, the video is delivered to the security, building administrator, or the house owner via mail or other telecommunication tools.
K p , K i , and K d are the proportional, integral, and derivative coefficients, respectively.•e(t) is the error between the desired and current values.

Table 1 .
Performance result on test-set image.

Table 2
reports the main pros and cons of the developed system.

Table 2 .
Advantages and limits of SSDS.