Target Detection and Recognition for Trafﬁc Congestion in Smart Cities Using Deep Learning-Enabled UAVs: A Review and Analysis

: In smart cities, target detection is one of the major issues in order to avoid trafﬁc congestion. It is also one of the key topics for military, trafﬁc, civilian, sports, and numerous other applications. In daily life, target detection is one of the challenging and serious tasks in trafﬁc congestion due to various factors such as background motion, small recipient size, unclear object characteristics, and drastic occlusion. For target examination, unmanned aerial vehicles (UAVs) are becoming an engaging solution due to their mobility, low cost, wide ﬁeld of view, accessibility of trained manipulators, a low threat to people’s lives, and ease to use. Because of these beneﬁts along with good tracking effectiveness and resolution, UAVs have received much attention in transportation technology for tracking and analyzing targets. However, objects in UAV images are usually small, so after a neural estimation, a large quantity of detailed knowledge about the objects may be missed, which results in a deﬁcient performance of actual recognition models. To tackle these issues, many deep learning (DL)-based approaches have been proposed. In this review paper, we study an end-to-end target detection paradigm based on different DL approaches, which includes one-stage and two-stage detectors from UAV images to observe the target in trafﬁc congestion under complex circumstances. Moreover, we also analyze the evaluation work to enhance the accuracy, reduce the computational cost, and optimize the design. Furthermore, we also provided the comparison and differences of various technologies for target detection followed by future research trends.


Introduction
In smart cities, the intelligent transportation network has gained much attention in computer vision in order to avoid traffic congestion and accidents. Traffic congestion occurs when the number of traffic increases and the speed of the object becomes slow. It causes UAVs have been integrated in different fields, for example, computer vision [5], search and rescue operations [6], and communication systems [7][8][9]. UAVs play an important role in military and non-military applications such as surveillance, filmmaking, attack, cultivation, scientific analysis, cargo transport, and many more due to low cost, mobility, ease to use, and accessibility [10]. The advantages of drones/UAVs lie in rescuing time and investment, boosting the reliability of fact measurement, increasing the security of data logging, improving the performance of complex scenarios, and making the investigation more systematic. Despite all its advantages, UAVs also face several significant challenges in terms of full deployment, which are given below: • In terms of heavy mobility, UAVs have insufficient ability to mark traffic motion evaluation at the intersection due to their dimensions, underneath stimulation, and braking capabilities. • In dense urban regions, UAV delivery can be impractical due to a limited utmost payload (mass and volume), comparably low-range, insufficient low-elevation airspace, insubstantial sensor's ability, and limited battery capacity, which can cause difficulties and security risks.
• It has insufficient auto-navigation expertise, transmission, and energy problems to detect different kinds of disasters such as floods, fire situations, and air tragedies.
It has insufficient facts about patrolling regions and is unable to examine and track the informant of pollution. • Moreover, target detection in the form of pedestrians, vehicles, and traffic signs is one of the critical issues of UAVs in traffic congestion due to complicated environments, small and heavy recipient size, drastic occlusion, weather and lightning variations, and unclear object characteristics such as appearance and color as demonstrated in Figure 2. To overcome the above-mentioned issues in UAV-based target detection, researchers have proposed many DL-based approaches. Recent DL approaches in the area of target detection undergoing UAVs have progressed rapidly by virtue of enormous performance. DL is the class of artificial intelligence and machine learning and has the expertise to gather, learn, and analyze a massive quantity of data. DL uses both graph and transfiguration techniques to create multi-level learning models.
In this article, we review the DL approach to address problems of target detection in traffic congestion. In the recent years, many researchers have tried to sort out this issue with the help of different approaches. To handle the tracking and detection issues of the moving target, there have been several tasks based on UAV-initiated cameras using traditional approaches [11,12]. In 2016, Zhang et al. [13] presented the pixel-based adaptive segmenter algorithm for target detection. In [14,15], fast fourier transform and kernel function were employed on a discriminative correlation filter-based detective to perform the complex computation in the frequency domain rather than in the spatial domain which optimized and enhanced the performance of the detecting model. In 2018, the Kanade-Lucas-Tomasi tracking approach was presented by Ke et al. for target detection tracking [16]. However, traditional approaches are less precise due to bad generalization accomplishment.
For vigorous tracking achievement, DeepSort was used which further omitted the DL properties and Kalman filter [17]. Apart from that, in image depth feature screening, CNN unifies the feature abstraction, selection, and categorization which are superior than traditional techniques. Its effectiveness is higher, so the scalability and correctness of the target detection are also much better than traditional techniques. Furthermore, different object detection approaches have been presented based on CNN, for instance, Region-based Convolutional Neural Network (RCNN) and Single Shot MultiBox Detector (SSD) to perceive the small targets and perform at low elevations. However, their performances are still not good enough in target detection in terms of accuracy [18][19][20]. In contrast, for better performance, the single-stage detector techniques such as YOLO v3 and YOLO v4 were presented by Redmon and Boboc [21] and Alexey et al. [22], which are better in target detection and permit excellent real-time achievement and high precision. But it prerequisites a notable computational scheme. Besides this, in [21,23,24], UAV-based detection and tracking methods have been presented, which merge the illusion, KCF (kernel correlation filter) based detectors, and YOLOv3. Moreover, many review books and articles have been found to detect targets in traffic congestion. For example, Butilȃ [25] presented a review on the applications of UAVs associated with traffic attention. Srivastava et al. [26] studied important parameters to foreground the small-size issues. Osco et al. [27] reviewed both UAV remote sensing images and DL in a review paper. Alzahrani et al. [28] studied a review based on an extant UAV-assisted system. Kanistras et al. [29] monitored the traffic based on the UAV system, while Outay et al. [30] reviewed the vision processing techniques. Park et al. [31] reviewed the information and communication technology approaches based on DL application to analyze the target in traffic in real-time using UAVs. Zhang et al. [32] presented correlation filtering and other tracking algorithms to solve the target occlusion problem in a review paper. From the above body of work, it is worth noting that all the above-mentioned works consider either only UAV-based target detection or one specific kind of DL for target detection. None of the above-mentioned reviews considered applications of DL and UAVs in target detection at the same time. The purpose of this review paper is to discuss different issues rather than one specific issue using different methods in contrast to other review papers and provide the future direction related to more preferable methods to handle those issues in the upcoming years. It also presents a brief overview of DL-based approaches for UAV-based target detection.
The main contribution of this review paper is to further address the target detection problem under various circumstances with the help of DL approaches instead of exploring one specific DL technique for target detection entirely and inspect the best execution performance model built on current research. The statistics of this paper are taken from different sources, for instance, conference proceedings, journals, and workshops to offer readers a glimpse of the common relationship between DL techniques and target detection through UAVs in traffic congestion at an advance level.
The rest of the paper is organized as follows: Section 2 narrates the concept and research tasks affiliated with UAVs classifying the area of target detection in traffic congestion along with major metrics and crucial issues. Next, Section 3 introduces DL approaches used for handling crucial issues in target detection. Section 4 inspects the existing implementation of DL techniques in target detection. Section 5 puts forward a comprehensive discussion on utilization of DL techniques in target detection and outlines the limitations to denominate future research trends. Finally, Section 6 concludes this review paper. For more clarity, the organization of the paper is presented in Figure 3.

Target Detection Prototypes and Their Related Metrics and Crucial Issues
This section reviews research tasks that use UAV videos and images to improve target detection in traffic congestion and management. The following subsections give detailed studies based on different methods and metrics.

A. Target Detection Prototypes
The dominant features of UAVs in target detection are elaborated in Table 1.

1.
Mobile UAV Trajectories Based on Road Traffic Monitoring Approach: Within a town, due to the insufficient amount of UAVs, it is not appropriate to mark a comprehensive tracking of all the targets in terms of vehicles, pedestrians, and motorbikes. To handle this issue, a mobile UAV-based road traffic monitoring system was proposed by Elloumi et al. [33] in 2019.
• Advantages: This mode aims to track the detection rate and the statistics of restrained vehicles to overcome accidents and overspeeding. Moreover, it initiates UAV trajectories to observe targets within a town from a long distance to avoid traffic congestion. • Limitation: It is necessary to enhance the detection rate of an isolated vehicle of a congested event at a low speed by sharing the facts related to dispersion through UAVs.

2.
DeepSort Approach: For electric vehicles and target detection in metropolitan surroundings, a DeepSORT approach based on DL has been formulated by Liu and Zhang [34] in 2021. This approach combines YOLOv4, various tracking methods, and fuses the target detection web to reduce the state estimation of the pursued target in the non-uniform tactic and realized the target position through UAVs.
• Advantages: The combination of the presented model aims to significantly enhance the performance, robustness, and positioning of numerous targets' perception and tracking in complicated metropolitan surroundings. • Limitation: Still, the performance is deficient when the UAVs are fluttering at a high elevation, which may cause the problem in detecting the tiny bulk of the ground substance.

3.
SAHER System Approach Based on UAVs: In traffic congestion, road misadventures are caused due to energy and coverage issues. To handle this issue, Ali et al. [35] proposed a SAHER system based on UAVs using the 5G data processing in 2020.
• Advantages: In real-time scenarios, this approach detects swiftness and alternative traffic interruptions to overcome the number of crashes. • Limitation: The proportion of tragedies and wounds is still extremely high.

4.
Traditional and UAVs Vision Approach: In 2022, Cheng et al. [36] presented a model based on traditional and UAV approaches which mainly include YOLO 3, Mean-Shift, Gaussian background difference approach, and Kalman filter algorithms to observe the unauthorized behavior of the target.
• Advantages: The aim of this approach was to compare the results of UAVs and traditional approaches on the four features: manual time, divergence results, recognition speed, and accuracy. Therein, the target detection based on the UAV approach performs better as compared to the traditional approach. This is because, in the traditional approach, the computational cost is low and poor in robustness. Furthermore, it cannot fulfill the substantial application demands in

B. Metrics
In the recent years, various researchers concentrated on the miscellaneous features of UAVs for target detection, which include cost, safety, privacy, etc. A few dominant metrics are shown in Figure 5 and reviewed below.

1.
Power Consumption: For UAVs, it is difficult to detect small targets under different weather conditions due to the restricted battery ability. Although it is not possible to extend the size of the battery because doing this will extend the mass of the UAVs which is one more critical consideration. Many research reviews have described the battery demand for the UAVs through the wireless power transfer (WPT) approach [37,38]. Through an expressive connection of WPT, the charging of UAVs can be performed to increase the flight time and dimension for the target inspection, observation, and other surveillance assignments. Moreover, it can also overcome the many restrictions of the current inspection approaches, for instance, costly tasks and dangerous functions. In [39], the researchers charged the battery of the drone and carried out the testing of presented schemes for various distances and unbalanced cartography with the help of magnetic resonance coupling WPT. Besides this, machine learning and DL algorithms for drones also provide a better solution in terms of energy consumption for data collection and processing in compute-hungry realms [40].

2.
Security and Privacy: Security and privacy are important key factors while detecting targets through UAVs. This is because sometimes the attackers liberate all the accumulated information of UAVs through scareware, viruses, and keystrokes with the help of computer streaming-assigned software. Hacking all the data will lead to false detection of traffic congestion, convey corrupt statistics which misguide the ground command stations, and may be used in illegal activities, for instance, using the stolen data against military action. To save the system from such kind of hacking and lay out the correct information about the required recipient, a model was presented named "Privacy by Design" in [41] which provides a solution for security and privacy violations. Recently, blockchain, machine learning, DL, and watermarking have performed an important role to secure UAV applications. These approaches aim to supply reliable, safe, and accurate information and secure programmatically updated facts. Further details of these approaches are presented in the research article [42].

3.
Cost: To identify the targets, it is compulsory to develop low-cost UAV detectors. Light mass and low cost are the features of UAVs to ratify the quality inspection with extreme temporal and multidimensional resolutions without endangering the lives of humans. In the recent years, traditional approaches have been used to detect the targets through UAVs, but traditional approaches such as scale-invariant feature transform features [43] have some drawbacks such as high computational cost, bulky deployment, and unpredictable risks and cannot identify the targets in real-time scenarios. In contrast to traditional approaches, DL approaches, such as R-FCN [44], have considerable computational cost and meet real-time demands. However, there is a requirement to better stabilize the dummy complexity and perception reliability and to validate the dummy with unified data from multiple sources [45].

4.
Resource Employment: For humans, it is exhausting for the crew to find the targets only from aerial graphics which consumes extensive human resources and time. The technology of intelligent automatic target recognition and monitoring can empower UAVs to become more competent in rescue, tragedy response, stuff transportation, target crushing, enemy research, and other tasks, which can tremendously decrease the consumption of mankind's resources and additionally stimulate the evolution of the UAVs' domain. Moreover, the implementation is done based on the different algorithms concerning resource utilization and time complexity such as the start of Faster R-CNN which consumes extreme computing resources. Therefore, in 2016, Jifeng Da et al. [44] presented an R-FCN approach to overcome this issue. Traditional standard approaches for screening graphic content will lead to omissions and authorization miscalculations. Therefore, it is impossible to entirely depend on the resources of humans for capturing, displaying, and processing large-scale video statistics. To gather large-scale video statistics in real-time through UAVs, they can be screened by DL [46] and big data algorithms to modify the traditional approaches of target detection from a weak standard manner to a smart real-time structured one and produce useful facts for the users which fulfill their demands, save manpower and data resources, overcome the cost of monitoring, and upgrade the efficiency.

C. Issues in Target Detection
To tackle the target detection challenges and upgrade the above-stated metrics through UAVs, it is necessary to sort out the following problems.

1.
Small-Size Objects: In real applications, the height of the shot is high, the size of the target is much smaller than that of the image, and the target has defective properties; then, the target suffers a particular degree of distortion overwhelmed by the angle of the shot and the correlative movement between the UAVs and target leads to a target which is substantially changing in the background, etc. Besides this, some datasets such as MS COCO characterized small targets due to limited discrete features which remit the missing and several false target detections [47].

2.
Target Occlusion: Target occlusion will occur due to target blockage and the effect of illumination surrounding which sorely validates the tracking and identification of targets. Further, false and missed detection is also caused due to occlusion. Many researchers tried to solve the target occlusion issue with the help of different methods based on various occlusion conditions. Some of the progressed methods based on DL are discussed in the next section.

3.
Joint Issues: An increased number of executions are multi-perspective. For example, some jobs are concurrently time-sensitive and have extensive resources, such as multi-scale appearance, spot, missed recipient and victim recognition [48,49], and enhancement of realistic applications [50]. Scientists began to tackle various combined problems in target detection which are called joint issues.

DL for Target Detection
DL has currently demonstrated excellent results in solving numerous robotic functions in the area of awareness, planning, segmentation, and management. It has an excellent ability to learn characterization from composite data obtained in the real-world context which makes it ideal for several types of autonomous applications. We have concentrated on three categories of DL approaches utilized in target detection for traffic congestion, i.e., one-stage detectors and two-stage detectors which are demonstrated in Figure 6.

A. One-Stage Detectors
The one-stage detector is also known as a regression-based approach which directly computes the object's correlatives and the class probability and then produces the outcomes after an isolated detection and enormously increases the detection speed. Some frequently used one-stage detectors for target detection through UAVs are stated below.
• YOLO was introduced by Joseph Redmon [21] in 2015. The major aim of this approach is to detect small objects and compute their fast speed. Through an artificial neural network ANN, this algorithm takes out the image attributes and then utilizes the regression algorithm to execute the image detection effect. With the help of a neural network, it can instantly extract the classification and locality of the bounding box. As a backbone, Darknet-19 and GooleNet is used in the training network, while confidence loss is utilized as a loss function. The grid segment is answerable for target detection. This algorithm has vigorous generalization capabilities because it can understand the highly versatile features to delegate to other regions. • YOLOv3 was presented by Joseph Redmon and Farhadi [21] in 2018 which is the updated version of YOLO. As a backbone, this version uses the Darknet53 classifier and utilizes a multi-scale indicator. Feature extraction is carried out with the help of Darknet-53. There are 53 convolutional layers to train ImageNet. The feature bounding is downward-sampled because convolutional layers are a two-step process and, at three various dimensions, it executes detection. Meanwhile, to verify the normalization of the input in intense layers with the help of convolutional layers, batch normalization is illustrated. In contrast to Darknet-19, Darknet53 shows superior accuracy. Besides this, to overcome the over-fitting, Leaky RELU can be utilized. Through extra convolutional layers, this version seized the low measure feature which boosts the small targets and other issues as well as enhanced its speed. Moreover, by contrasting the prediction outcomes with the actual merit of the sample class, the loss value is gained, and the framework variables are updated with the help of a back propagation design to get the boost network prototype for target detection. The loss function is the aggregate of three distinct losses which are: (i) confidence loss, (ii) classification loss, and (iii) location regression loss as shown in Equation (1).
• YOLOv4 is the latest version of the YOLO group, which was presented by Alexey et al. [22] in April 2020. This version is the improved version of YOLOv3 and is more marvelous than YOLOv3. This model is categorized into three parts: (i) backbone grid, (ii) neck grid, and (iii) head grid. The CSPDarnet53 classifier is utilized as a backbone grid which is the combination of Darknet53 and CSPNet [51]. With extra modules, convolutional and bath normalization layers are attached after the backbone grid. Further, the Mish activation function and spatial pyramid pooling are manipulated to enhance the correctness of the feature output and generalization capacity of the network [52]. Moreover, the main advantage of these layers is to enhance the difficult multi-target depiction experiment. In the neck grid, to lessen the information trajectory for the various detectors, a path aggregation network and FPN (feature pyramid network) are operated as a parameter assembling approach [53]. The head grid still employs the head grid of YOLOv3. Additionally, GIoU (Generalized Intersection over Union) is used as a loss function to improve the evaluation consequences of the target and optimized the model based on various factors such as illumination situation, height, dimension of the object, occlusion gradation, etc. Moreover, GIoU measures the intersection proportion between ground truth and bounding boxes of prior mount and prediction mount. The mathematical equation of the loss function is expressed in Equation (2) as: where P represents the small-scale box that immerses the ground truth and the predicted bounding boxes which determine the speculated ground target.

B. Two-Stage Detectors
The two-stage detectors are also called regional proposal-based approaches. In this approach, detection and categorization are achieved by extracting the applicant areas on the attribute map and accomplishing DL. Some frequently used one-stage detectors for target detection through UAVs are explained in the following.

•
In 2015, Ross Girshick et al. [54] presented a Faster R-CNN which is another famous target detector. Faster R-CNN symbolizes a "region-based Convolution neural networks" which works better on real images. These real images are employed to the area of UAVs images. It can predict locality and classification of numerous bounding boxes at the same time. Its main advantage over other similar models of this algorithm is its high accuracy. In the beginning, the Faster-RCNN algorithm introduces a regional proposal network (RPN) network. In the congruent classification, the target candidates were specified in a similar classification and allocated in similar networks to execute the outstanding detection consequences during training [19]. FPN is an attribute that integrates various levels to make the final regression and classification more efficient during the employment of attributes [55]. • Cascade R-CNN was presented in 2018 which is the repeated form of the Faster R-CNN [56]. It is a cascaded structure that consists of numerous repeated structures and is linked sequentially [19,57]. This algorithm is composed of three segments which are: (1) feature extraction unit, (2) RPN unit, and (3) multi-stage cascade identification unit. The Cascade R-CNN framework is a multistage augmentation that trains the continuous detectors at various IOU thresholds [56]. For the next phase of training, the boundary frames built by the R-CNN scene are used as input. By utilizing a variety of special repressors, the Cascade R-CNN detects high-quality recognition by eliminating noisy identified boundary frames while maintaining useful, adjacent, and optimistic examples.

A. One-Stage Detectors in Target Detection
In this subsection, we review the useful work of one-stage detectors in resolving the issues of target detection through UAVs in traffic congestion, which are outlined in Table 2.  1.
One-Stage Detectors for Small-Size Objects: In 2016, YOLO [18] performed real-time target detection to simultaneously predict the probability of location reliability and all target classes. In 2021, Sun et al. [69] proposed the YOLO approach because this approach is easy and simple. The target detection issue has been completely resolved by regression. The images captured from the UAVs have high-resolution characteristics. For the target, recognition detection used the VGG16 network as a backbone formation and the optimization scheme utilized the adaptive moment estimation Adam [58] in the training process which aims to hasten the speed of the prototype convergence. When data were deficient in an inadequate feature extraction network, they introduced transfer learning to enhance the accuracy of the training recognition estimation. Moreover, YOLO merges the target locality forecast and grouping forecast into an isolated neural network prototype to attain quick target recognition and detection with high reliability. In the target detection, the detection rate of the YOLO network is extremely high with 69% and the detection speed is 40 FPS/s, apart from the detection reliability which is lower as compared to other DL networks. Li et al. [59] proposed a global context cross (YOLO-GCC) model which signifies the design of YOLOv3 and GCNet to handle the blurring and fuzzy features of small objects. To extract several multi-dimensional feature maps, this model utilized the DarkNet-53 [21] as a backbone. The global context attention segment was attached as the latest backbone with DarkNet-53 and called GC-DarkNet to extract further accurate and compelling features. The H-Swish activation function was used to decrease the computing cost. In addition, an approach of intelligent traffic signal planning called Traffic Deep-Q Network (Traffic-DQN) is introduced which is based on deep reinforcement learning, taking the advantage of traffic flow facts gained from the YOLO-GCC and is used as the basis for transportation planning. The Traffic-DQN system shows apparent benefit in convergence velocity, and each diagnostic indicator is better than the corresponding one in the other approaches. The experiment testing was performed on four familiar UAVs datasets: (i) the UCAS-AOD dataset [70], (ii) the VisDrone2019 dataset [71], (iii) the TSD-MAX dataset [72], and (iv) the UA-DETRAC dataset [73] which comprises different classes such as car, bus, van, and others. The exploratory results show that the potential of the proposed method to identify small flow factors is clear and it is better than the YOLOv3 algorithm. Moreover, with small targets and mixed backgrounds, the position of the bounding box is more precise which is very essential for target detection in UAV images. In 2021, Benjdira et al. [60] proposed a Traffic Analysis from UAVs (TAU) approach to detect all of the existing targets inside one assembly and generated a UAV image-based dataset which is divided into five groupings such as pedestrian, motorcycle/bicycle, car, truck, and bus. However, to further pursue the detecting target, an online multiobject tracking approach called DeepSORT was utilized [74]. This approach decreases the time consumption, guarantees safety on the highway, and somehow reduces the computing cost. However, YOLOv3 still entails high extraction cost. In addition, in the current genre, the TAU approach has a few limitations which are: • The incoherence of the metrics of the x and y axis when the pixel indicator is a multiple of the height and breadth of the rectifying frames. • Due to the high resolution, it is unsatisfactory to pass it online.
So, it is necessary to solve these limitations with some new generic algorithms in the future. In 2020, Feng et al. [75] presented a design composed of four segments: (i) vehicle detection, (ii) background registration, (iii) trajectory construction and compensation, and (iv) trajectory denoising to draw remarkably well the orbit of highway users which include pedestrian, motor vehicles (MV) and non-motor vehicles (NMV) such as bicycles, tricycles, and motorcycles and to track the small target trajectories. The YOLOv3 algorithm was applied in the first segment for detection accuracy and to get the target bounding boxes at this stage. To gain the image locomotion in the second step, the Shi-Tomasi corner attribute is utilized. This approach is an extremely popular corner detector and is extensively used due to its high correctness and fast speed for numerous real-time clarification applications and manipulated for monitoring and tracking of the target characteristics [76,77]. Trajectory construction and compensation is the third step of this design which has three main phases: (i) data correlation, (ii) trajectory classification, and (iii) trajectory compensation. The purpose of these phases is to configure the irregular vehicle trajectories formed on the basis of the perception of the speed restraint, contest the smashed trajectories, and implement the assembly tasks to rectify the omitted components. Moreover, the ensemble empirical mode decomposition approach is employed in the last step to remove noise and errors from arbitrary and unbalanced signals and enhance the trajectory reliability [78]. The exploratory outcome reveals that the presented design attains high accuracy in trajectory abstraction and detection. Figure 7 demonstrates the recall outcome of target classes in three test videos recorded by high-depiction cameras; similarly, Figure 8 represents the precision outcome of the target classes in the same test record videos.  In 2022, Tian et al. [63] presented a YOLOv4-based approach to detect small targets in terms of cars and pedestrians on the VisDrone dataset. To enhance the performance of the presented model, the KCF algorithm and average peak-to-correlation energy scheme were utilized to stabilize the model and track small targets of long interspace.

2.
One-Stage Detectors for Target Occlusion: In 2020, Luo et al. [64] presented a YOLOv3, soft non-maximum suppression (Soft-NMS), and K-means++ framework to handle the moderately occluded targets in the region of the UAV portrayal. The K-means++ method is used in the YOLOv3 algorithm to optimize the series of the first recognition box and improve the AP estimate of the network [79]. Later, the Soft-NMS method was executed to solve the issue of multi-box crushing by NMS to improve the AP estimate of the network [80]. During the training operation, overfitting occurs due to some training feature samples. To lessen this issue, data augmentation was implemented which comprises of color oscillation, arbitrary rotation, and image flip. For validation, three generic datasets were selected with various image characteristics to enhance the network. Based on the experimental results, it is observed that the upgraded YOLOv3 method achieved high accuracy and a fast detection rate. The results of the three datasets in terms of average precision (AP), precision, and recall metrics are presented in Figure 9.

3.
One-Stage Detectors for Complex background: In [65], Feng et al. presented a gradient classification prophecy branch in the head network of YOLO to produce angular data and utilize the circular rectify class to overcome gradient classification loss and detect the target under complex circumstances. Moreover, to improve the UAV images, they implemented the data augmentation approach which consists of rotation, arbitrary flip, translation, and HSV augmentation. Then, they presented the cross-stage partial bottleneck transformer (CSP BoT) segment which is a hybrid approach that uses the multi-head self-attention process convolutions to encapsulate the latent broad spatial correlation of the target in the UAV images and improve the critical information. Finally, they adopted the general characteristics at various resolutions and predicted the spatial disparity in ambiguity by the weighted cross-scale interconnection. The adaptive spatial feature fusion-Head block was presented. The enormous experimental outcomes on UACS-AOD and UAV-ROD datasets show the presented model's dominance, low design complexity, and cost-effectiveness. 4.
One-Stage Detectors for Joint Issues: Sun et al. [66] proposed a YOLOv4 approach based on K-means clustering to recognize the multi-resolution detection scheme. The drone collected data from a low-elevation aerial viewpoint which comprises of various data such as heights, size, and positions. Moreover, they applied a data enhancement approach to improve the robustness of the target model. This approach has two main parts: one is color transformation such as brightness, tinge, and contrast and the other one is geometric transformation such as rotating, arbitrarily clipping, flipping, and splicing. A Darknet scheme is used in the training process. Mean average precision (mAP), recall, and precision are evaluation metrics used for the assessment. During the experiment, it was observed that the target model's accuracy reaches up to 95% while on the same dataset, the accuracy reaches up to 96% in cloudy weather. Moreover, it is observed that there is a need to improve the detection of the model in terms of a dark target, for example, black vehicles. With respect to the occlusion issue, this approach performs the best, but there appear to be some conflicts in the huge sector such as it shows some error that needs to be minimized. In 2021, Tan et al. [67] presented a model named YOLOv4_Drone which contains the YOLOv4 algorithm based on UAV images. The detection correctness of the isolated YOLOv4 algorithm is almost low and causes errors. Therefore, to enhance the abstraction of the small targets, the receptive field block (RFB) segment [81] is included in the feature extraction phase of YOLOv4 to test the feature map and withdraw the features of various scales. The RFB segment was added to the target detection prototype and termed YOLOv4_r. There is no replication of the gradient statistics in the system optimization because this approach provides high validity while decreasing the computational complexity. In the UAV images, to solve the issue of small targets and complex backgrounds, the ultra-lightweight subspace attention mechanism (ULSAM) has been incorporated into the YOLOv4_r segment [82]. This segment derives a feature map with various attention functions for the respective feature map to represent the multi-scale function. A ULSAM segment connected to the target detection approach is called YOLOv4_u. Moreover, the soft non-maximum suppression (Soft-NMS) approach is utilized to reduce the missed target which is caused by the occlusion [80]. This is because this approach deletes the lower count frames of the two close targets if there is a huge overlap. Additionally, it remarkably decreases the statistics of detection frames. In the experimental testing, the VisDrone dataset is utilized. The various datasets of 14 areas of China based on lightning and weather situations are collected which comprise 10 kinds of targets as outlined in Figure 10. Moreover, mAP is used as an evaluation metric. From the testing, it is observed that the YOLOv4_Drone target detection approach achieved a high accuracy of 45.67% in all the weather conditions as compared to previous target detection approaches such as RetinaNet which achieved 35.95% accuracy, SMPNet achieved 35.98% accuracy, while DPNetV3 achieved 37.37%. Moreover, it is observed that isolated YOLOv4 gained 40.99% accuracy which is 5% less than the YOLOv4_Drone target detection approach. Still, the presented approach is not abundantly stable due to a moderate runtime as compared to the other target detection paradigm. In 2022, Luo et al. [68] presented a YOLO-DRONE (YOLOD) model of UAV images which were upgraded on the foundation of YOLOv4 to handle the small-size objects and clustered background. To decrease the complexity of the model and obtain the best detection consequences, different activation functions were used as a backbone which comprise of Mish [83] and HardSwish [84] activation functions. To enhance the location consequences, they stimulated the convergence, and summed up the loss of the bounding box regression EIOU loss function [85]. Moreover, they utilized the pyramid pooling module in the replacement of the SPP segment and compared the model with the YOLOv4 algorithm. The aim of using the pyramid pooling module is to enhance the receptive field and performance of the detection model [86]. To boost the multiscale feature fusion and detect the targets on various scales at the end of the model, an adaptive spatial feature fusion segment was introduced [87]. The testing was done with the help of three different datasets which included forklift, VEDAI, and PASCAL VOC datasets, where the forklift is the first known dataset deployed on UAV images. During the investigation, it was observed that the presented YOLOD model achieves higher accuracy on all datasets and is good enough for complex backgrounds and small targets as compared to YOLOv4, as shown in Figure 11. However, it is still necessary to boost the performance of the model when the number of images is expanded. Figure 11. mAP of the target model based on three datasets.

B. Two-Stage Detectors in Target Detection
This subsection presents relative work on the adoption of two-stage detectors in resolving the issues of target detection through UAVs in traffic congestion. The basic work is summarized in Table 3. 1. Two-Stage Detectors for Small-Size Objects: Zhu et al. [88] presented a faster R-CNNbased approach to detect the small-size target when the quantity of the corresponding anchor is deficient or when the targets are adjacent or lightly overlapped. This is because an insufficient quantity of the corresponding anchor increases the calculation complexity of the network. Therefore, the proposed model is divided into five phases: (a) As a feature extractor, ResNet101 is utilized to reduce the incline dispersion issue while reinforcing the system depth [97].
The RPN is demonstrated to produce the anchors of different sizes and dimensions. (c) By attaining the feature interconnection and cross-channel integration and minimizing the feature portray channels, a convolution layer of 1 × 1 is used.
The RoIAlign algorithm is used to prevent the loss of margin pixels. These pixels help to track and differentiate between small targets and adjacent or lightly overlapped targets. It also determines the misconfiguration generated by the RoIPool. (e) To analyze the targets and environmental situations through the image domain and purify the target domain bounding boxes, the classification and regression system is utilized.
Experimental evaluation was performed on the COCO dataset, where image flick augmentation is deployed in a horizontal position to achieve a model accuracy of 79.77%. To detect small-size objects according to different weather conditions, considered complex spots, highways, and roads, in 2022, Cheng et al. [89] presented a GC-faster-RCNN where GC is called "Group Convolution" which is gained by boosting the Faster R-CNN algorithm and various models. This includes a cluster approach to examine the datasets and, in the replacement of real feature extraction, the lattice Resnext50 is employed. Moreover, to enhance the statistics of features depicting into the network, an output attention mechanism of the enhanced channel is unified. During the testing, it was observed that the detection accuracy slightly increased by 94.8% while the speed of detection was slow. Besides this, small target detection is not fitted with a very deep network structure. Further, this approach enhances the computing cost when a huge number of categorized datasets are used.
In the target detection work, to handle the multi-scale issue from the UAV images, in 2021, an ECascade-RCNN target detection framework was presented by Lin et al. [91], which is the improved version of Cascade-RCNN. As a backbone, Trident-FPN is employed to extract the attributes and to boost the execution of the detectors, a modern attention mechanism scheme is presented. In addition, the K-means technique has been used to create anchors to refine the model detection and attain the best regression precision. On the Visdrone dataset, testing was performed and during the investigation, it was observed that the model achieved better accuracy in huge and small-size objects.

2.
Two-Stage Detectors for Target Occlusion: Due to occlusion, missed detection occurs. Therefore, it is necessary to reduce this issue. In 2020, Wang et al. [92] proposed a Faster R-CNN algorithm to handle this issue. The presented approach uses different anchor fusions to choose the maximum anchor number and scale, which enhances the network perception value. Further boosting the value of network perception, they added a multi-layer feature fusion. The experimental testing was performed based on datasets of UAV images where AP, precision, and recall are employed as evaluation indicators. The results of these indicators are demonstrated in Figure 12. However, it is observed that the variety of scenarios manipulated in this approach is restricted, and changing the focus of the UAVs may cause miss-identification.

3.
Two-Stage Detectors for Complex Background: Liu et al. [93] presented a framework composed of four different models such as (i) Faster R-CNN, (ii) YOLOv3 model, (iii) histogram of oriented gradients (HOG) + support vector machine (SVM) algorithms which are a form of machine learning [98], and (iv) for the background difference technique, they chose the visual background extractor (ViBe) method. In video progression, ViBe is a dominant stochastic approach for predicting the background [99]. This proposed model handles the complicated background under various factors, for instance, wind, dozens of small and large-scale moving targets, unlimited speed, and substantial scenes. From the analysis, it is observed that the ViBe and HOG+SVM approaches do not yield satisfactory results due to restricted coherent and feature trajectory perception while YOLOv3 and Faster R-CNN perform best as compared to the other two approaches. In contrast to YOLOv3, Faster R-CNN yields the best accuracy in terms of recall and precision. However, it seems that Faster R-CNN has a crucial problem due to excessive hardware demand when implemented in the actual landscape.

4.
Two-Stage Detectors for Joint Issues: In 2021, Avola et al. [94] presented an MS-Faster R-CNN where MS stands for multi-stream which has three stages: in the stated frame, the multi-stream CNN abstract attributes at multiple scales from the target in the first step by manipulating its inherent architectural model. Second, under the Faster-CNN method, the bounding boxes close to the target are obtained using the extracted attributes map where the backbone produces CNN capability so that the area of the affinity group layer and region proposal network can turn the output of the classifier into the necessary bounding boxes. In the last stage, when the targets can be detected accurately by MS Faster-R-CNN inside of the graphical cascade, the DeepSORT [17] technique is used to attain the real-time monitoring abilities from the UAVs perception.
The evaluation was performed on four datasets which include: (i) UAV20L [100], (ii) UMCD [101], (iii) UAV123 [100], and UAVDT [102]. The data recorded from the UAVs comprise different features such as weather situations, small-size objects, lightning variations, huge occlusion, partial and full occlusion, background group, and low rectification. It is observed that the model is slightly improved but still the design is not fast and the detection speed is limited. In 2022, Huang et al. [95] presented an improved innovation in accordance with the Cascade R-CNN network where the framework of superclass detection is developed to supply the best precise region of interest for the consequential detector to enhance the recognition of the equivalent group. The final dependence can reflect the superior characteristics of the detection outcome with the help of regression dependence and virtually enhance the area accuracy. Simultaneously, to boost the detection outcome of the framework in the inspection of small-size targets in complex backgrounds, heavy target occlusion is used, and to lessen the phenomenon of the false alarms, the loss function is used. Moreover, this approach aims to improve the accuracy and speed of the target. The testing was performed on the VisDrone dataset which consists of 10 kinds of target classes as shown in Figure 13.

Discussion and Future Research Trends
Current studies demonstrate that scientists show much concern for the modern use of DL approaches to address problems in target detection through UAVs, for instance, small-size objects, target occlusion, complex backgrounds, and joint issues. However, to overcome these problems, there are even now obvious questions for scientists. The data in this section reflect DL-existing conditions for target detection and indicate future fact-finding directions.
A. Discussion Figure 14 shows the total percentage division of DL approaches listed in Figure 6 to solve the problems for target detection. Figure 15 describes all problems, DL approaches, and the aggregate of the articles related to each problem in target detection. In addition, we specified the major research findings obtained from Section 4.

1.
One-Stage Detectors: In 55% of the papers, the one-stage detectors have been manipulated in contrast to other DL approaches as presented in Figure 14. From Figure 16, we can notice that small-size objects are a trendy problem tackled by one-stage detectors and YOLOv3 is commonly utilized as one-stage detector in target detection. Figure 16 also presents the classification of one-stage detectors in problems related to target detection.

2.
Two-Stage Detectors: From Figure 14, it is observed that 45% of the papers used twostage detectors to tackle the problems in target detection. Two-stage detectors have been commonly employed for small-size objects and joint issues as demonstrated in Figure 17. The classification of two-stage detectors for problems related to target detection is shown in Figure 17.

B. Remarks
Based on the previous consideration, we want to give further details to figure out what type of DL approaches are good for resolving what types of problems: • In Faster R-CNN, the information about small targets will moderately disappear as the network progresses. Therefore, for very deep network structures, small target detections are not good enough. • The detection speed of the two-stage detectors is slow as compared to one-stage detectors which is the major defect in target detection. On the other hand, it is observed that YOLOv3 achieved high accuracy and a fast detection rate as compared to other YOLO versions for small-size objects but still possesses high computing complexity in real-world performance. • One-stage detector performs best in the occlusion issue while in the occlusion of large regions, the model still shows some misconception. • In the YOLO algorithm [18], small targets in the form of groups are hard to handle because the generalization capacities of the model are poor, and loss function issues easily cause prominent positioning miscalculations. • Faster R-CNN performs best compared to YOLOv3 in the complex background but it is noticed that the excessive hardware need is a serious issue. So, according to the particular complex situations, how to execute the Faster R-CNN on moderate-size hardware with wide prospects is also another issue. • Overall, it is also observed that it is crucial to obtain real-time identification in twostage detectors because the accuracy is high while the calculation amount is still huge. • For joint issues, the accuracy of the YOLOv4 algorithm is improved in contrast to other approaches but its speed is moderately decreased.

C. Future Research Trends
• UAV's vision-based approaches for the target detection are highly remarkable in contrast to traditional approaches. Through UAV's target detection, the UAV visionbased approaches, recognition and detection encryption are important parts of it. So, for the future direction, it is necessary to perceive more appropriate recognition and detection encryption. Moreover, it is necessary to enhance the UAV vision-based model accuracy for the desirable outcome and to overcome the existing conflicts. • To reduce the number of false alarms during the tiny target detection for railing broadcasting of surveillance, it is demanded to deploy the various visual sensors and different perspective stereo visions with the help of multi-sensor collaboration. • In terms of power consumption, it is compulsory to propose efficient approaches and present the advanced function offloading methods which can discover the lightleverage batteries that can notably increase the flight duration of UAVs, endure the prolonged distance, and decrease the total power consumption to carry out the predefined functions in the future. • In the future, it is mandatory to optimize the proposed algorithms by the use of laser facts and depth maps etc., to record more productive image characteristics and detect small and dense obstacles. • The YOLOv4_Drone approach yields a 5% to 15% higher detection accuracy as compared to other models such as CenterNet which has 29.85% detection accuracy but in terms of real-time implementation and speed, it is required to improve the YOLOv4_Drone approach in the future. • The bounding box associated with the pedestrian category is extracted from the concluding measurement in the TAU approach and has some limitations in the current design. In the future, it is obligatory to replace the current design of YOLOv3 with the latest version of YOLOv4 and an online DeepSORT target tracker with a multi-object tracker to solve these issues with efficient results. • In the area of aerial photography, developing an extensive and versatile dataset is a major challenge. So, it is obligatory to gather high-standard datasets to ease this challenge. In addition, researchers are required to generate effective and surprisingly automated approaches to classify the training data. • In the future, it is necessary to merge the one-stage and two-stage detectors to achieve the best outcome because both detectors have their own benefits. For instance, onestage algorithms are fast while two-stage algorithms have influential accuracy. • In contrast to YOLO-based algorithms, the detection speed of Faster R-CNN still requires to be upgraded and gradually put forward in upcoming research. • To train a framework of high quality, in the future, there is a prerequisite to change the portable datasets of high quality into vast datasets. • To discuss every feature regarding the employment of UAVs in target detection, we plan to extend the study, involve more work, and compute the latest approach in the future. • There are also some other YOLO versions such as YOLOv5 which were introduced in May 2020 after two months of the YOLOv4 version. Recently, a few papers have been published based on YOLOv5 on some custom datasets due to efficient memory in the training process. Meanwhile, the other versions YOLOv6, YOLOv7, and YOLOv8 are still in the improvement phase in terms of training speed and computational cost. Therefore, no paper has been published yet based on those versions. In the coming years, our entire focus will be on these versions to handle different issues.

Conclusions
In this study, we discussed the implementation of DL approaches to four crucial problems in target detection through UAVs: small-size objects, target occlusion, complex background, and joint issues. We inaugurated primitive ideas of DL along with crucial problems, and metrics and then concentrated on two groups of DL approaches utilized in target detection: one-stage detectors and two-stage detectors. Subsequently, different designed methods depending on DL approaches were examined in the framework of target detection based on UAVs. Further, the data about the current status quo of DL for target detection were presented on the basis of the study gathered in this review article. We observed that one-stage detectors were vigorously utilized in target detection due to their fast detection rate while two-stage detectors are not extensively used in target detection due to their low detection rate. In the end, we suggested some remarkable challenges and upcoming research directions for DL and target detection.  Acknowledgments: The authors would like to acknowledge the support of Prince Sultan University for paying the Article Processing Charges (APC) of this publication.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: