1. Introduction
The past ten years have been the time of intensive developments of unmanned aerial vehicles (UAVs) and drones [
1,
2]. They have found numerous civil and military applications including urban planning, precision agriculture, battlefield control and so on [
3,
4,
5]. Modern UAVs and drones are usually equipped with one or several sensors able to acquire images or video for sensed territories [
6]. Optical sensors are the most popular and wide-spread ones providing information in a commonly perceived way. Object localization and classification are typical operations in analyzing such images [
7,
8,
9]. Convolutional neural networks (CNNs) are typical tools applied for solving these tasks where, according to [
7], the existing CNNs can be classified into two groups: —two stage [
10,
11] and one-stage [
12] CNNs—where each group has its own advantages and drawbacks.
Quality of images acquired by UAV-based sensors is not perfect [
13,
14,
15]. Blur, weather conditions, noise and other factors can significantly reduce image quality and lead to degradation of performance of object localization and classification [
16,
17,
18]. In particular, noise can arise due to low light conditions and low quality of cheap cameras installed on-board of drones. Negative influence of the noise on object localization and classification has been clearly demonstrated in [
18]. Note that the sensitivity of different CNNs to noise is not the same. For example, SSD Lite [
12], Faster R-CNN [
11], and RetinaNet [
19] are quite sensitive.
Image/video denoising can be helpful for improving the performance of UAV-based imaging systems [
20,
21,
22]. Wang et al. [
20] proposed to apply noise reduction using Improved Generative Adversarial Networks; they have demonstrated image quality to become better in terms of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) metrics and this is favorable for image matching based on local features. Jin et al. [
17] have shown that the designed denoising method based on Guided Pixel Aggregation Network is able to significantly improve maritime image quality and can be useful for maritime target detection. Wavelet-based denoising technique has been proposed by Niu et al. and successfully tested for UAV images [
20]. A rather complex denoising technique has been designed by Lu et al. [
21] and applied to gas and oil pipeline inspection.
In addition, numerous filters have been proposed for other color imaging and remote sensing applications. Most of them are based either on non-local processing principle and orthogonal transforms [
23,
24,
25] or on trained neural networks (see [
26,
27] and references therein). The use of denoising in remote sensing (RS) data processing has some peculiar features. Alongside with characterization of denoising efficiency in terms of standard metrics such as mean square error, PSNR, and SSIM, it is necessary to consider also metrics (criteria) characterizing the final goal of RS data use such as classification accuracy [
28,
29], target detection reliability [
30], segmentation characteristics [
31], and so on depending on an application. In particular, the authors of the studies [
28,
29] have shown that efficient pre-filtering is able to considerably improve RS data classification. Good denoising is able to improve target detection [
30] and image segmentation [
31] characteristics. Although it is intuitively clear that more efficient filtering, on the average, results in better performance of image processing operations at further stages, there is no strict dependence established yet. There are classes for which texture feature preservation is important. Most misclassifications are observed in the neighborhood of edges and fine details [
28], and edge/detail preservation [
32] associated with visual quality [
33] is also important as well. Note that SSIM [
34] is obviously not the best visual quality metric [
35].
Concerning object localization and classification, special criteria are usually employed including Intersection of Union (IoU) [
36] and F1 score [
37]. Our motivation is to look how noise intensity influences the performance according to these criteria. Moreover, image filtering efficiency should also be treated in terms of these criteria. This is one point dealing with novelty of our research. As possible filtering approaches, we consider the block matching 3D (BM3D) filter [
24,
25] as one of the best representatives of nonlocal transform based techniques and DRUNet neural network (NN)-based filter [
38] as one useful NN-denoiser. Their performance is compared in terms of several performance criteria, and this constitutes another novelty in this paper. Eleven modern CNNs that can be applied for solving the considered task are studied. Their performance including computational efficiency, is considered and compared. This is the third novelty aspect. We pay attention not only to detection characteristics for all types of objects but also especially to localization and classification of small size objects. This is another specific feature and novelty aspect of our paper.
The paper is organized as follows. First, image and noise models are introduced. Some aspects of CNN training are discussed. Quantitative criteria of localization and classification accuracy are considered. Then,
Section 3 deals with experiment description whilst
Section 4 is devoted to analysis of the obtained results. Computation aspects are discussed in
Section 5. Finally, the conclusions follow.
3. Experiment Description
3.1. Used Dataset
To carry out experiments, we need a dataset with the following properties:
- (1)
It should contain images (image fragments) typical for UAV-based imaging.
- (2)
Objects in this dataset have to belong to several typical classes under interest.
- (3)
These objects have to be of different sizes (to provide an opportunity to analyze the influence of this feature on object localization and classification); they have to be preliminarily annotated to ensure easy training and determination of quantitative performance criteria.
- (4)
The number of objects for all classes should be large enough to provide appropriate training and verification for obtaining reliable statistics of applied performance indicators.
- (5)
The objects have to be placed on a quite complex and diverse background to correspond to possible practical situations.
One good candidates for solving our tasks is the VisDrone dataset [
46]. This dataset can be used for localization and classification tasks, as well as for object tracking tasks. The images were captured in different regions, with different traffic and people densities, as well as various environments and shooting conditions. In total, the dataset consists of 263 videos and 10,209 images that do not overlap with each other. The total number of frames, including videos and images, is 179,264 images. Considering the fact that the dataset contains 10,209 images with low redundancy, in our opinion, it is a good option for our research task.
The images in the dataset were annotated into regions, dividing them into 10 classes, which can be combined into several abstract categories. By classes, the dataset is divided into the classes “pedestrian”, “person”, “bicycle”, “car”, “van”, “truck”, “tricycle”, “tricycle with awning”, “bus”, and “motorcycle”. The distribution of classes in the dataset is quite uneven. In general, the classes can be combined into several categories, these are “person”, “car”, “truck”, “bus”, “tricycle”, and “bicycle”.
In general, the dataset contains almost 340 thousand annotated regions, which is sufficient for high-quality training of neural networks. These regions are annotated on images of different types, different weather conditions, different camera inclinations relative to the earth’s surface, and other factors. Also the markup was performed for all objects in the images, also taking into account small-sized objects. An example of the markup is given in
Figure 2.
The main factors in choosing VisDrone are its size and quality of annotations. Another important factor that influenced the choice is the expanded number of classes and the possibility of combining them into more abstract categories. The variety of shooting parameters and the large size of the images are also important factors that are positive for this dataset.
The VisDrone dataset has been used in our previous studies [
47,
48]. Paper [
48] gives details on CNN training, whilst the paper [
47] presents data showing that the objects having a size smaller than 150–200 pixels are localized and classified worse than the objects having a larger size.
To improve the quality and stability of the results obtained, the AU-AIR dataset [
41] was used in the study to test the performance of neural networks under various conditions. The structure of this dataset is quite similar to the training dataset, and therefore allows its use without additional processing. This approach allows for better determination of the accuracy parameters of neural networks, preventing memorization, as well as increasing the amount of data that will be used to calculate the accuracy metrics of the methods.
3.2. Preliminary Results
All CNNs mentioned in
Section 2.2 have been trained for noise-free test images (12,900 images) and then applied to images in the verification set (3200 images), both noise-free and the ones corrupted by the noise. Recall that the number of the images in the verification set here is significantly larger than for the data presented in [
18].
Let us start with the analysis of F1. The results are presented in
Figure 3. Analysis shows the following:
- (1)
Almost for all types of CNNs, a larger STD (more intensive noise) leads to F1 reduction; for STD ≤ 10, reduction is not observed or is negligible.
- (2)
This reduction is different, i.e., some CNNs are more robust with respect to noise; for example, Faster R-CNN (Resnet50) is the most robust.
- (3)
For some CNNs, F1 is reduced by almost 0.1 for STD = 25 (e.g., for SSD MobileNetV2), i.e., this CNN performs well for almost noise-free images but, certainly, it is not the best choice for intensive noise.
IoU is an important indicator (criterion) of performance—the corresponding data are presented in
Figure 4. Surprisingly, IoU values are practically the same for all considered values of σ, i.e., noise does not essentially influence accuracy of localization. According to this criterion, the results are practically at the same level for all considered CNNs.
Finally,
Figure 5 represents data for PoCPR. As seen, for many CNNs, their performance according to this criterion does not depend on noise intensity. Only for YOLOv5m, the performance reduction is essential.
Then, denoising seems expedient if noise STD > 5, i.e., if noise is visible. Moreover, denoising is expedient if: (1) it improves F1; (2) it does not make other performance characteristics worse; (3) it is fast enough and does not require too many resources.
One can wonder why IoU remains practically the same (
Figure 4) for different STDs whilst the CNN performance becomes worse according to other criteria. To understand this, we have carried out special experiments. We have looked at many particular images and results of their processing.
Figure 6 shows one particular case. Annotation sample for the original (noise-free) image is presented in
Figure 6a. There are three objects, all of the same class —“cars”. As one can see, red rectangles obtained by markup for the considered dataset are considerably larger than green rectangles generated by SSD (VGG16) CNN.
This takes place for many other images although there are also images (
Figure 2) where objects and their positions are marked more carefully and accurately. In our opinion, this is the main reason why IoU is only slightly sensitive to noise.
Figure 6b demonstrates an example when one object (the leftmost car) is not detected due to noise.
Figure 7 shows the image processing results for another CNN, YOLOv5m. Again, green rectangles are considerably smaller than the corresponding red ones and this influences IoU. In this case, the leftmost object marked as “car” is classified as two objects (“truck” and “car”, see
Figure 7a,c). Due to intensive noise, the leftmost object is not correctly detected (
Figure 7b).
5. Computational Complexity and Discussion
Each of the neural networks studied has its own positive features. For example, the neural network RetinaNet provides the best metrics for the percentage of correctly predicted regions, while YOLO has high classification accuracy rates. In general, the use of a particular neural network can be reduced to a choice between accuracy and speed (or computational load). For the neural networks under consideration, parameters were calculated to determine the computational load, namely the number of floating point operations per second (FLOPS) and the number of parameters (Params). The results are shown in
Table 5. Based on these indicators, it can be determined that the SSD MobileNetV2 and YOLO neural networks have the lowest computational load, while the Faster R-CNN neural networks require the most resources.
The filtering methods used in the work also affect the processing time of a single image and the computational capacity of the algorithm as a whole. To determine the impact of this effect, the time consumption for each of the methods used was evaluated on an image with a size of 1920 × 1080 pixels. The time calculations were performed on the same device with an Intel Core i9-10900 processor. Considering that the DRUNet method is a neural network, the number of floating point operations per second (FLOPS) and the number of parameters were determined for it. The results are presented in
Table 6. Analysis of the results shows that the DRUNet method is faster than BM3D, although it has a fairly long image processing time.
To study the impact of filtering on the total processing time of the localization and classification, the image processing time was measured from image loading until the localization result was obtained. The experiments were conducted on available devices, so a GPU (NVIDIA GeForce RTX 4060Ti) was used to run DRUNet, as was the case for all neural networks. An Intel i9-10900 was used to run the BM3D filtering algorithm, as it was not possible to run this algorithm directly on the GPU. The results are presented in
Table 7. It is noticeable that the processing time for one image using BM3D is significantly longer than for filtering using DRUNet. By comparing the values obtained with and without the use of filtering algorithms, it can be concluded that the filtering algorithms significantly slow down the overall speed (increase time expenses).
Considering the obtained accuracy estimates of neural networks when processing images with noise, as well as images after filtering, and taking into account the image processing speed estimate for each filtering method, it can be concluded that the DRUNet neural network is the optimal method for improving image quality. Noise pre-filtering to improve localization and classification accuracy metrics is effective and expedient for STD about 10 and larger. On one hand, denoising can be accelerated both on-board and on-land. On the other hand, the use of denoising seems more reasonable for image processing on-land where computational facilities are usually considerably better than on-board.
Recall that in all cases (for original images, images with different intensities of AWGN, and filtered images) the CNNs trained for noise-free images have been applied. Ref. [
51] shows that the use of noisy and/or filtered images in classifier training can improve the classification results. Then, it is possible to expect that similar effects might take the place for the considered task of object localization and classification. In other words, the CNNs training with AWGN augmented images can be helpful to improve performance, especially for the case of on-board processing.
The task of small-sized object localization and classification is, as shown, more complicated than the task of object localization and classification in general. It requires special attention and, probably, special approaches including determination of CNN type and parameters most suited for providing high performance, specific approaches to training, and input image pre-processing.
Above we have considered object localization and classification in general, without paying attention to a particular class or classes that can be of interest for a given application, e.g., the detection of forest fires or polluted regions. Then, analysis has to be modified to better fit the considered application.
These results already allow for improved urban monitoring systems, specifically for maintaining order on streets, monitoring complex road sections, identifying citizens in need of assistance, etc. It should also be noted that the obtained data are highly important for the authors’ future work in identifying small and camouflaged objects. This will allow for better noise filtering in images without losing important information, enabling object detection without reducing the probability of detection.
6. Conclusions
The task of object localization and classification in noisy color images acquired from UAVs is considered. It is demonstrated that noise has a negative impact on most performance characteristics (especially, F1) but mainly starting from the moment it becomes visible, i.e., if input PSNR is about 30 dB. Then, it means that noise intensity (or PSNR) has to be controlled and this might complicate image processing.
It should be noted that the F1-score demonstrated the most pronounced degradation with increasing noise levels compared to other metrics (IoU, PoCPR). This is explained by its high sensitivity to classification errors: in the presence of noise, even a slight increase in the number of false positives or false negatives leads to a significant decrease in the final value. Furthermore, the uneven distribution of classes in the dataset used amplifies this effect, since noise has a stronger impact on small categories, reducing the Recall score and, consequently, the F1-score. Another important factor is the fact that the CNN was trained exclusively on clean images, which causes a mismatch between the training and test sets (domain gap) and further degrades classification results. Thus, the F1-score is the most rigorous indicator of classification quality in the presence of noisy UAV images, and its suboptimal values indicate the need to use preprocessing methods (e.g., DRUNet) or expand the training sets with data containing synthetic noise.
Different CNN architectures have different robustness with respect to noise. In particular, YOLOv5m is quite sensitive. The use of pre-filtering occurs to be expedient if input PSNR is less than 30 dB. Both considered filters are, in general, able to improve the performance, especially if noise is very intensive. Meanwhile, on average, the use of DRUNet filter is preferable.
It is also shown that localization and classification of small-sized objects that might correspond to such classes as “humans” or “pedestrians” is even more complex task than localization of objects having size of one thousand pixels or more. Special efforts are needed to improve CNN performance for such classes.
We have studied only the simplest noise model of AWGN. The use of more adequate models of the noise and other degradations is desired in the future. In particular, signal-dependent and/or spatially correlated noise has to be considered. It is also worth studying different models of YOLO family CNNs that develop quickly with continuous improvement of their properties.