You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

9 December 2023

BA-YOLO for Object Detection in Satellite Remote Sensing Images

and
College of Communication Engineering, Jilin University, Changchun 130022, China
*
Author to whom correspondence should be addressed.

Abstract

In recent years, there has been significant progress in object detection within the domain of natural images. However, the field of satellite remote sensing images has consistently presented challenges due to its significant scale variations and complex background interference. Achieving satisfactory results by directly applying conventional image object detection models has proven to be difficult. To address these challenges, this paper introduces BA-YOLO, an improved version of the YOLOv8 object detection model. It incorporates several notable enhancements. Firstly, to fuse an increased number of features more effectively, we introduce the design concept of a higher-performing Bi-directional Feature Pyramid Network (BiFPN). Secondly, to retain sufficient global contextual information, we integrated a module in BA-YOLO that combines multi-head self-attention and convolutional networks. Finally, we employed various data augmentation techniques such as Mixup, Cutout, Mosaic, and multi-scale training to enhance the model’s accuracy and robustness. Experimental results demonstrate that BA-YOLO outperforms state-of-the-art detectors and has been evaluated on the DOTA dataset. BA-YOLO achieves a mean average precision (mAP) of 0.722 on the DOTA dataset.

1. Introduction

Remote sensing object detection is a critical task in the field of remote sensing, and its detection performance directly affects the performance of downstream tasks. This task finds significant applications in various civilian and military domains, including search and rescue operations [1,2], military reconnaissance, and intelligence gathering [3,4,5,6,7,8,9]. Unlike typical object detection tasks, remote sensing object detection faces a unique set of challenges, including variations in scale, differences in viewpoint, changing lighting conditions, background interference, noise interference, low imaging resolution, and significant variations in the morphology and distribution of targets [10,11]. In addressing these challenges, remote sensing object detection requires the development of algorithms that are robust and adaptive, enabling the effective identification and localization of targets of various scales and types. This may involve the use of multi-scale detection strategies, consideration of variations in target appearance under different viewpoints and lighting conditions, mitigation of the impact of background and noise on detection results, and handling targets in low-resolution images. In summary, remote sensing object detection is a challenging yet highly promising field that demands ongoing research and innovation to address the complexities of various environments and scenarios effectively.
Recently, the development of deep learning technologies has continuously broken through the bottlenecks in the field of image processing, achieving groundbreaking progress in areas such as object detection and tracking, image segmentation, image recognition, and retrieval. Natural image datasets like ImageNet [12] and MS COCO [13] have propelled advancements in object detection. However, remote sensing images have unique characteristics where objects often exhibit arbitrary orientations, dense arrangements, and complex background information that introduces interference. Therefore, directly applying object detection models designed for relatively simple background images to remote sensing object detection often falls short of practical requirements. All these factors make object detection in remote sensing images, which have complex backgrounds, more challenging. Remote sensing object detection necessitates specially designed algorithms and models to cope with the diversity, density, and complex environmental backgrounds of targets, ensuring efficient and accurate target identification. Research in this field is continuously advancing to meet the practical needs of remote sensing applications.
In this research, we focused on enhancing existing natural image object detection frameworks to achieve better performance in satellite remote sensing image object detection tasks. YOLOv8 [14] is the latest object detection framework in the YOLO family, and compared to YOLOv5 [15], it demonstrates the best performance in object detection tasks on natural image datasets such as COCO. However, achieving satisfactory results in satellite image object detection tasks directly with YOLOv8 proves challenging.
We updated YOLOv8 primarily in two aspects. Firstly, the neck part of YOLOv8 utilizes the PANet [16] structure. To achieve more effective feature fusion at different scales, this study adopts the BiFPN structure to design a new neck structure for feature fusion. Compared to PANet, BiFPN introduces residual connections to enhance feature representation, removes nodes with a single input edge to reduce computational complexity, and introduces weight fusion. Weight fusion involves adding weights to various scale features being fused, thereby adjusting the contributions of different scales to improve the speed of object detection. Secondly, YOLOv8 uses a Convolutional Neural Network (CNN) to construct its backbone network. CNNs excel at capturing local information, while transformers can enhance the model’s global modeling capabilities. ACmix combines the advantages of CNNs and transformers, allowing it to simultaneously capture local and global semantic information. Convolutional layers can capture local features, while self-attention can capture global dependencies. ACmix can handle both local and global semantic information, so we replaced the C2f module in the YOLOv8 backbone network with C2fAC, thereby improving the performance of remote sensing image object detection tasks. The modified YOLOv8 is named BA-YOLO. Experimental results demonstrate that BA-YOLO, proposed by us, surpasses the performance of standard YOLOv8 in detecting targets in the DOTA dataset.
Contributions:
  • Designed a new feature fusion layer that combines the advantages of BiFPN.
  • Added the C2fAC module to enhance the model’s feature extraction capability.
  • Implemented effective training strategies, including multi-scale training and testing, and data augmentation.
  • Achieved a mean average precision (mAP) of 0.722 on the DOTA dataset.

3. Materials and Methods

3.1. Overview of YOLOv8

YOLOv8 is a state-of-the-art (SOTA) model that builds upon the success of previous YOLO versions while introducing new features and improvements to further enhance performance and flexibility. Specific changes include YOLOv8 introducing a new backbone network. The most significant change is in the head section, which replaces the coupled head of YOLOv5 with a decoupled head and transitions from YOLOv5’s anchor-based approach to anchor-free. YOLOv8 utilizes a new loss function, where loss incorporates the Task-Aligned Assigner positive sample allocation strategy and introduces Distribution Focal Loss. The YOLOv8 model is capable of running on various hardware platforms, ranging from CPUs to GPUs. Furthermore, in the backbone network and neck section, YOLOv8 replaces YOLOv5’s C3 structure with a C2f structure that offers a richer gradient flow. Different channel numbers are adjusted for models of various scales, and this change significantly improves model performance. When the YOLOv8 model was tested on the COCO dataset, YOLOv8 showed a significant improvement in mAP, parameter count, and FLOPs results compared to YOLOv5 and YOLOv7 [40] in terms of object detection accuracy. In order to make YOLOv8 more suitable for object detection tasks in remote sensing images, we made some modifications to YOLOv8.

3.2. BA-YOLO

The architecture of our proposed BA-YOLO for remote sensing image object detection is depicted in Figure 1.
Figure 1. The architecture of the BA-YOLO. The dashed lines in the figure represent weighted feature fusion. The architecture of the BA-YOLO. The dashed lines in the figure represent weighted feature fusion. The details of each module in the BA-YOLO are shown in the lower part of Figure 1.

3.2.1. C2fAC

Recently, with the emergence of vision transformers, modules based on self-attention have shown comparable or even better performance in many visual tasks compared to CNN counterparts. In the Hybrid Attention Mechanism ACmix proposed by Pan et al. [35], the input features are first mapped using convolution to obtain a rich set of intermediate features. Then, these intermediate features are reused and aggregated according to different patterns (using self-attention and convolution methods), fully leveraging the inherent relationships between convolution and self-attention. The ACmix structure is depicted in Figure 2. First, input feature mapping is performed using three 1 × 1 convolutions. Next, the use of intermediate features follows two paradigms. Features from both paths are added together and serve as the final output. In this paper, by incorporating ACmix into the C2f module, it can capture more features, resulting in C2fAC. Unlike SPH-yolov5 and TPH-yolov5, where transformers and Swin transformers are respectively fused into the C3 module and placed in the model’s neck as detection heads, in this paper, we place the self-attention modules in the backbone section, using transformer layers early to enforce boundary regression. When the network is relatively shallow and the feature maps are relatively large, the model may lose some meaningful contextual information.
Figure 2. The overview of ACmix.

3.2.2. BIFPN

In remote sensing image object detection, there can be significant variations in the sizes of target objects, such as small vehicles and football fields. By propagating high-level features down to complement the semantics of lower-level features, it is possible to obtain high-resolution and semantically rich features, which are advantageous for the detection of small objects.
PANet further explores cross-scale connections, as illustrated in Figure 3a. In this work, as depicted in Figure 3b, a simple yet efficient Weighted Bi-Directional Feature Pyramid Network (BiFPN) was introduced to optimize cross-scale connections. The introduction of BiFPN significantly improved the performance of EfficientDet [41]. It enabled the model to better handle objects of different scales and improved detection performance. BiFPN achieves effective integration of multi-scale features through the introduction of bidirectional cross-scale connections and fast normalization fusion with learnable weights.
Figure 3. Feature network design (a) PANet adds an additional bottom-up pathway on top of FPN. (b) BiFPN implements two optimizations for cross-scale connections.
Firstly, we eliminated nodes with only one input edge. Furthermore, if the initial input node aligns with the output node at the same network level, we introduce an additional connection from the original input node to the output node.
Secondly, BiFPN introduces learnable weights when merging low-level and high-level features to learn the importance of different input features, rather than simply summing or connecting them. Formally, given a list of multi-scale features P i n = ( P l 1 i n , P l 2 i n , ) , where P l 1 i n presents the feature at level l i . The list of intermediate features on the pathway is represented as P t d = ( P l 1 t d , P l 2 t d , ) . Our goal is to find a method f : P o u t = f ( P i n ) that effectively aggregates different features and produces a set of new features as output. Figure 3a illustrates the conventional top-down and bottom-up PANet. It takes level 3–7 input features, P i n = ( P 3 i n , , P 7 i n ) , where P l i i n represents a feature level with a resolution of 1 / 2 i . The traditional PANet aggregates multi-scale features in a simple summation manner, for example, at layer 4:
P 4 t d = C o n v ( P 4 i n + R e s i z e ( P 5 t d ) )
P 4 o u t = C o n v ( P 4 t d + R e s i z e ( P 5 o u t ) )
whereas Resize is usually an upsampling or downsampling operation performed to match the resolution, Conv is typically a convolution operation used for feature processing. BiFPN integrates bidirectional cross-scale connections along with fast normalization fusion with learnable weights. As a specific example, here, we describe the fusion of two features at level 4 in BiFPN, as shown in Figure 3b.
P 4 t d = C o n v ( w 1 · P 4 i n + w 2 · R e s i z e P 5 t d w 1 + w 2 + ε )
P 4 o u t = C o n v ( w 1 · P 4 i n + w 2 · R e s i z e P 4 t d + w 3 · R e s i z e P 3 o u t w 1 + w 2 + w 3 + ε )
In the above formulas, P 4 t d represents the intermediate feature at level 6 on the top-down pathway, and P 4 o u t represents the output feature at level 4 on the bottom-up pathway. All other features are constructed in a similar manner. The key difference from the original BiFPN proposed in EfficientDet is that, in this paper, the neck section chooses to take three outputs from the backbone as inputs instead of five. The three outputs from BiFPN are fed into the final fully connected layer. The purpose of doing this is to improve the detection of objects at various scales.
Directly connecting the outputs of BiFPN to the classification layer would result in a huge number of parameters. While BiFPN is effective in achieving efficient bidirectional cross-scale connections and weighted feature fusion, resizing operations through convolutional networks are required to ensure that all scales have the same dimensions during feature weighting. This also leads to the large dimensions of the output at each layer of the final BiFPN network. In this paper, to enable efficient feature fusion while minimizing the increase in parameters, we have made modifications to the BiFPN structure by introducing an additional set of unweighted feature fusion networks, as illustrated in Figure 4.
Figure 4. Modified BiFPN structure. Weighted processing is being applied to feature fusion within the dashed box.

4. Experiments

4.1. Datasets and Evaluation Metrics

The DOTA dataset comprises 400,000 annotated object instances from 16 different categories. The images in this dataset are primarily sourced from the China Center for Resources Satellite Data and Application (CRESDA), Google Earth, JL-1 satellite, and GF-2 satellite. It includes 2806 aerial images captured by various sensors and platforms, with each image having a size of up to 4000 × 4000 pixels. The dataset encompasses a wide range of object categories, including the following: plane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer ball field, swimming pool, and container crane. These diverse categories make the DOTA dataset valuable for object detection and related computer vision tasks. As shown in Table 1, objects between 10 and 50 pixels in size are referred to as small objects, while objects larger than 300 pixels are categorized as large objects. Objects with sizes between 50 and 300 pixels are considered medium-sized objects. Therefore, we cropped the DOTA dataset into 40,000 images with a size of 1024 × 1024 pixels. We chose precision (P), recall (R), and mean average precision (mAP) as evaluation metrics.
Table 1. The distributions of the instance size for the DOTA datasets.
The definitions of P and R involve the true positive (TP), false positive (FP), true negative (TN), and false negative (FN). Precision is the proportion of true positive samples in the positive classification prediction results.
P = T P T P + F P
Recall is the proportion of successfully captured true positives among all actual positive instances by the model.
R = T P T P + F N
Mean average precision (mAP) is the average of precision values (APs) calculated for each class.
m A P = 0 N A P N N
In the given formula, N denotes the total number of classes. A P N represents the average precision for class n , which is numerically equal to the area covered by the precision–recall function curve along with the coordinate axes. AP can be calculated using the area under the P–R curve.
A P = 0 1 P ( R ) d R  

4.2. Implementation Details

In this paper, BA-YOLO is implemented in the PyTorch framework and is currently being trained and tested on an NVIDIA Tesla A100 GPU with 40GB of memory. The YOLOv8 model is initially trained on the COCO dataset. To save training time, we employed a transfer learning strategy by directly using a pre-trained YOLOv8 model. During the training phase, we utilized the pre-trained YOLOv8 backbone network model from the COCO dataset for transfer learning. We employed the SGD optimizer with an initial learning rate of 0.01. The number of training epochs was set to 150, and the first 3 epochs used a learning rate warm-up strategy, which is a common technique for optimizing learning rates. The learning rate gradually decreased from 0.1 to 0.01 during these initial 3 epochs. Learning rate warm-up helps stabilize the training of deeper models. The batch size was set to 8, the probability for Mosaic data augmentation was set to 1.0, and the probability for Mixup data augmentation was set to 0.7. Additionally, during the final few epochs of training, we disabled Mosaic data augmentation, which contributed to improving the model’s performance. Other detection models, such as Faster R-CNN, RetinaNet, SSD, YOLOv3, YOLOv5, and others, were validated using the default settings provided in their respective literature.

4.3. Experimental Results

We tested BA-YOLO on the DOTA dataset and compared it with other representative models, as shown in Table 2. On the DOTA dataset, our method achieves an mAP of 0.722, which is 0.036 higher than that of YOLOv8. This result demonstrates the effectiveness of BA-YOLO in satellite image object detection.
Table 2. Test results on the DOTA test dataset for different detection models.
In order to better assess the model’s performance, we plotted P–R curves for each category as shown in Figure 5, and calculated the area under the curve, known as AP (average precision). A higher AP value indicates better detection performance. Additionally, two crucial parameters for evaluating deep learning models are the Intersection-over-Union (IoU) threshold and the confidence threshold. Next, in Figure 6, we separately computed the confusion matrix for BA-YOLO on the DOTA test dataset using an IoU threshold of 0.5 and a confidence threshold of 0.25. The confusion matrix is a tool for summarizing and visualizing the predictions of a classification problem. It counts the number of correct and incorrect predictions, categorizes them by each class, and illustrates the model’s performance on different classes, including potential sources of confusion. As shown in Figure 6. The confusion matrix visually presents the classification results for each category. Each row represents the predicted categories by the model, each column represents the actual categories, and the data on the diagonal line represent the proportion of correctly classified categories. The confusion matrix helps us understand how the model performs on different classes, including which classes are prone to confusion or misclassification, thus aiding in further analysis and model performance improvement.
Figure 5. The P–R curve from the BA-YOLO results on the DOTA test set.
Figure 6. The confusion matrix from the BA-YOLO results on the DOTA test set, with the IoU threshold of 0.5 and the confidence threshold of 0.25, respectively.
The high false negatives (FNs) for the container crane category in the DOTA dataset can be attributed to the significantly lower number of training samples available for this category compared to other classes. The lack of training samples leads to limited feature extraction, resulting in a high FN rate. The high FN rate for the container crane category implies that a majority of objects belonging to this category are being missed by the model, and, consequently, the corresponding average precision (AP) for this category is much lower compared to others. On the other hand, it is evident that the category of small vehicles is prone to being missed. This is because, although there are sufficient training samples for small vehicles in the dataset, they are considered tiny objects with significant variations in scale, size, and orientation. Additionally, small vehicles in the images may be situated in dark shadows, obscured by buildings, or densely parked in parking lots, making their detection challenging in such environments.
Figure 7 showcases some visual results from our BA-YOLO on the DOTA dataset. The detected bounding boxes are color-coded according to different categories. Figure 8 showcases some visual results from the YOLOv8 detection model on the DOTA dataset. The confidence number calculation method in Figure 7 and Figure 8 is as follows:
C i = P r C l a s s i | O b j e c t × P r O b j e c t × I o U ( p r e d , t r u t h )
The term P r C l a s s i | O b j e c t represents the probability of each detected category, P r O b j e c t indicates whether the center of the object falls within the grid cell, and I o U ( p r e d , t r u t h ) is the Intersection-over-Union value between the predicted box and the ground truth.
Figure 7. Some visualization results from our BA-YOLO on the DOTA dataset. The detection boxes of different colors in the image display distinct colors based on categories, with the category and confidence being shown above the bounding boxes.
Figure 8. Some visualization results from our YOLOv8 on the DOTA dataset. The detection boxes of different colors in the image display distinct colors based on categories, with the category and confidence being shown above the bounding boxes.
We tested the inference speed of several algorithms in the same environment, as shown in Table 3. Our proposed model, BA-YOLO, is more complex than other models, but it still has a speed advantage compared to TPH-YOLOv5 and SPH-YOLOv5. However, it is somewhat slower compared to YOLOv5 and baseline model YOLOv8. In the future, we will explore ways to simplify the model and improve inference speed while maintaining accuracy.
Table 3. Inference time on the DOTA test dataset for different detection models.

4.4. Ablation Experiments

In this section, we will conduct a comprehensive ablation study on the DOTA test subset to analyze our algorithm, as shown in Table 4. Based on the baseline model YOLOv8-L, we introduced a simple yet effective weighted path aggregation method to replace PANet. We also embedded the C2fAC with ACmix module into the original YOLOv8 model. The multi-scale technique also contributes to improving the detection performance of BA-YOLO. Table 4 presents the comparison of results among the four models.
Table 4. The effects of the YOLOv8 combining different modules on the DOTA dataset.

5. Conclusions

Satellite images possess several prominent characteristics, including significant scale variations, complex backgrounds filled with distractions, and diverse perspectives, which pose tremendous challenges for general object detectors based on common convolutional networks. In this study, we upgraded the latest natural image detection technology YOLOv8 model to the BA-YOLO model, tailored for satellite images. We improved the YOLOv8 network structure according to the characteristics of remote sensing images. In BA-YOLO, we introduced the BiFPN structure as a feature fusion network and employed learnable weights for fast normalization fusion, effectively reducing the loss of target feature information and enhancing detection performance in satellite images. Additionally, we introduced the C2fAC module, which combines the advantages of multi-head attention and convolutional networks, enhancing the model’s detection accuracy. The BA-YOLO model was tested on the widely used DOTA dataset, achieving an mAP of 0.722, surpassing other models. This fully demonstrates the effectiveness of BA-YOLO for satellite image object detection. In the future, we will strive to improve object detection speed while maximizing the model’s detection accuracy to the greatest extent possible. Despite BA-YOLO achieving satisfactory object detection accuracy in satellite images, its performance in detecting small targets is not particularly ideal. In future work, we plan to incorporate image super-resolution techniques to enhance the research on target detection in satellite images.

Author Contributions

Conceptualization, Z.L. and K.W.; formal analysis, K.W.; investigation, K.W.; software, K.W.; writing—original draft, K.W.; writing—review and editing, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Acknowledgments

We are very grateful for the support and help from Ming Zhu of the University of Chinese Academy of Sciences.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Han, W.; Chen, J.; Wang, L.; Feng, R.; Li, F.; Wu, L.; Tian, T.; Yan, J. Methods for small, weak object detection in optical high-resolution remote sensing images: A survey of advances and challenges. IEEE Geosci. Remote Sens. Mag. 2021, 9, 8–34. [Google Scholar] [CrossRef]
  2. Liu, T.; Wang, L.; Zuo, S.; Yang, C. Remote Sensing Dynamic Monitoring System for Agricultural Disaster in Henan Province Based on Multi-source Satellite Data. Agric. Sci. Technol. 2013, 14, 155–161. [Google Scholar]
  3. Zhao, H.; Zhang, H.; Zhao, Y. Yolov7-sea: Object detection of maritime uav images based on improved yolov7. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 233–238. [Google Scholar]
  4. Zhu, C.; Zhou, H.; Wang, R.; Guo, J. A Novel Hierarchical Method of Ship Detection from Spaceborne Optical Image Based on Shape and Texture Features. IEEE Trans. Geosci. Remote Sens. 2010, 48, 3446–3456. [Google Scholar] [CrossRef]
  5. Proia, N.; Page, V. Characterization of a Bayesian Ship Detection Method in Optical Satellite Images. IEEE Geosci. Remote Sens. Lett. 2010, 7, 226–230. [Google Scholar] [CrossRef]
  6. Xu, J.; Sun, X.; Zhang, D.; Fu, K. Automatic Detection of Inshore Ships in High Resolution Remote Sensing Images Using Robust Invariant Generalized Hough Transform. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2070–2074. [Google Scholar]
  7. Feng, Y.; Xu, Q.; Feng, G.; Hu, L. Ship detection from optical satellite images based on visual search mechanism. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015. [Google Scholar]
  8. Yuan, Y.; Jiang, Z.; Zhang, H.; Wang, M.; Meng, G. Ship detection in panchromatic images: A new method and its DSP implementation. In Proceedings of the ISPRS International Conference on Computer Vision in Remote Sensing, Xiamen, China, 28–30 April 2015; International Society for Optics and Photonics: Bellingham, WA, USA, 2016. [Google Scholar]
  9. Lienhart, R.; Maydt, J. An extended set of Haar-like features for rapid object detection. In Proceedings of the International Conference on Image Processing, Rochester, NY, USA, 22–25 September 2002. [Google Scholar]
  10. Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 2017, 202, 18–27. [Google Scholar] [CrossRef]
  11. Li, D.; Ke, Y.; Gong, H.; Li, X. Object-based urban tree species classification using bi-temporal WorldView-2 and WorldView-3 images. Remote Sens. 2015, 7, 16917–16937. [Google Scholar] [CrossRef]
  12. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  13. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8693, pp. 740–755. [Google Scholar]
  14. Terven, J.; Cordova-Esparza, D. A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond. arXiv 2023, arXiv:2304.00501. [Google Scholar]
  15. Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al. ultralytics/yolov5: v3.0; Zenodo: Geneva, Switzerland, 2020. [Google Scholar]
  16. Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 9197–9206. [Google Scholar]
  17. Leninisha, S.; Vani, K. Water flow based geometric active deformable model for road network. ISPRS J. Photogramm. Remote Sens. 2015, 102, 140–147. [Google Scholar] [CrossRef]
  18. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
  19. McKeown, D.M., Jr.; Denlinger, J.L. Cooperative methods for road tracking in aerial imagery. In Proceedings of the 1988 DARPA IUS Workshop, Cambridge, MA, USA, 6–8 April 1988; pp. 662–672. [Google Scholar]
  20. Rosenfeld, A. The max Roberts operator is a Hueckel-type edge detector. IEEE Trans. Pattern Anal. Mach. Intell. 1981, 1, 101–103. [Google Scholar] [CrossRef] [PubMed]
  21. Zhang, Y.; Han, X.; Zhang, H.; Zhao, L. Edge Detection Algorithm of Image Fusion Based on Improved Sobel Operator. In Proceedings of the 2017 IEEE 3rd Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 3–5 October 2017. [Google Scholar]
  22. Ulupinar, F.; Medioni, G. Refining edges detected by a LoG operator. Comput. Vis. Graph. Image Process. 1990, 51, 275–298. [Google Scholar] [CrossRef]
  23. Harris, C.G.; Stephens, M. A combined corner and edge detector. In Proceedings of the AVC, Manchester, UK, 31 August–2 September 1988; pp. 10–5244. [Google Scholar]
  24. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  25. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
  26. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  27. Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
  28. Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6569–6578. [Google Scholar]
  29. Zhu, C.; He, Y.; Savvides, M. Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 840–849. [Google Scholar]
  30. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  31. Ghiasi, G.; Lin, T.Y.; Le, Q.V. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
  32. Xu, H.; Yao, L.; Zhang, W.; Liang, X.; Li, Z. Auto-FPN: Automatic network architecture adaptation for object detection beyond classification. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6649–6658. [Google Scholar]
  33. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
  34. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  35. Pan, X.; Ge, C.; Lu, R.; Song, S.; Chen, G.; Huang, Z.; Huang, G. On the integration of self-attention and convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 815–825. [Google Scholar]
  36. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
  37. Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6023–6032. [Google Scholar]
  38. Su, M.S.; Hwang, W.L.; Cheng, K.Y. Analysis on multiresolution mosaic images. IEEE Trans. Image Process. 2004, 13, 952–959. [Google Scholar] [CrossRef] [PubMed]
  39. Romeny, B.M.H. Front-End Vision and Multi-Scale Image Analysis: Multi-Scale Computer Vision Theory and Applications; Written in Mathematica; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  40. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
  41. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.