You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

8 October 2024

Enhancing a You Only Look Once-Plated Detector via Auxiliary Textual Coding for Multi-Scale Rotating Remote Sensing Objects in Transportation Monitoring Applications

,
,
,
and
1
School of Mongolian Studies, Inner Mongolia University, Hohhot 010021, China
2
National Engineering Research Center for Agri-Product Quality Traceability, Beijing Technology and Business University, Beijing 100048, China
3
School of Computer and Artificial Intelligence, Beijing Technology and Business University, Beijing 100048, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Deep Learning in Satellite Remote Sensing Applications

Abstract

With the rapid development of intelligent information technologies, remote sensing object detection has played an important role in different field applications. Particularly in recent years, it has attracted widespread attention in assisting with food safety supervision, which still faces troubling issues between oversized parameters and low performance that are challenging to solve. Hence, this article proposes a novel remote sensing detection framework for multi-scale objects with a rotating status and mutual occlusion, defined as EYMR-Net. This proposed approach is established on the YOLO-v7 architecture with a Swin Transformer backbone, which offers multi-scale receptive fields to mine massive features. Then, an enhanced attention module is added to exploit the spatial and dimensional interrelationships among different local characteristics. Subsequently, the effective rotating frame regression mechanism via circular smoothing labels is introduced to the EYMR-Net structure, addressing the problem of horizontal YOLO (You Only Look Once) frames ignoring direction changes. Extensive experiments on DOTA datasets demonstrated the outstanding performance of EYMR-Net, which achieved an impressive mAP0.5 of up to 74.3%. Further ablation experiments verified that our proposed approach obtains a balance between performance and efficiency, which is beneficial for practical remote sensing applications in transportation monitoring and supply chain management.

1. Introduction

In the global supply chains of agricultural production, remote sensing technology has proven valuable in various applications, such as in vehicle and boat detection, which plays an essential role in this field. It can help farmers, agricultural experts, and government agencies to identify and monitor crop growth, the effects of pests and diseases [1], soil moisture, and other important information about farmland management and agricultural production. By accurately locating and classifying agricultural objectives, these technologies can provide important decision support for agricultural production, helping farmers to optimize resource use, increase yields, and reduce environmental impacts [2]. Ensuring the efficiency and reliability of the supply chain has become particularly important as the global demand for food grows. The accurate detection and analysis of targets in remotely sensed imagery can provide vital information about vehicles, ports, and buildings, thus helping to plan and optimize all aspects of the supply chain [3,4]. In addition, the technology can be used to monitor the transparency of the supply chain during food distribution. Potential safety issues and illegal activities, such as counterfeit food and other illegal additives, can be detected by analyzing remote sensing images of vehicles, logistics nodes, and trading places. This helps to protect consumer rights, maintain a level playing field in the agriculture market, and improve the supply sustainability of products [5].
Taking the grain supply chain management system (GSMS) as an example, remote sensing images can monitor the entire processes from grain planting, production, storage, transportation, international trade, and consumption. With the strong support of remote sensing technology, the GSMS integrates various aspects of information, including traffic management [6], road infrastructure, weather and land conditions [7], and user consumption behavior, to achieve more intelligent and efficient management decision results. For example, by analyzing remote sensing images of a certain city, critical information such as the ocean transportation density, aircraft flight frequency, truck and high-speed rail capacity, and cold chain facilities’ distribution can be learned. This is of great significance in resource allocation and quality assurance reduction in international grain trade and provides an effective method to achieve transportation assurance and food quality safety [8].
Remote sensing is a technology that collects electromagnetic radiation information from artificial satellites, airplanes, or other vehicles about physical targets to identify the Earth’s environment and resources. Any object has different electromagnetic wave reflection or radiation characteristics. Aerospace remote sensing technology uses remote sensors mounted on aircraft to sense the electromagnetic radiation characteristics of physical targets and record the characteristics for identification and judgment. Remote sensing images are acquired using imaging equipment on aircraft to capture ground objects. These images are valuable sources of dynamic and comprehensive data, which are crucial for land cover monitoring, urban information planning, and marine environment detection.
Currently, remote sensing technology is widely used in international agricultural trade, transportation supervision, supply chain monitoring, and other scenarios [9]. For example, in a certain port, accurately identifying the number of ships, planes, trucks, and other transportation equipment from remote sensing images is essential in various practical applications of supply chain regulation. It can effectively analyze the trade situations of different agricultural productions by recognizing their types and estimating the total amount of goods in circulation as well as the operation situation of each supply chain [10]. Thus, determining how to accurately distinguish the category, quantity, location, and other information of various transportation vehicles is vital for improving the supply ability in international trade. In addition, remote sensing images have long been plagued by some issues, such as large-scale ranging, low pixel quality, mutual occlusion, and a dense distribution of individual objects, making accurately identifying objects in remote sensing images a challenge; therefore, this topic has significant value and is worth studying to guarantee the efficiency of agricultural product circulation.
From a specific technical point of view, remote sensing target detection is a crucial aspect of computer vision research. It involves identifying and localizing specific targets in remote sensing images by determining their category and rectangular enclosing box coordinates. The detection process typically consists of three steps: selecting the region of interest, extracting features from the selected region, and detecting and classifying the extracted features. Remote sensing images differ from natural images in several ways. They are primarily captured from altitudes ranging from several thousand meters to tens of thousands of meters, resulting in significant variations in target scales and perspectives. Furthermore, remote sensing images are captured under different weather conditions and angles, leading to diverse factors such as lighting changes, seasonal variations, and varying imaging quality. As a result, target detection based on remote sensing images presents challenges that include complex background issues, dense small target problems, and multi-directional complexities [11,12]. These challenges necessitate innovative approaches to address them effectively.
Traditional methods for target detection in remote sensing images can be categorized into four types: those based on statistical features of gray-scale information, visual saliency, template matching, and classification learning [13]. While these methods are computationally efficient, they often focus on specific scenes. In contrast, remote sensing photographs have a wide field of view, and high-resolution remote sensing photographs capture detailed surface features from multiple scenes. Traditional remote sensing image detection methods are insufficient for a fine-grained analysis of complex remote sensing scenes and lack robustness. Therefore, their applicability in target detection for high-resolution remote sensing photographs is limited [14]. With the rapid advancement of deep learning technology, a large number of studies also have gradually been using deep learning networks to realize target detection in remote sensing images to complete different applications of practical scenarios. Those approaches enable the effective integration of multi-scale features extracted from different dimensionalities to improve detection accuracy. Moreover, they can incorporate various learning strategies and skills, including attention mechanisms, the weighted summation strategy, dynamic parameter fine-tuning, etc., to further improve the identification ability and localization accuracy of each small object in remote sensing images [15]. Therefore, some deep neural network models designed for target detection in general image processing, such as the YOLO-serial and Faster-RCNN models, also migrate well to remote sensing applications.
However, the current state of deep learning-based algorithms reveals significant potential for advancements, specifically in detecting small targets within remote sensing images. Aerially captured imagery often portrays objects from an overhead shooting angle, highlighting unique top contour features that differentiate them from common objects. However, the high perspective of remote sensing images presents a notable hurdle, that is, different targets within these images tend to be small and densely arranged. Consequently, the process of extracting feature information for such minute targets becomes prone to loss during the model’s feature extraction phase, further exacerbating the detection challenges. Taking the cargo ship as an example, as one of the most important transport carriers in agricultural product supply, the number of pixels it occupies in large-scale remote sensing images is often very small [16]. Moreover, many ship targets often present a dense aggregation distribution with obvious directionality, which means there is a lot of background information in the real area labeled by the horizontal box. There is a severe overlap and intersection phenomenon in different targets. Even the labeled boxes of a ship target contain more than one object, which brings a great deal of interference in the training process of the model, leading to a more serious misdetection and omission phenomenon in the model training process [17].
These aforementioned problems are mixed together, so existing deep learning methods cannot accurately identify the transport carrier targets in remote sensing images, such as cargo ships, trucks, aircraft, etc. Thus, in the actual application of agricultural product supply, determining how to build a better intelligent method to directly process remote sensing data is a significant issue that needs to be solved urgently. This paper addresses the challenges associated with target detection in remote sensing images by proposing a novel rotating target detector named EYMR-Net, which builds upon the single-stage YOLOv7 detection framework. The primary contributions of this study encompass the following two aspects:
(1)
To address the leakage detection issue of each small-scale target with dense distribution in remote sensing images, this study designed a feature extraction module on the backbone network of YOLOv7 by incorporating the Swin Transformer. The network’s perceptual ability is enhanced for multi-scale targets, enabling the extraction of more comprehensive feature information. Additionally, the CBAM (Convolutional Block Attention Module) is integrated into the neck module to augment the model’s capacity to analyze multi-scale targets, particularly capturing fine-grained details of dense small targets. These network enhancements contribute to improving the detection performance for dense small targets, which successfully suppresses the error of omissions and failures in detecting transport equipment from remote sensing images.
(2)
To improve the position and orientation information of various objects, this study introduces rotating box annotations to the horizontal regression model of the YOLO framework. This modification effectively addresses two main issues. On one hand, it resolves the problem of ignoring the directionality of the YOLO algorithm when detecting targets such as ships and vehicles, ensuring accurate identification. On the other hand, it mitigates the issue of excessive redundant pixels in the detection box caused by objects with large aspect ratios detected using the horizontal box approach. By incorporating rotating box annotations, the proposed model can efficiently meet the demands of real-time monitoring applications of remote sensing in the field of agricultural supply.
This study is structured as follows: Section 2 introduces the detection solutions in related works and their shortcomings. Then, we introduce the proposed EYMR-Net in Section 3 and describe its principle and implementation process. Subsequently, experiments are conducted to validate the superiority of EYMR-Net, providing further evidence of the feasibility and effectiveness of the proposed method in this study. Finally, this study summarizes its contributions and outlines directions for future research.

3. Materials and Methods

In this study, we propose an enhanced YOLO architecture detector called EYMR-Net to apply food transportation regulation. EYMR-Net is designed to detect multi-scale and rotated remote sensing objects. Firstly, we present the overall model framework of EYMR-Net and discuss the modifications we made to each module based on YOLOv7. Then, the details of the modifications made to each part of the overall network framework are elaborated on in Section 3.2, Section 3.3 and Section 3.4.

3.1. Proposed Remote Sensing Detection Framework

The EYMR-Net model in this paper mainly contains four parts, the input, backbone, neck, and prediction head, and its network structure is shown in Figure 1. Firstly, the backbone network module consists of several layers, including the CBS layer, ELAN layer, MP-1 layer, and the Swin Transformer layer introduced in this paper. The CBS layer consists of a convolutional layer, a BN layer, and a LeakyReLU activation function for extracting image features at different scales. The ELAN layer in this paper is used to enhance the learning ability of the network by incorporating computational modules from different feature groups. This integration allows the layer to learn diverse feature information without compromising the original gradient path. The MP-1 layer uses a maximum pooling operation to expand the receptive field of the current feature layer, which is then fused with the feature information from normal convolutional processing to improve the feature extraction capability of the model. In this paper, we add the Swin Transformer [44] module to the feature extraction backbone network, located behind the second and third ELAN modules, in order to enhance the network’s ability to collect fine-grained information about multi-scale targets and to obtain richer gradient fusion information, especially detailed information about small targets.
Figure 1. Proposed EYMR-Net framework.
The neck module in this study adopts the structure of a Path Aggregation Feature Pyramid Network (PAFPN) [45], which incorporates bottom-up paths to facilitate information flow from lower layers to higher layers, thus efficiently fusing features at different levels. To further enhance information fusion, we add the CBAM [46] attention mechanism into the neck of YOLOv7. CBAM’s Channel Attention Module (CAM) strengthens channel-wise feature interactions, while the Spatial Attention Module (SAM) focuses on capturing target positions. By encoding feature map information along channel and spatial dimensions, CAM and SAM enable the output to be more focused on relevant details. In this paper, we replace the CBS module, originally present behind the four ELAN-W modules for feature analysis in the neck of YOLOv7, with the CBAM attention mechanism.
Furthermore, an angle classification prediction branch is added to the YOLOv7 prediction header. Hence, multi-task losses are calculated during training to optimize the angle loss, including classification loss, horizontal bounding box regression loss, confidence loss, and angle loss. In the testing phase, prediction boxes are generated by applying non-maximal suppression (NMS) after rotating them according to the predicted angles.

3.2. Swin Transformer Backbone for Feature Extraction

A Transformer network is a purely attentional mechanism network that has succeeded significantly in natural language processing tasks. The Transformer network consists of two parts: an encoder and a decoder. The Vision Transformer mainly utilizes the Transformer encoder for feature extraction. In image processing tasks, the image is first sliced into small chunks and then fed into the Transformer encoder, which is capable of outputting correlation information between image chunks. The multi-attention mechanism is the Transformer encoder’s core, which utilizes multiple self-attention computations to obtain the correlation information between different sequences. The Swin Transformer, in contrast to the Vision Transformer, considers the computational resources of the network and has shown remarkable performance in various computer vision tasks such as detection, segmentation, and classification. It adopts a layered structure similar to a CNN, dividing the network into four stages. Each stage contains a higher number of feature maps with a reduced size, allowing for the extraction of global features through self-attention and multi-headed attention mechanisms while preserving important features at each scale.
As shown in Figure 2, a single STR block is concretely presented. Firstly, the W-MSA (Windows Multi-head Self-attention) module is introduced to restrict the self-attention computation inside a local window, and this local window restriction reduces the computation of the whole network with respect to the global attention computation. The W-MSA module performs the self-attention computation inside each window by dividing the window on the image, which allows the windowed positions to interact with each other for information and to capture local correlations. This localized intra-window attention computation preserves the details inside the window and provides a feature representation with local contextual information.
Figure 2. Swin Transformer backbone framework.
However, local window attention computation alone cannot capture the global information of the entire image. To compensate for this shortcoming, the Swin Transformer introduces the SW-MSA module. The SW-MSA module interacts with information between non-overlapping windows through window shift operations. It facilitates attention computation between different windows by cyclically shifting the original feature map so that the windows can be moved to different locations. This attention computation mechanism within moving windows allows for global information exchange between windows, capturing global correlations and dependencies across the image.
For example, the shift windows of the Swin Transformer in layer i + 1 comes from the same area of four windows in layer i. That means each window in layer i + 1 is equivalent to the information of four windows in layer i. This overcomes the problem that information cannot be exchanged between different windows. By using both i-th and i + 1th layers in conjunction, it reduces the number of parameters as required in the traditional self-attention mechanism. Also, it alleviates the problem of the reduced receptive field of the normal window self-attention mechanism. By using the i-th and i + 1th layers in conjunction, not only can the problem of an excessive number of parameters in the traditional self-attention mechanism be solved, but it also solves the problem of the receptive field of the normal window self-attention mechanism being reduced.
In remote sensing image target detection, due to the high image resolution and large differences in the target size and type, it is often difficult for the traditional backbone network to consider the detection effect of targets at different scales. In order to improve the performance of the model in the multi-scale feature extraction of remote sensing images, this study introduces the Swin Transformer structure. It integrates it into the feature extraction backbone network, which effectively enhances the model’s ability to perceive remote images, allowing it to capture the dependencies between different regions of the input image through its unique window segmentation strategy, thereby improving the whole accuracy and robustness to multi-scale targets.

3.3. Convolutional Block Attention Module

The attention mechanism fundamentally imitates humans’ visual processing of images by assigning weights to different positions on the feature map to represent varying levels of attention. Remote sensing images often contain complex backgrounds, and after feature extraction through convolutional layers, there is less information about the small targets to be detected, while more information exists about the background and undetectable objects within it. These non-interest regions can interfere with the detection of small targets, leading to missed detections or false positives.
In this study, we introduce the CBAM, which addresses the challenges of false positives and missed detections for small targets in remote sensing images. The CBAM combines global average pooling (GAP) and global maximum pooling (GMP) strategies to prevent information loss. To preserve more relevant target information, we incorporate the CBAM attention mechanism into our proposed model.
The feature map dimensions contain different types of information; the channel dimension expresses abstract features, while the spatial dimension emphasizes object position information. The CBAM architecture comprises two primary components: the Channel Attention Module (CAM) and the Spatial Attention Module (SAM). These modules are combined sequentially to generate attention feature maps in both the channel and spatial dimensions. The architecture of the Channel Attention Module is depicted in Figure 3.
Figure 3. CBAM network architecture.
As illustrated, we can outline the CBAM procedure as follows: First, the input feature map F is subjected to adaptive feature refinement by leveraging the Channel Attention Module. This results in a modified feature map F . Then, the Spatial Attention Module corrects the feature map F. This ultimately yields the refined feature map F after CBAM processing. The computational procedure can be mathematically represented using the following equation, where the symbol “⨂” denotes an element-wise product, the initial feature map is F R C × H × W , the channel-wise attention map is M c R C × 1 × 1 , and the spatial-wise attention map is M c R 1 × H × W :
F = M c ( F ) F
F = M c ( F ) F
In order to compute the channel attention features effectively, a fusion of GAP and Global Maximum Pooling is utilized to reduce the spatial dimensions of the feature map. This reduction leads to the generation of two C × 1 × 1 feature maps. These feature maps are passed through a common Multilayer Perceptron (MLP) for subsequent processing.
In this proposal, a two-layer artificial neural network is introduced. The first layer’s primary aim is to reduce dimensionality on the feature maps. The layer’s size, in terms of the number of neurons, is determined as C / r , where r represents the dimensionality reduction factor. Subsequently, the fusion of the two-channel attention vectors derived from the MLP takes place, followed by applying the Sigmoid activation function to obtain the final channel attention vector, denoted as M c . Here, F avg c and F max c , respectively, represent the features obtained after applying average and max pooling operations.
M c ( F ) = σ ( M L P ( A v g P o o l ( F ) ) + M L P ( M a x P o o l ( F ) ) ) = σ ( W 1 ( W 0 ( F avg c ) ) + W 1 ( W 0 ( F max c ) ) )
To complement the Channel Attention Module, the Spatial Attention Module operates on the input feature map F , which is obtained by the element-wise multiplication of the channel attention vector M c and the input feature map F . In the Spatial Attention Module, F undergoes GAP and GMP operations along the channel dimension, resulting in two 1 × H × W feature maps. These maps are concatenated to generate a 2 × H × W feature map. Then, a 7 × 7 convolutional operation is applied to enlarge the receptive field while reducing the dimensionality to 1 × H × W. Lastly, the Sigmoid activation function obtains the spatial attention vector M s . This methodology reduces repetition and enhances the overall understanding of the passage.
M s ( F ) = σ ( f 7 × 7 ( [ A v g P o o l ( F ) ; M a x P o o l ( F ) ] ) ) = σ ( f 7 × 7 ( [ F a v g s ; F max S ] ) )
To enhance the detection performance of small objects while disregarding irrelevant background information in images, the CAM and SAM from CBAM were integrated into the Neck structure of YOLOv7 in this paper. The CAM enables the network to focus on important features by emphasizing relevant channels through channel-wise attention. On the other hand, the SAM enhances the network’s ability to capture spatial details by considering the relationships between different spatial locations using spatial attention. By integrating these attention mechanisms into the YOLOv7 framework, the network achieves improved detection performance for small objects.

3.4. Rotated Rectangular Box Optimization

(1)
Rotated circular smoothing labels
To account for the diverse orientations of objects in remote sensing images, the eight-parameter method is adopted to represent rotated bounding boxes (as RBO). This method represents a rotated rectangle with its four vertices, (x1, y1), (x2, y2), (x3, y3), and (x4, y4), in counterclockwise order, starting from the top-left corner. The input format for this model is in the form of eight-parameter annotations. These eight-parameter annotations are fed into the network, and the decoding process is performed to extract the rotation angle. The decoding process can be described as follows:
a r c t a n ( θ ) = x 1 + x 4 x 3 x 2 y 1 + y 4 y 3 y 2
The decoding process transforms the bounding box annotation format to a representation based on the angle denoted by ( x , y , w , h , θ ) . However, this introduces issues such as angle periodicity and boundary discontinuity. The angle regression approach is converted to a classification form to address angle periodicity, avoiding angles beyond their valid range. However, converting the regression problem to a classification problem results in a significant loss of accuracy. To measure the distance between angle labels, we introduce circular smooth labels (CSLs) in this study. CSLs mitigate the issue of boundary discontinuity by defining a window function g(x) to measure the angular distance between predicted and confidence labels. The prediction loss is smaller when the detection value is closer to the ground truth within a specific range. Compared to previous methods, CSLs exhibit higher tolerance to adjacent angles, alleviating the problem of unbalanced angles. The expression for CSLs is given as follows:
C S L ( x ) = g ( x ) , θ r < x < θ + r 0 , o t h e r w i s e
To increase the detection of rotated rectangular bounding boxes, we introduce the concept of circular smooth labels in this study. The original label is denoted by x, while the window function with a radius, r, is represented by g(x). The diameter extent of the window function determines the angle of the current bounding box. Using the window function, our model can calculate the angular distance between the predicted and ground truth labels. This approach enhances the original YOLOv7 model, allowing it to detect rotated rectangular bounding boxes more accurately. Consequently, as the predicted and true values become closer to a true range, the loss value decreases. We also address the issue of angular periodicity by incorporating periodicity, whereby angles like 89 degrees and −90 degrees are considered neighbors. The regression process for rotating bounding boxes with circular smooth labels is illustrated in Figure 4. Through these improvements, we aim to achieve more accurate detection of rotated bounding boxes.
Figure 4. Rotating box regression via circular smoothing labels.
(2)
Loss function design
For the new algorithm (EYMR-Net) proposed in this paper, since it supports the learning of rotating bounding boxes, we add the designed angle classification L o s s t h e t a to the original loss function, and the total loss function formula is as follows:
L o s s = L o s s o b j + L o s s ( x , y , w , h ) + L o s s c l s + L o s s t h e t a
In this context, L o s s o b j denotes confidence loss, L o s s ( x , y , w , h ) indicates horizontal bounding box coordinate localization loss, L o s s c l s refers to target category loss, and L o s s t h e t a represents angle classification loss. For confidence loss, target category loss, and horizontal bounding box coordinate localization loss, no improvement is needed by using the same loss function as YOLOv7 because the θ variable is not involved. The newly added angular categorization loss function formula is shown below:
L o s s t h e t a = i = 0 s 2 I i j o b j x [ 90 , 90 ) [ p ^ i ( C S L ( x ) ) l o g ( p i ( C S L ( x ) ) + ( 1 p i ^ ( C S L ( x ) ) ) l o g ( 1 p i ( C S L ( x ) ) ) ]
To address the angle regression problem, we transform it into a classification problem using CSLs. We represent the angle classification loss function using binary cross-entropy. In this approach, we use the function CSL(x) to convert the original angle theta into CSLs, incorporating periodicity and symmetry. The input range for x is within [−90, 90), representing the angle values.
We can enhance angle classification by using CSL, considering their periodic characteristics. During training, the binary cross-entropy loss function is employed to optimize the angle classification task. This approach results in improved algorithm performance and more precise angle prediction.

4. Results and Discussion

4.1. Experimental Datasets and Settings

To evaluate the efficacy of our proposed image object detection algorithm, we utilized the DOTAv1.0 dataset of aerial remote sensing images, which was provided by Wuhan University. This dataset includes 15 categories, namely small vehicle (SV), baseball diamond (BD), plane (PL), basketball court (BC), ground track field (GTF), harbor (HA), swimming pool (SP), bridge (BR), roundabout (RA), large vehicle (LV), tennis court (TC), helicopter (HC), ship (SH), soccer ball field (SBF), and storage tank (ST), and contains a total of 188,282 objects.
The DOTA dataset is characterized by a significant number of small targets, with approximately 80,000 and 110,000 instances of small vehicles and ships, respectively. In general, the objects in this dataset are relatively small in size, accounting for less than 5% of the image area. The images in the DOTA dataset are high-resolution aerial photographs, with pixel dimensions ranging from 800 × 800 to 4000 × 4000. However, due to the large sizes of most images in the dataset, direct training with the model after uniform scaling would pose challenges in detecting numerous small objects. Therefore, we perform preprocessing on the training images.
To preprocess the images, we first crop the original images to a uniform size of 1024 × 1024. We ensure that a certain overlap is maintained between the cropped images to avoid losing object information at cut boundaries. After cropping, the resulting dataset comprises 21,046 images. We then divide the dataset into training, validation, and testing sets using a 3:1:1 ratio. We visualize the quantities of different target categories in the dataset by creating a bar graph, as shown in Figure 5. In the graph, the labels are plotted on a graph, with the names of the labels displayed on the horizontal axis and the number of labels on the vertical axis. The unit is (number).
Figure 5. DOTA dataset description.

4.2. Evaluation Indicators

To validate the effectiveness of our model, we employed various evaluation metrics, including Precision, Recall, Average Precision (AP), and mean Average Precision (mAP). Precision is a measure of the proportion of predicted true positive samples over the entire sample of positive cases with positive predictions. It is calculated as follows:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
The precision rate is calculated using the number of true cases, false positive cases, and false negative cases. It reflects the proportion of true cases among the predicted positive samples. On the other hand, the recall rate reflects the proportion of positive samples correctly predicted by the model to the total positive samples. Recall is another evaluation metric that we use to assess the performance of our model. It measures the proportion of predicted true positive samples over the entire sample of true cases with positive results. The P-R curve is a plot of the precision rate against the recall rate. For a particular category, the recall rate is set as the horizontal coordinate and the precision rate as the vertical coordinate. The area under this curve is the AP value of the category as follows:
A P = 0 1 P ( R ) d R
The mean Average Precision (mAP) is a comprehensive evaluation metric for target detection algorithms on a particular dataset. It is calculated as the average of the AP values of all categories and is currently the most important metric for evaluating the performance of the model. The mAP is denoted as follows:
m A P = 1 C i = 0 C A P i

4.3. Comparative Experimental Results

To further validate the effectiveness of the enhanced EYMR-Net algorithm, we choose five classical target detection algorithms and this paper’s algorithm for comparison. As shown in Table 1, a comparison of the experimental results of the proposed EYMR-Net with other state-of-the-art object detection models is presented with the DOTA dataset.
Table 1. A comparison of the detecting results with the DOTA dataset.
In the remote sensing image detection task, the evaluation metrics of AP 0.5(%) and mAP 0.5(%) are chosen to evaluate the recognition performance on the remote sensing targets of each model. It shows that the EYMR-Net obtains a higher mAP0.5, which proves that the optimization approaches are effective. In detail, the proposed EYMR-Net achieves an mAP0.5 of 74.30%, outperforming CenterNet, YOLOv5, RetinaNet, Faster R-CNN, and CornerNet. Among them, our model achieves the best accuracy of AP0.5 for ships (SHs), small vehicles (SVs), and large vehicles (LVs), which are small-scale dense targets with an optimal accuracy of AP0.5, reaching 89.9%, 69.0%, and 87.8%, respectively. This demonstrates that our detector, EYMR-Net, greatly improves the detection of small targets compared to classical target detectors in the complex context of remotely sensed images. It also achieves quite excellent detection metrics in several categories, such as airplanes (PLs) and harbors (HAs). In order to further illustrate the advantages, we draw the detection performance results of the proposed methods including CornerNet, YOLOv5, RetinaNet, Faster R-CNN, and CenterNet, as shown in Figure 6.
Figure 6. Performance comparison of different models.
At the same time, we conduct tests on different detectors: for group 1, we choose remote sensing images of large vehicles (LVs) and small vehicles (SVs) in a parking lot in a small-scale dense and oriented row; for group 2, we choose remote sensing images of a ship (SH), small vehicles (SVs), and a harbor (HA) in a complex background simultaneously; for group 3, we choose remote sensing images of small and large vehicles at the same time; for group 4, we select remote sensing images of helicopters (HCs) and passenger planes (PLs) in an airport scene; and for group 5, we select remote sensing images of harbors (HAs), ships (SHs), and vehicles on shore in a complex scene. The test samples are shown in Figure 7. Faster R-CNN, SSD, RetinaNet, and YOLOv5 all have different degrees of small target leakage detection problems, such as large vehicles and small vehicles appearing at the same time in group 1 of images, and small vehicles (SVs) in the upper left corner of the image, and only our detector EYMR-Net detected five of them. The red circles in the figure represent the leakage detection areas of different models in small object detection.
Figure 7. Remote sensing detection results of comparative models.
By comparison, our rotating detection box greatly optimizes the problem of excessive redundant background pixels in the frame brought by the horizontal box of traditional target detectors, and it achieves the fine localization of slender targets. For group 2, regarding the small vehicles present in the upper-left corner, our detector EYMR-Net detects 11 vehicles, while the rest of the detectors still have a large number of leakage detection problems, and our detector improves the recognition ability of ports. Due to the complexity of the remote sensing marine scene, there are multiple ships sticking together in the port and ground, causing confusion; these problems make it relatively difficult for detectors, such as RetinaNet, to detect the port. Although the recognition ability of the ship is still good, it does not detect the port near the ship. For group 3 of remote sensing images after comparison, our EYMR-Net model has the lowest false detection rate for small targets, and the rest of the models have serious leakage detection problems. For group 4 of remote sensing pictures of airport scenes, SSD undergoes leakage detection for helicopters, while our detector, EYMR-Net, has no false detection, and relative to the other four detectors, our EYMR-Net generates the smallest IOU between the detected boxes and has the highest localization accuracy. Sample 5 in the group of remote sensing images also demonstrates the superiority of our detector in recognizing ports. A large number of examples of pictures from multiple categories in remote sensing scenarios show that our proposed EYMR-Net detector has superior detection performance.
The detection effects of each model for different object types are shown in Figure 8. We select six target types in violin plots, including small vehicles (SVs), large vehicles (LVs), ships (SHs), harbors (HAs), basketball courts (BCs), and airplanes (PLs). As shown in Figure 8, we plotted the errors for the last 50 rounds of the training fit for six different image detectors. It can be concluded that taken together, our EYMR-Net exhibits the best performance with a relatively low error in all categories. For small-scale targets, such as vehicles and ships, the violin plots of the EYMR-Net model show a more concentrated distribution and a smaller error range. This indicates that our model can detect and localize targets more accurately in these categories with higher robustness and stability. It also shows less than five rubbing sweats on medium- and large-scale targets, including harbors (HAs) and basketball courts (BCs). This proves that our detector has a lower error rate and better performance on targets of different scales.
Figure 8. Detection error distribution of different object types.

4.4. Ablation Experiment Results

We performed several ablation tests to comprehensively evaluate the enhanced EYMR-Net algorithm’s performance. During the training phase, we used the YOLOv7 model as the baseline. Throughout the process, each model was trained for 300 epochs using a batch size of 16. We introduced Swin Transformer (STR) modules and the CBAM attention mechanism to enhance the model’s performance. Furthermore, we improved the bounding box regression mechanism of YOLOv7 to handle rotated boxes based on circular smooth labels. To assess the combined effects of the EYMR-Net variants, we conducted corresponding ablation experiments, and the results are shown in Table 2. It can be seen that the three proposed modules all play a role in detection accuracy and efficiency. The organic combination of STR, CBAM, and RBO makes the proposed EYMR-Net achieve better robustness to cope with remote sensing detection and identification tasks under multi-scale changes.
Table 2. Ablation studies on the DOTA dataset.
Considering that our method is optimized on the advanced YOLOv7, we drew precision, recall, and mAP0.5 curves to visually compare the changes in EYMR-Net compared to the YOLOv7 series. As shown in Figure 9, the specific values of each item in the last round of the source code of YOLOv7 are 74.32%, 63.87%, and 70.72%, while the values indicating the performance of our EYMR-Net are 79.76%, 72.18%, and 74.30%. The comparison reveals that the EYMR-Net can reach the steady-state value faster during the entire learning process, with better convergence and accuracy, outperforming YOLOv7 series approaches.
Figure 9. Results of curve analysis.
To test the detection ability of EYMR-Net in real remote sensing scenes, five representative image samples are selected for visualization and comparison in this experiment: group 1 includes a dense and oriented large vehicle (LV) in a parking lot, group 2 comprises airplanes (PLs) and helicopters (HCs) in an airport scene, group 3 includes a small vehicle (SV) with small targets in a parking lot scene, group 4 comprises a coastal scene with multiple similar features of multiple ships (SHs) and harbors (HBs), and group 5 represents multiple categories simultaneously appearing in the remote sensing complex scene of tennis courts (TCs), basketball courts (BCs), and small vehicles (SVs). The confidence threshold is set to 0.25, and the detection results are shown in Figure 10.
Figure 10. Remote sensing detection results of EYMR-Net vs. YOLOv7 series.
A thermal effect map is plotted based on Grad-cam to further reflect the feature information extraction capability of the detector, as shown in Figure 11. In both figures, the first column comprises the original pictures, the second column lists the detection results of YOLOv7 after adding the Swin Transformer module (STR), and the third column lists the results of YOLOv7 after adding the Rotation Detection Frame (RBO). The last column presents the detection results obtained using our EYMR-Net.
Figure 11. Detection and thermal diagrams of EYMR-Net vs. YOLOv7 series.
Upon comparison, it was found that the improved algorithm in this paper dramatically improves the detection of multiple categories of targets in remote sensing scenes, such as large vehicles (LVs), airplanes (PLs), and helicopters (HCs), in the images of groups 1 and 2. If these categories were detected using the original algorithm, YOLOv7, they would contain a large number of redundant pixels of the background in the detection box, and the IOU between neighboring detection boxes would be too large, which would prevent the precise location of the position from being found and prevent the contour of a single target. On the contrary, EYMR-Net, presented in this paper, adopts a rotating box annotation that can perfectly fit the object’s shape, and the IOU between adjacent detection boxes is much lower than that of YOLOv7.
Densely arranged targets with considerable overlaps are best located using the rotating box. The improved multi-scale rotating target detector, EYMR-Net, presented in this paper solves the problem of excessive redundant pixels in the detection box brought about by detection with a horizontal box, which realizes target detection that better fits the shape of the object and improves the detection effect of the densely arranged targets in remote sensing scenes. For example, for the harbors (HAs) and ships (SHs) in the picture in group 4, the horizontal box of the original algorithm, YOLOv7, cannot accurately delineate the area of the ships moored near the harbor, which is not conducive to the detection of ship targets in marine remote sensing scenes. For the very small target ships moored on the shore in the lower-right bit of the picture, the improved algorithm presented this paper detects 16 ships, while the original algorithm YOLOv7 detects 14 ships, which proves that the new feature extraction networks backbone and neck, constructed using this paper’s algorithm based on the attention of the Swin Transformer and the CBAM, effectively optimizes the small-scale target leakage detection problem under complex remote sensing scenes.
For the remote sensing images in group 5, due to the overly complex background of urban remote sensing images, there are different categories of multiple scales: tennis courts (TCs), basketball courts (BCs), and small vehicles (SVs). The large-scale targets in the remote sensing scene, the tennis courts, and the basketball courts are in the form of angular rectangles. The original algorithm does not have a good detection effect for face-to-face directional alignment, there are too many redundant pixels in the background, the overlap of the neighboring frames is more than 50%, and the improved EYMR-Net improves the problem efficiently and superiorly differentiates the contour areas of different targets.
For further visual analysis, we utilized a confusion matrix to evaluate the accuracy of the EYMR-Net model’s predictions. This matrix’s vertical axis represents the actual target categories, while the horizontal axis represents the predicted categories. Figure 12 provides a visual representation of this matrix. Our analysis of the confusion matrix revealed that the proposed EYMR-Net model achieved high accuracy rates for small vehicles, large vehicles, planes, ships, harbors, tennis courts, and basketball courts, with correct prediction rates of 71%, 86%, 93%, 88%, 86%, 93%, and 71%, respectively. These results demonstrate the model’s ability to detect targets of various scales accurately, with a particularly impressive performance in detecting small objects.
Figure 12. Confusion matrix of EYMR-Net detector.

5. Conclusions

Sustainability in agricultural supply is crucial for ensuring food safety and improving people’s lives, and detecting various logistics targets through remote sensing imagery is meaningful for improving the efficiency of logistics supervision. However, traditional detection methods have problems with horizontal labels ignoring the multi-scale variation and dense distribution of transportation equipment in the detection process of remote sensing imagery. To address the identified issues, this paper proposes EYMR-Net, an enhanced YOLO architecture detector tailored for multi-scale rotating remote sensing objects. The EYMR-Net incorporates several key advancements. Firstly, it features a novel backbone based on the Swin Transformer, which captures richer gradient fusion information of multi-scale targets, with a particular emphasis on enhancing the detection of small targets. Additionally, the model integrates the Convolutional Block Attention Module (CBAM) within the neck component, thereby improving feature analysis capabilities and reducing missed detections and omissions of small targets. Moreover, EYMR-Net introduces an innovative rotating frame regression mechanism by utilizing circular smoothing labels, effectively addressing the limitations of traditional horizontal regression models that often overlook target directionality. Extensive comparative and ablation experiments conducted on the DOTA dataset demonstrate the superior performance of EYMR-Net over competing models, achieving Precision, Recall, and mAP0.5 scores of 79.76%, 72.18%, and 74.30%, respectively. These results highlight EYMR-Net’s innovative approach to multi-scale target detection, making it particularly well suited for practical monitoring applications of remote sensing technology.
In the future, we will further explore the potential of the EYMR-Net model in other research fields, such as agricultural or pharmaceutical logistics. Meanwhile, we can further optimize related technologies and modeling strategies to enhance their performance in different remote sensing scenarios.

Author Contributions

Conceptualization, S.B.; methodology, S.B. and M.Z.; software, R.X.; validation, R.X.; formal analysis, S.B. and M.Z.; investigation and data curation, M.Z.; writing—original draft preparation, S.B.; writing—review and editing, J.K.; visualization, resource, S.B. and D.H.; supervision, project administration, J.K. and D.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (Nos. 62473008, 62433002, 62476014 and 62173007), Project of ALL China Federation of Supply and Marketing Cooperatives (No.202407), the Project of Beijing Municipal University Teacher Team Construction Support Plan (No. BPHR20220104), and the Beijing Scholars Program (No. 099).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The dataset presented in this study is available at the following website link: https://captain-whu.github.io/DOTA/dataset.html accessed on 26 August 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zheng, Q.; Huang, W.; Xia, Q.; Dong, Y.; Ye, H.; Jiang, H.; Chen, S.; Huang, S. Remote Sensing Monitoring of Rice Diseases and Pests from Different Data Sources: A Review. Agronomy 2023, 13, 1851. [Google Scholar] [CrossRef]
  2. Mzid, N.; Pignatti, S.; Huang, W.; Casa, R. An analysis of bare soil occurrence in arable croplands for remote sensing topsoil applications. Remote Sens. 2021, 13, 474. [Google Scholar] [CrossRef]
  3. Tuomisto, H.; Hodge, I.; Riordan, P.; Macdonald, D.W. Does organic farming reduce environmental impacts? A meta-analysis of European research. J. Environ. Manag. 2012, 112, 309–320. [Google Scholar] [CrossRef]
  4. Liu, Y.; Wu, L. Geological disaster recognition on optical remote sensing images using deep learning. Procedia Comput. Sci. 2016, 91, 566–575. [Google Scholar] [CrossRef]
  5. Pang, Z.B.; Chen, Q.; Han, W.L.; Zheng, L.R. Value-centric design of the internet-of-things solution for food supply chain: Value creation, sensor portfolio and information fusion. Inf. Syst. Front. 2015, 17, 289–319. [Google Scholar] [CrossRef]
  6. Lenhart, D.; Hinz, S.; Leitloff, J.; Stilla, U. Automatic traffic monitoring based on aerial image sequences. Pattern Recognit. Image Anal. 2008, 18, 400–405. [Google Scholar] [CrossRef]
  7. Frohn, R.C.; Lopez, R.D. Remote Sensing for Landscape Ecology: New Metric Indicators; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
  8. Aung, M.; Chang, Y. Traceability in a food supply chain: Safety and quality perspectives—ScienceDirect. Food Control 2014, 39, 172–184. [Google Scholar] [CrossRef]
  9. Li, Y.; Fortner, L.; Kong, F. Development of a Gastric Simulation Model (GSM) incorporating gastric geometry and peristalsis for food digestion study. Food Res. Int. 2019, 11, 125. [Google Scholar] [CrossRef]
  10. Jiang, X.; Chen, W.X.; Nie, H.T. Real time ship target detection based on aerial remote sensing images. Opt. Precis. Eng 2020, 28, 2360–2369. [Google Scholar] [CrossRef]
  11. Lan, X.; Guo, Z.; Li, C. Attention and Feature Fusion for Aircraft Target Detection in Optical Remote Sensing Images. Chin. J. Liq. Cryst. Disp. 2021, 36, 1506–1515. [Google Scholar] [CrossRef]
  12. Wang, H.L.; Zhu, M.; Lin, C.B.; Chen, D.B.; Yang, H. Ship Detection of Complex Sea Background in Optical Remote Sensing Images. Opt. Precis. Eng. 2018, 26, 723–732. [Google Scholar] [CrossRef]
  13. Li, Z.M.; Cheng, L.; Zhu, D.M.; Yan, Z.J.; Ji, C.; Duan, Z.X.; Jing, M. Deep Learning and Spatial Analysis Based Port Detection. Laser Optoelectron. Prog. 2021, 58, 2028002. [Google Scholar]
  14. Zhang, C.G.; Xiong, B.L.; Kuang, G.Y. A Survey of Ship Detection in Optical Satellite Remote Sensing Images. Chin. J. Radio Sci. 2020, 35, 637–647. [Google Scholar]
  15. Kong, J.L.; Fan, X.M.; Jin, X.B.; Su, T.L.; Bai, Y.T.; Ma, H.J.; Zuo, M. BMAE-Net: A Data-Driven Weather Prediction Network for Smart Agriculture. Agronomy 2023, 13, 625. [Google Scholar] [CrossRef]
  16. Jin, X.B.; Wang, Z.Y.; Kong, J.L.; Bai, Y.T.; Su, T.L.; Ma, H.J.; Chakrabarti, P. Deep Spatio-Temporal Graph Network with Self-Optimization for Air Quality Prediction. Entropy 2023, 25, 247. [Google Scholar] [CrossRef]
  17. Jin, X.B.; Wang, Z.Y.; Gong, W.T.; Kong, J.L.; Bai, Y.T.; Su, T.L.; Ma, H.J.; Chakrabarti, P. Variational Bayesian Network with Information Interpretability Filtering for Air Quality Forecasting. Mathematics 2023, 11, 837. [Google Scholar] [CrossRef]
  18. Kong, J.L.; Fan, X.M.; Jin, X.B.; Lin, S.; Zuo, M. A Variational Bayesian Inference-Based En-Decoder Framework for Traffic Flow Prediction. IEEE Trans. Intell. Transp. Syst. 2023, 25, 2966–2975. [Google Scholar] [CrossRef]
  19. Kong, J.L.; Wang, H.X.; Wang, X.Y.; Jin, X.B.; Fang, X.; Lin, S. Multi-stream hybrid architecture based on cross-level fusion strategy for fine-grained crop species recognition in precision agriculture. Comput. Electron. Agric. 2021, 185, 106134. [Google Scholar] [CrossRef]
  20. Zheng, Y.Y.; Kong, J.L.; Jin, X.B.; Wang, X.Y.; Zuo, M. CropDeep: The crop vision dataset for deep-learning-based classification and detection in precision agriculture. Sensors 2019, 19, 1058. [Google Scholar] [CrossRef]
  21. Kong, J.L.; Wang, H.X.; Yang, C.C.; Jin, X.B.; Zuo, M.; Zhang, X. A spatial feature-enhanced attention neural network with high-order pooling representation for application in pest and disease recognition. Agriculture 2022, 12, 500. [Google Scholar] [CrossRef]
  22. Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001; Volume 1, p. I-I. [Google Scholar]
  23. Viola, P.; Jones, M. Robust real-time face detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
  24. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
  25. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  26. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  27. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  28. Fan, Z.; Xia, W.; Liu, X.; Li, H. Detection and segmentation of underwater objects from forward-looking sonar based on a modified Mask RCNN. Signal Image Video Process. 2021, 16, 320–335. [Google Scholar] [CrossRef]
  29. Berg, A.C.; Fu, C.Y.; Szegedy, C.; Anguelov, D.; Erhan, D.; Reed, S.; Liu, W. SSD: Single Shot MultiBox Detector. arXiv 2015, arXiv:1512.02325. [Google Scholar]
  30. Wang, Y.; Li, W.; Li, X.; Sun, X. Ship detection by modified retinanet. In Proceedings of the 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS), Beijing, China, 19–20 August 2018; pp. 1–5. [Google Scholar]
  31. Duan, K.; Bai, S.; Xie, L.; Qi, H. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
  32. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
  33. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
  34. Xu, X.; Feng, Z.; Cao, C.; Li, M.; Wu, J.; Wu, Z.; Shang, Y.; Ye, S. An improved Swin Transformer-based model for remote sensing object detection and instance segmentation. Remote Sens. 2021, 13, 4779. [Google Scholar] [CrossRef]
  35. Zhao, W.Q.; Kang, Y.J.; Zhao, Z.B.; Zhai, Y.J. A remote sensing image object detection algorithm with improved YOLOv5s. CAAI Trans. Intell. Syst. 2023, 18, 86–95. [Google Scholar]
  36. Chen, Z.; Liu, C.; Filaretov, V.F.; Yukhimets, D.A. Multi-Scale Ship Detection Algorithm Based on YOLOv7 for Complex Scene SAR Images. Remote Sens. 2023, 15, 2071. [Google Scholar] [CrossRef]
  37. Guo, H.; Bai, H.; Yuan, Y.; Qin, W. Fully deformable convolutional network for ship detection in remote sensing imagery. Remote Sens. 2022, 14, 1850. [Google Scholar] [CrossRef]
  38. Zhu, H.; Xie, Y.; Huang, H.; Jing, C.; Rong, Y.; Wang, C. DB-YOLO: A duplicate bilateral YOLO network for multi-scale ship detection in SAR images. Sensors 2021, 21, 8146. [Google Scholar] [CrossRef]
  39. Liu, Y.; Wang, X. SAR ship detection based on improved YOLOv7-tiny. In Proceedings of the 2022 IEEE 8th International Conference on Computer and Communications, Chengdu, China, 9–12 December 2022; pp. 2166–2170. [Google Scholar]
  40. Baek, W.K.; Kim, E.; Jeon, H.K.; Lee, K.J.; Kim, S.W.; Lee, Y.K.; Ryu, J.H. Monitoring Maritime Ship Characteristics Using Satellite Remote Sensing Data from Different Sensors. Ocean Sci. J. 2024, 59, 8. [Google Scholar] [CrossRef]
  41. Liu, Y.; Zhang, R.; Deng, R.; Zhao, J. Ship detection and classification based on cascaded detection of hull and wake from optical satellite remote sensing imagery. GIScience Remote Sens. 2023, 60, 2196159. [Google Scholar] [CrossRef]
  42. Reggiannini, M.; Salerno, E.; Bacciu, C.; D’Errico, A.; Lo Duca, A.; Marchetti, A.; Martinelli, M.; Mercurio, C.; Mistretta, A.; Righi, M.; et al. Remote Sensing for Maritime Traffic Understanding. Remote Sens. 2024, 16, 557. [Google Scholar] [CrossRef]
  43. Niu, R.; Zhi, X.; Jiang, S.; Gong, J.; Zhang, W.; Yu, L. Aircraft Target Detection in Low Signal-to-Noise Ratio Visible Remote Sensing Images. Remote Sens. 2023, 15, 1971. [Google Scholar] [CrossRef]
  44. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  45. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
  46. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.