MVPOD: A Dataset and Benchmark for Multi-Vertical-Perspective Object Detection in Multi-Platform Remote Sensing Images

Jin, Haiyan; Chen, Jintao; Zhang, Yuanlin; Su, Haonan; Wang, Bin

doi:10.3390/rs17173029

Open AccessArticle

MVPOD: A Dataset and Benchmark for Multi-Vertical-Perspective Object Detection in Multi-Platform Remote Sensing Images

by

Haiyan Jin

,

Jintao Chen

,

Yuanlin Zhang

^*

,

Haonan Su

and

Bin Wang

School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 3029; https://doi.org/10.3390/rs17173029

Submission received: 2 July 2025 / Revised: 22 August 2025 / Accepted: 28 August 2025 / Published: 1 September 2025

Download

Browse Figures

Versions Notes

Abstract

Deep learning-based object detection has achieved remarkable maturity after years of intensive research. However, as multi-platform data acquisition becomes increasingly prevalent, spanning satellite, UAV, and ground-based platforms, a critical challenge emerges involving significant vertical perspective variations in captured images. The current object detection literature largely neglects this perspective dimension, particularly the robustness evaluation of single models across diverse viewing angles. To bridge this gap, we first conduct a systematic review categorizing existing approaches into standard and rotated object detection paradigms. Second, we build the Multi-Vertical-Perspective Object Detection (MVPOD) dataset; this dataset is the first comprehensive benchmark integrating spaceborne (nadir), airborne (oblique) and ground-level (horizontal) imagery with dual annotation schemes. Third, rigorous cross-perspective evaluation protocols reveal that vertical viewpoint discrepancies cause measurable performance degradation. Finally, representative methods are benchmarked on the MVPOD dataset, establishing baselines for future research.

Keywords:

multi-source image analysis; rotated object detection; vertical perspective; MVPOD dataset

Graphical Abstract

1. Introduction

With the rapid development of intelligent technology in satellite remote sensing [1,2], Unmanned Aerial Vehicles (UAVs ) [3], and ground vehicles [4], computer vision technologies have been widely applied in social life. Object detection plays a crucial role in all of these technologies, [5]. As a core task in the field of computer vision, object detection enables the recognition and localization of objects, providing critical perception and decision support for intelligent systems [6,7].

In recent years, significant progress has been made in object detection algorithms based on deep learning. However, in practical applications object detection still faces challenges such as diverse scenarios with different imaging perspectives [8,9,10], as shown in Figure 1. Object detection has been extensively researched in the a number of scenarios. (1) In the field of spaceborne nadir image analysis, high-altitude images taken by satellites and aircraft provide a nadir-downward perspective with broad scenes. In this case, the objects in images present a top shape with direction, as shown in Figure 2a. For these directional objects, traditional axis-aligned bounding boxes make it difficult to accurately describe their orientation, limiting detection performance [11]. Cheng et al. [5] proposed rotation-invariant layer to address the object rotation problem, while Xia et al. [12] addressed this problem by introducing oriented bounding boxes, enabling more precise object localization for large ratio objects. Recent studies [13,14,15,16,17] in remote-sensing rotated object detection have achieved remarkable progress, demonstrating superior performance on large-scale benchmarks [12,18,19,20]. (2) In the field of airborne oblique image analysis, significant variations in flight altitude cause the perspectives of detected objects to range from nadir-downward to oblique-downward, as shown in Figure 2b. The shape, scale, and proportion of objects can significantly change with the perspectives, causing higher demand for perspective robustness on the part of object detection algorithms. To meet this challenge, Unmanned Aerial Vehicle Object Detection (UAV-OD) [8,9] has been proposed. This technique improves the performance of object detection in UAV images by implementing multi-scale feature fusion and incorporating additional detection layers to better handle small objects [10,21,22,23]. (3) In the field of ground-based horizontal image analysis, objects are usually at a horizontal perspective, which is closer to human visual experience (shown in Figure 2c). Thanks to the widespread application of large-scale object detection datasets such as MSCOCO [24] and Pascal VOC [25], current horizontal perspective object detection algorithms have demonstrated high performance [26,27,28,29,30,31,32].

Apparently, object detection in multi-platform remote sensing images is often accompanied by significant variations in vertical perspectives, especially the transition from nadir-downward or oblique-downward to horizontal perspectives. However, the impact of vertical perspective variations on object detection algorithms has not received sufficient attention in previous studies.

To explore this influence, in this paper we carry out a systematic review of normal object detection methods and rotated object detection methods. Due to the lack of datasets annotated with vertical perspectives, we present a publicly available Multi-Vertical Perspective Object Detection (MVPOD) dataset (https://github.com/iskiku/MVPOD (accessed on 25 August 2025)). Our MVPOD dataset covers images taken from spaceborne platforms, airborne platforms, and ground-based platforms. The MVPOD dataset consists of 10,470 images with eight object categories. In terms of data annotation, MVPOD provides both horizontal and oriented bounding boxes along with innovative vertical perspective labels for each object, covering common perspectives (nadir-downward, oblique-downward, and horizontal) as well as rarer ones (oblique-upward and nadir-upward). Finally, a comprehensive benchmark is conducted on the MVPOD dataset with advanced methods for both normal and rotated object detection, establishing baselines for future research. In particular, the dataset is divided by vertical perspective into three sub-datasets: nadir-downward, oblique-downward, and horizontal. Cross-training experiments on these sub-datasets are employed to assess the influence of vertical perspective variations on detection performance. The main contributions of this paper are presented below:

(1) A comprehensive review of deep learning-based object detection algorithms: A systematic review of the latest developments in deep learning-based normal object detection and rotated object detection methods is presented. These methods are widely applied with different vertical perspectives from diverse imaging platforms.

(2) Construction of the MVPOD dataset with vertical perspective labels: To study the impact of vertical perspective variations on object detection, we construct the publicly available MVPOD dataset. To the best of our knowledge, MVPOD is the first among current object detection dataset to specifically annotate vertical perspective labels.

(3) Benchmarking performance on the MVPOD dataset: Representative methods in horizontal and rotated object detection are benchmarked on the MVPOD dataset and their performance is comprehensively compared using multiple evaluation metrics.

(4) Cross-experiments assessing the impact of vertical perspective variations: Extensive cross-training experiments are conducted on sub-datasets of MVPOD with different vertical perspectives in order to investigate the robustness of existing object detection methods under different vertical perspectives and the influence of vertical perspective variations on object detection performance. The results of these experiments provide valuable insights for future research.

The rest of this paper is structured as follows: in Section 2, previous works about normal object detection are reviewed; Section 3 reviews previous works on rotated object detection; Section 4 introduces the proposed MVPOD dataset; in Section 5, we conduct experiments on the proposed MVPOD dataset and provide an analysis covering the influence of perspective on object detection; Section 6 looks forward to future work and research directions; finally, Section 7 concludes the paper.

2. Review of Normal Object Detection

Normal object detection represents a pivotal task in the computer vision domain. It aims to automatically identify foreground objects from images or videos while simultaneously providing location information (typically represented by bounding box coordinates) and category labels for these objects. It serves as a foundation for numerous other computer vision tasks such as object tracking, and plays a critical role in practical applications.

Figure 3 shows the development of normal object detection methods over the past decade. These normal object detection methods can be classified into two categories based on processing flow, namely, two-stage detection methods and one-stage detection methods [6]. Two-stage methods initially generate candidate object regions, then classify and refine these candidates [33]. Conversely, one-stage methods accomplish the object detection task within a single forward pass, directly predicting the object’s category and location through dense grids [34]. As shown in Figure 4, the input image is first processed by a feature extraction module to generate feature maps. For two-stage detectors (red path), a Region Proposal Network (RPN) [26] is used to generate region proposals, which are then pooled and forwarded for classification and regression. In contrast, one-stage detectors (blue path) perform classification and regression directly on the feature maps using predefined anchors, and have no proposal generation stage.

2.1. Two-Stage Object Detection Methods

The R-CNN series methods represent seminal contributions to the development of two-stage object detection algorithms. The original R-CNN [33] employed selective search to generate region proposals, independently extracting convolutional features for each region proposal before classification and regression. However, this approach suffered from low computational efficiency. Fast R-CNN [35] addressed this issue by sharing convolutional feature maps, performing a single convolution operation on the entire image, and using ROI Pooling to extract features from region proposals in the feature map, significantly improving computational efficiency. Faster R-CNN [26] further innovatively introduced an RPN, replacing the traditional selective search method. This achieved end-to-end joint optimization of region proposal generation and object detection, resulting in notable improvements in detection speed and accuracy.

Subsequent research has made various improvements based on the R-CNN series. Mask R-CNN [36] introduced RoI Align as a replacement for RoI Pooling, resolving alignment errors during the feature map quantization process and enhancing the accuracy of both object detection and instance segmentation. Cascade R-CNN [27] utilizes a cascade structure that progressively optimizes the quality of candidate boxes through multiple cascaded detectors. Each detector is trained with different IoU thresholds, significantly boosting detection accuracy. Dynamic R-CNN [37] proposed a dynamic adjustment strategy that can adaptively adjust label assignment criteria and parameters of the regression loss function based on the statistical characteristics of candidate boxes during training. This enables the model to better fit high-quality samples. Sparse R-CNN [38] abandoned traditional complex components such as RPN, anchor mechanisms, and Non-Maximum Suppression (NMS) [33], instead adopting learnable proposal boxes and a cross-attention mechanism to achieve end-to-end object detection, which greatly simplifies the model structure and enhances efficiency.

2.2. One-Stage Object Detection Methods

You Only Look Once (YOLO) [34] stands as a representative work among one-stage object detection algorithms. The core idea of YOLO is to treat the object detection task as a regression problem, simultaneously predicting the bounding boxes and category probabilities of objects through a single forward pass. Since the introduction of YOLOv1 [34] in 2015, the YOLO series of algorithms has been continuously updated with the release of YOLOv3 [39], YOLOv5 [28], YOLOX [40], RTMDet [41], and YOLOv8 [29]. While maintaining the advantages of the YOLO series, these works have introduced new techniques and optimization methods that significantly improve detection accuracy and speed, keeping the YOLO series at the forefront of real-time detection. RetinaNet [42] effectively alleviates the common category-imbalance issue in object detection by incorporating the focal loss function. Focal loss reduces the weight of easily classified samples, allowing the model to pay more attention to difficult samples.

Anchor-free detection algorithms, represented by CornerNet [43], CenterNet [44], and FCOS [45], further simplify the model structure. CornerNet achieves object localization by detecting the top-left and the bottom-right key points of the object bounding box [43]. CenterNet directly predicts the center point of the object along with its width and height offsets [44]. FCOS adopts a pixel-by-pixel prediction approach, calculating the distance from each pixel to the four sides of the object bounding box and introducing a centerness score to suppress low-quality prediction boxes [45]. FCOS provides a simple and efficient detection framework.

RepPoints [46] proposed an object representation method based on representative points. It uses a set of learnable points for more precise positioning and better feature extraction of objects. In this way, RepPoints breaks through the limitations of traditional bounding box representations to provide more flexible representation for object detection tasks.

Recently, the DETR [30] series methods have emerged. DETR [30] transforms the object detection problem into a sequence prediction task. It globally encodes the input image through a transformer [47], then decodes the transformer features to predict the object boxes. This end-to-end process avoids the complex design of anchors and postprocessing steps. Follow-up studies have proposed various DETR variants to further improve performance. Deformable-DETR [48] introduces a deformable attention module, allowing the model to focus more efficiently on key feature points in the object area. Group-DETR [49] proposes a grouped one-to-many label assignment strategy that reduces training costs. DINO [50] utilizes contrastive denoising training and a hybrid query selection method for anchor initialization. RT-DETR [31] combines the key technologies of the DETR series and incorporates a minimum uncertainty query selection strategy. It surpasses many YOLO series models of the same period in terms of accuracy and speed, demonstrating the great potential of transformers in the field of object detection.

3. Review of Rotated Object Detection

In recent years, increasing demand for applications such as remote sensing image analysis [12] and scene text detection [51] has led to the emergence of Rotated Object Detection (ROD) methods. These methods utilize oriented bounding boxes to delineate the contours of objects. Compared to traditional object detection methods, these ROD methods offer advantages such as more precise object localization, less background noise, richer object information, and better scene adaptability.

ROD methods can be classified into two categories based on their representation of bounding boxes: Rotated Rectangular and Quadrilateral Bounding Boxes, and Keypoint Sets [52].

(1) As shown in Figure 5 (left), a rotated rectangular box represents an object by using a rotation angle on the normal bounding box. Depending on the range of values for the rotation angle, rotated rectangular box methods can be further subdivided into the OpenCV representation method, the long-side definition method, and more.

(2) As Figure 5 (middle) shows, quadrilateral bounding box methods directly list the coordinates of the four vertices of the oriented bounding box, typically represented as

bbox = [(x_{1}, y_{1}), (x_{2}, y_{2}), (x_{3}, y_{3}), (x_{4}, y_{4})],

(1)

where

(x_{i}, y_{i}) |_{i = 1, 2, 3, 4}

denote the coordinates of the four vertices of the rotated rectangle.

(3) As shown in Figure 5 (right), there are also methods that represent objects without relying on bounding boxes. Instead, they characterize objects through a set of keypoints or sampled points that capture the underlying geometric structure and key semantic features of the object. This approach provides greater flexibility in adapting to the detection of rotated objects.

3.1. Methods Based on Rotated Rectangular Boxes

These methods are typically built on the basis traditional horizontal object detectors. RoI Transformer [53] applies spatial transformations to Region of Interests (RoIs) and learns the transformation parameters under the supervision of oriented bounding box (OBB) annotations. CSL [54] transforms the angle prediction problem into a classification task and designs circular smooth labels, significantly enhancing the classification fault tolerance between adjacent perspectives and effectively mitigating the boundary discontinuity issue in angle prediction. KLD [55] models the rotated object bounding box as a 2D Gaussian distribution, optimizing the bounding box parameters by calculating the KL divergence between Gaussian distributions and achieving joint optimization of the shape, size, and direction of rotated objects. Similarly, GWD [56] converts the rotated object bounding box into a 2D Gaussian distribution, then uses the Wasserstein distance to calculate the loss; this provides a more unified representation of the object’s geometric characteristics. R³Det [57] includes a feature refinement module that improves detection performance by obtaining more accurate features. The key idea of the feature refinement module is to re-encode the position information of the current refined bounding box to the corresponding feature points through pixel-wise feature interpolation, helping to realize feature reconstruction and alignment. To deal with the problem of misalignment between anchor boxes and axis-aligned convolutional features, S²A-Net [2] proposed a Feature Alignment Module (FAM) and Orientation Detection Module (ODM). The FAM generates high-quality anchors and adaptively aligns the convolutional features according to the anchor boxes through an alignment convolution. The ODM first encodes the orientation information, then produces orientation-sensitive and orientation-invariant features to alleviate the inconsistency between classification score and localization accuracy. ReDet [58] incorporates rotation-equivariant networks into the detector in order to extract rotation-equivariant features, which can accurately predict the orientation. Based on these rotation-equivariant features, ReDet incorporates Rotation-invariant RoI Align (RiRoI Align), which adaptively extract rotation-invariant features from equivariant features according to the orientation of the RoI. KFIoU [59] significantly improves detection accuracy by incorporating a loss function based on Gaussian modeling and Kalman filtering, effectively approximating SkewIoU through the center point loss and a distance-independent term without introducing additional hyperparameters. SASM [60] introduced a shape-adaptive selection and measurement strategy that dynamically selects samples based on the object’s shape information and feature distribution, allowing it to assess the quality of positive samples. PSC [61] maps different periods of rotational cycles to different frequencies of phases, providing a unified framework for addressing periodic ambiguity in rotated object detection.

3.2. Methods Based on Quadrilateral Bounding Boxes

Gliding Vertex [62] first predicts an initial horizontal bounding box, then “glides” the vertices of the horizontal bounding box to the corresponding vertices of the oblique bounding box by predicting offsets, thereby obtaining the precise location of the rotated object. RSDet [63] utilizes a cross-product-based vertex sorting algorithm that ensures the consistency of the four vertex orders, thereby avoiding detection errors caused by vertex order confusion. Additionally, RSDet incorporates a special loss calculation method that effectively mitigates the problem of loss discontinuity caused by angle periodicity.

3.3. Methods Based on Point Set Representation

CFA [64] represents the scope of object detection using convex hulls and adapts to different orientations and densely arranged object layouts by dynamically adjusting convex hull features, thereby improving the accuracy and robustness of object detection. Oriented RepPoints [65] converts the learned keypoint layout into an oriented bounding box through a direction transformation function; in addition, it utilizes an adaptive quality evaluation to learn the points and a sample allocation method to comprehensively measure the quality of the resulting point set. Point RCNN [66] employs a coarse-to-fine conversion approach driven by RepPoints [46] to generate precise Rotated Regions of Interest (RRoIs), then regresses and refines the corners of each RRoI to achieve accurate rotated object detection. Oriented DETR [67] introduces the “point-axis representation”, which separates the shape description and direction information of objects by using a point set to describe the spatial scope and an axis to define the direction. In addition, it incorporates the maximum projection loss and cross-axis loss to improve the accuracy of shape and direction prediction.

4. Proposed MVPOD Dataset

Numerous mature and excellent datasets have been built for use in the object detection field, as shown in Table 1. These datasets provide strong support for investigation of deep learning models. However, existing object detection datasets generally lack annotation for vertical object perspectives; thus, previous datasets cannot support research into the influence of vertical perspective variation on object detection algorithms. To fill this gap, we propose a novel object detection dataset called MVPOD.

4.1. Category Information

Selecting appropriate object categories is crucial for building a dataset. Considering the task requirements, several criteria are set for choosing object categories: (1) the objects should be presented in at least two different vertical perspectives among the five options, namely, nadir-downward, oblique-downward, and horizontal, oblique-upward and nadir-upward; (2) the object categories should include objects of various sizes and shapes in order to enhance the model’s ability to detect different scale objects; and (3) the object categories should be closely related to current technological development trends, especially in areas such as intelligent transportation, autonomous driving, UAV-based monitoring, and automated inspection.

Ultimately, eight categories of objects were selected to construct the MVPOD dataset: airplane, car, bus, truck, carrier, cargoship, warship, and bridge. The quantity information for different object categories is provided in Table 2. These categories not only exhibit significant vertical perspective variations but also cover a range of practical application scenarios.

4.2. Data Collection

To ensure the diversity and generalization capability of the dataset, images were collected from satellite imagery, UAV photography, and handheld device captures, representing observation perspectives from spaceborne, airborne, and ground-based platforms. The specific acquisition methods included selection from the HRRSD dataset [18], imaging with a DJI UAV (model: DJI-MINI3-PRO), and collection from Google’s website (https://www.google.com/imghp (accessed on 25 May 2025)). Figure 6 shows images acquired from the different platforms along with the categories of the objects contained in the images.

To ensure data quality and consistency, the original images were rigorously screened, cropped, and padded as needed, resulting in 10,470 images. After preprocessing, all images in the MVPOD were limited to a specific size range while ensuring that object positions in the images are completely random. This enhances the dataset’s diversity and generalization ability to a certain extent, providing more reliable and effective data support for subsequent model training and testing.

4.3. Annotation Types

In the MVPOD dataset, various types of annotations are provided for each object: horizontal bounding box label, oriented bounding box label, and vertical perspective classification label. The open-source tool LabelImg [75] was adopted for horizontal bounding box annotations, while roLabelImg [76] was used for oriented bounding box annotations. The quantities of different category objects are shown in Figure 7.

For vertical perspective classification annotations, perspectives are divided into five types: nadir-downward, oblique-downward, horizontal, oblique-upward, and nadir-upward. Based on practical need, attention is focused on the first three perspectives, nadir-downward, oblique-downward, and horizontal. The vertical perspective annotation quantities for each object category are shown in Table 2. As can be seen, except for airplane, the other categories do not have oblique-upward or nadir-upward, which is consistent with the situation in reality. The car and bridge categories do not include images with horizontal perspective due to their shape and size.

By annotating objects with horizontal bounding boxes, oriented bounding boxes, and vertical perspectives, the proposed MVPOD dataset enables both normal object detection and rotated object detection, thereby supporting experimental research about the impact of vertical perspective variation on object detection performance.

4.4. Dataset Characteristics

As shown in Table 1, when compared with other object detection datasets, our MVPOD dataset has several notable characteristics and advantages:

(1) Annotation Diversity: MVPOD innovatively annotates vertical perspective information for objects, providing support for future research. It also includes both horizontal and oriented bounding box annotations. Horizontal bounding box annotations are suitable for most general object detection tasks, while oriented bounding box annotations are better suited for direction-sensitive objects.

(2) Data Diversity: The MVPOD dataset comprises spaceborne platform images, airborne platform images, and ground-based platform images. These diverse imaging platforms cover different imaging perspectives, providing rich scene information for object recognition tasks. This rich information can aid in studying the performance differences of objects under varying spatial resolutions and observation perspectives, enhancing the model’s generalization ability to adapt to complex real-world scenarios.

(3) High Inter-class Similarity and Intra-class Diversity: The object categories in the MVPOD dataset mainly include transportation vehicles and ships. There is high semantic overlap between categories such as car, bus, and truck, all of which are transportation vehicles, as well as carrier, cargoship, and warship, which are all ships. When searching for images containing relevant objects, considerable effort was expended to ensure diversity, including different perspectives, shapes, colors, etc. For instance, the airplane category includes both ground-level and remote-sensing images, covering various airplane models such as the C919 and Boeing 747.

5. Experiments and Analysis

Representative normal object detection methods and rotated object detection methods were selected to conduct benchmark testing on the dataset, providing an overview of state-of-the-art performance for future research. Additionally, to analyze the influence of vertical perspective variations on object detection performance, the MVPOD dataset was divided into three sub-datasets with different vertical perspectives in order to conduct contrast analysis experiments.

5.1. Implementation Details

The dataset was split into training, validation, and test sets in a ratio of 8:1:1. The performance results were obtained on the test set. All object detection methods are trained on an NVIDIA 4060ti GPU. Normal object detection experiments were conducted using the deep learning frameworks MMDetection [77] and Ultralytics [78]. Rotated object detection experiments were performed using the deep learning frameworks MMRotate [79] and JDet [80]. For a fair comparison, the experiments used backbone networks with similar parameter counts. Moreover, the settings for hyperparameters, loss functions, and data augmentation were kept the same as those in MMDetection [77], Ultralytics [78], MMRotate [79], and JDet [80].

5.2. Evaluation Metrics

In model evaluation, metrics spanning multiple dimensions were considered in order to fully assess the model performance; specifically, the following four main evaluation metrics are adopted: Params, GFLOPS, FPS, and AP.

Params (Parameters) is used to quantify the complexity of the model, reflecting the storage space and computing resources requirements. A smaller number of parameters generally means higher computational efficiency and lower storage requirements, which is particularly important in resource-constrained environments.

GFLOPS (Giga Floating Point Operations Per Second) is a metric that measures the computational complexity of a model, representing the number of floating point operations that the model can perform per second. A low GFLOPS value indicates that the model has higher computational efficiency, which is crucial for real-time applications and edge computing scenarios.

FPS (Frames Per Second) is an intuitive metric that measures the processing speed of the model, and is especially suitable for real-time performance application scenarios such as autonomous driving and video monitoring. A high FPS value indicates that the model can process input data faster, providing a better user experience.

AP (Average Precision) is the core evaluation metric for object detection tasks, which comprehensively considers both Precision and Recall and evaluates the performance of the model under different IoU (Intersection of Union) thresholds. A high AP value means that the model has high accuracy and integrity in detecting objects, and can more reliably identify objects in images; other metrics, such as AP50, AP75, AP95, and mAP, are defined in the COCO dataset [24].

In summary, a comprehensive evaluation of the object detection model can be conducted through the four evaluation metrics of Params, GFLOPS, FPS, and AP, including multiple aspects such as model complexity, computational efficiency, inference speed, and average precision.

5.3. Object Detection Benchmark

5.3.1. Experimental Results

For normal object detection, sixteen methods introduced in Section 2 were selected for our experiments. Specifically, these included five two-stage object detection methods (Faster R-CNN, Cascade R-CNN, Sparse R-CNN, OLCN, and RFLA) and eleven one-stage object detection methods (YOLOv5, YOLOv8, RTMDet, Drone-YOLO, FFCA-YOLO, FBRT-YOLO, ESG-TOD, LTDNet, DINO, RT-DETR, and UAV-DETR).

Table 3 presents the quantitative results of the normal object detection experiments. The table lists the number of parameters (Params), computational complexity (GFLOPs), inference speed (FPS), and detection accuracy (AP) as evaluated on the test set of each method. AP includes the mean Average Precision (mAP), AP at an IoU threshold of 0.5 (AP50), AP at an IoU threshold of 0.75 (AP75), and mAP for each category. According to the results in the table, the two-stage detection methods demonstrate strong performance in terms of accuracy. Among them, OLCN achieves the best overall results, with an mAP of 0.809, AP50 of 0.952, and AP75 of 0.902, outperforming other models across most categories (airplane, car, bus, truck, and carrier). This indicates its superior capability in high-precision localization and complex object recognition. RFLA’s overall performance is comparable to OLCN, but its FPS is slightly lower (only 14). Cascade R-CNN also maintain high accuracy at higher IoU thresholds (AP75 exceeding 0.90), though its performance declines slightly in challenging categories such as cargoship, warship and bridge. Sparse R-CNN exhibits slightly lower mAP but offers advantages in inference speed and computational complexity. Among the one-stage detection methods, Drone-YOLO, FBRT-YOLO, and RTNDET exhibit superior overall performance. In particular, Drone-YOLO achieves the highest mAP (0.869) along with leading performance across multiple categories such as airplane, car, truck and carrier, illustrating its strong adaptability to diverse object types. FBRT-YOLO attains the best results in AP50 (0.962), AP75 (0.923), and several categories, including bus and bridge, reflecting its enhanced feature representation capability. Drone-YOLO achieves state-of-the-art performance on vehicle-related categories, notably car (0.927), bus (0.963), and truck (0.935), highlighting its effectiveness in fine-grained vehicle detection. RTMDET also delivers competitive performance, demonstrating robustness across multiple categories. Furthermore, transformer-based models such as DINO and UAV-DETR achieve remarkable accuracy on categories such as cargoship and airplane, underscoring the advantage of transformer architectures in capturing long-range dependencies. Overall, Drone-YOLO and FBRT-YOLO emerge as the most balanced detectors across diverse categories, while RTMDET and DINO reveal distinctive strengths in ship-centric detection tasks.

5.3.2. Visualization

Additionally, comparison charts in various metrics of different normal object detection methods are shown in Figure 8. Figure 8a shows the distribution of methods in terms of detection accuracy (AP) and inference speed (FPS). As can be seen from the figure, YOLOv5 performs well in inference speed. Two-stage object detection methods and transformer-based methods usually have large computational overhead. Overall, FBRT-YOLO, YOLOv8, and FFCA-YOLO achieve good tradeoffs between accuracy and speed. It can also be observed from Figure 8b that Drone-YOLO, FBRT-YOLO, YOLOv8, and RTMDET produce high accuracy with few parameters.

Figure 8c presents a comparison of the accuracy of various methods across different categories. Overall, these methods perform better on road vehicles (car, bus, truck) than on various types of ships (carrier, cargoship, warship). The detection precision for the bridge category is the lowest among all categories. This is probably because road vehicles typically have distinct appearance features and structural differences, making it easier for the model to identify them. In contrast, there are considerable similarities in appearance among ship categories, especially when they are in similar environments (such as on the sea), making it more difficult to distinguish between them. As for bridges, their large quadrilateral structures occupy a significant portion of the image, which tends to cause the model to make repeated detections.

In Appendix A, Figure A1, we visualize some predictions of these object detection methods on the MVPOD validation set.

5.4. Rotated Object Detection Benchmark

5.4.1. Experimental Results

For rotated object detection, we choose thirteen methods introduced in Section 3 for our experiments. These included nine methods based on rotated rectangular boxes (RoI Transformer, KLD, GWD, R³Det, S²A-Net, ReDet, SASM, KFIoU, and PSC), two methods based on quadrilateral bounding boxes (Gliding Vertex and RSDet), and two methods based on point set representation (CFA and Oriented RepPoint).

Table 4 shows the quantitative results of the rotated object detection experiments. Among the rotated rectangular box methods, R3Det, S²A-Net, and PSC achieve competitive performance. R³Det attains an AP50 of 0.862 and demonstrates superior performance on categories such as bus, truck, and carrier, indicating that its multi-stage regression and feature refinement are effective for rotated object localization. S²A-Net achieves the highest accuracy on the carrier category (0.929), suggesting that its angle-sensitive attention mechanism offers advantages for high aspect ratio object detection. PSC exhibits stable performance across most categories (AP50 = 0.764), reflecting a balanced detection capability. In contrast, methods such as KLD and GWD show lower accuracy in certain categories (e.g., cargoship, warship and bridge), likely due to the limited adaptability of their loss functions to complex object orientations. Among the quadrilateral bounding box methods, RSDet achieves the best results with an AP50 of 0.773, significantly outperforming GlidVertex on the bus (0.949) and truck (0.915) categories, which validates the effectiveness of its multi-resolution regression strategy. Among point set representation methods, OrRepPoints achieves superior accuracy (AP50 = 0.823) compared to CFA across most categories, particularly on the car, carrier, and bridge categories. This indicates that learnable regularized point sets provide strong adaptability for detecting objects with complex poses and deformations. Overall, the rotated rectangular box methods demonstrate stable general accuracy, while the quadrilateral bounding box and point set representation methods offer stronger detection advantages for specific shapes and orientations, making them well-suited for specialized detection scenarios.

5.4.2. Visualization

Figure 9 shows comparison charts of the different rotated object detection methods in terms of various metrics.

Figure 9a,b shows a comprehensive comparison of the different methods in terms of their detection accuracy, inference speed, and number of parameters. As shown in the figure, ReDet achieve the highest AP50, but its inference speed is relatively low; on the other hand, KFIoU, Oriented RepPoints, and CFA achieve a better balance between accuracy, speed, and number of parameters.

It can be observed from Figure 9c that most methods perform well in categories such as airplane, car, and bus, with AP50 maintained at a high level (close to 0.9). However, detection accuracy drops significantly on the cargoship, warship, and bridge categories. ReDet perform the best overall, maintaining high AP in multiple categories and strong stability. Other methods, such as Oriented RepPoints, CFA, and KFIoU, also maintain good accuracy in most categories. Overall, the detection of some categories still faces great challenges, especially in the bridge and warship categories, where accuracy is generally low.

We visualize some predictions of these rotated object detection methods on the MVPOD validation set in Appendix A, Figure A2.

5.5. Vertical Perspective Contrast Experiment

Next, we divided the MVPOD dataset based on vertical perspectives, obtaining a nadir-downward sub-dataset with 4576 images, an oblique-downward sub-dataset with 3443 images, and a horizontal sub-dataset with 2226 images. Then, we adopted a representative one-stage object detection method (YOLOv5) and two-stage object detection method (Faster R-CNN) for training and testing on each sub-dataset. The experiments consisted of two main parts: first, evaluating the model accuracy of YOLOv5 and Faster R-CNN on each sub-dataset, and second, exploring the impact of vertical perspective variations on model performance by applying models trained on specific sub-datasets to other perspective sub-datasets using a cross-testing approach. Detailed experimental results are provided in Table 5.

The experimental results show that the performance of YOLOv5 and Faster R-CNN on different perspective sub-datasets is very similar. The following analysis takes YOLOv5 as an example.

On the training datasets, the overall accuracy (ALL) of the model on the nadir-downward, oblique-downward, and horizontal sub-datasets is 0.689, 0.755, and 0.793, respectively, demonstrating good adaptability. Specifically, on the nadir-downward sub-dataset, the detection accuracy for the airplane and car categories is high (0.858 and 0.857, respectively); in the oblique-downward sub-dataset, the accuracy on the car and bus categories is high (0.929 and 0.846, respectively); and in the horizontal sub-dataset, the accuracy on the bus and truck categories is high (0.882 and 0.885, respectively).

However, when the models are tested on cross-perspective datasets, their performance drops significantly. For instance, the overall accuracy of the nadir-downward trained model on the oblique-downward and horizontal sub-datasets decreases to 0.391 and 0.0368, respectively. The accuracy of the oblique-downward trained model on the nadir-downward and horizontal sub-datasets is 0.475 and 0.556, respectively. The accuracy of the horizontal trained model on the nadir-downward and oblique-downward sub-datasets is 0.00392 and 0.295, respectively. These results demonstrate the significant impact of vertical perspective variations on the performance of object detection models.

Figure 10 intuitively shows the impact of perspective differences on model accuracy. When the training set and test set have the same perspective category (i.e., the three line graphs on the main diagonal), YOLOv5 and Faster R-CNN both show high accuracy, proving the robustness of these two methods under different vertical perspectives. When the training set and the test set have different perspective categories, however, the model accuracy is significantly reduced. Particularly when the perspective difference is large (for example, when the training set is nadir-upward and the test set is horizontal, or vice versa), the model accuracy is very poor. Therefore, we can conclude that variation of the object’s vertical perspective will have a great impact on the accuracy of object detection, and that the information of the object’s vertical perspective should be taken seriously.

6. Future Work

Our experimental results clearly show that detection accuracy drops substantially when cross-testing is performed across different vertical perspectives (nadir-downward, oblique-downward, and horizontal). This finding identifies an important blind spot in current object detection research. Building on the foundation of this study, we envision future work in the following directions:

(1) Developing perspective-robust feature representations and model architectures: The core contribution of this work is the discovery and verification of the above problem, while the primary task for future research is to address it. Conventional CNNs are not explicitly designed to account for the geometric and morphological distortions introduced by variations in perspective angle. Therefore, developing novel architectures with perspective-invariant or perspective-adaptive capabilities will be crucial.

(2) Cross-view domain adaptation and generalization: In practical applications, there may be abundant annotated data from a particular perspective (e.g., satellite nadir perspective), whereas data from other perspectives (e.g., UAV oblique perspective or ground-level horizontal perspective) can be extremely limited. In such cases, it becomes essential to investigate how to transfer models from a data-rich source domain to a data-scarce target domain in a robust manner.

(3) Collaborative detection through multi-perspective information fusion: In many real-world scenarios, imagery of the same region can be captured simultaneously from multiple platforms (satellite, aerial, and ground). Effectively fusing these multi-source and multi-perspective data to achieve more accurate and robust detection than single-perspective approaches represents a promising research direction. This requires addressing a series of challenges, including cross-platform image registration, target association, and multimodal feature fusion.

In summary, overcoming the performance bottlenecks in object detection caused by vertical perspective variations is a challenging yet highly impactful research topic.

7. Conclusions

This paper investigates the impact of vertical perspective on the performance of object detection methods, motivated by the practical applications of object detection. First, we review the latest research progress in both horizontal and rotated object detection within the deep learning framework. Then, a benchmark dataset annotated with vertical perspective information is constructed to facilitate the study of object detection from various vertical viewpoints. Several representative horizontal and rotated object detection algorithms are evaluated on this dataset. Additionally, the dataset is divided into multiple sub-datasets based on different vertical perspectives to facilitate cross-experimental performance comparisons using representative object detection methods. The experimental results demonstrate the robustness of these methods under different vertical perspectives as well as the impact of vertical perspective variations on the performance of object detection methods. Finally, we look forward to future research work. We hope that the MVPOD dataset and the baseline benchmarks established in this work can serve as a solid starting point for the research community, inspiring innovative studies that drive the advancement of object detection technologies in complex real-world scenarios.

Author Contributions

All authors contributed to this manuscript. Conceptualization, H.J.; writing original draft, J.C.; review and editing, Y.Z.; experimental results analysis, H.S. and B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China under Grants 62201472 and 62272383.

Data Availability Statement

Experimental data is contained within the article. Our proposed MVPOD dataset is publicly available and can be downloaded at https://github.com/iskiku/MVPOD (accessed on 25 August 2025).

Acknowledgments

The authors would like to thank the editors and the anonymous reviewers for their comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. Visualization results for normal object detection methods.

Figure A2. Visualization results for rotated object detection methods.

References

Lang, C.; Cheng, G.; Wu, J.; Li, Z.; Xie, X.; Li, J.; Han, J. Toward Open-World Remote Sensing Imagery Interpretation: Past, present, and future. IEEE Geosci. Remote Sens. Mag. 2024, 2–38. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124. [Google Scholar] [CrossRef]
Ma, X.; Ouyang, W.; Simonelli, A.; Ricci, E. 3d object detection from images for autonomous driving: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3537–3556. [Google Scholar] [CrossRef] [PubMed]
Cheng, G.; Han, J.; Zhou, P.; Xu, D. Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection. IEEE Trans. Image Process. 2018, 28, 265–278. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Zhao, Z.Q.; Zheng, P.; Xu, S.t.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
Faraji, H.; Chen, B. Drone-yolo: Improved yolo for small object detection in uav. In Proceedings of the 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 27–29 July 2023; pp. 93–100. [Google Scholar]
Zhang, H.; Liu, K.; Gan, Z.; Zhu, G.N. UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery. arXiv 2025, arXiv:2501.01855. [Google Scholar]
Li, X.; Diao, W.; Mao, Y.; Li, X.; Sun, X. SCLNet: A Scale-Robust Complementary Learning Network for Object Detection in UAV Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5651119. [Google Scholar] [CrossRef]
Wen, L.; Cheng, Y.; Fang, Y.; Li, X. A comprehensive survey of oriented object detection in remote sensing images. Expert Syst. Appl. 2023, 224, 119960. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Zhang, S.; Long, J.; Xu, Y.; Mei, S. PMHO: Point-Supervised Oriented Object Detection Based on Segmentation-Driven Proposal Generation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5638118. [Google Scholar] [CrossRef]
Li, Z.; Hou, B.; Wu, Z.; Ren, B.; Ren, Z.; Jiao, L. Gaussian Synthesis for High-Precision Location in Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5619612. [Google Scholar] [CrossRef]
Zhou, J.; Li, W.; Cao, Y.; Cai, H.; Huang, T.; Xia, G.S.; Li, X. Few-Shot Oriented Object Detection in Remote Sensing Images via Memorable Contrastive Learning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5630814. [Google Scholar] [CrossRef]
Zhou, S.; Liu, Z.; Luo, H.; Qi, G.; Liu, Y.; Zuo, H.; Zhang, J.; Wei, Y. GCA2Net: Global-Consolidation and Angle-Adaptive Network for Oriented Object Detection in Aerial Imagery. Remote Sens. 2025, 17, 1077. [Google Scholar] [CrossRef]
Wang, X.; Han, C.; Huang, L.; Nie, T.; Liu, X.; Liu, H.; Li, M. AG-Yolo: Attention-Guided Yolo for Efficient Remote Sensing Oriented Object Detection. Remote Sens. 2025, 17, 1027. [Google Scholar] [CrossRef]
Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and Robust Convolutional Neural Network for Very High-Resolution Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Qu, S.; Dang, C.; Chen, W.; Liu, Y. SMA-YOLO: An Improved YOLOv8 Algorithm Based on Parameter-Free Attention Mechanism and Multi-Scale Feature Fusion for Small Object Detection in UAV Images. Remote Sens. 2025, 17, 2421. [Google Scholar] [CrossRef]
Mao, Y.; Zhang, H.; Li, R.; Zhu, F.; Sun, R.; Ji, P. HSF-DETR: Hyper Scale Fusion Detection Transformer for Multi-Perspective UAV Object Detection. Remote Sens. 2025, 17, 1997. [Google Scholar] [CrossRef]
Chen, Y.; Ye, Z.; Sun, H.; Gong, T.; Xiong, S.; Lu, X. Global–Local Fusion With Semantic Information Guidance for Accurate Small Object Detection in UAV Aerial Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4701115. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Ren, S. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Jocher, G. YOLOv5 by Ultralytics. 2020. Available online: https://docs.ultralytics.com/models/yolov5/ (accessed on 15 June 2025).
Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO. 2023. Available online: https://docs.ultralytics.com/zh/models/yolov8/ (accessed on 16 June 2025).
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 16965–16974. [Google Scholar]
Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Singapore, 20–27 January 2025; Volume 39, pp. 8673–8681. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 29–29 October 2017; pp. 2961–2969. [Google Scholar]
Zhang, H.; Chang, H.; Ma, B.; Wang, N.; Chen, X. Dynamic R-CNN: Towards high quality object detection via dynamic training. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 260–275. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
Lin, T. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 765–781. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. arXiv 2019, arXiv:1904.01355. [Google Scholar] [CrossRef]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666. [Google Scholar]
Vaswani, A. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2010, arXiv:2010.04159. [Google Scholar]
Chen, Q.; Chen, X.; Wang, J.; Zhang, S.; Yao, K.; Feng, H.; Han, J.; Ding, E.; Zeng, G.; Wang, J. Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
Wang, K.; Wang, Z.; Li, Z.; Su, A.; Teng, X.; Liu, M.; Yu, Q. Oriented object detection in optical remote sensing images using deep learning: A survey. arXiv 2023, arXiv:2302.10473. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 2849–2858. [Google Scholar]
Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 677–694. [Google Scholar]
Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning High-Precision Bounding Box for Rotated Object Detection via Kullback-Leibler Divergence. Adv. Neural Inf. Process. Syst. 2021, 34, 18381–18394. [Google Scholar]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. Redet: A rotation-equivariant detector for aerial object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2786–2795. [Google Scholar]
Yang, X.; Zhou, Y.; Zhang, G.; Yang, J.; Wang, W.; Yan, J.; Zhang, X.; Tian, Q. The KFIoU loss for rotated object detection. arXiv 2022, arXiv:2201.12558. [Google Scholar]
Hou, L.; Lu, K.; Xue, J.; Li, Y. Shape-Adaptive Selection and Measurement for Oriented Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022. [Google Scholar]
Yu, Y.; Da, F. Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef] [PubMed]
Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning modulated loss for rotated object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 2458–2466. [Google Scholar]
Guo, Z.; Liu, C.; Zhang, X.; Jiao, J.; Ji, X.; Ye, Q. Beyond Bounding-Box: Convex-hull Feature Adaptation for Oriented and Densely Packed Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838. [Google Scholar]
Zhou, Q.; Yu, C. Point rcnn: An angle-free framework for rotated object detection. Remote Sens. 2022, 14, 2605. [Google Scholar] [CrossRef]
Zhao, Z.; Xue, Q.; He, Y.; Bai, Y.; Wei, X.; Gong, Y. Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation. arXiv 2024, arXiv:2407.08489. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 3735–3739. [Google Scholar]
Liu, Z.; Wang, H.; Weng, L.; Yang, Y. Ship Rotated Bounding Box Space for Ship Extraction From High-Resolution Optical Satellite Images With Complex Backgrounds. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1074–1078. [Google Scholar] [CrossRef]
Lam, D.; Kuzma, R.; McGee, K.; Dooley, S.; Laielli, M.; Klaric, M.; Bulatov, Y.; McCord, B. xview: Objects in context in overhead imagery. arXiv 2018, arXiv:1802.07856. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. arXiv 2021, arXiv:2102.12219. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Tzutalin. LabelImg. 2015. Available online: https://github.com/tzutalin/labelImg (accessed on 30 May 2025).
cgvict. roLabelImg. 2020. Available online: https://github.com/cgvict/roLabelImg (accessed on 30 May 2025).
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
glenn jocher, G.J. Ultralytics. 2022. Available online: https://github.com/ultralytics/ultralytics (accessed on 18 June 2025).
Zhou, Y.; Yang, X.; Zhang, G.; Wang, J.; Liu, Y.; Hou, L.; Jiang, X.; Liu, X.; Yan, J.; Lyu, C.; et al. MMRotate: A Rotated Object Detection Benchmark using PyTorch. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 7331–7334. [Google Scholar]
Hu, S.M.; Liang, D.; Yang, G.Y.; Yang, G.W.; Zhou, W.Y. Jittor: A novel deep learning framework with meta-operators and unified graph execution. Sci. China Inf. Sci. 2020, 63, 222103. [Google Scholar] [CrossRef]
Yuan, Y.; Zhang, Y. OLCN: An Optimized Low Coupling Network for Small Objects Detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8022005. [Google Scholar] [CrossRef]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European Conference on Computer Vision. Springer, Tel Aviv, Israel, 23–27 October 2022; pp. 526–543. [Google Scholar]
Zhang, Z. Drone-YOLO: An Efficient Neural Network Method for Target Detection in Drone Images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Liu, D.; Zhang, J.; Qi, Y.; Wu, Y.; Zhang, Y. A Tiny Object Detection Method Based on Explicit Semantic Guidance for Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6005405. [Google Scholar] [CrossRef]
Liu, D.; Zhang, J.; Qi, Y.; Xi, Y.; Jin, J. Exploring Lightweight Structures for Tiny Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5623215. [Google Scholar] [CrossRef]

Figure 1. Perspective variations from different imaging platforms.

Figure 2. Comparison of images taken at different platforms. (a) Spaceborne nadir image. (b) Airborne oblique image. (c) Ground-based horizontal image.

Figure 3. The development journey of normal object detection based on deep learning.

Figure 4. The overall architecture of one-stage and two-stage object detection methods.

Figure 5. Different object representations in rotated object detection methods (left: rotated rectangular box; middle: quadrilateral bounding box; right: point set).

Figure 6. Example images from the MVPOD dataset. The dataset contains images taken from spaceborne, airborne, and ground-based platforms. Object categories from spaceborne platform images include airplane, carrier, warship, cargoship, and bridge. Object categories from airborne platform images include car, bus, and truck. Object categories from ground-based platform images include airplane, bus, truck, carrier, cargoship, and warship.

Figure 7. Number of objects in each category.

Figure 8. Comparison of different normal object detection methods in terms of speed, accuracy, and parameters. (a) Comparison of FPS and mAP, the horizontal axis represents FPS and the vertical axis represents AP. The points closer to the top-right corner correspond to methods with better performance. (b) Comparison of Params and mAP, the horizontal axis represents Params and the points closer to the top-left corner correspond to methods with better performance. (c) Comparison of AP in each category. C1: airplane, C2: car, C3: bus, C4: truck, C5: carrier, C6: cargoship, C7: warship, and C8: bridge.

Figure 9. Comparison of different rotated object detection methods in terms of inference speed, detection accuracy, and parameters. (a) Comparison of FPS and AP50, the points closer to the top-right corner correspond to methods with better performance. (b) Comparison of Params and AP50, the points closer to the top-left corner correspond to methods with better performance. (c) Comparison of AP50 in each category, C1: airplane, C2: car, C3: bus, C4: truck, C5: carrier, C6: cargoship, C7: warship, C8: bridge.

Figure 10. Illustration of the detection accuracy of YOLOv5 and Faster R-CNN in cross-perspective comparison experiments. (a) Contrast experimental results for YOLOv5. (b) Contrast experimental results for Faster R-CNN. The results on the main diagonal demonstrate that YOLOv5 and Faster R-CNN exhibit good robustness under different vertical perspectives, while the non-main diagonal indicates that vertical perspective variations have an impact on their performance. C1: airplane, C2: car, C3: bus, C4: truck, C5: carrier, C6: cargoship, C7: warship, C8: bridge.

Table 1. Comparison between the proposed MVPOD dataset and other popular object detection datasets. Our MVPOD dataset provides not only horizontal box and rotated bounding box annotations but also innovative vertical perspective annotations.

Platform	Datasets	Categories	Images	Instances	Bounding Box Type	Vertical Perspective	Year
	NWPU VHR-10 [68]	10	800	3775	horizontal	NO	2014
	UCAS-AOD [69]	2	910	6029	horizontal	NO	2015
	HRSC2016 [70]	1	1070	2976	oriented	NO	2016
	RSOD [20]	4	976	6950	horizontal	NO	2017
Spaceborne	HRRSD [18]	13	21,761	55,740	horizontal	NO	2017
	DOTA [12]	15	2806	188,282	oriented	NO	2017
	DIOR [19]	20	23,463	192,472	horizontal	NO	2018
	xView [71]	60	1127	1,000,000	horizontal	NO	2021
	ODAI [72]	18	11,268	1,793,658	oriented	NO	2021
Airborne	VisDrone2021 [73]	10	10,209	54,200	horizontal	NO	2021
	UAVDT [74]	4	77,819	835,879	horizontal	NO	2018
Ground-based	COCO [24]	80	123,287	886,266	horizontal	NO	2014
	VOC [25]	20	21,503	52,090	horizontal	NO	2012
Multi-Platform	MVPOD	8	10,470	15,467	horizontal & oriented	YES	2025

Table 2. The number of objects in each vertical perspective category. In the MVPOD dataset, with the exception of airplane, other categories do not include oblique-downward and nadir-upward perspectives, which aligns with common sense. Additionally, considering the shape and size of the objects, car and bridge do not provide horizontal perspectives.

	Airplane	Car	Bus	Truck	Carrier	Cargoship	Warship	Bridge
nadir-downward	1489	292	25	30	185	1867	1455	1663
oblique-downward	288	3380	408	801	413	291	187	202
horizontal	165	0	693	671	106	319	308	0
oblique-upward	206	0	0	0	0	0	0	0
nadir-upward	23	0	0	0	0	0	0	0

Table 3. Normal object detection performance results. The results with red color denote the best results, while those with blue color present the second-best and third-best results in each column. C1: airplane, C2: car, C3: bus, C4: truck, C5: carrier, C6: cargoship, C7: warship, C8: bridge. The precision of each category is defined by mAP.

Model	Backbone	Epochs	Par/Mb	GFLOPs	FPS	mAP	AP₅₀	AP₇₅	C1	C2	C3	C4	C5	C6	C7	C8
Two-Stage Object Detection Methods
Faster R-CNN [26]	R50	12	41	188	20	0.791	0.947	0.885	0.884	0.917	0.837	0.859	0.839	0.760	0.731	0.538
Casc. R-CNN [27]	R50	12	69	216	12	0.837	0.957	0.917	0.870	0.945	0.921	0.902	0.872	0.825	0.781	0.580
Sparse R-CNN [38]	R50	12	106	135	25	0.768	0.910	0.842	0.789	0.843	0.869	0.859	0.871	0.788	0.627	0.494
OLCN [81]	R50	12	55	199	20	0.809	0.952	0.902	0.860	0.927	0.846	0.874	0.865	0.806	0.755	0.539
RFLA [82]	R50	12	69	196	14	0.808	0.945	0.892	0.849	0.918	0.855	0.871	0.887	0.801	0.737	0.545
One-Stage Object Detection Methods
YOLOv5s [28]	CSPDark53	200	7	23	87	0.821	0.954	0.902	0.873	0.946	0.875	0.895	0.890	0.784	0.749	0.560
YOLOv8s [29]	CSPDark53	200	10	16	58	0.826	0.960	0.919	0.871	0.934	0.881	0.909	0.879	0.783	0.761	0.588
RTMDET [41]	CSPNeXt	200	9	15	37	0.864	0.959	0.920	0.913	0.941	0.925	0.930	0.923	0.851	0.832	0.602
Drone-YOLO [83]	Darknet53	200	11	37	39	0.869	0.961	0.920	0.927	0.963	0.935	0.935	0.931	0.849	0.824	0.584
FFCA-YOLO [84]	CSPDark53	200	5	37	86	0.785	0.953	0.915	0.808	0.905	0.884	0.860	0.818	0.740	0.705	0.559
FBRT-YOLO [32]	CSPDark53	200	3	23	70	0.865	0.962	0.923	0.906	0.957	0.949	0.934	0.914	0.844	0.805	0.609
ESG-TOD [85]	R50	36	33	387	19	0.747	0.913	0.850	0.816	0.875	0.800	0.792	0.813	0.727	0.673	0.477
LTDNet [86]	RepVit	36	5	29	45	0.746	0.915	0.849	0.818	0.872	0.799	0.791	0.818	0.725	0.671	0.477
DINO [50]	R50	24	48	249	14	0.858	0.952	0.900	0.919	0.955	0.933	0.916	0.925	0.861	0.788	0.567
RT-DETR [31]	R50	200	42	125	31	0.840	0.938	0.894	0.912	0.934	0.894	0.920	0.914	0.798	0.770	0.577
UAV-DETR [9]	R50	200	34	103	35	0.850	0.938	0.899	0.919	0.959	0.918	0.927	0.924	0.775	0.781	0.598

Table 4. Rotated object detection performance results. The results with red color denote the best results, while those with blue color present the second-best and third-best results in each column. C1: airplane, C2: car, C3: bus, C4: truck, C5: carrier, C6: cargoship, C7: warship, C8: bridge. In the table, “-” indicates that no results were obtained. Each class performance is AP50.

Model	Backbone	Epochs	Par/Mb	GFLOPs	FPS	AP₅₀	C1	C2	C3	C4	C5	C6	C7	C8
Rotated Rectangular Box Methods
RoITransf. [53]	R50	12	55.3	-	13	0.853	0.909	0.908	0.895	0.876	0.882	0.807	0.794	0.757
KLD [55]	R50	12	36	213	19	0.763	0.906	0.896	0.864	0.827	0.834	0.635	0.536	0.605
GWD [56]	R50	12	36	213	20	0.762	0.905	0.897	0.869	0.824	0.836	0.616	0.546	0.603
R³Det [57]	R50	12	42	332	17	0.782	0.905	0.908	0.879	0.836	0.797	0.681	0.614	0.633
S²A-Net [2]	R50	12	36	197	20	0.845	0.907	0.908	0.893	0.868	0.929	0.775	0.768	0.713
ReDet [58]	Re50	12	32	-	12	0.857	0.908	0.908	0.899	0.863	0.893	0.785	0.798	0.799
SASM [60]	R50	12	37	194	18	0.755	0.893	0.895	0.737	0.652	0.806	0.621	0.692	0.740
KFIoU [59]	R50	12	36	213	20	0.816	0.904	0.898	0.883	0.861	0.878	0.760	0.688	0.654
PSC [61]	R50	12	36	215	19	0.764	0.906	0.908	0.857	0.837	0.848	0.616	0.545	0.597
Quadrilateral Bounding Box Methods
GlidVertex [62]	R50	12	41	-	12	0.716	0.889	0.800	0.876	0.745	0.729	0.661	0.545	0.486
RSDet [63]	R50	12	36	-	14	0.773	0.839	0.759	0.949	0.915	0.880	0.632	0.564	0.646
Point Set Representation Methods
CFA [64]	R50	12	37	194	17	0.797	0.906	0.906	0.861	0.822	0.870	0.714	0.669	0.627
OrRepPoints [65]	R50	12	37	194	18	0.823	0.899	0.905	0.868	0.807	0.847	0.791	0.729	0.735

Table 5. Experimental results using YOLOv5 and Faster R-CNN on each sub-dataset. The results with red color denote the best results of YOLOv5 on each test set. The results with blue color denote the best results of Faster R-CNN on each test set. The best accuracy is almost always obtained when the test set perspective category is the same as the training set. In this table, “-” means that there was no such object category in the MVPOD test set.

Test Sets	AP	Airplane	Car	Bus	Truck	Carrier	Cargoship	Warship	Bridge
YOLOv5
Training on the nadir-downward perspective dataset
nadir-downward	0.689	0.858	0.857	0.533	0.617	0.716	0.711	0.633	0.591
oblique-downward	0.391	0.765	0.77	0.328	0.261	0.111	0.266	0.0645	0.562
horizontal	0.037	0.006	-	0.052	0.109	0.007	0.040	0.008	-
Training on the oblique-downward perspective dataset
nadir-downward	0.475	0.742	0.919	0.501	0.779	0.1	0.289	0.065	0.405
oblique-downward	0.755	0.685	0.929	0.846	0.833	0.858	0.772	0.669	0.444
horizontal	0.556	0.378	-	0.515	0.493	0.756	0.587	0.61	-
Training on the horizontal perspective dataset
nadir-downward	0.00392	0.003	-	0.000	0.003	0.002	0.016	0.000	-
oblique-downward	0.295	0.047	-	0.035	0.039	0.634	0.455	0.560	-
horizontal	0.793	0.634	-	0.882	0.885	0.789	0.787	0.783	-
Faster R-CNN
Training on the nadir-downward perspective dataset
nadir-downward	0.644	0.854	0.895	0.279	0.513	0.672	0.723	0.684	0.536
oblique-downward	0.391	0.760	0.741	0.320	0.197	0.199	0.289	0.093	0.529
horizontal	0.052	0.005	-	0.065	0.105	0.008	0.133	0.002	-
Training on the oblique-downward perspective dataset
nadir-downward	0.484	0.708	0.924	0.684	0.746	0.102	0.307	0.070	0.333
oblique-downward	0.750	0.691	0.922	0.792	0.829	0.830	0.761	0.749	0.426
horizontal	0.621	0.484	-	0.566	0.536	0.837	0.635	0.668	-
Training on the horizontal perspective dataset
nadir-downward	0.007	0.005	-	0.001	0.019	0.001	0.002	0.001	-
oblique-downward	0.367	0.036	-	0.122	0.164	0.681	0.532	0.669	-
horizontal	0.816	0.648	-	0.893	0.874	0.858	0.821	0.800	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, H.; Chen, J.; Zhang, Y.; Su, H.; Wang, B. MVPOD: A Dataset and Benchmark for Multi-Vertical-Perspective Object Detection in Multi-Platform Remote Sensing Images. Remote Sens. 2025, 17, 3029. https://doi.org/10.3390/rs17173029

AMA Style

Jin H, Chen J, Zhang Y, Su H, Wang B. MVPOD: A Dataset and Benchmark for Multi-Vertical-Perspective Object Detection in Multi-Platform Remote Sensing Images. Remote Sensing. 2025; 17(17):3029. https://doi.org/10.3390/rs17173029

Chicago/Turabian Style

Jin, Haiyan, Jintao Chen, Yuanlin Zhang, Haonan Su, and Bin Wang. 2025. "MVPOD: A Dataset and Benchmark for Multi-Vertical-Perspective Object Detection in Multi-Platform Remote Sensing Images" Remote Sensing 17, no. 17: 3029. https://doi.org/10.3390/rs17173029

APA Style

Jin, H., Chen, J., Zhang, Y., Su, H., & Wang, B. (2025). MVPOD: A Dataset and Benchmark for Multi-Vertical-Perspective Object Detection in Multi-Platform Remote Sensing Images. Remote Sensing, 17(17), 3029. https://doi.org/10.3390/rs17173029

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MVPOD: A Dataset and Benchmark for Multi-Vertical-Perspective Object Detection in Multi-Platform Remote Sensing Images

Abstract

1. Introduction

2. Review of Normal Object Detection

2.1. Two-Stage Object Detection Methods

2.2. One-Stage Object Detection Methods

3. Review of Rotated Object Detection

3.1. Methods Based on Rotated Rectangular Boxes

3.2. Methods Based on Quadrilateral Bounding Boxes

3.3. Methods Based on Point Set Representation

4. Proposed MVPOD Dataset

4.1. Category Information

4.2. Data Collection

4.3. Annotation Types

4.4. Dataset Characteristics

5. Experiments and Analysis

5.1. Implementation Details

5.2. Evaluation Metrics

5.3. Object Detection Benchmark

5.3.1. Experimental Results

5.3.2. Visualization

5.4. Rotated Object Detection Benchmark

5.4.1. Experimental Results

5.4.2. Visualization

5.5. Vertical Perspective Contrast Experiment

6. Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI