1. Introduction
With the development of optical satellites, visible light imaging has reached sub-meter, providing a stronger foundation for precise interpretation of aircraft. Fine-grained datasets are built with increasingly large sizes, such as DOTA [
1], MAR20 [
2], and FAIR1M [
3]. High-resolution optical remote sensing images intuitively depict the appearance of aircraft targets. In recent years, deep learning algorithms based on convolutional neural networks (CNNs), such as SSD [
4], YOLO series [
5,
6,
7,
8,
9], R-CNN [
10,
11,
12], and ViT and its variants [
13,
14,
15,
16], have been applied to aircraft detection and recognition tasks in optical imagery, and have achieved promising results.
However, due to the similar design principles of aircraft, which result in significant appearance consistency—particularly among aircraft of the same type or aircraft performing similar missions, target classification based purely on appearance features remains challenging, as shown in 
Figure 1. In contrast, comprehensive utilization of these quantitative properties by humans enables more refined aircraft identification. Consequently, researchers have gradually proposed utilizing structural information, keypoints, or templates to achieve fine-grained aircraft recognition, which has demonstrated improved performance in optical remote sensing applications, as shown in 
Figure 2.
To address challenges such as background noise, occlusion, and weather variations, researchers developed an aircraft identification algorithm employing Harris–Laplace corners, Zernike moments, and color-invariant moments [
17]. This work utilized corners as discriminative features for aircraft characterization, although traditional feature extraction methods such as these are susceptible to data variations. Nevertheless, we established corners as viable distinguishing features for aircraft recognition in remote sensing imagery. During the same period, with the advancement of CNNs, keypoint-based facial recognition [
18,
19,
20,
21] achieved remarkable success, enabling both efficient and accurate facial recognition with recognition accuracy exceeding 
.
Aircraft recognition shares similar problem characteristics with facial recognition. Several deep learning-based approaches have incorporated keypoints as integral components of their aircraft recognition solutions since 2017. In 2017, researchers proposed an accurate and efficient landmark-based aircraft recognition method [
17]. They proposed an 8-point landmark design to characterize aircraft, coupled with a vanilla network architecture for comprehensive landmark regression to localize aircraft keypoints. Based on these network-extracted landmarks, they introduced a template vector-based matching algorithm to measure similarity between candidate aircraft and templates. In 2018, they further proposed an aircraft segmentation network to obtain refined segmentation results that provide critical details for distinguishing different aircraft types [
22]. By integrating keypoint results from another processing branch, they performed comprehensive matching with templates using Intersection over Union (IoU) as the similarity metric between segmentation results and reference templates. Similarly, a Conditional Generative Adversarial Networks-based recognition algorithm was proposed in [
23], which adopts the keypoints as a condition of Generative Adversarial Networks. Apart from keypoints extraction, an ROI feature extraction method is carefully designed to extract multi-scale features from the GAN in the regions of aircraft. After that, a linear support vector machine (SVM) classifier was adopted to classify each sample using their features. In 2021, a novel aircraft detection and recognition framework integrating component-level analysis [
24] was proposed. The method comprises three core stages, data preprocessing, a foundational detection network, and dedicated structural component detection. Common yet distinguishable aircraft structures are explicitly detected as independent targets to support subsequent category inference. By leveraging component classification alongside structural component detection, the approach capitalizes on discriminative differences to enhance overall classification performance. In 2024, an integrated approach for aircraft model recognition was proposed, combining target segmentation and keypoint detection [
25]. The methodology organically integrates multi-task deep neural networks with conditional random fields and template matching algorithms. The implementation involves three core phases, multi-task feature extraction and geometric refinement and template-based recognition.
Components recognition and part-to-whole reasoning methods show another possible solution for aircraft recognition. The Aircraft Reasoning Network (ARNet) [
26], designed for aircraft detection and fine-grained recognition in remote sensing images (RSIs), incorporates prior knowledge employed in expert interpretation. With an Aircraft Component Discrimination Module (ACDM) that recognizes aircraft based on component features, classification performance is improved for both few-shot and easily confused categories. In 2024, a knowledge-driven deep learning method [
27], called the Explainable Aircraft Recognition Framework Based on Part Parsing Prior (APPEAR), was introduced. It explicitly models the rigid structure of an aircraft as a pixel-level part parsing prior, dividing it into the nose, left wing, right wing, fuselage, and tail. This fine-grained prior provides reliable part locations to delineate aircraft architecture and imposes spatial constraints among parts, effectively reducing the search space for model optimization while identifying subtle inter-class differences. Furthermore, the Multiscale Rotation Invariant Prototype Network (MSRIP-Net) [
28] simulates the intuitive human reasoning process of identifying objects by segmenting them into multiple components. It automatically recognizes rigid components of aircraft targets without relying on additional part annotations, using only image-level class labels. In [
29], the detector DFDet simultaneously focuses on mining contextual knowledge and mitigating angle sensitivity, constructing features containing multi-range contexts with low computational cost and aggregating them into a compact yet informative representation, thereby enhancing the model’s robust inference capabilities.
Recently, the advancement of DeepSeek-like algorithms [
30,
31,
32] has demonstrated that object recognition is fundamentally grounded in reasoning. Purely end-to-end approaches remain challenging for many tasks, especially fine-grained aircraft recognition; thus, these systems must draw inspiration from human cognitive processes involving part-based recognition, keypoint detection, and advanced reasoning with domain knowledge. This necessitates systematic community efforts to analyze aircraft-specific characteristics and develop their practical applications.
Current methodologies primarily face the following limitations.
1. Undefined part definitions. Part-based approaches face generalization challenges due to evolving aircraft design trends—blended wing–body designs and flying wing configurations are increasingly replacing traditional fuselage–wing separations, making explicit part definitions increasingly untenable.
2. Underutilized priors. Existing keypoint extraction algorithms predominantly rely on regression or classification paradigms without incorporating domain-specific constraints from remote sensing data and aircraft properties. Prior knowledge regarding structural invariances and imaging geometry characteristics remains underutilized.
3. Insufficient fine-grained matching. Current aircraft keypoint matching algorithms neglect the impact of aircraft roll angles on the relative positional variations among keypoints, which compromises accuracy. This necessitates analyzing stable and representative keypoints to enhance matching precision.
4. Limitations of monolithic approaches. Many aircraft possess highly analogous platform designs, and inherent errors in keypoint extraction processes collectively render sole reliance on keypoint matching inadequate for precise target differentiation. This necessitates the development of more comprehensive low-dimensional features to enhance discriminability.
These limitations highlight the urgent demand for a unified interpretation framework that incorporates remote sensing imaging characteristics, reasoning-based cognitive requirements, and human-inspired cognition with both qualitative and quantitative analytical capabilities.
Keypoints serve as a fundamental set of descriptive features applicable to aircraft target recognition algorithms based on attributes such as components, colors, or aspect ratios. They also represent a universal depiction method that captures aircraft outlines regardless of configuration. However, keypoint design is susceptible to evolving aircraft configurations and cannot maintain fixed definitions like facial keypoints [
33]. Therefore, while keypoints effectively describe basic target features, further research on adaptive extraction algorithms remains essential.
In common keypoint detection and recognition algorithms, the task of predicting target keypoints is often treated as a regression task [
34,
35,
36,
37,
38]. This approach is versatile for applications like human pose estimation and gesture recognition where viewpoints vary significantly. However, considering optical remote sensing tasks where the observation perspective is predominantly nadir imaging, and the presence of roll angles generally does not disrupt the target’s relative structure, the prediction of keypoints for aircraft targets should additionally incorporate constraints related to topological structure. This can help reduce the occurrence of outliers during keypoint prediction and further improve keypoint detection accuracy.
Furthermore, deficiencies remain in the current field of optical remote sensing aircraft target recognition, although many publicly available datasets for aircraft detection, recognition, or fine-grained recognition have been published and some studies have approached the problem from a keypoint perspective. On the one hand, keypoint design often lacks systematic methodology. On the other hand, there remains a scarcity of publicly available optical remote sensing image datasets specifically annotated with aircraft keypoints.
Since 2017, several types of keypoint deployment schemes have been proposed in different works primarily for aircraft pose estimation, as shown in 
Figure 2. In [
25], the five keypoints defining aircraft geometry are sequentially designated as the aircraft nose, fuselage center, tail section, port-side wing extremity, and starboard-side wing extremity. This ordered set establishes the fundamental reference frame for aerodynamic analysis. Furthermore, eight landmarks for an aircraft are designed in [
22], numbered from 0 to 7 in an anticlockwise direction with the aircraft head designated as point 0. In [
23], a similar 8-point design scheme is utilized to generate polygon masks for aircraft targets. Additional keypoints generally improve matching accuracy.
In [
22], a vanilla network is proposed to detect keypoints by implicitly encoding geometric constraints among landmarks through simultaneous regression of all landmarks. The Euclidean distance between ground truth and predictions is normalized by wingspan to formulate the loss function. During inference, target crops are rotated three times, generating four aircraft crops with different poses. Final keypoint detection results are generated by averaging the four landmark sets. In [
23], a coarse segmentation network is proposed to segment aircraft from backgrounds. Furthermore, a fully connected CRF module is used to refine the coarse segmentation results. Finally, keypoints are extracted from the binary segmentation masks. In [
25], Mask-RCNN is used to construct a multi-task network with three branches: a detection head, segmentation head, and keypoint detection head. In the detection prediction branch, a one-hot binary mask of size 
 is treated as ground truth, with cross-entropy loss regulating the prediction. In summary, keypoint detection research for aircraft targets and the success of facial keypoint recognition demonstrate that aircraft keypoint detection is worth developing and represents an important solution.
In [
17], DT Nets are adopted to describe the distribution of feature points. By extracting multi-scale Harris–Laplace corners from the image, similarity is calculated based on triangle correspondences within the DT network. In [
25], based on keypoint detection results, the normalized Sum of Squared Differences (SSD) Matching Method is adopted to calculate similarity between mask templates and predictions. Differences between the candidate target and all templates in the template library are computed and compared. The template model with the best matching performance is identified as the model of the target aircraft under test.
In the field of face matching, a feature vector is constructed from extracted keypoints. Finally, Euclidean distance is used to calculate the distance between facial features and template features.
The nadir imaging perspective in remote sensing provides inherent stability for aircraft targets, analogous to facial recognition conditions. To enhance keypoint recognition accuracy, algorithms must be adaptively optimized to leverage remote sensing advantages while accommodating its constraints.
Operational factors—including orbital mechanics and emergency observation requirements—frequently produce non-zero side-looking angles. This oblique imaging perspective alters topological relationships between aircraft keypoints. Consequently, viewpoint-invariant keypoints must be identified that maintain consistent structural relationships across imaging conditions to ensure robust representation. Simultaneously, shared aircraft engineering principles create high visual similarity—especially in top-down views. Effective keypoint selection strategies must therefore identify distinctive keypoint groups that maximize inter-class differentiation.
In this manuscript, a novel comprehensive solution for aircraft keypoint detection in optical remote sensing imagery is proposed, as illustrated in 
Figure 3. The main contributions of this work are fourfold:
1. A large-scale aircraft keypoint dataset with thousands of aircraft is built, where 21 types of aircraft are carefully labeled. This provides a common foundation for all aircraft keypoint detection research in remote sensing.
2. Considering the characteristics of remote sensing imagery, a keypoint extraction algorithm for aircraft targets incorporating structural prior constraints is proposed, further enhancing the accuracy of aircraft keypoint extraction.
3. A simple yet effective selection method is proposed to identify representative and stable keypoints, laying a solid foundation for the design of target recognition algorithms.
4. Based on the keypoint set for aircraft targets, an algorithm measuring keypoint set consistency between candidates and templates is proposed, ensuring the precision of target matching and recognition by comprehensively utilizing point-to-point topological relationships.