LRF-SRNet: Large-Scale Super-Resolution Network for Estimating Aircraft Pose on the Airport Surface

The introduction of various deep neural network architectures has greatly advanced aircraft pose estimation using high-resolution images. However, realistic airport surface monitors typically take low-resolution (LR) images, and the results of the aircraft pose estimation are far from being accurate enough to be considered acceptable because of long-range capture. To fill this gap, we propose a brand-new, end-to-end low-resolution aircraft pose estimate network (LRF-SRNet) to address the problem of estimating the pose of poor-quality airport surface surveillance aircraft images. The method successfully combines the pose estimation method with the super-resolution (SR) technique. Specifically, to reconstruct high-resolution aircraft images, a super-resolution network (SRNet) is created. In addition, an essential component termed the large receptive field block (LRF block) helps estimate the aircraft’s pose. By broadening the neural network’s receptive field, it enables the perception of the aircraft’s structure. Experimental results demonstrate that, on the airport surface surveillance dataset, our method performs significantly better than the most widely used baseline methods, with AP exceeding Baseline and HRNet by 3.1% and 4.5%.


Introduction
Aircraft pose estimation is a fundamental and challenging vision task that is crucial for many downstream tasks, such as intelligent airport security monitoring [1], decreasing aircraft crashes with airport surfaces [2][3][4], assisting subsequent airport control decisions [5,6], and establishing digital twin airports [7,8].
The six degrees of freedom (6D) of an aircraft, including its three translations (x, y, z) and rotation (α, β, γ) around its three axes, are usually referred to as the aircraft's pose. However, the altitude (z), roll (α), and pitch (β) are meaningless for the aircraft on the airport surface in taxiing or parked state, because their values are zero [9,10]. Thus, the 2D position (x, y) and yaw angle (γ) are used to describe the pose of the aircraft on the airport surface.
The majority of the current methods [10][11][12] focus on studying how to precisely estimate the aircraft pose from high-resolution (HR) aircraft images (e.g., 512 × 384 or 256 × 192). Since their synthetic datasets cannot completely simulate realistic and complicated airport field environments, there are massive gaps in the study on estimating aircraft pose in low-resolution (LR) situations (e.g., 128 × 96), as shown in Figure 1a. Although high-resolution images can provide more details, in real airport scenes, such as long-distance capture, most aircraft images can only be acquired as poor-quality LR images, as shown in Figure 1b. Therefore, it is necessary to use super-resolution methods to help aircraft pose estimation.
The main purpose of the method is to estimate the 2D locations (x, y) of the aircraft joints that make up the aircraft geometry from the image, as illustrated in Figure 2b. The pose of the aircraft must match the geometry of the aircraft. As shown in Figure 2c, the final aircraft pose is represented by the aircraft skeleton, which is made up of the geometrical keypoints of the aircraft and their connections. The majority of deep neural network based pose estimation methods [13][14][15][16][17] currently use the generation of Gaussian heatmaps to represent the positions of actual keypoints, where each pixel in the heatmap indicates the probability of belonging to a particular type of keypoint. This is because it is currently difficult for these methods to directly regress the 2D positions of keypoints on the aircraft surface in the images. Due to the progressive downsampling process's inevitable loss of resolution, the heatmap's resolution is typically lower than that of the input image. Because of this, the generated heatmap for a given low-resolution image may be quite small (e.g., 32 × 24), which causes serious quantization problems when the aircraft keypoint heatmap is recovered to the same size as the input image. As a result, precise aircraft keypoint localization is essential. In this work, we explore how to deal with the issue of the precise estimation of lowresolution 2D aircraft pose. Super-resolution methods have recently been applied as preprocessing and have significantly improved downstream tasks, such as object detection [18][19][20]. Inspired by this, we extend the SR concept by designing an SRNet as an upstream subnetwork for aircraft pose estimation. Additionally, as the upsampling of SRNet may lead to local texture blurring of the aircraft, we propose a large receptive field block (LRF block) to expand the receptive field to cover the global features of the aircraft. We assess our method using a tough real-world airport surface surveillance dataset that includes images of parked aircraft, inbound and outgoing terminal traffic, and the runway being used for taxiing. We then compare the results to other state-of-the-art baseline methods.

(a) Synthetic SR image (b) Realistic LR image
We make the following contributions in this work: 1.
The end-to-end low-resolution aircraft 2D pose estimation network LRF-SRNet is proposed, which combines SR methods with pose estimation methods to precisely estimate the pose of an aircraft from low-resolution images.

2.
A large receptive field block (LRF block) is created as a core component to assist the network in extending its effective perceptual field and identifying the overall characteristics of the aircraft.

3.
The results of our experiments demonstrate that, when applied to the real-world airport surface surveillance dataset, our approach can successfully assist the pose estimate network in improving its performance.
The remainder of the essay is structured as follows. Section 2 looks at related studies. Our particular network model's structure is presented in Section 3, and the experimental findings are presented in Section 4. A conclusion follows Section 5.

Related Work
Recent methods for a variety of aircraft vision tasks involving low-resolution images are classified into two categories, depending on the solution ways: adding external superresolution subnetworks to recover high-quality images and adding internal multiscale perceptual field modules to enhance the network's performance in extracting aircraft features.

Super-Resolution-Based Method
Due to the limitations of the imaging equipment and less than ideal air conditions, among other factors, the surveillance video captured by digital imaging payloads in realworld situations is almost always blurry and degraded. As a result, there has been extensive research on how to handle high-resolution reconstruction in low-resolution imaging. He et al. [21] enhanced the spatial image resolution of a video taken by small drones using image fusion technology. Li et al. [22] suggested a technique for combining select highresolution multispectral remote sensing photos to create low-resolution super-resolution (SR) virtual scenes. The proper detection of tiny blurring airplanes in complicated airport photos is achieved by using an efficient deep belief network (DBN) [23] to rebuild highresolution features from numerous input images, including grayscale images and two locally thresholded images. By creating high-resolution aircraft from low-resolution remote sensing images, Tang et al. [24] proposed a joint super-resolution and aircraft recognition (Joint-SRARNet) SRARNet to enhance aircraft recognition performance. However, there is still a lack of study on the topic of aircraft pose estimation at low resolution, requiring further research. Thanks to the convolutional receptive field sensing aircraft features, deep convolutional neural network based aircraft pose estimation has recently been a research hot-spot in the field of aviation. The network's ability to aggregate local features depends on the size and shape of the receptive field, which has a significant impact on the model's performance. Since images often present complex backgrounds and hazy situations, many studies have looked at how to obtain larger fields of perception at shallower network depths. Zhao et al. [25] suggested a multiscale information augmentation framework (MS-IAF), which accurately identifies multiscale aircraft and their vital parts by stacking perceptual fields of various scale sizes in a multipath way. Li et al. [26] developed a new core component CBL module to increase the receptive field range in the neural network in order to address the issue of aircraft detection in airport field video images that is caused by a long shooting distance, small aircraft targets, and mutual occlusion. Wu et al. [27] enhanced aircraft detection in high-resolution remote sensing images with dense targets and complex backgrounds by improving Mask-rcnn [28] based on atrous convolution. However, the majority of these methods use atrous convolution or multilevel small convolution architectures to implement the large-scale receptive field. Large kernel has recently been proven efficient for effective receptive fields (ERFs) [29] and implements state-of-the-art pure convolutional network architectures on ImageNet classification [30], ADE20K semantic segmentation [31], and COCO target detection [32]. Inspired by this, we create a new core component LRF block to expand the receptive field of contemporary convolutional neural networks for effective capture of aircraft features.

Methodology
To solve the aircraft 2D pose estimation problem at low resolution, as shown in Figure 3, we propose an aircraft super-resolution reconstruction network (Aircraft SRNet) and an aircraft pose estimation network (Aircraft PoseNet). Aircraft SRNet reconstructs the aircraft's high-resolution information using spaced tandem up-and downsampling blocks. The high-resolution aircraft image is then fed into Aircraft PoseNet as input to predict all of the aircraft's geometric keypoints and generate the aircraft skeleton pose, as shown in Figure 2.

Aircraft Keypoint Heatmap
To properly represent the actual positions of aircraft geometric keypoints in space, a Gaussian-heatmap-based approach is utilized to describe the positions of aircraft geometric keypoints in the two-dimensional plane as soft annotations. Figure 4 illustrates how a Gaussian kernel with a variance of σ covers the left tail of the aircraft endpoint, and the value of p i at any location in the Gaussian heatmap represents the confidence probability that the endpoint belongs to the left tail of the aircraft. The detailed equation is shown in (1): The aircraft's kth geometric keypoint is indicated by the Gaussian heatmap Hm(k) with standard deviation σ. The confidence level is higher for points (P i ) close to the aircraft's keypoint and lowers or even reaches zero for locations far from the keypoint when using a threshold (thred) benchmark.

Loss Function
The loss function for our aircraft pose estimation is: L heatmap is the L2 loss of the model-predicted aircraft keypoint heatmap. The principal inflection and end points of the aircraft structure are contained in the ten keypoints we choose, including the nose of the aircraft, the left wing tip, the right wing tip, the right horizontal tail tip, the left horizontal tail tip, the point where the left horizontal tail attaches to the fuselage, the point where the right horizontal tail attaches to the fuselage, the point where the left-wing attaches to the fuselage, the point where the right-wing attaches to the fuselage, and the midpoint of the two points. Their distribution preserves the symmetry of the aircraft as a rigid body, making them easily detectable because the relative position relationship between each keypoint is fixed. These n = 10 aircraft geometrical keypoint heatmap losses constitute the final aircraft pose estimation losses (L pose ).

Aircraft Super-Resolution Network
We organize the cross-cascade architecture of upsampling and downsampling blocks [33] in Aircraft SRNet so that the features interact with features between the high-resolution semantic space and the low-resolution semantic space in order to reconstruct the high-resolution features of the aircraft. It reconstructs the high-resolution aircraft features by upsampling blocks before learning the deep semantic features of the aircraft, downprojecting the highresolution features to low resolution, and then up-projecting the deep semantic features to recover the high-resolution feature maps.
Furthermore, Figure 5a depicts the upsampling block's structural layout. Assuming that the feature input is the feature output tensor from the previous stage [F 1 , · · · , F n ], the upsampling block produces a high-resolution feature maps: G t 5 is the feature map of the final scale-up of the upsampling block, and G t 2 is the feature tensor of the first scale-up at stage t. As shown in Equation (4), a scale-down is carried out in between the two scale-ups, and the reconstructed residuals are propagated forward and self-corrected backward: As seen in Figure 5b, using the output feature maps from the previous upsampling [H 1 , · · · , H n ], and using the downsampling block, a low-resolution feature maps is produced: [L 1 , · · · , L n ] = P t 2 + P t

(5)
P t 2 is the feature tensor of the first scale-down, and P t 5 is the feature map of the final scaledown. Following the two scale-downs, a scale-up and residual cascade were carried out, as illustrated in Equation (6):

Aircraft Pose Estimation Network
Despite Aircraft SRNet helping us recover high-resolution aircraft images, upsampling interpolation is unable to stop the loss of high-frequency information, which softens the features of the aircraft texture in high-resolution images. It is easier to recognize the geometric keypoints of the aircraft if concentrating on its overall structure rather than its texture. We suggest using a large receptive field block (LRF block) to enable CNN to broaden its field of perception and perceive the aircraft as a whole. The LRF block performs the receptive field expansion at each stage of downsampling in our aircraft PoseNet, as shown in Figure 6.

Large Receptive Field Block
The global structure of the aircraft is better captured by the expanded CNN receptive field. Equation (7) is used to calculate the perceptual field: where the convolutional receptive field size at the t stage is indicated by the variable R t . We discuss 7 × 7 as an example, as shown in Figure 7. The scalable receptive field of three tiny convolutions with a series of 3 × 3 and an atrous convolution with a dilation rate of 3 is comparable to that of a convolution with 7 × 7, as can be seen. However, their effective receptive fields (ERFs) [34] are actually dissimilar, as the effective receptive field Equation (8) demonstrates: The ERFs of CNN is proportional to the convolutional kernel (k) and proportional to the square root of the network depth (n), as can be seen from the relationship above. This means that expanding the network depth, which expands the ERFs of many small convolutions placed in series, is less effective than expanding the convolution kernel size. Second, the aircraft-by-pixel prediction task is impacted by the negative effects, since the atrous convolution kernel is discontinuously sampled. ReLU has a gradient of 0 for the majority of pixels in the region away from the Gaussian kernel, as seen in Equation (1) or Figure 4, which prevents the corresponding gradients from updating and converging to the vicinity of the aircraft keypoints. According to Equation (9), GeLU causes the gradient to be continuously close to 0, which aids in the heatmap's convergence toward the location of the aircraft keypoint.
where x represents the pixel points in the feature map, and erf(x) denotes the Gaussian error function, erf(x) = 2 √ π x 0 e −y 2 dy.

Aircraft Dataset
The Airport surface surveillance dataset, which comprises a total of over 10,000 images and incorporates images taken from public surveillance footage of civilian airport scenarios, as well as video from dozens of camera spots, includes more than 30,000 different aircraft. The majority of the pictures were taken at an oblique upper angle. This dataset, as seen in Figure 8, is thought to be difficult to analyze, since it contains a lot of blurry target images. We labeled the aircraft pose to include nine important joints with defined boundaries and a center point where the wings intersect the fuselage, as shown in Figure 2, in order to precisely characterize the construction of the aircraft.

Evaluation Metric
The position of the aircraft's geometric keypoint determines the skeleton; thus, we use the widely accepted COCO evaluation criteria [32] to evaluate the aircraft pose: the Object . The Euclidean distance in this case between each related ground truth and the identified keypoint is d i . The ground truth's visibility flag is represented by the variable v i , and the bounding box area at the object scale is represented by the variable s. The constant k i regulates the decay at each keypoint. The primary competitive indicator and metric is the average accuracy (AP) utilizing 10 OKS levels.

Implement Details
All experiments are executed on a GIGABYTE 3090 Ti GPU with the Faster-RCNN [35] aircraft detector in the Ubuntu 18.04, Pytorch [36] environment in order to fairly assess the superiority of the proposed method in low-quality airport field aircraft pose estimation tasks. We adhered to the standard data augmentation and training strategies used in all ablation studies to maintain experiment homogeneity and avoid CNN overfitting. Specifically, by randomly scaling (±30%), randomly rotating (±40 • ), and randomly flipping horizontally. Our initial learning rate is 1 × 10 −3 , decreasing to 1 × 10 −4 and 1 × 10 −5 in the 90th and 130th epochs, respectively. A total of 150 epochs. Mini-batch = 128. Using the Adam [37] optimizer, the momentum is 0.9. The following discussion assumes an input low-resolution image size of 128 × 96.

Comparison with State-of-the-Art Baseline Methods
To evaluate the effectiveness of our proposed method, we compared our method with two state-of-the-art baseline methods, namely heatmap-based methods (SimpleBaseline [38] and HRNet [39]), under a low-resolution realistic airport surface surveillance dataset. The same resolution of 128 × 96 is used for both model training and testing. Table 1 demonstrates that, compared with other state-of-the-art baseline approaches, our method, which uses HRNet-W48 [39]   Based on Baseline, we discovered that the SRNet preprocessing is much worse than the LRF impact. On the other hand, when HRNet is employed as the backbone of Aircraft PoseNet, the SRNet result is noticeably better than LRF. The two examples above demonstrate that while high-resolution features appear to be more significant in the case of HRNet multibranch multiscale information interaction, the gain of receptive field expansion in a single-branch architecture is more sensitive. Figure 9 displays the Baseline (ResNet-101), HRNet-W48, and our aircraft pose findings to more clearly demonstrate the efficacy of our method. The degree of overlap visually reflects the validity of the methods.

Ablation Studies
In this section, we first perform ablation studies on the impact of different superresolution subnetworks on the aircraft pose estimation task to demonstrate the advantages of our method. Table 2 shows that our super-resolution subnetwork achieves the best results with relatively low GFLOPs (+3.77), significantly better than FSRCNN [41], RDN [42], and SOF-VSR [43]. Our method improves AP by 1.2% with little more computational costs than FeNet [44], which is a lightweight SR network. Then, we investigated how various receptive field expansion methods affect how well aircraft pose work is estimated. The several methods for enlarging the 7 × 7 receptive field are listed in Table 3, including a 3 × 3 atrous convolution with a dilation rate = 3, three tandem 3 × 3 small convolutions, and a 7 × 7 large convolution. The findings demonstrate that the atrous convolution is sampled in a discontinuous manner in a detrimental way for the work on estimating aircraft pose. When compared with using a single large convolution kernel directly, the tandem small convolution has a lower paradigm efficiency.
Then, we examine how different activation functions and convolutional kernel sizes affect the LRF. As shown in Table 4, we perform ablation tests on two alternative types of the backbone with different convolutional kernel sizes and different activation functions. The outcomes demonstrate that the GeLU activation function and the 11-bit convolution kernel size are the best options for this task.

HRNet-W48
Ours Baseline Figure 9. The results (blue) of the aircraft pose visualization and ground truth skeleton (red) for a better view of the aircraft skeleton.

Conclusions
In this paper, we propose a novel end-to-end 2D aircraft pose estimation approach to deal with the issue of aircraft pose estimation on airport surface at low resolution. The method uses a subnetwork SRNet to recover high-resolution details of the aircraft, as well as a core component LRF block to focus on the aircraft as a whole and overcome the local texture feature of the SRNet's blurring. Through extensive experiments using the airport surface surveillance dataset, we establish in this study the necessity for high-resolution reconstruction of the low-resolution aircraft pose estimate problem. We also demonstrate the potential of a large convolutional extended receptive field. Finally, ablation studies show that diverse PoseNet methods do not all benefit equally from resolution and receptive field. Compared with the other most widely used baseline methods, our suggested method is more precise and efficient.
Author Contributions: X.Y. proposed the network architecture design; X.Y. performed the experiments and analyzed the results; X.Y wrote the paper. X.Y., D.F. and S.H. revised the paper and provided valuable advises. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments:
The authors would also like to thank the anonymous referees for their valuable comments and helpful suggestions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: