LARS: Remote Sensing Small Object Detection Network Based on Adaptive Channel Attention and Large Kernel Adaptation

: In the field of object detection, small object detection in remote sensing images is an important and challenging task. Due to limitations in size and resolution, most existing methods often suffer from localization blurring. To address the above problem, this paper proposes a remote sensing small object detection network based on adaptive channel attention and large kernel adaptation. This approach aims to enhance multi-channel information mining and multi-scale feature extraction to alleviate the problem of localization blurring. To enhance the model’s focus on the features of small objects in remote sensing at varying scales, this paper introduces an adaptive channel attention block. This block applies adaptive attention weighting based on the input feature dimensions, guiding the model to better focus on local information. To mitigate the loss of local information by large kernel convolutions, a large kernel adaptive block is designed. The block dynamically adjusts the surrounding spatial receptive field based on the context around the detection area, improving the model’s ability to extract information around remote sensing small objects. To address the recognition confusion during the sample classification process, a layer batch normalization method is proposed. This method enhances the consistency analysis capabilities of adaptive learning, thereby reducing the decline in the model’s classification accuracy caused by sample misclassification. Experiments on the DOTA-v2.0, SODA-A and VisDrone datasets show that the proposed method achieves state-of-the-art performance.


Introduction
Object detection in remote sensing images is an important research direction in the field of computer vision, and its development and utilization is an important way to promote military and civilian remote sensing applications, with broad market prospects.One of the main difficulties faced in this task is that the means by which remote sensing images are acquired and the distance at which they are taken result in objects in the images that are small in size and have less distinctive features [1,2].As shown in Table 1, the detected objects can be classified via the size of the specific area according to Ref. [3].This characteristic of small objects leads to a relatively limited amount of information about the object and also increases the difficulty of feature extraction, and it has therefore received extensive academic attention [4,5].
Remote sensing image detection predominantly employs networks based on anchorbased and anchor-free structures.Anchor-based networks use predefined anchor boxes during the detection process, predicting the objects' relative position and size concerning the anchor boxes to accomplish classification and localization [6][7][8].On the other hand, anchor-free models directly regress the objects' positions and avoid reliance on predefined anchors, thus enhancing the network's adaptability to various object shapes and sizes [9][10][11].Although both types have achieved notable results in general object detection, they each have their strengths and weaknesses when processing small objects in remote sensing images.Anchor-based methods require the careful design and adjustment of the anchor boxes' size and quantity, making the design and tuning of the network more complex.In contrast, anchor-free structures can avoid potential errors in the anchor matching process by directly regressing the object's position and size.Consequently, anchor-free networks are more favored for the detection of small objects.However, due to the variations in the size and proportion of objects in remote sensing images and the anchor-free networks' reliance on local feature points, existing networks often fall short of expectations when processing remote sensing images.Therefore, a new type of detection network suitable for small objects in remote sensing imagery is needed.It should guide the model to focus more on the features of small objects and reduce the loss of local information around the images.This approach will alleviate localization blurring issues.To enhance the model's focus on the features of small objects in remote sensing images, an attention mechanism is introduced.This approach has seen considerable application in the field of remote sensing image detection [12].The attention mechanism helps the model to focus on target regions, addressing challenges related to the varying scales, shapes, and orientations in remote sensing images, thereby improving the detection and recognition accuracy and robustness.While traditional attention mechanisms can enhance the model performance, they often overlook the inter-channel positional correlations [13].Additionally, the use of fixed-size convolution operations in standard attention mechanisms to capture feature correlations can lead to the loss of local information for small objects, resulting in localization blurring.While it is essential to guide the model to focus on small objects, it is also important to consider a broader receptive field to extract more comprehensive features.
To extract a broader range of input features and capture more extensive contextual information, several researchers have employed larger convolution kernels [14][15][16].By increasing the receptive field, these approaches consider more local features of the object; thus, they have achieved significant application and success in remote sensing image processing [17,18].However, these methods often overlook the ranging context issue, which involves the local and global correlation relationships among objects of different sizes in remote sensing images.This oversight can lead to the loss of detailed information, resulting in the erroneous detection of small objects in remote sensing images.
Additionally, since batch normalization (BN) [19] relies on batch data during both training and testing, it can lead to changes in the batch statistical information during testing, affecting the model's detection performance.In contrast, layer normalization (LN) [20] focuses more on the independence of individual samples.LN covers all feature channels for each sample independently, emphasizing the features of each sample.However, because LN does not consider the correlations between the samples within a batch, it may overlook some common features and statistical information shared among the batch samples.This can result in unstable normalization effects and cause classification confusion.
To address the above limitations, this paper proposes a remote sensing small object detection network based on adaptive channel attention and large kernel adaptation (LARS).The classic and representative anchor-free network YOLO is selected as the network architecture, which not only helps to validate the effectiveness of the proposed methods but also demonstrates their broad applicability in limited experiments.To address the issue of insufficient attention to small object features at different scales of remote sensing images, as mentioned above, an adaptive channel attention (ACA) block is proposed.This block adjusts the convolutional kernel size based on the input feature channels and applies adaptive attention weighting, guiding the model to focus on the local information.To overcome the information loss problem associated with large kernel convolutions when processing small objects, the large kernel adaptive (LKA) block is proposed.The LKA block decomposes a large kernel into several smaller convolutions, retaining a broad receptive field while preserving more detailed feature information.The ACA block enables the network to dynamically adjust the attention to different channels to improve the model's sensitivity to small objects; after this, the extracted attention information is used to assign weights to the LKA block to better focus on small object information.The combination of the two ensures that the attention mechanism enhances each small object feature and is processed by the extended sensory field.This allows the model to better extract the contextual information surrounding small objects, which enables the network to accurately detect small objects in a wider range while reducing localization ambiguities and enhancing the detection of small objects.Considering the issue of sample correlation in traditional normalization methods, layer batch normalization (LBN) is proposed for normalization computation and it is integrated into the ACA and LKA blocks.Finally, extensive experiments are conducted on the DOTA-v2.0 [21] SODA-A [3] and VisDrone [22] datasets, demonstrating the effectiveness of the LARS model and the design of each block.
This paper has three contributions as follows.
1. To address the issue of insufficient attention to remote sensing small object features when dealing with features of different scales, the ACA block is proposed.This block applies adaptive attention weighting based on the input feature dimensions, guiding the model to better focus on the local information.2. The LKA block is designed to address the problem of the incorrect detection of remote sensing small objects caused by the loss of local information in remote sensing images due to large kernel convolutions.This block dynamically adjusts the surrounding spatial receptive field according to the ranging background around the detection area, and it is guided by the weight information extracted by the ACA block, enhancing the model's ability to extract the contextual information around small objects.3. The LBN method is designed to resolve the issue of classification confusion caused by the correlation between samples.This method improves the consistency analysis capabilities during adaptive learning, alleviating the decline in the model's classification accuracy caused by sample misclassification.

Related Works
Small object detection in remote sensing images involves the task of detecting and localizing small-sized objects, which is often hindered by factors such as low object resolutions and noise.The current models for small object detection in remote sensing typically improve the performance by enhancing the search strategies, using region proposal methods, and performing loss function regression.Additionally, attention mechanisms and large-sized convolution kernels have been introduced to further improve the detection accuracy and robustness.
Search-based methods generate anchor boxes during the detection process through a sliding window approach, moving the window to the right or down by a certain step size until the entire image is covered.Jiang et al. [23] addressed the automatic detection issue of transmission towers in drone images by proposing a model that detects transmission towers, enhancing the detection robustness.However, this approach neglects the global contextual information surrounding the object, making it difficult to accurately distinguish small objects, and is prone to missed or false detection.
Region proposal-based methods generate candidate regions by segmenting and merging similar regions at the base level, followed by object detection and localization using deep learning models.Ren et al. [24] improved R-CNN to adapt it to the task of small object detection in optical remote sensing images.Lim et al. [25] proposed a context-aware object detection method to address the challenge of limited small object information.Although these methods have improved the detection performance to some extent, they tend to produce a large number of inaccurate candidate regions when processing remote sensing images.This leads to localization blurring and recognition errors for small objects.
Methods based on loss function regression extract image features using a feature extractor and directly predict the object's bounding box position and size by optimizing the loss function.This approach eliminates the need for additional candidate region generation and classification calculations, providing a more direct and efficient localization method.Yan et al. [26] proposed a multi-level feature fusion network to address the issues of insufficient information and background noise in the detection of dim small objects in remote sensing images.Fan et al. [27] introduced an anchor-free, efficient single-stage object detection method for optical remote sensing images to tackle the challenges of multi-scale objects and complex backgrounds in remote sensing imagery.However, the bounding boxes generated by this method often lack precise localization, leading to instability in the regression model's prediction of the bounding box positions and sizes.
Subsequently, attention mechanisms were introduced to enhance the model's focus on important features, thereby improving the recognition of small objects in remote sensing images.Du et al. [28] addressed the issue of small object sizes and dense distributions in remote sensing imagery by designing an enhanced multi-scale feature fusion network based on spatial attention mechanisms.Paoletti et al. [29] proposed a new multi-attention guided network that uses detailed feature extractors and attention mechanisms to identify the most representative visual part of an image to improve the feature processing for remote sensing hyperspectral image classification.Yan et al. [30] explored the potential of low-cost sparse annotations and introduced an end-to-end RSI-SOD method that relies entirely on scribble annotations.Liu et al. [31] addressed the challenges posed by traditional feature pyramid networks in handling various scale variations in remote sensing images by proposing an attention-based multi-scale feature enhancement and fusion module algorithm.Although these methods can pay attention to some important feature information, they usually use fixed-size convolutions to calculate the correlations between the channels.Moreover, for high-resolution and multi-band remote sensing images, standard channel attention methods fail to adequately focus on small object features when processing features of different sizes, which can lead to the loss of local information, thus leading to the problem of blurred localization.
Several studies have found that large convolution kernels can cover a broader range of input features, which helps to capture a broader context.Wang et al. [16] proposed a large kernel convolutional object detection network based on feature capture enhancement and wide receptive field attention to address the issue whereby critical information in a small receptive field is not prominently highlighted.Dong et al. [32] introduced a novel Transformer with a large convolutional kernel decoding network to tackle the problems of blurred semantic information and inaccurate detail and boundary predictions in remote sensing images.Sharshar et al. [33] investigated an object detection model integrating the LSKNet backbone with the DiffusionDet header to solve the problems of small object detection, dense element management, and different orientation considerations in aerial images.Li et al. [17] proposed a lightweight large selective kernel network to address the issue of extracting prior knowledge in remote sensing scenes.Although these methods have achieved good results by considering a larger receptive field, the use of large kernel convolutions fails to effectively leverage the local and global ranging contexts of objects of different sizes in remote sensing images.This can lead to the loss of detailed information, resulting in the incorrect detection of small objects in remote sensing images.
In summary, although the current remote sensing small object detection methods can obtain good results, there are still challenges.Due to the large differences in the size and scale of small objects in remote sensing images, it is necessary to address the localization blurring problem caused by low attention to the features of small objects and the lack of localization accuracy.Therefore, it is necessary to explore a new processing method applicable to the above problems.

Network Overview
The overall model structure is divided into three parts, the backbone, neck, and head, as shown in Figure 1.In the backbone section, the ACA block is used to capture specific semantic features, such as color and texture, contained in different channels of the image.The adaptive weighting set within this block allows the model to focus more on the local information, guiding the model to focus on the small object area.Then, the LKA block analyzes the local and global correlation between this area and the surrounding receptive field, accurately extracting high-level feature representations of the input image for subsequent object detection tasks.The neck section uses a feature pyramid network (FPN) architecture for feature fusion and upsampling to further process the features extracted by the backbone, enhancing the model's sensitivity to objects of different scales.The head section primarily extracts the category and location information of objects of different sizes, using three anchor-free detection heads for information fusion.

ACA Block
Remote sensing images are typically composed of spectral data from multiple bands, each corresponding to a different spectral range.The channel information provides rich spectral features, which are crucial in distinguishing various land cover types, materials, and vegetation conditions.In contrast, an overemphasis on spatial features can lead to redundant information, causing the model to learn incorrect features.The proposed ACA block in this paper addresses the issue of focusing on regional features,and its structure is shown in Figure 2.
During training, we have an input set x ∈ (B, C, H, W), where B is the batch size, C is the number of channels, and H and W are the height and width, respectively.Convolution operations are used for the initial feature extraction from the input image.To enhance the connections between the channels, all channels share the same learning parameters, i.e., Here, W is a C × C parameter matrix.For each pixel x i , this paper only considers the receptive field within a range of k units, where R k i denotes the set of k adjacent channels of y i .To capture appropriate channel interaction information, the kernel size k for different convolution operations can be manually adjusted to account for different receptive fields.However, this is overly cumbersome.Therefore, an adaptive method is designed to automatically adjust the convolution kernel size k based on the input channels, enabling the adaptive convolution of features of different dimensions.A mapping is set between k and C: Here, |•| odd denotes the nearest odd number to •, and η and b are the parameters of the linear mapping.Through the mapping τ, high-dimensional channels have longer-range interactions, while low-dimensional channels have shorter-range interactions due to using a nonlinear mapping.

LKA Block
Using larger convolution kernels can increase the receptive field, allowing for the capture of more image information and thereby obtaining richer feature representations.
However, in the detection of small objects in remote sensing images, such a large receptive field can lead to overly mixed information.This makes it difficult to accurately capture the details of small objects, resulting in blurred and lost information.Therefore, this paper proposes the LKA block, which decomposes the original large kernel convolution using matrix multiplication in the KA block, thereby increasing the receptive field and computing a series of multiple remote receptive fields.The structure of the LKA block is shown in Figure 3, and the internal KA block structure is demonstrated in Algorithm 1.The determination of the number of decomposed kernels is a key aspect of the LKA block.An increase in the kernel size and dilation rate ensures that the receptive field expands sufficiently quickly.Therefore, this paper defines the kernel size k, dilation rate d, and receptive field R of the i-th convolution as follows: Based on the above rules, this paper splits a convolution with a kernel size of 23 into smaller convolutions with kernel sizes of 5 and 7.This approach allows for more detailed feature extraction.This paper uses a series of convolutions with different receptive fields to realize the above operations: Here, F represents depth convolution using a kernel k i .Assuming that there are N decomposed convolution kernels, after performing depth convolution, a 1 × 1 convolution layer is used for further processing.Each decomposed convolution kernel channel is reduced to 1  N of the original size and then concatenated.This enhances the model's ability to capture features with different receptive field sizes, allowing the model to consider information at multiple scales simultaneously.Subsequently, max pooling and average pooling operations are performed separately on each channel of the features, enhancing the model's perception of the feature information within each channel and making the model's processing of each channel's information more flexible.
Here, S avg represents the spatial features obtained through average pooling, and S max represents the spatial features obtained through max pooling.To reflect the information interaction between different descriptors, these two spatial pooling features are concatenated.Then, a sigmoid activation function is applied to obtain an individual spatial selection mask for each decomposed kernel.This allows the model to adaptively select the required features of different sizes: S = σ(Concat(S avg , S max )) Here, σ(•) represents the sigmoid function.The decomposed features are then weighted by their corresponding spatial selection masks and fused through a convolution layer F(•).
Finally, the learned weight information is introduced to enhance the attention to the input features.The element-wise product between the input features X and Z is outputted: Algorithm 1: KA Block: Core of LKA Input : Image tensor x Output : Tensor after processing X 1 Initialize convolutional layers; 2 Generate features F 1 = Conv 0 (x), F 2 = Conv(F 1 ); 3 Obtain concatenated features F con = Concat(F 1 , F 2 ); 4 Calculate statistics S avg = P avg (F con ), S max = P max (F con ); 5 Obtain aggregate weights S = σ(Concat(S avg , S max )); 6 Obtain the weight matrix W = S[0]*F 1 + S[1]*F 2 ; 7 Calculate the weighted sum X = W*x; 8 Return X;

LBN Method
In the detection of small object images in remote sensing, the features are often subtle and sparse in number, and the traditional normalization methods are mainly batch normalization (BN) and layer normalization (LN).BN normalizes the input data by using the mean and variance of each feature channel of all batch samples, and it mainly relies on the statistical information of the batch samples.However, BN may not perform well with small batch samples and single samples because it relies on the statistical information of the batch samples.Thus, it may introduce noise and lead to inter-sample interactions and a dependence on the overall statistical information.In contrast, LN performs normalization for all channels of each sample and can maintain the inter-sample independence.However, because of this, LN ignores global information, resulting in less stable statistical information.
In this paper, we combine the two types of normalization so that the network can take into account the statistical information of the batch samples while dealing with small objects, and we propose the LBN normalization method.LBN first normalizes each sample's channel to ensure independence between the samples; then, it calculates the mean and variance of the output of the current layer and normalizes the whole batch using these statistics to take advantage of the correlation between the batches.This dual normalization strategy can maintain the stability when dealing with small batches of samples and single samples, and, at the same time, it can use the batch statistical information to improve the overall model performance.It ensures feature independence and utilizes global statistical data, and it is particularly suitable for small object detection tasks.Assuming that the dimension of feature X is (N, C, H, W), BN normalizes each channel across the entire batch, while LN normalizes each sample across all channels.The following formula represents normalization along the (N, H, W) dimensions: The variable B represents the number of samples, and ϵ is an added smoothing term, taking the value of a small positive floating-point number, to prevent division by zero, ensuring the stability of the BN calculations.If we replace the variable B in the formula with the number of channels C, we can obtain the calculation result of LN.Then, we introduce a learnable parameter λ to balance the normalization outputs along both directions.
y is the output of the normalization method.In this paper, LBN is embedded into both the ACA and LKA blocks, accelerating the model training process.

Experiments
This section describes extensive experiments to evaluate the model's effectiveness and performance in remote sensing small object detection.Firstly, the datasets used in the experiments are introduced, followed by explanations of the experimental settings and evaluation metrics.Finally, the results of the ablation studies and comparative experiments are presented, and the observed phenomena and trends are discussed.

Datasets
DOTA-v2.0.DOTA-v2.0 is a benchmark dataset released by Wuhan University that is widely used for object detection in remote sensing images.The dataset contains 11,268 highresolution aerial and satellite images and 1,793,658 annotated instances covering 18 object classes, such as aircraft, harbors, etc.The high-resolution images and different object classes of the DOTA-v2.0 dataset provide rich test samples for the evaluation of the performance of different detection algorithms.As a publicly available benchmark dataset, DOTA-v2.0 provides a unified evaluation tool that facilitates direct comparisons with existing methods and ensures reproducible and comparable research results.The details of each category in DOTA-v2.0 are shown in Table 2. SODA-A.The SODA-A dataset is designed for small object detection and was released by the Northwestern Polytechnical University.The dataset contains 2513 high-resolution aerial images, in which 872,069 objects are labeled with orientation frames, covering nine categories, such as airplanes, helicopters, and ships, etc.The high-density and multidirectional small object annotations in the SODA-A dataset provide ideal test samples for the evaluation of small object detection algorithms for remote sensing.The details of each category in SODA-A are shown in Table 3. VisDrone.The VisDrone dataset is a benchmark for UAV vision tasks and is published by the University of Science and Technology of China.The dataset consists of 10,209 highresolution images and video frames covering 79,658 labeled instances distributed across 10 object classes, including pedestrians, vehicles, traffic lights, etc.The scene diversity and rich object classes of the VisDrone dataset allow for the analysis of model performance in complex urban environments and dynamic scenes.In addition, the multi-angle and multi-scale features of the objects in the VisDrone dataset can also be exploited to verify the robustness and generalization abilities of the model in practical applications.Table 4 presents the detailed information of each category in the VisDrone dataset.

Implementation Details
This paper reports experimental results obtained on the DOTA-v2.0 and VisDrone datasets to evaluate the model's performance.To ensure fairness, a unified data processing method was employed: the original images were cropped into 1024 × 1024 patches with a pixel overlap of 150 between adjacent patches.All experiments were conducted using a single NVIDIA 4090 GPU with a batch size of 6 for model training and testing.
A stochastic gradient descent (SGD) optimizer was used for training, with a learning rate of 0.01, momentum of 0.9, and weight decay of 0.0005.The classification loss was computed using BCE, and the bounding box regression loss was computed using the CIoU and DFL.
Pre-training was conducted on the ImageNet dataset for 400 epochs.For the ablation studies, the model was trained for 20 epochs to ensure that the proposed methods could achieve good results within a limited number of iterations.For the comparative experiments on the DOTA-v2.0 and VisDrone datasets, the model was trained for 50 epochs and the performance on the two datasets was evaluated using the mAP50 of each class and the mAP50 of the total class, as well as the mAP50 and the mAP95 as the evaluation metrics, respectively.

Comparative Experiments
Results on the DOTA dataset.The proposed method achieved state-of-the-art performance, with a 63.01%mAP50 on the DOTA-v2.0OBB benchmark.
From Table 5, it can be observed that, compared to previous methods, LARS achieved significant improvements in detection, achieving higher average precision and more accurate localization.The detection results for each category, as well as the overall detection accuracy, are visually presented in Figures 4 and 5 using line and bar charts.
Results on the SODA-A dataset.This paper further highlights the performance of LARS on the SODA-A dataset.The experimental results demonstrate the performance of our method compared with other multi-stage and single-stage detection methods on the SODA-A dataset.
As shown in Table 6, our method achieves significant performance improvements in all metrics, especially in small object detection.AP eS , AP rS , AP gS , and AP N each represent the detection accuracy for extremely small, relatively small, generally small, and normal objects [3].Our model outperforms all other comparative methods in all four metrics, which indicates that our method has higher accuracy in detecting small objects.In addition, our method also performs well in terms of the overall average precision (AP) and highconfidence detection (AP 75 ), reaching 49.4 and 59.3, respectively, proving our method's accuracy in detecting small objects in complex aerial photography scenes.Results on the VisDrone dataset.This paper further examines the performance of LARS on the VisDrone dataset.The VisDrone dataset has richer scenarios and more challenges, which enables us to evaluate the performance and generalization ability of the model more comprehensively.Next, this paper will analyze the experimental results on the VisDrone dataset to further validate the validity and generalization of the proposed approach, and the experimental results are shown in Table 7.
Figure 6 lists the results of the comparison tests on the VisDrone dataset, showing that LARS performs well in dealing with various challenging scenarios.Compared with other methods, LARS achieves higher values in the mAP evaluation metric, which indicates that the model not only covers real objects more effectively but also identifies the object boundaries more accurately.35.32 20.04 DCFL [45] 32.14 -IOD [55] 42.93 24.62 HIC-YOLOv5 [56] 44.32 25.99 QueryDet [57] 48.15 28.71 CEASC [58] 50.74 28.46 DSH-Net [59] 51.81 30.94SAHI [60] 43.59 -EdgeYOLO [61] 44.85 -Ours 52.87 33.92 Overall, the experimental results show that the method proposed in this paper not only achieves significant performance improvements on the DOTA-v2.0 dataset but also achieves excellent detection performance on the VisDrone dataset, which proves the versatility and effectiveness of the method.In addition, the results prove that the proposed method has good generalization.

Ablation Experiments
This section reports the results of the ablation experiments on the DOTA-v2.0 dataset to investigate the method's effectiveness.
Different decomposition strategies.Setting the theoretical receptive field as 23, the results of the ablation study on the number of large kernel decompositions are shown in Table 8, and the visualization results are illustrated in Figure 7. From the experimental findings, decomposing the large kernel into a convolution with a kernel size of 5 and a dilation rate of 1, along with another convolution with a kernel size of 7 and a dilation rate of 3, achieves the optimal performance.Different insertion blocks.In this experiment, the ACA block, the LKA block, and the LBN were gradually added to the model, after which the three blocks were used in combination.The same dataset and training configurations were used, and the performance was evaluated on the validation set.As shown in Table 9, the experimental results reveal that the accuracy is further improved after adding ACA, LKA, and LBN at the same time.The visual comparison of the detection results is depicted in Figure 8. Incorporating all blocks enables the more accurate localization of the objects, reducing both missed detection instances and false positives.Additionally, in areas with densely distributed objects, the use of all blocks, compared to using partial blocks, can reduce the overlap between the detection boxes, thereby more accurately distinguishing individual objects (Figure 9).This indicates that the two blocks complement each other and can jointly enhance the model's performance.

Results Analysis
The experimental results on the DOTA-v2.0 dataset are analyzed in this section.In Figure 10, various evaluation metrics are illustrated, including the loss function, mAP, recall, and precision.The overall pattern of the loss metric in the experiments shows a continuous decrease, indicating the gradual optimization of the model's bounding box prediction accuracy during training, enabling the accurate localization of the objects.The sustained increase in the mAP50 and mAP95 indicates the model's good performance in focusing on key features and expanding the receptive field, leading to significant performance improvements at different IoU thresholds and demonstrating strong generalization abilities.During the early stages of training, the precision metric exhibits significant fluctuations due to the model's lack of learned parameters and features.However, in the later stages of training, through adjustments made by the ACA block, critical information can be extracted, and the LKA block can assign corresponding receptive fields to objects of different sizes, leading to the gradual stabilization of the precision, converging to an optimal state and consistently achieving good performance across different samples.The continuous increase in the recall metric reflects the enhanced ability of the model to recognize positive samples, resulting in a decrease in missed detection instances.Through the proposed ACA and LKA blocks, the model can more accurately focus on critical features and better understand and capture the contextual information of objects, thereby further improving the recognition accuracy and completeness.9 also demonstrate the improvement effect after adding the LBN block.Specifically, the experimental results in the table show that the mAP50 and mAP95 were improved by 1.3% and 1.66%, respectively, after adding the LBN block.This indicates that the LBN block reduces misclassification and improves the overall detection performance.In addition, the model shows a strong discrimination ability in the categories of PL and TC, with accuracy of more than 90%.This indicates that the model can accurately recognize these categories and distinguish objects from the background.However, some of the PL samples were misclassified as HCs, which may have been due to the similarity in the features of PL and HC, making it challenging for the model to differentiate between them.Similarly, the low accuracy for categories such as CC and AP could be due to insufficient training samples, which prevented the model from learning enough features for accurate classification.
The PR curve and F1-confidence curve are important metrics in evaluating the performance of object detection models.
The PR curve illustrates the relationship between the precision and recall at different thresholds.Typically, the area under the curve (AUC) is used to quantify the model's performance, where a larger area indicates better performance.In the PR curve (Figure 12, left), most class curves protrude towards the upper right corner, indicating that the model maintains high precision while also improving the recall.This is attributed to the discriminative feature representations provided by the LKA block and the enhanced focus on objects by the ACA block, resulting in more accurate localization and recognition by the combined model.In the F1-confidence curve (Figure 12, right), the horizontal axis represents the confidence threshold, while the vertical axis represents the F1 score, which is the harmonic mean of the precision and recall.The calculation formula is as follows: At low confidence levels, the F1 score of the model is relatively low.However, with an increasing confidence threshold, the features extracted by the LKA block are fully utilized, and the ACA block effectively adjusts the importance of the feature channels.As a result, the F1 score gradually increases and reaches its highest value of 0.62 at a confidence level of 0.405.This improvement enhances the precision and recall, reducing instances of false positives and false negatives, thereby achieving more accurate object localization.
Figure 13 illustrates the performance of the proposed model on the two datasets.It can be observed that the model achieves high detection accuracy for small objects in remote sensing images and performs well in precise multi-scale object localization.On the DOTA-v2.0 dataset, the model accurately identifies objects of different scales, such as PL and HB, indicating that LARS can not only recognize normal-sized objects but also accurately identify small-sized objects.The test results on the VisDrone dataset also demonstrate the accurate identification of objects of different scales, such as cars, bicycles, and pedestrians.These experimental results fully demonstrate the effectiveness and feasibility of the proposed method in the task of detecting small objects in remote sensing images.

Conclusions
To address the localization blurring issue in small object detection in remote sensing images, a remote sensing small object detection network based on adaptive channel attention and large kernel adaptation was proposed.An adaptive channel attention block was proposed to enhance the attention mechanism and channel features for small objects in remote sensing images.This block could guide the model to focus better on local information.To alleviate the problem of local information loss when processing small objects in remote sensing images with large kernel convolutions, a large kernel adaptive block was designed to dynamically adjust the spatial receptive field of the objects so as to improve the model's ability to extract the associated information around the small objects.We also designed a layer batch normalization method to alleviate the decrease in the model classification accuracy caused by sample misclassification and address the issue of inter-sample correlation.Extensive experiments and analyses demonstrated the convincing improvements brought by the proposed model.
Although the network proposed in this paper has obtained satisfactory results in mitigating the localization blurring problem during small object detection in remote sensing images, there are still several directions that can be further explored, as well as some limitations.
(1) To address the lack of interpretability of the model, we are also exploring the combination of some mathematical formulas to explain the workings of the model in order to more clearly understand the inner workings and decision-making process of the model.( 2) The model has a large number of parameters, leaving room for improvement in terms of a lightweight design.Future research can focus on reducing the model parameters and computational complexity through compression techniques to meet the requirements of applications in resource-constrained environments.

Figure 1 .
Figure 1.The overall architecture of LARS.

Figure 2 .
Figure 2. Structure of the ACA block.

Figure 3 .
Figure 3. Structure of the LKA block.

Figure 4 .
Figure 4. Average detection accuracy per category on the DOTA-v2.0 dataset.Each point represents the accuracy of a comparison model in a given category, the horizontal axis represents different models, and the vertical axis represents the AP50 value for each category.

Figure 5 .
Figure 5. Average detection accuracy for all categories on the DOTA-v2.0 dataset.LARS is represented by the red bar, and the highest detection accuracy of 63.01 was achieved on this dataset.

Figure 6 .
Figure 6.Comparison of mAP50 and mAP95 metrics on the VisDrone dataset.

Figure 7 .
Figure 7.Comparison of results from different strategies for decomposition of large kernels using (kernel, dilation) format.

Figure 8 .
Figure 8.Comparison of detection performance with different blocks added.

Figure 9 .
Figure 9. Visualization of detection performance after adding different blocks, with the parts of the detection results where it is difficult to find differences circled in ellipses.

Figure 10 .
Figure 10.Evaluation metric analysis for proposed model.

Figure 11
Figure 11 illustrates the normalized confusion matrix without the LBN block (left panel) and with the LBN block (right panel), where the rows represent the true categories and the columns represent the categories predicted by the model.The diagonal elements from the top left to the bottom right represent the probability that the model correctly categorizes each category.It can be seen that after adding the LBN block, the model's classification performance in each category is improved, especially in the categories BC, GTF, BR, and AP, where the number of misclassification instances is significantly reduced, indicating that the LBN block effectively reduces the classification confusion and improves the overall detection accuracy.The ablation experiments illustrated in Table9also demonstrate the improvement effect after adding the LBN block.Specifically, the experimental results in the table show that the mAP50 and mAP95 were improved by 1.3% and 1.66%, respectively, after adding the LBN block.This indicates that the LBN block reduces misclassification and improves the overall detection performance.

Figure 11 .
Figure 11.Comparison of normalized confusion matrices without (Left) and with (Right) the LBN block on the DOTA-v2.0 dataset, where BG represents the background.

Figure 12 .
Figure 12.Precision-recall curve and F1-confidence curve for proposed model.

Table 1 .
Classification and corresponding area ranges of objects.

Table 2 .
Instance counts of each category in the DOTA-v2.0 dataset.

Table 3 .
Instance counts of each category in the SODA-A dataset.

Table 4 .
Number of instances per category in the VisDrone dataset.

Table 5 .
Comparative experimental results on the DOTA-v2.0 dataset.Each column represents a category.Results highlighted in red and blue represent the best and second-best performance in each column, respectively.

Table 6 .
Comparative experimental results on the SODA-A dataset.Results highlighted in red and blue represent the best and second-best performance in each column, respectively.

Table 7 .
Main results of the comparison test on the VisDrone dataset.Results highlighted in red and blue represent the best and second-best performance in each column, respectively.

Table 8 .
The impact of decomposing different numbers of large kernels on the evaluation metrics, assuming a theoretical receptive field of 23,with the best metrics indicated in bold.

Table 9 .
Effectiveness experiment on individual blocks proposed, √indicates the inclusion of the block and bold indicates the best indicator.