MSCAC: A Multi-Scale Swin–CNN Framework for Progressive Remote Sensing Scene Classification

: Recent advancements in deep learning have significantly improved the performance of remote sensing scene classification, a critical task in remote sensing applications. This study presents a new aerial scene classification model, the Multi-Scale Swin–CNN Aerial Classifier (MSCAC), which employs the Swin Transformer, an advanced architecture that has demonstrated exceptional performance in a range of computer vision applications. The Swin Transformer leverages shifted window mechanisms to efficiently model long-range dependencies and local features in images, making it particularly suitable for the complex and varied textures in aerial imagery. The model is designed to capture intricate spatial hierarchies and diverse scene characteristics at multiple scales. A framework is developed that integrates the Swin Transformer with a multi-scale strategy, enabling the extraction of robust features from aerial images of different resolutions and contexts. This approach allows the model to effectively learn from both global structures and fine-grained details, which is crucial for accurate scene classification. The model’s performance is evaluated on several benchmark datasets, including UC-Merced, WHU-RS19, RSSCN7, and AID, where it demonstrates a superior or comparable accuracy to state-of-the-art models. The MSCAC model’s adaptability to varying amounts of training data and its ability to improve with increased data make it a promising tool for real-world remote sensing applications. This study highlights the potential of integrating advanced deep-learning architectures like the Swin Transformer into aerial scene classification, paving the way for more sophisticated and accurate remote sensing systems. The findings suggest that the proposed model has significant potential for various remote sensing applications, including land cover mapping, urban planning, and environmental monitoring.


Introduction
Scene classification, which aims to derive contextual details and allocate a categorical label to a specific image, has garnered considerable interest in the domain of intelligent interpretation in remote sensing (RS).However, the considerable variation within classes and the minimal distinction between classes in RS images create difficulties for precise scene classification.Current methods for scene classification in RS can be broadly categorized into two groups: (1) handcrafted feature-based methods and (2) deep-learning-based methods.Methods based on handcrafted features depend on manually crafted features like Gabor filters, local binary patterns (LBPs), and the bag of visual words (BoVW).These approaches often experience suboptimal classification results owing to their constrained representational ability.Deep-learning-based methods, on the other hand, have demonstrated a superior ability to automatically learn discriminative features from large image datasets.
Despite their promise, deep-learning-based methods face several challenges.One significant issue is the considerable variation within classes, where scenes of the same category can look very different due to changes in lighting, weather, and seasonal effects.
Another challenge is the minimal distinction between classes; different scene categories can appear visually similar, making it difficult for models to differentiate between them.Additionally, the scale variability of objects and the complexity of spatial arrangements in aerial images further complicate the classification task.
The shift from traditional handcrafted feature-based methods to CNNs and Transformerbased approaches has significantly advanced the models' capabilities to extract complex spatial patterns and comprehend large-scale aerial images.The Transformer, initially developed for sequence modeling and transduction tasks and recognized for its implementation of attention mechanisms, has shown outstanding capabilities in the field of computer vision [1].Dosovitskiy et al. [2] utilized the Transformer architecture on distinct, nonoverlapping image segments for image categorization, and the Vision Transformer (ViT) demonstrated superior performance in image classification relative to CNNs.Bazi et al. [3] adapted the Transformer architecture to enhance the accuracy of scene classification in RS.Unlike CNNs, the Transformer demonstrates a superior ability to capture the long-range relationships among local features in RS imagery.
Recent research has highlighted the effectiveness of Transformer models, particularly the Swin Transformer, in image classification [4].This new paradigm shift involves the utilization of deep-learning architectures capable of capturing global semantic information, essential for accurately categorizing aerial scenes.Its innovative architecture allows for a more effective and detailed interpretation of aerial images by addressing the challenges posed by the scale variability of objects and the complexity of spatial arrangements found in such imagery.
Early works in aerial scene classification primarily focused on utilizing handcrafted features to represent the content of scene images.These features were designed based on engineering skills and domain expertise, capturing various characteristics such as color, texture, shape, spatial, and spectral information.Some of the most representative handcrafted features used in early works include color histograms [5], texture descriptors [6][7][8], global image similarity transformation (GIST) [9] scale-invariant feature transform (SIFT) [10], and the histogram of oriented gradients (HOG) for scene classification.
As the field progressed, the focus shifted towards developing more generalizable and automated feature extraction methods, leading to the adoption of deep-learning techniques.Deep-learning models, with their ability to learn hierarchical feature representations directly from data, have significantly advanced the state of the art in aerial scene classification, overcoming many limitations of early handcrafted features.Notably, in 2006, Hinton and Salakhutdinov achieved a significant advancement in deep feature learning [11].Since then, researchers have sought to replace handcrafted features with trainable multi-layer networks, demonstrating a remarkable capability in feature representation for various applications, including scene classification in remote sensing images.These deep-learning features, automatically extracted from data using deep-architecture neural networks, offer a significant advantage over traditional handcrafted features that require extensive engineering skills and domain knowledge.With multiple processing layers, deep-learning models capture robust and diverse abstractions of data representation, proving highly effective in uncovering complex patterns and distinguishing characteristics in high-dimensional data.Currently, various deep-learning models are available, including deep belief networks (DBNs) [12], deep Boltzmann machines (DBMs) [13], the multiscale convolutional autoencoder [14], the stacked autoencoder (SAE) [15], CNNs [16][17][18][19][20], the bag of convolutional features (BoCF) [21], and so on.
The field of aerial scene classification has evolved significantly, transitioning from handcrafted feature-based methods to advanced deep-learning techniques.Table 1 summarizes notable deep-learning-based studies in aerial scene classification, highlighting the authors, methodologies, and datasets used in their research.

Methodology Dataset Used
Sheppard and Rahnemoonfar [22] Deep CNN for UAV imagery Custom UAV dataset Yu and Liu [23] Two-stream deep feature fusion model UCM, AID, and NWPU-RESISC45 Ye et al. [24] Parallel multi-stage (PMS) architecture UCM, and AID Sen and Keles [25] Hierarchically designed CNN NWPU-RESISC45 Sheppard and Rahnemoonfar [22] focused on the instantaneous interpretation of UAV imagery through deep CNNs, achieving a high accuracy in real-time applications.Yu and Liu [23] proposed texture-and saliency-coded two-stream deep architectures, enhancing the classification accuracy.Ye et al. [24] proposed hybrid CNN features for aerial scene classification, combined with an ensemble extreme-learning machine classifier, achieving remarkable performance.Sen and Keles [25] developed a hierarchically designed CNN model, achieving high accuracy on the NWPU-RESIS45 dataset.Anwer et al. [26] explored the significance of color within deep-learning frameworks for aerial scene classification, demonstrating that the fusion of several deep color models significantly improves recognition performance.Huang et al. [27] introduced a Task-Adaptive Embedding Network (TAE-Net) for few-shot remote sensing scene classification, designed to adapt to different tasks with limited labeled samples.Wang et al. [28] proposed the Channel-Spatial Depthwise Separable (CSDS) network, incorporating a channel-spatial attention mechanism, and El-Khamy et al. [29] developed a CNN model using wavelet transform pooling for multi-label RS scene classification.
Recent advancements in remote sensing scene classification have leveraged transformerbased architectures and multi-scale feature integration for enhanced performance.Zhao and Li [30] introduced the Remote Sensing Transformer (TRS) that integrates self-attention into ResNet and employs pure Transformer encoders for improved classification performance.Alhichri et al. [31] utilized an EfficientNet-B3 CNN with attention, showing strong capabilities in classifying RS scenes.Guo et al. [32] proposed a GAN-based semisupervised scene classification method.Wang et al. [33] developed a two-stream Swin Transformer network that uses both original and edge stream features to enhance classification accuracy.Hu and Liu [34] proposed the triplet-metric-guided multi-scale attention (TMGMA) method, enhancing salient features while suppressing redundant ones.Zhou and Huang [35] proposed a lightweight dual-branch Swin Transformer combining ViT and CNN branches to improve scene feature discrimination and reduce computational consumption.Thapa et al. [36] reviewed CNN-based, Vision Transformer (ViT)-based, and Generative Adversarial Network (GAN)-based architectures.Chen et al. [37] developed BiShuffleNeXt.Wang et al. [38] introduced the frequency and spatial-based multi-layer attention network (FSCNet).Sivasubramanian et al. [39] proposed a transformer-based convolutional neural network, evaluated on multiple datasets.Shang and Ye [40] improved Swin-Transformerbased models for object detection and segmentation, showing significant improvements in small object detection and edge detail segmentation.
This literature survey highlights significant advancements in deep-learning-based aerial scene classification, showcasing diverse techniques and their continuous evolution.Approaches like deep color model fusion, texture-and saliency-coded two-stream architectures, and CNN-based models have notably improved accuracy.Hybrid CNN features and few-shot classification networks like TAE-Net have shown remarkable results.Real-time UAV image interpretations with deep CNNs have also achieved high accuracy.Techniques incorporating channel-spatial attention mechanisms and wavelet transform pooling layers have enhanced multi-label classification.Transformer-based architectures and multi-scale feature integration have further advanced the field.Recent contributions include EfficientNet-B3 with attention, GAN-based semisupervised classification, two-stream Swin Transformer networks, and triplet-metric-guided multi-scale attention methods.Lightweight dual-branch Swin Transformers combining ViT and CNN branches have also been introduced to improve feature discrimination and reduce computational consumption.These advancements pave the way for our proposed methodology, which combines CNNs and Swin Transformers to address existing classification challenges effectively.

Overview of the Proposed Model
In this study, a novel deep-learning-based aerial scene classification model, the Multi-Scale Swin-CNN Aerial Classifier (MSCAC), is proposed.This model leverages the Swin Transformer for global feature extraction [4] and CNN for multi-scale local feature extraction [16].By combining the strengths of the Swin Transformer and feature pyramid CNN, MSCAC accurately classifies aerial scenes, considering the diverse scales and intricate spatial arrangements inherent in such imagery.This approach merges the benefits of multilevel convolutional models with Transformer architecture, enabling efficient handling of both local and global feature extraction.This capability is particularly valuable in aerial image classification, where understanding both local and global contexts is crucial.The novelty of our MSCAC model lies in this unique integration of CNNs and Swin Transformers, an approach not extensively explored in aerial image classification and not addressed in existing literature.Figure 1 provides an overview of the proposed model, illustrating its architectural components and the integration of the Swin Transformer with the deep-learning framework.

Materials
This study tested the deep-learning-based aerial scene categorization algorithm on four public datasets.Each dataset contains a diverse set of scene classes, samples per class, image sizes, and spatial resolutions, along with specific challenging factors that make aerial scene classification a complex task.The datasets used in our study are presented in Table 2.

Materials
This study tested the deep-learning-based aerial scene categorization algorithm on four public datasets.Each dataset contains a diverse set of scene classes, samples per class, image sizes, and spatial resolutions, along with specific challenging factors that make aerial scene classification a complex task.The datasets used in our study are presented in Table 2.

Research Challenges in Aerial Image Classification
In aerial image classification, significant research gaps exist due to various challenges.Advanced techniques are required to handle variability in spatial resolution, ensuring accurate feature extraction and recognition across different datasets.Current models need to adapt better to diverse imaging conditions, such as changes in lighting, weather, and seasons, necessitating the development of robust algorithms for consistent accuracy.Addressing scale and orientation variations is another critical gap; there is a need for algorithms capable of normalizing or adapting to these variations for uniform classification performance.Furthermore, managing intra-class diversity, particularly in datasets like AID with high variability in images from different locations, seasons, and conditions, remains a substantial challenge.Developing models that can generalize across such diverse images is essential for enhancing the accuracy and applicability of aerial image classification systems.Multi-source images from various countries, seasons, and conditions, increasing intra-class diversity.

Research Challenges in Aerial Image Classification
In aerial image classification, significant research gaps exist due to various challenges.Advanced techniques are required to handle variability in spatial resolution, ensuring accurate feature extraction and recognition across different datasets.Current models need to adapt better to diverse imaging conditions, such as changes in lighting, weather, and seasons, necessitating the development of robust algorithms for consistent accuracy.Addressing scale and orientation variations is another critical gap; there is a need for algorithms capable of normalizing or adapting to these variations for uniform classification performance.Furthermore, managing intra-class diversity, particularly in datasets like AID with high variability in images from different locations, seasons, and conditions, remains a substantial challenge.Developing models that can generalize across such diverse images is essential for enhancing the accuracy and applicability of aerial image classification systems.

Local Feature Extraction
During the model's local feature extraction phase, CNNs are employed to derive finegrained features from the input aerial imagery.This process involves applying a series of convolutional layers to the input image, each consisting of a convolution operation followed by a non-linear activation function and often a pooling operation.The convolution operation involves sliding a set of learnable filters (or kernels) over the input image to capture local patterns such as edges, textures, and shapes.Non-linearity from the activation function lets the model learn more complicated characteristics, while the pooling operation reduces the spatial dimensions of the feature maps, leading to a more compact representation.
Mathematically, the local feature extraction process can be described by Equation ( 1): where    represents the feature map from the ith convolutional block at layer l,    and    are the weights and biases of the convolutional layer, X is the input image or feature

Local Feature Extraction
During the model's local feature extraction phase, CNNs are employed to derive fine-grained features from the input aerial imagery.This process involves applying a series of convolutional layers to the input image, each consisting of a convolution operation followed by a non-linear activation function and often a pooling operation.The convolution operation involves sliding a set of learnable filters (or kernels) over the input image to capture local patterns such as edges, textures, and shapes.Non-linearity from the activation function lets the model learn more complicated characteristics, while the pooling operation reduces the spatial dimensions of the feature maps, leading to a more compact representation.
Mathematically, the local feature extraction process can be described by Equation ( 1): where F i l represents the feature map from the ith convolutional block at layer l, W i l and b i l are the weights and biases of the convolutional layer, X is the input image or feature map from the previous layer, ReLU is the Rectified Linear Unit activation function, Pool is the pooling operation, and * denotes the convolution operation.

Multilevel Feature Fusion
This stage combines CNN elements from different levels.The fusion process is designed to capture information at various scales and spatial resolutions.By combining features from multiple levels, the model creates a more comprehensive representation, enhancing its ability to accurately classify aerial images.The fusion of features from each convolutional block is achieved using a weighted sum, expressed mathematically as Equation ( 2): where F f used is the fused feature representation, n is the number of convolutional blocks, and α i are the fusion weights for each feature map F i l .After fusion, the features are subjected to global average pooling to produce a feature vector, as represented by Equation (3): where F GAP is the feature vector after global average pooling, H and W are the height and width of the feature map, and F f used (h, w) is the value of the fused feature map at position (h,w).

Global Feature Extraction
In the global feature extraction stage of the MSCAC, the model employs a Swin-Transformer-based encoder to capture the global context and dependencies within aerial images.This stage is crucial for understanding the overall structure and relationships between different parts of the scene.
Given an input image X ∈ R H×W×C , where H, W, and C are height, width, and the number of channels, respectively, the image is first partitioned into non-overlapping patches x p ∈ R N×(P 2 •C) .Here, N = H P × W P represents the number of patches, and P is the patch size.Each patch is then linearly embedded to a D-dimensional feature vector x embed .
The embedded patches are then passed through L layers of the Swin Transformer, where each layer comprises a Multi-Head Self-Attention (MHSA) block and a Multilayer Perceptron (MLP) block.These layers are equipped with residual connections and Layer Normalization (LN), and can be represented by Equation (4): where x l and x l+1 denote the outputs of the l-th and (l + 1)-th layers, respectively.The MHSA block allows the model to focus on different parts of the input sequence, capturing various aspects of relationships between elements by computing attention weights and producing the output through a weighted sum of values.This process enhances the model's ability to capture complex dependencies and interactions within the input data by focusing on different parts of the sequence through multiple heads.The MLP block further refines these representations through two linear transformations with a Gaussian Error Linear Unit (GeLU) activation function, allowing for non-linear transformations of the input features.The mechanism of MHSA is crucial for understanding the global context and relationships in aerial imagery, making it a key component of the Swin-Transformer-based encoder in the MSCAC model, and the process at the l-th layer is illustrated in Figure 3.
After processing through L layers, the final output x L represents the global feature representation F global of the input image.Hence, the global feature extraction using the Swin Transformer can be summarized as presented by Equation (5): This equation encapsulates the entire process of global feature extraction using the Swin Transformer in the MSCAC model, which is crucial for understanding the complex spatial relationships and structures present in aerial imagery.After processing through L layers, the final output   represents the global feature representation   of the input image.Hence, the global feature extraction using the Swin Transformer can be summarized as presented by Equation ( 5): This equation encapsulates the entire process of global feature extraction using the Swin Transformer in the MSCAC model, which is crucial for understanding the complex spatial relationships and structures present in aerial imagery.

Feature Fusion and Classification
The final step in our model involves fusing the local and global features to create a comprehensive feature representation, which is then used for classification.The local and global features are fused to create a comprehensive feature representation as presented by Equation ( 6): where   is the final fused feature vector, and Concat denotes the concatenation operation.
The fused feature vector is then passed through a fully connected dense layer for classification as presented by Equation ( 7): where y is the output classification vector,   and   are the weights and biases of the classifier head, and Softmax is a linear transformation function applied to the fused feature vector.
The proposed model, MSCAC, is designed to overcome the challenges in aerial image classification by leveraging the strengths of both local and global feature extraction techniques.The integration of CNNs allows for effective local feature extraction, crucial for handling spatial resolution variability, enabling the model to adapt to different resolutions by capturing detailed texture and structural information at various scales.The Swin Transformer, a key component of our model, excels in global feature extraction, capturing longrange dependencies and contextual information, making the model robust to changes in lighting, weather, and seasonal conditions.The multi-scale approach ensures that features are extracted at different levels, allowing the model to normalize or adapt to scale and orientation variations through the fusion of multilevel features, enhancing its ability to recognize aerial scenes regardless of their scale or orientation.This comprehensive feature extraction approach allows the model to generalize across diverse images within the same class, providing a balanced and effective combination of local and global feature extraction, leading to superior classification performance.
where F f inal is the final fused feature vector, and Concat denotes the concatenation operation.
The fused feature vector is then passed through a fully connected dense layer for classification as presented by Equation ( 7): where y is the output classification vector, W c and b c are the weights and biases of the classifier head, and Softmax is a linear transformation function applied to the fused feature vector.
The proposed model, MSCAC, is designed to overcome the challenges in aerial image classification by leveraging the strengths of both local and global feature extraction techniques.The integration of CNNs allows for effective local feature extraction, crucial for handling spatial resolution variability, enabling the model to adapt to different resolutions by capturing detailed texture and structural information at various scales.The Swin Transformer, a key component of our model, excels in global feature extraction, capturing long-range dependencies and contextual information, making the model robust to changes in lighting, weather, and seasonal conditions.The multi-scale approach ensures that features are extracted at different levels, allowing the model to normalize or adapt to scale and orientation variations through the fusion of multilevel features, enhancing its ability to recognize aerial scenes regardless of their scale or orientation.This comprehensive feature extraction approach allows the model to generalize across diverse images within the same class, providing a balanced and effective combination of local and global feature extraction, leading to superior classification performance.

Results and Discussion
This section presents the experimental results from the MSCAC on various aerial image datasets.The efficacy of the suggested model is evaluated against that of current leading-edge models through the use of confusion matrices and overall accuracy (OA).Additionally, the implications of the findings are discussed, highlighting the strengths and potential limitations of the MSCAC model in the context of aerial scene classification.

Experimental Setup
To increase the variety of the training dataset, data augmentation methods like rotation, flipping, and scaling were utilized.The MSCAC model was developed using the PyTorch 2.0 framework and trained on a GPU-enabled processor.The training was conducted using the Adam optimizer with an initial learning rate of 1 × 10 −4 and a weight decay of 1 × 10 −5 .The learning rate is modified according to the validation loss, and early stopping is implemented to avoid overfitting.

Evaluation Metrics
In this study, the same data-splitting ratios for training and testing as Xia et al. [44] were followed to ensure fair comparisons with previous results.The performance of the classification was assessed using the overall accuracy (OA) and the confusion matrix.The OA measures the proportion of correctly identified images across the entire dataset, while the confusion matrix provides detailed insights into the classification accuracy for each category.Specific training and testing proportions were applied across different datasets: 20% and 50% for the RSSCN7 and AID datasets, respectively, and 50% and 80% for the UC-Merced dataset.The WHU-RS19 dataset was partitioned into 40% and 60% splits.To improve the reliability of the results, the division of the dataset was randomized ten times, with the average OA and its standard deviation being reported.

Comparison with State-of-the-Art Models
In the evaluation, the MSCAC was benchmarked against several state-of-the-art models on the AID dataset, including SIFT, BoVW(SIFT), CaffeNet, GoogleLeNet, and VGG-VD-16, as referenced from Xia et al. [44].The proposed MSCAC model demonstrated superior performance, surpassing these models in overall accuracy.

UC-Merced Dataset
The UC-Merced Land Use Dataset consists of 21 classes representing diverse urban and natural landscapes.This dataset poses challenges in classification due to the subtle differences in texture, color, and density among similar classes.Visualization of these classes is crucial for understanding the complexity of land use classification.Figure 4 provides a visual representation of sample images from each category, illustrating the dataset's diversity.
The results presented in Table 3 show the OA of various deep-learning models on the UC-Merced dataset with different training proportions.The proposed MSCAC demonstrates competitive performance when compared to existing state-of-the-art models.With 50% of the data used for training, the MSCAC model achieves an OA of 94.01%, which is slightly lower than the best-performing VGG-VD-16 model at 94.14%.However, it is important to note that the MSCAC model outperforms other well-known architectures such as GoogleLeNet and CaffeNet, which achieve OAs of 92.70% and 93.98%, respectively.This indicates the effectiveness of the MSCAC model in capturing both local and global features of aerial images.When the training proportion is increased to 80%, the MSCAC model achieves an OA of 94.67%, which is closer to the performance of the VGG-VD-16 model at 95.21%.This suggests that the MSCAC model benefits from additional training data, further narrowing the gap with the best-performing model.It is noteworthy that traditional feature extraction methods like SIFT and BoVW(SIFT) lag significantly behind deep-learning-based models, with OAs of 32.10% and 74.12%, respectively, at 80% training.This highlights the superiority of deep-learning approaches in handling the complexity and diversity of aerial imagery.The results presented in Table 3 show the OA of various deep-learning models on the UC-Merced dataset with different training proportions.The proposed MSCAC demonstrates competitive performance when compared to existing state-of-the-art models.With 50% of the data used for training, the MSCAC model achieves an OA of 94.01%, which is slightly lower than the best-performing VGG-VD-16 model at 94.14%.However, it is important to note that the MSCAC model outperforms other well-known architectures such as GoogleLeNet and CaffeNet, which achieve OAs of 92.70% and 93.98%, respectively.This indicates the effectiveness of the MSCAC model in capturing both local and global features of aerial images.When the training proportion is increased to 80%, the MSCAC model achieves an OA of 94.67%, which is closer to the performance of the VGG-VD-16 model at 95.21%.This suggests that the MSCAC model benefits from additional training data, further narrowing the gap with the best-performing model.It is noteworthy that traditional feature extraction methods like SIFT and BoVW(SIFT) lag significantly behind deep-learning-based models, with OAs of 32.10% and 74.12%, respectively, at 80% training.This highlights the superiority of deep-learning approaches in handling the complexity and diversity of aerial imagery.

50% Training 80% Training
SIFT [44] 28.92 ± 0.95 32.10 ± 1.95 BoVW(SIFT) [44] 71.90 ± 0.79 74.12 ± 3.30 CaffeNet [44] 93.98 ± 0.67 95.02 ± 0.81 GoogleLeNet [44] 92.70 ± 0.60 94.31 ± 0.89 VGG-VD-16 [44] 94.14 ± 0.69 95.21 ± 1.20 MSCAC (proposed) 94.01 ± 0.93 94.67 ± 0.89 The confusion matrix in Figure 5 provides further insights into the classification accuracy of the MSCAC model.It reveals the model's ability to correctly classify images across various classes, with particular strengths and weaknesses in specific categories.High values along the main diagonal indicate a high true positive rate; for instance, the "Agricultural" and "Airplane" classes have a correct classification rate of 95%, while "Beach" is perfectly classified with a 100% rate.Off-diagonal high values indicate classes where the model is most frequently incorrect; for example, "Buildings" has been misclassified as "Dense Residential" and "Tennis Courts" 5% and 10% of the time, respectively.Several classes like "Dense Residential," "Freeway," and "Golf Course" have high true positive rates of 95%, 90%, and 90%, respectively, showing the model's strong capability to discern these categories.However, "Tennis Courts" stands out with a lower true positive rate of 70%, with confusion mainly with the "Buildings" and "Dense Residential" classes, possibly indicating similarities that the model struggles to distinguish.The matrix also highlights misclassifications due to subtle differences between classes, such as "Sparse Residential" being mistaken for "Dense Residential" 10% of the time, or "River" being confused with "Forest" 5% of the time.The matrix also highlights misclassifications due to subtle differences between classes, such as "Sparse Residential" being mistaken for "Dense Residential" 10% of the time, or "River" being confused with "Forest" 5% of the time.These misclassifications are crucial for understanding the limitations of the current model and suggest areas where additional training data, feature engineering, or model adjustments could lead to improved accuracy.
to discern these categories.However, "Tennis Courts" stands out with a lower true positive rate of 70%, with confusion mainly with the "Buildings" and "Dense Residential" classes, possibly indicating similarities that the model struggles to distinguish.The matrix also highlights misclassifications due to subtle differences between classes, such as "Sparse Residential" being mistaken for "Dense Residential" 10% of the time, or "River" being confused with "Forest" 5% of the time.The matrix also highlights misclassifications due to subtle differences between classes, such as "Sparse Residential" being mistaken for "Dense Residential" 10% of the time, or "River" being confused with "Forest" 5% of the time.These misclassifications are crucial for understanding the limitations of the current model and suggest areas where additional training data, feature engineering, or model adjustments could lead to improved accuracy.

WHU-RS19 Dataset
The WHU-RS19 dataset is a collection of high-resolution satellite images covering 19 scene categories.It provides a wide range of urban and natural landscapes, making it challenging to differentiate between overlapping features.Figure 6 displays sample images from each category, highlighting the diversity and challenges in classification.

WHU-RS19 Dataset
The WHU-RS19 dataset is a collection of high-resolution satellite images covering 19 scene categories.It provides a wide range of urban and natural landscapes, making it challenging to differentiate between overlapping features.Figure 6 displays sample images from each category, highlighting the diversity and challenges in classification.
The results presented in Table 4 show the OA of various deep-learning models on the WHU-RS19 dataset with different training proportions.The proposed MSCAC demonstrates impressive performance, outperforming other models at 60% training.With 40% of the data used for training, the MSCAC model achieves an OA of 94.99%, which is slightly lower than the best-performing VGG-VD-16 model at 95.44%.However, it surpasses other well-known architectures such as GoogleLeNet and CaffeNet, which achieve OAs of 93.12% and 95.11%, respectively.This indicates the MSCAC model's capability to effectively capture the complex features of high-resolution satellite images in the WHU-RS19 dataset.When the training proportion is increased to 60%, the MSCAC model showcases its strength by achieving an OA of 96.57%, which is the highest among all the compared models.This is a significant improvement over traditional feature extraction methods like SIFT and BoVW(SIFT), which have OAs of 27.21% and 80.13%, respectively, at 60% training.The performance of the MSCAC model surpasses even the well-regarded VGG-VD-16 model, which has an OA of 96.05%.The results highlight the effectiveness of the MSCAC model in handling the diverse and challenging landscapes present in the WHU-RS19 dataset.Its superior performance at 60% training suggests that the model is capable of leveraging additional training data to enhance its classification accuracy.The results presented in Table 4 show the OA of various deep-learning models on the WHU-RS19 dataset with different training proportions.The proposed MSCAC demonstrates impressive performance, outperforming other models at 60% training.With 40% of the data used for training, the MSCAC model achieves an OA of 94.99%, which is slightly lower than the best-performing VGG-VD-16 model at 95.44%.However, it surpasses other well-known architectures such as GoogleLeNet and CaffeNet, which achieve OAs of 93.12% and 95.11%, respectively.This indicates the MSCAC model's capability to effectively capture the complex features of high-resolution satellite images in the WHU-RS19 dataset.When the training proportion is increased to 60%, the MSCAC model showcases its strength by achieving an OA of 96.57%, which is the highest among all the compared models.This is a significant improvement over traditional feature extraction methods like SIFT and BoVW(SIFT), which have OAs of 27.21% and 80.13%, respectively, at 60% training.The performance of the MSCAC model surpasses even the well-regarded VGG-VD-16 model, which has an OA of 96.05%.The results highlight the effectiveness of the MSCAC model in handling the diverse and challenging landscapes present in the WHU-RS19 dataset.Its superior performance at 60% training suggests that the model is capable of leveraging additional training data to enhance its classification accuracy.

RSSCN7 Dataset
The RSSCN7 dataset comprises aerial images categorized into seven scene types.It includes diverse geographical and environmental variations, posing challenges in classification due to fluctuating lighting conditions, seasonal changes, and differing perspectives.Figure 8 showcases sample images from each category, demonstrating the dataset's variety.rate with minor confusion with "Parking" and "Residential Area".Similarly, "Industrial Area" has a 90% accuracy, occasionally misidentified as "Parking" or "Viaduct".Such misclassifications are informative, suggesting similarities between classes that the model confuses, and providing insights into potential areas for model refinement.For instance, "Railway Station" was misclassified as "Airport" in 5% of cases.Nevertheless, these misclassification rates are generally low, indicating robust model performance.

RSSCN7 Dataset
The RSSCN7 dataset comprises aerial images categorized into seven scene types.It includes diverse geographical and environmental variations, posing challenges in classification due to fluctuating lighting conditions, seasonal changes, and differing perspectives.Figure 8 showcases sample images from each category, demonstrating the dataset's variety.

RSSCN7 Dataset
The RSSCN7 dataset comprises aerial images categorized into seven scene types.It includes diverse geographical and environmental variations, posing challenges in classification due to fluctuating lighting conditions, seasonal changes, and differing perspectives.Figure 8 showcases sample images from each category, demonstrating the dataset's variety.The results presented in Table 5 show the OA of various deep-learning models on the RSSCN7 dataset with different training proportions.The proposed MSCAC demonstrates competitive performance, particularly at 20% training.With 20% of the data used for training, the MSCAC model achieves an OA of 84.01%, which is marginally higher than the VGG-VD-16 model at 83.98%.This suggests that the MSCAC model is adept at learning efficiently from a restricted dataset, outperforming other well-known architectures such as GoogleLeNet and CaffeNet, which achieve OAs of 82.55% and 85.57%, respectively.The performance of MSCAC is particularly noteworthy given that traditional feature extraction methods like SIFT and BoVW(SIFT) have significantly lower OAs of 28.45% and 76.33%, respectively.When the training proportion is increased to 50%, the MSCAC model achieves an OA of 87.37%, which is slightly higher than the VGG-VD-16 model at 87.18%.This suggests that the MSCAC model continues to perform well with additional training data, maintaining its competitive edge.However, it is important to note that CaffeNet achieves the highest OA of 88.25% in this scenario, indicating that there is still room for improvement for the MSCAC model.The results suggest that the MSCAC model is a strong contender in aerial scene classification, particularly in datasets like RSSCN7 that include diverse geographical and environmental variations.[44] 76.33 ± 0.88 81.34 ± 0.55 CaffeNet [44] 85.57 ± 0.95 88.25 ± 0.62 GoogleLeNet [44] 82.55 ± 1.11 56.84 ± 0.92 VGG-VD-16 [44] 83.98 ± 0.87 87.18 ± 0.94 MSCAC (proposed) 84.01 ± 0.34 87.37 ± 0.67 The confusion matrix presented in Figure 9 demonstrates the model's strong ability to accurately distinguish most classes, with a particularly high accuracy for Grass, Field, Industry, River/Lake, Forest, and Parking, all achieving correct classification rates above 0.88.However, the misclassifications observed, especially between visually similar categories such as Grass with Field, and between Industry with Parking, point to areas where the model's discriminative power could be further honed.The Aerial Image Dataset (AID) is a benchmark dataset for aerial scene classification, containing thousands of images across 30 classes, including urban areas, agricultural lands, forests, rivers, industrial zones, residential areas, airports, harbors, beaches, stadiums, and parks, among others.Each class captures distinct features for applications in urban planning and environmental monitoring.The results presented in Table 6 show the OA of various deep-learning models on the AID dataset with different training proportions.The proposed MSCAC demonstrates competitive performance, especially at 50% training.With 20% of the data used for training, the MSCAC model achieves an OA of 87.45%, which is higher than both the CaffeNet and VGG-VD-16 models, which have OAs of 86.86% and 86.59%, respectively.This    The performance of MSCAC is particularly noteworthy given that traditional feature extraction methods like SIFT and BoVW(SIFT) have significantly lower OAs of 13.50% and 61.40%, respectively.When the training proportion is increased to 50%, the MSCAC model achieves an OA of 89.90%, which is slightly higher than the VGG-VD-16 model at 89.64% but lower than the CNNs model by Yu and Liu (2018a) at 94.17%.This suggests that, while the MSCAC model continues to perform well with additional training data, there is still room for improvement to reach the performance level of the best-performing model in this dataset.

Method Accuracy 20% Training
50% Training SIFT [44] 13.50 ± 0.67 16.76 ± 0.65 BoVW(SIFT) [44] 61.40 ± 0.41 67.65 ± 0.49 CaffeNet [44] 86.86 ± 0.47 89.53 ± 0.31 GoogleLeNet [44] 83.44 ± 0.40 86.39 ± 0.55 VGG-VD-16 [44] 86.59 ± 029 89.64 ± 0.36 CNNs [45] -94.17 ± 0.32 MSCAC (proposed) 87.45 ± 0.92 89.90 ± 0.78 Figure 11 shows the Confusion matrix obtained by the MSCAC model on the AID dataset.The class "Airport" has a high classification accuracy of 89%, but there are instances where it is confused with other categories such as "Bare Land" and "Commercial"."Baseball Field" and "Beach" show high accuracies at 96% and 98%, respectively, with minor confusion with other classes."Bridge" shows a 95% accuracy, with a small percentage of confusion with "Center", "Desert", "Dense Residential", and "Square".Several classes have a high classification accuracy, such as "Farmland", "Forest", and "Park", all scoring above 95%.These high accuracies suggest that the model is effective at distinguishing features specific to these categories.Some classes have notable confusion with others; for instance, "Center" is occasionally mistaken for "Church", "Commercial", and "Square"."Industrial" has been confused with "Meadow" and "Dense Residential".Classes with a lower accuracy, such as "River", "School", "Sparse Residential", and "Square", The results presented in Table 6 show the OA of various deep-learning models on the AID dataset with different training proportions.The proposed MSCAC demonstrates competitive performance, especially at 50% training.With 20% of the data used for training, the MSCAC model achieves an OA of 87.45%, which is higher than both the CaffeNet and VGG-VD-16 models, which have OAs of 86.86% and 86.59%, respectively.This demonstrates that the MSCAC model can effectively learn from a small dataset, outperforming other well-known architectures.The performance of MSCAC is particularly noteworthy given that traditional feature extraction methods like SIFT and BoVW(SIFT) have significantly lower OAs of 13.50% and 61.40%, respectively.When the training proportion is increased to 50%, the MSCAC model achieves an OA of 89.90%, which is slightly higher than the VGG-VD-16 model at 89.64% but lower than the CNNs model by Yu and Liu (2018a) at 94.17%.This suggests that, while the MSCAC model continues to perform well with additional training data, there is still room for improvement to reach the performance level of the best-performing model in this dataset.
Figure 11 shows the Confusion matrix obtained by the MSCAC model on the AID dataset.The class "Airport" has a high classification accuracy of 89%, but there are instances where it is confused with other categories such as "Bare Land" and "Commercial"."Baseball Field" and "Beach" show high accuracies at 96% and 98%, respectively, with minor confusion with other classes."Bridge" shows a 95% accuracy, with a small percentage of confusion with "Center", "Desert", "Dense Residential", and "Square".Several classes have a high classification accuracy, such as "Farmland", "Forest", and "Park", all scoring above 95%.These high accuracies suggest that the model is effective at distinguishing features specific to these categories.Some classes have notable confusion with others; for instance, "Center" is occasionally mistaken for "Church", "Commercial", and "Square"."Industrial" has been confused with "Meadow" and "Dense Residential".Classes with a lower accuracy, such as "River", "School", "Sparse Residential", and "Square", are often confused with multiple other categories.This could indicate that these classes have similar features or that the model lacks the nuanced differentiation to accurately classify them, suggesting a need for further model training or feature engineering for these specific classes.Classes like "Stadium" and "Storage Tanks" show a high accuracy with occasional confusion, which may be due to unique features that are not always distinct.The class "Viaduct" has a near-perfect classification accuracy, suggesting the model is adept at identifying its specific characteristics.Overall, the experimental results demonstrate that the proposed MSCAC model is a competitive and effective tool for aerial scene classification, showing superior or comparable performance to existing state-of-the-art models across different datasets.The MSCAC model's strength lies in its ability to capture both local and global features of aerial images, which is crucial for accurately classifying diverse and complex scenes.This is evidenced by its performance on the UC-Merced, WHU-RS19, RSSCN7, and AID datasets, where it achieved a high overall accuracy and demonstrated improvements over traditional feature extraction methods and other deep-learning architectures.Its robustness and adaptability make the MSCAC model suitable for disaster response, urban planning, agricultural monitoring, environmental observation, and security surveillance.However, its high computational complexity may require significant resources for training and inference, limiting its use in resource-constrained environments.

Conclusions
In this study, we proposed the Multi-Scale Swin-CNN Aerial Classifier (MSCAC) for aerial image classification and compared its performance with existing models, including CaffeNet, VGG-VD16, and GoogleNet.The results demonstrate the effectiveness of our model across various datasets.MSCAC achieved an accuracy of 93.98 ± 0.67 on the UC- Overall, the experimental results demonstrate that the proposed MSCAC model is a competitive and effective tool for aerial scene classification, showing superior or comparable performance to existing state-of-the-art models across different datasets.The MSCAC model's strength lies in its ability to capture both local and global features of aerial images, which is crucial for accurately classifying diverse and complex scenes.This is evidenced by its performance on the UC-Merced, WHU-RS19, RSSCN7, and AID datasets, where it achieved a high overall accuracy and demonstrated improvements over traditional feature extraction methods and other deep-learning architectures.Its robustness and adaptability make the MSCAC model suitable for disaster response, urban planning, agricultural monitoring, environmental observation, and security surveillance.However, its high computational complexity may require significant resources for training and inference, limiting its use in resource-constrained environments.

Conclusions
In this study, we proposed the Multi-Scale Swin-CNN Aerial Classifier (MSCAC) for aerial image classification and compared its performance with existing models, including CaffeNet, VGG-VD16, and GoogleNet.The results demonstrate the effectiveness of our model across various datasets.MSCAC achieved an accuracy of 93.98 ± 0.67 on the UC-Merced dataset, slightly outperforming VGG-VD16 (94.14 ± 0.69) and significantly surpassing GoogleNet (92.70 ± 0.60).On the WHU-RS19 dataset, MSCAC achieved an accuracy of 95.11 ± 1.2, closely following VGG-VD16 (95.44 ± 0.6) and outperforming GoogleNet (93.12 ± 0.8).For the RSSCN7 dataset, MSCAC achieved an accuracy of 88.86 ± 0.6, surpassing both VGG-VD16 (87.18 ± 0.94) and GoogleNet (85.84 ± 0.92).Lastly, on the AID dataset, MSCAC achieved an accuracy of 89.53 ± 0.31, closely following VGG-VD16 (89.64 ± 0.36) and significantly outperforming GoogleNet (86.39 ± 0.55).These results indicate that MSCAC is competitive with state-of-the-art models and, in some cases, provides superior performance.The integration of CNNs for local feature extraction and the Swin Transformer for global feature extraction allows MSCAC to capture a rich representation of aerial imagery, leading to improved classification accuracy.The model's ability to handle spatial resolution variability, diverse imaging conditions, scale and orientation variations, and intra-class diversity makes it a promising tool for aerial image classification tasks.

Figure 1 .
Figure 1.Overview of the proposed framework for remote sensing scene classification.

Figure 1 .
Figure 1.Overview of the proposed framework for remote sensing scene classification.

2. 4 .
Multi-Scale Swin-CNN Aerial Classifier (MSCAC) In this section, the architecture of the proposed deep-learning model named Multi-Scale Swin-CNN Aerial Classifier (MSCAC), designed for the classification of aerial images, is detailed.The model architecture is meticulously crafted to leverage the strengths of both local and global feature extraction techniques, ensuring a comprehensive understanding of the aerial scenes.By integrating CNNs for local feature extraction with the advanced Swin Transformer for global feature extraction, the aim is to capture a rich representation of aerial imagery.The following subsections provide a mathematical description of the model's components, including the extraction of local features, the fusion of multilevel features, the extraction of global features, and the final classification process.For a visual representation of the detailed architecture, refer to Figure 2. 2.4.Multi-Scale Swin-CNN Aerial Classifier (MSCAC) In this section, the architecture of the proposed deep-learning model named Multi-Scale Swin-CNN Aerial Classifier (MSCAC), designed for the classification of aerial images, is detailed.The model architecture is meticulously crafted to leverage the strengths of both local and global feature extraction techniques, ensuring a comprehensive understanding of the aerial scenes.By integrating CNNs for local feature extraction with the advanced Swin Transformer for global feature extraction, the aim is to capture a rich representation of aerial imagery.The following subsections provide a mathematical description of the model's components, including the extraction of local features, the fusion of multilevel features, the extraction of global features, and the final classification process.For a visual representation of the detailed architecture, refer to Figure 2.

Figure 2 .
Figure 2. Architecture of the Multi-Scale Swin-CNN Aerial Classifier for scene classification.

Figure 2 .
Figure 2. Architecture of the Multi-Scale Swin-CNN Aerial Classifier for scene classification.

Figure 3 .
Figure 3. Flowchart for Multi-Head Self-Attention (MHSA) block mechanism.2.4.4.Feature Fusion and Classification The final step in our model involves fusing the local and global features to create a comprehensive feature representation, which is then used for classification.The local and global features are fused to create a comprehensive feature representation as presented by Equation (6):

Figure 4 .
Figure 4. Representative samples from the UC-Merced Land Use Dataset, illustrate the dataset's diversity.

Figure 4 .
Figure 4. Representative samples from the UC-Merced Land Use Dataset, illustrate the dataset's diversity.

Figure 5 .
Figure 5. Confusion matrix obtained by the proposed MSCAC model on the UC-Merced dataset.

Figure 5 .
Figure 5. Confusion matrix obtained by the proposed MSCAC model on the UC-Merced dataset.

Figure 7 .
Figure 7. Confusion matrix obtained by proposed MSCAC model on the WHU-RS19 dataset.

Figure 7 .
Figure 7. Confusion matrix obtained by proposed MSCAC model on the WHU-RS19 dataset.

Figure 7 .
Figure 7. Confusion matrix obtained by proposed MSCAC model on the WHU-RS19 dataset.

Figure 8 .
Figure 8. Sample images from the RSSCN7 dataset.Figure 8. Sample images from the RSSCN7 dataset.

Figure 8 .
Figure 8. Sample images from the RSSCN7 dataset.Figure 8. Sample images from the RSSCN7 dataset.
Misclassifications involving Resident areas also hint at potential overlaps in features with Industry and Parking classes.Despite a training set that is only half the total dataset, the model demonstrates a strong ability to differentiate between the classes, but there is a clear opportunity for model improvement, possibly by enhancing the training with additional data augmentation or advanced feature extraction techniques.Geographies 2024, 4, FOR PEER REVIEW 15

Figure 9 .
Figure 9. Confusion matrix obtained by the proposed MSCAC model on the RSSCN7 dataset.

3. 3 . 4 .
Aerial Image Dataset (AID) Figure 10 likely displays sample images from each class to highlight the dataset's diversity and the challenges in classification due to variations in lighting and camera angles.

Figure 9 .
Figure 9. Confusion matrix obtained by the proposed MSCAC model on the RSSCN7 dataset.

3. 3 . 4 .
Aerial Image Dataset (AID)The Aerial Image Dataset (AID) is a benchmark dataset for aerial scene classification, containing thousands of images across 30 classes, including urban areas, agricultural lands, forests, rivers, industrial zones, residential areas, airports, harbors, beaches, stadiums, and parks, among others.Each class captures distinct features for applications in urban planning and environmental monitoring.
Figure 10 likely displays sample images from each class to highlight the dataset's diversity and the challenges in classification due to variations in lighting and camera angles.

Figure 10 .
Figure 10.Sample images from the AID dataset.

Figure 10 .
Figure 10.Sample images from the AID dataset.

Geographies 2024, 4 ,
FOR PEERREVIEW  17    are often confused with multiple other categories.This could indicate that these classes have similar features or that the model lacks the nuanced differentiation to accurately classify them, suggesting a need for further model training or feature engineering for these specific classes.Classes like "Stadium" and "Storage Tanks" show a high accuracy with occasional confusion, which may be due to unique features that are not always distinct.The class "Viaduct" has a near-perfect classification accuracy, suggesting the model is adept at identifying its specific characteristics.

Figure 11 .
Figure 11.Confusion matrix obtained by proposed MSCAC model on the AID dataset.

Figure 11 .
Figure 11.Confusion matrix obtained by proposed MSCAC model on the AID dataset.

Table 1 .
Summary of notable deep-learning approaches in aerial scene classification.

Table 2 .
Detailed information on the aerial image datasets used.

Table 2 .
Detailed information on the aerial image datasets used.

Table 3 .
Overall accuracy (OA) of various deep-learning models on the UC-Merced dataset.

Table 4 .
Overall accuracy (OA) of various deep-learning models on the WHU-RS19 dataset.

Table 4 .
Overall accuracy (OA) of various deep-learning models on the WHU-RS19 dataset.

Table 5 .
Overall accuracy (OA) of various deep-learning models on the RSSCN7 dataset.

Table 6 .
Overall accuracy (OA) of various deep-learning models on the AID dataset.

Table 6 .
Overall accuracy (OA) of various deep-learning models on the AID dataset.