Few-Shot Image Classification Based on Swin Transformer + CSAM + EMD

: In few-shot image classification (FSIC), the feature extraction module of the traditional convolutional neural networks is often constrained by the local nature of the convolutional kernel. As a result, it becomes challenging to handle global information and long-distance dependencies effectively. In order to address this problem, an innovative FSIC method is proposed in this paper, which is the integration of Swin Transformer and CSAM and Earth Mover’s Distance (EMD) technology (STCE). We utilize the Swin Transformer network for image feature extraction, and perform CSAM attention mechanism feature weighting on the output feature map, while we adopt the EMD algorithm to generate the optimal matching flow between the structural units, minimizing the matching cost. This approach allows for a more precise representation of the classification distance between images. We have conducted numerous experiments to validate the effectiveness of our algorithm. On three commonly used few-shot datasets, namely mini-ImageNet, tiered-ImageNet, and FC100, the accuracy of one-shot and five-shot has reached the state of the art (SOTA) in the FSIC; the mini-ImageNet achieves an accuracy of 98.65 ± 0.1% for one-shot and 99.6 ± 0.2% for five-shot tasks, while tiered ImageNet has an accuracy of 91.6 ± 0.1% for one-shot tasks and 96.55 ± 0.27% for five-shot tasks. For FC100, the accuracy is 64.1 ± 0.3% for one-shot tasks and 79.8 ± 0.69% for five-shot tasks. On two commonly used few-shot datasets, namely CUB, CIFAR-FS, CUB achieves an accuracy of 83.1 ± 0.4% for one-shot and 92.88 ± 0.4% for five-shot tasks, while CIFAR-FS achieves an accuracy of 86.95 ± 0.2% for one-shot and 94 ± 0.4% for five-shot tasks.


Introduction
In the age of abundant data, deep learning algorithms have demonstrated remarkable outcomes in numerous domains associated with visual computing [1,2].However, the deep learning model's ability to achieve high accuracy is often dependent on the availability of a significant amount of training data and a lot of manual labeling.In order to save these costs, people think about how to accurately classify objects with little data.For instance, if a young child has never seen a rabbit, give him a card with a rabbit on it.When he sees the real rabbit, he will immediately recall it in his mind and recognize it.Even if the rabbit's body size, hair color, and body posture are quite different from the pictures he has seen, he can accurately identify it.Inspired by human learning views, the concept of FSL is put forward.The key problem of FSL is that there is a scarcity of labeled data, and it relies on data augmentation [3][4][5][6].This method expands the dataset by generating data in various ways, thereby addressing the issue of data sparseness in the FSL process.
In recent years, FSL has made rapid developments.The FSL method based on metricbased learning [7][8][9][10] partially addresses the issue of data scarcity.It directly measures the distance between test images and training images rather than relying on large datasets.However, when confronted with complex situations, such as a cluttered background, high similarity between categories, and significant differences within categories, this method may result in a significant distance between images of the same category in the embedded space after feature extraction.As a result, the accuracy of image classification is inevitably reduced.At present, the existing FSIC methods based on metric learning usually rely on direct image comparison.However, they often overlook the significance of local image features.In fact, different parts of an image may have varying levels of importance; so, we require a more adaptable metric learning approach.Specifically, Zhang et al. [11] assign less weight to the classification features that contribute less overall, while assigning greater weight to the regions that contain rich image features and advanced semantics.This method of weight distribution is more aligned with the actual situation.In this study, the problem of FSL is formalized as an optimal matching problem, and the EMD is used to measure the learning idea.A measure function is used to calculate the structural distance between test images.Finally, we utilize these distances to make predictions for image classification.This allows us to construct a classifier that can efficiently and precisely determine the category, particularly for datasets with limited examples.
The main contributions of this paper are as follows: (1) The Swin Transformer is employed for feature extraction, enabling the acquisition of both local and global details from the image are captured and perform CSAM attention mechanism feature weighting on the output feature map.(2) The EMD measurement module is employed to measure the distance.The main concept involves utilizing block level measurement and incorporating a cross reference weight mechanism to effectively mitigate the influence of significant variations within the same category and cluttered background.(3) Numerous experiments were conducted on three widely utilized benchmark datasets for FSIC, and the findings demonstrate the significant enhancement achieved by the proposed model.The SOTA classification accuracy for few-shot images has been achieved.

Related Work
In 2017, Snell et al. [12] proposed a prototypical network.The researchers highlighted the utilization of deep neural networks to map images into feature vectors.In this approach, each category was represented as a prototype or category center point within the vector space.The feature vectors belonging to the same categories are subjected to an averaging process.In the prototype network, the training objective is to optimize the parameters of the embedding function by minimizing the loss.This enables the network to learn and identify prototypes for each category.In 2018, Sung et al. [13] proposed a model called the Relationship Network.There were two modules incorporated in this paper, namely the relationship module and the feature extraction module.Li et al. [14] proposed a Deep Nearest Neighbor Neural Network (DN4).DN4 was a neural network model designed specifically for FSL and image classification.The primary distinction of this approach was found in its feature representation and the manner in which similarity was computed.In conventional approaches, the representation of image features typically relied on image-level feature measurement.However, DN4 deviated from this practice by incorporating local descriptors of images into categories, thereby replacing image-level feature measurement.In 2021, Rizve et al. [15] proposed the complementary advantages of FSL invariance and equivariant representation.This approach aimed to achieve the necessary characteristics for input transformation and improve discrimination.It was found that features that prioritized transformation discrimination may not be ideal for class discrimination.However, these features can aid in learning the equivariant properties of data structures, leading to improved portability.In 2021, Wu et al. [16] proposed to mine parts in a task-aware manner (TPMN) by incorporating automatic part mining into FSL's metric-based model.TPMN designed a meta filter learner for a meta-learning [17,18] way to produce task-aware part filters based on task embeddings.The task aware part filter can be adapted to any individual task and automatically mines local parts relevant to the task, even if they are invisible.Gori et al. [19] proposed a Graph Neural Network (GNN) model wherein individual nodes represented samples and the edges denoted the relationships between different samples.Compared to the conventional neural network, GNN takes into account both the inter-sample information and the intra-sample information.Kim et al. [20] proposed a method called EGNN (Edge Labeling Graph Neural Network) to incorporate edge labeling into the GNN.Traditional GNN typically focused on node characteristics and the connectivity between nodes while disregarding the label information associated with the edges.EGNN utilized edge labels to depict the association between samples and integrated them into the process of model learning.The utilization of this particular type of edge label can enhance the model's comprehension and utilization of the similarities and distinctions among samples, thereby enhancing the efficacy of FSL.Chen et al. [21] presented a novel approach in their paper, which introduced a method that integrated spatial and frequency representations to enhance FSL capabilities.By integrating information from both the spatial and frequency domains, the proposed approach extracted a multi-scale feature representation.Additionally, it leveraged transduction reasoning to effectively utilize the label information from the test set.Consequently, the learning performance of few-shot tasks was substantially enhanced.

Problem Description
FSL divides the task into two parts.The training set, also called the support set, is divided into N data categories, each of which consists of K samples, referred to as the N-way K-shot problem.The test set is also called the query set, and the categories in the query set belong to the categories in the support set.To solve the N-way K-shot FSIC problem, a priori knowledge is first learned from the auxiliary dataset [13], and then the learned priori knowledge is utilized for image classification and prediction on the target dataset with limited labeling.
In an FSL task, the dataset is divided into a training set D base = (x i , y i )

Feature Extraction Module Based on Swin Transformer Network
In this study, the utilization of Swin Transformer [22] is implemented.As a module for extracting features in FSIC, the measurement of the distance between the output image features and the EMD is conducted.The architecture of the Swin Transformer is depicted in Figure 2, comprising a convolutional layer, a linear embedding layer, Patch Merging, a Block block, a global adaptive pooling layer, and a fully connected layer.The processing procedure is as follows: Initially, the input image undergoes a convolution operation to partition it into non-overlapping 4 × 4 image blocks.Then, the image blocks undergo transformation into a sequence via the linear embedding layer.In each block, the selfattention mechanism is employed to extract the image features.Subsequently, the feature map undergoes a Patch Merging operation, resulting in down sampling.This process reduces the width and height of the feature map while simultaneously increasing the number of channels.This process involves the extraction of deep features from the image by employing multiple Block and Patch Merging operations.The structure of the Swin Transformer Block is depicted in Figure 3.The LN module is employed to standardize the input features, thereby ensuring that the features across different channels exhibit a similar distribution.This aspect is beneficial in enhancing the stability of the model and expediting the convergence speed during training.The W-MSA module is utilized for conducting multi-head self-attention calculations within the window.By performing attention weight calculations, the model combines the features and facilitates the interaction and information exchange among features located at various positions.This feature facilitates the model's comprehension of the interconnections among various components.The MLP module is a fully connected feedforward network that is capable of performing complex nonlinear transformations on features.This is achieved

Feature Extraction Module Based on Swin Transformer Network
In this study, the utilization of Swin Transformer [22] is implemented.As a module for extracting features in FSIC, the measurement of the distance between the output image features and the EMD is conducted.The architecture of the Swin Transformer is depicted in Figure 2, comprising a convolutional layer, a linear embedding layer, Patch Merging, a Block block, a global adaptive pooling layer, and a fully connected layer.The processing procedure is as follows: Initially, the input image undergoes a convolution operation to partition it into non-overlapping 4 × 4 image blocks.Then, the image blocks undergo transformation into a sequence via the linear embedding layer.In each block, the selfattention mechanism is employed to extract the image features.Subsequently, the feature map undergoes a Patch Merging operation, resulting in down sampling.This process reduces the width and height of the feature map while simultaneously increasing the number of channels.This process involves the extraction of deep features from the image by employing multiple Block and Patch Merging operations.

Feature Extraction Module Based on Swin Transformer Network
In this study, the utilization of Swin Transformer [22] is implemented.As a module for extracting features in FSIC, the measurement of the distance between the output image features and the EMD is conducted.The architecture of the Swin Transformer is depicted in Figure 2, comprising a convolutional layer, a linear embedding layer, Patch Merging, a Block block, a global adaptive pooling layer, and a fully connected layer.The processing procedure is as follows: Initially, the input image undergoes a convolution operation to partition it into non-overlapping 4 × 4 image blocks.Then, the image blocks undergo transformation into a sequence via the linear embedding layer.In each block, the selfattention mechanism is employed to extract the image features.Subsequently, the feature map undergoes a Patch Merging operation, resulting in down sampling.This process reduces the width and height of the feature map while simultaneously increasing the number of channels.This process involves the extraction of deep features from the image by employing multiple Block and Patch Merging operations.The structure of the Swin Transformer Block is depicted in Figure 3.The LN module is employed to standardize the input features, thereby ensuring that the features across different channels exhibit a similar distribution.This aspect is beneficial in enhancing the stability of the model and expediting the convergence speed during training.The W-MSA module is utilized for conducting multi-head self-attention calculations within the window.By performing attention weight calculations, the model combines the features and facilitates the interaction and information exchange among features located at various positions.This feature facilitates the model's comprehension of the interconnections among various components.The MLP module is a fully connected feedforward network that is capable of performing complex nonlinear transformations on features.This is achieved The structure of the Swin Transformer Block is depicted in Figure 3.The LN module is employed to standardize the input features, thereby ensuring that the features across different channels exhibit a similar distribution.This aspect is beneficial in enhancing the stability of the model and expediting the convergence speed during training.The W-MSA module is utilized for conducting multi-head self-attention calculations within the window.By performing attention weight calculations, the model combines the features and facilitates the interaction and information exchange among features located at various positions.This feature facilitates the model's comprehension of the interconnections among various components.The MLP module is a fully connected feedforward network that is capable of performing complex nonlinear transformations on features.This is achieved through the utilization of multiple fully connected layers and activation functions, which enable the capture of more comprehensive and intricate feature representations.The inclusion of higher-level feature information aids in the learning process of the model.The SW-MSA module is utilized for conducting window moving multi-head self-attention calculations.The proposed approach enhances the capture of contextual information by performing feature translation, reorganization, and integration within a local area.Simultaneously, it also maintains the relative positional relationship between windows, thereby facilitating the efficient extraction of multi scale features.Through the sequential operation of the aforementioned four modules, the process of modeling and integrating the characteristics within the window is achieved.
Electronics 2024, 13, x FOR PEER REVIEW 5 of 15 through the utilization of multiple fully connected layers and activation functions, which enable the capture of more comprehensive and intricate feature representations.The inclusion of higher-level feature information aids in the learning process of the model.The SW-MSA module is utilized for conducting window moving multi-head self-attention calculations.The proposed approach enhances the capture of contextual information by performing feature translation, reorganization, and integration within a local area.Simultaneously, it also maintains the relative positional relationship between windows, thereby facilitating the efficient extraction of multi scale features.Through the sequential operation of the aforementioned four modules, the process of modeling and integrating the characteristics within the window is achieved.

CSAM MODULE
The feature maps F enter the channel attention path and the spatial attention path separately. () ∈  and  () ∈  × , respectively, channel attention to pay attention to the figure and space.CSAM is shown in Figure 4.In order to effectively calculate channel attention, the spatial dimension of the input feature map needs to be compressed.The common method for spatial information aggregation is average pooling, while maximum pooling collects unique object features and can infer attention on finer channels.Therefore, the features obtained after average pooling and maximum pooling are used simultaneously.The calculation of the channel attention path  () can be expressed as follows: where GlobalAvgPool represents global average pooling; and GlobalMaxPool indicates global maximum pooling.FC indicates the fully connected layer.σ indicates the sigmoid activation function.A channel attention diagram is shown in Figure 5.

CSAM MODULE
The feature maps F enter the channel attention path and the spatial attention path separately.M c (F) ∈ R c and M p (F) ∈ R H×W , respectively, channel attention to pay attention to the figure and space.CSAM is shown in Figure 4.
Electronics 2024, 13, x FOR PEER REVIEW 5 of 15 through the utilization of multiple fully connected layers and activation functions, which enable the capture of more comprehensive and intricate feature representations.The inclusion of higher-level feature information aids in the learning process of the model.The SW-MSA module is utilized for conducting window moving multi-head self-attention calculations.The proposed approach enhances the capture of contextual information by performing feature translation, reorganization, and integration within a local area.Simultaneously, it also maintains the relative positional relationship between windows, thereby facilitating the efficient extraction of multi scale features.Through the sequential operation of the aforementioned four modules, the process of modeling and integrating the characteristics within the window is achieved.

CSAM MODULE
The feature maps F enter the channel attention path and the spatial attention path separately. () ∈  and  () ∈  × , respectively, channel attention to pay attention to the figure and space.CSAM is shown in Figure 4.In order to effectively calculate channel attention, the spatial dimension of the input feature map needs to be compressed.The common method for spatial information aggregation is average pooling, while maximum pooling collects unique object features and can infer attention on finer channels.Therefore, the features obtained after average pooling and maximum pooling are used simultaneously.The calculation of the channel attention path  () can be expressed as follows: where GlobalAvgPool represents global average pooling; and GlobalMaxPool indicates global maximum pooling.FC indicates the fully connected layer.σ indicates the sigmoid activation function.A channel attention diagram is shown in Figure 5.In order to effectively calculate channel attention, the spatial dimension of the input feature map needs to be compressed.The common method for spatial information aggregation is average pooling, while maximum pooling collects unique object features and can infer attention on finer channels.Therefore, the features obtained after average pooling and maximum pooling are used simultaneously.The calculation of the channel attention path M c (F) can be expressed as follows: where GlobalAvgPool represents global average pooling; and GlobalMaxPool indicates global maximum pooling.FC indicates the fully connected layer.σ indicates the sigmoid activation function.A channel attention diagram is shown in Figure 5. Formula (1) indicates that the feature graph F is simultaneously passed through the average pooling layer and the maximum pooling layer, and then through the fully connected layer; then, the two are added by elements, the sigmoid activation function is used for nonlinear mapping, and the output of channel dimension is finally obtained.A spatial attention diagram is shown in Figure 6.As shown in Figure 6, the calculation of spatial attention path  () can be expressed as follows: where Conv represents the convolution operation, 1 × 1 and 3 × 3 represent the convolution kernel size, and σ indicates the sigmoid activation function.The feature graph F is extracted by a series of convolution operations to obtain  (), which is then added to  () by elements.Then, through the sigmoid activation function, in order to facilitate the forward and backward propagation of information, the low-level features can be directly transmitted to the high-level network, and the original feature diagram F is added by elements to obtain the final module output  .The calculation process can be expressed as follows: Through the analysis of the structure of the channel space fusion attention module, it can be seen that feature extraction of the channel dimension alone can better learn color information into the model, while the learning space dimension can make the network focus more on learning texture features separately; finally, the two are fused to obtain a shape consistent with the original feature map.Then, the processed image output is measured by EMD.

Problem Description
EMD is used to solve the optimal solution problem of transportation in linear programming.A certain number of mountain piles are piled up in two different ways, and EMD calculates the sum of the minimum distances required to move one pile into another.Suppose a group of suppliers  = { | = 1,2, . . .} need to transport goods to a specified Formula (1) indicates that the feature graph F is simultaneously passed through the average pooling layer and the maximum pooling layer, and then through the fully connected layer; then, the two are added by elements, the sigmoid activation function is used for nonlinear mapping, and the output of channel dimension is finally obtained.A spatial attention diagram is shown in Figure 6.Formula (1) indicates that the feature graph F is simultaneously passed through the average pooling layer and the maximum pooling layer, and then through the fully connected layer; then, the two are added by elements, the sigmoid activation function is used for nonlinear mapping, and the output of channel dimension is finally obtained.A spatial attention diagram is shown in Figure 6.As shown in Figure 6, the calculation of spatial attention path  () can be expressed as follows: where Conv represents the convolution operation, 1 × 1 and 3 × 3 represent the convolution kernel size, and σ indicates the sigmoid activation function.The feature graph F is extracted by a series of convolution operations to obtain  (), which is then added to  () by elements.Then, through the sigmoid activation function, in order to facilitate the forward and backward propagation of information, the low-level features can be directly transmitted to the high-level network, and the original feature diagram F is added by elements to obtain the final module output  .The calculation process can be expressed as follows: Through the analysis of the structure of the channel space fusion attention module, it can be seen that feature extraction of the channel dimension alone can better learn color information into the model, while the learning space dimension can make the network focus more on learning texture features separately; finally, the two are fused to obtain a shape consistent with the original feature map.Then, the processed image output is measured by EMD.

Problem Description
EMD is used to solve the optimal solution problem of transportation in linear programming.A certain number of mountain piles are piled up in two different ways, and EMD calculates the sum of the minimum distances required to move one pile into another.Suppose a group of suppliers  = { | = 1,2, . . .} need to transport goods to a specified As shown in Figure 6, the calculation of spatial attention path M p (F) can be expressed as follows: where Conv represents the convolution operation, 1 × 1 and 3 × 3 represent the convolution kernel size, and σ indicates the sigmoid activation function.The feature graph F is extracted by a series of convolution operations to obtain M c (F), which is then added to M p (F) by elements.Then, through the sigmoid activation function, in order to facilitate the forward and backward propagation of information, the low-level features can be directly transmitted to the high-level network, and the original feature diagram F is added by elements to obtain the final module output F CSAM .The calculation process can be expressed as follows: Through the analysis of the structure of the channel space fusion attention module, it can be seen that feature extraction of the channel dimension alone can better learn color information into the model, while the learning space dimension can make the network focus more on learning texture features separately; finally, the two are fused to obtain a shape consistent with the original feature map.Then, the processed image output is measured by EMD.

Problem Description
EMD is used to solve the optimal solution problem of transportation in linear programming.A certain number of mountain piles are piled up in two different ways, and EMD calculates the sum of the minimum distances required to move one pile into another.Suppose a group of suppliers A = {a i |i = 1, 2, . . .m} need to transport goods to a specified group of destinations B = b j j = 1, 2, . . .k , where a i represents the i-th supplier, and b j represents the j-th destination.The unit cost from supplier to destination is c ij , and the unit quantity of transportation is x ij .The goal of the transportation problem is to find the cheapest goods flow X = { xij |i = 1, . . .m, j = 1, . . .k} from the suppliers to the demanders:

Application of EMD in FSIC
Zhang et al. [11] put forward DeepEMD to address the few-shot challenge with EMD, which is a dynamic round training process.Specifically, the Swin Transformer is utilized in order to generate image embeddings denoted as U ∈ R H×W×C , where H and W correspond to the spatial dimensions of the feature map, and C represents the dimensionality of the features.Each image representation comprises a collection of local feature vectors [u 1 , u 2 , . ..uHW ], and each vector u i represents a node within the set.v j is the same, u i is the query set and v j is the support set.The overall framework of Swin Transformer + CSAM + EMD is shown in Figure 7. Therefore, the similarity of two images can be quantified by determining the optimal matching cost between two sets of vectors.According to the original EMD formula in Formula (4), the unit cost is determined by computing the paired distance between embedded nodes u i , v j derived from two image features:

Application of EMD in FSIC
Zhang et al. [11] put forward DeepEMD to address the few-shot challenge with EMD, which is a dynamic round training process.Specifically, the Swin Transformer is utilized in order to generate image embeddings denoted as U ∈ R H×W×C , where H and W correspond to the spatial dimensions of the feature map, and C represents the dimensionality of the features.Each image representation comprises a collection of local feature vectors [u1, u2, …uHW], and each vector u represents a node within the set.v is the same,u is the query set andv is the support set.The overall framework of Swin Transformer + CSAM + EMD is shown in Figure 7. Therefore, the similarity of two images can be quantified by determining the optimal matching cost between two sets of vectors.According to the original EMD formula in Formula (4), the unit cost is determined by computing the paired distance between embedded nodes u , v derived from two image features: Nodes exhibiting similar representations tend to produce reduced matching costs among one another.As for the types of weights a and b , we will elaborate on them in Section 3.4.4.Once the optimal matching stream  has been obtained, the similarity score between image representations can be expressed as follows:

End-to-End Training
Applying Implicit Function Theorem [23][24][25] to the Optimality (KKT) Condition, the Jacobian theorem can be obtained.For the sake of comprehensiveness, the equation can be expressed in a condensed matrix form, starting from Equation (4).Nodes exhibiting similar representations tend to produce reduced matching costs among one another.As for the types of weights a i and b j , we will elaborate on them in Section 3.4.4.Once the optimal matching stream X has been obtained, the similarity score between image representations can be expressed as follows: Electronics 2024, 13, 2121 8 of 15

End-to-End Training
Applying Implicit Function Theorem [23][24][25] to the Optimality (KKT) Condition, the Jacobian theorem can be obtained.For the sake of comprehensiveness, the equation can be expressed in a condensed matrix form, starting from Equation (4).minimize c(θ) T x subject to G(θ)x ≤ I(θ), E(θ)x = f(θ). ( θ represents the problem parameter associated with the initial layer in a differentiable manner.Ex = f represents the equality constraint and Gx ≤ I the inequality constraint in Formula (4).Therefore, the Lagrangian function for the linear programming problem described in Equation ( 7) can be represented by the following mathematical expression: where ν is the dual variable with equal constraints, and λ > 0 is the dual variable with unequal strain.Following the KKT condition with notational convenience, the original dual interior point method g θ, x, ν, λ = 0 is used to solve the problem to obtain the best objective function x, ν, λ , as follows: By solving the optimal solution x, we can find the relationship between x and θ.This facilitates efficient backpropagation.

Weight Generation
In the task of FSIC, it is difficult for a single image to assess the significance of local feature representation.In order to reduce the weight of high-variance background areas in two images and increase the weight of co-current object areas, the cross-reference mechanism is used to generate correlation scores as weight values through the dot product between node features and average node features in another structure.
b j can be obtained in the same way; finally, all the weights in the structure are normalized.âi = a i HW 3.4.5.How to Set K-Shot?
We discussed the situation of the 1-shot before, and we aim to know how to set the Kshot.FC is used to find a prototype vector for each category and uses distance measurement to classify images.Similarly, the K-shot learns a structured fully connected layer (SFCL), in which each category is embedded into a set of vectors instead of a vector.The trained 1-shot model is utilized as a fixed feature extractor, while the parameters within the SFCL are learned from the support set.As shown in Figure 8.

Experimental Section
We initially present the details of the dataset used and highlight key aspects incorporated in our network design.Finally, a comparison is made between our model and the SOTA method on widely recognized datasets.

Dataset Description
The algorithm's accuracy was confirmed using these five datasets.Mini-ImageNet [26]: The mini-ImageNet dataset, which is commonly utilized in the field of FSL, is considered a coarse-grained dataset.In 2016, the Google DeepMind team transitioned from ImageNet to the mini-ImageNet dataset [27].The dataset used in this study consists of 60,000 color pictures, which have been carefully chosen to represent 100 distinct categories.Each category contains 600 images, which are further divided into subsets for different purposes (specifically, 64 for meta training, 16 for meta verification and 20 for meta testing).
CIFAR-FS [28]: It is a modified version of CIFAR-100 [29], which consists of 100 categories and 600 images per category.It is common practice to divide the dataset into 64 classes for training purposes.while 16 classes are allocated for validation and another set of 20 classes are designated for testing.
CUB-200 [30]: The CUB dataset was initially introduced for the purpose of classifying birds at a fine-grained level.It is divided into three subsets for the purposes of meta training, meta validation, and meta testing, with each subset containing 100, 50, and 50 classes, respectively.One of the main difficulties in this dataset lies in the subtle distinctions between the different bird species.Tiered-ImageNet [31]: It is a dataset that shares similarities with mini-ImageNet and is considered to be coarse-grained.It consists of 608 classes, with the training set comprising 351 classes, the verification set comprising 97 classes, and the test set consisting of 160 classes.Additionally, this dataset offers a larger number of images for both training and evaluation purposes, with a total of 779,165 images available.
FC100 [32]: The FC100 dataset is a classification dataset that is designed for FSL, derived from the CIFAR100 dataset.The dataset follows a specific split division proposed in this study [32].In this split, the 36 original super classes have been reorganized into 12 super classes, consisting of 60 classes, for meta training.Additionally, four super classes, comprising 20 classes, have been allocated for meta validation, and another four super classes, containing 20 classes, have been designated for meta testing.

Experimental Section
We initially present the details of the dataset used and highlight key aspects incorporated in our network design.Finally, a comparison is made between our model and the SOTA method on widely recognized datasets.

Dataset Description
The algorithm's accuracy was confirmed using these five datasets.Mini-ImageNet [26]: The mini-ImageNet dataset, which is commonly utilized in the field of FSL, is considered a coarse-grained dataset.In 2016, the Google DeepMind team transitioned from ImageNet to the mini-ImageNet dataset [27].The dataset used in this study consists of 60,000 color pictures, which have been carefully chosen to represent 100 distinct categories.Each category contains 600 images, which are further divided into subsets for different purposes (specifically, 64 for meta training, 16 for meta verification and 20 for meta testing).
CIFAR-FS [28]: It is a modified version of CIFAR-100 [29], which consists of 100 categories and 600 images per category.It is common practice to divide the dataset into 64 classes for training purposes.while 16 classes are allocated for validation and another set of 20 classes are designated for testing.
CUB-200 [30]: The CUB dataset was initially introduced for the purpose of classifying birds at a fine-grained level.It is divided into three subsets for the purposes of meta training, meta validation, and meta testing, with each subset containing 100, 50, and 50 classes, respectively.One of the main difficulties in this dataset lies in the subtle distinctions between the different bird species.Tiered-ImageNet [31]: It is a dataset that shares similarities with mini-ImageNet and is considered to be coarse-grained.It consists of 608 classes, with the training set comprising 351 classes, the verification set comprising 97 classes, and the test set consisting of 160 classes.Additionally, this dataset offers a larger number of images for both training and evaluation purposes, with a total of 779,165 images available.
FC100 [32]: The FC100 dataset is a classification dataset that is designed for FSL, derived from the CIFAR100 dataset.The dataset follows a specific split division proposed in this study [32].In this split, the 36 original super classes have been reorganized into 12 super classes, consisting of 60 classes, for meta training.Additionally, four super classes, comprising 20 classes, have been allocated for meta validation, and another four super classes, containing 20 classes, have been designated for meta testing.

Experimental Environment
The experimental environment is an Ubuntu20.00system, the CPU processor is Xeon(R) Platinum 8352V, and the GPU uses RTX4090 two graphics cards, with a singlevideo memory of 24 G and two-video memories of 48 G.The model training platform adopts the PyTorch deep learning framework, and the specifically related versions are PyTorch 2.0.0,Python 3.8, and Cuda 11.8.

Implementation Details
In the experiment, the size of all images in the mini-ImageNet dataset was set to 224 × 224.Simultaneously, common data enhancement strategies were adopted to enhance the data, such as random horizontal flipping and color dithering, and some randomness was introduced into brightness, contrast and saturation.Each training sample made slight changes in these attributes with a certain probability, thus increasing the diversity of data, which was helpful for the model to better adapt to different brightness, contrast and color changes.The tensor was normalized after image conversion, the image pixel values to were normalized mean and standard deviation, the image size of the other four datasets was set to 84 × 84, without using color dithering, and the other steps were the same.
We initialized the pretraining phase using the Swin-T pretrained model with 28,288,354 parameters and 4.5 G FLOPs.The Swin Transformer model's parameters were adjusted through training.The final trained model produced a feature map of size 7 × 7 × 768, which was then average-pooled to a size of 5 × 5 × 768 and performed CSAM attention mechanism feature weighting on the output feature map for ease of downstream EMD fine-tuning.

Analysis of Ablation Experiment
The module that can be ablated in this experiment is CSAM, and Swin + EMD has excellent performance.The addition of this CSAM module improves the accuracy of the five datasets by about 0.2-0.3%.
Regarding the choice of learning rate, this study attempted values of 0.1, 0.01, 0.0001, 0.00001, and 0.00005 during pretraining.It was found that a learning rate of 0.00005 achieved higher accuracy during pretraining, while a learning rate of 0.0005 achieved higher accuracy during meta training.The performance of different learning rates varied across the five datasets.For example, increasing the learning rate to 0.0005 resulted in a 1% improvement in accuracy for the CUB dataset, but in a slight decrease of around 0.5% for the mini-ImageNet dataset.Fine-tuning the parameters may lead to slight improvements on other datasets, but to determine the optimal learning rate, the mini-ImageNet dataset was chosen as the reference.
SWIN-T pre-training model parameters used in this experiment; the training speed of this method is relatively fast, with pre-training taking about 2 h and fine-tuning about 1 h.In the same configuration environment and the same dataset, with mini-ImageNet, the P > M > F [33] training time is about 56 h.This experiment is expected to reach the highest accuracy rate in 2 h.
Momentum is 0.9, weight decay is 0.05.Epoch is 100, and, typically, 20 epochs in these 5 datasets can have good performance.
The settings of hyperparameters in this experiment refer to those of Hu et al. [33] and Zhang et al. [11] in their experiments; the authors of this paper have selected relatively good parameters after a large number of experiments, and the accuracy rate is shown to be very high.

Analysis of Experimental Results
All ResNets in the table are represented by R, and compared with the previous SOTA method, we call our pipeline STCE.
From Table 1, we can see that the STCE method proposed in this paper exceeded the accuracy of the highest P > M > F method on the few-shot dataset mini-ImageNet.Specifically, the accuracy was 3.3% higher on one-shot and 1.2% higher on five-shot.On the tiered-ImageNet dataset, it outperformed the TRIDENT method with the highest accuracy by 4.6% on one-shot, reaching the SOTA in the few-shot field on both datasets.From Table 2, we can see that the STCE method proposed in this study achieved higher accuracy than the SOTA BAVARDAGE method on the few-shot dataset FC100.Specifically, it outperformed BAVARDAGE by 6.8% in the one-shot scenario and by 9% in the five-shot scenario.These results demonstrate that the STCE method achieves the SOTA accuracy on the FC100 dataset.From Table 3, we can see that the STCE method proposed exhibited outstanding performance on the few-shot dataset CIFAR-FS.It attained an impressive accuracy of 86.95% in the one-shot scenario, which was marginally lower than the PT + MAP + SF + SOT method by 3%.However, in the five-shot scenario, it demonstrated superior performance compared to PT + MAP + SF + SOT by 1.2%, thereby surpassing it and establishing a SOTA performance.From Table 4, we can see that the STCE method achieved a performance of 83.1% on the CUB dataset in the one-shot scenario and a performance of 92.88% in the five-shot scenario.During the experiment, we divided it into three stages: pretraining, meta training, and testing.The pretraining models trained on ImageNet-1K include Swin-T, Swin-S, and Swin-B, with model parameters and layer numbers as follows: 28,288,354 parameters and [2,2,6,2] layers for Swin-T, 49,606,258 parameters and [2,2,18,2] layers for Swin-S, and 87,768,224 parameters and [2,2,18,2] layers for Swin-B.We tried various model architectures, including Swin-T, Swin-B, Swin-L, and the upgraded version of Swin Transformer V2 [56].We found that the pretraining parameters of the Swin-T model are the most suitable.Through an extensive series of experiments, we find that the bigger the model is, the better the parameters are, and they may lead to a series of problems such as over-fitting, which leads to low accuracy in FSIC.To sum up, the experimental findings demonstrate that the utilization of pretraining models for transfer learning yields significantly higher accuracy in FSL compared to more intricate methodologies.

Conclusions
This study introduces a novel approach by combining Swin Transformer and CSAM and EMD for the classification of few-shot images.By leveraging the strong feature representation capabilities of Swin Transformer, the model is able to capture more comprehensive and global image information.To address accuracy issues caused by variations within the same category and cluttered backgrounds, an EMD measurement module is incorporated.The proposed model, STCE, is evaluated on five datasets.The results of the experiment suggest that the STCE algorithm demonstrates better performance in comparison to other methods, achieving SOTA performance in the field of FSL for mini-ImageNet, tiered-ImageNet, and FC100 datasets.Additionally, the model also achieves impressive results on the other two datasets, improving the accuracy of FSIC.
The limitations of the current approach do not perform particularly well at fine granularity; however, we believe that transfer learning can greatly improve accuracy in fewshot domains, and that pre-training and fine-tuning form a type of method with a shorter training time, which has very good prospects for future applications in few-shot domains.
base i=1 , y i ∈ C base and test set D novel = (x i , y i ) m novel i=1 , y i ∈ C novel , where m base and m novel are the sample numbers in D base and D novel , and C base and C novel are the class sets corresponding to D base and D novel , respectively, and C base ∩ C novel = ∅.FSL aims to learn a model on D base that generalizes well to the unseen test set D novel .Scenario training is performed on D base and D novel A series of tasks are sampled as training samples and test samples in the FSL framework, where each classification task consists of a support set and a query set, and the support set S = (x i , y i )} N×K i=1 consists of N support set consists of K samples (N-way K-shot set), which serve as labeled instances.From the same N, the support set consists of B samples from the same classes that are used as unlabeled samples for the query set.Q = (x i , y i )} N×K+N×B i=N×K+1 of the query set as unlabeled samples.p(T) is set up as the task T distribution; scenario training draws a series of training tasks from p(T), and a series of training tasks from T train ∼ p(T) as training phase samples, and for a training task T train in the specific task N, a portion of the data for each of the classes D T train operating on it is used, C T train = {C 1 , C 2 , . . . ,C N } ∈ C base .For this N set of classes, the support set and query set in the training task are both from D T train .The data sampling in each task is shown in Figure 1.

Electronics 2024 , 15 Figure 8 .
Figure 8.(a) illustrates the FC, while (b) depicts the SFCL utilized by K-shot.The fully connected layer is responsible for learning a collection of vectors that serve as prototypes for each class.These vectors are then utilized in conjunction with EMD to generate category scores.

Figure 8 .
Figure 8.(a) illustrates the FC, while (b) depicts the SFCL utilized by K-shot.The fully connected layer is responsible for learning a collection of vectors that serve as prototypes for each class.These vectors are then utilized in conjunction with EMD to generate category scores.

Table 4 .
Results on CUB datasets.