Class-Patch Similarity Weighted Embedding for Few-Shot Infrared Image Classification

Huang, Zhen; Gong, Jinfu; Wang, Xiaoyu; Wu, Dongjie; Zhang, Yong

doi:10.3390/electronics14020290

Open AccessArticle

Class-Patch Similarity Weighted Embedding for Few-Shot Infrared Image Classification

by

Zhen Huang

^1,2,

Jinfu Gong

^1,2,

Xiaoyu Wang

^1,2,

Dongjie Wu

^1,2 and

Yong Zhang

^1,2,*

¹

Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai 200083, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(2), 290; https://doi.org/10.3390/electronics14020290

Submission received: 13 December 2024 / Revised: 8 January 2025 / Accepted: 10 January 2025 / Published: 13 January 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Infrared imaging plays a vital role in critical surveillance, military reconnaissance, and industrial inspection applications due to its advantages such as strong concealment and the ability to operate around the clock. However, the combination of low infrared image resolution and complex background scenarios poses significant challenges for traditional deep learning models in accurately extracting the most discriminative features for classification. These models are often disrupted by irrelevant features, especially when data for new classes is scarce. Current few-shot learning approaches heavily rely on comparing image patches, but the scarcity of data can significantly degrade the performance of recognition algorithms. To address these challenges, we propose the Class Patch Similarity Weighted Embedding (CPSWE) framework for few-shot infrared target classification. The CPSWE framework employs a ViT architecture for feature extraction. By introducing class embeddings and calculating similarity-based weights for each patch, CPSWE reweights the patch features to enhance their relevance to the target class. This approach improves the discriminability of class-related features, leading to better generalization in few-shot settings. Furthermore, we introduce an infrared dataset specifically designed for few-shot learning, combining multiple open-source datasets to support research in this area. Extensive experiments on the few-shot learning benchmark dataset miniImageNet and the infrared dataset miniIRNet show that CPSWE outperforms existing few-shot learning methods, achieving significant improvements in classification accuracy on infrared image datasets with limited labeled samples.

Keywords:

deep learning; few shot learning; weighted embedding; infrared image

1. Introduction

Infrared imaging detection offers advantages such as high concealment, strong anti-interference capabilities, and all-weather operability, making it widely applicable in early warning systems and environmental monitoring. Unlike visible spectrum images, infrared images capture thermal radiation, offering unique insights into object detection, especially in low-light or challenging environmental conditions [1]. Infrared image classification is an important part of infrared image processing. Image classification methods can be broadly divided into two categories: traditional approaches and deep learning-based techniques. Traditional methods, such as those based on handcrafted features, rely on the manual extraction of image descriptors like texture, shape, and color. While these methods can be effective in certain contexts, they are typically limited by their inability to generalize well across different datasets, especially when faced with complex image variations and noise.

In contrast, deep learning methods have revolutionized image classification by automatically learning hierarchical representations from raw data. Through deep neural networks, particularly convolutional neural networks (CNNs) and transfomers, these cmethods can extract increasingly abstract features, enabling them to identify complex patterns and relationships in images [2]. A significant advantage of deep learning approaches lies in their capacity to utilize extensive labeled datasets during training, which has led to significant advancements in the classification of visible spectrum images. However, owing to factors like military confidentiality and the substantial expense of imaging devices, the dependence on large quantities of labeled data presents difficulties in fields such as infrared imaging, where obtaining such large-scale annotated datasets is often impractical. Infrared images are often characterized by low resolution, high noise, and limited texture, which complicate the extraction of meaningful features for classification. These challenges are exacerbated in few-shot learning scenarios, where labeled data is scarce, and models must generalize effectively to new classes with minimal training samples. Put another way, the scarcity of training samples restricts scholars from enhancing the precision of infrared image classification [3].

Few-shot learning (FSL) draws inspiration from humans’ remarkable ability to learn quickly, offering a way to overcome the previously highlighted challenges. Unlike traditional deep learning methods, few-shot learning enables efficient learning and rapid adaptation to new tasks, even with limited sample sizes [4]. As a result, few-shot learning has gained considerable traction in recent years, giving rise to numerous groundbreaking approaches, with metric learning being a prominent example. Metric learning focuses on developing embedding functions that transform the input space into distinct embedding representations, then perform classification within these spaces using similarity metrics. However, despite the success of metric learning in few-shot scenarios, several challenges arise when these methods are applied to infrared images. Infrared images are often characterized by lower resolution, reduced texture, and significant noise, which complicate the process of extracting meaningful features. Traditional metric learning methods, which rely on well-defined visual features, may struggle to capture the subtle patterns unique to infrared data. Moreover, the significant variability within the same class and the high resemblance between different classes in infrared imagery frequently produce embeddings that lack distinction, posing challenges for conventional embedding functions to identify discriminative features and consequently reducing classification performance.

This paper introduces an innovative method to tackle the issues associated with few-shot infrared target classification, namely the Class-Patch Similarity Weighted Embedding (CPSWE) method, which enhances the extraction of class-relevant features in infrared images. Our method employs a Vision Transformer (ViT) architecture as the feature extractor, utilizing self-supervised pretraining through Masked Image Modeling (MIM) to generate semantically meaningful patch embeddings. The input image is divided into segmented patches, then we introduce a class embedding that interacts with the patch embeddings within the feature extractor. By calculating the similarity between the class feature and each patch embedding, we derive similarity weights for each patch. These weights are then used to re-weight the patch embeddings, which are combined with the class embedding. Finally, a similarity score matrix is constructed to compare class-relevant patch embeddings from different images, allowing us to quantify the similarity between image pairs. Furthermore, in this work, we introduce a infrared dataset specifically designed for few-shot target classification, which was constructed by integrating multiple open-source infrared datasets [5,6,7,8,9,10,11,12,13]. As far as we are aware, this is the first dataset developed specifically for few-shot learning in the domain of infrared target classification. By selecting and preprocessing relevant samples from diverse sources, we aim to tackle some of the challenges associated with the few shot learning techniques in infrared imaging tasks, offering a resource that could support future research in this area.

Finally, we summarize our contributions. (1) We propose a novel framework, CPSWE, which leverages a ViT architecture to enhance the extraction of class-relevant features in infrared images. By introducing class embeddings and similarity-based weighting of image patches, CPSWE improves the discriminability of features, thereby boosting performance in few-shot learning scenarios. (2) We establish a new dataset, miniIRNet, specifically designed for infrared few-shot classification to address the scarcity of such datasets in the field. To the best of our knowledge, this is the first dataset dedicated to few-shot classfication in infrared imaging. (3) We conduct extensive experiments on both the public miniImageNet dataset and the newly developed miniIRNet dataset, demonstrating that CPSWE outperforms existing state-of-the-art methods.

The remainder of this paper is organized as follows: Section 2 reviews related work in few-shot learning and infrared image few-shot classification. Section 3 details the definition of few-shot learning and the proposed CPSWE methodology. Section 4 describes the experimental setup, datasets, results on two datasets and ablation results. Section 5 discusses conclusions and outlines future directions.

2. Related Works

2.1. Few-Shot Learning

Few-shot learning has gained widespread interest due to its alignment with the demands of many real-world scenarios. Current methods for few-shot learning are typically divided into two main categories. The first type is based on meta-learning, which focuses on enabling models to learn how to learn. In particular, meta-learning focuses on accumulating transferable knowledge to enable rapid adaptation to new tasks within a few-shot learning framework.

Model-Agnostic Meta-Learning(MAML) [14] is the most representative work in this category. It updates model parameters by using the combined gradients of two tasks and learns a neural network initialization through a single gradient descent step, enabling the model to adapt to unseen tasks and achieve good generalization performance. Meta-transfer Learning (MTL) [15] algorithm applies deep neural networks to tackle tasks associated with few-shot learning. MetaOptNet [16] employs support vector machines as base learners to construct representations for few-shot learning, achieving a superior balance between feature dimensionality and accuracy on few-shot recognition benchmarks. Wang et al. [17]. proposed a novel approach to conditional meta-learning was proposed, focusing on structured prediction to enhance task-specific learning, deriving Task-Adaptive Structured Meta-Learning (TASML), which generates objective functions tailored to specific tasks by assigning weights to the meta-training data relevant to the target task. COMLN [18] designs an efficient algorithm based on forward-mode differentiation, where the memory requirements do not increase with the length of the learning trajectory, allowing for longer adaptation times with constant memory usage.

The second type consists of metric-based learning methods, which embed both support images and query images into the same feature space and use an appropriate distance function to measure similarity. For example, ProtoNet [19], MatchingNet [20], and RelationNet [21] use Euclidean distance, cosine similarity, and a learnable network, respectively, as metrics to quantify the distance between global representations of images within the feature space. CAN [22] designs a cross-attention module to capture semantic relationships between class and query features, dynamically identifying relevant regions and producing more distinct features. DeepBDC [23] employs the Bhattacharyya distance to learn a covariance matrix for representing block-level features effectively and uses the inner product as the metric for comparison. ReNet [24] leverages intra- and inter-image relational patterns through self-correlation representations and cross-correlation attention.

In some recent studies, several approaches have been proposed to address the challenges of FSL. One such method is the Frequency-Guided Few-shot Learning (FGFL) [25] framework, which leverages task-specific frequency components to enhance feature representations, using a multi-level metric learning strategy to improve performance across various FSL scenarios. Another notable work, Meta-AdaM [26], introduces a meta-learned adaptive optimizer with momentum, designed to improve convergence in few-shot learning by incorporating weight-update history and momentum into the optimization process. Additionally, the MetaDiff [27] approach models the gradient descent algorithm as a diffusion process, alleviating memory burdens and mitigating the risk of vanishing gradients, which often hinder gradient-based meta-learning methods. A novel prototype-based label propagation method [28] has been proposed to address the challenges of inductive and transductive FSL, particularly in improving graph construction and prototype estimation, leading to better performance on standard FSL benchmarks.

Recently, researchers have begun applying Vision Transformers (ViT) to few-shot learning scenarios. FewTURE [29] demonstrated that an entirely ViT-based design could be effectively adapted for small-scale image datasets. They divide the input samples into patches and use Vision Transformers to encode them, learning a representative embedding space beyond just label information. HCTransformers [30] employs ViT as a meta-feature extractor for few-shot learning, incorporating hierarchically cascaded transformers that exploit intrinsic image structures through spectral token pooling while refining learnable parameters via latent attribute proxies. CPEA [31] also utilizes a pre-trained ViT model, integrating patch embeddings with class-aware embeddings to ensure their relevance to specific classes. The CPEA approach improves few-shot learning by merging patch and class-aware embeddings, enhancing their relevance to specific classes and boosting performance on small datasets. However, it computes the similarity between each image patch and class embedding uniformly, without considering the varying importance of patches. This equal weighting may overlook key image regions, limiting the model’s capacity to capture the most discriminative features.

2.2. Few-Shot Classfication in Infrared Imaging

In recent years, few-shot classfication has garnered significant attention in various domains due to its potential to address data scarcity issues. However, its application in the infrared domain remains relatively limited. While there have been notable efforts to leverage FSL techniques for infrared image classification, the body of research in this specific area is still in its infancy. Several works have explored the combination of few-shot learning and infrared imagery. Chen et al. [32]. proposed a meta-learning-based method incorporating multi-scale feature fusion to resolve issues in few-shot infrared aerial target classification. Li et al. [33]. introduced the Deep-Shallow Learning Graph Model (D-SLGM), a cross-domain object recognition method designed to address feature representation challenges in unsupervised few-shot scenarios. This approach was applied to classify a custom-built dataset of infrared aerial targets. Yang et al. [34]. developed a virtual prototype generation method that accommodates both base and novel categories in the context of few-shot incremental learning for infrared target recognition, aiming to improve classification accuracy for both class types. Tan et al. [4]. proposed a few-shot classification approach for infrared images that leverages conceptual features derived from target components, which flexibly selects local features from the target and integrates them into a conceptual feature space, ultimately achieving infrared target classification through metric learning. Despite these advancements, a major limitation of the current body of research is the lack of standardized, publicly available infrared datasets for comparison. Most of the studies rely on private, undisclosed datasets, which introduces challenges in benchmarking and evaluating the effectiveness of different methods. The absence of a uniform benchmark for infrared few-shot learning makes it difficult to assess the progress and real-world applicability of these approaches across different use cases.

3. Method

3.1. Problem Definition

Few-shot classification aims to generalize the knowledge acquired during training on the dataset

D_{t r a i n}

to unseen test data

D_{t e s t}

, where the class sets

C_{t r a i n}

and

C_{t e s t}

are disjoint (

C_t r a i n \cap C_t e s t = \emptyset

), and each test class is represented by only a small number of labeled samples. Specifically, a support set

S

that contains

N

classes with

K

samples, the objective is to correctly assign

M

unseen samples from a query set

Q

to the corresponding

N

classes, forming an N-way K-shot task

T

. Such a classification task is typically referred to as an N-way K-shot task, where tasks are drawn randomly from the test dataset to evaluate the model’s performance. Following previous research [20], we adopt the meta-learning protocol, where few-shot classification problems are formulated through episodic training and testing. The objective is to develop a model trained on the training classes that can successfully adapt to new scenarios derived from unseen test classes within an inductive framework.

Our overall framework is illustrated in Figure 1. The framework of our CPSWE model leverages a ViT as the feature extractor. Input images are initially segmented into distinct, non-overlapping patches, which are then processed through a pre-trained ViT to produce both class embeddings and patch embeddings. The class embedding is positioned at the beginning of the patch embedding sequence before being input into a conventional Transformer encoder. The class embedding captures global image information, while the patch embeddings capture local details specific to each patch. To enhance the relevance of patch embeddings to the class token, we compute similarity weights between the class token and each patch embedding. These similarity weights are then used to re-weight the patch embeddings, which are combined with the class token to form class-relevant patch embeddings. Finally, to quantify the similarity between paired images, an MLP layer computes a similarity score from the similarity matrix.

3.2. Class-Patch Similarity Weighted Embedding

CPSWE method incorporates a weighting strategy derived from the resemblance between the class token and the patch embeddings. Within this framework, the class embeddings remain independent of any specific class prior to being processed by the ViT. Initially, the class token serves as a generic representation; however, through continuous interaction with patch embeddings during the forward pass, the class embeddings evolve to become more class-relevant. In CPSWE, after extracting the class token and patch embeddings from both query and support features, a class-patch similarity weight is calculated. We choose ViT with MIM as our backbone because of its proven robustness and strong generalization ability in previous studies, which is crucial for extracting discriminative features from limited infrared samples. The CPSWE method is particularly effective for few-shot infrared image classification because it selectively emphasizes the most relevant image patches. Rather than treating all patches equally, CPSWE calculates the similarity between each patch and the class embedding, assigning higher weights to patches that are more relevant to the target class. This allows the model to focus on the most discriminative features, improving classification accuracy by reducing the impact of noisy or irrelevant details, which is especially important in infrared images.

Specifically, the similarity between the class token and individual patch embedding is determined via a dot product, followed by a softmax normalization to generate similarity weights. The calculation method is shown in Formula (1). For the support set:

W_{S} = \frac{\exp (P_{S} \cdot C_{S}^{T})}{\sum_{j = 1}^{L} \exp ({(P_{S})}_{j} \cdot (C_{S}^{T}))} \in R^{K S \times L \times 1}

(1)

where

P_{S}

represents the patch embeddings for support samples,

C_{S}

represents class token embeddings for support samples.

K S

represents the total number of support samples. Here, K denotes the number of classes, S specifies the number of support samples per class.

L

represents the number of patch embeddings per sample.

W_{S}

represents the similarity weights for support patch embeddings. Similarly, for query sets:

W_{Q} = \frac{\exp (P_{Q} \cdot C_{Q}^{T})}{\sum_{j = 1}^{L} \exp ({(P_{Q})}_{j} \cdot (C_{Q}^{T}))} \in R^{Q \times L \times 1}

(2)

where

P_{Q}

represents the patch embeddings for query samples,

C_{Q}

represents class token embeddings for query samples.

Q

represents the number of query samples.

W_{Q}

represents the similarity weights for query patch embeddings.

These weights are then applied to re-weight the patch embeddings, enhancing the contribution of patches more semantically aligned with the class token. This process is performed for both the query and support features. The re-weighted patch embeddings are combined with the class token to generate the final class weighted embeddings, which are normalized and adjusted for subsequent similarity calculation. This interaction between the class and patch improves the alignment of patch embeddings with the target class, leading to more discriminative image representations. The relationship between class embedding and patch embedding is shown in Formulas (3) and (4):

{\tilde{F}}_{Q} = P_{Q} ⨀ W_{Q} + {λ C}_{Q}

(3)

{\tilde{F}}_{S} = P_{S} ⨀ W_{S} + {λ C}_{S}

(4)

where

{\tilde{F}}_{Q}

and

{\tilde{F}}_{S}

are similarity-weighted combinations of the patch embeddings and the class token embeddings for the query and support samples, respectively.

⨀

represents hadamard product, and

λ

is a scaling factor that ontrols the strength of the correlation between the patch embedding and the class embeddings.

3.3. Similarity Measure

To compute the similarity between images in the support set and those in the query set, a similarity matrix

R

is defined, with its elements representing the scores between the adapted patch embeddings across images, which can be expressed as follows:

R (x_{q}, x_{s}) = d {({\tilde{F}}_{Q} (x_{q}), {\tilde{F}}_{S} (x_{s}))}^{2}

(5)

where

{\tilde{F}}_{Q} (x_{q})

and

{\tilde{F}}_{S} (x_{s})

respectively denote the adapted patch embeddings of the support image and the query image, and

d (\cdot, \cdot)

denotes the cosine similarity. Then, the flattened matrix

R

and directly feed it into a multi-layer perceptron to output a similarity score. In the end, as shown in Formula (6), cross entropy

L

is employed as the loss function:

L = - \frac{1}{Q} \sum_{q = 1}^{Q} \sum_{k = 1}^{K} y_{q k} \log ({\hat{y}}_{q k})

(6)

here,

Q

refers to the number of query samples, while

K

signifies the total count of classes.

{\hat{y}}_{q k}

represents the predicted probability that query sample

q

falls under class

k

, calculated using the Formula (7):

{\hat{y}}_{q k} = \frac{\exp (s_{q k})}{\sum_{j = 1}^{K} \exp (s_{q j})}

(7)

where

s_{q k}

is the similarity score for query

q

and class

k

.

4. Experiments

4.1. Datasets

Our method is assessed on widely used miniImageNet benchmark and our self-constructed infrared dataset MiniIRNet for few-shot classification. Figure 2 shows some samples of miniImageNet and miniIRNet dataset. The field of few-shot learning in visible light imagery has seen considerable progress, yet areas like infrared imaging and medical diagnostics, where few-shot datasets are even more essential, remain underserved. However, our review of the literature revealed no publicly available infrared datasets specifically designed for few-shot learning. To fill this gap, we introduce miniIRNet, a benchmark infrared dataset specifically constructed for small sample learning by carefully curating and synthesizing data from several existing public infrared datasets [5,6,7,8,9,10,11,12,13]. The MiniIRNet dataset consists of 60 categories, including images of animals, buildings, cars, airplanes, and more. Each category contains approximately 300–400 infrared images, resulting in a total of 21,012 images. In the standard configuration, the dataset is randomly divided into training, validation, and test sets, with 40 categories in the training set, and 10 categories each in the validation and test sets.

To better evaluate the efficacy of our proposed method, experiments were additionally performed on the commonly utilized few-shot learning dataset, miniImageNet. Derived from ImageNet, MiniImageNet was initially presented within the framework of matching networks and has subsequently gained recognition as a standard benchmark for few-shot learning tasks. This dataset includes 100 classes, with each class comprising 600 images, amounting to a total of 60,000 images. It is divided randomly into three subsets: 64 classes for training, 16 for validation, and 20 for testing.

4.2. Implementation Details

All experiments are conducted using the ViT as the backbone network. Specifically, we employ the ViT-Small architecture with a patch size of 16. During the pretraining phase, we follow the same strategy as in [29] to pretrain our ViT-Small backbone, largely adhering to the hyperparameter settings reported in their work. Each image is resized to

224 \times 224

, we extract 196 patch-level features from each image using the same encoder architecture as in CPEA [31]. In the meta-training phase, for the miniImageNet dataset, we use the AdamW optimizer with default settings, an initial learning rate of 1 × 10⁻⁵, and apply a learning rate decay strategy. The model is trained for 80 epochs, with each epoch consisting of 600 episodes. For the miniIRNet dataset, the AdamW optimizer is employed with a starting learning rate of 1 × 10⁻⁶, alongside an identical learning rate decay schedule, runing for 40 epochs. Additionally, for both datasets, standard data augmentation methods are utilized, such as random cropping, horizontal flipping, and enhancing colors.

In the evaluation stage, the reliability of our findings is ensured by reporting the mean accuracy and

95 %

confidence interval, computed across 2000 randomly generated tasks on the test sets of both datasets. Each task contains 15 query images, and the performance of CPSWE is evaluated under both 5-way 1-shot and 5-way 5-shot settings. To ensure result reproducibility, a random seed of 1 is fixed, and all experiments are conducted using this seed. All experiments and evaluations are carried out using PyTorch and a single NVIDIA 4080 GPU.

4.3. Experimental Results

4.3.1. Results on miniImageNet

Following the convention in few-shot learning, the performance of the CPSWE method is initially evaluated on the widely adopted miniImageNet dataset, one of the most common benchmarks in this field. Table 1 presents a performance comparison between CPSWE and other few-shot classification methods on the miniImageNet dataset, with the best results highlighted in bold. The comparison focuses on few-shot learning models using different backbones, including ResNet12, WRN-28-10, and ViT-Small. These few-shot learning methods emphasize either metric-based learning or attention mechanisms, and all models are evaluated under the 5-way 1-shot and 5-way 5-shot settings.

Among the compared methods, Prototypical Networks [19] is one of the earliest and most classic approaches to solving few-shot classification using metric learning. FEAT [38] applies self-attention to mean prototypes, endowing them with task-specificity and discriminability. CAN [22] captures the semantic correlations between class and query features, drawing attention to important regions in the query feature map. ReNet [24] integrates self-correlation representations and cross-correlation attention, proposing a relational embedding network for few-shot classification. DeepBDC [23] proposes to use the Deep Brown Distance Covariance method to solve small sample classification problems. The central concept involves learning image representations by quantifying the disparity between the joint feature function of embedded features and the edge product, however it is computationally expensive and time-consuming. PSST [39] employs the Pareto Self-Supervised Training (PSST) method and proposes a multi-objective optimization solution to address the conflicting objectives often encountered in few-shot learning. FewTURE [29] combines a Transformer-only architecture with self-supervised pretraining, marking a successful application of this method to few-shot learning. Notably, after fine-tuning on the training classes, our method requires no further adjustments when generalizing to new test classes, whereas FewTURE requires the support set images and their labels to be optimized online during inference to learn the importance of each individual patch, making our method significantly faster in terms of inference speed. CPEA [31] combines block embeddings with class-aware embeddings, using a class-aware block embedding adaptive method to address the few-shot classification problem. Beyond the aforementioned methods, CPSWE was also compared with recent innovative few-shot learning approaches from different categories.

Compared to the current state-of-the-art (SOTA) methods, CPSWE achieves the best results. Specifically, on the miniImageNet 5-way 1-shot classification task, using the same ViT-Small backbone, CPSWE outperforms the second-best method, CPEA, by

1.34 %

in accuracy. On the 5-way 5-shot classification task, CPSWE’s accuracy surpasses CPEA by

1.49 %

.

4.3.2. Results on miniIRNet

Table 2 presents the infrared image classification results of the proposed CPSWE method, comparing its performance with several other few-shot learning methods on the miniIRNet dataset. To ensure fairness and consistency, all experiments were reproduced using the publicly available code from the original papers, with all methods evaluated under the same experimental settings as those used for CPSWE. The results for each method reflect performance under the 5-way 1-shot and 5-way 5-shot settings. ProtoNet, which adopts a metric learning framework based on the prototypical network and uses the ResNet-12 backbone, demonstrates solid performance. MetaBaseline is an advanced ProtoNet extension leveraging meta-learning principles that outperforms other ResNet-12-based methods in 5-way 1-shot task. Specifically, MetaBaseline achieves 1-shot accuracy of

76.10 %

and 5-shot accuracy of

89.26 %

. DeepBDC, which focuses on dense image features and utilizes covariance-based dense image descriptors, strengthens feature representation and enhances performance, achieving competitive results. It achieves the highest accuracy with the ResNet backbone. Methods like FewTURE and CPEA using the ViT-Small backbone achieve competitive results: FewTURE reaches

75.84 %

(1-shot) and

91.23 %

(5-shot), while CPEA records

74.93 %

(1-shot) and

90.00 %

(5-shot) accuracy. In contrast, the CPSWE method using the ViT-Small backbone outperforms all these methods with 1-shot accuracy of

77.71 %

and 5-shot accuracy of

92.06 %

. Compared to the baseline method CPEA, which also uses ViT-Small for class-aware embedding, CPSWE improves 1-shot performance by

2.78 %

and 5-shot performance by

2.06 %

. This demonstrates the effectiveness of its class-patch similarity weighting mechanism in selecting the most relevant patch embeddings.

For infrared images, CPSWE offers a significant advantage by enhancing the discriminative power of class-related features, which is crucial given the inherent challenges posed by low-resolution and noisy data in the infrared domain. Its ability to focus on more localized discriminative features allows it to differentiate significant differences that may exist between different classes. These results confirm that in addition to achieving state-of-the-art performance, CPSWE excels in extracting robust and discriminative class-specific information, establishing its suitability for few-shot classification in the infrared domain. The class-patch similarity weighting embeddings effectively isolate relevant features in infrared images, enhancing the model’s capacity to differentiate between classes, even under challenging conditions such as noise and low resolution. Furthermore, the integration of class embeddings with patch re-weighting significantly improves the model’s generalization ability when working with limited labeled data, thereby addressing a key limitation of traditional few-shot learning approaches in infrared image classification.

4.4. Ablation Study

This subsection introduces ablation experiments aimed at analyzing the contribution of each component to the performance of the proposed approach. The experiments utilize miniIRNet with a pre-trained ViT-small backbone to ensure consistency and facilitate comparison. By systematically removing or modifying specific components, we aim to analyze the resulting performance changes and thereby quantify the contribution of each component to the overall efficacy of the CPSWE method. These detailed experiments provides deeper insights into the mechanisms driving our approach’s performance. Specifically, we examine the impact of the following modifications:

Similarity-Weighted Embeddings. We perform ablation experiments to compare the results of the class similarity weighted embedding module with a baseline method that without similarity weighting. For the baseline comparison, we designed a model that does not incorporate similarity-weighted embeddings. Instead, it uses the original query and support embeddings directly, bypassing the computation of similarity weights. This approach excludes the weighting mechanism that combines class tokens and patch embeddings, relying solely on unweighted features for downstream tasks. Table 3 illustrates the results of the ablation experiments. The findings demonstrate that the model incorporating class similarity weighted embedding consistently outperforms the baseline model that lacks similarity weighting. Specifically, the accuracy improvement is evident across both 1-shot and 5-shot setting, highlighting the effectiveness of the similarity weighting mechanism.

The observed improvement can be attributed to the unique characteristics of infrared images. Infrared data often exhibits a high level of redundancy in pixel information and a lack of fine-grained details in certain regions. The similarity-weighted embedding mechanism addresses this challenge by emphasizing the most relevant patch features through a learned weighting process. By aligning patch embeddings with class token information, the model effectively suppresses irrelevant noise and enhances the discriminative features that are critical for classification. By concentrating on specific aspects, the network is able to derive more robust and more meaningful representations, which enhances its performance. Furthermore, the weighting process reduces the reliance on less informative regions, which is particularly advantageous in the context of the low-contrast and sparse feature distribution typical of infrared imagery. The results indicate that accounting for the significance of different patch-level features is both logical and effective for CPSWE.

Scaling factor. The scaling factor

λ

is crucial for calibrating the relevance between the class token and the block embedding, which has a direct impact on the model’s performance. Table 4 displays the results of testing

λ

values from 0 to 8, revealing that fine-tuning this parameter enhances the alignment of block embeddings with target categories, significantly improving performance.

λ = 0

means that the class token no longer contributes to the feature representation, and only the weighted sum of patch embeddings is considered, resulting in the neglect of global feature information. Table 4 shows that the adapted patch embeddings indeed lead to a significant performance improvement. The best accuracy was achieved when

λ

was set to 2. However, when

λ

exceeds this value, specifically at

λ = 2

both the 1-shot and 5-shot sample performances reach a point of saturation, beyond which further increases in

λ

do not result in substantial performance gains. This saturation can be attributed to the fact that increasing

λ

beyond a certain threshold effectively overemphasizes the class token’s influence, leading to reduced sensitivity to the diversity within the patch embeddings. As a result, the model’s capacity to differentiate between categories diminishes, and the performance becomes less responsive to higher values of

λ

. Thus, the default scaling factor is set to 2, where the model achieves the optimal balance between class token influence and feature embedding alignment without overfitting or diminishing returns.

Similarity measures. Table 5 demonstrates that among the different choices for similarity measures in Equation (5) for few-shot classification on miniIRNet, the squared similarity measure

d {(\cdot, \cdot)}^{2}

consistently achieves the highest accuracy in both 1-shot and 5-shot settings, outperforming the raw similarity

d (\cdot, \cdot)

, its absolute value

| d (\cdot, \cdot) |

, and scaled absolute value

2 \times | d (\cdot, \cdot) |

, with the improvements suggesting that squaring the similarity enhances the intra-class similarity and benefits generalization, thus get better results.

Distance function. Table 6 presents the few-shot classification accuracy on the miniIRNet dataset, comparing the performance of different distance functions, namely Euclidean distance and cosine similarity, under both 1-shot and 5-shot settings. As shown, cosine similarity outperforms Euclidean distance in both cases.

Visualization analysis. Figure 3 illustrates the feature visualization of patch embeddings for four randomly selected task test in a 5-way 1-shot classification setting, comparing the results with and without CPSWE. As shown, the integration of CPSWE significantly improves the clustering of patch embeddings. When CPSWE is applied, patch embeddings of the same category exhibit a more compact distribution, forming distinct and well-defined clusters, while embeddings from different categories are more separable. This demonstrates that CPSWE effectively enhances the category relevance of patch embeddings, facilitating better distinction between image categories and enabling improved task performance.

5. Conclusions

In this paper, we have proposed a novel approach for addressing the challenges associated with few-shot infrared image classification. The Class-Patch Similarity Weighted Embedding (CPSWE) method enhances feature extraction by leveraging a ViT architecture and introducing a novel mechanism for dynamically weighting patch embeddings based on their similarity to class embeddings. This approach allows the model to better capture class-relevant features, despite the inherent challenges posed by low resolution, noise, and high intra-class variability in infrared imagery. By incorporating class-patch interactions, CPSWE improves the discriminative power of the learned representations, making it more effective in distinguishing between classes with limited labeled data. Additionally, the proposed framework utilizes self-supervised pretraining through MIM to generate meaningful patch embeddings, thereby minimizing the need for large-scale labeled datasets in the infrared domain. This is a significant advantage, as the scarcity of labeled infrared data often limits the effectiveness of deep learning models in such applications. Furthermore, we have introduced the miniIRNet dataset, designed for few-shot learning in infrared target classification, which may support future research and help address challenges associated with few-shot regimes in this domain. Through comprehensive experiments, we demonstrate that the CPSWE method achieves superior performance compared to traditional and other state-of-the-art few-shot learning methods on public datasets and infrared dataset. In future work, we plan to extend CPSWE to address challenges such as cross-domain few-shot infrared recognition, robustness to domain shifts, and scalability to larger and more diverse datasets. These efforts will further enhance the framework’s applicability and effectiveness in real-world infrared classification tasks.

Author Contributions

All of the authors contributed to this study. Conceptualization, Z.H.; methodology, Z.H.; software, Z.H.; formal analysis, J.G.; writing—original draft preparation, Z.H.; writing—review and editing, Z.H., X.W. and D.W.; supervision, J.G.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The miniIRNet dataset used in the experiments described in this paper is now publicly available at https://drive.google.com/file/d/1PLNi3yNgpdVT_A0hy8Jh01WnJqV-QypE/view (accessed on: 12 December 2024) This dataset has been made accessible to support reproducibility and further research.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hu, Y.; Wang, K.; Chen, L.; Li, N.; Lei, Y. Visualization of Invisible Near-Infrared Light. Innov. Mater. 2024, 2, 100067. [Google Scholar] [CrossRef]
Xu, Y.; Liu, X.; Cao, X.; Huang, C.; Liu, E.; Qian, S.; Liu, X.; Wu, Y.; Dong, F.; Qiu, C.-W.; et al. Artificial Intelligence: A Powerful Paradigm for Scientific Research. Innovation 2021, 2, 100179. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Yue, J.; Qin, Q. Global Prototypical Network for Few-Shot Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4748–4759. [Google Scholar] [CrossRef]
Tan, J.; Zhang, R.; Zhang, Q.; Cao, Z.; Xu, L. Few-Shot Infrared Image Classification with Partial Concept Feature. In Proceedings of the 6th Pattern Recognition and Computer Vision, PRCV, Xiamen, China, 13–15 October 2023; Springer: Singapore, 2024; pp. 343–354. [Google Scholar]
Zhang, H.; Luo, C.; Wang, Q.; Kitchin, M.; Parmley, A.; Monge-Alvarez, J.; Casaseca-de-la-Higuera, P. A Novel Infrared Video Surveillance System Using Deep Learning Based Techniques. Multimed. Tools Appl. 2018, 77, 26657–26676. [Google Scholar] [CrossRef]
Berg, A.; Ahlberg, J.; Felsberg, M. A Thermal Object Tracking Benchmark. In Proceedings of the 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Karlsruhe, Germany, 25–28 August 2015; pp. 1–6. [Google Scholar]
Davis, J.; Keck, M. A two-stage approach to person detection in thermal imagery. In Proceedings of the 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION’05), Breckenridge, CO, USA, 5–7 January 2005. [Google Scholar]
Ariffin, S.M.Z.S.Z.; Jamil, N.; Rahman, P.N.M.A. DIAST Variability Illuminated Thermal and Visible Ear Images Datasets. In Proceedings of the 2016 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, 21–23 September 2016; pp. 191–195. [Google Scholar]
Mantecón, T.; del-Blanco, C.R.; Jaureguizar, F.; García, N. Hand Gesture Recognition Using Infrared Imagery Provided by Leap Motion Controller. In Proceedings of the Advanced Concepts for Intelligent Vision Systems, 7th International Conference (ACVIS), Antwerp, Belgium, 20–23 September 2005; Springer: Cham, Switzerland, 2016; pp. 47–57. [Google Scholar]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A Visible-Infrared Paired Dataset for Low-Light Vision. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3496–3504. [Google Scholar]
Liu, Q.; Li, X.; Yuan, D.; Yang, C.; Chang, X.; He, Z. LSOTB-TIR: A Large-Scale High-Diversity Thermal Infrared Single Object Tracking Benchmark. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 9844–9857. [Google Scholar] [CrossRef]
Liu, Q.; He, Z.; Li, X.; Zheng, Y. PTB-TIR: A Thermal Infrared Pedestrian Tracking Benchmark. IEEE Trans. Multimed. 2020, 22, 666–675. [Google Scholar] [CrossRef]
Dai, X.; Yuan, X.; Wei, X. TIRNet: Object Detection in Thermal Infrared Images for Autonomous Driving. Appl. Intell. 2021, 51, 1244–1261. [Google Scholar] [CrossRef]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning (PMLR), Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Sun, Q.; Liu, Y.; Chua, T.-S.; Schiele, B. Meta-Transfer Learning for Few-Shot Learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 403–412. [Google Scholar]
Lee, K.; Maji, S.; Ravichandran, A.; Soatto, S. Meta-Learning with Differentiable Convex Optimization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10657–10665. [Google Scholar]
Wang, R.; Demiris, Y.; Ciliberto, C. Structured Prediction for Conditional Meta-Learning. In Proceedings of the Advances in Neural Information Processing Systems 33, Northern Ireland, UK, 6–12 December 2020; Volume 33, pp. 2587–2598. [Google Scholar]
Deleu, T.; Kanaa, D.; Feng, L.; Kerg, G.; Bengio, Y.; Lajoie, G.; Bacon, P.-L. Continuous-Time Meta-Learning with Forward Mode Differentiation. In Proceedings of the 9th International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical Networks for Few-Shot Learning. In Proceedings of the Advances in Neural Information Processing Systems 31, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching Networks for One Shot Learning. In Proceedings of the Advances in Neural Information Processing Systems 29, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation Networks for Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–22 June 2018; pp. 3588–3597. [Google Scholar]
Hou, R.; Chang, H.; MA, B.; Shan, S.; Chen, X. Cross Attention Network for Few-Shot Classification. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Xie, J.; Long, F.; Lv, J.; Wang, Q.; Li, P. Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7962–7971. [Google Scholar]
Kang, D.; Kwon, H.; Min, J.; Cho, M. Relational Embedding for Few-Shot Classification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 8802–8813. [Google Scholar]
Cheng, H.; Yang, S.; Zhou, J.T.; Guo, L.; Wen, B. Frequency Guidance Matters in Few-Shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, Paris, France, 2–6 October 2023; pp. 11814–11824. [Google Scholar]
Sun, S.; Gao, H. Meta-AdaM: An Meta-Learned Adaptive Optimizer with Momentum for Few-Shot Learning. Adv. Neural Inf. Process. Syst. 2023, 36, 65441–65455. [Google Scholar]
Zhang, B.; Luo, C.; Yu, D.; Li, X.; Lin, H.; Ye, Y.; Zhang, B. MetaDiff: Meta-Learning with Conditional Diffusion for Few-Shot Learning. Proc. AAAI Conf. Artif. Intell. 2024, 38, 16687–16695. [Google Scholar] [CrossRef]
Zhu, H.; Koniusz, P. Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement. arXiv 2023, arXiv:2304.11598. [Google Scholar]
Hiller, M.; Ma, R.; Harandi, M.; Drummond, T. Rethinking Generalization in Few-Shot Classification. Adv. Neural Inf. Process. Syst. 2022, 35, 3582–3595. [Google Scholar]
He, Y.; Liang, W.; Zhao, D.; Zhou, H.-Y.; Ge, W.; Yu, Y.; Zhang, W. Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-Shot Learning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9109–9119. [Google Scholar]
Hao, F.; He, F.; Liu, L.; Wu, F.; Tao, D.; Cheng, J. Class-Aware Patch Embedding Adaptation for Few-Shot Image Classification. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 18859–18869. [Google Scholar]
Chen, R.; Liu, S.; Li, F. Infrared Aircraft Few-Shot Classification Method Based on Meta Learning. J. Infrared Millim. Waves 2021, 40, 554–560. [Google Scholar]
LI, Y.-Z.; ZHANG, Y.; CHEN, Y.; YANG, C.-L. An Unsupervised Few-Shot Infrared Aerial Object Recognition Network Based on Deep-Shallow Learning Graph Model. J. Infrared Millim. Waves 2023, 42, 916–923. [Google Scholar] [CrossRef]
Yang, B.; Zhang, R.; Liu, Y.; Liu, G.; Cao, Z.; Yang, Z.; Yu, H.; Xu, L. CTL-I: Infrared Few-Shot Learning via Omnidirectional Compatible Class-Incremental. In Proceedings of the 13th International Conference on Big Data Technologies and Applications, BDTA 2023, Edinburgh, UK, 23–24 August 2023; Springer: Cham, Switzerland, 2024; pp. 3–17. [Google Scholar]
Huang, X.; Choi, S.H. SAPENet: Self-Attention Based Prototype Enhancement Network for Few-Shot Learning. Pattern Recognit. 2023, 135, 109170. [Google Scholar] [CrossRef]
Sim, C.; Kim, G. Cross-Attention Based Dual-Similarity Network for Few-Shot Learning. Pattern Recognit. Lett. 2024, 186, 1–6. [Google Scholar] [CrossRef]
Huang, Y.; Hao, H.; Ge, W.; Cao, Y.; Wu, M.; Zhang, C.; Guo, J. Relation Fusion Propagation Network for Transductive Few-Shot Learning. Pattern Recognit. 2024, 151, 110367. [Google Scholar] [CrossRef]
Ye, H.-J.; Hu, H.; Zhan, D.-C.; Sha, F. Few-Shot Learning via Embedding Adaptation With Set-to-Set Functions. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8805–8814. [Google Scholar]
Chen, Z.; Ge, J.; Zhan, H.; Huang, S.; Wang, D. Pareto Self-Supervised Training for Few-Shot Learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13658–13667. [Google Scholar]
Zhang, X.; Meng, D.; Gouk, H.; Hospedales, T. Shallow Bayesian Meta Learning for Real-World Few-Shot Recognition. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 631–640. [Google Scholar]
Chen, W.-Y.; Liu, Y.-C.; Kira, Z.; Wang, Y.-C.F.; Huang, J.-B. A Closer Look at Few-Shot Classification. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; p. 3. [Google Scholar]
Chen, Y.; Liu, Z.; Xu, H.; Darrell, T.; Wang, X. Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 9062–9071. [Google Scholar]
Liu, Y.; Zhang, W.; Xiang, C.; Zheng, T.; Cai, D.; He, X. Learning to Affiliate: Mutual Centralized Learning for Few-Shot Classification. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14391–14400. [Google Scholar]

Figure 1. The overall framework of proposed CPSWE method. The framework takes support images and query images, splits them into patches, and processes them through a pretrained ViT to extract patch embeddings. Within the ViT, class embeddings are iteratively updated through continuous interaction with the patch embeddings. The final patch embeddings and class embeddings are combined to produce weighted patch embeddings. These weighted patch embeddings are used to compute a similarity matrix, which measures relationships between patches of support and query images. Finally, the similarity matrix is fed into a Multi-Layer Perceptron (MLP) to generate the similarity score.

Figure 2. Some samples of datasets. (a) miniImageNet, (b) miniIRNet.

Figure 3. Visualization of patch embeddings for four distinct 5-way 1-shot classification tests, each with one query image per class. Subfigures (a–d) present results prior to applying CPSWE, whereas (e–h) illustrate outcomes after its application. CPSWE effectively refines the patch embeddings, clustering them by class and enhancing their class relevance.

Table 1. Few-shot classification results for 5-way 1-shot and 5-way 5-shot tasks on miniImageNet.

Model	Backbone	≈Params	miniImageNet
Model	Backbone	≈Params	1-Shot	5-Shot
ProtoNet [19]	ResNet-12	12.4 M	60.76 ± 0.47	78.51 ± 0.34
CAN [22]	ResNet-12	12.4 M	63.85 ± 0.48	79.44 ± 0.34
SAPENet [35]	ResNet-12	12.4 M	66.41 ± 0.20	82.76 ± 0.14
DeepBDC [23]	ResNet-12	12.4 M	67.34 ± 0.43	84.46 ± 0.28
ReNet [24]	ResNet-12	12.4 M	67.60 ± 0.44	82.58 ± 0.30
DSN [36]	ResNet-12	12.4 M	70.37 ± 0.41	85.25 ± 0.30
RFPN [37]	ResNet-12	12.4 M	67.43 ± 0.51	83.69 ± 0.43
FEAT [38]	WRN-28-10	36.5 M	65.10 ± 0.20	81.11 ± 0.14
PSST [39]	WRN-28-10	36.5 M	64.16 ± 0.44	80.64 ± 0.32
MetaQDA [40]	WRN-28-10	36.5 M	67.83 ± 0.64	84.28 ± 0.69
FewTURE [29]	ViT-Small	22 M	68.02 ± 0.88	84.51 ± 0.53
CPEA [31]	ViT-Small	22 M	71.97 ± 0.65	87.06 ± 0.38
CPSWE	ViT-Small	22 M	73.31 ± 0.65	88.55 ± 0.35

Table 2. Few-shot classification results for 5-way 1-shot and 5-way 5-shot tasks on miniIRNet.

Model	Backbone	≈Params	miniIRNet
Model	Backbone	≈Params	1-Shot	5-Shot
ProtoNet [19]	ResNet-12	12.4 M	70.81 ± 0.41	83.03 ± 0.30
Baseline [41]	ResNet-12	12.4 M	71.20 ± 0.42	89.74 ± 0.58
MetaBaseline [42]	ResNet-12	12.4 M	76.10 ± 0.72	89.26 ± 0.72
MCL [43]	ResNet-12	12.4 M	71.15 ± 0.74	81.95 ± 0.59
DeepBDC [23]	ResNet-12	12.4 M	75.42 ± 0.86	91.17 ± 0.30
ReNet [24]	ResNet-12	12.4 M	71.19 ± 0.99	85.85 ± 0.71
FewTURE [29]	ViT-Small	22 M	75.84 ± 0.92	91.23 ± 0.34
CPEA [31]	ViT-Small	22 M	74.93 ± 0.73	90.00 ± 0.43
CPSWE	ViT-Small	22 M	77.71 ± 0.67	92.06 ± 0.39

Table 3. Results of the class similarity weighted embeddings for few-shot classification on miniIRNet.

Weighted Embedding	1-Shot	5-Shot
×	75.22 ± 0.68	89.94 ± 0.46
√	77.71 ± 0.67	92.06 ± 0.39

Table 4. Effect of the scaling factor for few-shot classification on miniIRNet.

Scaling Factor	1-Shot	5-Shot
$λ = 0$	73.50 ± 0.73	90.45 ± 0.42
$λ = 0 .$ 5	77.20 ± 0.68	91.69 ± 0.40
$λ = 1$	77.25 ± 0.70	91.03 ± 0.39
$λ = 2$	77.71 ± 0.67	92.06 ± 0.39
$λ = 4$	77.23 ± 0.69	91.63 ± 0.37
$λ = 6$	77.12 ± 0.69	90.86 ± 0.41
$λ = 8$	77.04 ± 0.64	90.64 ± 0.44

Table 5. Impact of different similarity measures in Equation (5) for few-shot classification on miniIRNet.

Similarity Measures	1-Shot	5-Shot
$d (\cdot, \cdot)$	76.79 ± 0.67	90.82 ± 0.43
$\| d (\cdot, \cdot) \|$	76.68 ± 0.68	91.40 ± 0.40
$2 \times \| d (\cdot, \cdot) \|$	76.59 ± 0.71	91.48 ± 0.42
$d {(\cdot, \cdot)}^{2}$	77.71 ± 0.67	92.06 ± 0.39

Table 6. The few-shot classification on miniIRNet accuracy with different distance function.

Distance Function	1-Shot	5-Shot
Euclidean distance	75.55 ± 0.75	90.60 ± 0.42
Cosine similarity	77.71 ± 0.67	92.06 ± 0.39

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, Z.; Gong, J.; Wang, X.; Wu, D.; Zhang, Y. Class-Patch Similarity Weighted Embedding for Few-Shot Infrared Image Classification. Electronics 2025, 14, 290. https://doi.org/10.3390/electronics14020290

AMA Style

Huang Z, Gong J, Wang X, Wu D, Zhang Y. Class-Patch Similarity Weighted Embedding for Few-Shot Infrared Image Classification. Electronics. 2025; 14(2):290. https://doi.org/10.3390/electronics14020290

Chicago/Turabian Style

Huang, Zhen, Jinfu Gong, Xiaoyu Wang, Dongjie Wu, and Yong Zhang. 2025. "Class-Patch Similarity Weighted Embedding for Few-Shot Infrared Image Classification" Electronics 14, no. 2: 290. https://doi.org/10.3390/electronics14020290

APA Style

Huang, Z., Gong, J., Wang, X., Wu, D., & Zhang, Y. (2025). Class-Patch Similarity Weighted Embedding for Few-Shot Infrared Image Classification. Electronics, 14(2), 290. https://doi.org/10.3390/electronics14020290

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Class-Patch Similarity Weighted Embedding for Few-Shot Infrared Image Classification

Abstract

1. Introduction

2. Related Works

2.1. Few-Shot Learning

2.2. Few-Shot Classfication in Infrared Imaging

3. Method

3.1. Problem Definition

3.2. Class-Patch Similarity Weighted Embedding

3.3. Similarity Measure

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Experimental Results

4.3.1. Results on miniImageNet

4.3.2. Results on miniIRNet

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI