Next Article in Journal
Self-Adaptive Virtual Synchronous Generator Control for Photovoltaic Hybrid Energy Storage Systems Based on Radial Basis Function Neural Network
Previous Article in Journal
A Novel Two-Factor Authentication Scheme Based on QR Code Prompt
Previous Article in Special Issue
Ultra-Low Bitrate Predictive Portrait Video Compression with Diffusion Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CGAP-HBSA: A Source Camera Identification Framework Under Few-Shot Conditions

College of Computer and Artificial Intelligence, Hunan University of Technology, Zhuzhou 412007, China
*
Author to whom correspondence should be addressed.
Symmetry 2026, 18(1), 71; https://doi.org/10.3390/sym18010071
Submission received: 4 December 2025 / Revised: 25 December 2025 / Accepted: 26 December 2025 / Published: 31 December 2025

Abstract

Source camera identification relies on sensor noise features to distinguish between different devices, but large-scale sample labeling is time-consuming and labor-intensive, making it difficult to implement in real-world applications. The noise residuals generated by different camera sensors exhibit statistical asymmetry, and the structured patterns within these residuals also show local symmetric relationships. Together, these features form the theoretical foundation for camera source identification. To address the problem of limited labeled data under few-shot conditions, this paper proposes a Cross-correlation Guided Augmentation and Prediction with Hybrid Bidirectional State-Space Model Attention (CGAP-HBSA) framework, based on the aforementioned symmetry-related theoretical foundation. The method extracts symmetric correlation structures from unlabeled samples and converts them into reliable pseudo-labeled samples. Furthermore, the HBSA network jointly models symmetric structures and asymmetric variations in camera fingerprints using a bidirectional SSM module and a hybrid attention mechanism, thereby enhancing long-range spatial modeling capabilities and recognition robustness. In the Dresden dataset, the proposed method achieves an identification accuracy for the 5-shot camera source identification task that is only 0.02% lower than the current best-performing method under few-shot conditions, MDM-CPS, and outperforms other classical few-shot camera source identification methods. In the 10-shot task, the method improves by at least 0.3% compared to MDM-CPS. In the Vision dataset, the method improves the identification accuracy in the 5-shot camera source identification task by at least 6% compared to MDM-CPS, and in the 10-shot task, it improves by at least 3% over the best-performing MDM-CPS method. Experimental results demonstrate that the proposed method achieves competitive or superior performance in both 5-shot and 10-shot settings. Additional robustness experiments further confirm that the HBSA network maintains strong performance even under image compression and noise contamination conditions.

1. Introduction

Source Camera Identification (SCI) technology was initially based on digital watermarking and the Photo-Response Non-Uniformity [1] (PRNU) in camera device fingerprints. PRNU utilizes sensor noise characteristics to identify the imaging device. Traditional SCI methods, particularly Convolutional Neural Networks [2] (CNN), rely heavily on large volumes of labeled training data. A lack of sufficient training data can lead to poor generalization, overfitting, and inadequate performance in tasks such as deep fake detection. These methods are particularly dependent on high-quality labeled data. If the training data does not encompass enough diversity between source cameras, the model may fail to correctly identify the camera source of new images in real-world applications. Labeling a large number of images for SCI tasks is a time-consuming and labor-intensive process, making it impractical in many real-world scenarios. However, labeling a smaller number of images is feasible, making few-shot source camera identification more applicable in practice. Thus, this paper investigates the challenge of source camera identification under few-shot conditions.
In the context of few-shot learning, the problem of camera source identification shares similarities with image recognition under limited data conditions. However, a key difference lies in the complexity of camera fingerprint features, which makes pseudo-labeling for data augmentation prone to misclassification. According to the literature survey conducted, only a small number of researchers have studied camera source identification in few-shot settings. Existing methods often fail to provide an effective metric for camera fingerprints, and the recognition techniques are largely based on traditional approaches such as CNN and Support Vector Machine (SVM). To improve the identification accuracy of deep learning models for source camera identification, Wu et al. [3] designed an adaptive attention dense network by introducing adaptive weighting factors to enable parameter-wise self-adaptive optimization, thereby obtaining attention weights that better match different data characteristics and enhancing feature learning and representation. Long et al. [4] proposed a camera forensics approach based on multi-scale feature fusion, which extracts local features from feature maps at different scales and then fuses them into a comprehensive representation to train a classification network for source camera identification under non–few-shot settings. Lu et al. [5] employed a masked autoencoder for image augmentation to increase sample complexity, and further designed a multi-scale attention module to emphasize class-relevant features, aiming to alleviate the large class-prototype bias commonly observed in few-shot image recognition.
In the SCI task, sensor noise patterns not only exhibit significant differences between different cameras, but also demonstrate statistical asymmetry, which serves as an important criterion for distinguishing different devices. At the same time, noise residuals contain certain local symmetric structures, such as repetitive patterns and correlated structures in the spatial domain. Effectively modeling these symmetric and asymmetric features plays a key role in improving recognition stability and robustness. However, existing SCI methods often struggle to fully capture these structural characteristics under few-shot conditions, limiting the generalization ability of the model.
The cross-correlation coefficient, a statistical measure of the linear relationship between two signals, has been widely applied in audio matching, image matching, edge detection, template matching, and object tracking. In this work, we extend its application to the augmentation of few-shot datasets by proposing a pseudo-labeling method based on cross-correlation coefficients. Furthermore, to enhance the effectiveness of the recognition model and prevent overfitting, we introduce a Hybrid Bidirectional State-Space Model Attention (HBSA) block. By stacking HBSA blocks into HBSA layers and integrating them with pooling and other modules, we construct the HBSA network, which improves global modeling capacity while reducing reliance on large sample sizes. Compared with existing convolution- and attention-based source camera identification methods, HBSA offers the advantage of jointly addressing “selective enhancement” and “global propagation modeling.” On the one hand, spatial and channel attention dynamically highlight informative regions and channels while suppressing irrelevant responses, thereby reducing the interference of scene content and background noise on camera fingerprint features. On the other hand, HBSA incorporates a bidirectional state-space model to capture long-range dependencies, enabling globally consistent yet weak and spatially dispersed noise/imaging-pipeline differences to be effectively aggregated and represented. Meanwhile, the hierarchical stacking of HBSA blocks supports progressive multi-scale and multi-level noise feature extraction; as network depth increases, the effective receptive field is expanded, cross-scale robustness is improved, and dependence on large labeled datasets is reduced. Consequently, HBSA is better suited to few-shot conditions and real-world scenarios involving image-processing degradations such as compression, noise, and blur.
The proposed CGAP-HBSA framework explicitly integrates symmetry-related information in its algorithm design. On one hand, the cross-correlation coefficient can measure the structural similarity between noise signals, making it an effective tool for characterizing symmetric correlation patterns. This property is used in the paper to convert high-structural-consistency unlabeled samples into pseudo-labeled ones, thereby augmenting the training data. On the other hand, the HBSA network jointly models symmetric structures and asymmetric variations in noise fingerprints through a bidirectional SSM module and a hybrid attention mechanism, enabling the network to capture both global correlations and local differences. These designs allow the proposed method to maintain good recognition performance under few-shot conditions and exhibit high robustness even in the presence of compression and noise interference.
The contributions of this paper are summarized as follows:
(1)
We propose a pseudo-label augmentation method based on cross-correlation coefficients. Compared with classical pseudo-labeling strategies such as self-training, consistency regularization, and ensemble voting, our method achieves higher interpretability and stronger adaptability to few-shot camera image data.
(2)
We design the HBSA block and use multiple blocks to construct HBSA layers, forming the HBSA network. This architecture enables effective information propagation within blocks and progressive transmission across blocks, thereby enhancing global modeling ability and reducing dependence on data scale.
(3)
We conduct comparative experiments and ablation studies on two datasets to validate the effectiveness of the proposed approach.
The remainder of this paper is organized as follows. Section 2 reviews existing methods for few-shot image recognition and summarizes related work on SCI under few-shot scenarios. Section 3 presents our proposed framework, Cross-correlation Guided Augmentation and Prediction with HBSA (CGAP-HBSA), in detail. Section 4 describes the experimental design and discusses the results. Finally, Section 5 concludes the paper with a summary of our findings.

2. Related Works

In the fields of image forensics and SCI, symmetry-related features have gradually gained attention in recent years. Image noise residuals often contain certain spatial or statistical symmetric structures, such as correlation patterns, repetitive frequency components, or structural consistency, which reflect the inherent characteristics of the sensor during the sampling and readout processes. Some studies have employed correlation analysis, transformation domain structures, or symmetry discriminative functions to extract stable camera fingerprint features, demonstrating that these structures play a positive role in enhancing recognition performance. However, most existing methods rely on large-scale labeled data to train complex models or focus only on shallow symmetric structure modeling, making it difficult to fully exploit symmetric and asymmetric features under few-shot conditions. In the study of Source camera identification under few-shot conditions, most existing approaches focus on feature augmentation, where the dataset is expanded through the generation of virtual samples or pseudo-labels. Methods that generate pseudo-labels for unlabeled samples based on labeled data fall within the scope of semi-supervised learning, which has shown particular promise in addressing few-shot Source camera identification tasks.
For instance, Tan et al. [6] proposed an ensemble projection (EP) method based on semi-supervised learning, in which multiple prototype sets were constructed to augment few-shot camera datasets, followed by the use of CNN for camera source classification. Wu et al. [7] employed a Mega-Trend-Diffusion (MTD)-based ensemble learning framework to generate virtual samples, thereby expanding limited camera datasets. These augmented datasets, together with the original few-shot samples, were used to train multiple SVM classifiers, whose outputs were then combined through ensemble learning for camera source identification.
Wang et al. [8] introduced the Multi-PCEP framework, which first leveraged semi-supervised learning to construct prototype sets. These prototypes were then used to retrain SVM classifiers, with the posterior probability of each image sample belonging to each class serving as the final projection vector. The classification results were obtained via ensemble voting. Similarly, Wang et al. [9] developed the Multi-DS method, which expands labeled few-shot camera datasets using multiple distance metrics and employs an SVM self-correction mechanism to refine pseudo-labels, thereby enhancing model performance. Building on this line of work, Wang et al. [10] further proposed the MDM-CPS method, which combines multiple distance metrics with coordinate-based pseudo-label selection to improve few-shot camera source recognition.
In recent years, vision architectures built on State-Space Models (SSM), such as Vision Mamba, have demonstrated strong long-range dependency modeling capability and computational efficiency for tasks including image classification, detection, and segmentation, largely due to mechanisms such as selective scanning and gating. However, directly transferring these models to SCI often fails to yield comparable benefits, primarily because SCI differs substantially in both task objective and the nature of discriminative cues. Specifically, SCI relies more on “camera fingerprint” signals—such as CFA interpolation traces, sensor noise, compression residuals, and color processing pipelines—that are typically low-amplitude, spatially dispersed, and easily overwhelmed by scene content. In the absence of explicit feature selection, Vision Mamba’s sequence-based global modeling can therefore be biased toward semantic content rather than camera-specific artifacts. Moreover, transforming 2D feature maps into 1D sequences for scanning-based propagation may weaken local geometric consistency and introduce mixing effects, which can smooth out or dilute the local noise statistics critical for SCI. In addition, camera-induced differences exhibit strong channel dependencies; without dynamic channel reweighting, the model may struggle to emphasize channel responses associated with the imaging pipeline. Under few-shot settings, these issues are further amplified, increasing the risk of overfitting and degrading cross-scene generalization. Motivated by these observations, we integrate the global modeling strength of a bidirectional SSM with spatial attention (to highlight informative regions and suppress background interference) and channel attention (to enhance discriminative channels while suppressing redundant ones), forming the HBSA block. By stacking HBSA blocks hierarchically to enable progressive extraction of noise-related features, this design provides a well-grounded motivation for the proposed HBSA network. In contrast to the aforementioned work, this paper reexamines the feature construction process of SCI from the perspective of symmetry. By using a cross-correlation mechanism, it captures the symmetric correlation structures between noise signals. Additionally, a deep network is employed to further learn the joint representation of symmetric structures and asymmetric variations, thereby achieving higher robustness and generalization ability in few-shot scenarios.

3. The Proposed CGAP-HBSA Framework

3.1. Overview

The overall framework is illustrated in Figure 1. This framework, termed Cross-correlation Guided Augmentation and Prediction with HBSA (CGAP-HBSA), consists of three main components: Feature Extraction, Pseudo-labeled sample augmentation, and the HBSA network.
(1)
Feature Extraction. The primary role of this stage is to extract discriminative features from image samples, which are subsequently used for pseudo-label augmentation in the few-shot dataset. We employ Photo-Response Non-Uniformity (PRNU) features as the basis for camera sample augmentation, leveraging their four key advantages: uniqueness, high discriminability, stability, and robustness.
(2)
Pseudo-labeled Sample Augmentation. This component expands the few-shot dataset by leveraging both the labeled small-sample set and a large number of unlabeled samples. Through the proposed cross-correlation guided pseudo-labeling method, we generate augmented labeled samples, forming an expanded training dataset. By enriching the dataset in this way, the risk of model overfitting is effectively reduced.
(3)
HBSA Network. Finally, the extracted features and the augmented dataset are used to construct and train the HBSA network, which serves as the classification model. The HBSA network is composed of six layers, with the HBSA layer at its core. The HBSA layer repeatedly downsamples the feature maps while expanding the channel dimensions, thereby capturing higher-dimensional representations. The classification model is trained on the expanded dataset, and during inference, test samples are input into the trained HBSA network to produce the final predictions.

3.2. Cross-Correlation–Based Pseudo-Label Expansion Algorithm

In SCI tasks, traditional pseudo-label augmentation methods face several inherent limitations. For example, in self-training approaches, the initial model typically exhibits weak generalization ability, which easily leads to the generation of incorrect pseudo-labels. This, in turn, creates a cycle of “error amplification,” while the model also tends to assign pseudo-labels preferentially to majority classes, thereby exacerbating class imbalance.
Moreover, since SCI relies heavily on subtle low-frequency or noise patterns in images, consistency-based augmentation methods are problematic: strong augmentations such as blurring or geometric transformations may disrupt these critical features. In particular, image enhancement operations can interfere with PRNU noise, reducing its correlation and ultimately causing feature confusion. Finally, ensemble-based pseudo-labeling methods require training multiple models, which is computationally expensive and impractical for few-shot scenarios. The cross-correlation guided augmentation method is crucial in Source Camera Identification (SCI) tasks, as it addresses several limitations of traditional pseudo-labeling augmentation methods by using the Normalized Cross-Correlation (NCC). First, it prevents the generation of incorrect labels due to weak generalization in self-training, reducing the “error amplification” phenomenon. Secondly, the cross-correlation-based pseudo-label generation helps mitigate the class imbalance issue, avoiding model bias towards the majority class. Moreover, the NCC method preserves key low-frequency noise features while avoiding distortion caused by traditional augmentations, such as blurring or geometric transformations. Compared to traditional methods, it does not rely on model retraining or complex image augmentations, which reduces computational complexity and makes it particularly suitable for few-shot learning scenarios. Finally, the NCC coefficient is robust to common image processing operations, such as noise addition, compression, or cropping, ensuring that pseudo-labeling is not disrupted by image transformations, making it highly suitable for SCI tasks. To address the shortcomings of these approaches, we propose a Cross-correlation–based pseudo-label expansion algorithm.
The cross-correlation coefficient is a statistical measure used to quantify the similarity between two signals, images, or vectors, and it can be employed to detect the degree of alignment of patterns across different data. In camera source identification tasks, to ensure standardized outputs and mitigate the influence of brightness and contrast variations, the cross-correlation coefficient is normalized to obtain the Normalized Cross-Correlation (NCC) [11], denoted as Ncc. Its value satisfies −1 ≤ Ncc ≤ 1, where values closer to 1 indicate higher similarity. Since mean removal and variance normalization are applied, the NCC coefficient is invariant to global linear transformations of brightness and contrast. Moreover, the pseudo-label augmentation process based on NCC does not rely on model training or image augmentation, thereby reducing the overall complexity of the augmentation procedure. In addition, the NCC coefficient between two images can maintain high matching accuracy even under common image processing operations such as noise addition, mild compression, and cropping. This robustness ensures that the measure is not disrupted by image content, making it well-suited for camera source identification tasks.
Let D = {(X1,Y1),(X2,Y2),(X3,Y3)…(XN,YN)} denote a labeled few-shot dataset, where each image in the dataset is represented as X 1 (for 1 i N ), and its corresponding label as Y i . The dataset thus contains a total of N labeled samples. Let Q = { X 1 ,   X 2 ,   X 3     X M } represent the unlabeled dataset, which contains M images ( M     N ). Using the normalized cross-correlation (NCC) coefficient, we design a pseudo-label sample selection algorithm, as illustrated in Algorithm 1. In Algorithm 1, U - denotes the pseudo-labeled augmented sample set, which is initialized as an empty set. T represents the predefined threshold for the NCC coefficient.
Algorithm 1 Pseudo-Label Expansion Based on Normalized Cross-Correlation
          input: Labeled few-shot dataset D ,   Unlabeled dataset Q ;
          output: Pseudo-labeled augmented dataset Ū ;
           1:       Initialize Ū = , set parameters k ,   T ,   τ ;
           2:        f o r   e a c h   X i D ( 1 i N ) :
           3:                 if | Q | = 0   o r   | Q | < k then
                 end if
           4:                 Stage 1: Randomly sample k instances from Q : Φ = { X j } j = 1 k , k < M ;
           5:                 Stage 2: Utilize Φ to construct the temporary pair set: Ů = { ( X i , X j ) } j = 1 k ; Compute the normalized cross-correlation (NCC) value for each pair ( X i , X j ) , denoted as N c c j .
           6:                 Stage 3: Identify the pair with the maximum NCC value: ( X i , X b )   w h e r e   N C C b = m a x { N c c j , 1 j k } .
           7:                 if  N C C b > T :
                 Stage 4: Assign the label of X X i to X b . Add ( X b , Y i ) to Ū , and remove X b from Q .
                 end if
           8:      end for
During each augmentation iteration, an image X i is sequentially selected from the labeled few-shot dataset D . For each X i , k unlabeled samples are randomly drawn from Q to form a temporary unlabeled sample set Φ . Together, Φ and X i   constitute a temporary paired sample set U ˘ . For all sample pairs in U ˘ , the NCC values are computed, and the pair with the highest NCC value, denoted as ( X i , X b ) , is identified. The corresponding NCC value is denoted as NCC b . If NCC b T , the label of X i   is assigned to X b , and X b   is subsequently removed from Q . Otherwise, if NCC b < T , the algorithm proceeds to the next image X i + 1   for augmentation. The pseudo-label expansion process terminates when the number of remaining images in Q is less than k .
Let P   denote the feature length (the number of pixels) used for NCC computation. For each labeled sample X i ( i = 1 , , N ), Algorithm 1 randomly draws k candidates from the unlabeled set Q and computes NCC for the resulting k pairs to select the maximum. The dominant cost per iteration is therefore O ( k P ) , and the overall time complexity is O ( N k P ) . Importantly, this complexity is independent of the unlabeled set size M (as long as M k ), since only k candidates are evaluated each time. In contrast, a full matching strategy would require O ( N M P )   NCC computations, which becomes prohibitive when M N . Hence, the proposed sampling-based NCC expansion offers a clear efficiency advantage for large-scale unlabeled data by decoupling the computation from M . The additional memory overhead is O ( k ) , needed to store the temporary candidate set and the corresponding NCC scores.
The parameter k   controls not only computational cost but also the probability of including a true same-source candidate in the temporary set. Let p   be the proportion of unlabeled samples in Q   that share the same camera source as X i . Under random sampling, the probability of hitting at least one same-source sample is:
Pr hit = 1 1 p k
which rapidly increases with k . With k = 100 , Pr hit 0.994   for p = 0.05 and 0.953   for p = 0.03 , providing a strong coverage guarantee without resorting to exhaustive search. Meanwhile, increasing k   also raises the chance of observing spuriously high NCC from mismatched samples (an extreme-value effect), so overly large k may yield diminishing returns or even destabilize pseudo-label quality. Therefore, k = 100 is selected as a parameter setting that achieves a good balance between same-source sample coverage and computational efficiency. The choice of parameter T   is experimentally discussed in Section 4.

3.3. HBSA Network and HBSA Block

SCI requires the analysis of high-dimensional image features, such as illumination conditions, image noise, and color distribution. Some classical CNN are capable of extracting high-dimensional feature information and have thus been widely applied to this task. However, several limitations remain:
(1)
Insufficient global modeling capacity. The convolution operation in classical CNN is inherently local. Although stacking multiple convolutional layers can expand the receptive field, the ability to capture long-range dependencies—such as global information within an image—remains limited.
(2)
Underutilization of contextual information. Feature extraction is confined to local regions, without explicitly modeling global contextual relationships across layers (e.g., cross-region dependencies within an image).
State-Space Model [12] (SSM) with efficient hardware-aware designs have demonstrated significant potential in long-sequence modeling. As a novel approach that integrates classical signal modeling with deep learning architectures, SSMs are capable of efficiently capturing sequential dependencies, thereby improving both accuracy and speed in image recognition tasks. Building on this idea, Liu et al. [13] employed a bidirectional SSM for global modeling, introducing efficient selective scanning and gating mechanisms to extract global image features, and further proposed the Vision Mamba network, which has been successfully applied to image classification, object detection, and semantic segmentation. Existing studies have shown that the Spatial Attention Module (SAM) [14] is a mechanism that assigns importance weights along the spatial dimension of feature maps. It enables the model to dynamically adjust its focus on specific regions across different positions, enhancing attention to key areas such as edges, corners, and camera-specific patterns, while effectively reducing the influence of background noise.
Similarly, the Channel Attention Module (CAM) [14] focuses on assigning importance to different channels in the feature maps of a neural network. By automatically learning the significance of each channel, CAM strengthens those channels that are most useful for source camera identification and suppresses redundant or irrelevant features. In particular, when dealing with variations in image characteristics such as color, brightness, and noise, CAM helps the model extract channel-specific features that are most beneficial for recognition.
Inspired by this, we attempted to apply the Vision Mamba network to the task of camera source identification under few-shot conditions. However, its performance proved unsatisfactory. A likely reason is that differences in camera-specific imaging characteristics (e.g., color distributions, noise patterns) were not adequately accounted for by the bidirectional SSM, resulting in limited ability to distinguish between images captured by different devices. Furthermore, the bidirectional SSM lacks deep adaptability to camera-specific features, which hinders its effectiveness when processing images from diverse sources. To address the problem of low recognition accuracy in few-shot camera source identification, we integrate the bidirectional SSM Block, spatial attention module, and channel attention module into a unified HBSA Block. Multiple HBSA Blocks are then serially connected to form an HBSA layer. This design not only facilitates more effective information propagation within each HBSA Block but also enables progressive information transmission across blocks, thereby extracting richer hierarchical noise features from images. To further qualitatively verify the ability of the proposed HBSA network to preserve key image features, comparative feature heatmap visualizations are provided in Appendix A (Figure A1).
Finally, the HBSA layers are combined with convolutional layers, pooling layers, fully connected layers, and a SoftMax layer to construct the complete HBSA network, which enhances global modeling capability while reducing dependence on large sample sizes. As illustrated in Figure 2, the HBSA network is composed of a 7 × 7 convolutional layer, a max-pooling layer, HBSA layers, an average-pooling layer, a fully connected layer, and a SoftMax layer. The input to the HBSA network is the augmented training dataset. Through training, the network produces a classification model, which is then applied to the test set to generate the final classification results. The 7 × 7 convolutional layer is designed to capture large-scale features within the image, thereby enhancing the network’s perceptual ability. The max-pooling layer downsamples the feature maps, reducing their spatial dimensions to lower computational cost and mitigate overfitting, while retaining the most salient image features. At the core of the HBSA network lies the HBSA layer, which is composed of n stacked HBSA Blocks connected in series. Its primary role is to progressively downsample the feature maps and expand their channel dimensions, enabling the modeling of long-range dependencies and enhancing the network’s global representation capability. Combining HBSA layers with convolutional layers is intended to fully exploit their complementary strengths in feature modeling. Convolutional layers possess strong local perception capability and can efficiently extract low-level features such as edges, textures, and local noise patterns, providing stable and reliable feature representations for subsequent processing. However, due to their inherently limited receptive fields, conventional convolution operations are insufficient for capturing long-range dependencies and global contextual information. The HBSA layers address this limitation by introducing spatial attention, channel attention, and bidirectional SSM modules, which significantly enhance the network’s ability to model global dependencies across spatial regions and feature channels. In this hierarchical design, convolutional layers are responsible for extracting local and mid-level features, while HBSA layers further perform global modeling and feature reweighting, enabling the network to progressively transition from low-level texture representations to high-level semantic and global statistical features. Moreover, convolutional layers offer high computational efficiency and training stability, which help reduce the modeling complexity of HBSA layers and facilitate network convergence. Through this combination, the network achieves a well-balanced trade-off between local detail preservation and global information modeling, making it particularly suitable for complex and data-limited SCI tasks.
The HBSA Blocks are arranged in a serial configuration to enable hierarchical feature extraction, where each block builds upon the refined representations of the preceding one. This design allows the network to progressively abstract information from low-level textures to high-level semantics. By repeatedly applying HBSA Blocks, the model can reinforce important signals and suppress redundant noise across multiple scales and attention dimensions, thereby achieving richer multi-scale semantic representations.
Moreover, the stacked arrangement of HBSA Blocks gradually enlarges the effective receptive field as network depth increases. This expansion facilitates the modeling of long-range contextual dependencies, enabling more comprehensive representation of large objects and spatial relationships over extended distances. As illustrated in Figure 3, the designed HBSA Block consists of two 3 × 3 convolutional layers, a feature fusion layer, a ReLU activation function, and batch normalization layers. At its core lies the feature fusion layer, which will be described in detail below.
The feature fusion layer integrates a spatial attention module, a channel attention module, and a bidirectional SSM Block in a parallel-additive manner, producing high-dimensional feature maps as its output (see Figure 3). The Spatial–channel attention is constructed by sequentially combining the spatial attention and channel attention modules. These two modules guide the model to focus on informative regions or channels from spatial and channel perspectives, respectively, while suppressing irrelevant or redundant features. This design endows the HBSA Block with a larger receptive field compared with conventional convolutional blocks, while also enabling global information modeling. The feature fusion layer in the HBSA Block integrates the spatial–channel attention branch and the Bidirectional SSM branch using a parallel-additive strategy. In the spatial–channel attention branch, the feature map is first refined by the channel attention module and subsequently modulated by the spatial attention module in a serial manner, producing an attention-enhanced representation. In parallel, the same input feature map is processed by the Bidirectional SSM Block to model long-range dependencies along both forward and backward directions. The outputs of the forward and backward SSM branches are fused by element-wise summation, yielding a bidirectionally contextualized representation. Finally, the attention-enhanced features and the bidirectional SSM features are combined through element-wise addition without introducing additional learnable weighting coefficients. This unweighted parallel fusion maintains training stability and avoids excessive parameter growth, while allowing the complementary strengths of local attention-based feature selection and global sequential modeling to be fully exploited. Such a design ensures consistent information flow across branches and enhances the robustness and global representation capability of the HBSA Block.
The Bidirectional SSM Block, illustrated in Figure 3, operates as follows. For each channel feature map, a linear layer first performs spatial reshaping, converting the 2D feature map into a 1D sequence. On one branch, the 1D sequence is passed through a 1D convolution, followed by an SSM Block, to generate a new sequence, referred to as the forward sequence. On the other branch, the projected 1D sequence is reversed, processed by a 1D convolution, and then passed through an SSM Block to yield the backward sequence (the implementation details of the SSM Block can be found in [12]). Finally, the forward and backward sequences are fused through parallel addition, and a feature map reconstruction operation restores the result to the original 2D feature map, completing the output of the bidirectional SSM Block. In the Bidirectional SSM Block, the local dependency modeling preceding the state-space propagation is implemented by a depthwise one-dimensional convolution. Specifically, after spatial reshaping, the projected feature sequence is processed by a Conv1D layer with kernel size d conv = 4 , stride set to 1, and dilation set to 1. To preserve the original sequence length, a padding of d conv 1 is applied, followed by truncation of the extra positions introduced by padding. Moreover, grouped convolution is employed with the number of groups equal to the number of channels, ensuring that each channel is convolved independently. This design allows the model to capture short-range contextual patterns within each channel without introducing inter-channel interference, which is particularly important for preserving camera-specific noise characteristics. The convolution output is then passed through a SiLU activation function and fed into the subsequent state-space model, enabling a smooth integration of local pattern enhancement and long-range dependency modeling.
This approach transforms spatial information into sequential representations, thereby establishing spatial ordering relationships among pixels while preserving local pattern features. The forward sequence encodes causal dependencies from past to future, whereas the backward sequence captures dependencies from future to past. By summing the two, the fused sequence ensures that each pixel position gains access to the full global context of its channel feature map.
A reconstruction operation is then applied to the fused sequence of each channel, restoring it to the original 2D feature map size and producing the output of the bidirectional SSM Block. In this manner, the block achieves long-range dependency modeling and global information flow by integrating both intra-block and inter-block information within the image sequence. The serial connection of the Spatial attention module and the Channel attention module enables localized enhancement or suppression of image features across both spatial and channel dimensions. This design directs the network’s focus toward salient regions and fine-grained details, thereby improving recognition accuracy.
Meanwhile, placing the Spatial–channel attention branch in parallel with the Bidirectional SSM Block allows the model to remain sensitive to information at different scales. This parallel design enhances the overall robustness of the framework, while also helping to control parameter growth and maintain training stability. It is also possible to integrate the spatial–channel attention and the bidirectional SSM Block in a serial configuration. However, in such a design, the spatial and channel attention modules enhance image features in a point-to-point manner, while the bidirectional SSM Block performs sequential propagation modeling. The inconsistency in information flow between these two mechanisms may lead to difficulties in gradient propagation or misaligned optimization, thereby limiting the effectiveness of global modeling.
Within its feature fusion layer, the HBSA Block leverages the complementary strengths of the spatial attention module, channel attention module, and bidirectional SSM Block. By doing so, it addresses two major limitations of conventional convolutional neural networks: insufficient attention to critical image information and inadequate adaptability to image depth. This integration enhances the network’s global modeling capability, enabling it to extract key features from images more effectively.

4. Experiments and Analysis

4.1. Datasets and Experimental Setup

The experiments were conducted on two benchmark datasets: the Dresden camera dataset [15] and the VISION mobile phone dataset [16]. To facilitate comparison with existing methods, 14 different camera models from various brands were selected from the Dresden dataset, comprising a total of 2791 image samples. Similarly, 11 different smartphone models from various brands were selected from the VISION dataset, containing 2163 image samples. Table 1 and Table 2 summarize the relevant information for each dataset, including camera brand, model, number of samples, and image resolution.
For each dataset, 1, 5, and 10 images per category were randomly selected as the few-shot training samples, while all remaining images were used as the testing set. These experimental configurations are referred to as 1-shot, 5-shot, and 10-shot, respectively. Since the images in these datasets have varying resolutions, we cropped a 224 × 224 central region from each image. The primary motivation for this choice is that camera lenses often introduce distortion during image capture, and the degree of distortion varies across different regions of the image. By consistently selecting the central region, we reduce the influence of distortion and other uncertainties on the experimental results. Experimental Setup: The experiments were conducted using Python 3.9.19 and the PyTorch 2.4.1 deep learning framework. A single RTX 4090D GPU (Manufacturer: NVIDIA, Country: United States, City: Santa Clara) was employed for training. All models were trained with the Adam optimizer, with a learning rate of 0.001 and a total of 100 epochs. For performance evaluation, we adopted Accuracy (ACC) as the metric, which is computed as follows:
A C C = T P + T N T P + T N + F P + F N
Here, True Positives (TP) represent the number of samples correctly predicted as positive, while True Negatives (TN) denote the number of samples correctly predicted as negative. False Positives (FP) correspond to the number of samples incorrectly predicted as positive, and False Negatives (FN) indicate the number of samples incorrectly predicted as negative. Accuracy is adopted as the evaluation measure because it provides a clear and intuitive assessment of the model’s overall classification performance by directly reflecting the proportion of correctly classified samples. In the source camera identification task considered in this work, the classes are balanced and each test sample belongs to a single, well-defined category, making accuracy an appropriate and reliable metric. Moreover, accuracy allows for straightforward comparison with existing methods commonly reported in the literature, facilitating fair and consistent performance evaluation. Under few-shot learning settings, accuracy also effectively reflects the model’s generalization ability on unseen data, which is a primary objective of this study.

4.2. Hyperparameter Selection Experiments

The HBSA layer consists of n HBSA Blocks. The parameter n has a significant impact on the performance of the HBSA layer. If n is too small, the HBSA layer may fail to adequately extract high-order features from the images. This issue becomes more pronounced in complex datasets, where the lack of sufficient hierarchical feature representations can weaken the model’s expressive capability and reduce classification accuracy. Conversely, when n is too large, the HBSA layer tends to memorize noise present in the training data, leading to overfitting and reduced generalization ability of the network.
To determine an appropriate value for n , we conducted a series of experiments in which n was set to 1, 2, 3, 4, and 5, respectively. The experimental results on the Dresden and VISION datasets are presented in Table 3 and Table 4, respectively. As shown in Table 3, the accuracy consistently reaches its highest value when n = 3, regardless of the experimental setting. In particular, under the 10-shot configuration, the accuracy exceeds 92%. Similarly, in Table 4, the best performance is also achieved when n = 3, with the accuracy surpassing 91% under the 10-shot setting. These results indicate that across different datasets and experimental conditions, setting n = 3 yields the most optimal performance. Therefore, the number of HBSA Blocks in the HBSA layer is fixed to 3 in the subsequent experiments.
In Algorithm 1, the threshold T   has a significant impact on the effectiveness of pseudo-label sample expansion. As T   increases, the confidence requirement for selecting pseudo-labeled samples becomes stricter, resulting in fewer high-confidence pseudo-labeled samples being generated. Consequently, the number of samples participating in model training decreases, which may lead to overfitting and a decline in recognition accuracy. Conversely, if T   is set too low, although a sufficient number of pseudo-labeled samples can be produced, the proportion of incorrect pseudo-labels increases. The inclusion of these erroneous pseudo-labeled samples in the expanded training set can also degrade the model’s classification performance.
Therefore, it is assumed that there exists an optimal value of the parameter T   at which the model achieves the highest classification accuracy. To ensure the classification performance of the expanded few-shot test set, experiments were conducted to determine the optimal value of T . To reduce computational cost and training time, the parameter k   in Algorithm 1 was fixed to 100. Using different values of T , pseudo-labeled samples were generated through Algorithm 1, resulting in corresponding pseudo-labeled sample sets. The HBSA network was then trained on these pseudo-labeled datasets to obtain the classification model, and the classification accuracy was evaluated using the test set.
In the 1-shot, 5-shot, and 10-shot experimental configurations, the threshold T   was varied within the range of −0.5 to 0.5. The classification accuracies obtained on both datasets under different values of T   are illustrated in Figure 4. Specifically, Figure 4 presents the T –ACC curves for the VISION dataset and the Dresden dataset. In Figure 4a,b, the black curve represents the classification accuracy of the model under the 1-shot setting, the red curve corresponds to the 5-shot setting, and the blue curve indicates the 10-shot setting. From Figure 4a, it can be observed that, for the VISION dataset, the recognition accuracy reaches its peak when T = 0.1   across all three configurations—75.48%, 89.75%, and 91.24% for the 1-shot, 5-shot, and 10-shot settings, respectively. Similarly, in the Dresden dataset (see Figure 4b), the highest accuracies are also achieved at T = 0.1 , with values of 68.37%, 88.47%, and 92.73%, respectively. Therefore, it can be concluded that the classification accuracy curve achieves its relative maximum when T = 0.1 . When T   is smaller, a larger number of pseudo-labeled samples are selected; however, this also increases the likelihood of introducing incorrect pseudo-labels, which deteriorates the model’s classification accuracy. Conversely, when T   is larger, fewer pseudo-labeled samples are retained, but with higher confidence and label accuracy. Nevertheless, an insufficient number of pseudo-labeled samples cannot effectively alleviate the model’s overfitting problem. In summary, by balancing the trade-off between the size of the expanded pseudo-labeled dataset and the classification accuracy, the optimal threshold parameter is set to T = 0.1   for subsequent experiments.

4.3. Comparative Experiments

To evaluate the effectiveness of the proposed method, we compared it against five classical approaches, which are summarized as follows:
(1)
MTD-EM [7]: An ensemble learning method based on Mega-Trend Diffusion. In this approach, virtual samples generated for dataset expansion are combined with the few-shot dataset to train multiple SVM classifiers, whose outputs are then ensembled for source camera classification.
(2)
Multi-PCEP [8]: A prototype construction framework based on ensemble projection. Semi-supervised learning is first employed to construct prototype sets. These prototypes are then used to retrain SVM classifiers, where the posterior probability of each image sample belonging to each class is taken as the final projection vector. The classification results are obtained through ensemble voting.
(3)
Multi-DS [9]: An ensemble learning method based on multi-distance metrics. This approach employs an SVM self-correction mechanism to iteratively refine pseudo-labels for the augmented dataset. The expanded dataset is then used to train the classification model for camera source identification.
(4)
MDM-CPS [10]: A method combining multiple distance metrics with coordinate-based pseudo-label selection. By incorporating collaborative attention Blocks, this method filters pseudo-labeled samples to expand few-shot datasets, which are then used to train classification models for source camera recognition.
(5)
Vision Mamba (Vim) [13]: A recent sequence-based vision model. Images are flattened into sequences and globally modeled using state-space models (SSMs). With efficient selective scanning and gating mechanisms, Vim captures global image features and has been successfully applied to image classification, object detection, and semantic segmentation tasks.
The comparative results under the 5-shot and 10-shot settings are presented in Table 5 and Table 6, respectively. The performance of the baseline methods was obtained from the reported results in the corresponding literature. As shown in Table 5, on the Dresden and VISION datasets, the proposed HBSA method achieved accuracies of 88.47% and 89.75%, respectively. On the Dresden dataset, the HBSA method performed nearly on par with the current best method, MDM-CPS, with only a 0.02% decrease in accuracy, while still outperforming the remaining four methods. On the VISION dataset, HBSA outperformed all other methods in the 5-shot classification task, reaching an accuracy of 89.75%. Notably, compared with MDM-CPS, the current state-of-the-art, HBSA improved classification accuracy on the VISION dataset by at least 6% in the 5-shot setting. As shown in Table 6, under the 10-shot setting, the proposed HBSA method achieved classification accuracies of 92.73% on the Dresden dataset and 91.24% on the VISION dataset, both surpassing the five baseline methods. Compared with MDM-CPS, the current best-performing approach, HBSA achieved improvements of at least 0.3% on Dresden and 3.5% on VISION.
The primary reason for the superior performance lies in our proposed few-shot data augmentation strategy, which employs the cross-correlation coefficient of PRNU as the basis for expansion. Since the pseudo-labeled data are derived from real captured images, they better reflect the intrinsic imaging characteristics of actual devices. In contrast, methods based on virtual sample generation are only weakly associated with device-specific fingerprint features, which may lead to inconsistencies between the generated samples and the true target distribution of the task. Moreover, synthetically generated images typically lack authentic sensor noise patterns and can only simulate low-dimensional structural information, without modeling the physical properties of the device. Compared with traditional SVM classifiers and conventional neural networks, the proposed HBSA network further strengthens performance through its architectural design. Within each HBSA Block, the bidirectional SSM branches perform forward and backward sequence modeling, enabling the capture of long-range dependencies in images. At the same time, the spatial and channel attention modules adaptively adjust the network’s focus across spatial and channel dimensions, allowing deeper adaptation to camera-specific features. This combination enhances the network’s global modeling capacity and enables more effective extraction of critical features from the input data.

4.4. Ablation Experiments

To evaluate the effectiveness of each component within the feature fusion layer of the HBSA Block, we conducted ablation experiments on its functional modules. In the ablation study, we evaluated three network variants of the HBSA Block:
HBSA_none: the HBSA Block with the entire feature fusion layer removed.
HBSA_BiSSM: the HBSA Block with the spatial and channel attention modules removed from the feature fusion layer, retaining only the bidirectional SSM branch.
HBSA_SC: the HBSA Block with the bidirectional SSM branch removed, retaining only the spatial–channel attention modules.
Using the proposed pseudo-label augmentation strategy, the few-shot datasets were expanded into augmented training sets. These were then used to train classification models with the three HBSA variants and the full HBSA network, respectively. The resulting models were tested on the Dresden and VISION datasets, and classification accuracies were recorded. The results are presented in Table 7 and Table 8. The superior performance can be attributed to the design of the feature fusion layer. The bidirectional SSM Block explicitly models long-range dependencies in pixel sequences of camera images, enabling global perception of long-distance information. This reduces the attenuation of small-object features during inter-layer propagation and allows the network to extract richer and more accurate representations for source camera identification.
Meanwhile, the spatial and channel attention mechanisms strengthen the network’s ability to adapt to camera-specific features by directing attention toward critical regions (e.g., edges, corners, or device-specific patterns). These modules complement the bidirectional SSM Block by compensating for its limited adaptability to camera datasets, thereby improving both robustness and generalization.
By stacking multiple HBSA Blocks in series, the HBSA network achieves hierarchical feature extraction, progressively capturing broader contextual dependencies and providing more comprehensive modeling of long-range spatial relationships.

4.5. Anti-Image Processing Experiments

In robust watermarking and source camera identification tasks, images are often subjected to various signal-level and geometric degradations during transmission, storage, and platform processing, which poses significant challenges to feature stability. Previous studies have emphasized the importance of learning robust feature representations that are insensitive to local distortions and appearance variations while preserving task-relevant discriminative information, as discussed in [17]. Moreover, in the context of robust watermarking, works such as [18] highlight that resistance to geometric and signal-level attacks relies on extracting and maintaining information from feature domains that are invariant or weakly sensitive to common image processing operations. Motivated by these insights, we conduct a series of anti-image processing experiments to systematically evaluate the robustness of the proposed CGAP-HBSA method under Gaussian noise, JPEG compression, and Gaussian blur, thereby assessing its performance in complex real-world scenarios.
During the collection of unlabeled image samples, various processing operations—such as image compression, blurring, or the addition of noise—may occur due to factors like user editing, automatic platform handling, or network transmission. These operations can degrade image quality and potentially lead to recognition errors. Therefore, to evaluate the robustness of the proposed method, we conducted three sets of performance tests focusing on resistance to image compression, blurring, and noise. By simulating these scenarios, we can more comprehensively assess the recognition performance of the HBSA network under complex real-world conditions.
During image transmission, storage, or acquisition, unlabeled image samples are often contaminated by varying levels of noise due to limitations in device performance or environmental conditions. Among various types of noise, Gaussian noise—also known as Gaussian white noise—is one of the most common. It frequently occurs in different stages of image acquisition and transmission. The intensity of Gaussian noise is determined by its variance V ; a larger variance indicates more severe noise interference and greater degradation of image quality. Typically, its mean value μ   is set to 0, as Gaussian noise is random and symmetrically distributed, conforming to the statistical property of zero mean.
To evaluate the robustness of the proposed method, we constructed unlabeled image datasets affected by Gaussian noise of different intensities. Specifically, for each image in the dataset, additive Gaussian noise with a mean of μ = 0 and varying variances V was added to generate corresponding low-quality image datasets. Subsequently, both the proposed CGAP-HBSA method and the comparison method VIM were trained and tested on the Dresden dataset for 1-shot, 5-shot, and 10-shot image classification tasks, with performance measured in terms of ACC (accuracy).
In the experimental setup, the noise variance V was set to 3, 5, 7, and 9, corresponding to mild, moderate, severe, and extremely severe noise interference conditions, respectively. In addition, images with V = 0 (i.e., noise-free) were included as a control group for comparative evaluation.
As shown in Figure 5a, the dashed lines represent the classification accuracies of the CGAP-HBSA method under different Gaussian noise conditions for the 1-shot, 5-shot, and 10-shot classification tasks, while the solid lines denote the corresponding results obtained using the VIM method. With the increase in variance V , the recognition accuracies of both CGAP-HBSA and VIM gradually decline. Even under extremely severe Gaussian noise conditions, the proposed CGAP-HBSA method maintains relatively high classification accuracies of at least 60%, 78%, and 84% for the 1-shot, 5-shot, and 10-shot tasks, respectively. In contrast, the performance of the VIM method degrades more significantly under high-noise environments, with accuracies dropping from 63%, 82%, and 86% to 52%, 70%, and 78%, respectively, under the same extreme noise level. These results demonstrate that CGAP-HBSA is less affected by Gaussian noise interference than VIM, exhibiting a stronger robustness and higher resistance to noise-induced degradation.
JPEG is a widely used lossy image compression standard, and its compression level is controlled by the Quality Factor (QF), which ranges from 1 to 100. A lower QF value corresponds to higher compression and poorer image quality. In practical applications, unlabeled test samples are often transmitted in compressed JPEG format; therefore, it is necessary to consider the impact of JPEG compression on the performance of image recognition tasks.
To simulate the degradation caused by compression during network transmission, each image in the dataset was compressed with different QF values to generate multiple datasets representing various image quality levels. Subsequently, the proposed CGAP-HBSA method and the VIM method (which incorporates a pseudo-label expansion strategy) were trained and evaluated on the Dresden dataset under 1-shot, 5-shot, and 10-shot image classification tasks. The classification accuracy (ACC) was used as the evaluation metric.
By comparing the ACC performance of CGAP-HBSA with four classical baseline methods under different compression conditions, we verified the robustness of the proposed approach against JPEG compression degradation. In the experiments, the QF values were set to 100, 95, 90, 85, and 80, corresponding to image quality levels ranging from no compression to increasing degrees of compression. Among them, QF = 100 indicates uncompressed images, which serve as the control group for comparison.
As shown in Figure 5b, the dashed lines represent the classification accuracies of the CGAP-HBSA method under different Quality Factor (QF) settings for the 1-shot, 5-shot, and 10-shot classification tasks, while the solid lines denote the corresponding results of the VIM method. As the QF value decreases, the ACC of both methods declines significantly.
However, the CGAP-HBSA curves remain consistently higher than those of VIM, and the rate of decline is notably smaller. Even when QF = 80, the CGAP-HBSA method still achieves classification accuracies of approximately 51%, 74%, and 88% for the 1-shot, 5-shot, and 10-shot tasks, respectively. Overall, these results demonstrate that the proposed CGAP-HBSA method exhibits superior robustness against compression-induced degradation, maintaining strong performance under varying JPEG compression levels.
When images are transmitted through online social network platforms, edge details and fine structures are often degraded, resulting in noticeable blurring. This type of degradation can be effectively simulated using Gaussian Blur. Mathematically, the Gaussian blurring process can be regarded as a convolution between the image and a normal (Gaussian) distribution, characterized by two parameters: the kernel radius K and the standard deviation σ   of the Gaussian distribution. The kernel radius K   determines the range and intensity of the blur—the larger the kernel, the stronger the blurring effect.
To simulate the blurring degradation that unlabeled samples may experience during network transmission, each image in the unlabeled dataset was processed with Gaussian blur using different kernel radii K , thereby generating low-quality blurred image datasets. The proposed CGAP-HBSA method and the VIM method (enhanced with pseudo-label augmentation) were then trained and evaluated on the Dresden dataset under 1-shot, 5-shot, and 10-shot image classification tasks, with ACC (classification accuracy) as the evaluation metric.
In the experiments, the kernel radius K   was set to 3, 5, and 7, corresponding to mild, moderate, and severe blur levels, respectively. The case of K = 0 represents the performance on images without any blurring, serving as the control group.
As shown in Figure 5c, the dashed lines represent the classification accuracies of the CGAP-HBSA method under different Gaussian blur conditions for the 1-shot, 5-shot, and 10-shot classification tasks, while the solid lines denote the corresponding results obtained by the VIM method. As the kernel radius K increases, the recognition accuracies of both methods gradually decline.
Even under severe blurring conditions, the CGAP-HBSA method maintains classification accuracies of approximately 58%, 78%, and 84% for the 1-shot, 5-shot, and 10-shot tasks, respectively. Compared with VIM, CGAP-HBSA is less affected by Gaussian blur, demonstrating a stronger resistance to blur-induced degradation. In contrast, the VIM method experiences a more pronounced performance drop, with accuracies decreasing from 63%, 82%, and 86% to 52%, 70%, and 78% for the 1-shot, 5-shot, and 10-shot tasks under severe Gaussian blur conditions.
Overall, the CGAP-HBSA method exhibits superior robustness to Gaussian blurring across different levels of blur, consistently outperforming the VIM method under the same experimental settings.

5. Conclusions

In this paper, we addressed the challenge of camera source identification under few-shot conditions by proposing a CAP-HBSA framework. The introduction of a cross-correlation–based pseudo-label augmentation method effectively alleviates the problem of insufficient training data in few-shot scenarios. Meanwhile, the designed HBSA network, incorporating bidirectional SSM structures and hybrid attention mechanisms, enhances the model’s ability to capture camera fingerprint information and improves its adaptability to complex and noisy imaging environments. Extensive experiments conducted on the Dresden and VISION datasets validate the effectiveness of the proposed approach.
Comparative experiments demonstrate that our method achieves clear advantages in recognition accuracy over existing classical methods. On the Dresden dataset, the proposed approach achieved an accuracy only 0.02% lower than the best-performing MDM-CPS method in the 5-shot task, while consistently outperforming other baselines. In the 10-shot task, our method surpassed MDM-CPS by at least 0.3%. On the VISION dataset, HBSA improved accuracy by at least 6% in the 5-shot task and by at least 3% in the 10-shot task, compared with MDM-CPS. The results of the ablation studies further confirm the critical role of each component within the HBSA network.

Author Contributions

Methodology, Y.H.; Validation, Y.H.; Resources, Z.W.; Data curation, Y.H., A.C. and L.W.; Writing—original draft, Y.H.; Writing—review and editing, Z.W.; Supervision, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hunan Province of China (Grant number: 2024JJ7149) and by Education Department of Hunan Province of China (Grant Number: 22A0414).

Data Availability Statement

The data presented in this study are openly available in [Dresden and VISION dataset] at https://www.kaggle.com/datasets/micscodes/dresden-image-database?utm_source=chatgpt.com (accessed on 10 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Comparative Experiments on Image Key

Feature heatmaps are a visualization technique used to intuitively display the regions or features that the neural network focuses on when processing input data, as well as the importance of these regions or features. The color intensity in the feature heatmap represents the activation strength or importance weight of different locations or features in the network.
Three sets of experimental comparisons of heatmaps were conducted to verify the ability of the HBSA network to retain key image features. In each set, a random image was selected from the dataset, passed through the HBSA network to extract key image features, and a feature heatmap was obtained through image visualization.
For each experiment, the feature extraction method proposed by the HBSA network was applied, denoted as HBSA. In another variant, only the spatial attention module and channel attention module from the proposed HBSA Block were retained, and the network obtained in this way is referred to as CBAM. In a third experiment, the HBSA Block was removed from the proposed HBSA network, and the resulting network is denoted as None. The feature heatmaps generated by these three networks are shown in Figure A1, respectively.
In Figure A1a shows the original images of three pictures. The main content of Image 1 is a combination of buildings and snow-covered land. The feature heatmap obtained from a network without the HBSA block (Figure A1b, first row) only retains vague outlines, making the image appear abstract. The feature heatmap extracted by a network with only the CBAM block (Figure A1c, first row) highlights the key features of the building more clearly, but the snow’s heatmap remains unclear. In contrast, the feature heatmap extracted by the HBSA network (Figure A1d, first row) shows much clearer contours of both the building and the snow compared to the previous two.
Image 2 features a solitary lighthouse. In the network without the HBSA block, the feature heatmap (Figure A1b, second row) only shows a blurred outline. The network with the CBAM block (Figure A1c, second row) captures more distinct contours of the building, but the details, such as the clock at the base of the lighthouse, are still not adequately focused on. However, the feature heatmap from the HBSA network (Figure A1d, second row) retains important details, such as the clock at the bottom of the lighthouse and small holes below it, further validating the HBSA block’s ability to preserve critical image features.
Figure A1. Comparison of Feature Heat Maps Extracted by three Images in Three Different Networks.
Figure A1. Comparison of Feature Heat Maps Extracted by three Images in Three Different Networks.
Symmetry 18 00071 g0a1
Image 3 contains a globe and a doll in the lower left corner. In the network without the HBSA block, the feature heatmap (Figure A1b, third row) only shows a blurry outline of the globe, which is quite abstract. With the CBAM block (Figure A1c, third row), the globe’s contours are more defined, but the doll in the lower left corner still does not attract sufficient attention, with its features remaining unclear. After adding the HBSA block, the feature heatmap (Figure A1d, third row) reveals clearer contours of both the globe and the doll compared to the previous two.
The results of these three sets of experiments clearly demonstrate that the HBSA network, with the HBSA block added, has superior ability in focusing on the key regions of images compared to the other two approaches. Furthermore, the HBSA network achieves better results in global feature extraction for multi-subject images.

References

  1. Akshatha, K.; Karunakar, A.; Anitha, H.; Raghavendra, U.; Shetty, D. Digital camera identification using PRNU: A feature based approach. Digit. Investig. 2016, 19, 69–77. [Google Scholar] [CrossRef]
  2. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
  3. Wu, H.; Wen, Z. Research on Adaptive Attention Dense Network in Camera Source Recognition Method. J. Hunan Univ. Technol. 2026, 40, 85–91. [Google Scholar] [CrossRef]
  4. Long, C.; Jianlin, Z.; Hao, P.; Meihui, L.; Zhiyong, X.; Yuxing, W. Few-shot image classification via multi-scale attention and domain adaptation. Opto-Electron. Eng. 2023, 50, 220232. [Google Scholar] [CrossRef]
  5. Lu, J.; Li, C.; Huang, X.; Cui, C.; Emam, M. Source Camera Identification Algorithm Based on Multi-Scale Feature Fusion. Comput. Mater. Contin. 2024, 80, 3047–3065. [Google Scholar] [CrossRef]
  6. Tan, Y.; Wang, B.; Li, M.; Guo, Y.; Kong, X.; Shi, Y. Camera Source Identification with Limited Labeled Training Set. In Proceedings of the 14th International Workshop, IWDW 2015, Tokyo, Japan, 7–10 October 2015; pp. 18–27. [Google Scholar]
  7. Wu, S.; Wang, B.; Zhao, J.; Zhao, M.; Zhong, K.; Guo, Y. Virtual sample generation and ensemble learning based image source identification with few-shot training samples. Int. J. Digit. Crime Forensics 2021, 13, 34–46. [Google Scholar] [CrossRef]
  8. Wang, B.; Yu, F.; Ma, Y.; Zhao, H.; Hou, J.; Zheng, W. Pcep: Few-shot model-based source camera identification. Mathematics 2023, 11, 803. [Google Scholar] [CrossRef]
  9. Wang, B.; Hou, J.; Ma, Y.; Wang, F.; Wei, F. Multi-DS strategy for source camera identification in few-shot sample data sets. Secur. Commun. Netw. 2022, 2022, 8716884. [Google Scholar] [CrossRef]
  10. Wang, B.; Hou, J.; Wei, F.; Yu, F.; Zheng, W. MDM-CPS: A few-shot sample approach for source camera identification. Expert Syst. Appl. 2023, 229, 120315. [Google Scholar] [CrossRef]
  11. Yoo, J.C.; Han, T.H. Han Fast normalized cross-correlation. Circuits Syst. Signal Process. 2009, 28, 819–843. [Google Scholar] [CrossRef]
  12. Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
  13. Liu, X.; Zhang, C.; Huang, F.; Xia, S.; Wang, G.; Zhang, L. Vision mamba: A comprehensive survey and taxonomy. IEEE Trans. Neural Networks Learn. Syst. 2025, 1–21. [Google Scholar] [CrossRef] [PubMed]
  14. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
  15. Gloe, T.; Böhme, R. The Dresden Image Database for Benchmarking Digital Image Forensics. In Proceedings of the 2010 ACM Symposium on Applied Computing, Sierre, Switzerland, 22–26 March 2010; pp. 1584–1590. [Google Scholar] [CrossRef]
  16. Zheng, Y.-Y.; Kong, J.-L.; Jin, X.-B.; Wang, X.-Y.; Su, T.-L.; Zuo, M. CropDeep: The crop vision dataset for deep-learning-based classification and detection in precision agriculture. Sensors 2019, 19, 1058. [Google Scholar] [CrossRef] [PubMed]
  17. Liu, Y.; Wang, C.; Lu, M.; Yang, J.; Gui, J.; Zhang, S. From Simple to Complex Scenes: Learning Robust Feature Representations for Accurate Human Parsing. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5449–5462. [Google Scholar] [CrossRef] [PubMed]
  18. Wang, C.; Zhang, Q.; Wang, X.; Zhou, L.; Li, Q.; Xia, Z.; Ma, B.; Shi, Y.-Q. Light-Field Image Multiple Reversible Robust Watermarking Against Geometric Attacks. IEEE Trans. Dependable Secur. Comput. 2025, 22, 5861–5875. [Google Scholar] [CrossRef]
Figure 1. Overall Framework Diagram.
Figure 1. Overall Framework Diagram.
Symmetry 18 00071 g001
Figure 2. HBSA Network.
Figure 2. HBSA Network.
Symmetry 18 00071 g002
Figure 3. HBSA Block.
Figure 3. HBSA Block.
Symmetry 18 00071 g003
Figure 4. T-ACC Curve Plot; (a) VISION dataset; (b) Dresden dataset.
Figure 4. T-ACC Curve Plot; (a) VISION dataset; (b) Dresden dataset.
Symmetry 18 00071 g004
Figure 5. ACC Curves of Two Methods under Different Anti-image Processing Environments: (a) Gaussian noise; (b) JPEG compression; (c) Gaussian blur.
Figure 5. ACC Curves of Two Methods under Different Anti-image Processing Environments: (a) Gaussian noise; (b) JPEG compression; (c) Gaussian blur.
Symmetry 18 00071 g005
Table 1. Sample Information of Dresden Dataset Used in Experiments.
Table 1. Sample Information of Dresden Dataset Used in Experiments.
No.Camera ModelManufacturerCountryCityAbbreviationNumber of SamplesSize
1Agfa_DC-504AgfaBelgiumMortselA11674032 × 3024
2Canon_PowerShotA640CanonJapanTokyoC11883648 × 2736
3Casio_EX-Z150CasioJapanTokyoC21813264 × 2448
4FujiFilm_FinePixJ50FujifilmJapanTokyoF12093264 × 2448
5Kodak_M1063KodakUSARochesterK14633664 × 2748
6Nikon_CoolPixS710NikonJapanTokyoN11864352 × 3264
7Olympus_mju_1050SWOlympusJapanTokyoO12023648 × 2736
8Praktica_DCZ5.9PrakticaGermanyDresdenP12092560 × 1920
9Pentax_OptioA40PentaxJapanTokyoP21694000 × 3000
10Panasonic_DMC-FZ50PanasonicJapanOsakaP32623648 × 2736
11Ricoh_GX100RicohJapanTokyoR11923648 × 2736
12Rollei_RCP-7325XSRolleiGermanyHamburgR21983072 × 2304
13Samsung_L74wideSamsungRepublic of KoreaSeoulS12313072 × 2304
14Sony_DSC-H50SonyJapanTokyoS22843456 × 2592
Table 2. Sample Information of VISION Dataset Used in Experiments.
Table 2. Sample Information of VISION Dataset Used in Experiments.
No.Camera ModelManufacturerCountryCityAbbreviationNumber of SamplesSize
1Apple_iPad2AppleUSACupertinoA1171960 × 720
2Asus_Zenfone2LaserAsusTaiwanTaipeiA22093264 × 1836
3Huawei_AscendHuaweiChinaShenzhenH11553264 × 2448
4Lenovo_P70ALenovoChinaBeijingL12164784 × 2704
5LG_D290LGRepublic of KoreaSeoulL22273264 × 2448
6Microsoft_Lumia640LTEMicrosoftUSARedmondM11873264 × 1840
7OnePlus_A3000OnePlusChinaShenzhenO12874640 × 3480
8Samsung_GalaxyS3SamsungRepublic of KoreaSeoulS12073264 × 2448
9Sony_XperiaZ1CompactSonyJapanTokyoS22155248 × 3936
10Wiko_Ridge4GWikoFranceAix-en-ProvenceW12533264 × 2448
11Xiaomi_RedmiNote3XiaomiChinaBeijingX13114608 × 2592
Table 3. Classification Results on Dresden Dataset under Different Values of n.
Table 3. Classification Results on Dresden Dataset under Different Values of n.
n1-Shot5-Shot10-Shot
156.24%72.17%77.28%
260.13%78.25%86.63%
368.37%88.47%92.73%
465.16%81.27%94.34%
556.54%78.97%81.25%
Table 4. Classification Results on VISION Dataset under Different Values of n.
Table 4. Classification Results on VISION Dataset under Different Values of n.
n1-Shot5-Shot10-Shot
152.14%65.52%70.28%
272.75%78.26%82.94%
375.48%89.75%91.24%
467.24%85.47%89.45%
554.73%71.52%77.98%
Table 5. Experimental Test Accuracy of 6 Methods (5-shot, %).
Table 5. Experimental Test Accuracy of 6 Methods (5-shot, %).
MethodDresdenVISION
MTD-EM [7]53.9376.03
Multi-PCEP [8]77.1574.94
Multi-DS [9]73.7576.03
MDM-CPS [10]88.4983.72
Vim [13]62.1569.83
CGAP-HBSA (ours)88.4789.75
Table 6. Experimental Test Accuracy of 6 Methods (10-shot, %).
Table 6. Experimental Test Accuracy of 6 Methods (10-shot, %).
MethodDresdenVISION
MTD-EM [7]75.1680.49
Multi-PCEP [8]87.0684.84
Multi-DS [9]86.0885.56
MDM-CPS [10]92.4387.74
Vim [13]86.4284.75
CGAP-HBSA (ours)92.7391.24
Table 7. Ablation Experiment Results on Dresden Dataset.
Table 7. Ablation Experiment Results on Dresden Dataset.
Method1-Shot5-Shot10-Shot
HBSA_none63.28%84.35%86.96%
HBSA_SC62.52%86.17%90.80%
HBSA_BiSSM65.33%85.72%91.10%
HBSA68.37%88.47%92.73%
Table 8. Ablation Experiment Results on VISION Dataset.
Table 8. Ablation Experiment Results on VISION Dataset.
Method1-Shot5-Shot10-Shot
HBSA_none66.32%80.84%86.13%
HBSA_SC74.87%82.16%89.28%
HBSA_BiSSM73.30%85.40%90.80%
HBSA75.48%89.75%91.24%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, Y.; Wen, Z.; Chen, A.; Wu, L. CGAP-HBSA: A Source Camera Identification Framework Under Few-Shot Conditions. Symmetry 2026, 18, 71. https://doi.org/10.3390/sym18010071

AMA Style

Hu Y, Wen Z, Chen A, Wu L. CGAP-HBSA: A Source Camera Identification Framework Under Few-Shot Conditions. Symmetry. 2026; 18(1):71. https://doi.org/10.3390/sym18010071

Chicago/Turabian Style

Hu, Yifan, Zhiqiang Wen, Aofei Chen, and Lini Wu. 2026. "CGAP-HBSA: A Source Camera Identification Framework Under Few-Shot Conditions" Symmetry 18, no. 1: 71. https://doi.org/10.3390/sym18010071

APA Style

Hu, Y., Wen, Z., Chen, A., & Wu, L. (2026). CGAP-HBSA: A Source Camera Identification Framework Under Few-Shot Conditions. Symmetry, 18(1), 71. https://doi.org/10.3390/sym18010071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop