In this experiment, we implemented the proposed algorithm using Python 3.10.5 on a workstation equipped with an Intel Core i5-11400F desktop processor (4.1 GHz) and 16 GB of RAM.
Section 4.1 details the parameter configurations and databases employed.
Section 4.2 and
Section 4.3 evaluate the robustness and discriminative capability of our image hashing scheme, respectively. We further investigate the impact of block size selection. In the following experiments, the parameters we used are set as follows. For saliency map processing, the image is divided into non-overlapping blocks of size
for subsequent feature averaging. The key parameters for feature extraction are set to capture both local texture and global structure: the LBP operator uses 8 sampling points (
) with a radius of 1 (
), while the Zernike moments are computed up to a degree of 8 (
) with a radius of 5 (
). This configuration results in a final hash length of 89 bits. The experiments are tested on three open image databases, i.e., Kodak standard image database [
52], USC-SIPI Image Database [
53] and UCID database [
54].
The Hamming distance is adopted as the similarity metric due to its suitability for comparing binary hash sequences [
3,
5]. It efficiently measures the number of differing bits, providing a direct indication of perceptual similarity. The Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) are used to evaluate the trade-off between robustness and discriminability [
55]. These metrics are widely accepted in image hashing literature for their ability to visualize classification performance under varying thresholds.
4.1. Robustness Evaluation
To verify the robustness of the proposed algorithm, the Kodak standard image database was selected for testing. This database contains 24 color images, with representative samples presented in
Figure 2. To construct variant samples that are highly similar to the original images in terms of visual features, a series of robustness test attacks were conducted using MATLAB (R2024b, MathWorks, Inc., Natick, MA, USA), Adobe Photoshop (v25.12.0, Adobe Inc., San Jose, CA, USA), and the StirMark4.0 image tool [
56]. The tested operations include brightness adjustment, contrast correction, gamma correction, Gaussian low-pass filtering, salt-and-pepper noise contamination, speckle noise interference, JPEG compression, watermark embedding, image scaling, and the combined rotation-cropping-rescaling attack. It is worth noting that the rotation-cropping-rescaling attack is a composite operation, which follows a specific process: first rotating the image, cropping out the padded pixels generated during rotation, and then rescaling the cropped image to the original size.
For each attack operation, different parameter combinations were set in the experiment (specific parameter ranges are listed in
Table 3), and a total of 74 distinct operation schemes were determined. This means that each original image in the database corresponds to 74 visually similar variant samples, resulting in 24 × 74 = 1776 pairs of visually similar image pairs. By extracting hash features from each pair of similar images, the Hamming distance was used to quantify feature similarity. The mean Hamming distances for each operation under different parameter settings were calculated, and the results are shown in
Figure 3. Analysis reveals that the mean values of all operations, except for the rotation-cropping-rescaling operation, are far below 15; The maximum mean value of the composite operation is approximately 20, directly due to composite operations introducing greater cumulative distortion than single operations, as they involve superposition of multiple distortion factors, leading to more significant feature deviations. Additionally, the block-based approach used for local feature extraction fails to resist rotation operations.
Further statistical analysis of the maximum, minimum, mean, and standard deviation of Hamming distances across different operations (as presented in
Table 4) reveals that the mean Hamming distance for the rotation-cropping-rescaling operation stands at 19.7, whereas the mean values for all other operations are below 15. This significant discrepancy can be attributed to a recognized limitation of the block-based LBP feature extraction process. While Zernike moments themselves are rotation-invariant, the division of the saliency map into fixed, non-overlapping blocks means that the spatial layout of the extracted LBP features is not preserved under rotation. This spatial misalignment leads to the observed significant changes in the resulting binary sequence. Additionally, the standard deviations of all digital operations are small, indicating that a threshold T = 37 can be selected to mitigate interference from most of the tested operations. With this threshold, the correct detection rate reaches 76.5% when excluding rotated variant images and increases to 86.23% when rotated variant images are included. This insight suggests a direction for future work: incorporating a rotation-invariant LBP variant or employing a block alignment correction step prior to feature extraction could markedly improve performance under such geometric attacks.
4.2. Discrimination Analysis
The discriminative capability of the proposed hashing method was evaluated using the UCID database and USC-SIPI Image Database. A total of 1000 color images were selected for comprehensive assessment. To evaluate discrimination performance, Hamming distances were computed between the hash codes of all possible unique image pairs. Therefore, a representative subset of 1000 images were randomly selected for this analysis, resulting in a total of 499,500 pairwise distances calculated (equivalent to the number of unique pairs among 1000 images). The distribution of these distances is presented in
Figure 4.
Statistical analysis indicates that the Hamming distance spans from 9 to 64, with an average value of 39.06 and a standard deviation of 5.36. Notably, this average distance between distinct images is significantly higher than the maximum average distance found among similar images (19.74, derived from the robustness test). This considerable disparity effectively showcases the strong discriminative ability of the proposed hashing approach.
The balance between discrimination (false detection rate) and robustness (correct detection rate) depends inherently on the threshold. The total error rate, which balances false positives and negatives, is minimized at a threshold of 18, making this the optimal choice for balancing discriminative power and robustness. However, the threshold can be adjusted based on application needs, prioritizing either criterion.
4.3. Block Size Selection
To view our performances under different dimension selections, the Receiver Operating Characteristics (ROC) graph [
55] is used to carry out experimental analysis. In ROC curve analysis, the following key metric definitions need to be clarified: Let
represent the total number of similar image pairs (i.e., pairs belonging to the same underlying image), among which
(True Positives) denotes the number of pairs correctly identified as similar. Let
denote the total number of dissimilar image pairs (i.e., pairs belonging to different images), with
(false positives) represents dissimilar pairs incorrectly classified as similar. Two core evaluation metrics are derived from these:
Correct Detection Rate (CDR), calculated as: .
False Detection Rate (FDR), defined as: .
In the ROC graph, the position of the curve directly reflects the algorithm’s performance: the closer the curve is to the upper-left corner, the stronger the algorithm’s ability to correctly identify similar images and effectively distinguish dissimilar ones, indicating superior overall discrimination performance.
To validate the parameter selection rationale outlined in
Section 3.5, we evaluated the performance under different saliency block sizes. The used datasets are consistent with the image libraries of
Section 3. Different block size selections in feature learning are discussed, i.e., different d values. Specifically, the block size is chosen from {16, 64, 256, 1024}, and other parameter settings remain the same. As demonstrated in
Figure 5 and
Table 5, a block size of 64 × 64 achieved a superior balance, yielding the highest AUC (0.9987) among the tested sizes while maintaining a compact hash length (89 bits) and minimal computational overhead. These results empirically confirm that our pre-defined choice, based on a theoretical trade-off, was indeed optimal. The results demonstrate that the AUC values of the block sizes of 16, 64, 256, 1024 are 0.9508, 0.9987, 0.9854, and 0.9931, respectively. Obviously, the AUC value of block size 64 is larger than those of other dimension values, while there is slight difference in running time. Performances under different block sizes are listed in
Table 5. Therefore, our proposed robust hashing reaches preferable performance when block sizes are 64.
4.4. Performance Comparisons
To demonstrate advantages, some popular robust hashing methods are employed for comparison, including the SVD-CSLBP method [
19], GF-LVQ [
44], and the HCVA-DCT method [
30]. The compared schemes have been recently published in reputable journals or conferences. Moreover, the SVD-CSLBP method also uses LBP as a local feature while HCVA-DCT selects color vector angle as prepressing, enabling a meaningful evaluation of our feature design. To ensure a fair comparison, all parameter settings and similarity metrics were adopted directly from the original publications, and all images were resized to 512 × 512 before they were input to these methods.
The image databases described in
Section 4.1 and
Section 4.2 were employed to evaluate the classification performance of our scheme, utilizing 1776 images for robustness assessment and 499,500 image pairs for discrimination analysis. Receiver Operating Characteristic (ROC) curves were again used for visual comparison, with all evaluated schemes plotted together in
Figure 6 to facilitate direct juxtaposition. To enhance clarity, local details of these curves are enlarged in an inset within the same figure. Evidently, our scheme’s curve lies closer to the upper-left corner than those of the comparative methods, confirming through visual analysis that our approach achieves superior classification performance. In addition, we calculated the Area Under the Curve (AUC) to quantitatively evaluate the trade-off performance, where a larger AUC value indicates a better balance. The results show that the AUC values of HCVT-DCT hashing, GF-LVQ hashing, and SVD-CSLBP hashing are 0.9747, 0.9902, and 0.9964, respectively. In contrast, our proposed method achieves an AUC of 0.9987, which outperforms all compared algorithms. Both the ROC curves and AUC values confirm that our method exhibits superior trade-off performance between robustness and discrimination compared to the competing approaches. This advantage originates from our novel integration of features: Zernike moments effectively capture global structural characteristics, while color vector matrices highlight region-of-interest information, and LBP further refines local textural details from the saliency maps. Notably, the color vector angle was adopted due to its superior properties over intensity-based measures. Unlike luminance, it is inherently invariant to changes in illumination intensity and shadows, as it depends on the direction of chromaticity in RGB space rather than its magnitude [
43]. This makes it highly sensitive to genuine changes in color content while being robust to common photometric distortions.
Despite its overall strong performance, the proposed method has certain limitations. It remains sensitive to severe geometric attacks causing spatial misalignment, such as large-angle rotation with cropping. Furthermore, while robust to mild noise, LBP performance can degrade under strong noise or blurring. Zernike features also lack inherent scale invariance beyond the initial normalization step. Notwithstanding these limitations, our method offers distinct advantages over existing competitors. Unlike SVD-CSLBP which uses DCT global features, our use of Zernike moments provides better inherent rotation invariance. Compared to GF-LVQ, our saliency-guided LBP extraction focuses computation on relevant regions, improving discriminability. Compared to HCVA-DCT, our fusion of Zernike moments and LBP captures a more diverse set of features (geometry + texture) than color angles alone.
These theoretical advantages are further substantiated by comparing practical performance metrics. As summarized in
Table 6, our method achieves a favorable balance between performance and economy. Although the hashing generation time (0.0923 s) is slightly higher than some alternatives, the difference is marginal for practical applications. More significantly, our algorithm requires only 89 bits of storage—substantially lower than the floating-point representations needed by HCVA-DCT (20 floats) and SVD-CSLBP (64 floats), and also lower than the 120 bits required by GF-LVQ.
In conclusion, the proposed global-local fusion hashing method achieves balanced discriminability, robustness, and compact storage (89 bits). To address the identified limitations, future work will focus on enhancing LBP’s noise resistance and optimizing the computational efficiency of Zernike moment calculation.