Adaptive Bit Selection via Deep Reinforcement Learning for Large-Scale Image Hashing

Rezaei, Mitra; Alaoui Mhamdi, Mohammed Ayoub; Allili, Madjid

doi:10.3390/electronics15081735

Open AccessArticle

Adaptive Bit Selection via Deep Reinforcement Learning for Large-Scale Image Hashing

by

Mitra Rezaei

,

Mohammed Ayoub Alaoui Mhamdi

^*

and

Madjid Allili

Computer Science, Faculty of Natural Sciences and Mathematics, Bishop’s University, Sherbrooke, QC J1M 1Z7, Canada

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(8), 1735; https://doi.org/10.3390/electronics15081735

Submission received: 27 February 2026 / Revised: 8 April 2026 / Accepted: 13 April 2026 / Published: 20 April 2026

(This article belongs to the Special Issue Advanced Machine Learning Technologies and Their Applications in Intelligent Imaging and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Image hashing enables efficient large-scale image retrieval by encoding high-dimensional visual data into compact binary representations. However, existing deep hashing methods typically learn fixed-length hash codes in a fully supervised manner, often generating redundant bits that limit discriminative capability and increase storage overhead. In this paper, we propose a deep reinforcement learning-based adaptive bit selection framework for compact image hashing. We formulate hash refinement as a Markov Decision Process (MDP) and employ a Proximal Policy Optimization (PPO) agent to selectively retain the most informative hash bits while discarding redundant ones, directly optimizing retrieval performance through mean Average Precision (mAP). The proposed approach integrates CNN-based hash extraction with reinforcement-driven adaptive regeneration, producing compact yet highly discriminative binary codes. Extensive experiments on standard image retrieval benchmarks demonstrate consistent improvements over state-of-the-art deep hashing methods in terms of retrieval accuracy and efficiency, highlighting the effectiveness of reinforcement learning for adaptive representation learning in intelligent large-scale retrieval systems.

Keywords:

deep reinforcement learning; image hashing; adaptive bit selection; large-scale image retrieval; Proximal Policy Optimization (PPO)

1. Introduction

The exponential growth of visual data generated by smartphones, surveillance systems, social media platforms, and intelligent sensing devices has made efficient large-scale image retrieval a fundamental challenge in modern intelligent imaging systems [1,2]. As image collections grow to millions or billions of instances, traditional nearest neighbor search in high-dimensional feature spaces becomes computationally prohibitive. To address this issue, image hashing techniques have been widely adopted to map high-dimensional visual features into compact binary codes while preserving semantic similarity [3,4,5]. By operating in Hamming space, hashing significantly reduces memory requirements and enables fast similarity computation via bitwise operations.

Early hashing methods, including Locality-Sensitive Hashing (LSH) [6], relied on data-independent random projections and offered theoretical guarantees for approximate nearest neighbor search. However, their retrieval performance is often limited due to the lack of semantic awareness. Data-dependent and supervised hashing approaches subsequently improved retrieval accuracy by leveraging label information and optimizing similarity-preserving objectives [7,8].

More recently, deep learning has revolutionized image hashing by jointly learning feature representations and binary embeddings in an end-to-end framework [9,10]. Deep supervised hashing methods integrate convolutional neural networks (CNNs) with quantization layers to directly generate compact binary codes, and recent advances have further incorporated transformer architectures and multiscale feature fusion mechanisms to enhance semantic discrimination and representation robustness [11,12]. Despite their effectiveness, most existing deep hashing techniques learn fixed-length hash codes in a fully supervised manner. This often leads to redundant or suboptimal bits, since all dimensions of the binary representation are treated equally during optimization. Furthermore, the use of non-differentiable sign activation functions introduces training inconsistencies that may degrade retrieval performance [13]. Even with recent improvements in representation modeling, adaptive redundancy reduction within learned binary codes remains insufficiently explored.

To overcome these limitations, recent research has explored reinforcement learning (RL) as a mechanism for learning discrete representations and sequential decision-making policies [14,15]. Unlike conventional supervised learning, RL optimizes long-term reward through interaction with an environment, making it particularly suitable for adaptive selection problems. However, the integration of deep reinforcement learning into compact binary representation learning for large-scale image retrieval remains underexplored.

In this paper, we propose a Deep Reinforcement Learning–Based Adaptive Bit Selection framework for Compact Image Hashing. We reformulate hash refinement as a sequential decision-making problem and model it as an MDP. Specifically, an agent learns to selectively retain informative hash bits and discard redundant ones using PPO [15]. By directly optimizing retrieval performance measured through mAP, the proposed framework adaptively improves discriminative capability while maintaining compactness.

The proposed method consists of two complementary stages: (1) CNN-based binary hash extraction for semantic representation learning, and (2) reinforcement-driven adaptive bit selection for redundancy reduction. This hybrid design bridges supervised deep hashing and policy-based reinforcement learning, enabling reward-driven optimization in Hamming space. Unlike conventional deep hashing approaches that rely solely on differentiable surrogate losses, our framework explicitly models bit selection as a discrete optimization process, thereby improving representation efficiency and retrieval robustness. The main contributions of this work are threefold. First, we formulate adaptive hash bit refinement as a Markov Decision Process, enabling reinforcement-driven optimization of compact binary representations. Second, we propose a PPO-based adaptive bit selection mechanism that directly optimizes retrieval-oriented performance metrics, thereby improving discriminative capability while reducing redundancy. Third, we integrate CNN-based feature extraction with reinforcement learning-based redundancy reduction in a unified framework, and demonstrate consistent performance improvements over state-of-the-art deep hashing methods across multiple large-scale image retrieval benchmarks.

The remainder of this paper is organized as follows: Section 2 reviews related work in image hashing and deep representation learning. Section 3 presents the proposed deep reinforcement learning-based adaptive bit selection framework, including the Markov Decision Process formulation and PPO strategy. Section 4 describes the experimental setup and reports comparative results on large-scale image retrieval benchmarks. Finally, Section 5 concludes the paper and outlines potential directions for future research.

2. Related Work

Image hashing has evolved significantly over the past two decades, transitioning from handcrafted projection-based methods to end-to-end deep representation learning frameworks [3,6]. Existing approaches can be broadly categorized into traditional data-independent hashing [6], supervised data-dependent hashing [5,7,8], deep supervised hashing [9,10], and more recently, transformer-based and graph-enhanced methods [11,12,16]. In this section, we briefly review the main developments in learning-based image hashing and highlight recent advances that motivate the need for adaptive and reinforcement-driven optimization strategies.

2.1. Learning to Hash for Large-Scale Retrieval

Hashing for approximate nearest neighbor search maps high-dimensional data into compact binary codes, enabling fast retrieval in Hamming space with low storage overhead [3,6]. Early data-independent approaches such as locality-sensitive hashing provide theoretical guarantees but often lack semantic discrimination for complex visual data [6]. Data-dependent and supervised hashing methods improve retrieval accuracy by leveraging label information or similarity constraints [5,7,8]. These classical formulations established the principle that well-structured binary embeddings can support scalable image retrieval.

2.2. Deep Supervised Hashing

Deep hashing methods learn image representations and binary codes jointly in an end-to-end manner, typically using CNN backbones coupled with quantization-aware objectives [9,10]. A key challenge is the discrete nature of binary constraints, which motivates relaxation strategies and continuation-based training to mitigate optimization instability and the train–test quantization gap [9]. Beyond pairwise/triplet supervision, recent supervised deep hashing research also explores more global structures in Hamming space to reduce sampling bias and improve class-level separability.

2.3. Recent Advances: Transformers, Hash Centers, and Graph/Contrastive Objectives

Recent work has extended deep hashing beyond standard CNN pipelines by integrating transformer architectures and stronger global discrimination mechanisms. HashFormer adopts a Vision Transformer (ViT) backbone for deep hashing, demonstrating the effectiveness of self-attention for learning compact retrieval-oriented binary representations [11]. In parallel, Wang et al. [12] propose deep hashing with minimal-distance-separated hash centers, introducing a coding-theoretic constraint to enforce separation between class centers and improve retrieval robustness at scale. Graph-based designs have also gained traction: Deep Collaborative Graph Hashing (DCGH) incorporates dual-stream encoding and graph structure mining to learn discriminative hash codes for large-scale image retrieval [16]. For more challenging real-world conditions, Contrastive Transformer Masked Image Hashing (CTMIH) addresses degraded-image retrieval using a ViT-based architecture with contrastive alignment and mask reconstruction objectives [17]. These studies indicate a shift toward (i) attention-based backbones, (ii) globally structured supervision (e.g., hash centers), and (iii) graph/contrastive learning to enhance semantic discrimination in Hamming space.

2.4. Reinforcement Learning for Discrete Representation Optimization

Reinforcement learning has proven effective for discrete and sequential decision-making, particularly when actions are non-differentiable or selection-based [14,15]. PPO is widely used due to its stable clipped objective and reliable policy updates [15]. Motivated by these strengths, our work leverages policy-based RL to explicitly model adaptive bit selection as a discrete optimization process, targeting redundancy reduction and retrieval-driven refinement of binary hash codes.

Although recent advances have significantly improved deep hashing performance through transformer-based backbones [18,19], structured hash center separation [20], graph-enhanced modeling [21], and contrastive objectives [22], most existing approaches still rely on fixed-length binary codes learned via differentiable surrogate losses. These methods primarily focus on enhancing feature representation or global class separation, while treating all hash bits uniformly during optimization. In contrast, the proposed approach introduces a fundamentally different perspective by explicitly addressing redundancy within learned binary codes through a reinforcement learning formulation. Specifically, we model hash bit refinement as a discrete sequential decision-making problem operating directly in the Hamming space, thereby avoiding continuous relaxations commonly used in deep hashing.

The novelty of this work does not lie in the PPO algorithm itself, but in its integration with a mask-based adaptive bit selection mechanism that selectively retains informative bits and discards redundant ones. Furthermore, the proposed framework directly optimizes retrieval performance using a reward signal based on evaluation metrics, ensuring alignment between the learning objective and the final retrieval task. This reward-driven, policy-based refinement strategy provides a principled and effective approach to improving compactness, discriminative capability, and robustness in large-scale image retrieval systems.

3. Proposed Methodology

This section presents the proposed Deep Reinforcement Learning-Based Adaptive Bit Selection framework for compact image hashing. The method consists of two sequential stages: (i) CNN-based binary hash extraction for semantic representation learning, and (ii) reinforcement-driven adaptive bit refinement for redundancy reduction.

3.1. Deep Reinforcement Learning to Hash

Binary embedding, also known as hashing, enables efficient large-scale retrieval by mapping high-dimensional feature vectors into compact binary representations. Let

x \in R^{d}

denote the feature representation of an image. The objective of hashing is to learn a function

h : R^{d} \to {- 1, 1}^{K}

(1)

where

K ≪ d

and K denotes the hash length. Binary representations significantly reduce storage requirements and allow fast similarity computation via Hamming distance.

Traditional hashing methods typically decouple feature extraction from binary embedding. In contrast, deep hashing jointly optimizes representation learning and binary code generation. However, due to the non-differentiability of the sign function,

sgn (z) = \{\begin{matrix} + 1, & z \geq 0 \\ - 1, & z < 0 \end{matrix}

(2)

most methods relax the discrete constraint during training and apply binarization only at test time, leading to training–testing inconsistency.

To address this issue and explicitly handle discrete optimization, we reformulate hash refinement as a sequential decision-making problem and adopt a deep reinforcement learning framework.

3.2. Markov Decision Process Formulation

We model adaptive hash refinement as an MDP, defined as

M = (S, A, T, R, γ)

, where

S

is the state space,

A

is the action space,

T

is the transition function,

R

is the reward function, and

γ \in [0, 1]

is the discount factor.

At time step t, the agent observes state

s_{t}

and selects action

a_{t}

, resulting in a new state

s_{t + 1}

according to transition probability:

T (s_{t}, a_{t}, s_{t + 1}) = P (S_{t + 1} = s_{t + 1} ∣ S_{t} = s_{t}, A_{t} = a_{t}) .

(3)

The agent receives reward:

r_{t} = E [R_{t + 1} ∣ S_{t} = s_{t}, A_{t} = a_{t}] .

(4)

The objective is to learn a policy

π (a | s)

that maximizes expected cumulative reward:

J (π) = E_{π} [\sum_{t = 0}^{T} γ^{t} r_{t}] .

(5)

3.3. Proposed Approach Components

The proposed Deep Reinforcement Learning-Based Image Hashing framework consists of two complementary components: (i) CNN-based binary hash code extraction for semantic representation learning, and (ii) reinforcement-driven regeneration of binary hash codes through adaptive bit selection. The first component generates discriminative fixed-length binary embeddings that preserve semantic similarity between images. The second component refines these embeddings by eliminating redundant or less informative bits using a reward-driven decision mechanism. Together, these stages form a unified framework that balances similarity preservation, compactness, and discriminative robustness.

3.3.1. CNN-Based Binary Hash Code Extraction

This stage aims to generate binary hash codes that globally preserve semantic similarity relationships and then learn a deep mapping from images to those codes using a CNN model. The process consists of two major parts: (i) similarity-preserving hash code extraction in binary space, and (ii) CNN-based mapping via multi-binary classification.

Training Data Representation

Let the training dataset be represented as

X = {x_{i}}_{i = 1}^{n} \in R^{n \times D}

(6)

where

x_{i}

denotes either raw image pixels or extracted image features.

Each image is annotated with semantic labels

Y = {y_{i}}_{i = 1}^{n} \in {0, 1}^{n \times m}

(7)

where m is the total number of semantic categories. If

y_{i j} = 1

, then the i-th image belongs to the j-th category. An image may belong to multiple categories.

Semantic Similarity Matrix

To formally encode pairwise semantic relationships among training samples, we construct a similarity matrix

S \in {- 1, + 1}^{n \times n}

(8)

whose entries quantify whether two images should be considered semantically similar or dissimilar.

Specifically, the element

S_{i j}

is defined as

S_{i j} = \{\begin{matrix} + 1, & if images i and j share at least one semantic label, \\ - 1, & otherwise . \end{matrix}

(9)

This definition ensures that semantically related images are encouraged to have similar binary codes, while unrelated images are pushed apart in Hamming space.

Using the label matrix

Y

, the similarity matrix can be computed compactly as

S = min (Y Y^{⊤}, 1) \times 2 - 1

(10)

Here,

Y Y^{⊤}

computes pairwise label co-occurrence counts between images. The element-wise

min (\cdot, 1)

operation converts all positive co-occurrence values to 1, ensuring that only the existence of shared labels (rather than the number of shared labels) determines similarity. Multiplying by 2 and subtracting

1

transforms the resulting matrix from

{0, 1}

to

{- 1, + 1}

.

The use of

{- 1, + 1}

instead of

{0, 1}

is deliberate. In binary hashing, inner products between

{- 1, + 1}

codes directly reflect Hamming affinity. Specifically, when two codes are identical, their inner product equals k, whereas maximally different codes produce

- k

. Thus, aligning

Z Z^{⊤}

with

k S

naturally enforces similarity preservation.

Geometrically, the similarity matrix

S

defines a desired structure in Hamming space: similar samples should lie close to each other, and dissimilar samples should be separated. Therefore,

S

serves as the supervision signal guiding the global arrangement of binary hash codes.

Hash Function Definition

The objective of deep hashing is to learn a nonlinear transformation that maps high-dimensional image representations into a compact binary space while preserving semantic similarity relationships. Formally, we aim to learn a mapping function

Hash (X) = Z,

(11)

where

Z = {z_{i}}_{i = 1}^{n} \in {- 1, + 1}^{n \times k}

(12)

and k denotes the length of the hash code (i.e., the number of bits).

More precisely, for each input image

x_{i}

, the hash function produces a binary vector

z_{i} = Hash (x_{i}) \in {- 1, + 1}^{k}

(13)

The dimensionality k is typically much smaller than the original feature dimension D, ensuring that storage requirements are significantly reduced. This compact representation is particularly suitable for large-scale image retrieval systems, where millions of samples must be stored and compared efficiently.

The choice of the binary space

{- 1, + 1}^{k}

offers two main advantages. First, binary representations enable extremely fast similarity computation using Hamming distance, which can be implemented via bitwise XOR operations followed by bit counting. Second, binary codes require minimal memory compared to floating-point embeddings, making them ideal for large-scale databases.

The learned hash function must simultaneously enforce similarity preservation and discriminability. In particular, semantically similar images should be mapped to binary codes with small Hamming distance, ensuring that related samples remain close in the Hamming space, while semantically dissimilar images should be assigned well-separated codes to enhance retrieval precision and avoid false matches. Consequently, the objective reduces to learning a mapping

Hash (\cdot)

that projects images into a structured binary space whose geometry is consistent with the semantic similarity matrix

S

introduced earlier, thereby aligning the Hamming distances between codes with the underlying semantic relationships.

Hamming Distance and Inner Product Relation

Given two binary hash codes

z_{i}, z_{j} \in {- 1, + 1}^{k}

(14)

their similarity in Hamming space is measured by the Hamming distance, defined as the number of bit positions at which the two codes differ.

A key property of

{- 1, + 1}^{k}

binary vectors is that their Hamming distance can be directly expressed in terms of their inner product. Specifically,

H_{dist} (z_{i}, z_{j}) = \frac{1}{2} (k - z_{i} z_{j}^{⊤}) .

(15)

This relationship can be understood as follows. For each bit position ℓ:

z_{i ℓ} z_{j ℓ} = \{\begin{matrix} + 1, & if z_{i ℓ} = z_{j ℓ}, \\ - 1, & if z_{i ℓ} \neq z_{j ℓ} . \end{matrix}

(16)

Therefore, the inner product

z_{i} z_{j}^{⊤} = \sum_{ℓ = 1}^{k} z_{i ℓ} z_{j ℓ}

(17)

counts agreements as

+ 1

and disagreements as

- 1

.

Let d denote the Hamming distance between

z_{i}

and

z_{j}

. Then the number of matching bits is

k - d

, and the number of differing bits is d. Thus,

z_{i} z_{j}^{⊤} = (k - d) (+ 1) + d (- 1) = k - 2 d

(18)

Solving for d yields

d = \frac{1}{2} (k - z_{i} z_{j}^{⊤})

(19)

which proves Equation (15).

This equivalence is fundamental for deep hashing. Instead of directly optimizing Hamming distances, which are discrete and non-differentiable, we can optimize inner products in Euclidean space. Maximizing the inner product between two binary codes minimizes their Hamming distance, while minimizing the inner product increases their separation. Thus, preserving semantic similarity can be reformulated as aligning the inner-product matrix

Z Z^{⊤}

with the semantic similarity matrix

k S

. This transformation enables continuous optimization techniques to be applied to a problem that is inherently combinatorial in Hamming space.

Global Similarity Reconstruction Objective

The primary goal of the hashing stage is to learn binary codes whose pairwise similarities reflect the semantic relationships encoded in the similarity matrix

S

. From the previous subsection, we know that the inner product between two binary codes is directly related to their Hamming distance. Therefore, instead of directly minimizing Hamming distances, we formulate the objective in terms of inner products.

Let

Z \in {- 1, + 1}^{n \times k}

denote the binary hash code matrix, where each row

z_{i}

corresponds to the k-bit representation of image i. The matrix product

Z Z^{⊤}

produces an

n \times n

matrix whose

(i, j)

-th entry equals the inner product

z_{i} z_{j}^{⊤}

.

To enforce global similarity preservation, we minimize

min_{Z} L_{1} = {∥Z Z^{⊤} - k S∥}_{2}^{2} .

(20)

This objective aims to align the inner-product similarity matrix

Z Z^{⊤}

with the scaled semantic similarity matrix

k S

.

The multiplication by k is necessary because, for binary codes in

{- 1, + 1}^{k}

, the maximum possible inner product between two identical codes is k, while the minimum (for completely opposite codes) is

- k

. Thus, scaling

S

by k ensures that the target similarity matrix lies in the same numerical range as

Z Z^{⊤}

. The norm

{∥ \cdot ∥}_{2}^{2}

denotes the squared Frobenius norm, defined as

{∥ A ∥}_{2}^{2} = \sum_{i = 1}^{n} \sum_{j = 1}^{n} A_{i j}^{2}

(21)

Minimizing the Frobenius norm of the difference enforces that all pairwise similarities are reconstructed simultaneously. This is why the formulation is referred to as global similarity reconstruction: it does not rely on mini-batch sampling but instead attempts to preserve the entire similarity structure. More explicitly, Equation (20) expands to

L_{1} = \sum_{i = 1}^{n} \sum_{j = 1}^{n} {(z_{i} z_{j}^{⊤} - k S_{i j})}^{2}

(22)

For semantically similar pairs (

S_{i j} = + 1

), the objective encourages

z_{i} z_{j}^{⊤} \approx k

(23)

which implies that their Hamming distance approaches zero. For semantically dissimilar pairs (

S_{i j} = - 1

), the objective encourages

z_{i} z_{j}^{⊤} \approx - k

(24)

which corresponds to maximal Hamming separation. Hence, the optimization problem directly shapes the geometry of the binary embedding space to reflect semantic structure. However, since

Z

is constrained to take discrete values in

{- 1, + 1}

, solving Equation (20) requires searching over

2^{n k}

possible configurations, which is computationally intractable. This motivates introducing a continuous relaxation in the next subsection.

Continuous Relaxation

The optimization problem in Equation (20) is defined over the discrete binary domain

Z \in {- 1, + 1}^{n \times k}

(25)

Because each element of

Z

can take only two possible values, the search space contains

2^{n k}

possible configurations. Such a combinatorial optimization problem is computationally intractable for realistic dataset sizes. To enable gradient-based optimization, we relax the discrete constraint and introduce a continuous surrogate matrix

\tilde{Z} \in R^{n \times k}

(26)

The objective becomes

min_{\tilde{Z}} L_{1} = {∥\tilde{Z} {\tilde{Z}}^{⊤} - k S∥}_{2}^{2} .

(27)

In this relaxed formulation,

\tilde{Z}

is optimized in continuous space using gradient descent. After convergence, binary codes are obtained by applying a sign function:

Z = sign (\tilde{Z}), sign (x) = \{\begin{matrix} - 1, & x < 0, \\ + 1, & x \geq 0 . \end{matrix}

(28)

The relaxation step allows the similarity-preserving objective to be optimized efficiently. However, it introduces a new issue: the continuous solution

\tilde{Z}

may not lie close to

{- 1, + 1}

values. The difference between

\tilde{Z}

and its binarized version

Z

is referred to as the quantization error. If

\tilde{Z}

contains values far from

\pm 1

, applying the sign function may cause large distortions, potentially degrading retrieval performance.

Quantization Regularization

Although the continuous relaxation in Equation (27) enables tractable optimization, it introduces a discrepancy between the relaxed embeddings

\tilde{Z}

and the desired discrete binary codes

Z

. This discrepancy, commonly referred to as quantization error, arises because the optimization is performed in continuous space, while the final representation must lie strictly in the binary set

{- 1, + 1}

.

If entries of

\tilde{Z}

deviate significantly from

\pm 1

, applying the sign function in Equation (28) may produce unstable or inaccurate binary codes. For example, values near zero are highly sensitive to small perturbations and may flip sign after binarization, leading to inconsistencies between training and inference.

To mitigate this issue, we introduce a regularization term that encourages each element of

\tilde{Z}

to approach either

+ 1

or

- 1

. The regularized objective becomes

min_{\tilde{Z}} L_{1} = {∥\tilde{Z} {\tilde{Z}}^{⊤} - k S∥}_{2}^{2} + ∥| \tilde{Z} | - 1∥ .

(29)

The first term preserves semantic similarity, while the second term penalizes deviations from binary magnitude. More precisely, the expression

| \tilde{Z} | - 1

(30)

computes the element-wise difference between the absolute value of

\tilde{Z}

and 1. Minimizing its norm forces

| {\tilde{Z}}_{i j} | \approx 1

, which implies that

{\tilde{Z}}_{i j}

approaches either

+ 1

or

- 1

.

Geometrically, this regularization pushes each continuous embedding vector toward the vertices of the k-dimensional hypercube

{- 1, + 1}^{k}

. Thus, the optimization simultaneously enforces structural consistency, ensuring that the learned representations preserve the semantic similarity relationships encoded in the dataset, and binary fidelity, which aims to minimize the discrepancy between continuous feature embeddings and their corresponding discrete binary codes.

By reducing quantization error during training, the binarization step in Equation (28) becomes more stable and less destructive, leading to improved retrieval accuracy and robustness in Hamming space.

Quantization Error Reduction

The introduction of continuous relaxation enables efficient optimization but inevitably creates a mismatch between the optimized embeddings

\tilde{Z}

and the desired discrete binary codes

Z \in {- 1, + 1}^{n \times k}

. This mismatch is referred to as quantization error.

Formally, after optimization of the relaxed objective in Equation (27), the binary codes are obtained through

Z = sign (\tilde{Z})

(31)

The quantization error measures the discrepancy between the continuous embedding and its binarized version, and can be conceptually expressed as

∥ \tilde{Z} - Z ∥

(32)

When entries of

\tilde{Z}

are close to zero, small perturbations may change their sign, leading to unstable binary codes. This instability can significantly affect Hamming distances and consequently degrade retrieval performance.

Therefore, it is essential to reduce the magnitude of quantization error during training rather than correcting it post hoc.

To achieve this, we incorporate a regularization term that encourages each entry of

\tilde{Z}

to move toward either

+ 1

or

- 1

. This leads to the regularized objective:

min_{\tilde{Z}} L_{1} = {∥\tilde{Z} {\tilde{Z}}^{⊤} - k S∥}_{2}^{2} + ∥| \tilde{Z} | - 1∥ .

(33)

The second term penalizes deviations of

| {\tilde{Z}}_{i j} |

from unity, effectively constraining the solution toward the discrete hypercube vertices.

From a geometric perspective, this regularization shrinks the feasible region from unconstrained Euclidean space toward the corners of the binary hypercube. As a result, when the sign function is applied, the transformation from

\tilde{Z}

to

Z

introduces minimal distortion.

Reducing quantization error therefore ensures that the learned continuous embeddings remain structurally aligned with the imposed binary constraints, that the subsequent sign operation yields stable and reliable hash codes with minimal information loss, and that retrieval performance remains consistent between the training phase and the inference stage, thereby preserving semantic coherence in the Hamming space.

Substituting $S$ with $Y$

The similarity matrix

S

introduced earlier explicitly encodes pairwise semantic relationships between images. However, storing and computing

S

directly may be inefficient when the number of training samples n is large. Therefore, instead of constructing

S

independently, we derive it directly from the label matrix

Y

.

Recall that

Y \in {0, 1}^{n \times m}

(34)

contains semantic annotations, where each row

y_{i}

corresponds to the multi-label vector of image i. The matrix product

Y Y^{⊤}

(35)

produces an

n \times n

matrix whose

(i, j)

-th entry equals the number of shared labels between images i and j. More precisely,

{(Y Y^{⊤})}_{i j} = \sum_{ℓ = 1}^{m} y_{i ℓ} y_{j ℓ}

(36)

If this value is greater than zero, the two images share at least one category and should be considered semantically similar. To convert this co-occurrence matrix into a binary similarity indicator, we apply the element-wise operation

min (Y Y^{⊤}, 1)

(37)

which maps all positive entries to 1 while preserving zeros. This ensures that similarity depends only on the existence of shared labels rather than the number of shared labels.

Finally, to match the inner-product range of binary codes in

{- 1, + 1}

, we transform the matrix into the

{- 1, + 1}

domain:

S = min (Y Y^{⊤}, 1) \times 2 - 1

(38)

Multiplying by 2 converts

{0, 1}

to

{0, 2}

, and subtracting

1

shifts the values to

{- 1, + 1}

. Thus,

S_{i j} = + 1

if images i and j share at least one label, and

S_{i j} = - 1

otherwise.

Substituting Equation (38) into the relaxed similarity reconstruction objective yields

min_{\tilde{Z}} L_{1} = {∥\tilde{Z} {\tilde{Z}}^{⊤} - k (min (Y Y^{⊤}, 1) \times 2 - 1)∥}_{2}^{2} + ∥| \tilde{Z} | - 1∥

(39)

This substitution offers two key advantages: it removes the necessity of explicitly constructing and storing a separate similarity matrix, thereby reducing computational and memory overhead, and it directly integrates semantic label information into the optimization objective, ensuring stronger supervision during training. Consequently, the learning process inherently enforces that the learned binary embeddings faithfully capture and reflect the semantic structure derived from the labels.

Block-Wise Optimization Strategy

The procedure for extracting block-wise hash codes is outlined in Algorithm 1. By leveraging the hash codes derived from the semantic labels and the similarity matrix, the training images can be effectively associated with their corresponding codes. Although Equation (39) eliminates the need to explicitly store a precomputed similarity matrix, the computation of

\tilde{Z} {\tilde{Z}}^{⊤}

(40)

still generates an

n \times n

matrix. When the number of training samples n is large, this operation requires

O (n^{2})

memory and computational complexity, which becomes impractical for large-scale datasets.

To overcome this limitation, we partition the similarity reconstruction into smaller blocks and compute the loss incrementally. Let h and w denote the block height and width, respectively. Assuming that n is divisible by both h and w, the dataset can be divided into

\frac{n}{h}

row partitions and

\frac{n}{w}

column partitions.

The global objective can then be expressed as a summation over these blocks:

min_{\tilde{Z}} \sum_{r = 0}^{\frac{n}{h} - 1} \sum_{c = 0}^{\frac{n}{w} - 1} {∥{\tilde{Z}}_{r : r + h - 1} {\tilde{Z}}_{c : c + w - 1}^{⊤} - k (min (Y_{r : r + h - 1} Y_{c : c + w - 1}^{⊤}, 1) \times 2 - 1)∥}_{2}^{2} + ∥| \tilde{Z} | - 1∥

(41)

In this formulation,

{\tilde{Z}}_{r : r + h - 1}

represents the subset of rows from index r to

r + h - 1

, while

{\tilde{Z}}_{c : c + w - 1}

denotes the corresponding column block. Similarly,

Y_{r : r + h - 1}

and

Y_{c : c + w - 1}

are the corresponding label submatrices. Each block computes the similarity reconstruction between two subsets of samples, thereby approximating the full pairwise similarity matrix without materializing it entirely in memory.

For each block

(r, c)

, the block-specific loss is computed and its gradient with respect to

\tilde{Z}

is evaluated. The gradients obtained from all blocks are accumulated into a single global gradient matrix. After iterating through all blocks, one parameter update step is performed using the accumulated gradient.

This strategy significantly reduces memory requirements from

O (n^{2})

to approximately

O (h w)

per block while preserving the global similarity structure. All pairwise relationships are still considered across iterations, but they are processed in segments to ensure computational feasibility. As a result, the block-wise approach enables scalable optimization of the similarity reconstruction objective on large datasets.

Algorithm 1 Hash Code Extraction using Block-wise Similarity Calculation

1:: Input: Labels $Y$ , hash length k, epochs $t_{1}$ , block size $h \times w$
2:: for $t = 1$ to $t_{1}$ do
3:: for $r = 0$ to $\frac{n}{h} - 1$ do
4:: for $c = 0$ to $\frac{n}{w} - 1$ do
5:: Compute block loss and gradient using Equation (41)
6:: Accumulate gradient
7:: end for
8:: end for
9:: Update $\tilde{Z}$
10:: end for
11:: $Z = sign (\tilde{Z})$
12:: $B = \frac{1}{2} (Z + 1)$
13:: Output: Binary codes $B$

CNN Mapping via Multi-Binary Classification

After extracting the binary hash codes

B = {b_{i}}_{i = 1}^{n} \in {0, 1}^{n \times k}

through similarity reconstruction, the next objective is to learn a parametric mapping that directly transforms an input image into its corresponding binary representation. To achieve this, we train a convolutional neural network (AlexNet) to approximate the extracted hash codes.

Let the network output for image

x_{i}

be

F (x_{i}; Θ) \in R^{k}

(42)

where

Θ

denotes the learnable parameters of the CNN and k is the hash length. Each dimension of

F (x_{i}; Θ)

corresponds to one bit of the hash code.

Since each hash bit represents a binary decision, the mapping problem can be interpreted as learning k independent binary classifiers simultaneously. Therefore, the learning objective naturally becomes a multi-binary classification problem.

To convert the real-valued network outputs into probabilities, we apply the sigmoid function element-wise:

σ (z) = \frac{1}{1 + e^{- z}}

(43)

which maps each component into the interval

(0, 1)

. The output

σ (F (x_{i}; Θ))

can thus be interpreted as the predicted probability that a particular hash bit equals 1.

The multi-binary cross-entropy loss is defined as

min_{Θ} L_{2} = - \sum_{i = 1}^{n} [b_{i} log (σ (F (x_{i}; Θ))) + (1 - b_{i}) log (1 - σ (F (x_{i}; Θ)))]

(44)

This loss measures the discrepancy between predicted probabilities and the target binary codes. For each image and each bit:

\begin{matrix} If b_{i j} = 1, the loss penalizes low predicted probability . \\ If b_{i j} = 0, the loss penalizes high predicted probability . \end{matrix}

(45)

Minimizing this objective encourages the network to produce outputs close to the extracted hash codes. The gradient of the loss with respect to the network output is

\frac{\partial L_{2}}{\partial F} = - \sum_{i = 1}^{n} [(1 - b_{i}) σ (F) - b_{i} (1 - σ (F))] .

(46)

This gradient reflects the difference between predicted probabilities and ground-truth binary labels. It drives the optimization process during backpropagation using stochastic gradient descent.

After training, the final binary hash code for a new image is obtained by thresholding the sigmoid outputs at 0.5:

h (x_{i}; Θ) = I (σ (F (x_{i}; Θ)) \geq 0.5),

(47)

where

I (\cdot)

is the indicator function. This step converts probabilistic outputs into discrete binary values, yielding the final hash code. We further demonstrate that the learned representation can be projected into a hash code space, where the resulting codes are well aligned with the hash codes of labeled images. Algorithm 2 illustrates the procedure used to transform the training images into their corresponding hash codes.

Consequently, the CNN learns a direct mapping from image space to Hamming space, enabling efficient hash code generation for unseen query images during retrieval.

Algorithm 2 Hash Code Mapping via Multi-Binary Classification

1:: Input: Extracted codes $B$ , epochs $t_{2}$
2:: for $t = 1$ to $t_{2}$ do
3:: for $i = 1$ to n do
4:: Forward $x_{i}$
5:: Compute loss (Equation (44))
6:: Backpropagate and update $Θ$
7:: end for
8:: end for
9:: Output: Hash function and binary code database

3.3.2. Regeneration of Binary Hash Codes by Retaining Valuable Bits

Although the CNN-based hashing stage produces compact binary codes, not all bits contribute equally to retrieval performance. Some bits may be redundant, weakly discriminative, or noisy. Therefore, instead of treating all hash dimensions uniformly, we introduce a regeneration mechanism that selectively retains informative bits while discarding less valuable ones.

This refinement process is formulated as a sequential decision-making problem and optimized using deep reinforcement learning. For that purpose, let

C = {c_{i}}_{i = 1}^{n} \in {0, 1}^{n \times k}

(48)

denote the binary codes generated by the CNN mapping stage. While these codes preserve semantic similarity globally, they are learned through supervised optimization that treats all bit positions symmetrically. However, the contribution of each bit to retrieval accuracy (e.g., mAP) may differ significantly.

The objective of this stage is to learn a binary mask

m \in {0, 1}^{k}

(49)

that selects valuable bits and suppresses redundant ones. The refined hash code becomes

c_{i}^{*} = c_{i} ⊙ m

(50)

where ⊙ denotes element-wise multiplication.

We model the bit selection process as an MDP defined by the tuple

(S, A, T, R, γ)

.

State. At step t, the state

s_{t}

represents the current selection configuration of bits. It encodes which bits have been retained and which remain candidates.

s_{t} \in {0, 1}^{k}

(51)

Action. At each step, the agent decides whether to retain or discard a specific bit:

a_{t} \in {0, 1}

(52)

Here,

a_{t} = 1

means keeping the bit, and

a_{t} = 0

means discarding it.

Transition. The state transitions deterministically according to

s_{t + 1} = Update (s_{t}, a_{t}, t)

(53)

At each step t, the state

s_{t} \in {0, 1}^{k}

represents the current binary selection mask over hash bits. The action

a_{t} \in {0, 1}

determines whether the t-th bit is retained (

a_{t} = 1

) or discarded (

a_{t} = 0

).

The state is updated as:

s_{t + 1}^{(i)} = \{\begin{matrix} a_{t}, & if i = t, \\ s_{t}^{(i)}, & otherwise . \end{matrix}

(54)

That is, the action

a_{t}

updates only the t-th component of the state vector, while the remaining components remain unchanged.

Reward. After selecting a subset of bits, retrieval performance is evaluated using mAP. The reward is defined as

R = mAP (C ⊙ m)

(55)

This reward directly reflects the quality of the selected bit subset. The goal of reinforcement learning is to maximize expected cumulative reward:

J (θ) = E_{π_{θ}} [R]

(56)

where

π_{θ}

denotes the policy parameterized by

θ

.

Unlike supervised hashing, this objective directly optimizes retrieval metrics rather than surrogate similarity losses. We adopt PPO to update the policy network. The PPO objective is defined as

L^{PPO} (θ) = E [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})],

(57)

where

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})},

(58)

and

A_{t}

denotes the advantage estimate.

The clipping mechanism stabilizes training by preventing large policy updates.

After convergence, the learned policy produces an optimal bit-selection mask

m^{*}

. The regenerated hash codes are

C^{*} = C ⊙ m^{*}

(59)

This regeneration step removes redundant dimensions and enhances discriminative power while preserving compactness.

Advantages of Bit Regeneration

The reinforcement-driven regeneration mechanism provides several key advantages over conventional deep hashing strategies. First, it formulates hash bit selection as an explicit discrete optimization problem rather than relying on implicit weighting through continuous surrogate losses. Second, the optimization objective is directly aligned with retrieval performance metrics, such as mAP, instead of proxy similarity reconstruction losses, thereby reducing the gap between training and evaluation criteria. Third, the regeneration process operates on the learned binary codes without requiring retraining of the underlying CNN, making it computationally efficient and modular. Finally, by selectively retaining informative bits and discarding redundant or weakly discriminative ones, the approach improves representation compactness, robustness, and generalization in Hamming space. Consequently, the regeneration stage complements supervised deep hashing by introducing adaptive, reward-driven refinement of binary representations.

CNN Freezing Strategy

During the reinforcement learning stage, the CNN backbone is kept fixed to ensure stable semantic representations and to avoid the challenges associated with backpropagation through discrete decision processes. This design decouples feature learning from bit selection, allowing the RL agent to operate directly in the Hamming space. Joint optimization of the CNN and RL policy could potentially improve performance but would introduce significant training complexity due to the non-differentiability of the bit selection process. This remains an interesting direction for future work.

4. Experimental Results

This section evaluates the proposed reinforcement-driven adaptive hashing framework on standard large-scale image retrieval benchmarks. We compare our method against representative shallow and deep hashing approaches and provide quantitative and qualitative analyses to validate the effectiveness of adaptive bit regeneration. All experiments are conducted using fixed training configurations and dataset splits to ensure reproducibility. The proposed framework exhibits largely deterministic behavior, as the CNN training follows standard supervised learning procedures and the reinforcement learning stage operates on fixed binary hash codes with a deterministic reward based on retrieval performance.

4.1. Setup and Datasets

To evaluate the proposed reinforcement-driven adaptive hashing framework, experiments are conducted on three widely used benchmark datasets: CIFAR-10 [23] (https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 12 April 2026)), NUS-WIDE [24] (https://github.com/NExTplusplus/NUS-WIDE (accessed on 12 April 2026)), and MS-COCO [25] (https://cocodataset.org/ (accessed on 12 April 2026)). These datasets are commonly adopted in image retrieval research and provide diverse visual and semantic characteristics for comprehensive evaluation.

CIFAR-10 consists of 60,000 color images of size

32 \times 32

, evenly distributed across 10 object categories. Following standard experimental settings, we randomly select 1000 images (100 per category) as the query set. A total of 5000 images (500 per category) are used for training, and the remaining images form the retrieval database.

NUS-WIDE contains 269,648 web images collected from Flickr and annotated with 81 semantic concepts. To ensure balanced evaluation, we retain the 21 most frequent concepts. From these, 2100 images (100 per concept) are randomly selected as queries. The remaining images constitute the retrieval database, from which 10,500 images are randomly sampled for training.

MS-COCO (Microsoft Common Objects in Context) is a large-scale multi-label dataset containing 122,218 annotated images with up to 80 semantic categories. We randomly select 5000 labeled images as queries and treat the remaining images as the retrieval database. Additionally, 10,000 images are randomly sampled from the database to construct the training set.

Following conventional evaluation protocols in supervised hashing, two images i and j are considered semantically similar if they share at least one common label. Formally, the similarity indicator is defined as:

s_{i j} = \{\begin{matrix} 1, & if y_{i}^{⊤} y_{j} > 0, \\ 0, & otherwise, \end{matrix}

(60)

where

y_{i}

and

y_{j}

denote the semantic label vectors of images i and j, respectively.

4.2. Hyperparameter Selection and Justification

The PPO hyperparameters used in this work are summarized in Table 1. These values are selected following standard practices in reinforcement learning to ensure stable and efficient policy optimization. The discount factor

γ = 0.99

encourages the agent to consider long-term rewards, which is essential for sequential bit selection where each decision impacts the overall retrieval performance. The clipping parameter

ϵ = 0.2

is chosen to constrain policy updates and prevent instability due to large policy shifts, as commonly adopted in PPO-based methods. The learning rate is set to

1 \times 10^{- 4}

to balance convergence speed and training stability.

Regarding the training setup, using the full dataset within each episode allows the agent to evaluate the global impact of bit selection decisions with respect to the retrieval objective, thereby reducing the variance of the reward signal. In addition, the optimization process is structured such that each decision step corresponds to a specific bit position in the hash code, enabling the policy to learn bit-wise selection strategies while accounting for their collective contribution to the overall retrieval performance. The constrained discrete action space further reduces the risk of overfitting and promotes stable policy learning.

4.3. Compared Methods

To comprehensively assess the effectiveness of the proposed reinforcement-driven adaptive hashing framework, we compare it with several representative shallow, deep, and reinforcement learning–based hashing approaches. Specifically, we include SDH (Supervised Discrete Hashing), a classical supervised method that jointly learns discrete binary codes and linear classifiers; CNNH (Convolutional Neural Network Hashing) [26], one of the earliest frameworks integrating CNN-based feature learning with similarity-preserving hashing in a two-stage pipeline; DNNH (Deep Neural Network Hashing) [27], which extends CNNH by improving feature representation and quantization mechanisms; HashNet [9], an end-to-end deep supervised hashing model that introduces a continuation strategy to alleviate the non-differentiability of the sign function and enhance convergence stability; HashGAN [28], which incorporates adversarial learning to improve the compactness and discriminability of binary codes; and DRLIH [29], a deep reinforcement learning-based hashing approach that models hash bit generation as a sequential decision process optimized via reward signals. These baselines collectively represent major methodological categories in supervised hashing research, including discrete optimization, deep representation learning, adversarial training, and reinforcement learning. For fair comparison, all methods are evaluated under identical dataset splits and evaluation protocols, and when available, official implementations are used; otherwise, models are re-implemented according to the configurations described in the original publications.

4.4. Evaluation Metrics

Retrieval performance is evaluated using standard ranking-based metrics widely adopted in image hashing research, including Average Precision (AP), mAP, Precision–Recall (PR) curves, precision within a fixed Hamming radius, and Top-N precision. For a given query image, Average Precision (AP) measures the quality of the ranked retrieval list by considering both precision and recall across different cutoff positions. Formally, AP is defined as

A P = \frac{1}{R} \sum_{k = 1}^{N} \frac{R_{k}}{k} \cdot r e l_{k},

(61)

where R denotes the total number of relevant images in the database for the query,

R_{k}

represents the number of relevant images among the top-k retrieved samples,

r e l_{k} \in {0, 1}

indicates whether the k-th retrieved image is relevant, and N is the total number of retrieved samples. mAP is then computed as the average AP over all query images:

m A P = \frac{1}{Q} \sum_{i = 1}^{Q} A P_{i},

(62)

where Q is the total number of queries and

A P_{i}

is the average precision of the i-th query. mAP is adopted as the primary evaluation metric because it reflects global ranking quality across the entire retrieval list. In addition, Precision–Recall (PR) curves are used to visualize retrieval behavior across different recall levels, precision within Hamming radius 2 (P@H ≤ 2) is reported to evaluate performance under strict Hamming distance constraints relevant to fast retrieval scenarios, and Top-N precision (P@N) is measured to assess accuracy among the first N retrieved results. Together, these metrics provide a comprehensive evaluation of both ranking effectiveness and compact binary representation quality.

4.5. Implementation Details

All experiments are implemented in PyTorch 2.0 and conducted on a single NVIDIA GPU. For the CNN-based hash learning stage, AlexNet [30] is adopted as the backbone network. The convolutional layers are initialized with ImageNet-pretrained weights and fine-tuned during training, while the final fully connected layer responsible for hash prediction is initialized randomly. A smaller learning rate of

10^{- 5}

is used for pretrained layers to preserve learned visual representations, whereas the final hashing layer is trained with a learning rate of

10^{- 4}

. The mini-batch size is set to 32, and the Adam optimizer is employed for parameter updates. For similarity reconstruction in the block-wise optimization stage, Adam is also used with a learning rate of 1.0 to accelerate convergence. In the reinforcement learning stage, PPO is adopted to learn the adaptive bit-selection policy. The discount factor is set to

γ = 0.99

, and all data are processed within a single episode to stabilize reward estimation. The batch size of the policy network equals the target hash length k. During inference, the sigmoid outputs of the CNN are thresholded at 0.5 to produce binary hash codes. For fair comparison with DRLIH, experiments are additionally conducted using VGG-19 [31] as backbone under identical training and evaluation settings, following the configuration described in the original DRLIH implementation.

4.6. Main Quantitative Results

Table 2 reports the mAP results of all compared methods under different hash lengths using AlexNet as the backbone network.

As shown in Table 2, the proposed method achieves consistently strong and competitive mAP across all datasets and hash lengths, although it is slightly outperformed by certain recent methods in specific settings. In particular, DCDH [16] achieves higher scores on CIFAR-10 at 32 bits (0.9167 vs. 0.790) and 48 bits (0.9142 vs. 0.788), benefiting from a more powerful end-to-end optimization strategy. Similarly, CTMIH [17] reports superior performance on MS-COCO at 64 bits (0.846 vs. 0.776), likely due to its enhanced feature representation capabilities.

It is important to emphasize that the proposed method is designed to be deep learning architecture-agnostic and operates as a refinement module on top of learned hash codes. In this work, AlexNet is intentionally adopted as the backbone to provide a controlled and widely used evaluation setting, allowing us to isolate the contribution of the reinforcement-driven adaptive bit selection mechanism. Consequently, the performance gains observed are primarily due to improved hash code quality rather than enhanced feature extraction.

Compared to classical and earlier deep hashing methods such as SDH [32], CNNH [26], DNNH [27], HashNet [9], and HashGAN [28], the proposed approach consistently achieves comparable or superior performance, particularly on NUS-WIDE and MS-COCO, demonstrating its robustness in multi-label and large-scale retrieval scenarios. Moreover, the improvements are more pronounced for shorter hash lengths (16–48 bits), indicating that the proposed method effectively removes redundant or weakly informative bits and produces more compact and discriminative representations.

Since the proposed framework is independent of the underlying feature extractor, it can be seamlessly integrated with more powerful architectures such as ResNet or Vision Transformers. Therefore, the current results obtained with AlexNet can be considered a conservative estimate of the method’s full potential, and further performance improvements are expected when combined with modern deep architectures.

4.7. Precision–Recall Curves

Figure 1 presents the Precision–Recall (PR) curves at 64-bit hash length on CIFAR-10, NUS-WIDE, and MS-COCO. Across all datasets, the proposed method consistently achieves higher precision throughout the entire recall range compared with competing approaches.

The superiority is particularly evident at high recall levels, where maintaining precision becomes increasingly challenging. This indicates that the regenerated hash codes preserve global semantic similarity more effectively, resulting in improved ranking stability. In contrast, baseline methods exhibit sharper precision degradation as recall increases, suggesting that redundant or weakly discriminative bits negatively impact ranking consistency. These results demonstrate that reinforcement-driven adaptive bit selection enhances the structural coherence of the learned Hamming space, leading to improved retrieval quality across varying recall thresholds.

4.8. Precision Within Hamming Radius 2

Figure 2 reports the precision within Hamming radius 2 (P@H ≤ 2) for different hash lengths. This metric is particularly important in real-time retrieval scenarios, as Hamming radius search enables constant-time lookup. The proposed method achieves the highest precision under this strict constraint across all datasets. The improvement is more pronounced for shorter hash lengths (16 and 32 bits), where eliminating redundant dimensions significantly enhances local neighborhood consistency. The results confirm that adaptive bit regeneration produces more compact and semantically consistent binary codes, improving retrieval performance within tight Hamming neighborhoods.

4.9. Top-N Retrieval Performance

Figure 3 illustrates the Top-N precision (P@N) curves on CIFAR-10, NUS-WIDE, and MS-COCO. The proposed method consistently maintains higher precision as the number of retrieved samples increases. In fact, the performance gap remains stable even for larger N, indicating that the regenerated binary codes preserve ranking quality beyond the top few results. This stability reflects improved global similarity alignment and reduced bit redundancy. Overall, the P@N analysis further validates the robustness and discriminative capability of the reinforcement-refined hash representations.

4.10. Comparison with DRLIH

To further assess the effectiveness of reinforcement learning in hashing, we compare our method with DRLIH under identical experimental settings using VGG-19 as the backbone network.

As shown in Table 3, the proposed method consistently outperforms DRLIH across all code lengths on both datasets. The average improvements reach approximately 3.9% on CIFAR-10 and 2.7% on NUS-WIDE. Unlike DRLIH, which relaxes the binary constraint during policy optimization, our framework preserves strict binary representations and performs discrete bit regeneration directly in Hamming space. This reduces the train–test discrepancy and leads to more stable retrieval performance.

4.11. Qualitative Results and Discussion

Figure 4 presents representative top-5 retrieval results for selected query images. The retrieved images are highlighted using color-coded borders to facilitate interpretation: images with green borders denote correctly retrieved samples that share at least one semantic label with the query image, while images with red borders indicate incorrect retrievals that do not match the query semantics.

As can be observed, most retrieved samples exhibit strong semantic consistency with the queries, demonstrating that the proposed method retrieves a high proportion of relevant images among the top-ranked results. The relatively small number of incorrect retrievals highlights the robustness and discriminative capability of the learned hash representations. The dominance of green-bordered images across different datasets further indicates that the proposed method effectively preserves semantic similarity in the Hamming space, even under compact hash representations.

Failure cases mainly arise from noisy or incomplete annotations in NUS-WIDE, low-resolution images in CIFAR-10, or complex scene ambiguity in MS-COCO. Nevertheless, the overall retrieval quality remains robust across diverse visual scenarios.

From a broader perspective, the experimental results demonstrate three key advantages of the proposed framework: (1) block-wise similarity preservation improves global structural alignment in binary space; (2) reinforcement-driven bit regeneration enhances the utilization of hash bits by reducing redundancy; and (3) direct optimization of retrieval metrics reduces the discrepancy between training objectives and evaluation criteria. Collectively, these findings validate the effectiveness and robustness of adaptive reinforcement-based hash refinement for large-scale image retrieval.

4.12. Stability of the Reinforcement Learning Component

Although reinforcement learning algorithms are generally sensitive to hyperparameters, the proposed formulation operates in a constrained and low-dimensional setting. The RL agent acts on fixed binary hash codes rather than raw image features, which simplifies the state space and reduces variability. We adopted PPO due to its stable optimization behavior, enforced through a clipped objective that prevents large policy updates. Furthermore, the reward is directly based on retrieval performance (mAP), providing a deterministic and task-aligned signal. The consistent improvements observed across datasets and hash lengths indicate that the learned policy is stable and not overly dependent on initialization conditions.

4.13. Backbone Selection and Generalization

In this work, AlexNet is adopted as the backbone network to ensure fair comparison with existing deep hashing methods and to isolate the contribution of the proposed reinforcement-driven bit selection mechanism. While more recent architectures such as ResNet or Vision Transformers provide stronger feature representations, the objective of this study is to improve hash code quality rather than feature extraction.

Since the proposed method operates on binary hash codes, it is independent of the underlying feature extractor and can be directly applied to more advanced architectures. Therefore, the results obtained with AlexNet represent a conservative estimate of performance, and further improvements can be expected when integrating the proposed framework with modern deep networks.

4.14. Computational Complexity and Efficiency

The proposed framework involves a two-stage training process consisting of CNN-based hash learning followed by reinforcement-driven adaptive bit selection. The CNN training phase follows standard deep hashing procedures and represents the dominant computational cost, which is consistent with existing deep hashing approaches. The reinforcement learning stage introduces additional overhead during training due to policy optimization; however, it operates on binary hash codes and exhibits linear complexity with respect to the hash length, making it significantly more efficient than feature learning.

As shown in Table 4, both CNN and RL training stages are performed offline. The RL-based bit selection is executed only once during training and does not introduce any additional overhead during inference. During retrieval, images are encoded into compact binary codes and compared using Hamming distance, resulting in

O (k)

complexity for similarity computation, identical to conventional hashing methods.

Therefore, the proposed method improves retrieval performance while preserving efficient inference, at the cost of additional but offline training computation, ensuring scalability and practical applicability in large-scale image retrieval systems. Here, N denotes the number of training samples, d the feature dimension, and k the hash code length.

5. Conclusions and Future Works

This paper presented a deep reinforcement learning-based adaptive bit selection framework for compact image hashing in large-scale retrieval systems. In contrast to conventional deep hashing approaches that learn fixed-length binary representations through surrogate similarity losses, the proposed method formulates hash refinement as a discrete optimization problem within a Markov Decision Process framework. By integrating CNN-based semantic feature extraction with reinforcement-driven bit regeneration, the model directly optimizes retrieval performance in Hamming space while preserving strict binary constraints. Extensive experiments on CIFAR-10, NUS-WIDE, and MS-COCO demonstrate consistent improvements over representative shallow, deep, adversarial, and reinforcement-based hashing methods. The results confirm that adaptive bit regeneration effectively improves the utilization of hash bits by reducing redundancy and enhancing discriminative capacity, leading to improved compactness, ranking stability, and robustness—particularly for shorter hash lengths and under strict Hamming radius constraints.

While the proposed method introduces an additional reinforcement learning stage during training, this step is performed offline and does not affect inference efficiency. The retrieval process relies on compact binary codes and Hamming distance computation, resulting in the same query-time complexity as standard hashing methods. Therefore, the method improves retrieval performance while preserving efficient inference, although the overall training process may incur additional computational cost due to the RL-based refinement stage. Although AlexNet and VGG-19 were employed for fair comparison with prior work, the proposed reinforcement-driven refinement mechanism is architecture-agnostic and can be seamlessly extended to modern Transformer-based visual encoders. Vision Transformers and hybrid CNN–Transformer architectures provide stronger global context modeling capabilities, which could further enhance the semantic consistency and discriminative power of learned binary embeddings. Integrating adaptive bit selection with Transformer backbones is therefore expected to yield additional performance gains. Overall, the proposed framework establishes a principled bridge between supervised deep hashing and reinforcement learning, offering a scalable and extensible paradigm for adaptive binary representation learning in intelligent imaging systems.

Author Contributions

Conceptualization, M.R., M.A.A.M. and M.A.; methodology, M.R., M.A.A.M. and M.A.; software, M.R.; validation, M.R., M.A.A.M. and M.A.; formal analysis, M.R. and M.A.A.M.; investigation, M.R.; resources, M.A.A.M. and M.A.; data curation, M.R.; writing—original draft preparation, M.R.; writing—review and editing, M.A.A.M. and M.A.; visualization, M.R.; supervision, M.A.A.M. and M.A.; project administration, M.A.A.M.; funding acquisition, M.A.A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are publicly available and have been extensively adopted in prior research for benchmarking large-scale image retrieval and deep hashing methods. Specifically, CIFAR-10, NUS-WIDE, and MS-COCO were utilized to evaluate the proposed Deep Reinforcement Learning-Based Adaptive Bit Selection for Compact Image Hashing in Large-Scale Retrieval Systems. The datasets can be accessed through their official repositories at the following URLs: CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html; accessed on 12 April 2026; NUS-WIDE: https://github.com/NExTplusplus/NUS-WIDE; accessed on 12 April 2026; MS-COCO: https://cocodataset.org/; accessed on 12 April 2026. No new data were created or collected for this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Datta, R.; Joshi, D.; Li, J.; Wang, J.Z. Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv. 2008, 40, 5. [Google Scholar] [CrossRef]
Jégou, H.; Douze, M.; Schmid, C. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 117–128. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Zhang, T.; Song, J.; Sebe, N.; Shen, H.T. A Survey on Learning to Hash. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 769–790. [Google Scholar] [CrossRef] [PubMed]
Luo, X.; Wang, H.; Wu, D.; Chen, C.; Deng, M.; Huang, J.; Hua, X.-S. A survey on deep hashing methods. ACM Trans. Knowl. Discov. Data 2023, 17, 15. [Google Scholar] [CrossRef]
Liu, W.; Wang, J.; Ji, R.; Jiang, Y.-G.; Chang, S.-F. Supervised hashing with kernels. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 2074–2081. [Google Scholar] [CrossRef]
Gionis, A.; Indyk, P.; Motwani, R. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB), Edinburgh, Scotland, UK, 7–10 September 1999; pp. 518–529. [Google Scholar]
Kulis, B.; Darrell, T. Learning to hash with binary reconstructive embeddings. In Advances in Neural Information Processing Systems 22 (NeurIPS 2009), Vancouver, BC, Canada, 7–12 December 2009; Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C., Culotta, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2009. [Google Scholar]
Norouzi, M.; Blei, D.M. Minimal loss hashing for compact binary codes. In Proceedings of the 28th International Conference on Machine Learning (ICML 2011), Bellevue, WA, USA, 28 June–2 July 2011; pp. 353–360. [Google Scholar]
Cao, Z.; Long, M.; Wang, J.; Yu, P.S. HashNet: Deep learning to hash by continuation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; pp. 5609–5618. [Google Scholar] [CrossRef]
Li, Q.; Sun, Z.; He, R.; Tan, T. Deep supervised discrete hashing. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Li, T.; Zhang, Z.; Pei, L.; Gan, Y. HashFormer: Vision transformer based deep hashing for image retrieval. IEEE Signal Process. Lett. 2022, 29, 827–831. [Google Scholar] [CrossRef]
Wang, L.; Pan, Y.; Liu, C.; Lai, H.; Yin, J.; Liu, Y. Deep hashing with minimal-distance-separated hash centers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 18–22 June 2023; pp. 23455–23464. [Google Scholar]
Chen, Z.; Yuan, X.; Lu, J.; Tian, Q.; Zhou, J. Deep hashing via discrepancy minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6838–6847. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, J.; Zhu, L.; Luo, Y.; Lu, G. Deep collaborative graph hashing for discriminative image retrieval. Pattern Recognit. 2023, 139, 109462. [Google Scholar] [CrossRef]
Shen, X.; Cai, H.; Gong, X.; Zheng, Y. Contrastive transformer masked image hashing for degraded image retrieval. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI 2024), Jeju, Republic of Korea, 3–9 August 2024; pp. 1218–1226. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, S.; Liu, F.; Chang, Z.; Ye, M.; Qi, Z. TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval. In Proceedings of the 2022 International Conference on Multimedia Retrieval (ICMR); Association for Computing Machinery: New York, NJ, USA, 2022; pp. 127–136. [Google Scholar] [CrossRef]
Dubey, A.; Dubey, S.R.; Singh, S.K.; Chu, W.-T. Transformer-based Clipped Contrastive Quantization Learning for Unsupervised Image Retrieval. arXiv 2024, arXiv:2401.15362. [Google Scholar] [CrossRef]
Chen, Y.; Lu, Z.; Zheng, Y.; Li, P.; Luo, W.; Kang, S. Deep hashing with mutual information: A comprehensive strategy for image retrieval. Expert Syst. Appl. 2025, 264, 125880. [Google Scholar] [CrossRef]
Yao, D.; Li, Z.; Li, B.; Zhang, C.; Ma, H. Similarity Graph-correlation Reconstruction Network for unsupervised cross-modal hashing. Expert Syst. Appl. 2024, 237, 121516. [Google Scholar] [CrossRef]
Xu, Y.; Yang, Z.; Ting, K.M. Contrastive Multi-View Graph Hashing. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management; Association for Computing Machinery: New York, NY, USA, 2025; pp. 3666–3676. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Chua, T.-S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR 2009), Santorini, Greece, 8–10 July 2009; Association for Computing Machinery: New York, NY, USA, 2009; p. 48. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar] [CrossRef]
Xia, R.; Pan, Y.; Lai, H.; Liu, C.; Yan, S. Supervised hashing for image retrieval via image representation learning. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI 2014), Québec City, QC, Canada, 27–31 July 2014; AAAI Press: Palo Alto, CA, USA, 2014; pp. 2156–2162. [Google Scholar]
Lai, H.; Pan, Y.; Liu, Y.; Yan, S. Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; pp. 3270–3278. [Google Scholar] [CrossRef]
Cao, Y.; Liu, B.; Long, M.; Wang, J. HashGAN: Deep learning to hash with pair conditional Wasserstein GAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–23 June 2018; pp. 1287–1296. [Google Scholar] [CrossRef]
Peng, Y.; Zhang, J.; Ye, Z. Deep reinforcement learning for image hashing. IEEE Trans. Multimed. 2020, 22, 2061–2073. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25 (NeurIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012; Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Shen, F.; Shen, C.; Liu, W.; Shen, H.T. Supervised discrete hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; pp. 37–45. [Google Scholar] [CrossRef]

Figure 1. Precision–recall curve @ 64 bits; The experimental result of our approach and comparison methods on CIFAR-10, NUS-WIDE, MS-COCO.

Figure 2. Precision within Hamming radius 2; The experimental result of our approach and comparison methods on CIFAR-10, NUS-WIDE, MS-COCO.

Figure 3. Precision curve for top-N @64bits; The experimental result of our approach and comparison methods on CIFAR-10, NUS-WIDE, MS-COCO.

Figure 4. Some good and bad examples of the top-5 retrieval results on three benchmark datasets.

Table 1. PPO hyperparameter configuration used for adaptive bit selection.

Component	Hyperparameter	Value
	Discount factor $γ$	$0.99$
RL Core	Clipping parameter $ϵ$	$0.2$
	Learning rate	$1 \times 10^{- 4}$
	Batch size	k (hash length)
Training Setup	Episode structure	Full dataset per episode
	Optimization epochs	Multiple (PPO standard)
Policy Update	Optimizer	Adam
Policy Update	Gradient update	Mini-batch based

Table 2. mAP of Hamming ranking for different code lengths on CIFAR-10, NUS-WIDE, and MS-COCO using AlexNet.

Method	CIFAR-10				NUS-WIDE				MS-COCO
Method	16	32	48	64	16	32	48	64	16	32	48	64
SDH [32]	0.461	0.520	0.553	0.568	0.588	0.611	0.638	0.667	0.555	0.564	0.572	0.580
CNNH [26]	0.476	0.472	0.489	0.501	0.570	0.583	0.593	0.600	0.564	0.574	0.571	0.567
DNNH [27]	0.559	0.558	0.581	0.583	0.598	0.616	0.635	0.639	0.593	0.603	0.605	0.610
HashNet [9]	0.643	0.667	0.675	0.687	0.662	0.699	0.711	0.716	0.687	0.718	0.730	0.736
HashGAN [28]	0.668	0.731	0.735	0.749	0.715	0.737	0.744	0.748	0.697	0.725	0.741	0.744
HashFormer [11]	0.9121	0.9167	-	0.9236	0.7317	0.7418	-	0.7597	-	-	-	-
DCDH [16]	-	0.9192	0.9142	-	-	0.8870	0.8922	-	-	-	-	-
CTMIH [17]	-	-	-	-	0.795	0.816	-	0.826	0.809	0.834	-	0.846
Proposed method	0.727	0.790	0.788	0.780	0.790	0.798	0.805	0.807	0.748	0.767	0.790	0.776

Table 3. mAP comparison with DRLIH using VGG-19.

Method	CIFAR-10				NUS-WIDE
Method	12	24	32	48	12	24	32	48
DRLIH	0.816	0.843	0.855	0.853	0.823	0.846	0.845	0.853
Proposed method	0.857	0.876	0.881	0.883	0.839	0.862	0.868	0.871

Table 4. Computational complexity analysis of the proposed framework.

Stage	Operation	Complexity	Frequency
CNN Training	Feature learning + hash prediction	$O (N \cdot d \cdot k)$	Offline (once)
RL Training (PPO)	Bit selection policy optimization	$O (N \cdot k)$	Offline (once)
Inference	Hash encoding + Hamming distance	$O (k)$	Online (per query)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rezaei, M.; Alaoui Mhamdi, M.A.; Allili, M. Adaptive Bit Selection via Deep Reinforcement Learning for Large-Scale Image Hashing. Electronics 2026, 15, 1735. https://doi.org/10.3390/electronics15081735

AMA Style

Rezaei M, Alaoui Mhamdi MA, Allili M. Adaptive Bit Selection via Deep Reinforcement Learning for Large-Scale Image Hashing. Electronics. 2026; 15(8):1735. https://doi.org/10.3390/electronics15081735

Chicago/Turabian Style

Rezaei, Mitra, Mohammed Ayoub Alaoui Mhamdi, and Madjid Allili. 2026. "Adaptive Bit Selection via Deep Reinforcement Learning for Large-Scale Image Hashing" Electronics 15, no. 8: 1735. https://doi.org/10.3390/electronics15081735

APA Style

Rezaei, M., Alaoui Mhamdi, M. A., & Allili, M. (2026). Adaptive Bit Selection via Deep Reinforcement Learning for Large-Scale Image Hashing. Electronics, 15(8), 1735. https://doi.org/10.3390/electronics15081735

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Bit Selection via Deep Reinforcement Learning for Large-Scale Image Hashing

Abstract

1. Introduction

2. Related Work

2.1. Learning to Hash for Large-Scale Retrieval

2.2. Deep Supervised Hashing

2.3. Recent Advances: Transformers, Hash Centers, and Graph/Contrastive Objectives

2.4. Reinforcement Learning for Discrete Representation Optimization

3. Proposed Methodology

3.1. Deep Reinforcement Learning to Hash

3.2. Markov Decision Process Formulation

3.3. Proposed Approach Components

3.3.1. CNN-Based Binary Hash Code Extraction

Training Data Representation

Semantic Similarity Matrix

Hash Function Definition

Hamming Distance and Inner Product Relation

Global Similarity Reconstruction Objective

Continuous Relaxation

Quantization Regularization

Quantization Error Reduction

Substituting S with Y

Block-Wise Optimization Strategy

CNN Mapping via Multi-Binary Classification

3.3.2. Regeneration of Binary Hash Codes by Retaining Valuable Bits

Advantages of Bit Regeneration

CNN Freezing Strategy

4. Experimental Results

4.1. Setup and Datasets

4.2. Hyperparameter Selection and Justification

4.3. Compared Methods

4.4. Evaluation Metrics

4.5. Implementation Details

4.6. Main Quantitative Results

4.7. Precision–Recall Curves

4.8. Precision Within Hamming Radius 2

4.9. Top-N Retrieval Performance

4.10. Comparison with DRLIH

4.11. Qualitative Results and Discussion

4.12. Stability of the Reinforcement Learning Component

4.13. Backbone Selection and Generalization

4.14. Computational Complexity and Efficiency

5. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Substituting $S$ with $Y$