Our HQD-EM framework is composed of two synergistic parts: the HQD module, which employs ensemble-based bias modeling to capture complex, layered language biases via hierarchical question decomposition, and the EM loss, which leverages these learned biases to adaptively adjust decision margins and perform effective debiasing (see
Figure 2 for an overview). In
Section 3.1, we first describe our VQA baseline.
Section 3.2 presents the HQD question decomposition strategy.
Section 3.3, illustrated in
Figure 2a,b, details how the decomposed questions are used to train the bias model and how to apply debiasing, and finally,
Section 3.4, depicted in
Figure 2c, explains how the EM loss uses the trained bias models to modulate margins for robust generalization.
3.1. VQA Baseline
We adopt the Bottom-Up Top-Down (UpDn) architecture [
31] as our VQA baseline. Given an input image
v, a pre-trained Faster R-CNN [
32] extracts
n object-level feature vectors
. Concurrently, the question
q is encoded by GloVe [
33] embeddings followed by a GRU [
34] to produce
. An attention-based fusion module
, then computes weights over the visual features conditioned on the question embedding, yielding a joint multimodal representation of
x where
This feature
x is then passed through an MLP classifier
to produce logits
y where
which are transformed into answer probabilities
. We denote this combined network of
as
for ease of notation, where necessary. We optimize this baseline using the standard cross-entropy loss against the one-hot ground-truth labels.
3.2. Hierarchical Question Decomposition Strategy
In this work, we refer to our multi-scale question decomposing strategy as Hierarchical Question Decomposition (HQD). HQD systematically splits each question into three levels—low, middle, and high—by first fixing the question token that encodes the question type and then randomly sampling the remaining tokens at different ratios. In low-level decomposition, we randomly select one-third of the question tokens (without replacement) and then form
by keeping the selected tokens in their original left-to-right order (i.e., no reordering or shuffling). In middle-level decomposition, we similarly retain roughly half of the tokens in order. High-level decomposition makes no changes and retains the full token sequence. For example, the question
“How many donuts have light purple sprinkles on them?” is transformed into a low-level variant
“How many donuts purple?” (
), a middle-level variant
“How many have purple sprinkles on?” (
), and a high-level variant
“How many donuts have light purple sprinkles on them?” (
). We choose the minimum ratio, one-third, for
based on the token statistics of VQA-CP: the maximum question length used in the model is 14 tokens, with most questions being much shorter. Further decomposition could leave too few tokens. In addition, as we deterministically keep the question-type token, this could leave only one to two additional content tokens, creating a sentence that is too sparse to preserve meaningful context. We find that using a decomposition rate of one-third typically preserves 4–5 tokens while still enforcing substantial compression. We also present a variant that fixes both the question-type token and noun tokens (sampling over the remaining tokens to align with meaningful units), but empirically find that fixing only the question-type token yielded stronger performance and more robust debiasing; therefore, we adopt this setting (see
Section 4.7). By exposing the model to these compressed and expanded variants, HQD enables learning of subtle distributed language biases at multiple granularities without sacrificing overall context.
3.3. Hierarchical Question Decomposition Training and Debiasing
As we also utilize an ensemble-based method, we create 2 models: a
target model, which is the main model that is used for inference and testing, and a
bias model, which is a similar model that we utilize to capture the biases and use for debiasing the target model. In our case, to capture both the dataset priors and the intrinsic biases of the target model, we adopt the same architectural structure as the target model. As previously mentioned, the target model produces logits:
, where the bias model would produce the logits:
, given a multimodal intermediary feature
x. However, for the bias model, following GenB [
12], we feed a Gaussian noise tensor
z through a simple generator to create generated image features so the bias model learns bias representations that remain architecturally aligned with the backbone while stochastically capturing biases from the question.
z is then used to synthesize a pseudo-visual feature
,
where
is a small generator network, and
denote the number of region features and their dimensionality.
Our HQD bias model
directly receives the decomposed questions
,
, and
, along with the pseudo-visual feature
.
Inside
, each question variant is first encoded by its own subnetwork:
We empirically found that element-wise summation of the three subnetwork outputs yields the best performance, resulting in the fused question feature
.
The fused question feature
and the pseudo-visual feature
are then passed through the attention-based fusion module and an MLP classifier to generate the final logits
.
Binary Cross-Entropy Loss. This loss forces
to overfit to dataset priors by matching its sigmoid outputs to the ground-truth answer distribution
:
Adversarial Loss. A discriminator
D [
35] is trained to distinguish true VQA logits
y, the output of the VQA model given image features
v and question embedding
q, from HQD bias logits
, which capture spurious correlations learned by the bias model. Meanwhile, the bias model
and generator
G are optimized to fool
D by producing logits that resemble the true VQA outputs. The resulting minimax game is:
This encourages
to mimic the target model’s output distribution and capture its biases.
Knowledge Distillation Loss. To align
more closely with the target VQA model, we minimize the Kullback–Leibler divergence [
36] between the target logits and bias logits:
This distillation transfers fine-grained prediction patterns into the bias model.
Overall Loss of Training Bias. The overall objective for bias model training combines these terms:
where
and
are hyperparameters that balance each component’s contribution.
Debiasing the Target Model. After training
, we generate bias logits
by feeding the real image features
v and each decomposed question into the bias model, following the design choices in GenB [
12]. Let the target model’s logits be
y and its predicted probabilities
. We then construct pseudo-labels
that reduce overly biased predictions while preserving bias intensity:
where
is the one-hot ground truth and
the
i-th bias logit. Finally, the target model is optimized with a cross-entropy loss against these pseudo-labels:
This debiasing loss leverages the bias model’s confidence to steer the target model away from spurious correlations without discarding useful bias strength.
3.4. Ensemble Adaptive Angular Margin Loss
To further enhance robustness, inspired by previous work RMLVQA [
19], we calibrate the loss using dataset-level statistics together with a bias signal learned via standard cross-entropy. In contrast, we tackle the limitation of such a CE (Cross Entropy) trained model, which is a largely static bias estimate by introducing an
Ensemble adaptive angular Margin (EM) that dynamically modulates class-specific margins using both dataset priors and instance-level confidence of our HQD bias model, yielding a more effective debiasing model.
First, inspired by RMLVQA [
19], for each question type
, we compute
frequency-based margins to penalize models less on common answers and more on rare ones:
where
is the count of answer
under
and
avoids division by zero. Here,
reflects how frequent an answer is, and
becomes a larger margin for rarer answers.
Next, we inject
controlled randomness into these margins to prevent over-correction:
Sampling
from a Gaussian centered on
with standard deviation
smooths margin extremes and improves generalization. In parallel, we derive an
instance-level margin from the dedicated bias classifier’s output logits. We do this by training a separate classifier
with standard cross-entropy loss to capture language biases. Then, we normalize the cross-entropy logits
of the stand-alone bias model.
where
is a temperature hyperparameter. Unlike previous works, here we then compute the per-instance margin as
Here,
denotes the output of HQD, and
is a hyperparameter. By injecting HQD’s output into the instance-specific margin
, we organically couple HQD and EM. The well-trained HQD bias model provides instance-level confidence to modulate the base
, complementing the standard CE-based bias estimation and capturing residual priors that the model might miss. We propose that this synergy yields a more context-aware margin and, ultimately, better debiasing.
We then merge the
randomized and
instance-level margins for the ground-truth class
:
where
is annealed from 1 to 0 over training, gradually shifting emphasis from global frequency priors to per-sample bias signals.
Finally, we impose these combined margins in an
angular margin loss, which maximizes the angular distance between feature
x and classifier weight
:
where
s scales features and
is the angle to
.
To train the bias-capturing classifier
(which outputs logits
), we include its cross-entropy loss:
To further enhance feature discrimination of our model, we add an additional Supervised Contrastive (SC) loss,
, as contrastive losses [
37,
38] have been shown to aid models in learning more representative features. Specifically, given a mini-batch of size
B with feature vectors
and corresponding labels
, we define for each anchor
j:
The SC loss [
37] is then:
where we set the temperature
for all experiments.
The full training objective is
This design leverages HQD’s bias signals to tailor margins at both global and instance levels, yielding fine-grained debiasing and robust generalization under distribution shifts.