Language Bias-Driven Self-Knowledge Distillation with Generalization Uncertainty for Reducing Language Bias in Visual Question Answering

: To answer questions, visual question answering systems (VQA) rely on language bias but ignore the information of the images, which has negative information on its generalization. The mainstream debiased methods focus on removing language prior to inferring. However, the image samples are distributed unevenly in the dataset, so the feature sets acquired by the model often cannot cover the features (views) of the tail samples. Therefore, language bias occurs. This paper proposes a language bias-driven self-knowledge distillation framework to implicitly learn the feature sets of multi-views so as to reduce language bias. Moreover, to measure the performance of student models, the authors of this paper use a generalization uncertainty index to help student models learn unbiased visual knowledge and force them to focus more on the questions that cannot be answered based on language bias alone. In addition, the authors of this paper analyze the theory of the proposed method and verify the positive correlation between generalization uncertainty and expected test error. The authors of this paper validate the method’s effectiveness on the VQA-CP v2, VQA-CP v1 and VQA v2 datasets through extensive ablation experiments.


Introduction
Visual Question Answering (VQA) [1,2] is a cross-domain task of computer vision and natural language processing, and it has become increasingly important in the research and application of multimodal machine learning. In the past few decades, significant advances have been made in computer vision and natural language processing, with an explosion of visual and textual data to acquire and process. The most common VQA consists of an image and a question to be answered by the machine. Compared with other computer vision tasks, this model answers in real-time, not in advance. Moreover, the VQA model is required to comprehend the multimodal information of images and texts in a more artificially intelligent [3] way, leading to an in-depth understanding of vision and language.
VQA remains a challenging and open research topic. Recent research has focused on how to solve language bias. Language bias [4][5][6][7][8] threatens the implementation of VQA, which indicates that the current VQA model has an inadequate understanding of multimodal information. Language bias seems to be caused by the uneven distribution of datasets, a common problem in the real world. For example, if 90 percent of the bananas in the training set are yellow, the model would ask, "what color the banana is" and answer, "yellow" all the time, based on language bias. As shown in Figure 1, many VQA models tend to answer "yes" or "no" directly. Take another typical example. For the question "what color is the banana in the image?", although the banana is green, the model still tends to predict "yellow". With language bias, the model overly relies on the correlation between the question and the answer while it ignores the information in the image. In essence, language bias arises from data imbalance, which leads to over-fitting of the model; that is, the model fits the head samples in the dataset [8]. The over-fitting of the model is an inherent problem of the model itself. The deep neural network has a variance in the case of an individual model, and the variance can be reduced by an ensemble or knowledge distillation [9][10][11][12] (Over-fitting to the label imbalance may lead to some models not training very well). At the feature level, the variance is caused by the incompleteness of the feature subgraph [12], so it is easy to produce over-fitting. Recently, knowledge [13,14] distillation and self-knowledge distillation [15,16] have been proven to be able to learn multi-view features and reduce over-fitting [12,17,18].
Neural network analysis of information is multi-view; for the same object, its different views have semantic consistency [19][20][21], and the multi-view structure can be ubiquitous in the dataset and feature level [12]. Therefore, the model can give the prediction based on a learned subgraph. However, if the subgraph is not comprehensive, the prediction can be biased. As shown in Figure 2, for the same question, "what color is the banana?", the model learns the feature of yellow bananas while it ignores the feature of green bananas, which are less frequent in the training set. In other words, the model ignores the view of green bananas, causing visual bias, which further leads to language bias. Therefore, the model needs to focus on the feature of the less distributed samples in the training set and learn a comprehensive set of multi-view features so as to overcome VQA language bias and over-fitting.
The paper discusses how to reduce the language bias of the VQA model via selfknowledge distillation and proposes a new online learning framework, "language biasdriven self-knowledge distillation (LBSD)", for implicit learning of multi-view visual features. Self-knowledge distillation enables the model to acquire more dark knowledge and improves its generalization ability. In short, with self-knowledge distillation, the model can have a more comprehensive understanding of view features. Online knowledge distillation no longer uses teacher models but allows student models to learn from each other by using KL divergence to uniformly constrain the output. It is worth mentioning that the student network is actually equal to the teacher network; the two networks are the same. However, the learning degree of student models cannot be described by using KL divergence alone [22,23]. Therefore, the authors of this paper put forward the concept of generalization uncertainty to help the model learn unbiased knowledge. The VQA-CP v2 dataset contains images with multi-view (features). The authors of this paper visualize the features at the same layer in the neural network. For the same question, although the images, views and features are different, the semantic information is the same. Broadly speaking, this "multi-view" structure [12] exists both in the original data and the feature sets extracted from the middle layer.
LBSD enables two debiased models to distill knowledge from each other to learn more complete visual features. It distinguishes between debiased students and biased students by calculating the generalization uncertainty of the prediction of student models and reinforces the mutual learning of the two models about unbiased knowledge. The paper also finds that heterogeneous student models can be used to reduce language bias. LBSD enables the model to learn a more complete set of visual features and to focus on the features of the less distributed samples in the training set by utilizing generalization uncertainty, thus reducing the language bias of the model and improving the robustness of the VQA model.
Contribution. In summary, the contributions of this paper are as follows: (1) The authors of this paper propose a training framework (LBSD) based on online selfknowledge distillation, which can considerably reduce the VQA language bias. Moreover, the authors of this paper explore the different cases of student models (heterogeneous networks). The authors of this paper verify the effectiveness of the LBSD method and analyze the theory behind it.
(2) The authors of this paper propose a method to measure generalization uncertainty based on Top-k information entropy, and use it to distinguish between debiased students and biased students, so as to force the model to focus on the samples that cannot be directly answered by language bias in the VQA datasets. The authors of this paper also prove the proportional relationship between the generalized uncertainty and the expected test error.

Language Bias in VQA
The language bias [8] in VQA has a negative impact on the general application of the model in real-world scenarios. The reason behind it is that there is often a strong correlation between questions and answers. Moreover, the questions tend to concern conspicuous objects in the image. In VQA v1 [1] and v2 [7], a positive answer or a questionrelated answer tends to have higher accuracy. When the questions and answers in the training set and the test set are distributed inconsistently, this language bias is obvious. Therefore, the VQA-CP v2 dataset was recently proposed to evaluate the language bias.

Knowledge Distillation
In recent years, knowledge distillation [45][46][47][48] has been widely used in deep learning to transfer knowledge between different models. Hinton et al. [13] used knowledge distillation for model compression; that is, moving knowledge from powerful but complex models (teacher models) to simple models (student models). By minimizing the Kullback-Leibler (KL) divergence loss of the categorical output probability, the student can imitate the output of the teacher model. In addition, some new knowledge transfer goals have been proposed, such as intermediate feature maps [49], attention maps [50], second-order statistics [46], contrastive features [51,52] or structured knowledge [53][54][55].
However, these methods require a distinction between the roles of the teacher and the student and are typically distilled offline. Online knowledge distillation is a knowledge distillation based on a series of student (generally two) models by eliminating cumbersome teacher models. Based on the Kullback-Leibler divergence, Zhang et al. [16] proposed a technique for deep mutual learning (DML) in which pair-wise students learn from each other using a mimicry loss. By adding distillation loss after updating enough steps, codistillation [15] (similar to DML) enables student networks to sustain their diversity for a longer time. However, KL divergence alone cannot capture the learning degree of student models. The authors of this paper put forward the notion of generalization uncertainty as a way for the model to learn unbiased knowledge.

Methods
In order to reduce VQA language bias, the authors of this paper consider making the model focus on the less distributed samples in the training set to learn a more complete set of multi-view features. To this end, the authors of this paper propose a new online self-knowledge distillation learning framework (LBSD) for implicit learning of multi-view visual feature sets to alleviate language bias. The methods are divided into: (1) language bias-driven self-knowledge distillation and (2) using generalization uncertainty to help student models learn unbiased visual knowledge. In the following sections, the authors of this paper explain the workflow of LBSD and analyze the theory behind it. The block diagram of the method presented in this paper is shown in Figure 3 and Algorithm 1. The flowcharts of the language bias-driven self-distillation framework, including: (1) language bias-driven self-knowledge distillation and (2) using generalization uncertainty to help student models learn unbiased visual knowledge.

Algorithm 1: Language Bias-Driven Self-Distillation
Input: Training set I,Q (X ), label set A (Y), learning rate γ 1,t and γ 2,t . Initialize: Debiased Models N 1 and N 2 (different initial conditions or models). Repeat : Randomly sample data I i , Q i from I,Q. 1: Update the predictions p 1 and p 2 of I i , Q i for the current mini-batch 2: Compute the stochastic gradient and update N 1 by equation (13) : Compute the stochastic gradient and update Θ 2 :

Preliminaries
To tackle the multi-class classification problem in VQA field, the general form of VQA is: A dataset is given D = {I i , Q i , a i } N containing N triplets of images I i ∈ I , questions Q i ∈ Q and answers a i ∈ A.
The aim of the VQA task is to learn a mapping function f vqa :I × Q → [0, 1] |A| , which generates the answer distributions for any given image-question pair. The authors of this paper omit subscript i in the following.
For each question Q and image I, the Bottom-Up Top-Down (UpDn) [56] model uses a question encoder e q and an object detector separately i q to extract a set of word embeddings Q and a set of visual object embeddings V. The model is fed both V and Q to get the joint feature mm(V, Q). Then, the joint features are fed into the classifier C to get the final predictions.
For fair comparisons, the authors of this paper use the Bottom-Up Top-Down (UpDn) model [56], which is mainly used by many researchers as the backbone network.

Language Bias-Driven Self-Distillation
The method aims to learn unbiased visual knowledge via the mutual learning of two debiased models so as to reduce VQA language bias. The training strategy, which can be integrated with the current debiased methods, consists of the mutual learning of two debiased models. A dataset is given D = {I i , Q i , a i } N containing N triplets of images I i ∈ I, questions Q i ∈ Q and answers a i ∈ A, it can be input into two identical models with different random initializations, N1 and N2, and two probability vectors p can be predicted by the model, z means Softmax output.
where k represents the number of outputs or classes of the neural network. At the same time, the VQA model is generally defined as multi-type. Therefore, for multiple types, the objective function of the training network N1 is defined as the cross-entropy error between the prediction and the correct label, as shown as follows, K means samples, M means classes and L C means the cross entropy error: In order to allow the two student models to learn unbiased visual features from each other (similar to self-knowledge distillation), the authors of this paper use KL divergence to constrain all the predictions, thus distilling the unbiased knowledge of the two models. The formula of KL divergence between N1 and N2 is shown as follows: The two student models simultaneously start parameter optimization, and the optimization loss is shown as follows. The consistency constraint of the predictions of the two models can realize the mutual learning of unbiased knowledge between the two models.
Since KL divergence is asymmetric, it can be replaced by Jensen-Shannon (JS) divergence (a variation of KL divergence) to ensure the consistency constraint between the two student models. Such replacement will not affect the final precision of the model.
Moreover, all the current self-knowledge distillation models use student models with different random initializations. The strategy is effective because the model learns more complete sets of multi-view features. The authors of this paper also explore the case where two heterogeneous student networks serve as the student models. The heterogeneous networks have the same feature extraction structure, but they have different loss functions and network branches.

Debiased Mutual Students
As mentioned above, the language bias of datasets is, in essence, the distribution bias of image samples. For the same input image/text sample pair, the two student networks may have different outputs because of an inconsistent random seed, order of data reading or even network structure.
As shown in Figure 4, for more-distributed image/text sample pairs in the dataset, the model can simply answer the question through language bias, and the confidence of the answer is very high. The different student models tend to have the same answer. For the gradient update of neural networks, the cross-entropy loss and KL divergence loss of image/text samples that can answer the question by language bias are minimal. However, for the less-distributed samples, the model is more likely to have different answers. Therefore, the different answers can be measured and analyzed to help the model focus more on the samples that cannot be directly answered by language bias so as to reduce language bias. In general, the current self-distillation methods only use KL divergence for the mutual distillation of knowledge. As KL divergence is not commutative, it cannot be understood as "distance", which measures the information loss between two distributions. Simply constraining the KL divergence of the two student models cannot figure out the difference between the output and help the two models learn from each other with more precision. As shown in Figure 4, KL divergence for different distributions and consistency constraints is not always consistent with our expectations. For this reason, the authors of this paper consider using information entropy to evaluate the output uncertainty of the two models and evaluate the output difference based on the uncertainty.
As shown in Figure 4, although information entropy H is a common method to measure information uncertainty, the output is not always consistent with our understanding. For p 1 = [0.5, 0.25, 0.25] and p 2 = [0.5, 0.5, 0], the formula leads to H(p a ) > H(p b ). For general classification scenarios, it is clear that p b is less certain than p a , and the confidence of predictions is extremely low. Therefore, in order to describe the prediction uncertainty, the authors of this paper adopt a simple and improved version: Top-k information entropy.
Suppose that p 1 , p 2 , . . . , p k are k values with the highest probability, the following formula can be obtained: By using the above formula, the authors of this paper can get a result in the range of 0 to 1 and take C as the final uncertainty measure.
In order to measure the output difference between the two student models, for the uncertainty C1 and C2, the output difference can be defined as |C1 − C2|. In order to enhance the mutual learning of the two student models in the case of output difference (questions that cannot be answered directly with language bias), the authors of this paper define a generalization uncertainty index GU to represent the intensity. The formula is GU = e |C1−C2| , and the final loss function of the generalization uncertainty index can be obtained. The formula is as follows: In the next section, the authors of this paper will prove that the generalization uncertainty index GU of the two student models can be used to estimate the test error of the model. In this section, the authors of this paper demonstrate that the generalized uncertainty index between two student models can be used to estimate the model test error on imagetext sample pairs. Thus, generalized uncertainty is used in the training process to help students learn unbiased knowledge. Following the research of Nakkiran and Bansal [57], Jiang et al. [58] and others [59][60][61], the authors of this paper use class-segregated calibration (or class-wise calibration) [58,[62][63][64][65][66] to prove the proportional relationship between the generalized uncertainty and the test error. E Ω,Ω GUErr D vqa n, n Proof. The authors in this paper define the Expected Test error as TE (Test error). The TOP-K error of the two-student model with generalized uncertainty is fixed at GUE. By simplifying the two errors, the following results can be obtained, and the proportional relationship (GUP) between them can be obtained. Since K previously represented the number of categories, for this reason, the authors of this paper represent J as the K term in TOP-K, i as the corresponding prediction at the sample of J-th value.
The detailed proof of generalization uncertainty can be found in Appendix A.

Theoretical Analysis of Debiased Self-Distillation
In this section, the authors of this paper demonstrate that self-knowledge distillation and generalized uncertainty can enable models to learn more complete multi-view feature sets and reduce language bias in VQA. The authors of this paper followed the research of Allen Zhu and Zhiyuan Li [12,19].

Notation 2.
Let us set up a model whose dataset contains K categories, p-input patch s and the ReLU function. The model input is I i , Q i and the label is a i . To simplify the problem, the authors of this paper assume that each category contains related features that are orthogonal to each other. The authors of this paper define these features as vectors of vqa j,1 , vqa j,2 .
Following the settings of Zhu et al.'s research. The authors of this paper get the definitions as follows. The set of all features: vqa j, ⊥ vqa j , when(j, ) = (j , ) The logits function of the single model (η ≤ 1 poly(k) ,T = poly(k) η ) can be defined as The logits function of the model using knowledge distillation can be defined as The authors of this paper find that when comparing Theorems 2 and 3, the prediction error of the model decreased. That means the LBSD method can reduce language bias. The detailed proof can be found in Appendix A.

Settings, Results and Discussion
In this section, the authors of this paper evaluate the effectiveness of all the LBSD methods in the three mainstream datasets (VQA-CP v2, VQA-CP v1 and VQA v2), carry out an ablation experiment with the typical debiased method and compare the performance of the LBSD methods and that of the latest method. Table 1 shows the statistics of all the datasets.

. Datasets and Backbone
The paper uses the standard VQA evaluation metric [1] to evaluate the performance of the model on the VQA-CP v2 [67], VQA-CP v1 [67] and VQA v2 [7] datasets. For fair comparisons, all the methods are based on the UpDn model, and their best-recorded performance is compared. The experiment trains and tests the models on two Titan Xp GPUs.
Currently, for the VQA language bias issue, researchers evaluate the performance of the proposed models on the VQA-CP v2 dataset and conduct auxiliary verification on the VQA v2 dataset. Most findings test the models on VQA-CP v2 and VQA v2 and calculate the gap index [36] as an auxiliary index to verify the robustness of the model.
VQA-CP v2. The researchers propose the VQA-CP v2 dataset, which is derived from the re-classification of the samples in the VQA v2 dataset, to measure language bias. The VQA-CP v2 and VQA-CP v1 datasets are the only open-source datasets for language bias evaluation. The questions and answers in the training and testing sets are distributed in considerably different ways. In other words, for the same type of questions, the answers in the training set and testing set are distributed very differently. Therefore, the VQA-CP v2 dataset is suitable for measuring the language bias of the models. The training set consists of 121 K images, 438 K questions and 4.4 million answers, and the testing set consists of 98 K images, 220 K questions and 2.2 million answers.
VQA-CP v1. The VQA-CP v1 dataset, the first version of the VQA-CP dataset, is the first-ever dataset for language bias evaluation. It is derived from the re-classification of the VQA v1 [1] dataset. The VQA-CP v1 training set consists of 118 K images, 245 K questions

Experimental Details
For LBSD, the k in generalization uncertainty is set at 3, and the KL divergence coefficient is set at 2 or 3. The basic VQA network UpDn uses a pre-trained Faster-RCNN to extract image features, a pretrained model GloVe (300 dimensions) to extract text features and a single-layer GRU to obtain question-embedded vectors (512 dimensions). Finally, the joint embedding is 2048 dimensions. In addition, the batch size is set at 512 and trained and tested on two Titan Xp GPUs. Because VQA-CP v1 and v2 lack validation datasets, VQA v2 datasets generally display results on validation datasets. In order to select the parameters of the model, the authors of this paper divide 10% of the samples from the test datasets or the validation datasets to act as the validation dataset, select the parameters of the model on the validation datasets and then test the precision of the model on the test datasets. The results of our experiments are based on the results of the original published papers, and for experiments that were not performed in the original papers, we reproduced them using the official code, and for the results that we reproduced, we put an asterisk in the upper right-hand corner. With regard to run time, the proposed method runs 30 epochs for 15 h in a 256 GB memory and two Titan XP GPUs environment. For the statistical analysis, we performed multiple experiments with a confidence of 95% for the experimental results, and for the purpose of fillability of the experimental results, we selected the median of the results of multiple experiments as the final result, the final precision and the precision of each index are filled in the table.

Ablation Studies
To verify the effectiveness of LBSD, the authors of this paper conduct an ablation experiment on every aspect. For fair comparisons, the authors of this paper select the mainstream VQA network UpDn as the skeleton and carry out ablation experiments on typical debiased methods such as Bias product, Reweight and LMH. In these tables, * indicates the results of our reimplementation from the official code.

Architecture Agnostic
Since LBSD is irrelevant to the model, it can be integrated into various VQA networks. To evaluate the performance of LBSD on debiased methods, the authors of this paper combine it with other typical methods and baseline, including UpDn, Bias product (Product of Experts), reweight and LMH. Reweight, a non-ensemble method, encourages the model to focus on the samples that are predicted erroneously by the language bias model. While Bias product and LMH are ensemble models. Compared with these, the LBSD-integrated models have higher precision.
The authors of this paper conduct ablation experiments on the VQA-CP v2 and VQA-CP v1 datasets. As shown in Table 2, for typical debiased methods, including ensemble and non-ensemble methods, LBSD improves the precision of the model on the VQA-CP v2 dataset. For example, the performance of reweighting (non-ensemble) and LMH (ensemble) improves by 1.26% and 2.22%, respectively. Even for UpDn without debiased methods, LBSD improves the precision by 0.25%, which demonstrates that LBSD reduced the language bias from the perspective of feature learning. As shown in Table 3, for reweight (non-ensemble) and bias product (ensemble), LBSD improves the performance by 2.2% and 0.58% ("NUM" index has been improved by 7.51%), respectively, on the VQA-CP v1 dataset.

Effectiveness of GU
To verify the effectiveness of generalization uncertainty in the reduction of language bias, the authors of this paper conduct ablation experiments on VQA-CP v2. Two debiased methods, including Reweight (non-ensemble) and LMH (ensemble), are selected for verification. As shown in Table 4, the results show that, compared with LBSD without the generalization uncertainty constraint, LBSD with the generalization uncertainty constraint improves the performance by 0.27% and 0.63%, respectively, on Reweight and LMH. For the question types "YES/NO" and "Other" that are highly dependent on language bias, the generalization uncertainty constraint can be added to reduce the language bias of these question types. Table 4. VQA-CP v2: Ablation experiments of the generalization uncertainty method on the VQA-CP v2 dataset. * indicates the results from our reimplementation using officially released codes.

Model
Overall

Heterogeneous Student Networks
Generally, the student models of self-knowledge distillation have identical network structures. The authors of this paper also explore heterogeneous student networks, where the two student models are not identical. The authors of this paper select two debiased methods based on the UpDn model to verify the effectiveness of heterogeneous student networks. As shown in Table 5, heterogeneous student networks can have similar effects to homogeneous student networks. Moreover, the precision of the two heterogeneous student models is improved.

Comparisons with State-of-the-Arts
To evaluate the performance of LBSD, the authors of this paper carry out an experiment on VQA-CP v2, VQA-CP v1, and VQA v2 and compare it with the state-of-the-art method. In these tables, * indicates the results of our reimplementation from the official code.

Performance on VQA-CP v2
Setting. The authors of this paper combine LBSD with LMH and name it LBSD-LMH. For fair comparisons, the authors of this paper choose the debiased method based on UpDn. According to the principles of reducing language bias, the authors of this paper divide the methods into groups: (1) Strengthening visual information [24,25]. (2) Weakening language priors [29,31,32]. (3) Using various data enhancement and data balance [36,68].
Since LBSD improves the performance by enabling the model to focus more on visual information and difficult samples (the model cannot answer based on language bias), the authors of this paper compare other methods with those in the first and second groups. Moreover, according to the experiment settings of CSS [36], the authors of this paper test and calculate the gap index as an auxiliary index on VQA v2 to verify the robustness of the model.
Results. Comparisons are reported in Table 6. As shown in Table 6, compared with other methods with UpDn as the standard VQA model, LBSD improves the performance on VQA-CP v2. The gap index has also been improved ("All" and "Other"). The results show that the proposed LBSD can reduce language bias in VQA. For individual items, such as Num, yes/no and others, CFVQA is slightly higher than our method in num index; it is an ensemble method based on causal inference. Similar to boosting, CFVQA uses more ensemble networks as additional information than ours. Therefore, it is unfair to compare directly on small indices. Settings. The authors of this paper compare the state-of-the-art methods to LBSD-LMH and VQA-CP v1. According to the principle and method of reducing language bias, the authors of this paper divide them into groups: (1) Strengthening visual information [24,25]. (2) Weakening language priors [29,31,32]. (3) Using various data enhancement and data balance [36,68]. Moreover, the authors of this paper conducted another experiment based on the official codes of the methods, as the results of some methods on VQA-CP v1 were not shown.
Results. As shown in Table 7, compared with the methods in group 1 and group 2, LBSD realizes the best performance on VQA-CP v1. In particular, LBSD improves the performance of LMH and Reweight by 0.66% and 2.2%, respectively. The results show that the proposed method is effective for different datasets and is effective for different types of debiased methods. The results verify the effectiveness of LBSD.

Qualitative Examples
In order to better show the results, the authors of this paper conduct a visualization analysis of some representative findings of the model from the perspective of qualitative analysis and compare it with other methods. Figure 5 shows that our method is superior to the baseline method.

Conclusions
This paper discusses how to reduce the language bias of the VQA model via selfknowledge distillation and proposes a new online learning framework, "language biasdriven self-knowledge distillation (LBSD)", for implicit learning of multi-view visual features. Moreover, in order to help student models learn unbiased visual knowledge, the authors of this paper propose generalization uncertainty to measure the learning results of student models and use KL divergence to reinforce the debiased mutual learning of student models. In this way, the student model can learn unbiased knowledge from each other through the output of Top-K information entropy. In addition, the paper also discusses the effect of the heterogenous student models on the reduction of language bias. The experiment proves that even the heterogeneous student model can improve the unbiased learning ability through the LBSD method. Extensive experiments and ablation experiments on the VQA-CP v2, VQA-CP v1 and VQA v2 datasets verify the effectiveness of the proposed method. In the future, we will continue to explore how to better define the concept of unbiased knowledge, such as using multimodal knowledge graphs to help the model understand the type of knowledge in the dataset and how to optimize the loss function to enable the model to distinguish biased and unbiased knowledge, so as to reduce the experimental bias against language.  Proof. Predicting generalization and Calibration. It is found that the distribution over predicted classes and ground truth labels match each other within a set of confidence levels; a measure of disagreement between an ensemble and the ensemble itself boils down to measuring disagreement against the ground truth.
Recall Theorem 1. The authors of this paper can express the expected disagreement rate between two debiased students as an integral over the confidence values.
In order to simplify the expected TOP-K disagreement rate (GU), the authors of this paper will first simplify the expected test error (TE) as follows. (A1) Deal with integrals, and defineñ k (I, Q) as q k , (1 −ñ k (I i , Q i )) as f k , the authors of this paper can get: Proof. Referring to the research work of Allen Zhu, the authors of this paper use the same lottery winning theory and other lemmas to prove it. Refer to Allen Zhu's research for the details of the lemma. The authors of this paper expand the research to VQA with GU. For the single model. For every t < T, according to the noise lower bound and multi-view error claim, the authors of this paper can get: Assume that the distribution of ∑ p∈P vqa (I,Q) z q p for vqa ∈ {vqa a,1 , vqa a,2 } are the same. The authors of this paper can get: Similar to the single model, for every (I i , Q i , a i ) ∈ S dm , the authors of this paper can get: