Uncertainty Flow Facilitates Zero-Shot Multi-Label Learning in Affective Facial Analysis

Featured Application: The proposed Uncertainty Flow framework may benefit the facial analysis with its promised elevation in discriminability in multi-label affective classification tasks. Moreover, this framework also allows the efficient model training and between tasks knowledge transfer. The applications that rely heavily on continuous prediction on emotional valance, e.g., to monitor prisoners’ emotional stability in jail, can be directly benefited from our framework. Abstract: To lower the single-label dependency on affective facial analysis, it urges the fruition of multi-label affective learning. The impediment to practical implementation of existing multi-label algorithms pertains to scarcity of scalable multi-label training datasets. To resolve this, an inductive transfer learning based framework, i.e., Uncertainty Flow , is put forward in this research to allow knowledge transfer from a single labelled emotion recognition task to a multi-label affective recognition task. I.e., the model uncertainty—which can be quantiﬁed in Uncertainty Flow —is distilled from a single-label learning task. The distilled model uncertainty ensures the later efﬁcient zero-shot multi-label affective learning. On the theoretical perspective, within our proposed Uncertainty Flow framework, the feasibility of applying weakly informative priors, e.g., uniform and Cauchy prior, is fully explored in this research. More importantly, based on the derived weight uncertainty, three sets of prediction related uncertainty indexes, i.e., soft-max uncertainty , pure uncertainty and uncertainty plus are proposed to produce reliable and accurate multi-label predictions. Validated on our manual annotated evaluation dataset, i.e., the multi-label annotated FER2013, our proposed Uncertainty Flow in multi-label facial expression analysis exhibited superiority to conventional multi-label learning algorithms and multi-label compatible neural networks. The success of our proposed Uncertainty Flow provides a glimpse of future in continuous, uncertain, and multi-label affective computing.


Challenges in Affective Facial Analysis
Affective facial analysis, which is assessed as one of most primitive functions in vivo, has yet to be successfully implemented in machine. Previous attempts in accomplishing this goal focused on improving the accuracy of emotion classification tasks. Less attention was paid to reveal the uniqueness of affective classification. The uniqueness is on the intrinsic ambiguity of emotion per se. The same facial expression may be interpreted differently dependent upon its associated contexts, spatial and temporal cues [1,2]. Hence, affective classification, in a nutshell, is ambiguous [3]. The past effort in resolving this ambiguity has been reflected in lowering the single-label dependency in producing emotion categories [4]. To further lowering the single-label dependency, one stream of research aims in 'softening' the label space in production of soft labels, allowing affective prediction along the continuous axis. I.e., it allows the relaxation of a discrete label into a partial continuous one. Like in Bai et al. [5], the pseudo soft labels can be crafted by a continuous approximation to the original labels. However, this relaxation trick merely provides a provisional resolution in tackling the ambiguity (cf. [5]).
Instead of the foregoing proximal solution (hard label relaxation), here, we suggest a distal approach: extending the single-label discrimination to the multi-label domain. The research on multi-label affective discrimination is also in line with the finding that the decision boundaries among classes are less ostentatious in affective analysis compare to other categorisation problems, e.g., object classification [2,3]. Benefited from previous researches on multi-label classification in general, it appears straightforward to extend affective computing along this direction. However, there is one difficulty that hinders the success application of multi-label affective recognition: it is laborious and expensive to collect the multi-label training data [6].

Uncertainty Flow in Zero-Shot Multi-Label Learning
In combating with the scarcity of multi-label training data, unlike conventional approaches, we resort on inductive transfer learning [7] that allows the knowledge to be distilled from a source task, i.e., a single-label affective learning task, and to be applied on a similar but more complex target task, i.e., a multi-label affective discrimination task. But instead of transferring the mere knowledge, i.e., the model parameters between source and target tasks, we propose the Uncertainty Flow framework to transfer the model uncertainty between tasks. The crux of our proposed Uncertainty Flow is on the quality of uncertainty quantification. To measure this quantity, instead of non-Bayesian neural networks, Bayesian neural networks are employed in quantification of model uncertainty. Bayesian neural network-a recapitulation of a neural network under the direct probabilistic modelling-replaces the single point estimate of the model parameters with the distribution of the parameter. It allows the production of real probabilistic outputs, i.e., model uncertainty [8]. Contrast with conventional implementations on Bayesian neural networks, we further provide our suggestion on the usage of weakly informative priors, e.g., uniform and Cauchy prior, in perfecting the final production of model uncertainty.
The article is organised as following: we chiefly introduce the proposed Uncertainty Flow framework in sketch along with the description of four core components, e.g., Bayesian neural networks (More precisely, two hierarchical Bayesian neural networks); our suggested weakly informative priors; the quantification of model uncertainty; and three prediction related uncertainty indexes, e.g., soft-max uncertainty, pure uncertainty and uncertainty plus. To demonstrate the effectiveness of our proposed Uncertainty Flow framework, we then present the results from a large-scale comparative experiment. This large-scale experiment contains three levels of comparisons, i.e., the comparison among models, the comparison among different priors, and the comparison among three uncertainty indexes. The observed pronounced discriminability, i.e., 20 to 30 percent performance enhancement, proved the effectiveness of the proposed Uncertainty Flow framework.
This pioneer research should be credited under following contributions: (1) We develop a novel inductive transfer learning [3] based computational framework that allows multi-label affective prediction on single evoked source. (2) Unlike conventional inductive transfer learning, the proposed Uncertainty Flow focuses on model uncertainty rather than the mere model weights in knowledge distillation. (3) To obtain the model uncertainty, rather than the conventional used informative priors, the usage on weakly informative priors, e.g., uniform and Cauchy prior has also been proposed. (4) To further improve the discriminability of the Uncertainty Flow, two advanced prediction related uncertainty indexes, i.e., pure uncertainty, and uncertainty plus are also suggested in this research.

Previous Works on Affective Learning
Past works on neural network based affective computing have focused on the segmentation of single facial expression into finer sub-components, which can be achieved via the added principal component analysis (PCA) [9] or the complex feature pre-processing engineering, e.g., the introduction of Sobel filters [10]. However, the complex in emotional representation demands affective analysis to move beyond the single label categorisation. The researches on multi-label learning have been divided into two streams: problem transformation and algorithm adaptation, respectively [11]. The former approach allows a multi-label learning problem to degrade to a single-label one. Two widely applied problem transformation algorithms are binary relevance [12] and hierarchical of multi-label classifier, AKA., ML-ARAM [13]. The latter approach directly tackles the multi-label learning via the reconstructed loss function. Within this scope, the representative models are ranging from k-nearest neighbour related ML-KNN [14], to label relevance based multi-label neural networks [15].
Despite of the bulk of researches on multi-label learning in general, their applications on affective computing are rarely documented. To fulfill this research gap, Mower et al. [16,17] proposed a feature-agglomerate extraction method to encompass all appeared distinctive emotions in single prediction. Their approach coincides with the foregoing ML-ARAM model in ensuring the structured multi-label predictions. However, their claimed confidence rating-the computed Euclidean distance between input space and feature hyperplane-is mere an metric to index the importance of a feature. Another study that aimed in applying multi-label learning in affective classification relied on a novel regularisation to further penalise the max margin loss [18]. In spite of their claimed effectiveness in extracting multi-label affective features, the success of their proposed Group LASSO regulariser depended heavily on their manual and recursive feature extraction process.

Previous Works on Bayesian Neural Networks
Previous efforts in developing Bayesian neural networks need to be mentioned here. Deep neural networks are suffered from their inability in outputting authentic probabilistic output [19]. In literature, the history of probabilistic neural networks can be dated back to the early proposal of using the 'soft-max' function to transform a real-value prediction to a probabilistic one [20,21]. The mathematical role of this added 'soft-max' function is to normalise all real-valued outputs into [0, 1] range. However, this added 'soft-max' function is not sufficient to craft real probabilistic account for each classification prediction [8,22]. Therefore, the production of real probabilistic outputs demands the binding of a neural network with a direct probabilistic model. The resultant model is a Bayesian neural network. However, Bayesian neural networks have long been criticised for their imprecise prior-to-posterior inferences and unreliable posterior samplings in practice. Credits to the recent advance in variational inference, i.e., the achievement in deriving rapid and precise variational method to tackle the issue of intractable posterior inferences, it allows the scalable training of Bayesian neural networks [23].
Although the exhaustive review of Bayesian neural network is out of our scope, we focus on the priors in Bayesian neural networks. In general, a prior can be classified as either informative or non-informative. Despite of fruitful researches on informative priors, e.g., Gaussian and Laplace priors [24], the work on non-informative or weakly informative priors is still at early stage [25]. Early researches on non-informative priors, e.g., Jeffery prior and reference prior, emphasised on pursuing the invariance prosperities of non-informative priors (It is resistible to all types of differentiable transformation of the input variables) [26,27]. However, these non-informative priors are neither applicable in multiple parameter modelling nor asymptotically inconsistent in deriving the posterior [28]. To merge the gap between the informative and non-informative priors, a proposal of using weakly informative priors, i.e., semi-flat priors, has been put up in literature [29]. The practical advantages of weakly informative priors over informative ones have been witnessed in other Bayesian models, e.g., generalised linear model [29].

Uncertainty Flow Framework
Sketched in Figure 1, the proposed Uncertainty Flow framework is consisted of four components, i.e., a dual Bayesian neural networks, the weakly informative priors, the derived model uncertainty, and prediction related uncertainty indexes. The work pipeline of Uncertainty Flow initiates at standard supervised training of a source Bayesian neural network(BNN) with a weakly informative prior, e.g., uniform or Cauchy prior, follows the computation of the weight posterior in preparation of model uncertainty in the source BNN, then this distilled model uncertainty is transferred to a target BNN, which is specialised in outputting multi-label predictions. Finally, three distinctive prediction related uncertainty indexes are introduced in perfecting the final outputs from the target BNN.

I. Bayesian Neural Network
The core part of proposed Uncertainty Flow is the dual BNNs. Under Bayesian learning, a deep neural network-a stacked multiple non-linearity transformations of affine computations-is perceived as sequential layer-wise prior-to-posterior inferences. To allow the model to be flexible enough, we resort on the hierarchical architecture. In a standard classification task set-up, where both input and output variables are observable, i.e., {X|x n Y|y n }, and each input is comprised of D features, i.e., x n ∈ D , the likelihood functions for our dual BNNs are specified in following Equations (1) to (3): For simplicity, only tanh non-linearity is considered as the activation function in BNN. Notice here, we tailer the target BNN in concord with the differentiated task demand. I.e., to allow a Bayesian neural network to produce multiple outputs, the soft-max function in the source BNN is replaced with a real-value function, e.g., sigmoid function, in the target BNN.
Armed with foregoing likelihood functions, a full hierarchical Bayesian neural network is derived from following Equations (4) to (6): Here, we narrow our discussion in the most simplified version of a hierarchical Bayesian neural network, which contains one hyper-parameter, i.e., σ. This hyper-parameter, e.g., σ directly controls the variance of a prior in production of weight posterior.

II. Weakly Informative Priors
The prior, which determines the first and second order statistics of model parameters, is de facto the driving force in bayesian learning. Hence, the proper specification of a model ties closely with the choice of an applicable prior for a given task. Unfortunately, the majority works on Bayesian learning pay overwhelmed attention towards the prior that are informative and conjugate for their analytical convenience, the family of uninformative and weakly informative priors had been largely ignored.
Argued in [30], differ than conventional implementation on Bayesian neural networks with the common used informative prior, the usage of weakly informative prior, i.e., a semi-flat prior, yielded superior predicative performance in single-label discrimination. Therefore, it is rational to extend this finding in multi-label learning. The formal definitions of informative and weakly informative priors are rendered below, and their corresponding probabilistic density curves are plotted in Figure 2. Their differentiated effects on a simple simulation is shown in Figure 3.

III. Model Uncertainty
In order to craft model uncertainty, we rely on the quantitative analysis of posterior predictive distribution under Bayesian neural networks. As the production of posterior predicative distribution entails the computation of the intractable parameter posterior, the common way is to approximate it via minimising the KL-divergence [23,31] (KL(q||p) = ∑ θ q(θ)log q(θ) p(x|θ) ) between approximated variational distribution, i.e., q(θ) and true posterior, i.e., p(θ|x), therefore solving for optimal posterior, i.e., q * (θ), becomes: As we cannot compute the KL-divergence directly, the common approach is to resort on optimising an alternative objective, i.e., maximising the ELBO(evidence lower bound), derived as Whereas ELBO can be seen as a sum of the expected log likelihood of the data with the negative divergence between the variational variance and the model priors. Then it is customary to use the mean-field variational family to complete the specification of the above-mentioned optimisation. The mean-field variational family for each latent model parameter, i.e., θ, can be defined: Hence, finding the intractable posterior degrades to a coordinate ascent optimisation in obtaining the optimal θ * in maximising ELBO (cf. Algorithm 1 in [23] for detailed review). The learned optimal parameter posterior, e.g., (θ s |x) serves as a surrogate to be used in the parameter posterior in target BNN, i.e., (θ t |x)(θ s |x). The model uncertainty, which is distilled from the source task, is now flowed to the target task.

IV. Prediction Related Uncertainty Indexes
With the flowed parameter posterior in the target task, i.e., (θ t |x), it is feasible to form the predictive posterior distribution for each upcoming novel observation, i.e., x new , where, Armed with this predictive posterior distribution, it allows the production of prediction related uncertainty indexes. As lengthy discussion in previous literature [20,32], one overwhelming claim insists that the probabilistic outputs can be produced by the soft-max function in (It is often placed in the final layer of neural networks to allow the real-valued prediction to be 'pushed' in presenting the pseudo-probabilisitic output.) permitting the averaging over the repetitive forward propagations of new observation in either Bayesian neural network or non-Bayesian neural networks. This type of probabilistic output merely tells the most probable output given the input, not how uncertain is the prediction. For the comparative purpose, we refer this type of uncertainty index as soft-max uncertainty. The quantification of this soft-max uncertainty has been previously approximated via averaged T times of forward model(input) propagation [33], expressed in Equations (17) and (18): = Class − Type| Non−Bayesian From Bayesian learning perspective, the above-mentioned softmax uncertainty reflects the belief of applying predictive mean in indexing the prediction uncertainty. Numerically, this type of uncertainty index captures the mere classification type in multiple-object discrimination, is equivalent with the class type probability in non-Bayesian neural networks. However, as the predictive mean does not capture the full picture of parameter posterior distribution, we draw our attention towards the predictive variance instead.
We argue that the yielded predictive variance reflects the degree of uncertainty that is associated with each prediction. As a result, based on the approximated weight posterior, i.e., p(θ|x), a better prediction related uncertainty index is expressed below in Equation (19): We denote this measure of prediction uncertainty index as pure uncertainty.
One step further, rather than the dichotomous uncertainty indexes, e.g., pure uncertainty and soft-max uncertainty, these two indexes can be fused together, which allows the uncertainty index to reflect both class-type probabilistic prediction and the model uncertainty associated with each prediction. In consistent with the previous naming tradition, this type of uncertainty index is marked as uncertainty plus. The production of this uncertainty plus shows that each class type probabilistic prediction should be proportionately adjusted according to its associated prediction uncertainty, expressed in Equation (20): For illustrative purposes, how each of three above mentioned uncertainty indexes, e.g., soft-max uncertainty, pure uncertainty, and uncertainty plus, influences on a simple binary classifier, is demonstrated in Figure 4.

Experiment
Relying on the transferred model uncertainty, the proposed Uncertainty Flow framework allows a learner to output multi-label predictions under single-label training curriculum. Empirical validation of our proposed framework contains two enquiries that need to be addressed. I.e., one is to investigate that whether or not our proposed Uncertainty Flow is superior to conventional multi-label learning algorithms in facilitating the zero-shot multi-label learning; whereas the other is to see which uncertainty index leads to the most significant performance elevation. Also, in order to investigate the role of suggested weakly informative priors, we additional specify our Uncertainty Flow into three types of priors. Hence, in total, there are three-level comparisons in our experiment, i.e., model comparison; prior comparison; and uncertainty comparison. The entire experiment is written in Python, using Theano [34], and Pymc3 [35] libraries. The partial code to produce this study is available from: https://github.com/LeonBai/Uncertainty-Flow.

Training Dataset
We selected the first 1500 images from FER2013 [36] as our training dataset. The reason for intentional lowered size of training dataset is to enforce the similar model complexity between a source BNN and a target BNN. (cf. Figure 1). We leave the relaxation of such restriction to future research. FER2013 is a well researched public dataset, which is comprised of facial expression images for pictorial sentiment discrimination. Prior to our implementation, all images in this truncated version of FER2013 had gone through the standard preprocessing process, e.g., fixate the faces at centre, standardise the image size to 48 by 48 pixels in resolution, and all faces are properly registered. We then normalised the pixel values of input images. The original FER2013 images are labelled as one of seven emotion categories, e.g., angry, disgust, fear, happy, sad, surprise, and neutral.

Testing Dataset
To allow the evaluation of outputted multi-label predictions, it is imperative to rely on some existing benchmark annotations. Unfortunately, there is no current reliable multi-label annotations for FER2013 facial expressions. For this, we conducted a small-scale, i.e., 200 images, experiment on manual annotating the multi-label version of FER2013. The descriptive statistics of this annotated multi-label testing dataset is summarised in Appendix A, and the raw data can be found on https://github.com/LeonBai/Uncertainty-Flow. Preliminary statistic test revealed the high similarity between the original single-label and yielded multi-label annotations. I.e., treating the original single-label FER2013 annotations as ground truth, the overlaps between multi-label annotations and ground truth reached 75%, suggesting high similarity between two annotations. Indicated by a acceptable Fleiss-Kappa coefficient value [37], i.e., 0.25 (between −1 to 1, higher is more reliable)-the measurement of inter-rater reliability-the annotated multi-label version of FER2013 can be served as our testing dataset.

Models
To conduct an experiment that contains above-mentioned three-level comparisons, i.e., model comparison, prior comparison, uncertainty comparison, it demands explicit specification of all models in current experiment. In model comparison, four widely used multi-label learning algorithms, ranging from adaption algorithms, e.g., Multi-Label K-means Nearest Neighbour (MLkNN), Multi-label Neurofuzzy Classifier(ML-ARAM), to problem transformation algorithms, e.g., Binary Relevance (BR) and Label Powerset (LP), are included. In addition, two multi-label compatible neural networks. i.e., a multi-label feedforward Neural network (ML-FNN) and a multi-label convolutional neural network (ML-CNN), are also included in model comparison comparison.

Evaluation Metrics
Different than the uniformed metric that used in single-label classification, i.e., classification accuracy, diversified evaluation metrics have been proposed. In line with the rouge classification from [11], we adhere the dichotomy classification of the evaluation metrics as bipartition and ranking based. For illustrative purposes, assuming a multi-label evaluation dataset is consist of input, i.e., x i , and the set of true labels, i.e., Y i , where i = 1, ..., m and Y i ⊆ L, L is the set of all correct labels. Under this notation, the set of predicated labels are denoted as Z i , where i = 1, ..., m, while the rank predicted by learning method for a label λ is denoted as r i (λ). The most relevant label, receives the highest rank (1), while the least relevant one, receives the lowest rank (q).

Bipartition Based
Delegated from the single-label metric, bipartition based metrics are proposed to capture the differences between actual and predicted sets of labels over all evaluation dataset. These differences can be computed in various means via either averaged over all samples or all label sets.

1.
Hamming loss Where represents the symmetric difference of two sets, i.e., predicted and true label sets. Contrast with other over-strict measures of multi-label classification accuracy, i.e., low tolerance on partial label misclassification, e.g., 1 m ∑ m i 1 I|Y i = Z i |, the hamming loss, which sums up to 1, offers a mild criteria for wider range of measurement application.

Micro-Averaged F-Score & Average Precision
Inherited from classic binary evaluation in information retrieval tasks, F-score and average precision, which both reflect their corresponded combinations of averaging over precision and recall, are two readily applicable metrics in multi-label learning. Among various averaging operations, e.g., macro, weighted, and micro, the preferred operation is micro-average as it offers each sample-class pair an equal contribution to the overall metric. Consider a binary evaluation measure t p , t n , f p , f n that is computed via the number of true positives t p , true negatives t n , false positives f p , false negatives f n ,the nth threshold for precision and recall are P n and R n , the interested micro-averaged F-score and average precision score(AP) are derived as following:

Converge
To measure the needed distance to cover all true label sets, i.e., Y i in the predicted label sets, we resort on the converge error metric. It can be defined as following:

Ranking Loss
The ranking loss targets at the incorrect ordering of the predicted label sets. PresumeȲ i is expressed as the complementary set of Y i , its computation can be defined as following:

Results & Discussion
The overall result of our conducted large-scale comparative experiment is chiefly presented in Table 1. For illustrative purposes, we grouped the results to highlight the comparison among different models. As we used various of evaluation metrics to assess the performance of corresponding models, it is difficulty to obtain a clear judgement that is based on single metric. I.e., when we pitted our approach, i.e., Uncertainty Flow against the MLkNN approach in conventional multi-label models, our approach, including all nine variations, is inferior to the MLkNN approach on the metric of Hamming-Loss. However, when we accessed the model according to its performance on Average Precision, our approach largely outperformed the MLkNN approach. Moreover, as we incorporated nine variations in Uncertainty Flow, the in-depth analyses of the prior types and uncertainty indexes are demanded. We then divided our discussion of the overall result into three parts, e.g., the results on model comparison, the results on prior comparison, and the results on uncertainty comparison. Ruling out the factors of prior types and prediction related uncertainty indexes, the empirical comparison between Uncertainty Flow framework and the alternatives demonstrated mixed results, illustrated in Figure 5. Previous findings from [11] and [6] stated that it is practical difficult to observe a single model or algorithm, which is competitive enough to beat others in every multi-label evaluation metric. Hence, it is imperative to investigate each loss metric independently. Focusing on Hamming-Loss (Hamming-ACC = 1 − Hamming loss), interestingly, the conventional multi-label learning models are particular good in minimising this type of loss. However, indicated in Converge-Loss and Ranking-Loss, both Uncertainty Flow and multi-label compatible neural networks, e.g., ML-FNN and ML-CNN are superior than the conventional multi-label learning alternatives.
Under two precision related metrics, e.g., F-score and average precision, with the help from a weakly informative prior, e.g., uniform or Cauchy prior, and an advanced prediction related uncertainty indexes, e.g., pure uncertainty, or uncertainty plus, both BNN and BCNN, exhibited clear performance advantage over their alternatives, e.g., ML-FNN, ML-CNN and conventional multi-label models. Especially on the metric of average Precision, the nontrivial performance enhancement, i.e., over 20% accuracy increase, demonstrated the superior discriminability that tags to our proposed Uncertainty Flow. Moreover, comparing the performance between BNN and BCNN, the convolutional architecture, e.g., BCNN, should be credited for overall performance improvement.

Prior Comparison
To verify the most applicable prior in our proposed Uncertainty Flow, the prior comparison among three candidate priors is worthy to be fully investigated. Shown in Figure 6b, despite some similarities in shapes, it is clear that each prior has its unique effect in shaping the corresponded posterior distribution of the weights. In specific, the effect of uniform prior on posterior weight is seen as the restriction on the approximated posterior weights, i.e., the posterior weights have to be higher than a fixed value, e.g., 1 in our implementation. This restriction effect may lower the discriminability of the uniform prior imposed model, shown in Figure 7. Interestingly, the posterior distribution of weights from imposed normal and Cauchy priors respectively rendered nearly identical distribution shape, shown in Figure 6a,c. The minute difference between these two is the enlarged variance for Cauchy prior induced posterior distribution of weights. Despite seemingly trivial, this difference in variance lead to the discrepancy in discriminative performance, shown in Figure 7. Overall, based on final induced discriminability, a Cauchy prior is considered as the most applicable prior in our proposed Uncertainty Flow. The performance enhancement that can be reflected by above-mentioned 'clustered' effect in weight posterior was observed in examination of the discriminability of three implemented prior. Plotted in Figure 7, focusing on the average precision evaluation metric, regardless of the variations in Bayesian neural networks, i.e., BNN or BCNN, the employment of Cauchy prior-as one kind of weakly informative prior-leaded competitive multi-label affective classification. However, as another implemented weakly informative prior, uniform prior was inferior to the used informative prior, e.g., normal prior. This observed attenuation in discriminability from uniform prior may due to the its above-mentioned spike-and-slab effect on weight posterior that requires extra training epochs to stabilise the pre-to-posterior inference.

Uncertainty Comparison
Undoubtedly, the most pronounced performance improvement is pertaining to the inclusion of advanced prediction related uncertainty indexes, e.g., pure uncertainty and uncertainty plus.
To recall the foregoing definition of prediction related uncertainty indexes, the soft-max uncertainty is a mere indication of multi-class prediction type, which is equivalent with the predictions in non-Bayesian alternatives. The pure uncertainty, i.e., on the contrary-depends heavily on weight posterior-can be produced exclusively in our proposed Uncertainty Flow framework. Reflected in Figure 8, when the feedforward architecture was adopted, soft-max uncertainty is inferior to pure uncertainty in producing multi-label prediction. Interestingly, when the convolutional architecture was chosen, it uncovered a different story, i.e., the discriminability from pure uncertainty became inferior to soft-max uncertainty.
Not surprisingly, the combination of soft-max uncertainty and pure uncertainty, i.e., the craft of uncertainty plus, allows a set of multi-label predictions to be tuned based on its uncertainty value. Shown in Figure 8, it is clear that the crafted predictions that are benefited from Uncertainty plus are superior to other two uncertainty indexes. I.e., its introduced improvement in average precision is over 20% compare to other two indexes. Combining the most applicable weakly informative prior and the advanced uncertainty indexe together, the two most efficient variants in feedforward and convolutional architectures are BNN-cauchy-plus and BCNN-cauchy-plus, respectively. We leave sensitivity analysis of our proposed advanced uncertainty indexes to future research.

Conclusions
Over reliance on single-label affective learning hinders the fruition of the automatic affective analysis. To free from this restriction, we resort on multi-label affective learning. However, current multi-learning algorithms are not scalable enough due to the scarcity of multi-label training samples. To tackle this issue, we propose a inductive transfer learning based framework, i.e., Uncertainty Flow. Under this pioneer framework, we argue that the model uncertainty can be distilled from a source single-label recognition task. The distilled knowledge is then fed to a to-be-learned multi-label affective recognition task. For predictions, three types of uncertainty indexes, i.e., soft-max uncertainty, pure uncertainty, and uncertainty plus, are further proposed. For empirical validation, the authors conducted a large-scale comparative experiment on the manual annotated multi-label FER2013 dataset across three levels of comparisons, i.e., model comparison, prior comparison, and uncertainty comparison. The observed performance superiority in Uncertainty Flow unequivocally renders the feasibility of applying this framework in zero-shot multi label affective learning.
However, even under the permitted computational resources, to run a full Bayesian posterior remains as a daunting task. How to speed up the posterior inference is an open research question. In terms of future researches, there are two streams of researches that are worthy to be further explored. One focuses on improving the discriminability of our novel proposed Uncertainty Flow framework. This entails the revision on the mainstream mean field based variational inference [23]. The other is to extend the current inductive transfer based framework to the transductive transfer domain [7], where has already been demonstrated in vivo [42].