Progressively Discriminative Transfer Network for Cross-Corpus Speech Emotion Recognition

Cross-corpus speech emotion recognition (SER) is a challenging task, and its difficulty lies in the mismatch between the feature distributions of the training (source domain) and testing (target domain) data, leading to the performance degradation when the model deals with new domain data. Previous works explore utilizing domain adaptation (DA) to eliminate the domain shift between the source and target domains and have achieved the promising performance in SER. However, these methods mainly treat cross-corpus tasks simply as the DA problem, directly aligning the distributions across domains in a common feature space. In this case, excessively narrowing the domain distance will impair the emotion discrimination of speech features since it is difficult to maintain the completeness of the emotion space only by an emotion classifier. To overcome this issue, we propose a progressively discriminative transfer network (PDTN) for cross-corpus SER in this paper, which can enhance the emotion discrimination ability of speech features while eliminating the mismatch between the source and target corpora. In detail, we design two special losses in the feature layers of PDTN, i.e., emotion discriminant loss Ld and distribution alignment loss La. By incorporating prior knowledge of speech emotion into feature learning (i.e., high and low valence speech emotion features have their respective cluster centers), we integrate a valence-aware center loss Lv and an emotion-aware center loss Lc as the Ld to guarantee the discriminative learning of speech emotions except an emotion classifier. Furthermore, a multi-layer distribution alignment loss La is adopted to more precisely eliminate the discrepancy of feature distributions between the source and target domains. Finally, through the optimization of PDTN by combining three losses, i.e., cross-entropy loss Le, Ld, and La, we can gradually eliminate the domain mismatch between the source and target corpora while maintaining the emotion discrimination of speech features. Extensive experimental results of six cross-corpus tasks on three datasets, i.e., Emo-DB, eNTERFACE, and CASIA, reveal that our proposed PDTN outperforms the state-of-the-art methods.


Introduction
Emotions reflect the psychological state of human beings, which are usually manifested in physiological and psychological signals [1][2][3][4][5], e.g., facial expression, speech, and electroencephalogram (EEG). As a commonly used communication mean, speech contains rich emotional information. Therefore, making the machine recognize the emotional states of speech, known as the speech emotion recognition (SER) task, is crucial for humancomputer interaction (HCI). Generally, the task setting of SER suggests that the training and testing data come from the same corpus, which will cause the trained model on the training data to perform poorly on a new corpus. In recent years, the SER task, which information for the discriminative emotion feature space [6,10,13]. However, the linear mapping of the subspace learning limits the representation ability of features, which is one of its disadvantages. In addition, the deep learning methods on cross-corpus SER still only consider eliminating the distribution shifts across the source and target domains, while ignoring the preservation of emotion discrimination on speech features.
To address the above issues, we jointly consider the emotion discrimination preservation of speech features and the distribution elimination between the source and target domains, and integrate them into the deep feature extractor. A benefit of this approach is to enhance the emotion discriminativeness of speech while narrowing the distribution discrepancy between two domains such that the emotion-discriminative and domain-invariant speech features can be obtained through the training of a deep end-to-end network.
Therefore, in this paper, we propose a progressively discriminative transfer network (PDTN) for the cross-corpus SER. In the PDTN, we adopt two special losses i.e., emotion discriminant loss L d and distribution alignment loss L a , in the high-level feature layers (i.e., fc layers), where L d is combined with the emotion classification loss L ce to enhance the emotion discrimination of speech features and L a decreases the distribution distance of features between the source and target domains. Specifically, L d contains a valenceaware center loss L v and an emotion-aware center loss L c , which are inspired by the prior knowledge of speech emotions, i.e., speech emotion features of the high and low valences have their respective cluster centers. Further, we utilize the multi-layer MMD in L a to measure the domain shift of marginal distributions between two domains. The proposed PDTN integrates the three losses, i.e., L a , L d , and L ce to progressively eliminate the inter-domain discrepancy and improve the emotion discriminativeness of speech features through an end-to-end network training stage. Experimental results on three datasets, i.e., Emo-DB, eNTERFACE, and CASIA, demonstrate the superiority of our proposed PDTN over the comparison methods.
Overall, the contributions of this paper can be summarized as the following three points: • This paper proposes a novel progressively discriminative transfer network for crosscorpus SER, which jointly considers the two aspects of eliminating the distribution discrepancy across the source and target domains, and enhancing the emotion discrimination of speech features during deep feature learning. Thus, it can avoid the dilemma that previous methods only consider one of two above aspects. • As far as we know, it is the first work to introduce the prior knowledge of speech emotions, i.e., speech emotion features of high and low valences with their respective cluster centers, into the deep feature learning to enhance the emotion discrimination of speech representations. • We adopt high-level features of fc layers to perform a practical distribution discrepancy measures under multi-layer features between the source and target domains through a multi-layer MMD metric.
The rest of this paper is organized as follows: Section 2 illustrates the proposed method in detail. Then, we conduct our experiments and discuss the results in Section 3. Finally, Section 4 concludes the paper and gives some points for future research.

The Proposed Method
In this section, we will illustrate the framework of PDTN in detail, shown in Figure 1, which can be divided into three parts, i.e., deep feature extraction, emotion discrimination preservation, and distribution discrepancy elimination.
The overview of the progressively discriminative transfer network (PDTN) for cross-corpus SER. PDTN firstly extracts the high-dimensional fc layer features (i.e., f 1 , f 2 , and f 3 ) of the source domain data x s and the target domain data x t through DNN (i.e., AlexNet and VGGNet). Then, it uses fc features to calculate the valence-aware center loss L v and emotion-aware center loss L c in emotion discriminant loss L d , and distribution alignment loss L a , respectively. Finally, it predicts the emotion label of source samples for the emotion classification cross-entropy loss L ce .

Deep Feature Extraction
Compared with traditional methods, deep learning has performed well in speech processing, e.g., SER, speech recognition, and speech enhancement. Especially in the SER, DNNs (e.g., CNN and RNN) can extract the high-level feature of speech with more discriminative emotion information. Therefore, we adopt the deep CNN (DCNN) as the backbone network of our proposed PDTN for the deep feature extraction of speech emotion, according to [18,19]. Moreover, as a time-frequency representation of speech, the spectrogram is commonly used for the input feature of DCNN instead of the hand-crafted features.
To illustrate the process of deep feature extraction clearly, firstly, we formalize the labeled training dataset as D s = {x s i , l i } n s i=1 and the unlabeled testing dataset D t = {x t j } n t j=1 , where x s i and x t j are donated as the spectrogram of the ith speech sample in source data and the jth sample in source data, l i represents the emotion label of the ith speech in source dataset, and n s and n t are the numbers of source and target samples. Notably, since our proposed method is based on unsupervised domain adaptation (UDA) in TL, the target speech samples have no labels.
Then, the spectrogram features of the source and target dataset are fed into the DCNNs to extract the high-level representations of speech emotions. In this paper, we select the AlexNet and VGGNet as the comparison backbones of the proposed PDTN to evaluate the method's performance. Through the backbones, the spectrograms x are encoded in time and frequency domains by a series of stacked convolutional layers, and further pass through several fully connected (fc) layers to obtain high-dimensional emotional semantic , where f s k and f t k represent the features of source and target datasets in the kth fc layer, and n l is the number of the fc layers. Eventually, the extraction process of the high-level emotion feature f k = [ f s k , f t k ] in the kth fc layer of backbone network G f (·) can be formalized as where θ f is the parameters of the feature extraction network G f (·). The numbers of fc layers in AlexNet and VGGNet are both set as 3 in this paper.

Emotion Discrimination Preservation
In cross-corpus SER, after extracting the deep speech emotion feature f k , the common practice is to either input these high-dimensional features into a fully connected network-based classifier for emotion recognition or to align the distribution of these features both in the source and target domains [14,15,18]. However, since speech emotion is easily disturbed by other factors, e.g., background noise, speaker identity, and language, the emotion features are always confused with the features of these factors [10][11][12]. Therefore, in cross-corpus SER, when the feature distributions between domains are aligned, only utilizing a single emotion classifier cannot effectively disentangle the emotion information from the confusing features in a sufficiently complete feature subspace, which will damage the emotion discrimination of speech features in the feature generalization learning. To address this issue, we introduce an emotion discrimination preservation learning of speech features in the distribution alignment process, which can decouple independent emotion features in the common feature space using the prior knowledge of emotions.
As we know, the emotions can be represented on the two-dimensional arousal-valence emotion wheel [20][21][22], shown in Figure 2. It is obvious that each of seven emotions (i.e., angry, disgust, fear, happy, neutral, sad, and surprise) have a specific position on the arousal and valence axes of the emotion wheel. According to these positions, the prior knowledge of emotion categories can be indicated, that is, the seven emotions are divided into the negative-valence group (i.e., angry, disgust, fear, sad, and surprise), the neutral valence group (i.e., neutral), and the positive-valence group (i.e., happy and surprise) on the valence axis. Under these groups, the emotions in the same group are naturally near each other on the valence, indicating that the centers of their classes are relatively close. On the contrary, the emotions in the different groups have distant centers of emotion classes. Therefore, we introduce the prior knowledge of emotion categories into deep feature learning to maintain the emotion discrimination of speech features. Specifically, we design a valence-aware center loss L v to model the emotion similarity inside groups and dissimilarity outside groups, which can be donated as where n b is the mini-batch size; f s,i k represents the kth fc layer feature of the ith speech sample in the source dataset; v b n and v b p are the mini-batch feature centers of the negativevalence emotion group N = {angry, disgust, fear, sad, surprise} and the positive-valence emotion group P = {neutral, happy}, respectively; v l,i is the feature center of the emotion group where the lth class of the ith speech sample belongs; and α 1 and α 2 are the thresholds to adjust the feature distances within group and between groups, respectively. The feature center of v l,i can be obtained as follows where v n and v p are the global centers of the negative-valence and positive-valence emotion groups in the whole source data, which can be calculated during the parameter updating in Algorithm 1. Moreover, the negative-valence feature centers v b n and the positive-valence feature centers v b p in each mini-batch can be denoted as where n b and n b are the numbers of speech samples belonging to N and P in a mini-batch, respectively, and In addition to the arousal-valance-based center loss, we also construct a fine-grained emotion discrimination preservation strategy by fully using the prior information of each emotion category to finely maintain the emotion discriminativeness. Specifically, we design a novel emotion-aware center loss L c , which can decrease the inter-class distance and increase the intra-class distance in the source data, represented as follows where c i is the feature center of the emotion category corresponding to the ith speech sample in the whole source data, which is implemented for details in Algorithm 1. α 1 and α 2 are the thresholds to adjust the distances, respectively. c b p and c b q are the mini-batch feature centers of the pth and qth emotion category, where c b q can be formalized as where n q b is the number of speech samples in a mini-batch corresponding to the qth emotion category. The formalization of c b p is similar to c b q . Consequently, we combine L v and L c in deep feature learning to ensure the discrimination of emotions from coarse to fine in the process of distribution discrepancy elimination. Therefore, the loss of emotion discrimination preservation can be represented as where λ and γ are the tradeoff parameters to balance the two losses.

Algorithm 1 Algorithm for the parameter optimization of PDTN.
Input: the input features of source and target data: ; training labels of source data: {l i } n s i=1 ; fc layers: [ f c 1 , f c 2 , f c 3 ]; learning rate: l r and trade-off parameters λ, γ, and µ. Initialize: θ f , θ c randomly. Output: the optimized parameters:θ f ,θ c . while the total loss L total < or iter n < maxIter do (1) Generate a mini-batch features of source and target data: (2) Extract the high-level features of source and target data:  (4) and (5); (4) Calculate the feature center of qth class c b q in each mini-batch by the Equation (7); Initialize global centers v n , v p , and c q (or c p ) in whole source data using steps (4) and (5); else: (6) Calculate L v , L c , L a , L ce , and L total using Equations (2), (6), and (10)-(12), respectively; (7) Update the parameter θ f and θ c :

Distribution Discrepancy Elimination
Besides the discriminative feature of emotional speech, another challenge in crosscorpus SER is how to eliminate the domain shift between the source and target data, caused by the factors such as background noise, speaker identity, language, etc. To address this challenge, the moment matching-based methods [12,13] and adversarial learning-based methods [14,15] have been widely investigated and achieved great success. Adversarial learning adopts a domain discriminator to confuse the domain information of features for the discriminative representation of emotional speech, which is prone to a lack of convergence [14]. Moment matching is used to find a suitable metric function to measure the discrepancy between domains, e.g., MMD [13], 2 distance [23], Deep Coral [24], which is a non-parameter method and easy to implement. Therefore, the previous works of cross-corpus SER mainly integrated MMD into the subspace learning. Nevertheless, the speech emotion features generated by subspace learning are low-level such that it cannot accurately represent the feature distribution of the source and target data, which brings errors to the distance measurement. Thus, in this paper, we utilize the high-level features in fc layers to measure the distribution distance precisely. In addition, since the features of each fc layer correspond to the specific discrimination, inspired by [25][26][27], we also extend the feature alignment of a single layer to a multi-layer adaptation to obtain a more accurate measurement for the domain shift.
Firstly, we implement the distribution discrepancy of the signal layer high-level feature in the kth fc layer, namely D k , which can be formalized as where k ∈ [1, 2, . . . , is the kernel function in the highdimension reproducing kernel Hilbert space (RKHS) H, which is denoted as the inner product :, : of the source and target features' mapping function φ.
Further, D k H can be extended to the multi-layer feature distribution distance measurement by integrating the MMD in the two domain features of several fc layers to match the discrepancy between the source and target domains more accurately. Therefore, we can obtain the multi-layer distribution discrepancy distance and take it as the distribution alignment loss L a to constrain the model to gradually eliminate the domain shift between domains during the feature learning process. So, L a can be represented as follows where K k is the kernel function corresponding to the features in the kth fc layer.

PDTN for Cross-Corpus SER
In cross-corpus SER, the spectrograms x s and x t of the source and target data are fed into the backbone network (e.g., AlexNet, VGGNet) to extract the high-level emotion semantic features in the kth fc layer, i.e., f s k and f t k . After this step, the high-level features in the first fc layer of the source and target data are utilized to calculate the valence-aware center loss L v , and the features in the second fc layer are used to generate the emotion-aware center loss L c . The combination of L v and L c is regarded as the emotion discrimination preservation loss L d to maintain the emotion information of speech features from coarse to fine. Furthermore, the source feature f s 3 in the final fc layer is adopted to predict emotion labels for cross-entropy loss L ce by emotion classifier G c (·), which can be represented as where θ c denotes the parameter of the emotion classifier G c (·) and J(·) is the cross-entropy function. Then, the high-level features in three fc layers of the source and target data are adopted to produce the distribution alignment loss L a for eliminating domain shifts between the source and target domains. Consequently, we can obtain the corpus-invariant and discriminative emotion representation through the total loss L total , which can be denoted as where L ce is the cross-entropy loss of the emotion classifier. λ, γ, and µ are all the tradeoff parameters used to balance the different losses. According to the aforementioned pipeline, the proposed PDTN is optimized by the L total to update the parameters of the backbone network and classifier. The detailed optimization processing is illustrated in Algorithm 1. Thus, in this paper, we utilize three fc layers in both AlexNet and VGGNet backbones, i.e., f c 1 , f c 2 , and f c 3 . Specifically, the features in f c 1 and f c 2 are utilized to calculate the L v and L c , respectively. The L a is obtained by integrating the features in three fc layers into the alignment loss.

Experiments
In this section, several experiments are implemented to evaluate our proposed method, and the results are also discussed to illustrate its applicability for cross-corpus SER.

Dataset
• eNTERFACE [28] is a public English multi-modal emotion dataset, which contains 1290 audio-visual samples with a sample rate of 48 kHz. In this dataset, six emotions, i.e., anger, disgust, fear, happiness, sadness, and surprise, are induced by the pre-prepared performance contents. Forty-three volunteers coming from different countries with males and females participated in the recording of the dataset. • CASIA [29] includes 7200 emotional speech sentences with the Chinese language. Each sample is recorded with six emotions, i.e., anger, fear, happiness, neutral, sadness, and surprise, through some acting contents from four actors containing two males and two females. We utilize 1200 public speech samples with the sample rate of 16 kHz for the experiments. • Emo-DB [30] is collected as a German emotional speech dataset with 535 speech samples by ten native speakers, including five males and five females. In Emo-DB, each sentence is recorded with 16 kHz under seven emotions, i.e., anger, boredom, disgust, fear, happiness, neutral, and sadness.
In this paper, to perform the cross-corpus SER conveniently, we pick common emotion categories inside two datasets which are adopted for the cross-corpus task. We also design six tasks according to three datasets and the detailed setting is shown in Table 1, in which e, c, and b represent the datasets of eNTERFACE, CASIA, and Emo-DB, respectively. Table 1. Data statistics of six cross-corpus SER tasks on three public datasets, where e, c, and b represent eNTERFACE, CASIA, and Emo-DB, respectively.

Experimental Setting
In order to obtain the input of the proposed PDTN, we transform the speech signals to spectrogram features through the short-time discrete Fourier transform (STFT) with the Hamming window, in which the frame length is set as 350, and the FFT points is 1024. It is noted that all speech samples are chosen as the signal channel data and resampled to the sample rate of 16 kHz.
In PDTN, we select the AlexNet [31] and VGGNet (i.e., VGGNet-11) [32] as the backbone networks to evaluate the PDTN's performance on different networks. In the backbone networks, their three fc layers with the dimensions of 4096, 4096, and class number, i.e., f c 1 , f c 2 , and f c 3 , are adopted to calculate the emotion discrimination preservation loss L d and the distribution alignment loss L a . Moreover, to match the input size of backbone networks, the dimension of spectrogram features is resized as 224 × 224. The implementation of our proposed PDTN is based on the deep learning framework Pytorch with NVIDIA GeForce RTX3090 GPUs and it is optimized by the Adam optimizer [33] with a batch size of 32. Its initialized learning rate is set as 0.0002 with a decay weight of 0.9 and the training epoch is set as 500.
We also describe other parameters for the detail as follows. For instance, we utilize the Gaussian kernel in the MMD of L a and its bandwidth is set according to [34]. For the trade-off parameters, we set the γ and λ by the grid search strategy in the parameter set [0.001, 0.003, 0.005, 0.01, 0.03, 0.05, 0.1, 0.5]. µ is set by an adjusting strategy, which can be formalized as µ = 2 1+e −δp − 1. Then, δ is fixed to 10 and p is defined as the ratio of the current number of iterations to the total number of iterations.
In addition, in this paper, we adopt the setting of the cross-corpus SER task by training the PDTN in one dataset (e.g., eNTERFACE) and testing the model in another dataset (e.g., CASIA). Therefore, the six cross-corpus tasks are generated by three datasets, which are summarized as the task section in Table 1. Furthermore, two widely used measure criteria for the recognition accuracy are adopted to evaluate the performance of our proposed PDTN, i.e., the weighted average recall (WAR) and the unweighted average recall (UAR). WAR is denoted as the ratio of the number of correctly predicted samples to the total number of samples, and UAR is the average of the correct rate of each class. UAR has an advantage on measuring the model's performance on class imbalance databases over WAR. Therefore, combining WAR and UAR can more comprehensively evaluate the performance of PDTN with state-of-the-art methods.

Comparison Methods
To effectively estimate the performance of our proposed PDTN on the cross-corpus SER tasks, we choose several state-of-the-art methods for the comparison, which are described as follows: • Baseline methods: both backbone networks used to extract the high-level features for the experiments.
AlexNet [31]: includes five convolution blocks with the kernel of 5 × 5 or 3 × 3 and three fc layers with the dimensions of 4096, 4096, and class number.
VGGNet-11 [32]: consists of eight convolution blocks with the kernel of 3 × 3 and three fc layers with the dimensions of 4096, 4096, and class number.
• DA-based methods: all domain adaptation-based methods for cross-corpus SER tasks by our own implementation.
DAN [27]: contains a deep feature extractor and a domain alignment layer with the MMD in multiple fc layers.
DANN [35]: utilizes the domain adversarial training strategy by a domain discriminator to obtain the task-specific and domain-invariant representation.
Deep CORAL [24]: integrates the CORAL loss based on the second-order statistics (i.e., covariances) into a deep neural network for the end-to-end unsupervised domain adaptation framework. DSAN [34]: proposes a non-adversarial sub-domain adaptation to align the local distribution discrepancy using joint local MMD.
Note that our proposed PDTN is non-parameterized because the calculation of L c e, L d , and L a does not require the parameter updating. Therefore, the parameter number of PDTN depends on the backbone networks, i.e., PDTN (AlexNet) has a similar parameter number with AlexNet (60 millions) [31] and the parameter number of PDTN (VGGNet-11) is the same as VGGNet-11 (133 millions) [32]. Furthermore, the parameters of other comparison methods, e.g., DAN, Deep CORAL, and DSAN, also rest with backbone networks. However, DANN has larger parameters than others because of the additional domain discriminator [35]. In addition, compared with AlexNet, VGGNet-11, and DANN, the proposed PDTN, DAN, Deep CORAL, and DSAN all design novel lossless resulting in extra computational complexity. Specifically, PDTN, DAN, and DSAN are based on MMD (O(n 2 )) and Deep CORAL is based on the second-order covariance (O(n 4 )), where n is the larger one of source number n s and target number n t .

Results and Discussions
The experimental results of six cross-corpus SER tasks are reported in Table 2 with WAR and UAR. The comparison results reveal that our proposed PDTN based on the two backbone networks, i.e., AlexNet and VGGNet-11, can achieve the best performance over other state-of-the-art methods. In detail, the DA-based methods are superior to the baseline methods for all six tasks of cross-corpus SER on the average accuracies. For each task, the DA-based methods also surpassed the performance of most tasks. Significantly, the discrepancy-based methods, i.e., DAN and Deep CORAL, achieve the comparable recognition rate with the adversarial-based method, i.e., DANN, demonstrating that the distribution alignment strategy, either distance measurement or adversarial training, can promote the corpus-invariant emotion features. Furthermore, DSAN has better performance than these three DA-based methods due to the sub-domina alignment strategy taken into account in DSAN. Furthermore, our proposed method goes beyond the mentioned DA-based methods. This is because the proposed PDTN framework not only adapts the marginal distribution between multiple layers but also maintains the emotion discriminative of speech features. From the results in the Table 2, we can also observe that the tasks of b → e, e → c, and c → e have worse performances than other tasks (i. e., e → b, b → c, and c → b). This situation indicates that variations in training and test datasets may affect the generalization performance of all cross-corpus methods. In addition, it is also interesting to find that the actuaries of b → e are less than e → b, which may be because the database of Emo-DB is small such that it cannot sufficiently obtain robust speech emotion features. Furthermore, neither c → e nor e → c perform promisingly. This is very likely because the CASIA and eNTERFACE are based on different languages, as CASIA is a Chinese dataset and eNTERFACE is an English one. The disparities across languages lead to the emotion variations in speech, which is also a research hotspot in the field of SER. Nevertheless, our proposed PDTN outperforms both the average accuracies and the performance of each task, demonstrating the superiority of the PDTN.

Ablation Experiments
To verify the effects of different components in the proposed PDTN, we also conduct the ablation study to illustrate this point through extensive experiments. The results with WAR and UAR for ablation experiments are illustrated in Table 3, in which PDTN_S and PDTN_M represent the signal-layer and multi-layer distribution alignment strategy in the PDTN framework according to Section 2.3. Furthermore, we select several key components of PDTN to explore their functions for cross-corpus SER. For instance, PDTN_M w/o L c & L v , PDTN_M w/o L v denote the model under the PDTN framework without the L c and L v losses and the one without the L v loss, respectively. Thus, for convenient comparison purposes, we adopt VGGNet-11 as the backbone network of PDTN for the ablation study. Thus, the PDTM_M herein is the proposed PDTN (VGGNet-11) in Table 2. From the ablation results in Table 3, firstly, it is clear that PDTN_M w/o L c & L v outperforms PDTN_S w/o L c & L v in terms of the average accuracies, which indicates that the multi-layer alignment can obtain more domain-invariant features of speech emotions. Secondly, the performances of PDTN_M w/o L c & L v and PDTN_M w/o L v demonstrate that the emotion-aware loss L c we designed in the PDTN framework could facilitate the speech emotion feature learning with more discrimination. Thirdly, the PDTN achieves the best performance compared to other ablation components in the average accuracies. Moreover, PDTN illustrates its superior recognition rates in most of cross-corpus SER tasks except c → e. These comparison results all demonstrate our proposed emotion discrimination preservation loss L d , including the valence-aware loss L v and emotionaware loss L c , and distribution alignment loss L a can obtain more discriminative and corpus-invariant representations of emotional speech.

Visualization for Feature Distribution
The key to copying with cross-corpus SER is to extract the discriminative speech emotion feature. Therefore, to demonstrate the superiority of the proposed method on emotion discriminative preservation, we choose the features under the task of e → b for the visualization. The feature distributions of different emotions are visualized in Figure 3, in which the features are generated by the fc layers (i.e., f c 1 , f c 2 , and f c 3 ) in the PDTN based on VGGNet-11. The distributions are shown through t-SNE [36], and the points of different colors represent the corresponding emotions, i.e., anger, disgust, fear, happiness, and sadness. The sub-figures from (a)-(c) of Figure 3(1)- (3) illustrate that the deeper the fc layer, the more compact the distribution margin of each emotion, indicating that the deeper fc layer features contain stronger emotion discrimination. In addition, from Figure 3(1)-(3), we can also observe that, with the integration of distribution alignment loss L a , emotion-aware loss L c , and valance-aware loss L v , the features in three fc layers become more dispersed among different emotions, and more compact among the same emotion. These visualizations all demonstrate that our proposed PDTN framework is adept at maintaining the emotion discrimination of speech features while eliminating the distribution shift between training and testing data.

Conclusions
In the paper, we propose a progressively discriminative transfer network (PDTN) for cross-corpus SER, aiming at preserving the emotion discrimination of speech emotion features and eliminating distribution discrepancy between the training and testing data. In PDTN, we design the special discriminative loss L d based on the prior knowledge of speech emotions, including the valence-aware loss L v and emotion-aware loss L c , to assist the emotion classifier in enhancing the discrimination of speech features in deep feature learning processing. Then, we also adopt the multi-layer distribution alignment based on MMD to reduce the domain shifts between the source and target data. The experimental results of six cross-corpus SER tasks on three public datasets (i.e., Emo-DB, eNTERFACE, and CASIA) show that our proposed PDTN can obtain the more discriminative and domaininvariant representation of emotional speech than the state-of-the-art methods. In fact, the distance metric we adopt is based on the marginal distribution. Therefore, we will explore integrating conditional distribution to obtain a finer-grained measure for the domain shift in the future.