A Triple Adversary Network Driven by Hybrid High-Order Attention for Domain Adaptation

How to bridge the knowledge gap between the annotated source domain and the unlabeled target domain is a basic challenge to domain adaptation. The existing approaches can relieve this gap by feature alignments across domains; however, aligning non-transferable features may lead to negative shift confusing the knowledge learning on target domains. In this paper, a triple adversary network is proposed on the basis of a high-order attention, hopefully to solve the problem. The proposed architecture focuses on the detailed feature alignment by a hybrid high-order attention using a fast iteration algorithm. In addition, an orthogonal loss of two complementary modules is applied to constrain the mutual exclusion of foreground and background features. Finally, a triple adversarial strategy is introduced to further improve the training convergence for the composed architectures. Numeric experiments on datasets of Digits, Office-31 and Office-home illuminate that the proposed network can effectively improve the state-of-art domain adaptations with superior transferring performance.


Introduction
Supervised learning has achieved great success in many applications by utilizing fully annotated training data, such as image recognition [1][2][3], speech recognition [4,5], etc. In these scenarios, a practical difficulty is the manual collection of huge amounts of training data and annotations. To solve this problem, existing solutions usually resort to the rich knowledge of the easily labeled datasets, namely, the source domain, to promote effective adaption learning in those domains with scarce labels, namely, the target domains, also known as domain adaptation (DA). Generally, DA can be categorized into supervised adaptation and unsupervised adaption. The former assumes that a small amount of labeled target data can be collected for training [6][7][8], and the latter focuses on the cases involving fully unlabeled target examples [3,[9][10][11]. Though the latter case is more common and significant progress has been made on it in recent years [12,13], challenges still remain in several specific aspects.
The unsupervised domain adaptation (UDA) specifically focuses on how to reduce the domain shift between the fully labeled source domain and the unlabeled target domain, also known as domain inconsistency, which is caused by many practical factors, such as the vision variations of captured angles, lighting quality, and image resolutions varying in different scenes [14,15]. To improve it, deep domain confusion (DDC) [13] has been studied to represent the domain invariant knowledge by introducing a adaption layer and a maximum mean difference (MMD)-based domain confusion loss. In addition, the deep adaptation network (DAN) [12] integrates task-specific layers into the kernel Hilbert space to enhance the transfer ability between domain representations; and geodesic flow kernel (GFK) [16] uses Kullback-Leibler (KL) divergence to estimate domain differences, and a limited (2) The proposed architecture is further driven by reverse phase modules and a orthogonal loss to constrain the positive interaction of foreground and background features, thereby ultimately solving the domain inconsistency in the feature transfers. (3) Moreover, a triple-player-based adversarial learning strategy is introduced into the proposed network to improve the iterative convergence of complex network parameters. Numerical experimental results have verified that on the datasets including MNIST [25], USPS [26], SVHN [27], Office-31 [1] and Office-home [28], the proposed network has superior performances compared with other state-of-art benchmarks.
The remaining contents of this paper are arranged as follows. Section 2 reviews the related work, including adversarial domain adaptation and attention mechanism. The proposed domain adversarial architecture is detailed in Section 3. In the next section are the experimental settings and the comparison results and discussions. The last section summaries this paper.

Unsupervised Domain Adversarial Adaptation
Recent research methods on domain adaptation mainly focus on the following aspects. One is to utilize metric methods to measure the shift across domains, such as maximum mean discrepancy (MMD) [10,13], second-order statistics correlation alignment CORAL [3,9] and center moment discrepancy (CMD) [29]. Wasserstein distance-based discriminator is adopted in these methods to bring the two distributions closer. These methods explicitly minimize domain distribution discrepancy to exploit transferable domain features. According to the above formulas, the existing domain adversarial networks [17,18] have been verified to achieve excellent performances on the transferring scenes where the distributions of the source and the target domain are complex and short of prior knowledge. Generally, the adversarial approaches can solve the domain-shift difficulties by globally matching example features across these domains. However not all spatial representations should be transferred for domain adaptation, and the negative transfers might arise because of the confused knowledge transferred to the target domains.
Adversarial learning is another way to convey domain information. Specifically, the domain-adversarial neural network (DANN) [17] first leverages adversarial learning between the domain classifier and feature generator to learn domain-invariant representations by adding a simple gradient reversal layer (GRL). Further, to address the mode collapse issue, multi-adversarial domain adaptation (MADA) [30] presents a multi-adversarial domain adaptation approach with the help of multiple domain classifiers. The adversarial discriminative domain adaptation (ADDA) [18] uses label learning in the source domain to distinguish representations, and then uses asymmetric mapping (without weight sharing) learned by standard generative adversarial network (GAN) loss to map the target data onto a separate code in the same space. Designed in cycle-consistent adversarial domain adaptation (CyCADA) [31], cyclic consistency loss strengthens the consistency of structure and semantics during adversarial domain adaptation (ADA). MADA [30] captures multi-modal structures to achieve fine-grained alignment of different data distributions based on multiple domain identifiers. Co-regularized domain alignment (Co-DA) [32] constructs a number of different feature spaces, and aligns the source and target distributions in each feature space. In contrast, compared with the existing DA method based on adversarial learning, the key improvement of the proposed model is the ability to add hybrid high-order attention to jointly capture the knowledge of complex non-Gaussian distribution fine features and distinguish structure. This helps to achieve satisfactory performance when the field gap is large.

Attention Mechanism in Deep Architectures
Recently, the attention mechanism has made significant progress in various tasks, such as speech recognition [33] and domain adaptation [22]. It can be divided into two categories: spatial attention and channel attention. For the first category, the size of feature maps is C × H × W. It can be considered as a H × W image and the representation of every pixel is C dimensional. The attention model can learn to re-weigh every pixel. This model is used in [34] to refine spatial attention. For the second category, more attention is paid to each channel of a feature map, viewed as a process of selecting semantic attributes. Its most famous application is SeNet [35] which is the foundation of the imageNet large scale visual recognition challenge's (ILSVRC) top classification submission in 2017. The convolutional block attention module (CBAM) [36] combines spatial attention with channel attention to independently perfect convolution features. Although there are few studies on attention adaptation, it is worth noting that in [22], the domain-adapted transferable attention TADA was proposed. It focuses on two complementary, transferable local and global attentions. There are few studies on the combination of channel attention and spatial attention in the field of confrontation adaptation. Therefore, in our study, channel attention and spatial attention are combined in the field of confrontation adaptation to study their learning of different levels of feature information from different perspectives.

High-Order Statistics for Spatial Representations
In the study of features based on deep learning statistics, statistics above the first order [37][38][39] can successfully be used to represent the significant details of an image. In the field of image and video recognition particularly, the second-order convolutional neural networks (SO-CNNs) [40] extract the covariance second-order statistics matrix from the convolution activation to construct the covariance descriptor unit's second-order convolutional neural network (CNN). Bilinear CNN (B-CNN) [39] calculates the outer products of the convolutional description vector and combines them to obtain the image descriptor. On the surface, second-order statistics, covariance and Gaussian descriptors show better performances than those of descriptors using zero-order or first-order statistics. For example, when the feature distribution is non-Gaussian, the use of the second-order or lower statistical information may not be enough [39]. Therefore, many researchers have turned to the exploration of the higher-order information. For example, the video recognition task [41] combines an effective dot product attention mechanism with temporal reasoning to dynamically discover high-order object interactions. The convolution neural network (CNN) is fully parameterized by the high-order tensor [42] to jointly capture the complete structure of the neural network. These high-order statistical representations can capture more discriminative information than the first-order and obtain promising improvements. Therefore, this paper attempts to combine high-order statistics with spatial attention for the first time to continue to explore high-order moment tensors for comprehensive domain alignment research.

The Proposed Domain Adaption Architecture
In this section, a hybrid architecture is presented to achieve detailed feature transfers for the domain adaption. Given n s labeled examples from a source domain D s = {(x s i , y s i )} n s i=1 and n t unlabeled examples from a target domain D t = {(x t j )} n t j=1 , P(x s , y s ) and Q(x t , y t ) are joint distributions of the source and target domains respectively. The i.i.d. assumption is violated, as P = Q by assuming that the two domains have an identical number of categories. This paper aims to formulate a hybrid adversarial transferring architecture, able to be pre-trained on D s and then generalized well to D t . The overall procedure is illuminated in Figure 1.
At present, adversarial domain adaptation [17,19] has been verified as one of the basic transfer schemes to align the representations on the two domains following different probability distributions. According to these solutions, the features generated through insufficient training might still deceive the domain discriminator, implying that the intrinsic mechanism of feature transferring and discriminating cannot be fully understood by the adversarial learning. Thus, attention-driven modules have been integrated into generative adversarial frameworks, as in self-attention generative adversarial networks (SAGANs) [21], which are helpful to the long-range and multi-level dependencies of modeling across image regions. On this basis, an extended attention mechanism can be formulated by mixed channel and spatial masks with high-order representations in order to perform precisely spatial alignments, and meanwhile, a triple-game adversarial strategy based on spatial orthogonal losses is integrated to enhance the super-parameter convergence during training.

The Mixed High-Order Attention for Feature Alignments
Given a feature map F ∈ R C×H×W on intermediate layers as the input of transferring modules, the module of mixed attention sequentially infers a 1D channel attention map M c ∈ R C×1×1 and a 2D spatial attention map M s ∈ R 1×H×W as where F * stands for the final output of the mixed channel and spatial attention, F c for the output of the channel attention and ⊗ for the convolutional operation. To exploit the inter-channel relationship of features, the spatial dimension of the feature F is firstly aggregated by two different spatial descriptors AvgPool(F) and MaxPool(F) respectively. Both the descriptors are merged using element-wise summation, and then forwarded to a shared layer as multi-layer perceptron (MLP) with sigmoid activation to generate the channel attention mask M c . The channel attention map can be denoted as In terms of the perception of feature maps, the channel attention is applied globally, while the spatial attention often works locally. However, since these spatial masks can only be represented by the low-order statistics, which are inefficient for accurately capturing the spatial semantic details, a high-order spatial attention M s (F) ∈ R 1×H×W and detailed high-order statistics are adopted for the feature alignments. Firstly, a linear polynomial predictor is defined on top of the high-order statistics of f ∈ R C , denoting a local descriptor at a specific spatial location of f as where ·, · denotes the inner-product of two equally-sized tensors, R the number of order, ⊗ r x the r-th order outer-product of f that comprises each the degree-r monomial in f and w r the r-th order tensor to be learned that contains the weights of degree-r variable combinations in f . Suppose that when r > 1, w r can be approximated by D r rank-1 tensors by tensor decomposition; then Equation (3) can be expanded as where u r,d 1 ∈ R C , ..., u r,d r ∈ R C represent vectors, ⊗ is the outer-product and α r,d is the weight for d-th rank-1 tensor. Then, according to the tensor algebra, the above formula is reformulated as where α r = [α r,1 , · · ·, α r,D r ] T is the weight vector, z r = [z r,1 , · · ·, z r,D r ] T with z r,d = ∏ r s=1 u r,d s , f . For convenience in the calculation of high-order statistics, let D r = D, where r = 1, 2, ..., R, and then z r can be obtained with a fast iteration equation indicating that z r r should be multiplied by the previous order to obtain the current order statistics. Then, according to this fast iteration, Equation (5) can be rewritten as The above equation contains two terms, so for clarity, it is formulated into a more general case. Suppose that w 1 can be approximated by the multiplication of two matrixes v 1 ∈ R 1×D 1 and α 1 ∈ R D 1 ×1 , i.e., w 1 = v 1 × α 1 ; then an overall equation is obtained as In Equation (7), since m( f ) is capable of modeling and using the high-order statistics of the local descriptor f , the high-order attention mask can be obtained by performing Sigmoid function Sigmoid function on Equation (7) as where M s ( f ij ) ∈ R C and the value of each element in M s ( f ij ) is within the interval [0, 1]. For the high-order spatial attention module shown in Figure 2. To this end, the spatial attention mechanism is modeled by combining complex high-order statistics to capture more complex and advanced information for precise parts, so that the feature extractor generates more high-level information transferability to distinguish fine features. To this end, we are committed to modeling the spatial attention mechanism by combining complex high-order statistics in order to capture more complex and advanced information between precise parts. Furthermore, the adopted order statistics have the time complexity O(R) as a faster calculation model, compared with O( R 2 2 ) for the previous version [23], so that the feature extractor generates high-level transferable information and distinguishes fine features more efficiently.

The Orthogonal Loss in Reverse Phase Modules
The previous studies [43] indicate that orthogonality helps the optimization of deep neural networks (DNN) by preventing explosion or vanishing of back-propagated gradients. Rodrguez et al. [44] proposed an orthogonal regulation to enforce feature orthogonality locally based on cosine similarities of filters. Jia et al. [45] proposed the algorithms of orthogonal deep neural networks (OrthDNNs) to meet the recent interest in spectrally regularized deep learning methods. A cheap orthogonal constraint was proposed based on parameterizations from exponential maps [46]. To overcome the redundancy in improving feature diversity, orthogonality regularization [47] was proposed. Presently, few studies focus on the effectiveness of background discrimination in unsupervised confrontation adaptive tasks. Thus, in our research, the proposed attention-based adaption framework is adopted to further integrate the accuracy influences of background and foreground on the overall migration effect.
To obtain accurate spatial features of foregrounds in vision scenes and eliminate the backgrounds or other interfering factors, a pair of complementary modules with orthogonal loss are integrated. The hybrid high-order attention focuses more on the important target features of the image, while the complementary attention generated by the complementary module can capture the background and other factors. It is hoped that the features of the two are orthogonal in the image space, so that the overlapped features are approximate to zero. Thus, an orthogonal loss is applied to constraining the size of the two spatially overlapped features as follows.
It is used to calculate the sum of outer-products between the hybrid high-order attention background features and other factors in the images. This idea can further help the model enhance the cross-domain transfers with distinguishing fine spatial features, and solve the drawback that the domain classifier may still be deceived in extracting insignificant features.

The Training Based on Triple Adversarial Strategy
According to the adversarial-based domain adaption, the distribution shift between the source and the target domains is reduced by generating globally transferable features [17,30]. The existing adversarial strategy can be regarded as a two-player game, where the first player is the domain discriminator D, who distinguishes the source domain from the target domain, and the second player is the feature generator G simultaneously trained to confuse the discrimination results of D. To obtain domain-invariant features F, the trainable parameters θ g , θ d and θ c are optimized by minimizing the losses of the three modules G, D and C alternately. Then, the objective of domain adversarial network [17] can be denoted by where the losses L y and L d can be assigned as cross-entropy, and α is a trade-off coefficient between the two objectives formulating the feature generation during training. After training convergence, parametersθ g ,θ d andθ c will deliver a saddle point of Equation (11). However, the challenge lies in that local convergence of the model often arises even after the two players have got training balance, particularly when the extended attention modules with more trainable parameters are integrated in this game. In this section, an expanded adversarial procedure involving three games is proposed to relieve the local convergence. The extended formula can be illuminated as where the modules G/Ḡ, D and C are the three given players formulating the feature alignments during training, and α and β are the trade-off coefficients. After training convergence, the parameterŝ θ g ,θ d andθ c will deliver a saddle point of Equation (12) alternately by adversarial iterations as follows: According to the above equations, the training with triple adversarial strategy can be detailed as following stages. Firstly, the adversarial game is applied to G, D and C. The features from the source domain are fed into two branches, namely, the binary domain discriminator D and the label classifier C to predict input labels in a supervised manner. The features generated from the target domain are only fed into the domain discriminator D to promote an adversarial training across the domains. The parameters of these modules are alternately frozen to update the other module in the adversarial iterations. The second adversarial game is performed between D andḠ. The feature extractor G trained in the first game focuses on the foregrounds, while the backgrounds are captured by the complementary moduleḠ as a game player to accelerate the convergence of the feature generation.
To ensure that the discriminator receives transferable information to divide the domain samples, the hybrid high-order attention is also updated by this stage to generate accuracy spatial alignments.
Finally, the updating of both G andḠ should be achieved by a supplementary game. These two complementary modules respectively align the foreground and the background details to support the subsequent domain discrimination. Thus, the discrimination modules D and C are frozen to accelerate the feature generations with salient spatial details in this game. In short, the role of this stage can be considered as both the cooperation and competition between the two complementary modules. In the triple adversarial training, this three-player game can learn more transferable and distinguishable fine spatial representations, and prevent the module parameters from falling into local convergence in an end-to-end optimization of complex architectures.

Results and Analysis
Labeled source images and unlabeled target images were used for training, and then tests were conducted on the remaining data. The proposed hybrid high-order triple-adversary network (HTAN) was evaluated with state-of-the-art approaches on three standard unsupervised domain adaptation datasets: Digits [27], Office-31 [1] and Office-home [28].
Specifically, three digits datasets are investigatedm including MNIST [25], USPS [26] and SVHN [27]. These datasets each contain digits of 10 classes ranging from 0 to 9. In particular, MNIST and USPS contain 28 × 28 and 16 × 16 gray images respectively, and SVHN consists of 32 × 32 color images, which might contain more than one digit in each image. The evaluation protocol with four transfer tasks is adopted as USPS to MNIST, MNIST to USPS and SVHN to MNIST. Office-31 is the most widely used dataset for visual domain adaptation, with 4652 images and 31 categories collected from three distinct domains: Amazon (A), Webcam (W) and DSLR (D). All methods are evaluated on six transfer tasks A → W, D → W, W → D, A → D, D → A and W → A. Office-Home is a better organized but more difficult dataset than Office-31, for it consists of 15,500 images with 65 object classes in office and home settings, forming four extremely dissimilar domains: Art (Ar) with 2427 paintings, sketches or artistic depiction images; Clipart (Cl) with 4365 images; Product (Pr) containing 4439 images; and Real-World (Rw) with 4357 regularly captured images. All 12 transfer tasks are performed on this dataset: Ar → Cl, Ar → Pr, Ar → Rw, Cl → Ar, Cl → Pr, Cl → Rw, Pr → Ar, Pr → Cl, Pr → Rw, Rw → Ar, Rw → Cl, Rw → Pr. Figure 3 shows some sample images of the three data sets.
For Office-31 and Office-Home datasets, deep adaptation methods are implemented on the basis of the PyTorch framework of residual neural networks (ResNet-50) [54], and the ResNet-50 model is also used on the ImageNet [55] dataset. For pre-training, back-propagation is applied to fine-tune the model with labeled training domain samples and completely unlabeled test domain samples. The average grade is used to assess the statistical significance of different methods. To train the model, the batch size of all experiments is set as 64. In order to optimize the network, the learning rate is set as 0.001 and the momentum as 0.9 mini-batch stochastic gradient descent (SGD). To guarantee fair comparison, all of the methods are re-implemented, and each method is trained five times with the average taken as the final result. In the comparisons, the standard protocol for unsupervised domain adaptation is observed, with labeled source data and unlabeled target data applied to all migration tasks, such as [26]. In order to distinguish the transferable hybrid high-order attention and the transferable complementary orthogonal attention module, and the individual contributions of the three-player game training method, HTAN is used to represent the hybrid high-order attention used to counter the adaptive DANN model, and HTAN-k to represent the k-th order hybrid high-order attention model. Lor + (HTAN-k) is used to represent the k-th order hybrid high-order attention and complementary orthogonal attention. The use of Lor + Tri + (HTAN-k) means that in the combination of the k-th order of mixed high-order attention with complementary orthogonal attention, the three-player game training method is used for the entire module.
To guarantee fair comparison, all of the methods are re-implemented, and each method is trained five times with the average taken as the final result.

Results on Dataset Digits
According to our experimental scheme, in MNIST → USPS and USPS → MNIST experiments, the size of USPS is adjusted to 28 × 28 pixels, and MNIST adopts the original size of 28 × 28 pixels. In the SVHN → MNIST experiment, the SVHN dataset contains images with colored background, multiple numbers and extremely fuzzy numbers. MNIST is composed of binary black and white handwritten digital images, meaning significant domain difference between the two datasets. As the image size of MNIST is much smaller than that of SVHN, the size of MNIST used is 32 × 32 of SVHN with three channels. The proposed method shows competitive performances in all three migration tasks. The results are recorded in Table 1.
In the experiment, the performances of our method on MNIST → USPS, USPS → MNIST and SVHN → MNIST were significantly better than those of the basic model DANN, by 14.9%, 9% and 19.7%, respectively. The results show that our method outperforms the latest method 3CATN. The accuracy improved by 0.9%, 0.4% and 1.1% on MNIST → USPS, USPS → MNIST and SVHN → MNIST respectively. The average increase in accuracy, compared with the state-of-the-art 3CATN method, reached 1.2%. The HTAN model extracts the fine features of key targets in the foreground of the image, using orthogonal loss to restrict each other to eliminate insignificant background, and uses the three to confront. The training method further learns more refined feature representations that can be transferred and distinguished. From the comparison results, some conclusions can be drawn. First of all, from the perspective of migration tasks between digital datasets, DANN, 3CTAN and Lor + Tri + (HTAN-6) based on confrontation have better performances than those of DDC, Dan and d-coral based on difference. Secondly, on average, the proposed method performs the best among the three migration schemes. The results show that our method is most competitive. This also proves that the domain adaptation method with mixed high-order attention in the digital dataset experiment can obtain a better adaptive effect than other latest domain adaptation methods, such as 3CATN. Finally, the sound proof of our method shows that it is beneficial for the UDA model to take into account both domain confrontation and attention.
The purpose of the confusion matrixes shown in Figure 4 is to intuitively illustrate the effectiveness of our method. From Figure 4a, it can be seen that most of the samples of number category "8" are incorrectly predicted as "3", so misclassification is likely in some cases, especially in testing similar numbers between 7 and 2, also 9 and 4, when they reveal the huge differences between domains. With the increase of the order, the effect of HTAN-6 is improved, especially on "3" and "4". In some cases, they are likely to be misclassified, while Lor + (HTAN-6) reduces the differences of background edges under the constraint of orthogonal loss. In contrast, the use of Lor + Tri + (HTAN-6) shows more accurate predictions on the diagonal, proving that the proposed method can effectively alleviate regional and categorical differences.

Results on Dataset Office-31
Six transmission tasks were performed in the context of domain adaptation: A → W, D → W, D → A, W → A, W → D and A → D. The results are recorded in Table 2. From the migration tasks between office-31 datasets, it can be seen that DDC, DAN and JDDA are better than traditional shallow migration methods TCA and GFK. However, DANN, 3CTAN and lor+Tri(HTAN-6) have better performances than DDC, DAN and JDDA. The calculated average accuracy rate of two different tasks is 89.63%, higher than the average accuracy rate of 0.73% of the latest 3CATN model. It is worth noting that the method proposed by us obtained two migration tasks A → W and A → D. Based on the latest results, it should be noted that our method performs better than the basic model DANN on all migration tasks. In particular, the HTAN method achieves higher classification accuracy on some difficult migration tasks: A → W and A → D. Among these tasks, in terms of shooting angle and object attributes-color, etc.-the domain difference between the source domain and the target domain is significantly larger. HTAN can improve the adaptation tasks with larger domain differences, such as A → W, A → D, D → A and W → A, and achieve comparable classification accuracy on adaptation tasks with small domain differences. From the results, the following conclusions can be drawn. First of all, DANN can train an additional domain classifier to minimize the difference, making its performance about 9.5% better than that of the standard deep neural network RTN. This improvement also shows that adversarial learning is useful for minimizing the difference between the source data and the target data. Secondly, the proposed method is significantly better than a series of other metric-based methods, such as DDC, DAN and CORAL. They all help the model's fully connected layer distribution. The minimization of difference indicates that blindly aligning the content and background parts may have a negative impact on the final result. Thirdly, combining high-order mixed attention with adversarial learning can improve the average accuracy rate by 6.2%. Finally, a new training method is added to HTAN-6. The rise of average accuracy by 0.51% indicates that our model is superior to the baseline method in reducing domain differences. Figure 5 shows the confirmation of the convergence performances of ResNet-50, DAN and HTAN. From it, it can be seen that the proposed HTAN-6 enjoys faster convergence than DAN, while the performance of Lor + (HTAN-6) is better than that of HTAN-6. It is worth noting that the performance of Lor + (HTAN-6) has similarly stable converge performance as that of Lor + Tri + (HTAN-6) at the beginning of adversarial training, while Lor + Tri + (HTAN-6) remarkably outperforms Lor + (HTAN-6) in the whole procedure of convergence. Thus, as the training progresses, more fine-grained features are gradually learned between source domain and target domain, and the performance of Lor + Tri + (HTAN-6) becomes better than other approaches. The above findings confirm that our model can achieve minimum test error smoothly and quickly, resulting in better domain transfer.

Results on Dataset Office-Home
In the context of domain adaptation, 12 migration tasks were performed for four domains on the Office-Home dataset. The results recorded are listed in Table 3. The previous best average accuracy rate of TADA was 67.6%, while the average accuracy rate of the proposed HTAN was 68.66%. That accuracy is the new best performance. The average accuracy rate achieved by our method was 11.06% higher than the baseline method DANN. It is encouraging that HTAN adapts to tasks in some difficult areas (for example: Cl → Pr, meaning that HTAN can learn more transferable features for effective transfer learning.
From the results, the following important observations can be made. On the one hand, combining high-order hybrid attention with adversarial learning can increase the average accuracy rate by 3.63%, proving that our method can accurately select foreground alignment and retain fine feature information. On the other hand, a new three-part confrontation training method was added to the high-order hybrid confrontation model HTAN-6. In doing so, the average accuracy was increased by 0.57% on the original basis, showing that our model reduces the domain and that its ability of differentiation is better than that of the baseline adversarial approach.

Effectiveness Verification
Ablation study: From Table 4, we analyze the individual contributions of hybrid high-order attention, orthogonal loss and adversarial training. When we use the DANN basic model without high-level attention, we get a certain improvement, but the recognition rate is not very obvious. It shows that the model did not learn the complex multimodal structure distribution. Secondly, the use of orthogonal loss and the three adversarial training can improve HTAN to a certain extent, indicating that they reduce the difference of background edges and learn domain invariant features under the constraint of orthogonal loss. By fair comparison, the results of the proposed Lor + Tri + (HTAN-6) uniformly outperform those of the other variants among these ablation experiments, which certifies its remarkable effect in matching the features and adversarial domain adaptation across domains. Visualization: We visualize the network activations from the feature extractors of HTAN, HTAN-6, Lor + (HTAN-6) and Lor + Tri + (HTAN-6) on the transfer task A→ W by T-SNE [56] in Figure 6. By HTAN features, the source domain and target domain are not well aligned. From left HTAN to right Lor + Tri + (HTAN-6), the source and target domains are made more and more indistinguishable. For feature representations of Lor + Tri + (HTAN-6), the source and target domains are perfectly aligned while different classes are well discriminated. All of the above observations can demonstrate the advantages of the proposed method. The proposed method can learn more domaininvariant features with the hybrid high-order attention mechanism and triple adversarial learning, which is proved intuitively.

High-Order Performance Analysis and Convergence Performance
Effects of HTAN: Firstly, denote the order of HTAN with "HTAN-k", and the orders are R = {1, 2, 3, ..., k}. Then quantify HTAN, followed by a quantitative comparison of HTAN. The results are shown in Figure 6. It can be seen from this table that the proposed HTAN can significantly improve the recognition rate of the MNIST → USPS migration task. Specifically, as the order of HTAN rises, the recognition rate is further improved. For example, in the migration of MNIST → USPS, when the HTAN order rises from order 1 to order 6, the HTAN performance increases from 91.2% to 94.7%. This phenomenon shows that the use of a hybrid high-order attention model helps to capture the interactions of some complex and high-order statistical information. In order to further eliminate the background features, orthogonal modules were added to limit the background features. It can be seen from the figure that with the increase of the order, the migration recognition rate of Lor+(HTAN-k) model also increases from 92.2% to 96%, showing that our model can better concentrate on the fine features of the foreground. Therefore, a domain confrontation adaptive training method was added based on three-player games. Lor+Tri+(HTAN-k) model improves the recognition rate further on the basis of the first two. The performance of Lor+Tri+(HTAN-k) on three benchmarks is much better than all baseline models, proving that our method has a better ability to express fine feature vectors and is effective for invariant features in the learning domain. However, when the order of the Lor + Tri + (HTAN-k) model is further increased, such as R = 7, the performance improvement remains almost unchanged; data not reported here. Parameter sensitivity: From Figure 7 we can see that when the hyperparameter order is greater than 6, the performance of our hyperparameter order remains almost unchanged. From the analysis of convergence performance, the above findings confirm that our model can smoothly and quickly achieve the smallest test error, thereby achieving better domain migration. We recommend using the HTAN-6 model to generate multiple modules with different orders so that diverse and complementary high-order information can be used explicitly, thereby encouraging the richness of learning functions and preventing the learning of partial/biased visual information. When the order is 6, the model learns the best significant difference feature. Additionally, we use adversarial learning to generate diversified high-order attention maps with the order of constrained hyperparameters, which will not lead to a decrease in benchmark performance.

Conclusions
This paper concerns the proposition of a hybrid high-order triple-adversary network (HTAN), a new type of confrontation learning method with a hybrid high-order attention mechanism. It is different from the previous methods, which only match the low-order feature representation across domains, often leading to negative feature transfers. The proposed network uses a hybrid high-order attention mechanism to weight the extracted features, effectively eliminating the influences of non-transferable features. By considering the transitivity of different regions or images, the complex multi-modal structural information is further developed to achieve more precise feature matching. In addition, the orthogonal loss is integrated to further constrain the background features, then the triple-adversary strategy is adopted to improve the training convergence of the hybrid network.
In the experiments, comprehensive evaluations on three benchmark datasets verified the superior performance of the proposed network, compared with the related adaptive models. Based on the paper, further studies might work in several areas. First, the task-specific decision boundary between categories should be focused on avoiding generated confusing features close to the category boundary; secondly, relieving the problem that the time cost excessively depends on the model order; and finally, specific challenging fields should be explored, such as cross-domain pedestrian re-recognition, in order to introduce more distinctive high-level attentions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: