Transformer-Based Unsupervised Cross-Sensor Domain Adaptation for Electromechanical Actuator Fault Diagnosis

: There have been some successful attempts to develop data-driven fault diagnostic methods in recent years. A common assumption in most studies is that the data of the source and target domains are obtained from the same sensor. Nevertheless, because electromechanical actuators may have complex motion trajectories and mechanical structures, it may not always be possible to acquire the data from a particular sensor position. When the sensor locations of electromechanical actuators are changed, the fault diagnosis problem becomes further complicated because the feature space is signiﬁcantly distorted. The literature on this subject is relatively underdeveloped despite its critical importance. This paper introduces a Transformer-based end-to-end cross-sensor domain fault diagnosis method for electromechanical actuators to overcome these obstacles. An enhanced Transformer model is developed to obtain domain-stable features at various sensor locations. A convolutional embedding method is also proposed to improve the model’s ability to integrate local contextual information. Further, the joint distribution discrepancy between two sensor domains is minimized by using Joint Maximum Mean Discrepancy. Finally, the proposed method is validated using an electromechanical actuator dataset. Twenty-four transfer tasks are designed to validate cross-sensor domain adaptation fault diagnosis problems, covering all combinations of three sensor locations under different operating conditions. According to the results, the proposed method signiﬁcantly outperforms the comparative method in terms of varying sensor locations.


Introduction
In recent years, the next generation of aerospace equipment with fly-by-wire flight control systems has increased substantially [1]. As an important type of fly-by-wire flight control actuator, Electromechanical Actuators (EMAs) are gaining increasing attention in the aerospace industry due to their many advantages, including higher reliability, lower total weight, and better maintainability. As an important safety component on a spacecraft, the EMA may fail for several reasons due to the complicated and uncertain operating environment of the spacecraft. Therefore, the safety and reliability of EMA are of critical importance, and it may lead to catastrophic consequences if EMA failures fail to be detected. Consequently, it is important to carry out fault diagnosis research on the EMA of spacecraft.
There has been increasing research in Prognostic and Health Management (PHM) in EMA over the past several years. There are two categories of approaches suggested in these studies: model-based approaches and data-driven approaches.
The model-based approach requires the development of an accurate mathematical model [2,3] to predict the input and output correlations of the EMA. Analyzing the estimated parameters of an EMA against its measurements will reveal its health state. Modelbased approaches can provide insights into the operational state of individual components within an EMA, enabling faults to be identified directly from a physical perspective. Arriola et al. [4] constructed five sets of signal-based monitoring functions to detect faults based on a detailed model of EMA. Ossmann et al. [5] designed a residual filter of EMA based on a linear model, which implemented fault diagnosis by monitoring three sensors of EMA. As an advantage of using a model-based approach, the fault modes are correlated with the parameters of the model. However, this approach relies on high-fidelity models, which are often designed for specific devices. It is generally difficult or expensive to measure and calculate the model's complete parameters outside of a laboratory.
In data-driven approaches, monitoring data and signal processing techniques are used to learn EMA fault patterns from normal and fault data. EMA fault patterns can be directly learned from sensor data when data-driven approaches are used. Data-driven approaches are becoming increasingly popular as monitoring techniques and computational power improve. Chirico et al. [6] determined the frequency domain features of vibration signals derived from EMAs using the power spectral density method, and further reduced the dimensionality of the features through principal component analysis. In the end, fault diagnosis is achieved using Bayesian classifiers. This type of traditional data-driven approach normally relies on predefined domain knowledge to extract features manually. With deep learning, it is possible to automatically learn hierarchical representations of large-scale data [7]. This is critical to fault diagnosis applications. Reddy et al. [8] employed a Deep Auto-Encoder (DAE) network to transform high-dimensional sensor data into a low-dimensional feature space to detect anomalies. Yang et al. [9] developed a sliding window enhanced EMA fault detection and isolation method using an extended Long Short-Term Memory (LSTM) model. A Convolutional Neural Network (CNN)-based EMA fault diagnosis method was presented by Riaz et al. [10]. In EMA fault diagnosis, deep learning-based approaches have achieved significant performance improvements, as they can take advantage of non-linear feature mapping and end-to-end data mining. For a new diagnosis task, these deep neural networks always perform poorly on newly discovered target samples regardless of initial dataset training. Because there is a difference in distribution between source and target domains, this can be attributed to a variety of factors, including working conditions, mechanical characteristics, and sensor differences. [11]. To address this problem, several transfer learning techniques have been proposed that are designed to transfer fault diagnostic knowledge from one domain to another [12]. In recent years, intelligent fault diagnosis methods consisting of cross-domain adaptation and knowledge transfer can be effective for classifying fault types in varying working conditions [13][14][15][16], across different machines [17][18][19], and imbalance instances [20,21], etc.
Despite promising results from these studies, most existing research assumes data collection is conducted at the same location on each machine. It is important to note that in real-world settings, this assumption is often difficult to achieve. The source domain (training data) distribution and the target domain distribution (testing data) may change with the sensor or its location. Due to this situation, it is difficult for the knowledge acquired from the source domain to be implemented successfully in the target domain. Less attention has been given to scenarios where training and testing data are collected at different locations.
In addition, vibration signals captured by the EMA are susceptible to contamination by sensor locations as a result of its complex construction. Vibration data collected at a different location has a negative impact on both the quantity and quality of information. It may be argued that cross-sensor domain adaptation presents a challenge to fault diagnosis in EMA. In order to overcome this problem, a cross-sensor fault classifier needs to be built using learned source domain knowledge. Unfortunately, little research has been conducted on this topic. Li et al. [22] proposed a Generative Adversarial Network (GAN)-based approach for marginal domain fusion of bearing data, combined with parallel unsupervised data, to converge on conditional distribution alignments. However, GAN is not optimal for discriminative tasks, and it is sometimes limited to smaller shifts on these tasks [23]. A CNN and Maximum Mean Discrepancy (MMD) [24] were utilized by Pandhare et al. [25] for fault diagnosis across different sensor locations of a ball screw. CNN treats all data equally and lacks both pertinence and relevance, making it difficult for viewers to discover relationships between targets [26]. As of this moment, CNN has a limited ability to provide long-ranged information because of its receptive fields that are localized [27]. Though cross-sensor domain fault diagnosis is exceedingly common in real industrial scenarios, various aspects of these problems have not been adequately addressed.
A number of attention mechanisms have been successfully employed in the areas of Computer Vision (CV), Natural Language Processing (NLP), and fault detection [28][29][30]. In a recent study, Vaswani et al. [31] presented a TransformerTransformer model that relies solely on self-attention. A representation of the input data is computed based on the position of the input data. In this way, it is possible to determine global dependencies effectively and simultaneously between input and output over time. In contrast to CNN, the kernel size is not constrained, enabling a complete receptive field to be created for each time step. Contrary to Recurrent Neural Network (RNN)-based methods, the Transformer enables full parallel computation through its dot-product self-attention. Diagnostic tasks often require processing signal sequences and determining their internal correlations. Thus, a Transformer was found to have positive effects in this area [32]. The Transformer has made tremendous progress in CV and NLP. In the area of fault diagnosis, it has not yet been widely adopted.
This paper proposes an end-to-end cross-sensor domain adaptation fault diagnosis method based on enhanced Transformer and convolutional embedding. The canonical Transformer can capture long-term dependencies in parallel and makes it easy to tailor it to match different input sequence lengths. However, a canonical Transformer does not take much account of the local contexts when extracting high-level features [33]. The convolution-based input embedding method further emphasizes the contribution of local contexts to learning. Moreover, the enhanced Transformer model is developed for the EMA fault diagnosis, which takes the attention mechanism as its core and avoids the weaknesses of the CNN and LSTM models. Furthermore, Joint Maximum Mean Discrepancy (JMMD) [34] is explored for achieving feature alignment and satisfying the need for domain-invariant features. The effect of domain adversarial training on Transformer-based models is that it influences their learned representations without having much impact on their performance. It indicates that Transformer-based models are already robust across domains [35]. Therefore, adversarial learning is not considered in this study. Finally, a validation experiment is carried out on a real-world EMA fault diagnosis dataset. The experiment results indicate that the proposed method is well-suited for cross-sensor domain fault diagnosis.
The following are the main contributions of this study: 1.
An end-to-end cross-sensor domain fault diagnosis model is proposed. The domain adaptation is based on a source-only supervised method.

2.
The proposed method takes advantage of a new input embedding technique, which is investigated to incorporate the local features into the attention mechanism. An enhanced Transformer is introduced as the backbone to extract effective information from local features with an attention mechanism.

3.
A benchmark dataset is grouped into twenty-four transfer tasks to validate the effectiveness of the proposed method. Experimental results on EMA fault diagnosis show the excellent performance of the proposed method in terms of fault diagnosis accuracy and sensor generalization capability under different working conditions.

Problem Formulation
In this paper, the problem of cross-sensor domain fault diagnosis for EMAs is addressed. The proposed model will be trained on labelled data from one location sensor and transferred to unlabeled data from another location sensor. Thus, the domain of the source Machines 2023, 11, 102 4 of 20 and the target will have different feature spaces. Labels are available in the source domain, which can be described as follows: where D s denotes the source domain with n s labeled samples, x s i ∈ R M s represents the ith source domain sample of M s dimensions, and y s i represents the label for the ith source domain sample.
Many existing studies based on transfer learning use semi-supervised adaptation and assume that there are a small number of available labelled samples. In this case, domainadaptive fault diagnosis is easier to implement. It is typically not possible to access labelled target domain data in most industrial scenarios. This study focuses on unsupervised crosssensor domain adaptation for fault diagnosis. For unsupervised learning, labelled target domain data are not available, therefore the target domain can be defined as follows: where D t denotes the target domain with n t labeled samples and x t i ∈ R M t represents the ith source domain sample of M t dimensions.
Moreover, when testing the model, y t i represents the label for the ith testing sample. The two label spaces in this paper are assumed to be the same, which implies: where D s and D t denote the sets of samples from distributions P and Q, respectively. A model that can learn transferable features is designed to bridge this discrepancy across domains, and this model will be used to classify unlabelled samples in a target domain: where f (·) denotes the function of the fault diagnosis model andŷ t j is the predicted label of x t j . As a result, the target risk, ε t ( f ), is minimized by employing source domain supervision.
where θ denotes the model parameters.
A fault diagnostic classification model will be developed for cross-sensor domain adaptation in this study. The vibration data from sensors at one location are utilized to train a classifier that can diagnose the health status of the EMA, including normal and fault states, using data from sensors at other locations.

Proposed Method
This paper presents EMA fault diagnosis with cross-sensor domain adaptation using a convolutional embedding method, an enhanced Transformer model, and the JMMD algorithm. Figure 1 demonstrates the analytical pipeline for the proposed model. During the local feature extraction process, raw sensor data are mapped into local feature representations, and the feature dimension is adapted accordingly. Following this, the local features are fed into the enhanced Transformer. The architecture of the proposed enhanced Transformer is illustrated in Figure 2 and explained in Section 3.2. The total loss is calculated using the cross-entropy loss and JMMD loss. Based on the total loss, several training strategies are utilized to train the entire network. Finally, the trained model is employed to predict the health state of unlabeled EMA data samples in the target domain. features are fed into the enhanced Transformer. The architecture of the proposed hanced Transformer is illustrated in Figure 2 and explained in Sections 3.2 and 3.2 total loss is calculated using the cross-entropy loss and JMMD loss. Based on the total several training strategies are utilized to train the entire network. Finally, the tra model is employed to predict the health state of unlabeled EMA data samples in the t domain.   features are fed into the enhanced Transformer. The architecture of the proposed enhanced Transformer is illustrated in Figure 2 and explained in Sections 3.2 and 3.2. The total loss is calculated using the cross-entropy loss and JMMD loss. Based on the total loss, several training strategies are utilized to train the entire network. Finally, the trained model is employed to predict the health state of unlabeled EMA data samples in the target domain.

Local Feature Extraction
The proposed local feature network is comprised of an input embedding layer and a positional encoding layer. It maps raw sensor data into local feature representations.
Though the proposed method can learn features automatically, some data normalization steps can improve its performance. The process of data normalization is fundamental to fault diagnosis, in which input values are kept within a specific range. Z-score normalization is used to unify data magnitudes and reduce their differences. Let denote the input sequence of the vibration signal with length n. The Z-score normalization method can be implemented by: where µ x and σ x are the mean and standard deviation of x, respectively. Adjacent time steps in time-series data may have stronger dependencies. In this approach, the input embedding layer consists of a convolution sublayer and a learnable linear mapping sublayer, which provides informative local features to the enhanced Transformer.
At each time step, the convolution sublayer extracts the local features, h c i , from the input vector, g, within a window, k c , and padding, p c : By using linear mapping, the dimensions of the source and Transformer are adapted. In addition, the dropout layer is used to prevent overfitting. The linear mapping output, h i , is defined as: where W lm ∈ R m×d model and b lm ∈ R d model are network parameters and d model is the dimension of the enhanced Transformer.
To account for the order of the data, a linear mapping layer is passed first, followed by a positional encoding function. As the Transformer does not contain any recurrence or convolution operations, it is necessary to inject some relative position tokens into the input to fully exploit the positioning information of the input. Sine and cosine functions with different frequencies are used to encode position.
where p i represents the position encoding at time step i, i denotes the position, and j denotes the dimension. A sinusoid corresponds to each dimension of the positional encoding.
The final output of data pre-processing is denoted as X i : Finally, the output of the local feature extraction is represented as

Enhanced Transformer Based Feature Extraction
An encoder and a decoder comprise this sequence-to-sequence architecture. The encoder maps input sequences into hidden feature spaces, and the decoder uses the feature vectors to generate output sequences. The encoder may be viewed as an extractor of features in this sense. The encoder is composed of N-stacked encoder modules. Each encoder module is composed of a multi-head attention layer and a position-wise feedforward layer. A residual connection [36] and layer normalization [37] are applied to each module. The stacked multiple encoder modules share the same structure, but their parameters differ.
This enhanced encoder introduces two major novelties: reordered layer normalization layers and using Gaussian error linear unit activation instead of standard activation.
The concept of attention can be represented by a query of an entry with key-value pairs. For input data X, W Q , W K , and W V are learnable projection matrices to generate corresponding Q (Query), K (Key), and V (Value): The attention of the encoder uses dot-product attention, which is defined as: where √ d k is served as a normalization factor. For any input vector, v, the softmax function rescales the elements of the vector so that they lie in the range of [0,1] and sum to 1.
where v i is the ith element of vector v. The Transformer is distinguished by its multi-head attention design. With multi-head attention, the input data are transformed into multiple queries, keys, and values over H times. Consequently, multi-head attention enables the model to attend to information from a variety of representation subspaces at the same time. The function of multi-head attention is defined as: where There can only be a certain number of attention heads in a Transformer. The d model must be divided by H without remainder. For each of these dimensions: A layer normalization layer and a residual connector connect the input and output of the multi-head attention function after the input data, X, is passed through the multi-head attention layer.
For any vector, v, the layer normalization is computed as: where µ v and σ v are the mean and standard deviation of the elements in v. Scale γ and bias vector β are parameters. A residual connector and layer normalization are used to combine the outputs of the multi-head attention layer: where Y MA is the output of the multi-head attention layer. In deep neural networks, residual connections are essential for alleviating information decay. However, the canonical encoder utilizes a series of layer normalization operations that non-linearly transform the state encoding. Inspired by [38], layer normalization is moved to the input streams of the submodules. Additionally, there exists a smooth gradient path, which flows directly from the output to the input without any transformation.
In this relation, the output of the multi-head attention module is changed to: In addition, each encoder module includes a two-layer Feed-Forward Network (FFN), whose activation function is based on the Rectified Linear Unit (ReLU) [39]. For any input v, ReLU is defined as: As the term indicates, position-wise refers to applying the network to each position. The computation of the position-wise feed-forward network is defined as: are the weights and biases of two sub-layers, respectively. As part of the feed-forward network, the ReLU function is used to first perform a nonlinear dimension-raising operation on the input, and then the linear layer is employed to perform a linear dimension reduction operation.
The position-wise feed-forward module connects input and output through a residual connector and a layer normalization layer: where Y FF is the output of the position-wise feed-forward layer.
The gradient path is also smoothed by reordering the layer normalization layer. Hence, the output of the position-wise feed-forward module is changed to: To enhance the convergence of the encoder layers, we use Gaussian error Linear Unit (GeLU) [40] activation instead of ReLU activation. An input, v, and a mask, u, are combined to define the GeLU function: where Φ(·) is the standard Gaussian distribution function and N (·) represents the cumulative distribution function of the standard normal distribution.
where erf means the Gaussian error function. Thus, GeLU can be formulated as: As a continuous differentiable process, GeLU activation is more nonlinear than the ReLU activation at v = 0. Sublayer input typically follows a normal distribution, particularly when layer normalization is applied. In this setting, the probability of input becoming masked increases as v decreases; therefore, using GeLU to optimize v is stochastic yet depends upon input.
In this relation, the computation of the position-wise feed-forward network is changed to: Finally, the proactive use of residual connectors, GeLU activation, and the reordering of layer normalizations contribute to a faster and more stable convergence process.
The output of the encoder is converted to a one-dimensional vector after a flatten layer. After that, the vector is mapped to higher-level features using a Fully Connected (FC) layer. As a result, high-level features are derived from the hidden features learned by the encoder. The fully connected network consists of N E neurons and a GeLU activation function; in addition, the dropout technique is employed to reduce overfitting.
where Y FF end is the output of the last encoder module, Flatten(·) means the function of the flatten layer, and W f ∈ R d model d f f ×N E and b f ∈ R N E are the weights and biases of the fully connected layer, respectively.
Finally, the encoder maps the input data, X, to high-level feature representations In the next step, these high-level features are used for classification tasks.

Feature Transfer and Classification
A backbone and a bottleneck structure are used for transfer learning-based classification, as shown in Figure 2. In this scenario, the enhanced encoder serves as a backbone network whose output is fed into a bottleneck network. The bottleneck is used to reduce distribution discrepancy between cross-sensor domain features and to learn the transferable features. The bottleneck containing L fully connected layers {FC 1 , . . . , FC L } are transferable for domain adaptation, and their output features are handled with transfer learning strategies.
For each fully connected layer of the bottleneck, it consists of a fully connected sublayer with N L neurons, a GeLU activation function, and a dropout sublayer. The output features of FC l are represented as: where W l ∈ R N L ×N L and b l ∈ R N L are the weights and biases of the lth fully connected layer, respectively. When l = 1, the Z 0 equals E. The output of the bottleneck is passed through a softmax layer to derive a probability distribution over the output class labels. The softmax layer consists of a linear layer with N c neurons and a basic softmax activation function.
where Y i is the predicted EMA health states corresponding to X i , and W c ∈ R N L ×N c , b c ∈ R N L , are the weights and biases of the output layer, respectively.
In other words, the function of the softmax layer is to map the transfer features to EMA health states.
The classification optimization objective includes supervision of the source domain and minimization of domain discrepancy.
In the first step, the conventional supervised machine learning paradigm is implemented, in which the categorical cross-entropy loss on the source domain is minimized.
where L c is the cross-entropy loss, 1(·) is the indicator function, and N c is the number of EMA health states. The total loss of the classifier model can be formulated as: where λ is the trade-off factor and L t is the partial loss to close the discrepancy between source and target domains. In a classifier model, the cross-entropy loss function is effective if the training and testing data come from the same sensor location. The domain shift is aligned by upgrading the optimization method by including JMMD in the loss function.
JMMD is used for addressing the domain shift created by joint distributions. The difference in the joint distribution can be expressed as: JMMD measures the discrepancy between two joint distributions, P(x s , y s ) and Q x t , y t , based on their Hilbert space embeddings. The joint distribution discrepancy between the source and target domains is calculated as follows: where E(·) denotes the mathematical expectation, H l is reproducing kernel Hilbert space (RKHS), φ(·) is the mapping to RKHS, L represents the number of fully connected layers of the bottleneck, ⊗ L l=1 φ l (Z l ) is the feature map in the tensor product Hilbert space, and Z s l and Z t l are the source domain features and target domain features in the lth fully connected layers of the bottleneck, respectively. ⊗ L l=1 φ l (Z l ) can be formulated as: To address the domain shift issue, the cross-entropy loss and the discrepancy loss are integrated into the optimization objective. In this way, the total loss can be calculated as follows: During network training, the parameters can be upgraded at each epoch as follows: where δ denotes the learning rate.

Dataset Description
To verify the effectiveness of the proposed method for cross-sensor domain fault diagnosis of EMA, we use the Flyable Electromechanical Actuator (FLEA) dataset provided by the National Aeronautics and Space Administration (NASA), Ames Research Center [41]. As shown in Figure 3, the FLEA contains three different actuators: where denotes the learning rate.

Dataset Description
To verify the effectiveness of the proposed method for cross-sensor dom agnosis of EMA, we use the Flyable Electromechanical Actuator (FLEA) data by the National Aeronautics and Space Administration (NASA), Ames Res [41]. As shown in Figure 3, the FLEA contains three different actuators: • Actuator A-Fault-injected test actuator; • Actuator B-The nominal test actuator; • Actuator C-Load actuator. By switching the load actuator from actuator B to actuator A, FLEA injection without changing the operating state of the EMA system.
Additionally, two nut accelerometers were attached to actuators A and erometers were used to collect vibration data at a sampling frequency of 20 k 4, it is shown that the accelerometer was in three different directions (X is in actuator motion, Y is vertical, and Z is horizontal).  By switching the load actuator from actuator B to actuator A, FLEA enables fault injection without changing the operating state of the EMA system.
Additionally, two nut accelerometers were attached to actuators A and B. The accelerometers were used to collect vibration data at a sampling frequency of 20 kHz. In Figure 4, it is shown that the accelerometer was in three different directions (X is in line with the actuator motion, Y is vertical, and Z is horizontal).

Dataset Description
To verify the effectiveness of the proposed method for cross-sensor do agnosis of EMA, we use the Flyable Electromechanical Actuator (FLEA) da by the National Aeronautics and Space Administration (NASA), Ames Re [41]. As shown in Figure 3, the FLEA contains three different actuators: • Actuator A-Fault-injected test actuator; • Actuator B-The nominal test actuator; • Actuator C-Load actuator. By switching the load actuator from actuator B to actuator A, FLEA injection without changing the operating state of the EMA system.
Additionally, two nut accelerometers were attached to actuators A an erometers were used to collect vibration data at a sampling frequency of 20 4, it is shown that the accelerometer was in three different directions (X is i actuator motion, Y is vertical, and Z is horizontal).  In this study, there were four classes of data: the normal state, ball screw return channel jam, screw surface spall, and motor failure. Detailed information on the dataset is provided in Table 1.
The acceleration signal of the four states from the three directions in experiment 3 is taken as an example in Figure 5, respectively. The signals in the three directions have different amplitudes and shapes. nel jam, screw surface spall, and motor failure. Detailed information on the dataset is provided in Table 1. The acceleration signal of the four states from the three directions in experiment 3 is taken as an example in Figure 5, respectively. The signals in the three directions have different amplitudes and shapes.

Transfer Task Description
In practice, EMAs work under a variety of conditions and handle complex transmission chains. Therefore, the transfer tasks should include a variety of scenarios that vary in the driving waveforms, load profiles, and output directions.
Four experiments were conducted, as described in Table 1, to examine the effects of varying working conditions upon transfer task performance. Only the data from actuator A was used in the source and target domains. EMA fault diagnosis is performed by utilizing all possible sensor location combinations under the four working conditions. The detailed information for the six-class cross-sensor fault diagnosis tasks is presented in Table 2.

Transfer Task Description
In practice, EMAs work under a variety of conditions and handle complex transmission chains. Therefore, the transfer tasks should include a variety of scenarios that vary in the driving waveforms, load profiles, and output directions.
Four experiments were conducted, as described in Table 1, to examine the effects of varying working conditions upon transfer task performance. Only the data from actuator A was used in the source and target domains. EMA fault diagnosis is performed by utilizing all possible sensor location combinations under the four working conditions. The detailed information for the six-class cross-sensor fault diagnosis tasks is presented in Table 2.
Generally, location X is more sensitive to faults than locations Y and Z. However, locations Y and Z are more convenient for installing sensors than location X.
Tasks T1 and T2 are aimed at determining the efficiency of the feature transfer process from fault-critical sensor locations (location X) to locations that can be implemented easily (locations Y and Z).
Tasks T3 and T4 evaluate the transferability from locations Y and Z to location X to demonstrate the effectiveness of the transferability from signals located at locations with less health information to those located at locations with more health information.
Tasks T5 and T6 examine the transferability of measurement axes between the Y and Z locations to analyze different health information.
In addition to each task, four subtasks were designed for a thorough evaluation under different working conditions. Location X Location Z T3a Location Y Location X T4a Location Z Location X T5a Location Location Y Location X T4b Location Z Location X T5b Location Y Location Z T6b Location Z Location Y T1c 3

Location X Location Y T2c
Location X Location Z T3c Location Y Location X T4c Location Z Location X T5c Location Location Y Location X T4d Location Z Location X T5d Location Y Location Z T6d Location Z Location Y

Compared Approaches
To validate the effectiveness of the proposed method, four different approaches are examined in comparison, sharing a similar experimental setting as the proposed method.
(1) 1D-CNN (CNN) The 1D CNN method can be used to establish a baseline for fault diagnosis. Using the CNN results, a baseline could be established for detecting faults across sensor domains without the need for transfer learning.
(2) Basic-Transformer (BT) The Basic-Transformer method is the traditional encoder of Transformer that has not been improved, and it only uses L c as the loss function.
(3) Enhanced-Transformer (ET) The proposed enhanced encoder is trained based on the source domain, which implies that source supervision is the only optimization objective. After that, the enhanced encoder is employed to recognize the health state represented by the EMA data of the target domain (4) CNN-JMMD (CJ) This method is composed of CNN and JMMD. The JMMD is used to reduce the distribution difference in the bottleneck.
(5) Enhanced-Transformer-JMMD (ETJ) This method is a joint usage of the local feature extraction network, the enhanced encoder model, and JMMD. It indicates that L c and L J have been considered for optimization.

Model Parameters
Data are fed into the proposed network as they are in the time domain and have a dimension of 128. To compare network performance, the CNN and the enhanced encoder were set to have similar dimensions and layers. Table 3 summarizes the architecture and parameters of the Encoder-JMMD. The architecture and parameters of the CNN are shown in Table 4. The bottleneck of the encoder-JMMD is shared by the CNN-JMMD.

Training Strategies
Raw vibration data collected from the FLEA is divided into three sets. The first set is the labelled data from the source domain. The second dataset, composed of 80% of the unlabeled data from the target domain, is used for model training to align the domains. The rest of the unlabeled data constitute the third dataset and are used to evaluate the trained model. It is important to note that there is no overlap between the second and third datasets.
To initialize the network for model training, the Xavier normal initializer is employed. In addition to the back-propagation method, the Adam optimization method is used for updating all parameters. The fault diagnosis model is trained for 150 epochs.
It is necessary to predict pseudo labels for the target domain to perform the JMMD calculation. However, the pseudo labels predicted in the initial iteration may not be accurate to the target domains. Hence, the model is trained with source samples in the previous 50 epochs, meaning the pseudo labels can be predicted after 50 epochs. Subsequently, transfer learning strategies are employed. A minibatch Adam optimizer and a step learning strategy are employed as a learning rate annealing method. The learning rate was set to 0.001 at the beginning of the training. To avoid obtaining a local optimum, the learning rate is reduced to 0.0001 after 100 training cycles.
Furthermore, a progressive training method is used to increase the trade-off parameter from 0 to 1: where κ is the training epochs that change from 50 to 150 [42].

Overall Results
The statistical results of the twenty-four tasks are shown in Table 5. The typical 1D CNN method has an average accuracy of less than 70%, which makes it difficult to diagnose faults with multiple tasks. As a result, cross-sensor domain adaptation poses significant challenges and cannot be directly addressed by existing deep learning approaches.
The basic Transformer improves average accuracy over the 1D CNN by 11.15 percent, which is a result of better model characterization.
The enhanced Transformer provides a 5.12% improvement in average accuracy over the basic Transformer. The standard deviation of the basic Transformer is greater than that of the enhanced Transformer, indicating that the proposed model is more stable. This suggests that the proposed model has the potential to be generalized and to diagnose faults effectively.
When combined with JMMD, the CNN method provides an average diagnosis accuracy of over 90%. The average accuracy of the enhanced Transformer-JMMD is even higher than 97%. By aligning conditional data, transfer strategies achieve significantly better diagnosis results. This is caused by the fact that vibration data were collected from two different locations with vast differences in distribution.
Among these five methods, the proposed enhanced Transformer-JMMD method is the most accurate in terms of both mean and standard deviation of accuracy for all tasks. Furthermore, all the methods have great difficulties in performing the transfer tasks specified in T1 and T2, which are designed to determine whether a method can transfer knowledge from an optimal to a suboptimal sensor location. Fortunately, the average accuracy of T1 and T2 is greatly improved by the method proposed in this paper.

Visualization
A histogram is depicted in Figure 6 to provide a more intuitive visual representation of test accuracy.   Table 5 in a histogram.
The proposed method demonstrates significantly higher accuracy compared to other methods. Particularly, the proposed method exhibits better results than the commonly used CJ algorithm in each task. It appears that the proposed method is beneficial in dealing with cross-sensor problems.
A t-distributed Stochastic Neighbor Embedding (t-SNE) approach is also used in Figure 1 to display high-level features for tasks T3d and T1c. Because there were too many test samples, 200 samples were randomly selected for feature visualization.
A comparison of the CNN and ETJ methods in task T3d is presented in Figure 7a,b. With the CNN method, even for the simplest tasks, there are significant discrepancies between the source and target domains. Considering the same health state of different domains, the data are projected into different places. Therefore, the CNN method acquires a low diagnostic accuracy in the target domain. Based on this finding, it appears that wellestablished approaches cannot be used directly to generalize the knowledge learned un-  Table 5 in a histogram.
The proposed method demonstrates significantly higher accuracy compared to other methods. Particularly, the proposed method exhibits better results than the commonly used CJ algorithm in each task. It appears that the proposed method is beneficial in dealing with cross-sensor problems.
A t-distributed Stochastic Neighbor Embedding (t-SNE) approach is also used in Figure 1 to display high-level features for tasks T3d and T1c. Because there were too many test samples, 200 samples were randomly selected for feature visualization.
A comparison of the CNN and ETJ methods in task T3d is presented in Figure 7a,b. With the CNN method, even for the simplest tasks, there are significant discrepancies between the source and target domains. Considering the same health state of different domains, the data are projected into different places. Therefore, the CNN method acquires a low diagnostic accuracy in the target domain. Based on this finding, it appears that well-established approaches cannot be used directly to generalize the knowledge learned under source supervision into target domains. Figure 7b demonstrates that the data for the two domains under the same health state are accurately mapped into close high-level feature spaces using the proposed method. Most features with the same label across the two domains can be clustered together, and only a few samples were incorrectly classified.  Figure 8 illustrates the confusion matrices corresponding to tasks T3d and T1c. This figure compares the predicted results with the ground truth for different tasks. In Figure  8b the proposed method can achieve high diagnostic accuracies in all health states with few misclassifications. Figure 8d illustrates how the proposed model can distinguish most health states for task T1c. The normal state is likely to have the most similar representations, making it easier to align the distributions and, in turn, results in optimal performance. The proposed method is also significantly more accurate than the CJ method in classifying faults. The CJ and ETJ methods were compared for the most challenging task, T1c. In Figure 7c,d, the visualization results of the CJ and ETJ for task T1c are presented. Although JMMD minimizes the discrepancy between the source and target domain distributions, the CJ method is still generally inefficient, with a high rate of misclassification. Utilizing the enhanced Transformer in the same task significantly improves the performance. The three fault classes still exhibit domain gaps, even after a satisfactory clustering phenomenon has been achieved in the four classes. Consequently, negative testing performances are seen for all three classes, as illustrated by Figure 7d. Comparatively, the other methods are significantly less capable of achieving cross-sensor domain adaptation for task T1c, resulting in very low numerical test accuracy, as shown in Table 5. Figure 8 illustrates the confusion matrices corresponding to tasks T3d and T1c. This figure compares the predicted results with the ground truth for different tasks. In Figure 8b the proposed method can achieve high diagnostic accuracies in all health states with few misclassifications. Figure 8d illustrates how the proposed model can distinguish most health states for task T1c. The normal state is likely to have the most similar representations, making it easier to align the distributions and, in turn, results in optimal performance. The proposed method is also significantly more accurate than the CJ method in classifying faults. figure compares the predicted results with the ground truth for different tasks. In Figure  8b the proposed method can achieve high diagnostic accuracies in all health states with few misclassifications. Figure 8d illustrates how the proposed model can distinguish most health states for task T1c. The normal state is likely to have the most similar representations, making it easier to align the distributions and, in turn, results in optimal performance. The proposed method is also significantly more accurate than the CJ method in classifying faults.

Conclusions
In this paper, we propose an end-to-end cross-sensor domain fault diagnosis method for electromechanical actuators. An enhanced Transformer model is designed to obtain stable features. The enhanced Transformer reorders the layer normalization layers and replaces the standard ReLU activation with GeLU activation. With these improvements, the proposed model offers more stable optimization and greater robustness than the canonical architecture. Furthermore, in addition to the enhanced Transformer's ability to obtain global information, we propose a convolution-based embedding technique to improve the model's ability to incorporate local contexts. The Joint Maximum Mean Discrepancy metric is employed in conjunction with the enhanced Transformer to optimize the distribution of the source and target domains corresponding to the various sensor locations. A real-world dataset of electromechanical actuators with three sensor locations and four health states is used to validate the proposed method. As demonstrated by the experimental results, the proposed method achieves outstanding results for a variety of sensor position transfer tasks under different working conditions. Therefore, the method proposed in this paper can effectively solve the problem of cross-sensor domain fault diagnosis of electromechanical actuators, which is a consequence of their complex construction, driving waveform, and load profile. In the future, we will research more EMA fault classes and more complex sensor locations.

Conflicts of Interest:
The authors declare no conflict of interest.