Graph Adaptation Network with Domain-Specific Word Alignment for Cross-Domain Relation Extraction

Cross-domain relation extraction has become an essential approach when target domain lacking labeled data. Most existing works adapted relation extraction models from the source domain to target domain through aligning sequential features, but failed to transfer non-local and non-sequential features such as word co-occurrence which are also critical for cross-domain relation extraction. To address this issue, in this paper, we propose a novel tripartite graph architecture to adapt non-local features when there is no labeled data in the target domain. The graph uses domain words as nodes to model the co-occurrence relation between domain-specific words and domain-independent words. Through graph convolutions on the tripartite graph, the information of domain-specific words is propagated so that the word representation can be fine-tuned to align domain-specific features. In addition, unlike the traditional graph structure, the weights of edges innovatively combine fixed weight and dynamic weight, to capture the global non-local features and avoid introducing noise to word representation. Experiments on three domains of ACE2005 datasets show that our method outperforms the state-of-the-art models by a big margin.


Introduction
The Internet of Things (IoT) is a large-scale paradigm. In the IoT paradigm, devices communicate with each other regardless of their owner. Communication is not only between machines and people, but also between machines [1] and between machines and intelligent objects [2]. The IoT of connected devices includes different fields, such as healthcare, agriculture, smart cities, smart homes, smart grids, automated vehicles, asset monitoring, environmental monitoring, education, industry, etc. [3]. Each field contains a large amount of data information, and the data distribution in different fields is different. At the same time, each of the above domains contains a large number of sensors, actuators, gateways, servers and related end-user applications [1,3,4]. Data collected from interconnected objects or things are used to produce results in different IoT applications. The basic modules of any IoT solution are connected devices, communication networks, services, management, security and applications [1,5]. Among them, the data information of different fields in the Internet of Things is equivalent to the source domain and target domain information in this article. The cross-domain relationship extraction in the IoT will help the collection, mining and classification of data information.
Collecting traffic information from social networks and using it for travel safety are two challenging issues in Intelligent Transportation Systems (ITSs). The transportation network can be monitored through sensor devices and social network data. Currently, ITSs uses sensor devices of edges. However, word co-occurrence information has a strong dependence on the corpus and inevitably introduces some noise; therefore, we propose a method to calculate the dynamic weights of the edges using the attention mechanism. The fixed weights and dynamic weights are combined as the final edge weights. The word information is propagated through graph convolutions on the tripartite graph so that the domain-specific word representations are aligned. After adapting the non-local features, i.e., obtaining the aligned word representations by the graph adaptation network, we perform adversarial training to extract the shared local features between domains. These word representations are aligned before being fed into downstream modules; therefore, the model can extract shared features effectively. Finally, the shared features are fed to a fully connected layer to perform relation extraction. Because we assume that only source domain data have labels, the relation classifier is trained using only source-labelled data. In addition, to prevent the transfer of irrelevant information from words, we select valuable edges and remove irrelevant edges depending on their fixed weights. Using this method, the model not only preserves useful information for adapting non-local features but also speeds up the calculation process.
The major contributions of our work are as follows: • We propose a novel graph adaptation network to align domain-specific features. Local features and non-local features are transferred simultaneously for cross-domain relation extraction. This is also the first work to adapt the non-local features between domains. • Unlike the traditional graph convolutional network, our network combines fixed weights and dynamic weights as the edge weights of the graph. In addition, rather than using a fully connected graph, we only keep valuable edges based on their edge weights. These strategies can transfer useful information effectively and avoid introducing irrelevant noise.

•
Experiments have shown that non-local features such as word co-occurrences are also important for cross-domain relation extraction. The proposed method to calculate weights and select edges can capture more non-local features and better avoid noise interference than other methods.
The rest of this paper is organized as follows. Section 2 presents the work carried out in the field of cross-domain relation extraction. Section 3 introduces the task definition of the relation extraction. Section 4 describes a graph adaptation network for cross-domain relation extraction. Section 5 presents experiments that analyze the effectiveness of our model. Section 6 concludes and presents future work.

Related Work
Data collection and data query are important applications in wireless sensor networks (WSN) and the IoT, and data collection and data query are usually information-centric. We noticed that WSN is a special network, in which each sensor will involve sensor data in different fields (for example, smart cities, healthcare, agriculture, etc.). Each field contains a large amount of data information, and the data distribution in different fields is different. In the last decade, deep learning (DL) has made breakthroughs in natural language processing (NLP), image processing and reinforcement learning, so that making breakthroughs in the field of artificial intelligence (AI) occupy the dominant position. Presently, much research contributed to information mining and the utility of massive data [23]. In addition, Lei et al. used multi-sensor data in fault detection of gearbox [24]. Safizadeh et al. studied multi-sensor data fusion to improve the performance of fault recognition for rolling element bearings [25]. Jing et al. combined deep neural networks and multi-sensor data fusion in the fault detection of planetary gearbox [26]. Compared with these fields, DL methods for cross-domain data analysis of sensors are relatively scarce [27]. At the same time, few papers researched cross-domain deep feature learning and fusion models. However, the representativeness of common features will obviously affect the ability of cross-domain information mining and classification. These problems encourage researchers to find a new method to adaptively extract cross-domain relationships, which in turn helps to mine and classify information in different fields.
Cross-domain relation extraction aims to solve the problem of the training set and test set with different data distributions. Reference [11] was the first work to adapt a relation extraction model to other domains, and it used generalized approaches such as word clustering to extract shared features. The authors in [13,28] combined hand-crafted features, such as dependency paths and learned word embeddings, for cross-domain relation extraction. These methods use manually crafted features to adapt existing features, so they are limited and lose some information. The deep learning approach was applied to cross-domain relation extraction in [29]. It combined feature-based methods and neural networks to exploit their advantages. Reference [15] used adversarial training to extract shared features by introducing a gradient reversal layer (GRL) [30], but it simply projected source features and target features into one unified space, inevitably introducing some domain-specific features that harmed the performance of extracting relations with the target domain. Reference [16] proposed a genre separation network to extract shared features and specific features separately. Reference [31] applied cross-view training [32] to a domain adversarial neural network (DANN) [15] and adapted shared features in different views; this is a highly fine-gained method.
Massive domain adaptation methods were also applied in other tasks. Reference [20] used spectral clustering to unify specific word representations in the context of text classification. Reference [33] used some labelled data from the target domain for learning domain-specific information. Reference [34] aligned different domain cells of the sequence model to perform domain adaptation; this is another fine-grained method. In the image field, [18,19] aligned deep specific features using distance metrics such as the maximum mean discrepancy (MMD). These methods provided us with inspiration to align domain-specific features for the task of cross-domain relation extraction task, but they only transferred local features while ignoring non-local features. Although [20] aligned specific features by using non-local features such as word co-occurrence information, complex feature engineering was needed.
Recently, graph structures were widely used in natural language processing tasks to capture non-local features. Reference [21] applied graph convolutions to pruned dependency trees and automatically captured the dependence information. Reference [35] used a graph convolutional network to model the co-referent and identical mentions between words. Reference [22] combined linear and dependency structures to improve the extraction of overlapping relations. Reference [36] proposed an entity-relation graph to perform joint type inference on entities and relations and used the entity-relation bipartite graph in a highly efficient and interpretable way. Reference [37] proposed a graph-based method to improve word embeddings. Reference [38] used graph neural networks with generated parameters to improve multi-hop reasoning. To specify the weights of neighbors automatically without requiring any kind of costly matrix operation or depending on knowing the graph structure upfront [39], the authors in [39,40] introduced an attention mechanism for the graph structure. These studies inspire us to use a suitable graph structure to model cross-domain relation extraction problems.

Relation Extraction
Given a set of labeled corpus D = {(s 1 , e 11 , e 12 , r 1 ), . . . , (s n , e n1 , e n2 , r n )}, where e i1 and e i2 (i = 1, 2, . . . , n) denote the first and second candidate entity respectively, r i represents the relation type, s i represents a sentence, relation extraction can be regarded as a classification task that applying a classifier f trained on D to the test datasets D = {(s 1 , e 11 , e 12 ), . . . , (s n , e n1 , e n2 )}. In other words, considering the task where X is the input space and Y is the set of relation labels, the goal of the learning algorithm is to build a classifier f :X → Y with a low loss L(D ) = E (s ,e 1 ,e 2 ,r )∼D P( f (s , e 1 , e 2 ) = r )).

Cross-Domain Relation Extraction
Given a set of source labeled corpus D s = {(s 1 , e 11 , e 12 , r 1 ), . . . , (s n , e n1 , e n2 , r n )} and target unlabeled corpus D t = {(s 1 , e 11 , e 12 ), . . . , (s n , e n1 , e n2 )}, the meanings of these symbols are the same as Section 3.1. It is worth noting that we assume that there is no labels in target domain data. Then cross-domain relation extraction can be regarded as a classification task that uses source domain labeled data D s and target domain unlabeled data D t to train a classifier f , and applies f to target domain. The goal of the learning algorithm is to build a classifier f :X → Y with a low loss L(D t ) = E (s ,e 1 ,e 2 ,r )∼D t P( f (s , e 1 , e 2 ) = r )).

Our Methodology
In our work, we take D s and D t as inputs to design an algorithm that can improve the extraction of relations from the target domain. First, D s and D t are fed into an embedding layer to obtain an embedding matrix, and then the graph convolutions work on the embedding matrix to align domain-specific word representations. A feature extractor takes the adjusted embedding matrix as input to obtain the shared features. To force the feature extractor to extract these shared features, a domain discriminator is added after the feature extractor. Finally, the shared features are fed into a relation classifier to perform classification.
In brief, our model consists of four modules: an adaptation module, an embedding layer, a shared feature extractor and a relation classifier. The adaptation module contains a domain discriminator and a graph convolutional network (GCN) layer, which are responsible for local shared feature extraction and non-local feature alignment, respectively. Figure 2 shows the overall architecture of our model. We introduce these modules in detail below.

Adaptation Module
The adaptation module mainly contributes to extracting the shared features between domains and aligning domain-specific features. The traditional approach only adapts sequential features but ignores non-sequential or non-local features. Our adaptation module consists of two processes: local information adaptation and non-local information adaptation. Local information adaptation is applied at the sentence level; in other words, we extract the shared features of the source and target sentences. While non-local information adaptation is a word-level adaptation, it uses a GCN to align domain-specific word representations. The remainder of this section elaborately introduces the adaptation layer. (1) Local information adaptation (sentence-level).
To make the shared feature extractor capture domain-invariant features, a domain discriminator is added after the shared feature extractor. It takes s s and s t as inputs, where s s and s t represent the source and target features extracted by the shared feature extractor, respectively. The domain discriminator is implemented by a simple neural network with one hidden layer and performs binary classification to predict the domain that a sample comes from. The domain discriminator loss is defined as cross entropy loss: In this equation, N t denotes the total number of target domain data, p i is the probability of one sample belonging to the source domain and y i ∈ {0, 1} indicates that the sample comes from the source domain (1) or the target domain (0).
To confuse the domain discriminator, a gradient reversal layer (GRL) [30] is used between the shared feature extractor and domain discriminator. Then, the forward and back propagations are formulated as follows: Through reversing the gradient before domain discriminator, the parameters of domain discriminator are optimized to reduce the domain discriminator loss L dom while the parameters of shared feature extractor will make L dom increase. The adversarial training finally converged so that the discriminator cannot distinguish which sample comes from which domain, in other words, the shared feature extractor captures some domain-invariant features.
Word-vectorized representations, such as word2vec [41] and Glove [42], have greatly improved downstream applications. However, in cross-domain relation extraction, the representations of domain-specific words differ significantly between domains, and this causes poor performance when applying a model to other domains. Most previous works only focused on aligning different domain features at the sentence level [15,16,19] while ignoring word-level alignment. Inspired by [20], rather than using feature-based methods, we use a GCN to model the word co-occurrences of different domains. Through this alignment, the word representation gap between the source domain and target domain can be reduced, thereby enabling the downstream module to extract the shared features in a fine-grained way. Figure 3 shows the architecture of the GCN layer.
Word co-occurrence tripartite graph construction: The key idea of non-local information adaptation is, in the tripartite graph,if two domain-specific words have connections to more common domain-independent words in the graph, they tend to be aligned together with higher probability, i.e., have similar word representation [20]. Given the source domain sentence set D s = {S 1 , S 2 , . . . , S n } and target domain sentence set D t = {S 1 , S 2 , . . . , S n }, we construct a graph G = (V s ∪ V i ∪ V t ; E si ∪ E ti ) for any two sentence S i = {w 1 , w 2 , . . . , w n } and S j = {w 1 , w 2 , . . . , w n }. Here, V s , V i , V t denote the graph vertex that corresponds to domain-specific words in S i , domain-independent words in S i ∪ S j and domain-specific words in S j respectively, E si represents the graph edges between V s and V i , E ti represents edges between V t and V i . See Figure 3 for details.
We use two types of weights for the graph edges: fixed weights and dynamic weights. First, the pointwise mutual information (PMI) of two words is used as a fixed weight. The PMI is an algorithm for calculating the correlation between two variables. Here, we use the PMI to measure the co-occurrence relationship of two words: where #win(w i , w j ) refers to the number of sliding windows that w i and w j appeared together. #win is the total number of sliding windows. We calculate these on the whole corpus. A higher PMI value means that w i and w j appear together on the corpus more times; thus, they have a higher correlation. A small PMI value means there is little correlation between w i and w j because they seldom appear together. After we obtain the weight matrix A ∈ R m * m , (m is the length of the dictionary), the weights are normalized by: where a is the minimum element of A i =j and b is the maximum element of A i =j . Then the fixed weights are defined as: The graph only keeps edges with f ij > α , where α > 0 is a hyperparameter. When α increases, there will be fewer edges in the graph. The effects of different values of α are discussed in the experiment later in this paper.  Fixed weights capture the word co-occurrence features, but they have difficulty aligning domain-specific words. There are two limitations of using fixed weights: (1) some English stop words such as "is" and "the" are often domain-independent, and they have a high probability of appearing together with domain-specific words; therefore, the fixed weight between them is large. However, these stop words have little semantic meaning and harm the word representations.
(2) Some domain-specific words are rare and seldom appear together with domain-independent words, so the fixed weight is almost 0, but these words should probably also be aligned; fixed weights cannot achieve this.
To compensate for the limitations of fixed weights, inspired by [39], attention mechanism-based dynamic weights are used. First, to increase the power of the feature expressions, a linear transformation is applied on every node in the graph: where h i ∈ R n is the vector representation of node i, w l ∈ R n * n and b l ∈ R n is the parameters of linear transformation. For a node i in graph, N(i) is defined as the nodes which directly connect to the node i and ∀j ∈ N(i), f ij > β. We set β = 0.3 for balancing computational efficiency and model effectiveness.
Then we calculate the attention weight on N(i) for every node i: where W att ∈ R n * n is the attention parameter and j ∈ N(i), e ij is the attention weight that indicates how important the node j to node i, LeakyReLU is an activate function. The dynamic weights of node i are normalized by so f tmax function: Dynamic weights can be adjusted during training to provide a flexible way to train the parameters of the model. In addition, each fixed weight can be seen as a global weight because it is calculated based on statistics of the whole corpus, while a dynamic weight can be seen as a local weight because it only uses two sentences per calculation. In the end, we combine the fixed weights and dynamic weights as our final graph weights: The process of calculating the edge weights is illustrated in Figure 4 (left).  Graph convolutions on the tripartite graph: In this section, we first introduce the graph convolution operation and edgewise gating mechanism and then elaborate on how these methods are used in our model.
A GCN [43] is used to capture non-local and nonsequential information. Specifically, given a graph G = (V, E), where V is the node set and E is the edge set, the graph convolution operation is applied on every node and propagates information to other nodes along the edges. For a 1-layer GCN, the information only transfers to neighboring nodes, and the information of an n-layer GCN can transfer to further nodes as n increases. The information propagation from layer k to layer k + 1 can be formulated as: where d(i) = ∑ j∈N ( i) W ij is sum of all the weights between node i and its neighbors. w (k) g ∈ R m * m and b (k) g ∈ R m are layer specific parameters. h (k) j ∈ R m is vector representation of node j. Figure 4 (right) gives visual representation of the information propagation.
Edge-wise gate [44] is proposed to control how much information is transferred from neighbors. The scalar gate value of each neighbor is calculated as: Here w (k) e ∈ R m and b (k) e ∈ R are layer specific parameters, and σ is a non-linear activate function. According to [37], we integrated edge-wise gating mechanism into the graph convolution network, the final propagation function is: , the word original representation. For a n-layer GCN, h (n) i is the word final representation after the non-local information adaptation.

Embedding Layer
External knowledge, such as entity positions and dependency trees, is important for relation extraction [45][46][47]. Furthermore, when adapting a model from a source domain to a target domain, external knowledge can be seen as general knowledge that can improve cross-domain task performance. Following previous works [15,28,33], we use the following five types of external knowledge: Real-valued word embedding vector. We obtain one word's embedding vector e i from the word embedding matrix, which is pretrained as in [41]. This process yields continuous vector representations of words by training the CBOW or skim-gram model on very large data sets, and the vectors include the words' semantic information. The words that do not exist in the word embedding matrix are randomly initialized.
Words' relative distances from candidate entities. Use i and j denote the two entities' in a sentence, for each word x k with index k, its relative distances are k − i and k − j, respectively. The word's relative distances can inform the model of the entities' positions. Every word has two relative distances vectors d 1 and d 2 .
Entity type. Each entity type is predefined, and every entity has an entity type to which it belongs. In the sentence "He will blow a city off the earth in a minute if he can get the hold of the means to do it", the bold words are candidate entities, and their entity types are GPE and LOC. Entity types are essential knowledge for relation extraction. In some cases, we can infer the relation of two candidate entities only by the entity types. In our setting, we only indicate that the entity types and nonentity words are randomly initialized in the same vector. Every word has two entity type vectors t 1 and t 2 because we have two candidate entities per sentence.
Sematic chunks. A chunk is an indivisible fixed phrase in a sentence, and we inform the model to regard all chunks as a whole so that the semantic information of the chunks is not disrupted. We use the B-I-O format to indicate chunks, and every word obtains a chunk vector c i . Shortest dependency path between two entities. The shortest dependency path refers to the shortest path between two entities in the dependency tree. See Figure 5 for an example. In relation extraction, the information required to assert a relationship between two entities is mostly captured by the words in the shortest dependency path between the two entities [45]. Therefore, the shortest dependency path can help the model to distinguish between valuable information and noise. We use a vector d i to indicate whether a word is in the shortest dependency path between two entities. Figure 5. Dependency tree of the sentence "As we all know, Steve Jobs was the co-founder of Apple Inc. which is a great company in America". Bold words are two candidate entities, red lines indicates the shortest dependency path between the two entities.
After getting all above types of embedding vectors, we transform every word into a real-valued vector v i by concatenating them: v i = [gcn(e i ); p i1 ; p i2 ; t i1 ; t i2 ; c i ; d i ]. The gcn(·) is the transformation of GCN layer described in Section 4.1, see Figure 3 for details. The whole sentence with length n can be represented as v = [v 1 , v 2 , . . . , v n ].

Shared Feature Extractor
We use a simple CNN architecture proposed by [48] as shared feature extractor. Let v i ∈ R d , and a convolution operation with a kernel w ∈ R rd is applied to v, where r is numbers of words the kernel spanned and v is the word representation matrix after GCN layer. A feature c i is generated from words v i:i+r−1 : Here, ReLU is the activation function and b ∈ R is a bias. The kernel moves one step at a time in the direction of word sequences and get n − h + 1 features totally: Then we perform max-over-time pooling operation [49] on c, i.e., c = max{c}, to get the most important feature. To capture various features, we use multiple kernel size (keep d unchanged and use different r) and each kernel size has multiple feature maps. For a fixed kernel size, different feature maps will get different c i , so the feature vector c i corresponding to one kernel size i is: m is the number of feature maps. Finally, we concatenate all c i to get shared feature extractor's output s = [c 1 ; c 2 , . . . , c k ], k is the number of kernel size.

Relation Classifier
Since there only exists labels in the source domain, the relation classifier only takes the shared features s s from source domain as input to perform relation classification. The relation classifier is a 2-layer fully connected neural network h with tanh as activation function and followed by a so f tmax layer: where θ s is the parameters of the hidden layer. p i ∈ R r is the relation distribution of i-th source domain data, and r is the number of relation type. The relation classify loss L rel is defined as below: Here N s is the total number of source domain data, y ij ∈ [0, 1] to indicate whether the example i has relation j. p ij is obtained through so f tmax layer and indicates the probability of example i containing the relation j.
During training, we combine all the losses mentioned above to get the final loss function and optimize jointly: γ is a hyparameter and we set γ = 0.1 through validation dataset. In the test stage, due to lacking of source domain data, a heuristic algorithm is designed to select domain-specific words of source domain. First, for every shared word w i appearing in one target domain sentence, we find the source domain-specific word w j which have the highest PMI value with w i . All the w j consist of the top-pmi set w = {w 1 , w 2 , . . . , w n } which n is the number of shared word appeared in the target domain sentence. Then we sort the elements of w in descending order. Finally, the top m words are selected to form the source domain sentence and we set m = 10 through the performance in validation dataset.
From Table 1, we can see that negative examples account for a large proportion of the data, so correctly distinguishing between positive and negative examples is an important indicator for measuring the effect of the model. Figure 6 explicitly displays the differences between domains in terms of their data distributions. These large differences also pose a challenge for our model. Following previous works [15,16,33], we use bn+nw as the source domain, adjust the hyperparameter on half of the bc domain, and use the remaining half of bc, as well as all of cts and wl, as the target domain to evaluate the performance of our model. (2) Evaluation The precision, recall and macro-F1 are used as evaluation method in our experiment. Specifically, we calculate precision (P i ) and recall (R i ) for every relation type, and get all relation precision (P) and recall (R) by averaging: where r is the number of relation type and n i is the number of samples belong to i-th relation. The macro-F1 is calculated using:

Parameter Setting
We use pretrained, 300-dimensional word embeddings generated by word2vec [11]. The dependency entity positions (eps) are obtained from ace-data-prep (https://github.com/mgormley/ace-data-prep). The embeddings are randomly initialized and optimized during training except for these pretrained word embeddings because we found that there is no improvement when pretraining all embeddings. All the sentences are padded or cut to 155 characters. The learning rate is set to 0.001 and halved every 4 epochs. We use Adam [50] as the optimizer and apply gradient clipping during optimization. To avoid overfitting, the dropout technique [51] is used in the embedding layer and GCN layer. The details of the parameter settings are shown in Table 2.

Baseline Models
We use the following baseline models for evaluation purposes: NNM & log-linear model: Basic neural network models (NNM), such as CNNs and RNNs, were used in [29] to improve relation extraction. In addition, the authors achieved state-ofthe-art performance by stacking these models. We use their single model bidirectional RNN (BRNN), CNN, log-linear model, and combined model (called hybrid-voting system (HVS)) as our baseline models.
FCM & hybrid FCM: The feature-rich compositional embedding model (FCM) was proposed in [28]. The key idea is to combine (unlexicalized) hand-crafted features with learned word embeddings. The hybrid FCM (HFCM) combines the basic FCM and existing log-linear models.
LRFCM: The low-rank approximation of the FCM (LRFCM) [13] is an improvement of the FCM. It replaces manual features with feature embeddings so that it can easily scale to a large number of features.
DANN: The domain adversarial neural network (DANN) [15] was the first to introduce adversarial training to cross-domain relation extraction. It simply projects source domain features and target domain features into one unified space and uses adversarial training to extract domain-independent features. We propose a graph adaptation network based on this model. GSN: The genre separate network (GSN) [16] uses a domain separate network [17] to extract domain-independent features and domain-specific features separately, therefore avoiding the introduction of some domain-specific features into the shared feature space. CVAN: The cross-view adaptation network (CVAN) [33] uses cross-view training to extract shared features from different views and constructs various input views that have proven to be useful for cross-domain relation extraction.
AGGCN: The attention guided graph convolutional networks (AGGCN) [52], a novel model which directly takes full dependency trees as inputs.This model can be understood as a soft-pruning approach that automatically learns how to selectively attend to the relevant sub-structures useful for the relation extraction task.
MAPDA: A novel model based on a multi-adversarial module for partial domain adaptation (MAPDA) is proposed in this study [10]. This paper design a weight mechanism to mitigate the impact of noise samples and outlier categories, and embed several adversarial networks to realize various category alignments between domains.

Results Analysis
(1) Performance comparison with existing methods. Table 3 provides a comparison between existing models and our method. Note that the models marked * are reimplemented version because their precision and recall values have not been reported. The model marked + denotes that it contains multiple models, and we only report the best results among them. The results of our models are obtained under the parameter settings that achieve the best results for the development dataset (GCN layers = 3, α = 0.4). From the table, we can see that a model removing adversarial training and only using fixed weights (i.e., CNN+GCN) outperforms the CNN by 2% in terms of macro-F1 score and obtains results that are comparable to those of the DANN. This means that local feature adaptation and non-local feature adaptation are equally important for cross-domain relation extraction. Model marked with + contains multiple models, and only the best results among them are reported. Models marked with * present reimplemented versions, as their precision and recall values have not been reported.
After adding adversarial training, DANN+CNN achieves results that are comparable to those of the state-of-the-art model (CVAN) in terms of macro-F1. DANN + GCN 2 indicates the model that uses only dynamic attention weights; this also improves the baseline DANN model but only achieves similar results to those of DANN+GCN. We assume that dynamic weights cannot capture all word co-occurrence information due to a lack of global statistical knowledge. The ensemble model (HVS) performs the best among the methods from previous works, especially in the bc domain. To illustrate the performance of our model more convincingly, we also compare it with the ensemble model. Our combined model (DANN+GCN+DA) outperforms all the existing models, including the ensemble model (HVS), in terms of macro-F1, and achieves almost all of the best precision and recall performance in the three domains.
(2) Data distribution of source and target domains.
In this paper, the source domain dataset bn+nw and the target domain dataset cts are taken as examples. To see the change in data distribution more intuitively, we used the high-dimensional data dimensionality reduction algorithm t-distributed stochastic neighbor embedding (t-SNE) to map the data distribution in two-dimensional space. The data distribution before and after applying our method is shown in Figure 7. As can be seen from Figure 7a,b, after using graph adaptation network, the cts overlapping data is effectively reduced and data distribution is better than the previous data distribution. The t-SNE visualization also indirectly proves the effectiveness of our model. From Equation (14), we know that the number of GCN layers plays an important role in the process of information propagation. In a 1-layer GCN, information only flows between neighbors, so for our tripartite graph, the number of GCN layers is at least 2 so that the information can flow from the source to the target domain. To better verify this intuition and further illustrate the interpretability of our model, we fix the threshold (α) described in Equation (3) as 0.4 and draw Figure 8 to display the relation between the number of GCN layers and the performance (macro-F1) of our model. We report the macro-F1 values for three domains under different numbers of GCN layers L within [1,2,3,4,5,6]. It is worth noting that we only draw the line when L ≤ 6 because there is no improvement obtained when L > 6. From Figure 8, we can see that the 1-layer GCN has a relatively poor performance on all three domains, but when the number of layers is 3 or 4, the best performance is achieved for all three domains; this illustrates that the 1-layer GCN cannot transfer information from the source domain to the target domain and that a GCN with at least 2 layers can capture the word co-occurrences between the source and target domains. When L > 4, the macro-F1 declines to different degrees for all three domains, and we assume that the word representations are distorted due to too many instances of information propagation.
The threshold (α) controls the number of edges whose fixed weights >0. Because the fixed weights of edges are normalized by Equation (6), α varies within the interval [0,1]. From Table 4, we can see that when α is smaller, there are more edges with fixed weights >0 in the graph and vice versa. In particular, when α = 1, there only exist self-loop edges whose fixed weights >0. Figure 9 shows the macro-F1 values with different values of α under the setting "GCN layers = 3". We can clearly see that when α decreases, macro-F1 suffers from a significant drop. The reason for this phenomenon is that the number of edges with fixed weights >0 increases sharply and therefore inevitably includes more irrelevant information.  The model obtains the best results when α = 0.4 or 0.5, for which the corresponding edge numbers are optimal. When α > 0.5, the performance on all three domains suffers from dramatic decreases. We argue that this is because almost all edges' fixed weights = 0, and dynamic weights cannot capture complete word co-occurrence information due to a lack of global statistical knowledge. When α is in the interval [0.6, 1], the macro-F1 value remains stable because the graph has a few edges with fixed weights >0. The model performance under different values of α illustrates that simply increasing the number of edges indefinitely is not ideal. The use of a proper number of edges can not only yield better performance but also speed up calculations.
(5) Effects of the GCN and dynamic weights.
To illustrate the effectiveness of the GCN and the use of dynamic attention weights, we draw the precision-recall curves for every relation type in Figure 10. The values of precision and recall are averaged across all domains. CNN+adv is the reimplemented version of the DANN [11], +cv refers to adding cross-view training, and DA refers to the dynamic attention weights we proposed. From Figure 10, we can see the following: (1) Only using the CNN to perform cross-domain relation extraction is far from enough, and it almost achieves the worst performance for all 6 relations, indicating that it is worth trying domain adaptation on this dataset. (2) The performance of CNN+adv and CNN+adv+cv are better than that obtained when only using the CNN, but CNN+adv+cv is significantly worse than the CNN under the relation GAN-AFF, and CNN+adv is also worse than the CNN under the relation ORG-AFF. We find that the GAN-AFF and ORG-AFF relations are similar and have the most subtypes, so these two relations provide a strong challenge for our model. This illustrates the effects of the GCN layer. When adding DA to the GCN layer, the improvement is highly obvious, especially under the PER-SOC relation; specifically, there is a 10% improvement in precision when recall >0.7 in Figure 10e. This verifies that DA can compensate for the weakness of fixed weights and consider many words' connections. For a quantitative analysis of the effects of each model, we average the AUCs under different relations for every method, and these are shown in Table 5. It can be clearly seen that GCN+DA improves the baseline by a large margin (by 4% compared with CNN+adv, by 7% compared with the CNN).

Case Study
An example of graph weight visualization is shown in Figure 11. The x-axis denotes the shared words of different domain sentences, the red part of the y-axis denotes source-specific words, and the blue part denotes target-specific words. Each pixel corresponds to the weight W ij of the i-th shared word and the j-th specific word described in Equation (11), and the deeper the colour is, the greater the W ij . All the weights corresponding to the i-th shared word are normalized. The colors between the word new and some target-specific words such as Mexico and Kansas are deepest because their fixed weights are large according to Equation (7). In addition, some specific words with strong domain relevance, such as Funeral and School, also have deep colors, while the colors of words with weak domain relevance, such as functions and apointed, are relatively shallow. These facts illustrate that the dynamic weight mechanism pays more attention to aligning specific words with stronger domain relevance; the sole use of fixed weights cannot be achieved.  We also present some typical examples in Table 6 that demonstrate the effectiveness of our model. In some cases, such as the first two samples, traditional models and our model can all correctly predict the labels. However, when the domain-specific words have strong domain relevance or account for a high proportion of the sentence, traditional models mistake all the labels for "None" (as in the last three examples) because negative samples account for a large proportion (Table 1). Our model correctly distinguishes between negative and positive samples and predicts true labels in all the above cases. We argue that traditional models only capture shared features based on local information, i.e., words in sequential order; therefore, when the proportion of domain-specific words increases, the number of shared features decreases. In addition, the shared feature extractor may be unable to capture shared features because these specific words have strong domain relevance; in other words, the domain discriminator still has the ability to determine which domain a given sample comes from. Through the GCN layer we proposed, these target-specific words are aligned with source-specific words, so the interference of domain-specific features is reduced. Table 6. Some predictions of typical samples comparison. We use DANN and CVAN as traditional models. The words with bold are domain-specific.

Predict Label Examples Traditional Ours
ART √ ART √ In which, the state of Israel buys many expensive military weapons, which is used to oppress and to kill a lot of Palestinians.
ORG-AFF √ ORG-AFF √ It could be done so easily because most small towns are protected by a small town police force.
None × PHYS √ well, my sister usually comes in from Ohio because it's not that far like-two hundred fifty miles or something, and she'll come in for Thanksgiving.
None × GAN-AFF √ That's in Fairmont, West Virginia, it's like-oh, between Charleston and Pittsburgh.
None × PER-SOC √ My brother-in-law still lives in the city but we were Long Island people.

Conclusions
In this article, a novel graph adaptation network for cross-domain relation extraction is proposed. First, a novel graph adaptation network is constructed to align domain-specific features. The model aligns domain-specific features through applying graph convolutions on a source-shared-target words tripartite graph. Secondly, unlike the traditional methods only adapting local or sequential features, our model adapts local and non-local features jointly to improve the performance of cross-domain relation extraction. Finally, a cross-domain relation extraction model is constructed by inputting the fused all features into softmax.
In addition, unlike the traditional graph convolutional network, in order to compensate for the limitation of fixed weight, a dynamic attention weight is combined together with fixed weight. To transmit useful information more effectively, we retain valuable edges based on their edge weights. Experiments on the three domains of ACE2005 datasets verified the effectiveness of GCN layer and dynamic attention weight, which achieve state-of-the-art on all three domains.
We will further explore the cross-domain relation extraction when a few labeled examples can be explored or new relations appearing in the target domain. In addition, we hope our work can be applied to other domain adaptation tasks in the future.