We use precision, recall, F-score and error rate as the criteria to measure the performance of our domain-independent ontology learning model. To measure the transfer learning effect, we use TE. If TE < 0, then the transfer knowledge’s contribution is negative and vice versa. The higher TE is, the more contributions transfer knowledge donates and vice versa. Because we use transfer learning framework, it requires that there is only a small set of labeled data in the target domain. Therefore, we split 20% of the labeled data as training set and 80% of the labeled data as testing set in a specific target domain.
6.2.1. Effects of the Correlation Coefficient
According to
Section 4.3, when calculating correlation coefficients between domains, TF-Mnt has an implicit factor: the proportion of labels. Namely, the difference in the proportion of labels in the training set reflects the different distribution of knowledge that can be provided to the model. For the TF-Mnt model, the weights of the feature functions reflect the knowledge learned by the model.
We first conducted experiments on the distribution of label proportions and the weights of feature functions. The experimental results on four domains are shown in
Figure 4, in which f() is a feature function of word length, y is the category, C stands for concepts, and I denotes instances. Therefore,
Figure 4 contains eight visualization of the distributions. The horizontal coordinate is the weight of the feature function (WOF) and the vertical coordinate is the proportion of labels in the training set (POT) corresponding to the feature function. The results show that there is a distribution between the label proportions and the weights of the feature functions, which is not scattered and unbounded. There are a few outliers, but most of them are small probability events. Further analysis shows that the weights that classified as concept are basically distributed on the positive semi-axis, i.e., it means that the text with concept labels have the similar feature of the word length. In addition, we can observe that in the company domain, the feature function weights are evenly distributed in the positive and negative halves, while the feature function weights in the other domains are basically in the positive half. This is because some of the lengths of company names are long while concept names in other domains are short, so some of the feature weights in the company domain are in the negative half-axis.
Then we analyzed the relationship between weight distributions and correlation coefficients by experiments. As
Table 2 and
Table 3 show, the correlation coefficients between domains with similar distribution of feature function weights are relatively large and positive, i.e., they have positive correlation; the correlation coefficients between domains with some differences in the distribution of feature function weights are relatively small and negative, i.e., they have negative correlation. Thus, it can be seen that it is feasible to use correlation coefficients to measure the similarity of feature function weights between domains. This provides a basis for using the correlation coefficient to filter the feature function weights in the source domain and select the feature function weights with higher similarity to those in the target domain for knowledge transferring.
Specifically, first, from
Table 2, we can see that the correlation coefficient values between domains with large differences in distribution (profiles, journals, conferences, and companies) are negative, while between domains with small differences in distribution, the correlation coefficient is positive. The correlation coefficient between journal domain and conference domain is positive and larger, while the correlation coefficient with company domain is positive but smaller. The correlation coefficient between the conference field and the company field becomes negative. Second, from
Table 3, the correlation coefficients between the profiles, conferences and company domains are negative. It is not difficult to find that the distribution of journal domain and profile domain is relatively close. The distributions of company domain and conference domain are closer. Once again, it is verified that the correlation coefficient is positive between domains if the distributions of feature function weights and training set label ratios are relatively close. If the distributions are somewhat different, the correlation coefficient is negative.
We then conduct the experiment to verify how the correlation coefficient affects transfer effect.
Figure 5 shows the results of three models with using original
(contains both + and −), only positive
and only negative
. X-axis illustrates domain selection represented by A-B form which means the source domain is
A and the target domain is
B. Y-axis shows the transfer effect. From
Figure 5, in most situations, using
, the model will show a positive transfer. Once negative correlation coefficients are added into the model (using original
), it will decrease the transfer effect. The model with
has the worst result, showing a negative transfer phenomenon all the time. Therefore, we can conclude that positive correlation coefficient has positive contribution in classification and vice versa. We can explain this phenomenon with the reasoning that because correlation coefficient reflects linear dependence between two distributions, and if it is negative, the two distributions are negatively related, which means domain-independent knowledge of a specific dimension is important in one domain but less important in the other. Therefore, negative correlation coefficient is a factor leading to negative transfer. In following experiments, if we do not emphasize, we will set all the negative correlation coefficients to 0.
However, there exist three exceptions which are CF-P, J-P and J-CM. Even when using
, their transfer effects are all negative and these situations are all negative transfers. J and CF are negative sources of P; J is a negative source of CM. Besides negative correlation coefficients, we must find other factors that lead to negative transfer.
The purpose of transfer learning is to use a large number of known samples from the source domain and combine them with a small number of known samples within the target domain to enhance the classification within the target domain. In traditional machine learning, the division of the training set and the test set is generally done according to 2:1, i.e., 66% of the training set 33% of the test set. However, since labels are very valuable, in transfer learning, the training set of the target domain is very small, and the domain knowledge it provides is very limited, which is when the importance of transfer knowledge can be reflected; while if the training set of the target domain is large, it can provide a large amount of domain knowledge, and the transfer knowledge is weakened to some extent. This experiment investigates the effect of the proportion of training sets in the target domain on the transfer effect in this case.
Figure 6 illustrates the relationship between the curves of training set partitioning and error rate under positive transfer phenomenon.
Figure 7 illustrates the relationship between the curves of training set partitioning and error rate under negative transfer situations. The horizontal axis represents the proportion of training set partitioning in the target domain (POT in target), and the vertical axis has different meanings for different curves. For the curves of non-transfer and transfer, it represents the error rate of both; for the curve of transfer effect, it represents the magnitude of the transfer effect. The green curve represents the error rate of non-transfer learning model, the blue curve represents the error rate of transfer learning model, and the red curve represents the transfer effect. Once the domain is selected appropriately and positive transfer occurs, the transfer learning effect is better when the training set division is smaller than the transfer threshold; when the training set division is larger than the transfer threshold, the difference between the transfer learning and non-transfer learning models is not large. If negative transfer occurs, the effect of non-migratory learning is always better than migratory learning.
Figure 6 shows the relationship between the training set division and the transfer effect in the positive transfer phenomenon. From
Figure 6a, it can be seen that when the training set is divided at a ratio less than 0.12, the transfer learning model has better results, while when the training set division is greater than 0.12, the error rate of both the transfer learning model and the non-transfer learning model does not change much with the gradual increase of the training set division, and sometimes the positive transfer phenomenon is generated, and sometimes the negative transfer phenomenon is generated. This is because the model has reached a limit for learning domain knowledge, and after reaching this limit, it is impossible to learn new knowledge by increasing the training samples. We call this point the transfer threshold of the model, i.e., the transfer threshold is the point where the transfer effect curve intersects with the horizontal axis for the first time.
Figure 7 shows the relationship between training set partitioning and transfer effect in the negative transfer phenomenon. It can be seen from
Figure 7 that if the negative transfer phenomenon is generated, the error rate of transfer learning is greater than the error rate of non-transfer learning from beginning to end, and the transfer effect of the model is less than 0.
6.2.3. Effects of the Transfer Weight
In TF-Mnt, transfer knowledge and domain knowledge are weighted by
a and
b. This experiment will analyze how these weights affect transfer performance. Here we fix
b = 1. Because
a and
b are linear weights, we can scale them to let
b = 1 except the situation that
b = 0. For
b = 0, it means only use transfer knowledge to construct model and it is equivalence to
b = 1 and
a. For
a = 0, this situation means only use domain knowledge to construct model.
Figure 8 shows the result of the experiment,
x-axis shows the value of
a and y-axis shows the error rate of the corresponding model. It is clear from the figure that these curves have different tendency in positive transfer
Figure 8a–d and negative transfer
Figure 8e–f.
Figure 8a–d show four positive transfer situations. In positive transfer, the error rate shows a decreased tendency at first (smaller than the error rate at
a = 0, so it is positive transfer) and then increasing as
a becomes larger. Finally, it will approach to an asymptote which represents the error rate of only using transfer knowledge. In most situations, effect of only using domain knowledge is better than only using transfer knowledge because domain knowledge reflects the target domain directly although it learns from a few labels. However,
Figure 8b is an exception, in which effect of the model with only transfer knowledge is better than the effect of the model with only domain knowledge. And from
Figure 8a–d, we can find when the model reaches the best performance,
a is in
, which means only with domain knowledge or transfer knowledge is not better than combining them together by some weights. Therefore, transfer knowledge and domain knowledge are all fully used by TF-Mnt model.
Figure 8e,f show two negative transfer situations. In negative transfer, the error rate increases all the time before approaching to an asymptote which also represents the error rate of the model with only transfer knowledge. Because they are negative transfer, error rates are larger than only transferring with domain knowledge all the time. While no matter in situations of positive transfer or negative transfer, curves will finally approach to an asymptote.
6.2.4. Model Comparison
Figure 9 is the result of model comparison. We choose four famous traditional machine learning models, which are decision tree (C4.5), naïve Bayes, support vector machine (SVM) and Mnt. In this experiment, for a specific target domain, the TF-Mnt model will select the source domain with the largest transfer coefficient. Five models were used to compare the precision, recall, F-score and error rate for the four domains. The horizontal coordinates represent each of the four domains. From
Figure 9, it can be concluded that the proposed model TF-Mnt has the highest F-score and the lowest error rate in the four domains.
According to the precision in
Figure 9, it can be observed that in the profile domain, the highest precision is obtained using the SVM model, while in the remaining three domains, the highest precision is obtained using the transfer learning model proposed in this paper. In all domains, the lowest precision is obtained using the decision tree model.
According to the recall in
Figure 9, it can be observed that in the journal domain, the highest recall is obtained using the naïve Bayesian model, the next highest recall is obtained by the Mnt model, and the transfer learning model and the SVM model are tied for third place. In the remaining three domains, it is the transfer learning model that has the highest recalls.
According to the F-score in
Figure 9, our TF-Mnt model gets the highest F-Score in all the four domains while decision tree gets the lowest F-Score. In domain P and C, SVM model also achieves a good result, and it is almost close to TF-Mnt and the naïve Bayes model represents well in CM and J. The original Mnt model also has a good performance, and although it is not the best, it is very stable.
According to the error rate in
Figure 9, our model also has the best performance, getting the lowest error rate in all the situations, while decision tree also performs the worst. Naïve Bayes, SVM and Mnt are very close with some fluctuations.
In conclusion, TF-Mnt models do learn some knowledge from auxiliary data to improve learning and represent it well in these four domains according to the result of F-score and error rate but it will cost some extra time to learn the auxiliary knowledge. Decision tree is not suitable to do this classification as it gets the lowest F-score and the highest error rate in all the situations. As for naïve Bayes, SVM and Mnt, they can also achieve good results, but TF-Mnt is better than them.
6.2.5. Ontology Learning
This experiment focuses on learning of relations. However, unlike terms recognition, relation is transitive. Namely, suppose in an ontology, concept A is a subclass of B and B is a subclass of C. It is insufficient, if our method only finds A is a subclass of C. Due to this property, we cannot use binary value to judge the effect of relation learning. To validate our result more precisely, we introduce two distinct values α and β, where α is the value for subclass-of relation and β for is-a relation. More specifically, if in an ontology there is a relation path
(here
means C2 is a subclass of C1), the score between (Ci, Cj), where
, is
and the score between (Ci, Ik) is . An ontology can be represented by triple (C, C/I, subclass-of/is-a), while in ontology validation, we must construct tuple (C, C/I, subclass-of/is-a, S), with an extra element S which reflects the score of the relation. We set α = β = 0.8 in this experiment.
Table 5 shows the precision, recall, and F-score of ontology learning, which contains the results of
is-a relation learning (the second column),
subclass-of relation learning (the third column) and ontology learning which is the average of terms recognition and learning of relations (the fourth column). From
Table 5, results of
is-a relation learning are relatively good because most instances appear in lists in HTML, and they are easy to capture. Our methods reach 0.902 F-score in average. However,
subclass-of relation learning is not so good, with only reaching 0.722 F-score in average. Unlike
is-a relations, most
subclass-of relations have less structured information. Combining terms recognition and learning of relations, our domain-independent ontology learning method can learn ontology for a Web page in a new domain with only a small set of labeled data (5 percentage of dataset in that domain) with 0.859 F-score in average.
Figure 10 illustrates a real-world example of an ontology about academic conferences, which is learned automatically by TF-Mnt model. The source domain is computer science researchers, and the target domain is computer science conferences. In
Figure 10, the left part is the concept hierarchy, and the right part is the relations between concepts, which contains hierarchy and other relations. We can see that our model can extract ontology knowledge efficiently and correctly.