In order to further improve the accuracy of the model and avoid the problem of algorithm performance degradation caused by the random combination of SDIs, FILPA introduces the AdaBoost meta-learning strategy to identify FISDI composed of the best SDIs that are selected by NB. Because the process of new user link prediction must tackle the problems of an extremely sparse local network structure, the accuracy of a single model to identify FISDI is not high. Through the AdaBoost meta-learning strategy, the weak recognition models with low accuracy can be enhanced to become strong recognition models with high accuracy. In the AdaBoost meta-learning strategy, Discriminant Analysis (DA) is used as the base classifier and linear regression is used as the meta classifier, multiple base classifiers are combined to improve the classification accuracy. On the basis of the learning results of the base classifier, the meta classifier is used for relearning how to obtain the final results, so that the low-level learning can be fully used in the high-level induction process.
Figure 3 shows an example of the AdaBoost meta-learning strategy.
4.3.2. DA Base Classifier
In order to improve the efficiency of the algorithm, we limited the composite index to be composed of SDIs. The basic idea of the DA base classifier is to judge whether the SDIs can constitute a fully integrated index according to the similarity between two scores of the SDIs, which is calculated by Canberra, SAD, RAV, and Max–min.
Meta-learning is used in the process of training the DA base classifier, assuming that the total training sample set is
, where
vector represents the similarity of any two of the
SDIs calculated by the above four algorithms,
, where 0 indicates
SDIs cannot form a fully integrated index, 1, 2, 3, and 4 mean that the combination way to form a fully integrated index, where 1 indicates that all indexes are additive, 2 represents the index is made up of a number of blocks, which is composed of the combination of addition and subtraction of three indexes, 3 means that the index is comprised of a number of blocks, which is composed of the combination of subtraction and addition of three indexes, and 4 expresses that one index is subtractive from other indexes. Based on meta-learning, the weight
calculated in
Section 4.2 is used to combine the
SDIs with above different linear combinations for comprehensively mining characteristics of the implied social distance,
is the label corresponding to the fully integrated index with the largest AUC value.
The base DA base classifier is denoted as
, where
is the similarity between algorithm scores,
is the category label, (namely
), and its output value is the probability that
belongs to class
,. Suppose that the i-th training input sample is
,
represents other classes except
, and we also define operator
, when
r is true,
, when
r is false,
. When
, DA makes three judgments on the sample:
or
. There are three situations when
is judged and classified: (1) When
and
, then
; (2) When
and
, then
; (3) When
, the possibility of
is the same as
, then choose one of them at random. Therefore, the probability that
is wrongly classified as
is shown in Formula (17) [
36].
For the above five kinds of problems, there are four different kinds of
, and because each different
may have different importance in different situations, each
is given a specified weight
, (
). Therefore, Formula (17) is modified to Formula (18).
4.3.3. AdaBoost Framework
The AdaBoost meta-learning strategy shares outstanding performance in a multi-classification problem, so it is selected to identify FISDI composed of the best SDIs. According to Formula (18), its pseudo-error can be expressed as Formula (19).
where
represents the weight of the i-th sample, and the larger its value, the more likely the i-th sample is to be misjudged. The label weighting function
indicates the probability of classifying
into class
wrongly. The larger its value, the more easily the sample can be misclassified, which needs to be examined in the next iteration of learning.
changes with multiple iterations, so as to get the final global classification model and achieve a better classification effect.
The main steps of the AdaBoost meta-learning strategy proposed in this paper are as follows:
Step 1: Generate the raw data S. For each sample in the training network set, the optimal indexes identified by the NB model are firstly eliminated from all L indexes, and then the remaining L-1 indexes are combined in pairs to form a composite index with the optimal indexes, respectively, and the w calculated in
Section 4.2 is taken as the weight of the corresponding SDI. For each group of indicators, the similarity between two scores is calculated according to Canberra, SAD, RAV and Max-min, and the label is judged based on AUC value;
Step 2: Input. The total training sample set , and the number of iterations is T = 100. In each iteration, samples with the size of are selected according to the sample distribution weight obtained from the previous iteration, where represents the proportion of selected samples. This algorithm ranks the weight vectors of sample distribution in descending order and selects the first samples in total;
Step 3: Initialize variables. Let , the weight of an error label in the i-th sample is , where and ;
Step 4: At iteration T, generate T DA based classifiers. Cycle the following steps at the t-th iteration ():
a. Calculate the label weight according to Formula (20) and compute the sample distribution weight of the i-th sample based on Formula (21);
b. According to the new sample set obtained from the sample distribution , DA is trained and to obtain the classifier ;
c. The pseudo-error of is calculated according to Formula (19), if , then jump to Step 5;
d. Calculate the proportion
of the current base classifier and update the weight vector, as shown in Formula (22).
Step 5: At the end of iteration T, the base classifiers
are linearly combined with different weights to get the final meta classifier
, as shown in Formula (23).
is used to test the test samples. According to the similarity between the scores of two indexes calculated by Canberra, SAD, RAV, and Max-min, the fully integrated index composed of the most suitable SDIs selected by NB is distinguished. The final classification results are obtained by weighted voting rules, as shown in Formula (24).