Evolutionary Multilabel Classification Algorithm Based on Cultural Algorithm

: As one of the common methods to construct classifiers, naïve Bayes has become one of the most popular classification methods because of its solid theoretical basis, strong prior knowledge learning characteristics, unique knowledge expression forms, and high classification accuracy. This classification method has a symmetry phenomenon in the process of data classification. Although the naïve Bayes classifier has high classification performance in single-label classification problems, it is worth studying whether the multilabel classification problem is still valid. In this paper, with the naïve Bayes classifier as the basic research object, in view of the naïve Bayes classification algorithm’s shortage of conditional independence assumptions and label class selection strategies, the characteristics of weighted naïve Bayes is given a better label classifier algorithm framework; the introduction of cultural algorithms to search for and determine the optimal weights is proposed as the weighted naïve Bayes multilabel classification algorithm. Experimental results show that the algorithm proposed in this paper is superior to other algorithms in classification performance.


Introduction
The multilabel learning problem draws its origins from the text classification problem [1][2][3]. For example, a text may belong to one of several predetermined topics such as hygiene and governance. Today, problems of this type are extremely widespread in everyday applications. For example, in video indexing, audio clips can be divided according to emotion-related labels such as "happiness" and "joy" [4]. In functional genomics, multiple function labels can be assigned to each gene, such as "a large and tall body" and "fair skin" [5]. In image recognition, an image can simultaneously contain several scene labels such as "big tree" and "tall building" [6]. Because multilabel classification is becoming increasingly widespread in real applications, an in-depth study about this subject can be significantly beneficial for our everyday lives [7][8][9][10][11][12].
Many methods are available to construct multilabel classifiers, such as naïve Bayes [13], decision tree [14], the k-nearest neighbors [15], support vector machines (SVMs) [16], instance-based learning [17], artificial neural networks [18], and genetic algorithm-based methods [19]. The naïve Bayes classifier (NBC) is a learning method incorporating supervision and guidance mechanisms and is simple and efficient [20]. These features have aided the NBC in becoming highly popular for classifier learning. However, the NBC is based on a simple albeit unrealistic assumption that the attributes are mutually independent. This begs the question: when the NBC is used to construct classifiers, is it feasible to improve the accuracy of the resulting classifiers by making corrections to this assumption?
In 2004, Gao et al. proposed a multiclass (MC) classification approach to text categorization (TC) [21]. McCallum et al. proposed the use of conditional random fields (CRF) to predict the classification of unlabeled test data [22]. Zhang and Zhou proposed the multilabel K-nearest neighbors (ML-KNNs) algorithm for the classic multilabel classification problem [23]. Zhang et al. converted the NBC model, which is meant for single-label datasets, into a multilabel naïve Bayes (MLNB) algorithm that is suitable for multilabel datasets [13]. Xu et al. proposed an ensemble based on the conditional random field (En-CRF) method for multilabel image/video annotation [24]. Qu et al. proposed the application of Bayes' theorem to the multilabel classification problem [25]. Wu et al. proposed a weighted naïve Bayes based on differential evolution (DE-WNB) algorithm and estimated a naïve Bayes based on self-adaptive differential evolution (SAMNB) algorithm for classifying single-label datasets [26,27]. In 2014, Sucar et al. proposed the use of Bayesian network-based chain classifiers for multilabel classification [28].
For data mining researchers, methods for improving the accuracy of multilabel classifiers have become an important subject in studies on the multilabel classification problem. The problem with the NBC model is that it is exceptionally challenging for the attributes of real datasets to be mutually independent. The assumption of mutual independence will significantly affect classification accuracy in datasets sensitive to feature combinations and when the dimensionality of class labels is very large. There are two problems that must be considered when constructing a multilabel classifier: (1) the processing of the relationships between the different labels, label sets, and attribute sets and the different attributes, and (2) the selection of the final label set for predicting the classification of real data. The available strategies for solving the label selection problem in NBC-based multilabel classification generally overlook the interdependencies between the labels. This is because they rely only on the maximization of posterior probability to perform label selections.
As naïve Bayes multilabel classification can be considered an optimization problem, many researchers have attempted to apply intelligent optimization algorithms to it [29][30][31][32][33][34][35]. The intelligent optimization algorithms have wide applications . Cultural algorithms are a type of intelligent search algorithm; compared to the conventional genetic algorithm, cultural algorithms add a so-called "belief space" to the population component. This component stores the experience and knowledge learned by the individuals during the population's evolution process. The evolution of the individuals in the population space is then guided by this knowledge. Cultural algorithms are established to be particularly well-suited to the optimization of multimodal problems [58,59]. Based on the characteristics of the samples in this work, a cultural algorithm was used to search for the optimal naïve Bayes multilabel classifier. It was then used to predict the class labels of test samples.

Bayesian Multilabel Classification
The naïve Bayes approach has become highly popular for constructing classifiers. This is owing to its robust theoretical foundations, the capability of NBCs to learn prior knowledge, the unique knowledge representation of NBCs, and the accuracy of NBCs. Although NBCs are capable of remarkable classification performance, questions remain with regard to their performance in multilabel classification. Furthermore, there are a few additional questions with regard to naïve Bayes multilabel classifiers compared to naïve Bayes single-label classifiers: First, do different attributes exert different levels of influence on the prediction of each class label? Secondly, is it feasible to extract the interdependencies among the various labels of the label set and, thus, optimize classifier performance?
The multilabel classification problem can be converted into m single-label binary classification problems, according to the dimensionality of the class label set, m. The naïve Bayes algorithm is then used to solve m single-label binary classification problems, thereby solving the multilabel classification problem. This approach is called the naïve Bayes multilabel classification (NBMLC) algorithm. As binary classifiers are used in this algorithm, a class label may take a value of zero or one. For example, if a sample belongs to a class label Ck, the class label acquires a value of one in the sample instance. This is expressed as Ck = 1 or Ck 1 . Conversely, if a sample does not belong to the class label Ck, the class label acquires a value of zero in the sample instance. This is expressed as Ck = 0 or Ck 0 .
The training dataset of a multilabel classification problem is presented in Table 1. Here, A = {A1, A2, A3, A4} represents the attribute set of the training dataset, whereas C = {C1, C2} represents the label set of the training dataset. This model has four training instances. Table 1. Training dataset of a multilabel data model. The testing instance is as follows: Given Y = <A1 = y1, A2 = y2, A3 = y3, A4 = y4>, the objective is to solve for the values corresponding to the class labels C1 and C2.
The problem is solved as follows: First, construct a naïve Bayes network classifier ( Figure 1). The nodes C1 and C2 in Figure 1 represent the class attributes C1 and C2. The four other nodes (A1, A2, A3, A4) represent the four attribute values, A1, A2, A3, and A4. The class nodes C1 and C2 are the parent nodes of the attribute nodes A1, A2, A3, and A4. The following three assumptions have been adopted in the abovementioned NBMLC algorithm: A. All attribute nodes exhibit an equal level of importance for the selection of a class node. B. The attribute nodes (A1, A2, A3, and A4) are mutually independent and completely unrelated to each other.
C. Assume that the class nodes 1 C and 2 C are unrelated and independent. However, these assumptions tend to be untrue in real problems. In the case of Assumption A, it is feasible for different attributes to contribute differently to the selection of a class label; different conditional attributes may not necessarily exhibit an equal level of importance in the classification of decision attributes. For example, in real data, if the attribute A1 has a value larger than 0.5, this instance must belong to C1, and the value of the attribute A2 has no bearing on whether this data instance belongs to C1 or C2. Hence, the value of A2 does not significantly affect the selection of the class label.

Cultural Algorithms
Cultural algorithms (CAs) are inspired by the processes of cultural evolution that occur in the natural world. The effectiveness of the CA framework has already been established in many applications. Figure 2 illustrates the general architecture of a CA. A CA consists of a population space (POP) and a belief space (BLF). POP and BLF have independent evolution processes. In a CA, these spaces are connected by a communication protocol, i.e., a functional function, which enables these spaces to cooperatively drive the evolution and optimization of the individuals in the population. The functional functions of a CA are the "accept function" and "influence function". As CAs possess many evolutionary processes, a different hybrid cultural algorithm can be developed by including a different evolutionary algorithm in the POP space. Theoretically, any evolutionary algorithm can be incorporated within the POP space as an evolutionary rule. However, a systematic theoretical foundation has yet to be established for applying CAs as an intelligent optimization algorithm.

Weighted Bayes Multilabel Classification Algorithm
In Assumption A of the NBMLC algorithm, all the attribute nodes exhibit an equal level of importance for the selection of a class label node. In single-label problems, many researchers incorporate feature weighting in the NBMLC algorithm to correct this assumption. This has been demonstrated to improve classification accuracy [26,60,61]. In this work, we apply the weighting approach to the multilabel classification problem and, thus, obtain the weighted naïve Bayes multilabel classifier (WNBMLC). Here, wj represents the weight of the attribute xj, i.e., the importance of xj for the class label set. Equation (1) shows the mathematical expression of the WNBMLC algorithm.
Here, it is illustrated that the key to solving the multilabel classification problem lies in the weighting of sample features. First, we constructed a WNBMLC (see Figure 3), where the nodes C1 and C2 correspond to the class attributes C1 and C2. The nodes A1, A2, A3, and A4 represent the four attributes, A1, A2, A3, and A4. The class nodes C1 and C2 are the parent nodes of the attribute nodes A1, A2, A3, and A4. The weights of the conditional attributes A1, A2, A3, and A4 for the selection of a class label from the class label set C = {C1, C2} are w1, w2, w3, and w4, respectively. In this work, a CA was used to iteratively optimize the selection of feature weights. An NBC occasionally calculates probabilities that are zero or very close to zero. Therefore, it is necessary to consider cases where the denominator becomes zero and to prevent underflows caused by the multiplication of small probabilities. Furthermore, while calculating conditional probabilities, an extreme situation can occasionally arise where all the training instances either belong to or do not belong to a class label. Thereby, the class label has a value of one or zero, respectively. This can occur if there are very few samples in the training set or when the dimensionalities of the attributes and class labels are very large. Consequently, the sample instances in the training set do not fully cover the relationship between the attributes and class labels, and it becomes infeasible to classify the records of the test set using the NBC. For example, it is feasible for the training set class label C1 = 1 to have a probability of zero. The denominator then becomes zero in the equations for calculating the average and variance of the conditional probability, resulting in erroneous calculations. We circumvented this problem by using the M-estimate while calculating prior probabilities. Equation (2) shows the definition of the M-estimate.

( )
In this equation, n is the total number of instances that belong to the class label Yj in the training set; nc is the number of sample instances in the training set that belong to the class label Yj and have an attribute value of Xi; m is a deterministic parameter called the equivalent sample size; p is a self-defined parameter. According to this equation, if the training set is absent or the number of sample instances in the training set is zero, nc = n = 0, and P (Xi | Yj) = p. Therefore, p may be considered the prior probability for the appearance of the attribute Xi in the class Yj. In addition, prior probability (p) and observation probability (nc/n) are determined by the equivalent sample size (m). In this work, m is defined as one, and the value of p is 1/|C|, where |C| is the total number of class labels, i.e., the dimensionality of the class label set. Therefore, Equation (3) shows the equation for calculating prior probability.

Correction of the Conditional Probability Formula
Correction for Assumption A in the conditional probability formula: Suppose that the instances of the training set adhere to a Gaussian density function. If all the training set instances belong to (or do not belong to) a certain class label, the number of elements in the average and variance formulae that correspond to the class label will be zero. Therefore, the denominator becomes zero. In this scenario, the calculated conditional probabilities are meaningless. In this work, it is assumed that for each class label, there is a minimum of one instance in the training set that belongs to that class label and also a minimum of one instance in the training set that does not belong to that class label. Therefore, if a class label was not selected in the sample instances of a training set, the algorithm will still assume that it was selected in one instance. This does not affect the results of the classification because it is a low-probability event compared to the number of training set instances. However, it ensures that the denominator in the conditional probability formula will not become zero. That is, regardless of the number of sample instances wherein a class label was evaluated as zero under a certain specified set of conditions, the selection count will always be incremented by one to ensure that this class label is selected in a minimum of one instance or not selected in one instance.
Correction for Assumption B: Suppose that conditional probability is being calculated by discretizing the continuous attribute values. If all the sample instances of a training set belong to (or do not belong to) a class label, Nj (Ck) = 0. Therefore, the denominator of � , | � = , ( ) ( ) becomes zero. This data would then be considered invalid by the classifier. To resolve this issue, we used the M-estimate to smooth the conditional probability formula, as in Equation (4).

Correction of the Posterior Probability Formula
In this work, the logarithm summation (log-sum) method was used to prevent underflows caused by the multiplication of small probabilities. In , even if all of the factors in the multiplication product are not zero, if n is large, the final result of ( | ) can be zero, or an underflow may occur and prevent the evaluation of ( | ). In this scenario, it is infeasible to classify the test samples via stringent pairwise probability comparisons. It is, therefore, necessary to convert the probabilities through Equations (5) and (6).
The product calculation is transformed into a log-sum to solve this problem. This solves the underflow problem effectively, improves the accuracy of the calculation, and facilitates stringent pairwise comparisons. To ensure accurate calculations in this work, M-estimate-smoothed equations were used to calculate prior probability and conditional probability, whereas the log-sum method was used to calculate posterior probability in all the experiments described in this paper.

Improved Cultural Algorithm
In the proposed algorithm, the individuals (chromosomes) in the POP space are designed using real-number coding. The variables of the individuals are randomly initialized in the (0.0, 1.0) range of real numbers so that each chromosome consists of a set of real numbers. The dimensionality of the chromosome is equal to the dimensionality of the conditional attributes in the sample data. Moreover, each real number corresponds to a conditional attribute in the dataset. Suppose that the population size is N and that the attribute dimensionality of an individual in the population is n. Then, each individual in the population, Wi, may be expressed as an n-dimensional vector such that Wi = {w1, w2, …, wj, … wn}. In this equation, wj is the weight of the j-th attribute of individual wi, which is within (0.0, 1.0). The structure of each chromosome is shown in Figure 4. The structure of the POP space is shown in Figure 5.

Definition and Update Rules of the Belief Space
The BLF space in our algorithm uses the <S, N> structure. Here, S is situational knowledge (SK), which is mainly used to record the exemplar individuals in the evolutionary process. The structure of SK may be expressed as SK = {S 1 , S 2 , …, S S }. In this equation, S represents the capacity of SK, and the structure of each individual in the SK set has the expression S i = {xj i | f (S i )}. In this equation, S i is the i-th exemplar individual in the SK set, and f(xi) is the fitness of individual xi in the population. The structure of SK is shown in Figure 6, and the update rules for SK are shown in Equation (7).  Figure 7, and the update rules for NK are shown in Equation (8).

Fitness Function
In this work, the fitness function is defined as the accuracy of the class labels predicted by the algorithm. Therefore, the value of the fitness function of the t-th generation of the i-th individual, f(Xi t ), is equal to the classification accuracy that is obtained by substituting the weight of individual Xi t into the weighted naïve Bayes posterior probability formula. Therefore, substituting the weight of the dimension corresponding to individual Xi t into the weighted naïve Bayes posterior probability formula (Equations (1)-(3)) yields the theoretical class label. This theoretical class label, Ji d k, is then compared to the real class label, Ji k. A score of one is assigned if they are equal, and zero otherwise. If there are n test instances and m class label dimensions, the equation for calculating the fitness of an individual is shown in Equation (9).

Influence Function
The influence function (Equation 10) is the channel by which the BLF space guides the evolution of individuals in the POP space. That is, the influence function enables the various knowledge categories of the BLF space to influence the evolution of individuals in the POP space. To adapt the CA to the multilabel classification problem, NK was used to adjust the step-length of the individuals' evolution. Meanwhile, SK was used to direct the evolution of the individuals.

Selection Function
In the CA-based weighted naïve Bayes multilabel classification (CA-WNB) algorithm, the greedy search strategy is used to determine whether a newly generated test individual, vj,i(t + 1), will replace the parent individual in the t-th generation, xj,i(t), to form a new individual in the t + 1-st generation, xj,i(t + 1). The algorithm compares the fitness value of vj,i(t + 1), f(vj,i(t + 1)) to that of xj,i(t), f(xj,i(t)). vj,i(t + 1) will be selected for the next generation only if f(vj,i(t + 1)) is strictly superior to f(xj,i(t)). Otherwise, xj,i(t) will be retained in the t + 1-th generation. This approach systematically selects superior individuals for retention in the next generation. The implementation of this approach can be mathematically expressed as Equation (11).
The process of our improved CA is shown in Figure 8: Terminating algorithm?
The new population P(t+1) is generated by the influence function of P(t).

N End Y
The whole population is evaluated by the posterior probability Update belief space B(t+1) through accpt function by B(t) Figure 8. Flowchart of the improved cultural algorithm.

CA-Based Evolutionary Multilabel Classification Algorithm
In the CA-WNB algorithm, the main purpose of the CA is to determine the weight of the attributes for label selection as the weight searching process is effectively a label-learning process. Once the optimal weights have been determined, the attribute weights may be used to classify the test set's instances. The architecture of the CA-WNB algorithm is shown in Figure 9. The procedure of the CA-WNB algorithm is described below. It provides a detailed explanation of the algorithm's architecture. The training of the CA-WNB algorithm is performed according to the following procedure: Step 1: The data is preprocessed using stratified sampling. In this step, 70% of the sample dataset is randomly imported into the training set. The other 30% of the dataset is imported into the test set. The prior and posterior probabilities of the sample data in the training set (with M-estimate smoothing applied) are then calculated.
Step 2: Initialize the POP space. The individuals in the POP are randomly initialized, with each individual corresponding to a set of feature weights. The size of the population is NP.
Step 3: Evaluate the POP. Let wi = xi. In addition, normalize wi (the sum of all the attribute weights should be equal to one so that the sum of all the variables in the chromosome is one). The weights of each individual are substituted into the weighted naïve Bayes posterior probability formula (which uses the log-sum method) to predict the class labels of the sample data in the training set. The resulting classification accuracies are then considered the fitness values of the individuals. The evaluation of the POP space is thus completed, and the best individual in the POP is stored.
Step 4: Initialize the BLF. NK and SK are obtained by selecting the range of the BLF according to the settings of the accept function and the individuals in this range.
Step 5: Update the POP. Based on the features of NK and SK of the BLF, new individuals are generated in the POP according to the influence rules of the influence function. In the selection function, the exemplar individuals are selected from the parents and children according to the greedy selection rules, thus forming the next generation of the POP.
Step 6: Update the BLF. If the new individuals are superior to the individuals of the BLF, the BLF is updated. Otherwise, Step 5 is repeated until the algorithm attains the maximum number of iterations or until the results have converged. The algorithm is then terminated.
CA optimization is used to obtain the optimal combination of weights based on the training set. The weighted naïve Bayes posterior probability formula is then used to predict the class labels of the unlabeled test set instances. The predictions are then scored: a point is scored if the prediction is equal to the theoretical value; no point is scored otherwise. This ultimately yields the average classification accuracy of the test set's instances.

Experimental Datasets
In most cases, the efficacy of multilabel classification is strongly correlated with the characteristics of the dataset. A few of these datasets may have gaps, noise, or nonuniform distributions. The dataset's attributes may also be strongly correlated. Moreover, the data may be discrete, continuous, or a mix of both. The datasets used in this work have been normalized. That is, the attribute data have been scaled so that their values fall within a small, specified value interval, which is generally 0.0-1.0.
Single instance multilabel datasets were selected for this experiment. In datasets of this type, a task can be represented by an instance. However, this instance belongs to multiple class labels simultaneously. Given a multilabel dataset D and class label set C = {C1, C2, C3, C4}, if there is a tuple, Xi, whose class labels are Yi = {C1, C3}, the sample instance is represented as {Xi |1, 0, 1, 0}. That is, all the sample instances in the dataset must include all the labels in the label set. If an instance belongs to the class label Ci, the value of Ci is one; if an instance does not belong to Ci, the value of Ci is zero.
As the proposed multilabel classification algorithm is mainly aimed at text data, to ensure that the results of the validation experiments are universally comparable, the simulation experiments were performed using four widely accepted and preprocessed multilabel datasets that were obtained from the following multilabel datasets website: http://mulan.sourceforge.net/datasets.html. The CAL500 and emotions datasets concern music and emotion, the yeast dataset is from bioinformatics, and the scene dataset comprises natural scenes; the features of these datasets are described in Table 2.

Classification Evaluation Criteria
Multilabel data are generally mined to achieve two objectives: to classify multilabel data and to sort large numbers of labels. In this work, we focus only on multilabel classification. Therefore, the evaluation criteria used in this work are based on evaluation criteria for classification methods. According to the characteristics of the experimental data, if we suppose that there is a multilabel dataset D, a class label set C = {C1, C2, C3, C4}, and a Xi tuple whose class labels are Yi = {C1, C3}, the representation of the sample instance is then {Xi | 1, 0, 1, 0}. In a multilabel classification problem, the label set, Zi, that was predicted by the multilabel classifier for Xi may differ from the actual label set, Yi. Suppose that the class label set that was obtained by the algorithm is {1, 0, 0, 0}. As this set partially matches the real set, {1, 0, 1, 0}, the prediction accuracy of this instance is Accuracyi = (3/4) × 100% = 75%. That is, three of the four class labels were predicted correctly. Each correct prediction results in 1 point while a wrong prediction results in 0 points; prediction accuracy is given by the number of points divided by the dimensionality of the class label set. It is thus shown that the prediction accuracy of each test instance ranges within [0, 1]. If F(Cik) and Cik represent the theoretical (i.e., algorithm-predicted) and real values, respectively, of the class label in the k-th dimension of the i-th sample instance in the test set, then N is the total number of sample instances in the test set and m is the dimensionality of the class label set. The equation for calculating the classification accuracy of each to-beclassified sample instance in the test set is Equation (12).
The T (F(Cik), Cik) function returns a value of one if the values of F(Cik) and Cik are equal, and zero otherwise, as per Equation (13).
The average criterion of the algorithm is the average classification accuracy of all the sample instances in the test set. It is calculated using Equation (14).

Classification Prediction Methods
CA produces a set of feature weights at the end of its iteration. The two classification methods described in [31] is used to predict the classification accuracy of each individual for the test sets. As these algorithms use topological rankings to predict an algorithm's classification accuracy, the population size (NP) of the CA must be at least 30 to ensure that these algorithm prediction methods remain effective.

Analysis of the Results of the NBMLC Experiment
In the NBMLC experiment, stratified sampling was used to randomly import 70% of the sample instances into the training set. The remaining 30% of the samples were imported into the test set. The attributes of the experimental datasets are continuous, and there are many methods by which an NBC can compute their conditional probabilities. Therefore, three common fitting methods (Gaussian distribution, Cauchy distribution, and data discretization) were used in the experimental validation. The calculated results were then compared to analyze the strengths and weaknesses of each fitting method. In the data discretization experiment, data fitting was performed by specifying the initial number of discrete intervals, which is 10 in this experiment. Ten independent trials were then performed by applying the NBMLC algorithm to each of the four experimental datasets. The maximum (MAX), minimum (MIN), and average (AVE) values of the 10 trials were then recorded. The experimental results are presented in Table 3. The analysis of the results of the three fitting experiments is presented in Table 4. In this table, Gau-Cau represents the difference between the average classification accuracies of the Gaussian and Cauchy distribution experiments. Dis-Gau represents the difference between the average classification accuracies of the data discretization experiment and the Gaussian distribution experiment. Dis-Cau is the difference between the average classification accuracies of the data discretization experiment and the Cauchy distribution experiment.  N × n × m). Here, N is the size of the dataset, n the dimensionality of the attributes, and m the dimensionality of the class labels. Figure 10 shows the total computation times of each distribution method in their 10 trial runs, which were performed using identical computational hardware. The horizontal axis indicates the type of fitting method, whereas the vertical axis indicates the computation time consumed by each method.  Table 4 compares the classification accuracies of the NBMLCs with three distribution methods. It is demonstrated that the classification accuracy of an NBMLC is the highest when data discretization is used to fit the conditional probabilities. Furthermore, it is demonstrated that the data discretization approach yields higher classification efficacy in highly concentrated datasets. The use of Gaussian and Cauchy distributions to fit the conditional probabilities of the dataset resulted in significantly poorer results than that of the discretization approach. Furthermore, the classification accuracies obtained with Gaussian and Cauchy distribution are similar. Further analysis revealed that the effects of the different distribution methods on classification accuracy are significantly more pronounced in the "emotions" dataset than in the CAL500, "scene", or "yeast" datasets. In the "emotions" dataset, the classification accuracy of the discretization approach is nearly 13% higher than that of the other approaches. In the "scene", "yeast", and CAL500 datasets, the discretization approach outperformed the other approaches by 4%, 3%, and 1%, respectively. An analysis of the characteristics of the datasets revealed that the class label dimensionality of the "emotions" dataset is smaller than that of the other datasets. It is followed by the "scene" dataset and the "yeast" dataset. The CAL500 dataset had the highest number of class label dimensions. Therefore, it may be concluded that the classification accuracies of these fitting methods become more similar as the number of class label dimensions increase. Although the algorithmic time complexities of these fitting methods are on an identical level of magnitude, the attribute values of the test data must be divided into intervals in the discretization approach. This requirement resulted in higher computation times than the Gaussian and Cauchy distribution approaches.

Analysis of Results of the CA-WNB Experiment
In the CA-WNB algorithm, there are three parameters that are relevant for the CA: the maximum number of iterations, population size, and the initial acceptance ratio of the accept function. The configuration of these parameters is presented in Table 5. In the CA-WNB algorithm, the CA is used to optimize the attribute weights of the WNBMLC and to validate the weights of the three methods for conditional probability fitting. The results obtained by the CA-WNB algorithm were compared to those by the ordinary NBMLC algorithm, based on the abovementioned design rules and experimental evaluation criteria for classification functions. As the Gaussian and Cauchy approaches attempt to model the continuous attributes of each dataset by fitting their probability curves, and the NBMLC results associated with these approaches are similar, we used Prediction Methods 1 and 2 to compare the experimental results corresponding to the CA-WNB and NBMLC algorithms with Gaussian and Cauchy fitting. These comparisons were also conducted between the CA-WNB and NBMLC algorithms with the discretization approach, with varying numbers of discretization intervals (num = 10 and num = 20).

Gaussian and Cauchy Distribution
Table 6 [62] shows the experimental results of the best individuals produced by the CA-WNB and NBMLC algorithms with Gaussian and Cauchy distribution, according to Prediction Methods 1 and 2. The CA-WNB and NBMLC algorithms were applied to four experimental datasets, and 10 trials were performed for each dataset. The MAX, MIN, and AVE of these trials were then recorded. In Table 6, CA-WNB-P1 and CA-WNB-P2 represent the classification accuracies predicted by Prediction Methods 1 and 2, respectively, for the individuals produced by the CA-WNB algorithm.   Tables 10 and 11 present the average classification accuracy of the weighting combinations corresponding to the final-generation individuals whose fitness values ranked in the top-10, top-20, and top-30, as yielded by Prediction Methods 1 and 2, in the classification of the four experimental datasets; Gaussian and Cauchy distribution were used to model conditional probability. These tables also show the percentage by which the CA-WNB algorithm improves upon the classification accuracies of the NBMLC algorithm, according to Prediction Methods 1 and 2. The bolded entries in these tables indicate the classification accuracy obtained by the best individual. The bottom rows of these tables present the average classification accuracy of the three algorithms. CA-WNB-P1 and CA-WNB-P2 are the average classification accuracies obtained using the CA-WNB algorithm, according to Prediction Methods 1 and 2. It is apparent that the average accuracy obtained by the CA-WNB algorithm is superior to that of the NBMLC algorithm. However, this is obtained at the expense of computation time. The CA-WNB algorithm iteratively optimizes attribute weights prior to the prediction of class labels so as to weaken the effects of the naïve conditional independence assumption. Even if one overlooks the effects of having multiple prediction methods and different training and test dataset sizes on the computation time, the time complexity of the CA-WNB algorithm is still NP × MAXGEN times higher than that of the NBMLC algorithm (NP is the population size, and MAXGEN is the maximum number of evolutions).
The running times of the CA-WNB and NBMLC algorithms (under the same conditions and environment as in the abovementioned experiments) are shown in Figure 11. NBMLC-Gau and NBMLC-Cau are the running times of the NBMLC algorithm with Gaussian and Cauchy distribution, respectively. Meanwhile, CA-WNB-Gau and CA-WNB-Cau are the running times of the CA-WNB algorithm with Gaussian and Cauchy distribution, respectively. The experimental results demonstrate the following: (1) Based on Figure 11, the CA-WNB algorithm consumes a significantly longer computation time than the NBMLC algorithm. (2) If one omits computational cost, the CA-WNB algorithm is evidently superior to the simple NBMLC algorithm in terms of classification performance. The percentage improvement in classification accuracy with the CA-WNB algorithm was most pronounced in the "emotions" dataset, followed by the "scene", "yeast", and CAL500 datasets. An analysis of the characteristics of these datasets revealed that the use of stratified sampling to disrupt the "emotions" and "scene" datasets did not significantly affect their classification accuracies. Meanwhile, the effects of this operation on the CAL500 and "yeast" datasets are more pronounced. This illustrates that the "emotions" and "scene" datasets have relatively uniform distributions of data, whereas the CAL500 and "yeast" datasets have highly nonuniform data distributions. If the training dataset is strongly representative of the dataset, the classification efficacy of the algorithm will be substantially improved by the training process. Otherwise, the improvement in classification efficacy will not be very significant. Based on this observation, weighted naïve Bayes can be used to optimize classification efficacy if the distribution of the dataset is relatively uniform and if the time requirements of the experiment are not particularly stringent. (3) A comparison of Tables 10 and 11 reveals that the improvement in the average classification accuracy yielded by Prediction Method 2 (6.79% and 7.43%) is always higher than that yielded by Prediction Method 1 (6.54% and 7.20%) regardless of whether Gaussian or Cauchy fitting is used. Furthermore, the improvement in average classification accuracy is always higher with the Gaussian fitting (7.20% and 7.43%) than with the Cauchy fitting (6.54% and 6.79%). A comparison of the results with Cauchy fitting and Gaussian fitting reveals that the results of the CA-WNB algorithm with Cauchy fitting varied substantially. Moreover, the weights obtained by the CA also exert a higher impact on the Cauchy fitting. Therefore, the results with Cauchy fitting are unstable to a certain degree. Nonetheless, the average classification accuracy of the CA-WNB algorithm with Cauchy fitting is superior to that of the CA-WNB algorithm with Gaussian fitting. (4) Tables 10 and 11 reveal that the highest average classification accuracies for the four experimental datasets were obtained by individuals with Ranks 10-20, according to Prediction Method 2. Conversely, the worst classification accuracies were obtained by individuals with Ranks 20-30. It is established that the weights that have the best fit with training set instances may not be the optimal weights for classifying test set instances. After a certain number of evolutions in the population, excessive fitting may have occurred because the fitting curves were adjusted too finely, thereby resulting in less-than-ideal classification accuracies for the test set's instances. Consequently, the fitting curves produced by the weights of individuals with Ranks 10-20 were better for classifying the instances of the test set.

Conclusions
In this paper, we study the multilabel classification problem. This paper presents the algorithm framework of naïve Bayes multilabel classification and analyzes and compares the effects of three common fitting methods of continuous attributes on the classification performance of the naïve Bayes multilabel classification algorithm from the perspective of average classification accuracy and algorithm time cost. On this basis, the framework of weighted naïve Bayes multilabel classification is given, the determination of weights is regarded as an optimization problem, the cultural algorithm is introduced to search for and determine the optimal weight, and the weighted naïve Bayes multilabel classification algorithm based on the cultural algorithm is proposed. In this algorithm, the classification accuracy obtained by substituting the current individual in the weighted naïve Bayes multilabel classification is taken as the objective function, and the attribute dimension represents the individual dimension. Each one-dimensional variable in the individual represents the weight of the corresponding dimension of the attribute, and the coding test and verification are carried out by real numbers to obtain a prediction strategy that is more suitable for the algorithm. Experimental results show that the algorithm proposed in this paper is superior to similar algorithms in classification performance when the time cost is not considered.