Capacity Random Forest for Correlative Multiple Criteria Decision Pattern Learning

: The Choquet capacity and integral is an eminent scheme to represent the interaction knowledge among multiple decision criteria and deal with the independent multiple sources preference information. In this paper, we enhance this scheme’s decision pattern learning ability by combining it with another powerful machine learning tool, the random forest of decision trees. We ﬁrst use the capacity ﬁtting method to train the Choquet capacity and integral-based decision trees and then compose them into the capacity random forest (CRF) to better learn and explain the given decision pattern. The CRF algorithms of solving the correlative multiple criteria based ranking and sorting decision problems are both constructed and discussed. Two illustrative examples are given to show the feasibilities of the proposed algorithms. It is shown that on the one hand, CRF method can provide more detailed explanation information and a more reliable collective prediction result than the main existing capacity ﬁtting methods; on the other hand, CRF extends the applicability of the traditional random forest method into solving the multiple criteria ranking and sorting problems with a relatively small pool of decision learning data.


Introduction
The multiple criteria involved in most decision or evaluation problems are usually not independent, and a variety of interactions, from negative to positive, can exist among them [1][2][3]. The capacity [4], also called fuzzy measure [5], combined with the nonlinear or fuzzy integral [6], especially the most commonly accepted type, the Choquet integral [4], can effectively deal with the complex interaction situations among interdependent criteria and aggregate the multiple-source partial preferences of decision alternatives into their overall assessments, based on which the decision maker can accordingly get the final ranking orders or sorting classifications [7,8].
Given adequate decision instances, generally not a very lot of data, and their overall evaluations, ranking orders, or classifications as the learning set, the scheme of capacity plus Choquet integral can learn, simulate, and explain the correlative multiple criteria decision pattern, wherein the decision and explanation knowledge will be stored in the power set of decision criteria as the capacity values of all decision subsets. This fitting or training process is generally carried out by the linear or nonlinear optimization models [2,9,10], called capacity identification methods [11], among which the least-squares principle method [12][13][14], the least absolute deviation criterion method [15,16], and the maximum split method [13,17] are the three most widely adopted ones.
However, in the capacity identification method, it is supposed that the decision criteria set is well predetermined by the decision maker and is kept fixed and unchanged during the whole learning and fitting process. Actually, the determination of decision criteria is really a vital precondition of high quality decision making and also a time-consuming task for the decision maker. Furthermore, as we can imagine, a given unchanged decision criteria set will definitely confine the model's possibility and ability to explore other criteria subsets that enable better performance on learning and explaining the decision pattern behind the training data. Additionally, by using any static decision criteria set, the model does not have the ability to avoid the overfitting problem and also lacks adequately mutual comparison validation with other optional criteria sets. Hence, the traditional capacity identification method can not always ensure the competent generalization ability to the new decision instances, and there is also lack of proof and evidence to verify the reliability and rationality of the prediction result.
The decision tree (DT) [18,19] and random forest (RF) [20,21] framework can help the Choquet capacity and integral decision scheme to improve its abilities of dynamic explanation and reliable prediction. RF is basically an ensemble method of a number of DTs for the purpose of learning and solving classification and regression problems. The outcome of RF is the mean or weighted sum of all selected DTs for the regressive case or their majority vote for the classification case [21]. Different DTs provide different learning and explanation perspectives of the given pattern, and the collective outcome of all DTs with the consideration of their performance will highly increase the credibility and generalization ability of RF's predictions.
RF's classification and regression performance highly depend on the given training data. For the aspect of classification, if with a lot of training data and a small number of desired output categories, RF usually can provide competent performance; but if there is a lack of enough data for each category, like in most situations of the ranking decision problem in which each category only has one instance or record, it is hard for RF to get an acceptable performance. For the aspect of regression, since the interactions exist among decision criteria, the multicriteria decision is basically a nonlinear problem, which is really a challenge for RF to learn and simulate by using some linear regression models. Furthermore, the RF or DT can neither explicitly explain the interaction phenomenon among decision criteria nor specifically express the decision maker's additional judgment and preference for decision criteria and alternatives.
Hence, it is necessary to combine the capacity plus Choquet integral with the DT and RF to better learn and explain the correlative multicriteria decision pattern. With the given decision data and potential decision criteria, q suitable capacity identification method can be used to generate many capacity based DTs (CDTs) and then a capacity-based RF (CRF) can be established by taking account of those trees' performances; see Algorithm 1 for more details.
In summary, there are two direct advantages of CRF. On the one hand, CRF can provide more dynamic explanations and insights of the multicriteria decision pattern, and on the other hand, CRF can have better generalization ability and give a more creditable prediction result using a collective vote of all CDTs. That is, CRF enriches the explanation capability of traditional capacity identification methods, and meanwhile extends the application range of traditional RF into the sorting and ranking multicriteria decision making with a relatively small-scale training dataset. This paper is organized as follows. After the introduction, we briefly present background knowledge of the capacity, Choquet integral, and capacity identification methods, and of DT and RF in Section 2. In Section 3, we discuss the basic steps of RF and construct the CRF-based decision algorithm and strategies for transforming between ranking and sorting decision problems. In Section 4, we use two illustrative examples to present and analyze the proposed CRF model and algorithm in detail. Finally, we conclude the paper in Section 5.

Capacity and Its Identification Methods
Let N = {1, 2, . . . , n}, n ≥ 2, be the decision criteria set, P (N) the power set of N, and |S| the cardinality of subset S ⊆ N. Definition 1. [4,5,11] A capacity on N is a set function µ : P (N) → [0, 1] such that The capacity value is generally considered as the importance of criteria subset to the decision problem, and monotonicity means that any new participation of other criteria can not decrease the importance of original coalition [7,11]. Two famous characteristics that stem from the monotonicity with respect to inclusion subsets are the nonadditivity with respect to disjoint subsets and the nonmodularity with respect to any arbitrary two subsets. Nonadditivity and nonmodularity are commonly accepted as the explicit representations of interaction situations among decision criteria [22][23][24].
The following is one type of probabilistic nonadditivity index (bipartition interaction index), called the Shapley nonadditivity index [23].
More properties of nonadditivity indices can be found in [2,22,23,25]. In brief, the Shapley nonadditivity index of a single criterion can be taken as its overall importance index to the decision problem, and the Shapley nonadditivity index of non-singleton nonempty set can be regarded as its comprehensive interaction index, which reflects the interaction kind and tensity among all criteria in it.
A nonlinear or fuzzy integral is a universal notion of the aggregation functions with respect to capacity, among which the Choquet integral is one of the most widely accepted types [4,11]. Definition 3. For a given x ∈ [0, +∞] n , the discrete Choquet integral of x with respect to capacity µ on N is defined as: where x (.) is a non-decreasing permutation induced by x i , i = 1, . . . , n, i.e., x (1) ≤ . . . ≤ x (n) , and x (0) = 0 by convention. The Choquet integral can also be represented in terms of capacity without previously ordering the partial values of x as [7,26]: where the basis functions arex A = max(0, min i∈A x i − max i∈N\A x i ), ∀A ⊆ N.
The Choquet integral is an extension of weighted arithmetic mean (WAM) and ordered weight average (OWA) and has some good aggregation properties [6].
The capacity plus Choquet integral can learn and simulate the multiple criteria decision pattern, wherein the decision knowledge is stored as the capacity values of all decision subsets. The knowledge learning and fitting process is usually carried out by some optimization algorithms and models, called the capacity identification methods, which generally take history or typical decision instances and their desired overall evaluations, ranking orders, or sorting classifications as the learning or training dataset. In the following, we introduce three main capacity identification methods.
(1) The least square method In this method, the decision maker needs to provide the expected overall evaluations of all training alternatives [11]. Suppose the training set as L, and the desired overall evaluation of an alternative x ∈ L is given as y(x); then the least square method can be constructed as follows: the boundary and monotonicity conditions of capacity. (1) The objective function is to minimize the square distance between the Choquet integrals with respect to capacity and the given expected overall evaluations of all alternatives in set L. The constraints include the boundary and monotonicity conditions given in the definition of capacity; see Definition 1. It can be figured out that all the constraints are linear but the objective is a quadratic equation-quadratic programming maybe with multiple optimal solutions.
(2) The least absolute deviation method This method transforms the least square method's nonlinear model into a linear programming by introducing the goal deviation variables [15]. The model is given as: x the boundary and monotonicity condition of capacity, where L and y(x) are the same as in Equation (1), d + x , d − x ≥ 0 are the positive and negative deviation variables and d + . Basically, this model aims to minimize the absolute distance between the obtained Choquet integrals and the desired overall evaluations of all alternatives. This model is linear programming and can be easily solved by most mathematics software, like many linear programming solvers of R packages. (

3) The maximum split method
In this method, the decision maker needs to provide a partial weak order to all training alternatives. Denoted the given preference partial order on L as O(L), the maximum split method can be expressed as follows [27]: max z = the boundary and monotonicity conditions of capacity, The above model aims to maximize the distances among all neighbor alternatives in the given order O(L). It obviously uses linear programming and has some advantages in construction and solving. The main drawback is that it contradicts partial order, e.g., a b, b c, but c a will lead to infeasibility and some inconsistency checking works should be preprocessed; see [8,28,29] for more details of inconsistency recognizing and adjustment.
The above three methods are suitable for solving the alternative ranking decision. As for the sorting decision, we can transform the sorting classification results into the representative overall evaluations of each ordered category or the consistent dominance partial order of all alternatives, and then adopt the above three methods to get the satisfactory capacities; see Section 3.2 for the two strategies.

The Traditional DT and RF
The DT is a well known machine learning method for solving multiple decision attributes/ criteria-based classification and regression problems [18,19] and RF is basically an ensemble methodology of DTs to enhance the prediction correctness rate and the robustness against overfitting [20,21].
With a given learning dataset, the first step is to randomly generate a number of decision trees according to some specific algorithms [30]; then for a new instance, RF assembles all train DTs' outcomes to get a collective prediction, usually a weighted sum for the regressive case or a majority vote for classification case [21]. The collective vote scheme can overcome the low prediction rate of single DT and have a good generalization ability. For a given test dataset, also called out-of-bag (OOB) samples, the generalization error of RF is estimated by OOB error rate in most RF software.
With adequate learning instances, RF can deal with high dimensional classification and regression problems very well and have competent performance compared with other machine learning methods, such as discriminant analysis, support vector machines, and neural networks [20]. However, with little data, the DT and the RF will unavoidably run into the situation of underfitting and cannot provide acceptable and reliable performance.
The multiple criteria decision making problems usually have small learning instance datasets, which are rather inadequate to well train RF and other machine learning methods if part of dataset, usually at least one third of it, needs to be further separated as testing data. However, this kind of decision dataset has two basic characteristics: the first is monotonicity; i.e., if an alternative has larger evaluations on all criteria than another alternative, then its overall evaluation can not be smaller than another one's; the second is nonlinearity; i.e., the overall evaluation of an alternative is generally not a linear representation of its partial evaluations on decision criteria.
As mentioned previously, the capacity identification methods, essentially some optimal models, are competent at fitting this kind of multicriteria decision making pattern. Furthermore, with an importance and interaction index, such as the Shapely nonadditivity index, the relative importance of criteria and interaction situations can be depicted and explained explicitly. Therefore, the capacity plus Choquet integral scheme can help the traditional DT and RF to well deal with the challenges from the monotonicity and nonlinearity lying in the small scale of the decision dataset.

The CRF Decision Method
In this section, we combine the random forest framework with the capacity plus Choquet integral scheme to establish the capacity random forest method to better solve and explain the decision ranking and sorting problems.

The CRF Algorithm for the Ranking Decision Problem
The ranking decision dataset generally includes the partial evaluations of alternatives on all criteria and their overall evaluations or their ranking orders. As mentioned before, the least square method, Model (1), and least absolute deviation method, Model (2), are competent for learning the overall evaluation type decision data, and maximum split method is suitable for the ranking order-type decision data. Combining these capacity identification methods with the RF framework, we can have the CRF algorithm; see Algorithm 1.
In CRF Algorithm 1, the condition |S| ≤ k, k can be empirically set as 6 on account of the exponential complexity inherent in the construction of capacity; more precisely, there are 2 n − 2 coefficients involved in the capacity and Choquet integral for n decision criteria. When n ≤ 6, we can just set k = n as well.

Algorithm 1 Capacity random forest (CRF) algorithm for ranking decisions.
Input: Decision criteria set N, learning set L, the tree number t of the forest, the largest number k of criteria in a tree. Output: The capacity random forest and the prediction of new instance.
/* Training of the capacity random forest. */ for S in P (N) do if |S| ≥ 2 and |S| ≤ k then Use capacity identification method to get the optimal capacity on S, denoted as µ S . Calculate the performance of µ S , denoted as f (µ S ). end Based on performances f (µ S ), identify the appearance frequency of each tree q(µ S ) and randomly generate t trees to get the capacity random forest. end /* Predicting by the capacity random forest. */ for every new unknow instance do Calculate the outcomes of all the decision trees of given new instance. Aggregate these outcome into the collective prediction evaluation. end Get the final ranking orders of all new instances according to their collective predictions.
In CRF Algorithm 1, the performance of each CDT about S, or simply of the optimal capacity µ S , should deeply depend on the objective function value of the adopted capacity identification method, which is denoted as z µ . Since the Models (1) and (2) are able to minimize their objective functions, we can set their CDT's performance function as: where z 0 = min(η 1 , ) is the adjusted objective function value of CDT whose model's optimal objective function is of zero; η 1 is the expected best objective function value of all CDTs; η 2 is the least ratio between the minimum of nonzero objective function values of all CDTs and best objective function value. Since Model 3 aims to maximize, we can set the performance function of CDT with the positive objective function value as Then the appearance frequency of each CDT in the CRF is defined as: Remark 1. In Equation (4), z 0 is adopted to reasonably confine the appearance frequencies of CDTs with zero objective function, which is connected to the diversity of CDTs and can help to maintain the good generalization ability of CRF and avoid the case of overfitting. Empirically, we can set η 1 = 10 −4 and η 2 = 10 2 . For example, if min s∈S,z µ =0 z µ > 10 −2 , e.g., 0.02, then z 0 = min(10 −4 , 0.02/100) = 10 −4 ; if min s∈S,z µ =0 z µ < 10 −2 , e.g., 0.002, then z 0 = min(10 −4 , 0.002/100) = 2 × 10 −5 . The Equation (5) only involves the CDTs whose objective functions are positive, mainly because zero or negative objective function values mean Model (3) fails on splitting the decision alternatives and then we just abandon those incompetent CDTs.
In CRF Algorithm 1, the number of decision trees, t, should be lager than 2 k at least and in general with a scale of hundreds. The prediction of overall evaluation and rank can adopt the simple arithmetic average of the outcomes of those trees, because the training process of the capacity random forest has already considered those trees' performances through their appearance frequencies.

The CRF Algorithm for the Sorting Decision Problem
The sorting decision problem aims to classify all decision alternatives into a set of ordered classes. Since a lack of training instances likely causes the traditional RF an underfitting case, we need to transform the sorting decision into the overall evaluation or ranking order-types of decision problem.
Mathematically, supposing there are m sorted classes C = {C 1 , ..., C m }, where if x ∈ C j , x ∈ C j+1 , j = 1, ..., m − 1, then x x , we can have the following two strategies: (a) Construct a series of partial orders: and then adopt model (3) to solve this transformed ranking decision problem.
Or alternatively, (b) Set the representative overall evaluations of m classes as e 1 , ..., e m , where e j ≥ e j+1 + δ, δ > 0, j = 1, ..., m − 1, and adopt model (1) or (2) to identify the desired capacity; With the above two transformation strategies, the CRF Algorithm 1 can be applied smoothly to solve the sorting decision problem.
Strategy (a) is to transform the sorting decision problem into the ranking-order decision. In strategy (a), it is necessary to define all dominant relationships between the decision alternatives in all neighbor classes, and the threshold δ is a small positive number, e.g., 0.05 to differentiate the ordered classes. One can figure out, in strategy (a), the total number of partial order constraints depends on the number of classes and the number of elements in each class; more specially, the total number of constraints is ∑ m−1 i=1 |C i | × |C i+1 |. Some reduction methods for these types of constraints can be found in [13,27]. Another potential issue for strategy (a) is that some inconsistencies may possibly exist in these hard constraints, and in this case, the inconsistency check and adjustment methods need be applied before identifying the desired capacity; more technologies about inconsistency check and adjustment strategies can be found in [25,28,31].
Strategy (b) aims to transform the sorting decision problem into the overall evaluation decision, where δ acts the same role as in strategy (a). In strategy (b), the critical task is to find the suitable representative overall evaluation of each class, which needs the decision maker to have some background knowledge about the real decision problem and also a few rounds of trial and error. Fortunately, this task has much good freedom, and it is not difficult to obtain an acceptable result because of the inherent property of multiple criteria decision making and the characteristics of capacity identification models. On the one hand, the monotonicity or nondecreasing property of the aggregation function in decision context can ensure that the positive threshold between the desired representative evaluations of different classes can have relatively large allowable ranges; see the illustrative example in Section 4.3. On the other hand, in the corresponding capacity identification models, see Models (1) and (2), these representative values are only involved in objective function or soft goal constraints; as a result, the rigorous inconsistency checking can be omitted accordingly. Meanwhile, the number of constraints should be rather less than that of strategy (a); see the analysis in Section 4.3.
With these partial and desired overall evaluations of seven students as the training set, we ran the Algorithm 1 by adopting the least absolute deviation method, i.e., Model (2), as the capacity identification method. Table 2 shows the results of k = 2, 3, 4, 5 and t = 20, 200, 2000, 20, 000, wherein first column holds the values of k, the second column has values of t, the third column is the running time (seconds) of the algorithm on a common laptop with CPU i3-4030U 1.90 GHz and RAM 4.00 GB, the fourth to tenth columns are the prediction values of seven students by the corresponding capacity random forests, and the last column titled by "Obj" shows the sums of absolute deviations, i.e., absolute distances, between the predicted evaluates and the desired evaluations. According to the last column, one can see the 200-tree forests seem to be good enough for the random forest with any k for this example. When k = 4 and 5, the "Obj" values are all 0; the main reason for this fact is that by criteria subsets {1, 3, 4, 5} and {1, 2, 3, 4, 5}, their corresponding models' objective function values are 0, so according to Equation (4), these two subsets CDTs' appearance frequencies in CRF will be overwhelmingly superior to other CDTs and the final outcome of CRF will be mainly determined by these two superior CDTs. One can figure out that the above result was not successful at simulating the decision pattern, especially since the student g has a slight dominance over students e and f , which is somewhat unacceptable. Compared to the result in Table 2, one can conclude that the CRF is more competent than the traditional RF for fitting this decision example.
Now, we consider the partial order case decision. Suppose that the decision maker provides a partial weak order of the seven students as: which is consistent with the previously given desired overall evaluations of seven students and the partial order generated by last eight rows in Table 2. By taking data in Table 1 which is also the same ranking orders by Algorithm 1 with k = 4 and 5. Furthermore, we can get more explanations and decision information from the CRF algorithm. For example, the decision subset {1, 3, 4, 5} is basically good enough to fit the above two kinds of ranking decision problems, which is very useful if the decision maker wants to carry out a reduction on the decision criteria. The optimal capacity on {1, 3, 4, 5} and its Shapely nonadditivity index are listed in Table 3. One can see that criterion 1 has the largest importance, 0.3869; criteria 3 and 5 have similar importance (0.2560 and 0.2440); and criteria 4 has the least importance, 0.1131, in this decision process of student evaluation. As for the interaction among criteria, almost all criteria have positive interactions (see, e.g., the nonadditivity of these four criteria, 0.2755), but criteria {1, 3} have a slightly negative interaction; see its nonadditivity index of -0.0476. Actually, according to the partial order (8), we can divide the seven students into seven classes such that each class has only one student, and in this sense, the ranking decision essentially turns into a special sorting decision. See the following quality evaluation problem in the next subsection; it is a more general case and each class has more than one decision alternative.

Remark 3.
For the purpose of comparison, we also ran the traditional RF algorithm with R package "randomForest" version 4.6-14 by taking each student as a different class; i.e., take the learning result as a "factor" type datum with seven elements. Unfortunately, the results of the traditional random forest are very messy; the OOB (out of bag) estimate of the error rate was 100% with 500 decision trees and default parameter values of classification. This means the traditional RF fails in fitting this decision problem and the main reason should have been the extreme lack of training data in each class.

Quality Evaluation Example
This example was adapted from [27]. The decision maker plans to sort 20 alternatives (see Table 4) into three classes (good, medium and bad) according to eight criteria, and also gives some reference information (i.e., the training dataset) as follows: • CenConf, Loe, OM, EII belong to class good; • Zhang, ECR, PDI belong to class medium; • Sup, Surp, Kappa belong to class bad.  Hence, we can adopt the strategy (a) given in Section 3.2 to transform the above references information into partial order constraints as the alternatives in class good dominate the alternatives in class medium and those in class medium dominate those in class Bad: CenConf Zhang, CenConf ECR, CenConf PDI, · · · , PDI Kappa.
Additionally, while taking the above dominant constraints and adopting the maximum split model, Model (3), we ran CRF Algorithm 1 and found out that the subsets in Table 5 have non zero splits, which means with these subsets' CTDs, the capacity plus Choquet integral patterns, can correctly differentiate the given alternatives. For example, CDT of {3, 4, 6, 8} can split all these learning stances very well with the largest threshold 0.5; the importance and interaction indices of the criteria subsets are listed in Table 6. The prediction results of 20 alternatives of CRF with different levels k from 4 to 8 and t = 200 are given in Table 7. According to the results of k = 8, we give the following classification rules: the alternative is good if its prediction value is larger than 0.9; medium if less than 0.9 and larger than 0.45; bad if less than 0.45; see the last second column "CRF", which stands for "capacity random forest", in Table 7 for their predicted classes.

Remark 4.
For the purpose of comparison, we ran the traditional random forest by R with "randomForest" version 4.6-14 for the quality evaluation example. With 500 DTs and other parameters as defaults, the RF's OOB estimate of classification error rate was 50%. The prediction results of all alternatives are given in last column "TRF", which stands for "traditional random forest", in Table 7, which are similar to the results of CRF; see the second to last column in Table 7, and italics in last column show two different results between CRF and TRF; see third and ninth rows. Besides, we should point out that the running times of CRF, Algorithm 1, with k = 4, 5, 6, 7, 8 in Table 7, were 235.45, 703.91, 1512.15, 2346.31, and 2716.71 s, respectively (still by the laptop with CPU i3-4030U 1.90 GHz and RAM 4.00 GB). That means the efficiency of CRF is not comparable with that of traditional RF. The main reason for such a time consuming situation is the exponential complexity associated with capacity, or equally the construction process of each CDT.

Further Discussions on Targeted Evaluations of Ordered Classes for Sorting Decisions
Previously, we have applied strategy (a) mentioned in Section 3.2 to transform the learning datasets of two decision examples into ranking order-types of data; see Equations (8) and (10). Next, we further discuss the application of the strategy (b) and mainly investigate how to set the desired evaluations of the ordered classes.
A basic fact or axiom in multiple criteria decision analysis is that if alternative x is no less than alternative y on every criterion, then the overall evaluation (e.g., the Choquet integral value) of x should be no less than that of y, which is also called the nondecreasing or monotonicity property of aggregation function. From this point of view, we can empirically select a series of different numbers with a certain threshold from the common range of alternatives, and then sort them in an ascending/descending order as the targeted evaluations of the ordered classes.
For example, we can take the target values in Equation (7) and those in Equation (9) as the values of seven students desiring to run the CRF algorithm with the least square method or least absolute deviation method. Additionally, it can be verified that these target values have relatively large freedoms: if we divide the seven students into three groups as {a, b}, {c, d, e} and { f , g} and take {18, 14, 11} as their targeted evaluations, the capacity of the random forest can fit this ranking problem with zero deviation, so do well with targeted evaluation of {16, 14, 12} or even {18, 17, 16}.
Furthermore, for the quality evaluation of 20 alternatives, we give three cases with {1, 0.5, 0}, {2, 1, 0} and {0.5, 0.3, 0.1} as their targeted evaluations; the corresponding results by the capacity random forest algorithm are given in Table 8, wherein first three rows of the "Intervals" columns are the allowable values that belong to the three classes; e.g., the element of first row and third column, [0.6, 1], means if the prediction value falls in this interval, the alternative will belong to "good." Additionally, these intervals are also given empirically. The italic words in Table 8 mean the infliction with result of the last column in Table 7.
One can figure out that the strategy (b) has a rather equivalent performance to strategy (a), which greatly comes from the flexibility and nonlinearity of capacity and Choquet integral decision pattern. Finally, we exemplify that strategy (b) has some advantages over strategy (a) in this application:

•
It needs relatively fewer constraints, e.g., 7 constraints of strategy (b) comparing with the 4 × 3 + 3 × 3 = 21 constraints of strategy (a) for the quality evaluation problem; • Strategy (b) basically uses the goal constraints which are soft equations and does not further concern the inconsistency among these constraints or the infeasibility issue in application; however, the strategy (a) has to face the strict constraints and carefully deals with the unavoidable contradiction among them (some inconsistency checks and adjustment strategies can be found in references [8,25,32]).

Conclusions
In this paper, we combined the traditional RF framework with the capacity and Choquet integral decision model to develop the CRF scheme, in which the capacity plus Choquet integral is taken as CDT while using the capacity values to store the decision knowledge and the capacity identification or fitting methods as the suitable knowledge learning and model training tools. With the CRF algorithm, more decision aids and explanation information, such as the fitting abilities of decision criteria subsets, the proper criteria reduction suggestion, and the importance and interactions situation among specific criteria subsets, can be obtained accordingly. Meanwhile, CRF has good generalization ability and its final prediction result becomes relatively persuadable and confidential as a collective vote outcome. It was proven that, compared to traditional RF, the CRF is competent to deal with the small scale of a decision learning dataset because of the flexibility and nonlinearity inherent in the capacity and Choquet integral scheme.
It should be admitted that if with an adequate or large amount of decision learning data, especially for a sorting decision problem with a small number of classifications, the traditional DT and RF can have much better performance than CDT and CRF. The exponential complexity inherent in the structure of capacity will limit its efficiency in dealing with a large scale dataset. Hence, in the future, we plan to adopt more special families of capacities, e.g., the k-additive capacity [33], the k-maxitive and minitive capacity [26,34], the k-interactive capacity [35], and the k-order representative capacity [36], to reduce the construction complexity, and also we will enrich the types of CDTs and CRFs to enhance their fitting ability and flexibility by adopting more forms of nonlinear or fuzzy integrals, such as the Sugeno integral [5], the pan integral [37], and the inclusion-exclusion integral [38].