1. Introduction
The multiple criteria involved in most decision or evaluation problems are usually not independent, and a variety of interactions, from negative to positive, can exist among them [
1,
2,
3]. The capacity [
4], also called fuzzy measure [
5], combined with the nonlinear or fuzzy integral [
6], especially the most commonly accepted type, the Choquet integral [
4], can effectively deal with the complex interaction situations among interdependent criteria and aggregate the multiple-source partial preferences of decision alternatives into their overall assessments, based on which the decision maker can accordingly get the final ranking orders or sorting classifications [
7,
8].
Given adequate decision instances, generally not a very lot of data, and their overall evaluations, ranking orders, or classifications as the learning set, the scheme of capacity plus Choquet integral can learn, simulate, and explain the correlative multiple criteria decision pattern, wherein the decision and explanation knowledge will be stored in the power set of decision criteria as the capacity values of all decision subsets. This fitting or training process is generally carried out by the linear or nonlinear optimization models [
2,
9,
10], called capacity identification methods [
11], among which the least-squares principle method [
12,
13,
14], the least absolute deviation criterion method [
15,
16], and the maximum split method [
13,
17] are the three most widely adopted ones.
However, in the capacity identification method, it is supposed that the decision criteria set is well predetermined by the decision maker and is kept fixed and unchanged during the whole learning and fitting process. Actually, the determination of decision criteria is really a vital precondition of high quality decision making and also a time-consuming task for the decision maker. Furthermore, as we can imagine, a given unchanged decision criteria set will definitely confine the model’s possibility and ability to explore other criteria subsets that enable better performance on learning and explaining the decision pattern behind the training data. Additionally, by using any static decision criteria set, the model does not have the ability to avoid the overfitting problem and also lacks adequately mutual comparison validation with other optional criteria sets. Hence, the traditional capacity identification method can not always ensure the competent generalization ability to the new decision instances, and there is also lack of proof and evidence to verify the reliability and rationality of the prediction result.
The decision tree (DT) [
18,
19] and random forest (RF) [
20,
21] framework can help the Choquet capacity and integral decision scheme to improve its abilities of dynamic explanation and reliable prediction. RF is basically an ensemble method of a number of DTs for the purpose of learning and solving classification and regression problems. The outcome of RF is the mean or weighted sum of all selected DTs for the regressive case or their majority vote for the classification case [
21]. Different DTs provide different learning and explanation perspectives of the given pattern, and the collective outcome of all DTs with the consideration of their performance will highly increase the credibility and generalization ability of RF’s predictions.
RF’s classification and regression performance highly depend on the given training data. For the aspect of classification, if with a lot of training data and a small number of desired output categories, RF usually can provide competent performance; but if there is a lack of enough data for each category, like in most situations of the ranking decision problem in which each category only has one instance or record, it is hard for RF to get an acceptable performance. For the aspect of regression, since the interactions exist among decision criteria, the multicriteria decision is basically a nonlinear problem, which is really a challenge for RF to learn and simulate by using some linear regression models. Furthermore, the RF or DT can neither explicitly explain the interaction phenomenon among decision criteria nor specifically express the decision maker’s additional judgment and preference for decision criteria and alternatives.
Hence, it is necessary to combine the capacity plus Choquet integral with the DT and RF to better learn and explain the correlative multicriteria decision pattern. With the given decision data and potential decision criteria, q suitable capacity identification method can be used to generate many capacity based DTs (CDTs) and then a capacity-based RF (CRF) can be established by taking account of those trees’ performances; see Algorithm 1 for more details.
In summary, there are two direct advantages of CRF. On the one hand, CRF can provide more dynamic explanations and insights of the multicriteria decision pattern, and on the other hand, CRF can have better generalization ability and give a more creditable prediction result using a collective vote of all CDTs. That is, CRF enriches the explanation capability of traditional capacity identification methods, and meanwhile extends the application range of traditional RF into the sorting and ranking multicriteria decision making with a relatively small-scale training dataset.
This paper is organized as follows. After the introduction, we briefly present background knowledge of the capacity, Choquet integral, and capacity identification methods, and of DT and RF in
Section 2. In
Section 3, we discuss the basic steps of RF and construct the CRF-based decision algorithm and strategies for transforming between ranking and sorting decision problems. In
Section 4, we use two illustrative examples to present and analyze the proposed CRF model and algorithm in detail. Finally, we conclude the paper in
Section 5.
2. Preliminaries
2.1. Capacity and Its Identification Methods
Let , , be the decision criteria set, the power set of N, and the cardinality of subset .
Definition 1. [4,5,11] A capacity on N is a set function such that - (i)
(boundary condition);
- (ii)
, implies (monotonicity condition).
The capacity value is generally considered as the importance of criteria subset to the decision problem, and monotonicity means that any new participation of other criteria can not decrease the importance of original coalition [
7,
11]. Two famous characteristics that stem from the monotonicity with respect to inclusion subsets are the nonadditivity with respect to disjoint subsets and the nonmodularity with respect to any arbitrary two subsets. Nonadditivity and nonmodularity are commonly accepted as the explicit representations of interaction situations among decision criteria [
22,
23,
24].
The following is one type of probabilistic nonadditivity index (bipartition interaction index), called the Shapley nonadditivity index [
23].
Definition 2. The Shapley nonadditivity index of a subset is defined as is called the set of proper bipartitions of A. More properties of nonadditivity indices can be found in [
2,
22,
23,
25]. In brief, the Shapley nonadditivity index of a single criterion can be taken as its overall importance index to the decision problem, and the Shapley nonadditivity index of non-singleton nonempty set can be regarded as its comprehensive interaction index, which reflects the interaction kind and tensity among all criteria in it.
A nonlinear or fuzzy integral is a universal notion of the aggregation functions with respect to capacity, among which the Choquet integral is one of the most widely accepted types [
4,
11].
Definition 3. For a given , the discrete Choquet integral of with respect to capacity μ on N is defined as:where is a non-decreasing permutation induced by , , i.e., , and by convention. The Choquet integral can also be represented in terms of capacity without previously ordering the partial values of as [7,26]:where the basis functions are The Choquet integral is an extension of weighted arithmetic mean (WAM) and ordered weight average (OWA) and has some good aggregation properties [
6].
The capacity plus Choquet integral can learn and simulate the multiple criteria decision pattern, wherein the decision knowledge is stored as the capacity values of all decision subsets. The knowledge learning and fitting process is usually carried out by some optimization algorithms and models, called the capacity identification methods, which generally take history or typical decision instances and their desired overall evaluations, ranking orders, or sorting classifications as the learning or training dataset. In the following, we introduce three main capacity identification methods.
(1) The least square method
In this method, the decision maker needs to provide the expected overall evaluations of all training alternatives [
11]. Suppose the training set as
L, and the desired overall evaluation of an alternative
is given as
; then the least square method can be constructed as follows:
The objective function is to minimize the square distance between the Choquet integrals with respect to capacity and the given expected overall evaluations of all alternatives in set L. The constraints include the boundary and monotonicity conditions given in the definition of capacity; see Definition 1. It can be figured out that all the constraints are linear but the objective is a quadratic equation—quadratic programming maybe with multiple optimal solutions.
(2) The least absolute deviation method
This method transforms the least square method’s nonlinear model into a linear programming by introducing the goal deviation variables [
15]. The model is given as:
where
L and
are the same as in Equation (
1),
are the positive and negative deviation variables and
implies
. Basically, this model aims to minimize the absolute distance between the obtained Choquet integrals and the desired overall evaluations of all alternatives. This model is linear programming and can be easily solved by most mathematics software, like many linear programming solvers of R packages.
(3) The maximum split method
In this method, the decision maker needs to provide a partial weak order to all training alternatives. Denoted the given preference partial order on
L as
, the maximum split method can be expressed as follows [
27]:
The above model aims to maximize the distances among all neighbor alternatives in the given order
. It obviously uses linear programming and has some advantages in construction and solving. The main drawback is that it contradicts partial order, e.g.,
,
, but
will lead to infeasibility and some inconsistency checking works should be preprocessed; see [
8,
28,
29] for more details of inconsistency recognizing and adjustment.
The above three methods are suitable for solving the alternative ranking decision. As for the sorting decision, we can transform the sorting classification results into the representative overall evaluations of each ordered category or the consistent dominance partial order of all alternatives, and then adopt the above three methods to get the satisfactory capacities; see
Section 3.2 for the two strategies.
2.2. The Traditional DT and RF
The DT is a well known machine learning method for solving multiple decision attributes/ criteria-based classification and regression problems [
18,
19] and RF is basically an ensemble methodology of DTs to enhance the prediction correctness rate and the robustness against overfitting [
20,
21].
With a given learning dataset, the first step is to randomly generate a number of decision trees according to some specific algorithms [
30]; then for a new instance, RF assembles all train DTs’ outcomes to get a collective prediction, usually a weighted sum for the regressive case or a majority vote for classification case [
21]. The collective vote scheme can overcome the low prediction rate of single DT and have a good generalization ability. For a given test dataset, also called out-of-bag (OOB) samples, the generalization error of RF is estimated by OOB error rate in most RF software.
With adequate learning instances, RF can deal with high dimensional classification and regression problems very well and have competent performance compared with other machine learning methods, such as discriminant analysis, support vector machines, and neural networks [
20]. However, with little data, the DT and the RF will unavoidably run into the situation of underfitting and cannot provide acceptable and reliable performance.
The multiple criteria decision making problems usually have small learning instance datasets, which are rather inadequate to well train RF and other machine learning methods if part of dataset, usually at least one third of it, needs to be further separated as testing data. However, this kind of decision dataset has two basic characteristics: the first is monotonicity; i.e., if an alternative has larger evaluations on all criteria than another alternative, then its overall evaluation can not be smaller than another one’s; the second is nonlinearity; i.e., the overall evaluation of an alternative is generally not a linear representation of its partial evaluations on decision criteria.
As mentioned previously, the capacity identification methods, essentially some optimal models, are competent at fitting this kind of multicriteria decision making pattern. Furthermore, with an importance and interaction index, such as the Shapely nonadditivity index, the relative importance of criteria and interaction situations can be depicted and explained explicitly. Therefore, the capacity plus Choquet integral scheme can help the traditional DT and RF to well deal with the challenges from the monotonicity and nonlinearity lying in the small scale of the decision dataset.
3. The CRF Decision Method
In this section, we combine the random forest framework with the capacity plus Choquet integral scheme to establish the capacity random forest method to better solve and explain the decision ranking and sorting problems.
3.1. The CRF Algorithm for the Ranking Decision Problem
The ranking decision dataset generally includes the partial evaluations of alternatives on all criteria and their overall evaluations or their ranking orders. As mentioned before, the least square method, Model (
1), and least absolute deviation method, Model (
2), are competent for learning the overall evaluation type decision data, and maximum split method is suitable for the ranking order-type decision data. Combining these capacity identification methods with the RF framework, we can have the CRF algorithm; see Algorithm 1.
In CRF Algorithm 1, the condition
,
k can be empirically set as 6 on account of the exponential complexity inherent in the construction of capacity; more precisely, there are
coefficients involved in the capacity and Choquet integral for
n decision criteria. When
, we can just set
as well.
Algorithm 1 Capacity random forest (CRF) algorithm for ranking decisions. |
|
In CRF Algorithm 1, the performance of each CDT about
S, or simply of the optimal capacity
, should deeply depend on the objective function value of the adopted capacity identification method, which is denoted as
. Since the Models (
1) and (
2) are able to minimize their objective functions, we can set their CDT’s performance function as:
where
is the adjusted objective function value of CDT whose model’s optimal objective function is of zero;
is the expected best objective function value of all CDTs;
is the least ratio between the minimum of nonzero objective function values of all CDTs and best objective function value. Since Model
3 aims to maximize, we can set the performance function of CDT with the positive objective function value as
Then the appearance frequency of each CDT in the CRF is defined as:
Remark 1. In Equation (4), is adopted to reasonably confine the appearance frequencies of CDTs with zero objective function, which is connected to the diversity of CDTs and can help to maintain the good generalization ability of CRF and avoid the case of overfitting. Empirically, we can set and . For example, if , e.g., 0.02, then ; if , e.g., , then . The Equation (5) only involves the CDTs whose objective functions are positive, mainly because zero or negative objective function values mean Model (
3)
fails on splitting the decision alternatives and then we just abandon those incompetent CDTs. In CRF Algorithm 1, the number of decision trees, t, should be lager than at least and in general with a scale of hundreds. The prediction of overall evaluation and rank can adopt the simple arithmetic average of the outcomes of those trees, because the training process of the capacity random forest has already considered those trees’ performances through their appearance frequencies.
3.2. The CRF Algorithm for the Sorting Decision Problem
The sorting decision problem aims to classify all decision alternatives into a set of ordered classes. Since a lack of training instances likely causes the traditional RF an underfitting case, we need to transform the sorting decision into the overall evaluation or ranking order-types of decision problem.
Mathematically, supposing there are m sorted classes , where if , , then , we can have the following two strategies:
- (a)
Construct a series of partial orders:
,
,
, and then adopt model (
3) to solve this transformed ranking decision problem.
Or alternatively,
- (b)
Set the representative overall evaluations of
m classes as
, where
,
,
, and adopt model (
1) or (
2) to identify the desired capacity;
With the above two transformation strategies, the CRF Algorithm 1 can be applied smoothly to solve the sorting decision problem.
Strategy (a) is to transform the sorting decision problem into the ranking-order decision. In strategy (a), it is necessary to define all dominant relationships between the decision alternatives in all neighbor classes, and the threshold
is a small positive number, e.g., 0.05 to differentiate the ordered classes. One can figure out, in strategy (a), the total number of partial order constraints depends on the number of classes and the number of elements in each class; more specially, the total number of constraints is
Some reduction methods for these types of constraints can be found in [
13,
27]. Another potential issue for strategy (a) is that some inconsistencies may possibly exist in these hard constraints, and in this case, the inconsistency check and adjustment methods need be applied before identifying the desired capacity; more technologies about inconsistency check and adjustment strategies can be found in [
25,
28,
31].
Strategy (b) aims to transform the sorting decision problem into the overall evaluation decision, where
acts the same role as in strategy (a). In strategy (b), the critical task is to find the suitable representative overall evaluation of each class, which needs the decision maker to have some background knowledge about the real decision problem and also a few rounds of trial and error. Fortunately, this task has much good freedom, and it is not difficult to obtain an acceptable result because of the inherent property of multiple criteria decision making and the characteristics of capacity identification models. On the one hand, the monotonicity or nondecreasing property of the aggregation function in decision context can ensure that the positive threshold between the desired representative evaluations of different classes can have relatively large allowable ranges; see the illustrative example in
Section 4.3. On the other hand, in the corresponding capacity identification models, see Models (
1) and (
2), these representative values are only involved in objective function or soft goal constraints; as a result, the rigorous inconsistency checking can be omitted accordingly. Meanwhile, the number of constraints should be rather less than that of strategy (a); see the analysis in
Section 4.3.
5. Conclusions
In this paper, we combined the traditional RF framework with the capacity and Choquet integral decision model to develop the CRF scheme, in which the capacity plus Choquet integral is taken as CDT while using the capacity values to store the decision knowledge and the capacity identification or fitting methods as the suitable knowledge learning and model training tools. With the CRF algorithm, more decision aids and explanation information, such as the fitting abilities of decision criteria subsets, the proper criteria reduction suggestion, and the importance and interactions situation among specific criteria subsets, can be obtained accordingly. Meanwhile, CRF has good generalization ability and its final prediction result becomes relatively persuadable and confidential as a collective vote outcome. It was proven that, compared to traditional RF, the CRF is competent to deal with the small scale of a decision learning dataset because of the flexibility and nonlinearity inherent in the capacity and Choquet integral scheme.
It should be admitted that if with an adequate or large amount of decision learning data, especially for a sorting decision problem with a small number of classifications, the traditional DT and RF can have much better performance than CDT and CRF. The exponential complexity inherent in the structure of capacity will limit its efficiency in dealing with a large scale dataset. Hence, in the future, we plan to adopt more special families of capacities, e.g., the
k-additive capacity [
33], the
k-maxitive and minitive capacity [
26,
34], the
k-interactive capacity [
35], and the
k-order representative capacity [
36], to reduce the construction complexity, and also we will enrich the types of CDTs and CRFs to enhance their fitting ability and flexibility by adopting more forms of nonlinear or fuzzy integrals, such as the Sugeno integral [
5], the pan integral [
37], and the inclusion-exclusion integral [
38].