One Aggregated Approach in Multidisciplinary Based Modeling to Predict Further Students’ Education

: In this paper, one multidisciplinary-applicable aggregated model has been proposed and veriﬁed. This model uses traditional techniques, on the one hand, and algorithms of machine learning as modern techniques, on the other hand, throughout the determination process of the relevance of model attributes for solving any problems of multicriteria decision. The main goal of this model is to take advantage of both approaches and lead to better results than when the techniques are used alone. In addition, the proposed model uses feature selection methodology to reduce the number of attributes, thus increasing the accuracy of the model. We have used the traditional method of regression analysis combined with the well-known mathematical method Analytic Hierarchy Process (AHP). This approach has been combined with the application of the ReliefF classiﬁcatory modern ranking method of machine learning. Last but not least, the decision tree classiﬁer J48 has been used for aggregation purposes. Information on grades of the ﬁrst-year graduate students at the Criminalistics and Police University, Belgrade, after they chose and ﬁnished one of the three possible study modules, was used for the evaluation of the proposed model. To the best knowledge of the authors, this work is the ﬁrst work when considering mining closed frequent trees in case of the streaming of time-varying data.


Introduction
In order to clearly present the content of the research covered in this paper, the introduction is divided into two subchapters: 'Background', where the problem and the importance and main goal of the solution are presented and 'Related studies', where the current state of the research in the field related to the subject of this paper as well as the key publications are given.
The rest of this paper is organized as follows. Section 1 presents the 'Introduction', where a description of the considered problem, the background, objectives and existing research gap, as well as our contribution, including the organization of the paper, are given. It gives the authors' review of the state-of-the-art literature that deals with the topic of interest together with the motivation of the authors to work on this paper. After that, we present Section 2, 'Materials and Methods', as a part of the paper that introduces the experiment. In regression analysis, there is the possible use of such units for determining performances, which are using one input or providing one output.
The presence of irrelevant and redundant attributes harms supervised learning performances; hence, the optimum set of attributes for learning makes fairly relevant nonredundant attributes and strongly relevant attributes. Due to the necessity of attributes selection method analysis, the subject of the research in this paper is the possibility of applying regression analysis as a classical statistical technique and AHP as a method of operational research in the process of supervised learning using the ReliefF classificatory modern ranking method. The method of previous learning shall determine sub-set of attributes based on accuracy estimation by applying decision tree classifiers after applying wrapper methods. WEKA [7] data mining tools were used for the application of selected classifiers, including intensive pattern discovery, which has been frequently studied during the last decade.
Evaluation of the proposed methodology was conducted using a case study predicting the choice of one of three possible modules for further study based on the assessment of subjects from the joint first year of study. The information on grades of the first-year graduate students at the Academy of Criminalistics and Police Studies in Belgrade was used as a dataset in this paper. Modules are presented as the attributes set. Analysis of the aforementioned sets should determine the prediction of the further orientation of students after the first year of studies. For each orientation, the meaningfulness of given attributes was determined by the application of traditional and smart methods to achieve an optimal sub-set that is relevant for further students' orientation. We use two traditional methods in our proposed model-AHP and regression analysis because, in general, we propose aggregation as methodology and because we first aggregate these two methods, which, as it is known, belong to two different types of methodology, i.e., subjective and objective. The comparison of accuracy prediction estimation is presented in the paper. It was performed by application of decision tree classifiers using wrapper methods and a set of attributes acquired by the application of ReliefF classifiers. Attributes whose combination acquires a satisfying prediction model were abstracted by the method of negative elimination; thus, these models were represented as relevant attributes present in both sets. Thereby, this paper presents an algorithm for extracting a minimal set of attributes that represent relevant data for further orientation in this case study. We propose novel label and unlabeled approaches for adaptively mining closed-rooted trees from data streams that change over time. It should be noted that closed patterns are a powerful type of frequent patterns due to the fact that they eliminate unnecessary information.

Related Studies
The use of statistical regression methods to predict student achievement can be found in the literature based on their success in graduation or entrance exam [8,9]. Moreover, statistical methods are used to determine the parameters for assessing whether a student will withdraw from studies or continue their studies based on the first results [10]. Regression models for predicting student academic performance in an engineering dynamics course are the considering subject of the solution with the same name [11]. Another paper [12] deals with academic progress using models based on linear and logistic regression, employing prior success and demographic factors as predictors.
The multi-criteria AHP method is used in higher education for the selection of candidates for teaching positions [13] where the AHP method is used to determine joint action as well as the priority of individual selection criteria. The selection of doctoral studies, depending on the goals that the applicant wants in his career, was done with the help of AHP [14]. Perspectives are set as a pseudo level in the hierarchy and for each of them, doctoral programs are offered and selected by the candidates depending on their preferences. The paper [15] gives a knowledge management system that recommends an orientation program to students who should choose it, and therefore, helps him be the most efficient in further studies. In the paper [16], comparative analyzes of DEA and AHP were performed, as well as their aggregation according to the importance of the first-year study subjects for selecting an appropriate study program on the example of the Faculty of Organizational Sciences in Belgrade. Significance of the weights of individual subjects was used to rank the benefits of study programs for a particular student, based on the success achieved in subjects that are common to all students in the first year of study. Moreover, very interesting papers are [17,18], in which the effects of the first-year course on student retention and the prediction of student success based on the achieved results are considered, respectively. The time dependence of predictions and ANOVA implementation in fuzzy AHP is discussed in [19].
Machine learning algorithms in [20,21] and data mining techniques in [22,23] are considered as one overview of student performance prediction modeling for further education in both pairs of these papers. Application of machine learning in predicting performance for computer engineering students is subject of the manuscripts [24,25] and data mining techniques are applied in predicting further education for students that study medical curriculum [26]. Additionally, supervised learning application is proposed in one model for student performance prediction [27]. Moreover, one comprehensive consideration of the application of supervised machine learning techniques was conducted in the prediction of students' performance using different parameters including demographics and social interest [28].
A machine learning approach that uses two techniques, logistic regressions and decision trees, to predict student dropout at the Karlsruhe Institute of Technology is considered in [29]. We could also find such approach in [30], where a decision tree algorithm combined with the linear regression of data classification for students' evaluation purposes is applied in Turkey.

Materials and Methods
These two subchapters are written to allow readers to replicate and build on published results. The proposed model is described in detail in the subchapter Methods, while a detailed description of the data set used for its development and evaluation is described before in the following subchapter Materials.

Materials
The dataset used in this case study for the evaluation of the proposed model in this paper and used data are taken from the students' service database at the Academy of Criminalistics and Police Studies-CPA in Belgrade from 2006 to 2016. The analysis encompassed several graduated students in three departments with mutual modules in year one. Out of 290 graduated students at the Academy, 83 graduated from the Department of Safety (which is labeled as SD), 114 students graduated from the Criminology Department (labeled as CD), and 93 graduated from the Police Department (labeled as PD). For each student, we have compiled information on grades from 10 modules (i.e., subjects) in year one. These numbers are presented, along with their labels, in the following Table 1. As can be seen, descriptive statistics with the average grade of the students at the CPA in the above modules at year one, observed for each department, are presented. For these purposes, the group of statistical methods, based on the methods of descriptive statistical analysis of the observable data, were used. The values of minima (MIN), maxima (MAX), average grade (MEAN), median grade (MEDIAN), mode grade (MODE), standard deviation (STDEV), sample variance (VAR), kurtosis (KURT) and skewness (SKEW) are set in rows of the following Tables 2-4. As can be seen from the presented tables, the minimum and maximum grades of students in all subjects are equal to MIN = 6 and MAX = 10. The next three rows show the so-called dominant characteristics (MEAN, MEDIAN and MODE). Obviously, the students showed the best success in the subject 'English language' (module S 2 ), while other subjects have lower values of these statistical indicators. In the next two rows, the values of the so-called variability measure, standard deviations (STDEV) and variance (VAR) are given. In all three sets of observed data, i.e., for all three departments above, slightly statistically-significant differences in the variability of the observed subjects can be seen.  The last two rows of Tables 2-4 show the values of kurtosis, i.e., the elongation coefficient (KURT), as well as the skewness coefficient (SKEW). One can notice that most of the observed data (except, for instance, subject S 3 in the Police Department) have negative values (KURT < 0). This means that the observed data sets have the so-called platykurtic properties, which means that they are 'flatter' compared to a normal distribution. Finally, the obtained values of the skewness coefficient are, in most cases, positive (SKEW > 0), which means a higher frequency (and consequently mode values) of lower student grades is present.

Methods
The model, which is proposed in this paper, uses a traditional method of regression analysis which is first aggregated with the AHP method and, in this way, the obtained method has been combined using the application of the ReliefF classificatory ranker method of machine learning and the decision trees classifiers J48 has been used for the aggregation purpose. All of these methods and the proposed aggregated model are studied in detail in four subsections of this subchapter.

Traditional Techniques
The problem of prediction in the choice of study direction is observed in this paper, from the point of view of the mathematical approach. The solution to the problem is to determine the impact of individual subjects in the first year as criteria that affect the final success of the student, which is the observed goal and later allows for the creation of a model that is based on the success of individual subjects in the first year and can predict the choice of appropriate study. Practically, this problem, when it comes to the traditional mathematical and statistical approach, is reduced to the well-known in the literature problem of determining the weights of criteria in a multicriteria problem and a linear regression model.
The classification of mathematical methods for determining the weights of criteria is not uniform. Namely, the division between these methods is made following the authors' concept and the need to solve a practical problem. There are different divisions of methods, such as: statistical and algebraic, direct and indirect, holistic and decomposed, and compensatory and non-compensatory, which can be seen more in [31].
The most important groups of objective and subjective methodologies for determining the weight of the criteria are discussed in [32], as follows:

•
Standard methods of statistical analysis, including the most commonly used regression and correlation analysis as well as variance and factor analysis. These methods require a statistically relevant sample as well as meeting the strict condition of the relationship between the number of members of the observed sample and the number of criteria and imply a normal distribution, which is eliminated by using so-called non-parametric statistical methods; • Methods of operational research, most often methods of multicriteria analysis, which are mostly algebraic but also statistically-based methods for determining the weight of the criteria and whose main task is in fact to choose the optimal number of alternatives and optimal solutions. Methods of operational research that are less demanding than statistical, as well as non-parametric statistical methods, are used where the condition of normal distribution in the sample is not met as well as the required sample size and its relationship with the number of considered criteria. When it is necessary to choose alternatives and their ranking and quantitative positioning.
Methods of operational research can be divided into two basic groups [32]: subjective methods, which depend on the influence of decision makers on the difficulty of criteria and objective methods, which are based on the application of objective mathematical, primarily statistical, apparatus to the information contained in the matrix decision making [31,33].
Objective methods from the essential information of each of the evaluated criteria determine their weight according to two principles [34][35][36]:

•
The criterium that has the least variation in the considered cases, i.e., alternatives have the least impact, i.e., weight; this is the principle of contrast intensity within when special methods appear as those in which entropy, standard deviation, variation, etc. are most often used as a measure of contrast intensity; • The criterium that is in conflict with a large number of others is more important, and that is the principle of conflict character in which the most well-known method of correlation is the correlation of the coefficient between pairs of criteria.
In the literature, one can find the application of statistical regression methods to predict students' success [16] based on their success in graduation or the entrance exam [8,19]. Moreover, statistical methods are used to determine the parameters for assessing whether a student will withdraw from their studies or continue their studies based on the first results [10].
Supervised learning is considered through its main aim to obtain the best prediction results in [37,38] and as one overview in [39]. In [24,40], the applications of machine learning in predicting performance for computer engineering students and disease prediction were considered, respectively.

Regression Analysis
Regression analysis is a method for examining the influence of more different independent variables on one dependent variable to determine the analytic form of this connection, i.e., the model which will be used in analytical and predictive applications. It has a deterministic model in which each value of independent variables exists exactly one value of the dependent variable and which can be given in the form: where b i are partial coefficients of regression.
In the case when we have a non-explained variation of the dependent variable in the form of error which we notate as a consequence of the influence of random action or non-including all existing independent variables, we can present multiple regression in the form: The calculation of parameters a and b i in Equation (2) can be executed with the method of least squares with minimization of the sum of squares of the residuals, and in a general case, notated as: In practice, an algebraic algorithm for solving arising system of equations is rarely in use compared with the well-known Gaussian method of multiplication or with the analysis of variance-ANOVA that is mostly used in empiric research.
One of the particularly important problems that may occur in the multiple regression model is a multi-correlation. It manifests as an intercorrelation between explanatory variables and usually leads to erroneous results of regression analysis [41,42]. In order to check the multi-correlation in the regression model that we obtained here, two diagnostic tools were used and described in detail below (please see Section 3 'Results and Findings'). Firstly, the correlation coefficients between all predictor variables were computed, and the weak or moderate correlation was noticeable. In addition, for the purpose of doublechecking, another measure of multi-correlation, a well-known variance inflation factor (VIF), was also used.

Multi-Criteria Decision Making
Multi-criteria decision making is a field that is receiving great importance in the last two decades since each process requires contemplation of numerous criteria that are often in conflict or stated in different measurement units. Since the 1960s, further on, a various number of multi-criteria analyses have been developed and could be classified on several grounds. One of the most important multi-criteria classification methods of decision making was conducted by Hwang and Yoon [43], who had classified 17 different methods per type and relevant features of information brought by the decision makers.
According to the type of information, all stated methods are divided into two groups: 1.
Methods without information on attributes, 2.
Methods requesting certain attribute information.
There are several ways to enable attribute transformation and adjust them to the models of multi-criteria decision making: conversion of the attributes to the scale interval, normalization of the attributes and assembling of the appropriate weight set. Another way of transformation is the assignment of an appropriate weight set, and it is mostly used in situations when it is necessary to determine the relevant importance of certain attributes. For the N criterion, the weight set is: Among numerous estimation techniques of the relative importance of certain abstracted attributes are methods of natural vectors, method of weight in smallest quadrants, entropy method, etc. The algorithm of the AHP method can be described as a structural analysis of one complex decision-making problem that contains more criteria, more alternatives, even to have more decision makers (decision-making group), determining relevant weight criteria and alternatives per level and forming a final alternative outcome (rang alternative). This process is one of the well-known methods of multi-criteria decision making, which is mostly used in cases when there is the possibility of a hierarchical structure of relevant criteria. This method was created in the 1970s by Thomas Saaty [44].
It is important to note that, since the main part of the AHP method is pairs' comparison of hierarchy elements and formation of appropriate local reciprocal numerical matrices, from which the weights of compared elements are determined, these matrices with the calculated element's weights carry the measurable information on consistency used by the consistency ratio (CR). A model that provides CR < 0.1 is considered a good one.

Knowledgeable Data Analysis
The basic property differing deep data analysis from the traditional or 'ordinary' analysis is the application of machine learning. Intelligent data analysis is another term for deep data analysis. Adjective 'intelligent' emphasizes that this data analysis is based on artificial intelligence proceedings. Attribute selection is the field that was developed within the framework of sample recognition in mathematical statistics [45], knowledge revelation [46], machine learning [47,48], especially neuron grids, and many other fields. The fundamental task of the attributes' selection is the reduction of spatial dimensionality and removal of redundant, irrelevant and disrupted data that speeds up the operation of the learning algorithm, improves the data quality and increases the accuracy of taught knowledge. When selecting attributes, there are two approaches that are different in what they need to achieve. The first approach is finding and ranking the sub-set of the attributes useful for the construction of the quality model. Another approach is finding or ranking all potentially important attributes. Using a sub-set of potentially important attributes for model construction is presented as a sub-optimal approach with the construction of some models [49]. Furthermore, the issue of usefulness versus the importance of the attributes could be very interesting [50].

Attribute Selection Using ReliefF Algorithm
The complexity of group correlation analysis is a result of a huge number of combinations of the attributes, which relations should be considered as O(2N), where N is the number of attributes in the model [51]. Due to the usual great complexity, the approximation is applied. For example, only partial analysis of individual attributes with the class O(N), or analysis of only some of the possible combinations (interaction of length 2 or 3 attributes) is performed. Entropy is a commonly used measure in information theory [52], which characterizes the purity of an arbitrary collection of examples. The entropy measure is considered a measure of the system's unpredictability. The entropy of Y is: where p(y) is the marginal probability density function for the random variable Y. If the observed values of Y in the training data set S are partitioned according to the values of a second feature X, and the entropy of Y concerning the partitions induced by X is less than the entropy of Y before partitioning, then there is a relationship between features Y and X. The entropy of Y after observing X is then: where p(y|x) is the conditional probability of y for given x.
ReliefF is a filtering method with proceedings of attribute ranking. It is based on the procedure of the k-nearest neighbors (k-NN). Figure 1 shows the ReliefF algorithm. This algorithm estimates and ranks each attribute with the global grade function [−1, . . . , 1]. Weight calculation could be performed based on the probability of the nearest neighbors from two different classes with different values of the attributes as well as on the probability of two neighbors from the same class having the same value of the attributes. The function diff (Attribute; Instance1; Instance2) calculates the difference between the values of the attribute for two instances.
When considering discrete attributes, the difference is either 1 (when values are different) or 0 (when values are the same), while in continuous attributes case, the difference is an actual difference normalized to the interval [0, 1]. Kononenko notes in [53] that the higher value of m (the number of sampled instances), the more reliable the ReliefF's estimates are. Of course, it should be noted that the increasing m rises the running time. When considering discrete attributes, the difference is either 1 (when values are different) or 0 (when values are the same), while in continuous attributes case, the difference is an actual difference normalized to the interval [0, 1]. Kononenko notes in [53] that the higher value of m (the number of sampled instances), the more reliable the ReliefF's estimates are. Of course, it should be noted that the increasing m rises the running time.

Classification Technique and Algorithms
Classification is the task of data mining that administers the separation of data set examples into previously determined classes of output variables based on the value of input variables [54]. Classification of some objects is based on finding similarities with priory-determined objects belonging to different classes, whereas the similarity of two objects is determined by analyzing their characteristics. The task is to design a model based on the characteristics of the objects with a previously known classification. That model shall represent the ground to perform the classification of new objects. In the problem of classification, several classes are known in advance and limited. By testing, the classifier performs classification of the test set examples in priory determined classes of attributes. If the classifier makes some errors in test data or if there is a higher percentage of wrongly classified examples, the conclusion is that a wrong and unstable model is created. In that case, it is necessary to perform improvements to the model by modifying the applied classification process. Up-to-date research shows that most applied classifiers include Bayes networks, decision trees, neural networks, support vector machines, and K-nearest neighbors [55].
Decision trees are well-known classification techniques considering that they include several ways of creation of easily interpreted trees used for the classification of categories and numerical values of the attributes. These classification methods perform the division of data to nodes and leaves until the entire set of data is being analyzed. The well-known algorithms are ID3 [56] and C4.5 [57]. Algorithm C4.5 for decision-making tree induction is developed on the basis of the ID3 algorithm, with multiple significant improvements compared to the basic algorithm: imperfect operation with continual attributes and emitted values of the attributes, new estimation of gain ration quality, and simplification of the taught tree due to the increasing classification accuracy of new examples. It is available as an independent program as the objective module (library MLC++) for usage within the other systems for supervised learning and intelligent data analysis. J48 algorithm of Weka software is a popular machine learning algorithm based upon Quilan's C4.5 algorithm [58].

Classification Technique and Algorithms
Classification is the task of data mining that administers the separation of data set examples into previously determined classes of output variables based on the value of input variables [54]. Classification of some objects is based on finding similarities with priory-determined objects belonging to different classes, whereas the similarity of two objects is determined by analyzing their characteristics. The task is to design a model based on the characteristics of the objects with a previously known classification. That model shall represent the ground to perform the classification of new objects. In the problem of classification, several classes are known in advance and limited. By testing, the classifier performs classification of the test set examples in priory determined classes of attributes. If the classifier makes some errors in test data or if there is a higher percentage of wrongly classified examples, the conclusion is that a wrong and unstable model is created. In that case, it is necessary to perform improvements to the model by modifying the applied classification process. Up-to-date research shows that most applied classifiers include Bayes networks, decision trees, neural networks, support vector machines, and K-nearest neighbors [55].
Decision trees are well-known classification techniques considering that they include several ways of creation of easily interpreted trees used for the classification of categories and numerical values of the attributes. These classification methods perform the division of data to nodes and leaves until the entire set of data is being analyzed. The well-known algorithms are ID3 [56] and C4.5 [57]. Algorithm C4.5 for decision-making tree induction is developed on the basis of the ID3 algorithm, with multiple significant improvements compared to the basic algorithm: imperfect operation with continual attributes and emitted values of the attributes, new estimation of gain ration quality, and simplification of the taught tree due to the increasing classification accuracy of new examples. It is available as an independent program as the objective module (library MLC++) for usage within the other systems for supervised learning and intelligent data analysis. J48 algorithm of Weka software is a popular machine learning algorithm based upon Quilan's C4.5 algorithm [58].

Classifiers Estimation
Classifiers estimation enables the prediction of performances, selection of the best and behavioral assessment of multiple different classifiers. Cross-validation is a statistical method of assessing and comparing learning algorithms by dividing data into two segments: one is used to learn or train a model, and the other one is used for model validation.
In standard cross-validation, the training and validation sets must cross over in successive rounds such that each data point has a chance of being validated.
The basic form of cross-validation is k-fold cross-validation. When we talk about the k-fold cross-validation, the whole data is partitioned into k equally (or nearly equal) sized segments/folds. Afterward, k iterations of training and validation are completed in such a way that within each iteration, a different fold of the data is held-out for validation while the remaining k = 1 folds are used for learning. In most cases in data mining and machine learning, 10-fold cross-validation (k = 10) is the most common one. J48 classifier with the corresponding parameters is used and evaluated with this mostly used type of cross-validation.
The biggest number of measurements for classifying models' estimation is related to classification issues with two classes. That does not represent a specific limitation for the application of those measurements, taking into consideration that the issues with the higher number of classes can be shown in the form of a sequence of problems with two classes. Each of them is separately abstracted for one of the classes as the target ones, a the data set is divided into a where N is the total number of members in the considered set which will be classified. It is necessary to notice that these numbers are counts, i.e., integers, not ratios or fractions.
The accuracy, precision, recall and F1 measure can be respectively calculated as:  In Figure 2 so called a 2 × 2 confusion matrix which enables calculation formulas is presented: where N is the total number of members in the considered set which will be classified. It is necessary to notice that these numbers are counts, i.e., integers, not ratios or fractions. The accuracy, precision, recall and F1 measure can be respectively calculated as: Recall(Sensitivity) = TP/(TP + FN) An evaluation of the prediction performance of a classifier is carried out using the method called Receiver Operating Characteristic (ROC) curves, which represents the rate of false-positive cases and the rate of true-positive cases on the OX and the OY axes, respectively. ROC curves and parameter AUC (area under the curve) enable best-way evaluation for checking each classification model in Figure 3 [59].
In theory, it is known that for the values of AUC = 70%, it is said that the classification is good and sometimes leads to the highest possible values of 100%.

Proposed Model for Selection of the Relevant Attributes
In the phase of data preparation, in the data set for three departments, the Attribute Department was added, and its value was appointed as binominal. For the Attribute Department, the (yes, no) values were determined. Attribute Department is marked as class and response variable, respectively. Educational information is mostly transparent since they do not contain wrong values collected automatically (log files, database with grades, database of learning system). As the data with the average grade of the year one module at the CPA on all three bases were in numerical form for each graduated student, it was not necessary to apply some of the standard discretization filters during the data preparation.
After the data were acquired and extracted into one unique direction selection, the data mining algorithm was applied as a method and used to estimate and select the attributes. The aim is to extract irrelevant and redundant attributes from the learning data set. In order to keep the data learning set with the desirable attributes only, the measure of attributes evaluation compared to the classification problem is necessary. Filtering methods encompass techniques for attribute values assessment relying on heuristics based on general features of data.
The proposed model selects relevant attributes by combining acquired aggregation results obtained using the ReliefF algorithm, shown in Figure 4. An evaluation of the prediction performance of a classifier is carried out using the method called Receiver Operating Characteristic (ROC) curves, which represents the rate of false-positive cases and the rate of true-positive cases on the OX and the OY axes, respectively. ROC curves and parameter AUC (area under the curve) enable best-way evaluation for checking each classification model in Figure 3 [59]. In theory, it is known that for the values of AUC = 70%, it is said that the classification is good and sometimes leads to the highest possible values of 100%.

Proposed Model for Selection of the Relevant Attributes
In the phase of data preparation, in the data set for three departments, the Attribute Department was added, and its value was appointed as binominal. For the Attribute Department, the (yes, no) values were determined. Attribute Department is marked as class and response variable, respectively. Educational information is mostly transparent since they do not contain wrong values collected automatically (log files, database with grades, database of learning system). As the data with the average grade of the year one module at the CPA on all three bases were in numerical form for each graduated student, it was not necessary to apply some of the standard discretization filters during the data preparation.
After the data were acquired and extracted into one unique direction selection, the data mining algorithm was applied as a method and used to estimate and select the attributes. The aim is to extract irrelevant and redundant attributes from the learning data set. In order to keep the data learning set with the desirable attributes only, the measure of attributes evaluation compared to the classification problem is necessary. Filtering methods encompass techniques for attribute values assessment relying on heuristics based on general features of data.
The proposed model selects relevant attributes by combining acquired aggregation results obtained using the ReliefF algorithm, shown in Figure 4.
A selection of the attributes using various filtering techniques that are performing ranking according to the importance estimation is carried out in the WEKA system so that the attributes are ranked and assessed using the current training set. For the importance assessment of the attributes, the ReliefF algorithm with the ranker method was used. Extraction of the relevant attribute set from the whole set was performed by the negative elimination process [60,61]. The weakest ranked attribute is extracted in the next step of attribute ranking. After each extraction, the worst-ranked attribute was selected in WEKA software, and the estimation accuracy that is unique for all steps within these proceedings for the selected learning method was chosen.

Results and Findings
For determination of the relevant attributes for selection of the study program at the CPA in Belgrade, the evaluation aggregation was used, as a combination of regression analysis and subjective AHP method, respectively, in order to points gained grade and eliminate the bad side of both applied technologies. In order to calculate relevant attributes, the following procedure was used: A selection of the attributes using various filtering techniques that are performing ranking according to the importance estimation is carried out in the WEKA system so that the attributes are ranked and assessed using the current training set. For the importance assessment of the attributes, the ReliefF algorithm with the ranker method was used. Extraction of the relevant attribute set from the whole set was performed by the negative elimination process [60,61]. The weakest ranked attribute is extracted in the next step of attribute ranking. After each extraction, the worst-ranked attribute was selected in WEKA software, and the estimation accuracy that is unique for all steps within these proceedings for the selected learning method was chosen.

Results and Findings
For determination of the relevant attributes for selection of the study program at the CPA in Belgrade, the evaluation aggregation was used, as a combination of regression analysis and subjective AHP method, respectively, in order to points gained grade and eliminate the bad side of both applied technologies. In order to calculate relevant attributes, the following procedure was used: Step 1: In AHP analysis, the procedure has been started from the data from Tables 2-4 of the average grades per subject for each department individually, safety, criminology and police, respectively. They are compared in pairs. The result of the comparison by pairs is the formation of a reciprocal matrix in accordance with the preferences defined by the Saaty scale. Saaty scale is a liqueur type and contains divisions from 1 to 9, where 1 is the same meaning of the criteria, and 9 is the absolute meaning of one in relation to the other (even numbers are intermediate divisions on the scale). Moreover, the scale contains reciprocal values, for example if p1: p2 = 5, then P2: P1 = 1/5. What importance will be given to a certain ratio of average grades depends on the subjective opinion of the decision maker. Logically, the highest possible grade is 10, and the lowest positive grade is 6. Hence, the ratio of the significance of subjects whose grades are P1 = 10 and P2 = 6 can be determined as 10:6 = 1.66667, and, on the Saaty scale, it is the absolute significance of P1 in relation to P2 and enter 9. Conversely, the ratio P2: P1 is the lowest possible preference 6:10 = 0.6. Therefore, it is necessary to distribute preferences in the interval from 0.6 to 1.66667, where the same meaning of the criterion must be marked with 1. Precisely, because it is necessary to distinguish between similar estimates (approximate values), the divisions are distributed so that the final values occupy a larger interval. Values from 0.88 to 1.10 are divided into several smaller intervals. In that way, reciprocal matrices were formed for each direction, and further procedure of applying the AHP method involved summing each of the columns of the reciprocal matrix, so the next matrices were constituted, which were normalized values of the reciprocal matrix (each coefficient divided by the sum of the corresponding column). In the end, all normalized coefficients are added into rows and the weighting coefficient is obtained as the quotient of the sum of the rows, and the number of criteria used in the model. AHP analysis was implemented by using VCO (multi-criteria decision-making software package of the Faculty of Organizational Sciences in Belgrade) using hierarchy and AHP analysis model [62]. Obtained results are shown in Table 5.
Reciprocal AHP numerical matrices for all departments provide CR < 0.1, so the model can be considered as a good one.
Step 2: Regression analysis was implemented by using model (1) of linear regression in the SPSS software package [63,64], and the results are shown in Table 6. One can notice that the last row contains the estimated values of the multiple coefficients of determination (R 2 ). It represents a widely used quantitative measure of agreement of the theoretical, fitting model in relation to the given set of empirical data. The estimated values of R 2 confirm that the obtained regression model can be an adequate model of dependency for the given subjects S 1 , . . . , S 10 .
As already pointed out in the previous section, one of the most important problems that arise here is the potential multi-correlation of the predictors in the above regression models. In order to determine the intercorrelation relationship between the predictors S i and S j, , where i, j = 1, . . . , 10, the following Tables 7-9 show the estimated values of their correlation coefficients (R ij ).   Table 6 correspond to the ones in Table 1. As it can be seen, students' grades at the Department of Safely satisfy the condition max R ij < 0. 7, when i = j. On the other hand, in the case of the other two departments, the inequality max R ij < 0.6 holds. Thus, there is only a weak or (in the 'worst case') moderate correlation between students' grades in different subjects in all three departments.
Multicollinearity can be expressed in another way, using the so-called variance inflation factor (VIF), which represents the measure of the severity of multicollinearity in regression analysis. Mathematically, for a multiple regression model given by Equation (1), the VIF can be expressed using the coefficient of determination R 2 i of the new multiple regression model with one explanatory variable (x i ) as the response variable and the other variables x j (j = i) as its explanatory variables: The calculated VIF values of each of the three previously obtained regression models are shown in the following Table 10.
It can be seen that for all these three sets of data, i.e., the grades of students from different departments, the condition max(V IF) < 4 is valid. This result is fully consistent with both practical and theoretical interpretations of this coefficient [65,66]. In other words, the values of VIF, obtained in this way, indicate that the multi-correlation between the predictors in the previously obtained regression models is not emphasized and, therefore, can be neglected. Step 3: Evaluation aggregation. Aggregated evaluation measures are acquired by the simple application of arithmetic middle of acquired AHP results and regression analysis in order to objectify the acquired grade and eliminate the bad sides of both applied methodologies. The results are shown in Table 11. So as to clarify, the evaluation aggregation set is given as follows: aggr ik = (regresion_analysis ik + AHP ik )/2 (13)  Step 4: According to the proposed model of aggregation of traditional methods of regression and AHP with the method of classification from the group of machine learning algorithms which is given in previous Section 2.2.4, we apply the ReliefF algorithm with the ranker method for importance assessment of the attributes. After that, we apply accuracy evaluation for both groups of methods using decision trees classifiers J48 and, for relevant attributes, we take the number of the most important attributes, for which the deviation of accuracy for these two groups of methods is the smallest [67][68][69][70]. The results for the Department of Safety on CPA are given in Tables 12-14. Step 5: By aggregation evaluation for the same department, we have acquired the following measurement values shown in Table 14.
The number of the important factors should be 4 because the deviation of the accuracy of the Department for Safety dataset with ranked attributes by traditional aggregation and with classification is ε = 0.854-0.876 = 0.022, and it is less than when 3 of them are taken as important factors because then it is ε = 0.883-0.819 = 0.064.

Discussion and Suggestions
By applying attribute rank gained by aggregation, it could be noticed that the set of 4 attributes gave the highest mid-estimation values for the Department of Safety. Attributes presented as relevant on the basis of this criterion are: (S2, S10, S8 and S6). Acquired mid-accuracy values with a set of attributes that was decreased by negative elimination of the lowest ranked by applying both of the techniques for the Department of Safety are shown in Figure 5.
The number of the important factors should be 4 because the deviation of the accuracy of the Department for Safety dataset with ranked attributes by traditional aggregation and with classification is ε = 0.854-0.876 = 0.022, and it is less than when 3 of them are taken as important factors because then it is ε = 0.883-0.819 = 0.064.

Discussion and Suggestions
By applying attribute rank gained by aggregation, it could be noticed that the set of 4 attributes gave the highest mid-estimation values for the Department of Safety. Attributes presented as relevant on the basis of this criterion are: (S2, S10, S8 and S6). Acquired mid-accuracy values with a set of attributes that was decreased by negative elimination of the lowest ranked by applying both of the techniques for the Department of Safety are shown in Figure 5. Presented accuracy results for both cases show that these proceedings enable the discovery of the approximate accuracy values on the grounds, which we could determine an equal number of attributes in both sets. The comparison was performed in order to achieve the aforementioned, i.e., differences of set accuracy of ranking attributes by aggregation technique PrecA(k) and set of ranked attributes by applying ReliefF algorithm PrecR(k). In consideration of accuracy precision ε, the values of sets with sequences of 1 to 10 values were not taken into consideration. On the basis of acquired results ( Figure 5), the accuracy precision is visible by applying both ranking methods, the smaller with the set of 4 attributes. For the other two departments, the closest accuracy by application of both methods was acquired by the set of 8 attributes for the Criminology Department and the set of 9 attributes for the Police Department, shown in Figure 6 and  Presented accuracy results for both cases show that these proceedings enable the discovery of the approximate accuracy values on the grounds, which we could determine an equal number of attributes in both sets. The comparison was performed in order to achieve the aforementioned, i.e., differences of set accuracy of ranking attributes by aggregation technique PrecA(k) and set of ranked attributes by applying ReliefF algorithm PrecR(k). In consideration of accuracy precision ε, the values of sets with sequences of 1 to 10 values were not taken into consideration. On the basis of acquired results ( Figure 5), the accuracy precision is visible by applying both ranking methods, the smaller with the set of 4 attributes. For the other two departments, the closest accuracy by application of both methods was acquired by the set of 8 attributes for the Criminology Department and the set of 9 attributes for the Police Department, shown in Figures 6 and 7, respectively.   Set of the appropriate number of attributes with ranking coefficient that we have acquired on the basis of previous figures for all Departments is shown in the Table 15.  Set of the appropriate number of attributes with ranking coefficient that we have acquired on the basis of previous figures for all Departments is shown in the Table 15. Discovery of knowledge by analysis process of acquired results could be determined using the estimation accuracy on the same information by application of tree decision for all three departments, in the case of ReliefF PrecR(k) classifier application compared to the aggregation method PrecA(k) has expressed better results. The number of attributes k was less, and estimation accuracy Prec(k) is better: However, if we take a look at the smallest difference of ε achieved in the mid-value of estimation accuracy by applying attributes ranking by aggregation PrecA(k) and application of ReliefF classifier PrecR(k), we have accomplished the following aberrance: Department for Safety's value is ε = 0.022 (for the set of 4 attributes), Criminology Department's value is ε = 0.001 (for set of 8 attributes) and Police Department's value is ε = 0.000 (for set of 9 attributes). By comparing both sets of attributes with all three departments, we could extract the set of the same attributes that represent relevant modules when selecting the Department at the CPA. In Table 15 (marked with gray), it is clearly visible that S8-Constitutional law, S2-English language (excluding the case of application of ReliefF classifier to Safety department) and S5-Introduction to the law (excluding the case of application of aggregation to Safety department) are important for all three departments, that the modules S3-Criminal law general part, S7-Police equipment and S10-IT are important for the Police Department and Criminology Department. For the Safety Department and Police Department (excluding the case of application of aggregation to the Safety Department), module S1-Fundamentals of economics is also important.

Conclusions
The main hypothesis of the authors was to show the possibility of the development of one aggregated model having better characteristics than integrated techniques when they are used independently. The techniques which were considered in this paper solved the problem of determination and prediction of the importance of attributes that describes one multidimensional and multicriteria process. The evaluation was carried out on the case study to the determination of students' future behavior regarding the choice of one of the three offered modules at the Criminal and Police University, Belgrade. The method of feature selection was used to reduce the number of attributes, thus increasing the accuracy of the model. As traditional techniques, the authors demonstrated the usage of the method of regression analysis combined with the AHP mathematical method. The application of ReliefF classificatory ranking method of machine learning was used together with decision tree classifiers J48 for aggregation purposes. The evaluation results have shown the advantage and supremacy of the proposed model. The authors have claimed that the proposed model has no significant limitations, and we will consider the inclusion of n-modular redundancy into it in our future work related to this topic.
It is important to mention that the obtained analysis of the personal contribution of the students via achieved results on the first-year subject modules can help in the process of achieving relevant attributes, which are, in this case, related to the determination of the robustness of the first-year modules that could be prearranged by the experts at the Ministry of Interior, The Republic of Serbia as well as adequate support in determination of toughness that is given to the first-year modules through the subject accreditation process with a corresponding number of European Credit Transfer and Accumulation System (ECTS) credits.