A Decision Support System Using Text Mining Based Grey Relational Method for the Evaluation of Written Exams

: Grey relational analysis (GRA) is a part of the Grey system theory (GST). It is appropriate for solving problems with complicated interrelationships between multiple factors / parameters and variables. It solves multiple-criteria decision-making problems by combining the entire range of performance attribute values being considered for every alternative into one single value. Thus, the main problem is reduced to a single-objective decision-making problem. In this study, we developed a decision support system for the evaluation of written exams with the help of GRA using contextual text mining techniques. The answers obtained from the written exam with the participation of 50 students in a computer laboratory and the answer key prepared by the instructor constituted the data set of the study. A symmetrical perspective allows us to perform relational analysis between the students’ answers and the instructor’s answer key in order to contribute to the measurement and evaluation. Text mining methods and GRA were applied to the data set through the decision support system employing the SQL Server database management system, C#, and Java programming languages. According to the results, we demonstrated that the exam papers are successfully ranked and graded based on the word similarities in the answer key.


Introduction
The concept of measurement, which aims to determine and grade differences, is defined as the observation of characteristics and the expression of observation results [1]. It is accepted as a critical component of the education system and contains various techniques to define the level of education status and differences [2]. Between these techniques used to determine the levels of success, measurement tools such as tests, essay exams, written exams, and oral exams are available. Written exams are considered as essential tools when it comes to measuring the original and creative thinking power of students, written expression skills, subject evaluation, and application of knowledge and skills [2][3][4]. However, the validity and reliability of written exams are reduced due to some factors such as non-objective assessment, not to pay attention and care to the scoring criteria, making unnecessary verbiage in answers, scoring difficulties, and time-consuming situations [2][3][4][5].
In recent years, increasing attention to artificial intelligence has encouraged the progress of data mining and analytics in the field of education. Data mining is a knowledge-intensive task concerned with deriving higher-level insights from data. It is defined as "the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" [6,7] or "the analysis is used to determine the intended learning outcomes in education levels, rather than a successful or unsuccessful evaluation form today, points to a level of learning-oriented approach [27]. According to [28], when deciding on the level of learning, the interests, motivation, preparation, participation, perseverance, and achievement of the students should be considered. Various measuring instruments or methods are used to evaluate the achievement between the mentioned elements. Commonly used measuring instruments are oral exams, multiple-choice exams (tests), and written exams [29]. Oral exams have low validity and reliability. They cannot go beyond measuring isolated information. The emotional and mental state of the student affects the test result. The appearance of the student (i.e., dressing style) also affects the evaluation result. For these reasons, oral exams are a measurement tool that results in high inconsistencies. In multiple-choice exams, the student chooses the most appropriate answer to the question. However, these exams reduce expression skills, decision-making skills, and interpretation power, and cannot measure mental skills. It is not seen as a robust measurement tool for such reasons. Contrary to oral exams and multiple-choice exams, there is a consensus in the literature that written exams offer effective results in measurement and assessment, although they are laborious in practice [1][2][3][4][5]. The advantages and disadvantages of these exam types are presented in Table 1, Table 2, and Table 3, respectively. Table 1. The advantages and disadvantages of oral exams.

Advantages Disadvantages
• The most appropriate method for measuring verbal expression skills.

•
The preparation of the questions is easy and takes less time.

•
It allows observing how the student has reached an answer. • When the number of students is very small, their knowledge level and academic trust can be determined in more detail.

•
Students are not likely to copy in the exam.

•
Instant feedback is received.
• There is no standardization. • Factors such as personality traits, momentary emotions, speech effectiveness, and oral expression skills are mixed into the scoring process and affect the performance of the exam.

•
Since the application of the exam in crowded classes is laborious and time-consuming, very few questions are asked. This situation significantly reduces the content validity. Besides, it is compulsory to find different questions for students.

•
There is a continuous interaction between the student and the teacher, which may affect the qualifications of the exam. The teacher's approach to the student may affect the student's performance.

•
Since the scoring is done with a general impression, its objectivity is low.

Advantages Disadvantages
• It is easy to apply and score. • All kinds of information can be measured.

•
To answer the question takes a short time.

•
It is suitable to be used in exams for groups with high participation. • It is easily applicable at all levels and stages of education. • Reliability and content validity are high since it is possible to ask many questions in the exam.

•
When the answers are not marked on the exam paper, the exam paper can be used repeatedly.

•
The exam results are objective since they do not vary from evaluator to evaluator.

•
It provides a variety of statistical applications. Statistical procedures can be carried out with data obtained from the exam.
• There is a chance success. • It does not improve the ability of expression.

•
Most of the test time is spent on reading choices and finding the right answer.

•
It measures knowledge and remembering and is limited in measuring information that is at the level of synthesis and evaluation.

•
The forming, organizing, and writing of the questions requires expertise and experience.

•
It is not used much in the measurement of advanced behaviors.

Written Exams
When it comes to putting together ideas, organizing, evaluating, and applying knowledge and skills, the most appropriate method for the measurement is written exams. Therefore, written exams, which aim to find out the student's original and creative thinking power, are a beneficial technique for measuring writing skills and are quite a widely applied method for measurement at all levels of the education system starting from primary school [3][4][5]. Written exams, which are also called classical exams and open-ended exams, are created with answers by the students in a handwritten or an electronic environment. The answers to the written exams, which reflect the students' knowledge and thinking skills about the subject, are usually short texts [25][26][27].
Written exams, which are suitable for measuring high-level mental skills, have some advantages over the other exam types. The most important feature of the written exams is the freedom to answer. The student is free to decide on how to approach the questions, which practical information to use, how to arrange the answer, and how much emphasis is placed on every aspect of the answer. Therefore, the written exams enable students to have the ability to produce, integrate, and express ideas and to develop skills to state their thoughts in written form. Furthermore, unlike other types of tests, these exams increase the validity of results because of the chance factor and lack of possibility of copying, providing a more accurate determination of the outcomes and performance [26][27][28][29].
Besides the advantages of written exams, there are some restrictions. According to educators, exam scoring is the most significant limitation in the evaluation of exams [1][2][3]. The scoring process can be influenced by the paper layout, imperfections, in-class behavior, and students' family characteristics (i.e., interests and attitudes). Therefore, different scores can be shown on the same answer sheet. Consequently, there may be some situations in which results distort the measuring success and reduce reliability [3][4][5]25]. In [4], the authors emphasized that evaluators should pay attention to the learning outcomes and evaluate answers carefully. In this context, the evaluation process was described as "three blind men and an elephant". On the other hand, students who do not know the exact answer can write answers consisting of unnecessary statements to show that they have knowledge. In this case, it is challenging to identify information to be scored, and the educator loses the objectivity; thus, reducing the validity and the reliability of the exam. Finally, the excessive course load on the students can extend the period in which the educator evaluates the learning success, and reduce the efficiency of exams through these results [2][3][4][28][29][30]. Because of the positive effects of results obtained from written exams on the measurement and evaluation, we focused on the analysis of these exams. From this point of view, we developed a decision support system as a solution to the problems experienced in scoring the written exams.

Data Mining
With the increasing growth of data used in many application areas (e.g., bioinformatics, business intelligence, finance, retail, health, telecommunication, Web search engines, social media, and digital libraries), data mining meets the needs for flexible, scalable, and efficient data analysis in today's information age. Data mining (also known as knowledge discovery from data) can be considered as a natural evolution of information technology and a confluence of several disciplines (i.e., machine learning, statistics, information retrieval, pattern recognition, and bioinformatics) and application fields. It is the process of discovering interesting patterns (that represent knowledge and are novel, potentially useful, and easily understandable by humans) from large amounts of data [31,32]. Data mining, as a knowledge discovery process, usually includes an iterative sequence of the following steps: data cleaning, data integration, data selection, data transformation, pattern discovery, pattern evaluation, and knowledge presentation. It can be performed by any kind of data (e.g., transactional data, data warehouse data, database data, advanced data types that have versatile forms, structures and rather different semantic meanings, such as Web data, engineering design data, data streams, hypertext and multi-media data, graph and networked data, spatial and spatiotemporal data, time-related or sequence data) as long as the data are meaningful for the target application [33,34].
There are several techniques to specify the kinds of patterns to be discovered in data mining tasks. In general, such data mining tasks are classified into two major groups: descriptive and predictive. Descriptive tasks characterize the properties of data in a target dataset. Predictive tasks carry out induction on the available data to make predictions [17]. The main techniques used in data mining tasks are classification and regression; characterization and discrimination; cluster analysis; the mining of frequent patterns, associations, and correlations; association rules; outlier detection; sequence analysis; time series analysis; sentiment analysis; social network analysis; prediction; and text mining [31,32]. As a highly application-driven discipline, data mining incorporates many techniques from the other research areas such as artificial intelligence, machine learning, database systems, data warehouses, information retrieval, high-performance computing, statistics, pattern recognition, and visualization [33,34]. Consequently, the interdisciplinary nature of data mining development significantly contributes to the success of data mining and its extensive application domains.

Text Mining
The analysis of structural data is possible with data mining approaches. However, the vast majority of data we encounter in our daily lives, such as documents, texts, and pictures, are composed of unstructured data. In the analysis of these data, the insufficiency of data mining approaches requires the use of analyses involving textual operations and language constructs. In this context, text mining can be defined as the process to obtain confidential information from textual data [35]. The first study on the analysis of textual data was done by [36]. The authors, aiming to extract keywords based on NLP from project descriptions recorded in the database in text form, have performed a cluster analysis of the documents by using those keywords. The process of converting textual data into a structured form was first described by [37]. In this study, the textual data in documents were converted into a structured form using IR, and then meaningful concepts were determined by using the text categorization approach of [36]. In recent years, the use of text mining, which is also an interdisciplinary research area related to the IR, the NLP, and the Named Entity Recognition (NER), has increased rapidly and become the focus of interest in almost all application domains.
To be able to analyze the textual data, first of all, they must be represented numerically. For the numerical representation of a document, the Bag of Words (BoW), the Binary model, and the Vector Space Model (VSM) representations are employed [18]. In the BoW method, the number of words in the document is determined, and the document can, therefore, be represented numerically [38,39]. In the Binary model, a set of terms in the document is referenced, and the values "0" and "1" are encoded to determine whether or not a specific term is related to the document; "1" is used in the case of the term is in the text, otherwise "0" is used. The VSM is used to vectorize the word counts in the text [40]. In other words, it can be considered as a matrix based on BoW, which shows the word frequencies in the document. If more than one document is compared, each word is weighted depending on the frequency numbers in other documents. In general, the TF-IDF method is used for the weighting operation.
There are two techniques used in text mining: contextual analysis and lexical analysis [31,32]. The contextual analysis is aimed at determining words and word associations based on the frequency values of the words in the text. The BoW and VSM methods are employed in contextual analysis to represent the documents. Lexical analysis is carried out in the context of NLP for the meaningful examination of texts [39,41]. Within the meaningful analyses, applications such as opinion mining, sentiment analysis, evaluation mining, and thought mining are performed and used to extract social, political, commercial, and cultural meanings from textual data. Besides, the main methods used in text mining include clustering, classification, and summarization [19,42,43].

•
Clustering: It is the method to organize similar contents from unstructured data sources such as documents, news, images, paragraphs, sentences, comments, or terms in order to enhance retrieval and support browsing. It is the process to group the contents based on fuzzy information, such as words or word phrases in a set of documents. The similarity is computed using a similarity function. Measurement methods such as Cosine distance, Manhattan distance, and Euclidean distance are used for the clustering process. Besides, there is a wide variety of different clustering algorithms, such as hierarchical algorithms, partitioning algorithms, and standard parametric modeling-based methods. These algorithms are grouped along different dimensions based either on the underlying methodology of the algorithm, leading to agglomerative or partitional approaches, or on the structure of the final solution, leading to hierarchical or non-hierarchical solutions [35,38,43]. • Classification: It is a data mining approach to the grouping of the data according to the specified characteristics. It is carried out in two stages as learning and classification. In the first stage, a part of the data set is used for training purposes to determine how data characteristics will be classified. In the second stage, all datasets are classified by these rules. Supervised machine learning algorithms are used for classification methods. These algorithms include decision trees, I Bayes, nearest neighbor, classification and regression trees, support vector machines, and genetic algorithms [31,35,43]. • Summarization: It is used for documents and aims to determine the meanings, words, and phrases that can represent a document. It is carried out based on the language-specific rules that textual data belongs to. NLP approaches, which are one of the complex processes in terms of computer systems, are used in applications such as speech recognition, language translation, automated response systems, text summarization, and sentiment analysis [31,35,43].

Educational Data Mining (EDM)
EDM is a research area in which education system issues (e.g., learning models, programs, activities, teaching methods and materials, resources, and vocational qualifications of teachers) are studied based on data mining. It aims to develop education models and techniques and analyze data related to the attitudes and behaviors of students, teachers, administrators, teaching methods and materials [13]. In the literature, we observed that data mining techniques such as clustering, classification, and prediction are typically used in EDM studies [16,17]. With EDM applications, educators, administrators, and researchers make effective decisions by using tools such as student modeling, performance assessment, educational outcome prediction, course activity analysis, management activity analysis, and learning behaviors analysis [44]. In this context, EDM analyses that employ statistical methods and machine learning techniques are decision support systems performed with educational data.
The fact that the educational data consists of multiple factors and that all data cannot be transferred to the electronic environment creates problems in educational analysis processes [45]. Therefore, many educators, managers, and policymakers have hesitated to use data mining methods. In recent years, the transfer of historical data to the electronic media and the more regular recording of data have attracted the attention of researchers to the field of data mining. Besides, the experience is essential in acquiring general and special field vocational competences in education. In this context, it is indisputable that the analysis of data based on past experiences will significantly contribute to the education process. However, the inclusion of historical data in the analysis poses more challenging processes for EDM. Traditional data mining techniques used in EDM are inadequate in complex data analysis. Thus, it is inevitable that EDM analyses make use of artificial intelligence techniques, such as machine learning, NLP, expert systems, deep learning, neural networks, and big data. The analysis of educational data (the evaluation of relations, patterns, and associations based on the historical data) through these techniques will enable more precise assessments related to education and ensure that future decisions will be made more accurately by the administrators, supervisors, and teachers.

Grey Relational Analysis (GRA)
The Grey system theory is used to solve problems with a small sample and missing information and produce results in the case of uncertainty and incomplete information [46]. While the exact unknown information in the Grey system theory is represented by black, precisely known information is represented by white, and the incomplete information between these two endpoints is represented by grey [47][48][49].
GRA is a relational ranking, classification, and decision-making technique developed by using the Grey system theory [47]. With this method, it is aimed to determine the degree of relations between the reference series and each factor in a grey system. The relationship level between the factors is called a Grey relational degree. GRA is used to make the most appropriate selection based on various criteria between the options. It aims at the selection of the best alternatives in a multi-criteria decision-making process. It can be regarded as an optimization method. The calculation to perform the comparison and sorting between alternatives is realized in six steps: the creation of decision matrix, comparison matrix, normalization matrix, absolute value table, grey relational coefficient matrix, and grey relational degrees [50,51].
Step 1: The first operation in GRA is to create a decision matrix consisting of alternatives a and decision criteria c. Equation (2) shows the decision matrix X. X a represents each alternative, and X ac represents the criteria value of each alternative. The alternative number in the decision matrix is denoted by m, and the criteria number is denoted by n.
Step 2: Based on the values in the decision matrix, the reference series is added to the matrix, and the comparison matrix is obtained. The value of the reference series is selected based on the maximum or minimum values of the decision criteria in the alternatives, as well as the optimal criterion value, which the ideal alternative can obtain. Comparison matrix X 0 is Step 3: The process of reducing a datum to small intervals is called normalization. After normalization, it becomes possible to compare the series with each other. Thus, the normalization prevents the analysis from being adversely affected, depending on the discrete values [50][51][52]. In GRA, the normalization varies according to whether the criterion values in the series are maximum or minimum. If the maximum value of the series contributes positively to the decision (utility-based), Equation (4) is used. If the minimum value of the series contributes positively to the decision (cost-based), Equation (5) is used. If there is a positive contribution of a desired optimal value to the decision (optimal based), Equation (6) is used to normalize the matrix X 0 . In this case, the normalization matrix X * is obtained by Equation (7) X Step 4: Equation (8) indicates the absolute subtraction of the reference series and the alternatives. The absolute value table is created by taking the absolute difference of each alternative series in X * and the reference series and obtained by Equation (9). For example, the absolute value of the second alternative third criterion is calculated as ∆ 23 = X * 03 − X * 23 .
Step 5: The absolute value table (∆) in Equation (9) is used to calculate the grey relational coefficients indicating the degree of closeness of the decision matrix to the comparison matrix. The grey relational coefficients are calculated by Equation (12).
∆ min = min a min c ∆ ac (11) In Equation (12), ρ denotes the distinguishing coefficient and ρ ∈ [0, 1]. The purpose of ρ is to expand or compress the range of coefficients. In cases where there is a large difference in data, ρ is chosen close to 0 otherwise close to 1. In the literature, it is stated that ρ does not affect the ranking and is generally taken as ρ = 0.5 [53].
Step 6: Grey relational degrees determining the relationship between the reference series and alternatives can be calculated in two ways. If the decision criteria have equal significance, Equation (13) is used to calculate the degrees. In the other case, Equation (14) is used. The value w ac denotes the weight of the decision criteria in Equation (14). The value δ a calculated by the arithmetic mean of γ ac is close to 1 indicates that the relationship with the reference series is high.

Proposed Method
Text mining is a complex process consisting of four steps (Figure 1), in which textual data is converted into structured data and analyzed by data mining techniques. In this study, we developed a decision support system based on text mining and GRA to evaluate the written exams. Figure 2 depicts our system model. At the university level, 50 students in the same class took an examination, and the instructor evaluated the answer sheets. In this context, we analyzed the relationship level between the students' written exams and the answer key. In our system, we used the C# programming language for data collection, data preparation, text processing, and GRA operations. We also found word roots using Java programming language and performed data operations through the Microsoft SQL Server database management system. The database tables used in the analysis process are shown in Figure 3. In this study, we developed a decision support system based on text mining and GRA to evaluate the written exams. Figure 2 depicts our system model. At the university level, 50 students in the same class took an examination, and the instructor evaluated the answer sheets. In this context, we analyzed the relationship level between the students' written exams and the answer key. In this study, we developed a decision support system based on text mining and GRA to evaluate the written exams. Figure 2 depicts our system model. At the university level, 50 students in the same class took an examination, and the instructor evaluated the answer sheets. In this context, we analyzed the relationship level between the students' written exams and the answer key. In our system, we used the C# programming language for data collection, data preparation, text processing, and GRA operations. We also found word roots using Java programming language and performed data operations through the Microsoft SQL Server database management system. The database tables used in the analysis process are shown in Figure 3. In our system, we used the C# programming language for data collection, data preparation, text processing, and GRA operations. We also found word roots using Java programming language and performed data operations through the Microsoft SQL Server database management system. The database tables used in the analysis process are shown in Figure 3. In our system, we used the C# programming language for data collection, data preparation, text processing, and GRA operations. We also found word roots using Java programming language and performed data operations through the Microsoft SQL Server database management system. The database tables used in the analysis process are shown in Figure 3.

Data Collection
Students taking the exam were announced a week before the exam date. On a volunteer basis, 50 students participated in the exam in the computer laboratory. Five questions were asked to the students, and the answer sheets were received in an electronic format (as a Word document). The answer key for the first question is shown in Appendix A (a,b). In order to be able to describe the

Data Collection
Students taking the exam were announced a week before the exam date. On a volunteer basis, 50 students participated in the exam in the computer laboratory. Five questions were asked to the students, and the answer sheets were received in an electronic format (as a Word document). The answer key for the first question is shown in Appendix A (a,b). In order to be able to describe the GRA-based system model used in this research to the reader as short and understandable, we created a sample table by considering five student answers and explained the GRA calculations according to this table. The answers given by the five students are presented in Appendix A (c). The answers, the scores, and the answer key were saved into the database with the application programming interface we developed.

Data Preprocessing
The efficiency of text mining technique is directly related to data preprocessing. Data preprocessing influences the health of the analysis results. It consists of three steps, which are applied, respectively: tokenization, filtering, and stemming. In the tokenization phase, which is performed by text processing functions of the software developed, the marks such as point, comma, and exclamation are extracted from the text firstly. Then, the words in the text are broken up, and each word is stored in the database. Those steps in the tokenization phase are repeated for each answer of all students and the answer key. The second phase used in data preprocessing is filtering, which is NLP-based and includes removing prepositions and conjunction words. This process is carried out by deleting the words from the database through the programming language. Some of the deleted words in filtering process are as follows in Turkish: fakat (but), ancak (only), bile (even), ile (with), çünkü (because), eger (if), ve (and), veya (or), etc. The final phase in data preprocessing is the stemming. As a process of finding word roots, stemming is used to extract the prefixes and the suffixes that change the meaning of the words. In this study, we employed the Zemberek 0.11.1, which is the Java-based NLP library designed for performing Turkish NLP operations, to find the root of the words. Roots of the words stored in the database were identified by the Java application we developed. After removing the prefixes and the suffixes, the roots found were recorded in the database again.

Feature Extraction
Converting data to a structured form is a transformation process to apply data mining techniques to the text data. BoW, binary, and VSM methods are typically used to extract the word counts of the text, and these values are considered as features of the document. In other words, the data conversion task is the feature extraction of the document (textual data) at the same time [54,55].

Bag of Words (BoW)
The BoW approach is used to convert the document into a structured form. In other words, it assumes that the number of words in the text document can represent the document. In our study, we generated two BoW series (BoW1: the word frequency number of answer key; BoW2: the word frequency number of student answers) for each answer. In the BoW generation process given by Algorithm 1, BoW values of both answer key and student answers were placed in the BoW array. A value of 0 for the variable "answer" in the algorithm indicated the answer key BoW.

Vector Space Model (VSM)
VSM is the most common method to represent textual data numerically. In VSM, each document is represented by a vector created using the frequency count of the items. In document comparisons, the weight values of the words in documents to be compared can be determined relative to the Term Frequency (TF) or the Term Frequency-Inverse Document Frequency (TF-IDF). The TF denotes the number of terms in the document. The TF-IDF is a weighting method of the terms in the document collection [19]. It is possible to compare the multi-documents using the TF-IDF. For the term weight calculation, we used the TF-IDF method as follows: In Equation (15), q(w) denotes the weight q of the term w. The value of q(w) is directly proportional to the number of w in document d, f d (w), and inversely proportional the number of w in other documents D. N is the number of all documents, and f D (w) indicates the number of documents in which the related term is located. This approach, which is used in text mining applications effectively, is not a measure of similarity; however, it indicates the importance of a term in documents. Document similarities can be determined using the similarity measures such as Cosines and Euclidean [55,56].
Besides the advantages of simple calculation and allowing a comparison of multiple documents, there are some limitations related to TF-IDF. Insufficient information, defined as the sparsity in short texts, is one of those limitations. Although the TF-IDF provides healthy results in the analysis of long text data, it may not give effective results in the analysis of short text data. On the other hand, the evaluation of a text according to the weight gradations of the terms makes the term with a higher weight degree more important than others. Therefore, the TF-IDF indicates the importance of the frequency-based term in the text [20,56]. According to [21], although this method can be used to separate terms, it is not sufficient to distinguish between documents. Because of the shortcomings of the TF-IDF, it can be problematic when we consider short texts where terms with a low weight scale and terms with higher weight have an equal prefix. Therefore, the use of TF-IDF may not be an appropriate choice when written exams consist of short texts. Furthermore, exam evaluations should be measured with the similarity to the answer key to be referenced, not to other answers. In other words, the evaluation of exam questions should be done by the similarity to the answer key of all answers, and not be weighted according to other answers. In this context, it is suitable to analyze exam answers represented by the VSM with GRA, which is used in case of insufficient information so that written exam evaluations can be carried out by text mining.

Analysis
The structural text converted into a digital vector is analyzed in detail by employing clustering, classification, and summarization. Moreover, statistical methods and machine learning techniques in data mining tasks can also be used as analysis methods.

BoW Findings
The number of frequencies of the word roots in the answer key was determined, and the bag of words, BoW 1 , representing the answer key was created for each question. Thus, the bag of words for n questions in the answer key can be defined as

VSM Findings
The roots of words and numbers that match the answer key need to be specified for the analysis. The VSM table showing word distribution based on the answers of the first question responded by the five students is presented in Table 4. The VSM table, created by the answers of all students to the first question, is given in Appendix B (b).

VSM Findings
The roots of words and numbers that match the answer key need to be specified for the analysis. The VSM table showing word distribution based on the answers of the first question responded by the five students is presented in Table 4. The VSM table, created by the answers of all students to the first question, is given in Appendix B (b).

Grey Relational Degree Findings
As illustrated in Figure 5, there is a two-stage decision-making process in our analysis. The first stage is to identify the level of relationship between the answers responded by the student and the words in the answer key for each question. The level of relationship between BoW 1 and BoW 2 was determined based on the result of the first step using VSM prepared for each question. In the first stage, VSM, used as a decision matrix in the GRA, has a reference series which is added according to the utility-based values of each criterion among alternatives. Normalization was carried out by applying Equations (4) and (6). If the number of frequencies of the words in is less than , Equation (4) is used. If is greater than or equal to , Equation (6) is used for normalization. The absolute differences of the series obtained from the normalization process of the VSM with reference series were calculated, and then the Grey relational coefficients were calculated. = 0.5 was chosen in accordance with the literature in the calculation of the coefficients. The decision matrix formed by the first step is presented in Appendix C (a), the normalization matrix is given in Appendix C (b), the absolute differences are shown in Appendix C (c), and the Grey relational coefficients are given in Appendix C (d). Equation (13) is used for the calculation of Grey relational degrees with the assumption that the words in each question have equal importance. The Grey relational degrees showing the answer key relational level of the answers responded by the students for each question are presented in Table 5.  In the first stage, VSM, used as a decision matrix in the GRA, has a reference series which is added according to the utility-based values of each criterion among alternatives. Normalization was carried out by applying Equations (4) and (6). If the number of frequencies of the words in BoW 1 is less than BoW 2 , Equation (4) is used. If BoW 1 is greater than or equal to BoW 2 , Equation (6) is used for normalization. The absolute differences of the series obtained from the normalization process of the VSM with reference series were calculated, and then the Grey relational coefficients were calculated. ρ = 0.5 was chosen in accordance with the literature in the calculation of the coefficients. The decision matrix formed by the first step is presented in Appendix C (a), the normalization matrix is given in Appendix C (b), the absolute differences are shown in Appendix C (c), and the Grey relational coefficients are given in Appendix C (d). Equation (13) is used for the calculation of Grey relational degrees with the assumption that the words in each question have equal importance. The Grey relational degrees showing the answer key relational level of the answers responded by the students for each question are presented in Table 5.  Table 5 shows the relationship between BoW 1 and BoW 2 for the questions answered by each student and requires a second decision-making process. It is used to rank the exam paper of each student in relation to the answer key. In this context, it was used as a decision matrix for the second stage GRA. The reference series was created by maximum values within the alternative degree values for each question. Since the higher degree of alternatives shows the maximum scoring value, based on the utility-based approach, normalization is carried out by Equation (4), and the absolute differences table is created. The Grey relational coefficients are calculated with Equation (12) by ρ = 0.5. The Grey relational coefficients of the second stage decision-making process were presented in Appendix D (a). Since each question has equal weight, the Grey relational degrees presented in Table 6 were calculated by Equation (13). Table 6 shows the relevance level of student answers to the answer key. For example, according to the table, the answer closest to the answer key for the first question belongs to student-1 and student-3. In similar, student-5 has the best answer to question 2. If all the questions are considered, the highest score should be given to the student-5. Table 6. Second order GRA degrees for the five students. The analysis results that contain the instructor's scoring and the Grey relational degrees of the students' answers are presented in Table 7. (See Appendix D (b) for all students). Table 7. Comparison of evaluator scores and GRA degrees for the five students.

Discussion and Suggestions
In this study, we developed a decision support system using text mining and Grey system theory. The exam evaluations, considered as multi-criteria decision-making problems, are based on the student answers and word similarities in the answer key. No study has been found on the use of Grey system theory (which is rarely studied in text mining literature) for the evaluation of the written exams. In this regard, our study is a pioneer in the field of text mining and educational data mining.
It should be noted that the study does not provide a scoring system based on the results. However, it realizes an evaluation based on the rating. In this respect, it is a decision support system that helps the decision-maker to evaluate the student exam results. When the results, presented in Table 7, are compared, it is seen that the analysis results made with the classical method are close to each other. The decision support system developed in this way contributes to the assessment of distance learning exams, both in the classical written exams and online, and facilitates the control of exam scores.
Since the study is based on the similarity of the word roots, the preprocessing part of the text mining analysis becomes an important task. At this stage, we observed that the Zemberek library has some inadequacies, such as the shortening of some vocabulary roots. However, since the stemming process was applied to both the answer key and the answer sheets by using the same way, this inadequacy was assumed not to affect the analysis results. Furthermore, since the study is based on contextual text mining, semantic changes in word roots were neglected in the stemming process.
We have observed that there are some limitations related to the root-finding processes involved in the preprocessing. We have accepted that the stemming is applied to both the answer key and student answers in the same way and that the analysis is not affected since the study is contextual. However, in lexical analysis, it cannot be assumed that the situation will not have an analytical effect. In this context, since the development of NLP tools will gain importance, more effective results can be obtained for the lexical text mining studies.
In the future study, the renovation of the research using the lexical text mining method and realizing semantic analyses using artificial intelligence will contribute positively to the literature. However, the inadequacies in NLP implementations in the used language limit the semantic approaches. Particularly phrases, sentences, expressions composed of more than one word, and the shortcomings in the idioms make the lexical text mining analysis difficult. Turkish NLP libraries, which are currently being developed to increase the efficiency and capacity of the libraries such as Zemberek, ITU Turkish NLP, and Tspell, will have a vital role for text mining. Besides, the fact that the data set used in the study is in Turkish does not mean that the proposed model cannot be applied in other languages. On the contrary, thanks to the powerful processing capabilities of English NLP libraries, our model can achieve more successful results in English data sets.

Conclusions
In this paper, we presented a new novel approach to written exam evaluation. We developed a decision support system employing text mining and GRA to help decision-makers to evaluate the written exams. We demonstrated that how GRA could be used in two-stage decision systems. We carried out the analysis with the assumption that the word and the problem have equal weight. It has been determined in the application process that even if the factors affecting the decision have different effect weights, it can be performed with GRA easily. Therefore, it will be an exciting research topic to analyze the decision-making process that has differently weighted criteria by using GRA.
The assessment of written exam questions should be based on a referenced answer key. The words used to answer the question should be used to determine word similarity in the answer key prepared for each question. Thus, the evaluation of the written exams is a multi-criteria decision process. Besides, BoW, TF, and TF-IDF methods used in contextual text mining studies can be insufficient. Therefore, this study reveals that a rating can be made based on the word similarities in the written exams. In this context, by employing two-stage GRA, we have demonstrated that Grey relational degrees obtained by word similarities can be used to determine the similarity between the exam papers and the answer key.
In our study, text data in the exam papers of 50 students were structured by text mining methods, and first-order GRA was applied to the decision matrix created by VSM. The ranking of the student exam papers was carried out by applying the second-order GRA so that this method was applied to cover all questions. The decision support system using the GRA similarity values determines the range in which the student assessment score should be. Instead of evaluating as successful or unsuccessful, the success level of a student is determined relationally based on the success level of the other students by taking advantage of the GRA. Consequently, we demonstrated that the exam papers were evaluated using grey relational degrees. Furthermore, we compared the ranking using GRA with the instructor's ranking, and we observed that there were no significant differences between these two assessments.
Author Contributions: The authors contributed equally to this work.
Funding: This research received no external funding.