Random Forest with Sampling Techniques for Handling Imbalanced Prediction of University Student Depression

: In this work, we propose a combined sampling technique to improve the performance of imbalanced classiﬁcation of university student depression data. In experimental results, we found that combined random oversampling with the Tomek links under sampling methods allowed generating a relatively balanced depression dataset without losing signiﬁcant information. In this case, the random oversampling technique was used for sampling the minority class to balance the number of samples between the datasets. Then, the Tomek links technique was used for undersampling the samples by removing the depression data considered less relevant and noisy. The relatively balanced dataset was classiﬁed by random forest. The results show that the overall accuracy in the prediction of adolescent depression data was 94.17%, outperforming the individual sampling technique. Moreover, our proposed method was tested with another dataset for its external validity. This dataset’s predictive accuracy was found to be 93.33%.


Introduction
Depression is an important public health problem that occurs today and will do so in the future. It is one of the leading causes of "disability" worldwide and the most common mental disorder among college students [1]. A study in Thailand discovered that the prevalence of depression in adolescents was 14.9% [2]. Depression screening is the most important first step in assessing the symptoms and the severity of depression in people.
The observed depression is difficult to measure due to a condition within individuals who are sensitive and unaware of themselves being depressed. The diagnosis of depression in adolescents is necessary. The study found that adolescent depression is under-recognized by general doctors. Therefore, short and concise depressive screening tools are important to help doctors detect more students with depression at university.
The Patient Health Questionnaire-9 (PHQ-9) is a self-administered assessment. It is easy to understand for respondents. This tool is sensitive to the analysis and interpretation of results. It is also a tool that has few questions. Moreover, the questionnaire was assessed for both mental and physical symptoms [3]. Al-Busaidi et al. used PHQ-9 to screen for the prevalence of depressive symptoms among university students in Oman [4]. Levis et al. found that PHQ-9 was sensitive and accurate for screening to detect major depression [5].
Research related to prediction and analysis of the relevant factors of depression using data mining techniques has been carried out. For example, Chang et al. used ontologies and Bayesian networks to build the terminology of depression and applied Bayesian networks to infer the probability of depression [6]. Thanathamathee applied the Adaboost decision tree and feature selection technique to predict adolescent depression. The results were useful for screening depression [7]. Ghafoor et al. used the association rule and a frequent pattern tree to extract the results of depression [8]. Hou et al. proposed many data mining techniques to predict depression in university students on the basis of their reading habits. The results found that logistic regression achieved the highest prediction accuracy at a lower relative error [9]. Li et al. employed a feature selection method called GreedyStepwise (GSW) on the basis of correlation feature selection (CFS) and K-nearest neighbor (KNN) for mild depressive detection [10]. Kim et al. proposed a simple unobtrusive sensing system to monitor the activities of elderly people living alone. The results showed that the neural network effectively detected normal and mild depression [11]. Gao et al. focused on machine learning to predict major depression disorder utilizing features derived from magnetic resonance imaging (MRI) data [12]. Alzyoud et al. developed a hybrid machine learning model to predict the level of depression for youth smokers [13]. They found that the age factor is an important attribute for predicting depression. Blessie and George employed naïve Bayes and KNN to extract data patterns to prevent mental illness and assist in mental health services [14]. Priya et al. applied different data mining techniques to predict anxiety, depression, and stress [15]. Random forest was the highest identifier for stress, and naïve Bayes was the highest identifier for depression.
Imbalanced data problems are often encountered in the real world, especially medical data, as there are many patients who are admitted to the hospital. However, the number of patients diagnosed with disease is small compared to the total number of patients. When these data are used to predict outcomes (by machine learning and data mining), the learning of the algorithm is affected. It is assumed that the data are drawn from the same distribution as the training data, presenting imbalanced data to the classifier and producing unacceptable results [16]. Sampling methods and synthetic data generation provide a balanced class distribution by adding or removing samples. Naïve random oversampling allows generating new samples by randomly sampling the current samples of the minority class with replacement. With regard to synthetic sampling, Chawla et al. proposed the synthetic minority oversampling technique (SMOTE) [17]. SMOTE generated synthetic data for the minority class by selecting some of the nearest minority neighbors using minority data and generated synthetic minority data along with the lines between the minority data and the nearest minority neighbors [18]. Borderline-SMOTE developed by Han et al. was used to generate synthetic data along the line between the borderline samples of the decision regions [19]. Random undersampling was used to undersample the data in the majority class by randomly picking samples with replacement. Tomek proposed the Tomek links method to find all examples in the majority class closest to the minority class. Then, these examples were removed from the majority class, thereby providing better borderline decisions for the classifier [20]. Wilson introduced the edited nearest-neighbors method, which applied a nearest-neighbor algorithm to delete the samples from the majority class that were near or around the borderline of different classes [21]. Moreover, Tomek proposed a new repeated edited nearest-neighbor method, extending the edited nearest-neighbor method by repeating it multiple times [22].
In predicting highly imbalanced disease data, Khalilia et al. presented a method to predict disease risks from highly imbalanced data, using random forest and combined repeated, random subsampling [16]. Bektas et al. applied feature selection and an oversampling method to predict imbalanced cardiovascular diseases [23]. Sharma and Verbeke handled imbalanced Dutch depression data by undersampling, oversampling, over-undersampling and random oversampling example (ROSE) techniques, and then applied extreme gradient boosting to classify mental illness [24].
According to the literature review, the following research questions are proposed: RQ1. Which questions in PHQ-9 were involved in the depression prediction performance of university students who responded to this questionnaire? Is it necessary to use all 9 questions? RQ2. Which oversampling or undersampling techniques should be used to achieve better depression prediction performance of university students who responded to this questionnaire?
RQ3. Is using only one dataset of depression enough to predict university student depression effectively?
This paper is organized as follows: Section 2 presents the proposed method; Section 3 summarizes the performance evaluation; the deployment and discussion are presented in Section 4; lastly, the conclusion is provided in Section 5.

Feature Selection
Feature selection is a process of extracting the most relevant features from the dataset; it generates a subset of the original feature set to improve predictive performance with less processing time. Some of the original features that were not important are eliminated. In this paper, the filter approach method was applied. This is a way to calculate the weight or the relative value of each feature by selecting only the features that are important to keep.

Chi-Square
The chi-square x 2 test is used for categorical features in a dataset. To calculate the chi-square between each feature and the target in a descending order of weight values, the following equation can be used [25]: where O i represents the observed value(s) and E i represents the expected value(s).

Gini Index
The Gini index is a measure of the impurity of a dataset. A higher attribute weight denotes greater relevance. Suppose S is a dataset and the samples have k different classes (C i , i = 1, . . . , k). S is divided into k subsets (S i , i = 1, . . . , k), where S i is a sample set that belongs to class C i . s i is the sample number of sets S i . Then, the Gini index of set S is described as follows [26]: where P 2 i is the probability, estimated using s i /s. The minimum value of Gini(S) is 0, indicating that the maximum useful information can be obtained.

Information Gain
Information gain is calculated using the entropy value, which is a measure of the difference or dispersion of the data. If the data are very different, the entropy value is high; on the other hand, if the data are very similar, the entropy value is low. Information gain is defined as follows [27]: where p( j, x) is the joint distribution of class j and feature x.

Proposed Method
In this paper, we proposed seven major steps to handle the problem of imbalanced depression data, as outlined in Figure 1.
Information 2020, 11, x FOR PEER REVIEW 4 of 12 Each set was in turn used as the test set, while the random forest classifier was trained on the other nine sets [28]. 2. Chi-square, Gini index, and information gain were used to extract the principle feature items.
These techniques were selected to automatically identify meaningful smaller subsets of feature items and still exhibit high performance in terms of the prediction of depression [29]. 3. The reduced feature items of training data were obtained by oversampling the minority class using (1) random over-sampling, (2) SMOTE, and (3) Borderline-SMOTE. Then, the technique that balanced the data with the best performance in terms of prediction was selected. 4. Moreover, the reduced feature items of training data were also undersampled by (1) Tomek links, (2) edited nearest neighbors, and (3) repeated edited nearest neighbors. Then, the undersampling technique with the best performance in terms of prediction was selected. 5. The best-performing oversampling and undersampling techniques from steps 3 and 4 were used as a hybrid sampling approach. First, the random oversampling method was applied to generate new samples by randomly sampling the current training data with replacement. The minority class was oversampled, and a new balanced training dataset was created. Then, the Tomek links undersampling method was applied to reduce the number of each class by removing unwanted overlap between classes. This procedure was based on the assumption that random oversampling with replacement in the minority class was used for balancing data. Then, Tomek links were used to remove the majority data closest to real or synthetic data in the minority class. Thus, the majority samples were deleted until a better boundary was provided for classifier decisions. 6. The new training data from step 5 were used to generate the depression prediction model with the random forest classifier. 7. Then, the unseen testing data were used to evaluate the performance of the predicted depression model. 1.
The depression data were divided into training data, for generating the prediction model for depression, and unseen testing data, for evaluating the performance of the model for prediction. For training data, 10-fold cross-validation was used to divide the data into 10 equally sized sets. Each set was in turn used as the test set, while the random forest classifier was trained on the other nine sets [28].

2.
Chi-square, Gini index, and information gain were used to extract the principle feature items. These techniques were selected to automatically identify meaningful smaller subsets of feature items and still exhibit high performance in terms of the prediction of depression [29]. 3.
The reduced feature items of training data were obtained by oversampling the minority class using (1) random over-sampling, (2) SMOTE, and (3) Borderline-SMOTE. Then, the technique that balanced the data with the best performance in terms of prediction was selected.

4.
Moreover, the reduced feature items of training data were also undersampled by (1) Tomek links, (2) edited nearest neighbors, and (3) repeated edited nearest neighbors. Then, the undersampling technique with the best performance in terms of prediction was selected. 5.
The best-performing oversampling and undersampling techniques from steps 3 and 4 were used as a hybrid sampling approach. First, the random oversampling method was applied to generate new samples by randomly sampling the current training data with replacement. The minority class was oversampled, and a new balanced training dataset was created. Then, the Tomek links undersampling method was applied to reduce the number of each class by removing unwanted overlap between classes. This procedure was based on the assumption that random oversampling with replacement in the minority class was used for balancing data. Then, Tomek links were used to remove the majority data closest to real or synthetic data in the minority class. Thus, the majority samples were deleted until a better boundary was provided for classifier decisions. 6.
The new training data from step 5 were used to generate the depression prediction model with the random forest classifier. 7.
Then, the unseen testing data were used to evaluate the performance of the predicted depression model.

Dataset Description
The data used in this paper were obtained from PHQ-9. It is a self-assessment questionnaire developed from the Diagnostic and Statistical Manual of Mental Disorders fourth edition (DSM-IV) [3]. It assesses depression by asking questions related to symptoms over the past 2 weeks, as shown in Table 1. The scores for each question constitutes four levels: none (points = 0), some days or rarely (points = 1), quite often (points = 2), and almost every day (points = 3). The total score of all questions ranges from 0 to 27 points. There are 4 depression symptoms, and the detailed scores are described in Table 2. The data included 1549 examples collected from male and female undergraduate students of a university in the south of Thailand [29]. There were four classes of depression symptoms: class 0 (no symptoms), class 1 (mild symptoms), class 2 (moderate symptoms), and class 3 (severe symptoms).

Ethical Consideration
Permission for the study was obtained from the Ethics Committee of Walailak University, Thailand (Protocol Number WUEC-18-060-01).

Assessment Matrices
In this paper, the performance of depression prediction was evaluated as a function of accuracy, precision, and recall. The confusion matrix in Table 3 represents a summary of the prediction results from the classification problem. The number of correct and incorrect predictions of each class are summarized with count values. {P, N} represent the positive and negative testing data, and {Y, N} represent the predicted results given by classifier for the positive and negative class [30]. Accuracy: Precision: Recall: F-measure: where β is a coefficient used to adjust the relative importance of precision versus the recall, which is usually set to 1. The F-measure is high when both recall and precision are high, indicating the goodness of a learning algorithm for the interest class [28,31].

Results
In this paper, all sampling techniques were included in the imbalance-learn python package [32]. To achieve the best performance in terms of prediction, the proposed method was applied several times to determine the best parameters for all sampling techniques with k_neighbors = 5. For the random forest classifier, we set the number of random trees to 100, and the maximum depth for all trees was built until it attained a value of 10.
The results achieved when using the random forest to classify the dataset with only seven feature items [29] are shown in Table 4. Questions no. 1 and 9 were less important according to all three statistical weight measures of the feature selection methods.  Table 5 shows the performance of depression prediction without handling the imbalanced techniques, where the accuracy achieved was 91.26%. It can be seen that the precision of the model in the prediction of class 3 (severe depression) was only 57.14% and the recall was only 66.67%. Furthermore, although the efficiency (precision) of this model could predict class 2 (moderate depression) with 84.62% accuracy, the prediction recall of this class was only 42.31%. This shows that this model could not effectively predict class 2 and class 3. Therefore, in this paper, oversampling and undersampling techniques were applied to handle the imbalanced data in order to improve the predictive performance of the minority class. Then, the best oversampling and undersampling techniques were applied together according to the research of Batista et al. [33], which indicated that, although oversampling minority class examples can balance class distributions, some skewed class distributions are not solved, which can lead to an overfitting classifier problem. Thus, the undersampling technique was applied as a data-cleaning method. Instead of removing only the majority class examples, examples from both minority and majority classes were also removed. The oversampling techniques used in this paper were random oversampling, SMOTE, and Borderline-SMOTE. For each technique, Table 6 presents the results in terms of accuracy, precision, and recall. The results indicate that random oversampling was the best oversampling technique for this data, with classes 2 and 3 predicted more efficiently in terms of precision and recall compared to other oversampling methods with an accuracy of 93.53%.
Moreover, we used undersampling techniques to balance the class distribution via elimination of the majority class examples. The undersampling techniques used were Tomek links, edited nearest neighbors, and repeated edited nearest neighbors. Table 7 shows the performance in terms of prediction using all three techniques. The results show that Tomek links produced predictions with the best accuracy of 91.59%. Therefore, we applied random oversampling together with the Tomek links technique to improve the efficiency of the depression prediction for both the majority class and the minority class. Table 8 shows the number of samples after using the combined sampling techniques, displaying almost the same or balanced fractions. To clarify the performance of the proposed prediction, a comparative study with random oversampling and the Tomek links technique is summarized in Table 9. The results show that our proposed method achieved the best performance in terms of depression prediction as a function of accuracy, precision, recall, and F-measure evaluation indicators. In addition, a significant independent t-test was applied to these measures, i.e., precision, recall, and F-measure, to evaluate and confirm the statistical significance of our performance results. The different performance results were assumed as statistically significant when p < 0.05. The evaluation indicates that our proposed method performed significantly better than random forest with random oversampling and random forest with Tomek links in terms of precision and F-measure, as shown in Table 10. We applied feature selection including chi-square, Gini index, and information gain [29] to identify significant features. As shown in Table 4, only seven important questions were found to influence the prediction of university student depression. Furthermore, the question with the least relevance to the prediction of depression was Q9 (weight of 0), consistent with Lotrakul et al. ' First, our research was conducted using oversampling. It was found that random oversampling was an effective technique for predicting the depression data, achieving an accuracy of 93.53%, as shown in Table 6. Next, we applied undersampling techniques to our dataset. The results in Table 7 indicate that the Tomek links technique produced the best prediction among other undersampling methods with an accuracy of 91.59%. However, the Tomek links technique was less effective in predicting class 2 and 3 depression, producing lower precision and recall values than the random oversampling technique. Although the random oversampling technique showed good predictive performance, using only random oversampling for minority classes encountered problems of a skewed class distribution, leading to overfitting [35]. For this reason, our research applied a combination of random oversampling and Tomek links to handle our imbalanced depression data. Thus, Tomek links were also applied to remove the majority class examples. The proposed method showed an improved predictive performance for all classes, as shown in Tables 9 and 10. When considering the p-value, it was found that our model showed the best performance in terms of precision and f-measure among other sampling techniques.  Table 11. The performance in terms of prediction is shown in Table 12. It can be seen that our proposed method was capable of predicting unseen depression data with an accuracy of up to 93.33%. The precision and recall remained satisfactory, and this model could also precisely predict severe depression. It can be concluded that our proposed method is effective in correctly predicting each class. Through the feature selection methods, we found that feature items Q8, Q4, Q6, and Q7 were the most important items in predicting depression. These features are related to cognition and physical change [33] when negative self-consciousness affects physical expression. It was found that physical symptom change always appears at the severe depression level [36]. When delving into items Q8, Q4, Q6, and Q7, it was evident that severe depression had a score level of 3 (almost every day) for these four items. It was seen that these four items must be considered or given special importance when screening for and predicting depression in university students. Moreover, the questions identified were consistent with the research of Lotrakul et al. [34], which showed that the most important item affecting depression was related to low energy, i.e., Q4 and Q8 in our paper. This is also consistent with the results from [25], whereby depressed patients mostly emphasized somatic symptoms and emotional symptoms, corresponding to Q6 and Q7 in our research.
Predicting university student depression is very important and must be the first step of primary care. Empirical research was conducted using data collected from the university in the south of Thailand. The results show that our proposed method can be applied to correct identification of depression in a screening process. It can also reduce the time consumption and increase the accuracy of the assessment, while allowing practitioners to choose appropriate depression assessments that are sensitive and effective for screening of university students.

Conclusions
Sampling techniques are effective methods for resolving the class imbalance problem. The concept of the proposed method involved applying random oversampling to minority class examples and Tomek links to majority class examples, thereby removing unwanted overlap between classes. The proposed method generated a relatively balanced dataset, without losing significant information. This balanced dataset was classified using random forest. The predictive depression performance of the proposed method was evaluated in terms of accuracy, precision, recall, and F-measure. The experimental results show that our proposed method outperformed the individual sampling techniques. Moreover, our proposed method can be used for effectively predicting university student depression, which is the primary step in screening for depression symptoms.
The limitation of this work is related to the depression prediction model using the PHQ-9 questionnaire. This research involved predicting depression among university students, discovering that only seven questions were important in depression prediction. Another dataset, e.g., involving patients of working age or the elderly, may use different prediction models for depression. Therefore, this model is only suitable for university students using PHQ-9.
Future studies will be carried out using other assessment methods; we also intend to use additional information provided by university students' social media posts to help improve the performance of depression prediction.