Realizing an Integrated Multistage Support Vector Machine Model for Augmented Recognition of Unipolar Depression

Unipolar depression (UD), also referred to as clinical depression, appears to be a widespread mental disorder around the world. Further, this is a vital state related to a person’s health that influences his/her daily routine. Besides, this state also influences the person’s frame of mind, behavior, and several body functionalities like sleep, appetite, and also it can cause a scenario where a person could harm himself/herself or others. In several cases, it becomes an arduous task to detect UD, since, it is a state of comorbidity. For that reason, this research proposes a more convenient approach for the physicians to detect the state of clinical depression at an initial phase using an integrated multistage support vector machine model. Initially, the dataset is preprocessed using multiple imputation by chained equations (MICE) technique. Then, for selecting the appropriate features, the support vector machine-based recursive feature elimination (SVM RFE) is deployed. Subsequently, the integrated multistage support vector machine classifier is built by employing the bagging random sampling technique. Finally, the experimental outcomes indicate that the proposed integrated multistage support vector machine model surpasses methods such as logistic regression, multilayer perceptron, random forest, and bagging SVM (majority voting), in terms of overall performance.


Introduction
In recent years, depression seems to be a very prevalent disorder around the globe, having a presence among approximately 264 million individuals. Psychiatrists usually claim that this disorder is unique, and it is unlike mood swings or ephemeral emotions and their reactions. Usually, when such a depressive condition is prevalent for a long duration among individuals, it might be a somber state of health. Additionally, the causes and effects of such cases are severe, and it critically rescinds the day to day functioning of individuals. In the worst scenario, it might stimulate suicidal tendencies in an individual.
In the millennial (born 1981-1996) era, depression is found to be on the rise; the reason is not apparent. Research shows that depression is greater among the younger Millennials, which results in many risk factors such as substance abuse and behavioral failures [1]. It is found that the depression symptoms have gone to 15 percent from 9 percent between 2005 and 2015, which is very shocking [1]. The three main parts of the brain that are affected by depression are the hypothalamus, the prefrontal cortex, and the amygdala. Some of the common reasons for being depressed are hormonal imbalance, stress, or genetic [2]. The symptoms of depression involve prolonged feelings of regret, sadness and hopelessness, irregular appetite, weight gain or weight loss, and many others. These days, more than the physical health issues, mental health issues are increasing exponentially [3]. It seems like almost everyone is affected by stress, anxiety, and depression [4].
More than physical health, mental health is essential, as it would directly affect physical health too. It will be easier if we have proper techniques to identify mental health as well [5]. There are few significant issues in diagnosing and treating the individual affected with unipolar depression such as, it is not easy for a depressed individual to seek expert help due to motivation and cost, and in some cases, the individual fails to take the mental health seriously [6].
In order to treat the depressed individuals better, we have proposed a machine learning classification algorithm, integrated multistage support vector machine model. It is an ensemble-based classification algorithm, where the support vector machine (SVM) classifiers are integrated with the help of the SVR-based weighted voting method to produce the outcome. Machine learning techniques are the best in identifying the patterns in the dataset and predict the outcome. We gathered data with the help of a questionnaire and preprocessed it to handle the missing values. The preprocessed data is then processed with a feature selection technique to select the relevant features.
The key contributions of this work include the following: • The multiple imputation by chained equations (MICE) method is deployed for preprocessing and cleaning the gathered dataset • The feature selection process is accomplished by employing the support vector machine-based recursive feature elimination (SVM RFE). • The UD classification is performed using the proposed integrated multistage support vector machine classifier, which is built by employing the bagging random sampling approach. The significant motivation of this research is to devise a random sampling-based integrated multistage SVM model for classifying the unipolar depression dataset, and we also attempt to enhance the overall performance of the proposed model. The rest of this research work is organized as follows. Section 2 elucidates the methodology formulation process and provides a detailed outlook into the individual modules of the proposed integrated multistage SVM model. Section 3 focuses on the experimental results. Section 4 provides information regarding the conclusion and future work.

Utilized Dataset
The dataset we used in this study was collected from various individuals with an average age of 30. We framed a questionnaire based on the "Hamilton Depression Rating Scale" [7] and prepared a self-rating report. The dataset collected had 3040 samples with 22 features, including the target variable. The features are the demographic attributes and symptom scores. For processing, we split the dataset into training and testing sets (75-25 rule). The model was trained with the training set then tested with the test set and evaluated with specific performance metrics. Essential features are portrayed in Table 1.

Data Cleaning and Preprocessing
The data cleaning and preprocessing were performed by utilizing the multivariate imputation by chained equations (MICE) [8]. MICE is a flexible, advanced method in handling the missing values [9]. This technique handles the missing values by imputing multiple values [10]. The primary assumption in MICE is that the imputation variables were from the observed values, not from the unobserved values [11]. The process of chained equations involves various steps as follows, Step 1: For every missing value in the dataset, the mean imputation technique was performed. These mean imputations were considered as placeholders.
Step 2: The mean imputation placeholder for any one of the variables say "var" was set back to null.
Step 3: In Step 2, the values that were observed for the "var" variable, which was made null, were regressed with other variables present in the imputation model and might or might not have all the variables from the dataset. In simpler terms, in this regression model, "var" was considered to be the dependent variable, and the other variables were considered to be independent.
Step 4: The variable "var", which was made as null, was now replaced with the actual imputations or predictions from the regression model. In the later stages, when "var" was used as an independent variable for other variables in the regression model, the observed values, as well as the imputed values, were used.
Step 5: For every missing value in the dataset, steps 2-4 were repeated to impute values. This process was continued to one iteration or one cycle. At the end of the first cycle, all the missing values would have been handled and imputed with the predictions from the regression model that can be seen in the observed data.
Step 6: Steps 2-4 were repeated for several cycles; the iterations depend on one's requirement. The imputation values would be updated at the end of each cycle. The final imputation values were retained at the end, which formed an imputed dataset. The most common number of cycles used was ten.

Selection of Features
In the dataset we collected the Hamilton Depression Rating Scale based self-rating report, there were about 22 features, including the target variable with 3040 samples. In the dataset, we found that there were features that interacted with each other. The features that were dependent on each other directly affected the accuracy of the model. In order to reduce the interaction between the features and remove the irrelevant or redundant variables, we implemented a wrapper based feature selection algorithm, support vector machine-based recursive feature elimination (SVM RFE).
Using this approach, nine features were selected. Alternatively, it has to be witnessed that choosing extra features will not assure higher accuracy levels in classification scenarios. Table 2 demonstrates the selected features and their indices for UD classification. Logistic regression (LR) is a statistical approach, borrowed by machine learning in predictive analysis. This approach is mainly used when the dependent or the target variable is categorical. In logistic regression, the dependent variable must be dichotomous (i.e., Binary, Yes or No) [12]. The main assumptions made in logistic regression are that there are no outliers in the data, and that there is no multicollinearity between the predictor variables. Logistic regression is an extension of linear regression when the target variable seems to be categorical [13]. In this work, the penalized logistic regression uses a Glmnet in RStudio for predicting the unipolar depression. Table 3 presents the parameter settings for the logistic regression approach. Logistic regression is calculated through the probability of event occurrence with the help of the following the logistic function.
where, is the probability of event occurrence, for j = 1, …, n. Usually, when there is an increase in the complexity of the problem, the complexity of the theoretical understanding of the problem also upsurges. In that case, traditional statistical approaches are sought after. Currently, the studies show that neural networks, multilayered perceptron (MLP) in particular, seem to be replacing traditional statistical approaches. Multilayered perceptron does not make any prior assumptions about the data distribution, unlike the statistical models, and it can model even a highly non-linear function to accuracy by training it with new unseen data [14]. Multilayered perceptron is a model with interconnected nodes or neurons, which are connected by connection links with weights and the output signals [14]. We implemented the MLP in RStudio by deploying the RSNNS package. Table 4 shows the parameter settings for the multilayer perceptron approach. The input and the output signals are connected with the help of these neurons and connection links. The net input is calculated by, where, -preactivation function or Net input; -the weight associated with the connection link; -inputs (I1, I2, …, In); B-bias.
Based on the error rate at every iteration, the weights of the neurons can be adjusted. The perceptron weight adjustment is calculated by, where -change in weights of the neurons; -learning rate; -predicted or desired output. In this work, we utilized the tuneRanger package in RStudio for the quick deployment of the random forests. Table 5 presents the hyperparameter settings of the random forest (RF) approach in this work. Random forest is an ensemble approach; it uses a recursive partitioning method to produce numerous trees, which are then aggregated to get the results [15][16][17]. Every tree in the random forest was constructed independently with the help of bootstraps of the training data. In random forest, each tree was constructed using two-thirds of the training data and the remaining one-third was used for testing the tree. The error rate of the forest depends on the strength of the individual trees and the correlation between each tree. The main advantage of using random forest is that there is no need to use any cross-validation methods, as the random forest approach itself has a built-in method called the out-of-bag errors to determine the test set errors in an unbiased manner. When compared with decision tree, random forest seems to have better accuracy and was less dependent on the training set and more tolerant to noise. Support vector machine (SVM) is a machine learning algorithm that can be modeled for both regression and classification problems but it is majorly used for classification of a binary class problem [18,19]. In this work, we utilized the e1071package in RStudio for the deployment of the SVM classifier. Table 6 illustrates the hyperparameter settings for the SVM classifier in this work.
When a labeled training data is given as an input, the model gives an optimal hyperplane as an output, which categorizes the samples. It is easy to maintain a linear hyperplane between two classes. However, when there is no precise classification between the vector points, manual separation is not possible [20]. For such situations, SVM has a strategy called the kernel. Kernel techniques convert a non-separable space to a separable space, which is called kernels used in non-linear separation models. Some of the commonly used kernel techniques are Gaussian kernel, Polynomial kernel, and many more [21,22].

Integrated MultiStage Support Vector Machine Classification Model
The proposed integrated multistage support vector machine classification model comprises of two segments: the first one being the design of the SVM classifier, and the second is the UD feature selecting and ranking.

Design of Integrated Multistage Support Vector Machine Classifier
In the proposed model, we were combining the individual SVM classifiers into a stronger and accurate model to improve the robustness and the generality of the SVM classifier. The deployment of this integration model depends on two factors: (i) the efficient way to build the member classifiers, aligning with the integration technique, and (ii) how to make all the member classifiers fuse to end up with a robust classifier. Therefore, to form a group of member classifiers, a random sampling method based on bagging is applied repeatedly [23]. For every individual SVM classification member classifier, around 75% from the original data sample is selected randomly for the training set, and the rest of the samples are used as a test or validation set to evaluate the performance of the model. A grid search utilizing the factor ranges C = {1, 2, 3,4, 5, … , 30} and γ = {0.1, 0.2, 03, 0.4,0.5, … , 5} is accomplished, for determining the optimimum values of C and γ. Later, without considering the optimal number of members in an integrated classifier, in this study, we implemented 10 different SVM classifiers with data from 10 random samplings and validated using the 10-fold cross-validation. This technique uses SVM RFE as the base learner. Thus, we constructed an SVM classifier with ten members in this study. In the SVM RFE, the features will be selected by the member classes based on their rankings in the support vector ratio-based ranking criteria. As the member classifiers are built with different random samples, they tend to have behaviors different from each other, and also, they will have different classification outcomes for the same data. As the final step, to integrate all the decisions by the individual classifiers to form ensemble SVM classifiers, the SVR-based weighted voting technique is implemented. The overall design of the proposed method is shown in Figure 1. Once the integrated classifier is built, it can be used for any classification tasks, as shown in Figure 1. In Figure 1, we can see that, once the member classifier was trained, the rest of the samples, which was 25% from the training set, was used as a temporary validation set for evaluating the performance of the model. In order to maintain the diversity of the classifiers and the simplicity of the integrated model, we used m = 10 member classifiers in this study.

Ranking and Selection of Features
The essential step in implementing the integrated multistage SVM classification model is selecting the feature subset, which eventually enhances the performance of the member classifiers. Figure 2 represents the flow diagram of the support vector ratio-based support vector machine recursive feature ranking-the irrelevant variables and the variables that interact with each other usually slow down the overall performance of the model concerning computation and storage, during training or prediction. Sometimes, the irrelevant features can make drastic effects on the learning phase of the model. To improve the performance of the SVM classifier, we implemented an effective feature ranking and feature selection method to remove the irrelevant features from the 22 available features in the dataset, which can be seen in Table 1. The commonly used feature selection algorithms come under two categories, the filter methods and wrapper methods [24]. As simple as the filter methods look, they are not considered most of the time because they do not take into account the interaction between the features, which reduces the optimality of the feature subset, though they are computationally effective. On the other hand, wrapper methods evaluate the features jointly and iteratively, which results in effective capturing of interaction between the features [25].
Due to the above-mentioned advantage, we used a wrapper method for feature selection in constructing the ensemble-based SVM classifier. Among all the existing wrapper-based feature selection methods, SVM RFE is considered as the most effective [26]. In this study, we implemented RFE as a part of the RBF SVM classification with the help of the support vector rate (SVR) metric for ranking all the 22 features shown in Table 1. The SVR is given by, The features are the support vectors in SVM; it is known that some of the support vectors help in minimizing the computational load of SVM and also improve its efficiency during the training. The ranking process is illustrated in Figure 2. The algorithm for the ranking process is as follows, Step 1: Initialize the feature set, define S with all the 22 features from the dataset.
Step 2: Assume the ranked feature set as R.
Step 3: Eliminate one feature from the set and train the SVM model with 21 features. The classifier is initialized with empirical parameters, in order to calculate the SVR, which allows us to find out the contribution of the removed feature.
Step 4: Repeat step 3 for all the 22 features in the dataset. The feature with higher SVR after removal is placed in the ranked set R. It implies that the feature is not a support vector and is far away from the hyperplane.

Results and Discussion
The collected dataset had 3040 samples with 22 features, including the outcome variable. We preprocessed the dataset for removing the missing values using the MICE technique. Once the missing values were handled, we applied a wrapper-based feature selection technique, SVM RFE, to eliminate the less relevant and low performing features from the set. The algorithm removed the features in iteration and ranked them based on the SVR score. From the total 22 features, the algorithm selected nine features as the most important ones. These nine features did not depend on each other and also there was no interaction among them. The dataset was then divided into training and testing sets, where the model was trained with a training set and evaluated with the testing set. The composition was 75-25 for the training and testing dataset, respectively, with 10-fold cross-validation. In the numerical implementation, we implemented the proposed method with 10-member SVM classifiers and then integrated them with the help of the SVR-based weighted voting technique, as explained in the previous section.
To evaluate the proposed model, we have used the confusion matrix [27]. The confusion matrix was used to validate the performance of the model, which was tested with test data and whose true values were known. The technical terms involved in the confusion matrix are the true positive TP (model prediction-positive, actual outcome-positive), true negative TN (model prediction-negative, actual outcome-negative), false positive FP (model prediction-positive, actual outcome-negative), and false negative FN (model predicted-negative, actual outcome-positive). From the confusion matrix, different performance metrics can be calculated, such as accuracy, specificity, precision, sensitivity, and FMeasure [27]. The respective formulas for the metrics can be seen in Table 7. The results are tabulated in Table 8; the proposed model was compared with other methods such as logistic regression (LR), multilayer perceptron (MLP), random forest (RF), and bagging SVM (majority voting). Figure 3 represents the confusion matrix for LR, MLP, RF, and bagging SVM (majority voting), the proposed model, respectively. A comparison of evaluation metrics of the proposed model with other approaches is illustrated in Figure 4. It can be witnessed that the proposed model surpasses all other compared approaches in terms of performance and superior accuracy. The receiver operating characteristic (ROC) curve for the LR, MLP, RF, bagging SVM (majority voting), and the proposed model is depicted in Figures 5, 6, 7, 8, and 9, respectively. Stability comparison between the integrated SVM classifier and the member classifiers is shown in Figure 10.

Conclusion
In this study, we proposed an effective ensemble-based classification model, integrated multistage support vector machine classification model for enhancing the predicting accuracy of UD. As the first step, we cleaned the data with MICE for handling the missing values. Then we implemented SVM RFE, a wrapper-based feature selection technique in order to reduce the feature dimension and select the necessary features, which are not dependent on each other, which eventually improves the accuracy of the model. The initial number of features in the original dataset was 22 on which the feature selection technique was applied. We used a 75-25 composition for training and testing datasets. The results proved that the proposed methodology had improved the prediction accuracy of UD when compared with other classification models. It could be observed that the proposed model was better than all other compared approaches in terms of performance and also offered greater accuracy.