Assessing the Reliability of Machine Learning Models Applied to the Mental Health Domain Using Explainable AI

: Machine learning is increasingly and ubiquitously being used in the medical domain. Evaluation metrics like accuracy, precision, and recall may indicate the performance of the models but not necessarily the reliability of their outcomes. This paper assesses the effectiveness of a number of machine learning algorithms applied to an important dataset in the medical domain, specifically, mental health, by employing explainability methodologies. Using multiple machine learning algorithms and model explainability techniques, this work provides insights into the models’ workings to help determine the reliability of the machine learning algorithm predictions. The results are not intuitive. It was found that the models were focusing significantly on less relevant features and, at times, unsound ranking of the features to make the predictions. This paper therefore argues that it is important for research in applied machine learning to provide insights into the explainability of models in addition to other performance metrics like accuracy. This is particularly important for applications in critical domains such as healthcare.


Introduction
Mental health is a serious issue.For the past few years, a non-profit organization, Open Sourcing Mental Health (OSMH), has been publishing the results of its survey on the mental health of employees primarily in the tech/IT industry under Creative Commons Attribution-ShareAlike 4.0 International license.A literature survey shows that the dataset [1] is popular among researchers.Multiple articles have used machine learning models on the dataset(s) to draw valuable insights.The models have been treated as black boxes, assessed using conventional evaluation metrics, and concluded to have performed significantly well.However, no attempts have been made to gauge the reliability of the results.LIME [2] and SHAP [3] have become popular tools for enhancing the transparency and accountability of machine learning models, enabling users to gain insights into their decision-making processes and build trust in their predictions.Both approaches aim to unravel the underlying logic driving the models' predictions, particularly for individual data points.We therefore attempted to use them to see how reliable the models were in their prognosis.
Our experiments in trying to justify the results using LIME [2] and SHAP [3] show that the class predictions relied significantly on unsound weights for features in the dataset.These experiments demonstrate the need to supplement conventional metrics with explanations of the models' behavior and justification of the results.This work aims to answer the following research questions: • RQ1: How reliable are evaluation metrics such as accuracy in assessing machine learning model performance in making important predictions?
• RQ2: How well do the explainable AI techniques SHAP and LIME complement conventional evaluation metrics?• RQ3: How do the various machine learning algorithms compare when it comes to the explainability of their outcomes?
A brief review of the current literature follows.A search for "Mental Health in Tech Survey" in Google Scholar shows scores of articles that apparently have used the OSMH datasets.A few of them [4,5] applied machine learning to predict the mental health of employees mainly working in the tech/IT sector.One of them [6] attempted to interpret the model using SHAP and Permutation Importance, but no conclusions were drawn.Even according to their analysis, a past mental health disorders "whether it would interfere with work if treatment is not effective" weighed more than "whether it has been professionally diagnosed", which does not make much sense.Generalized Linear Models applied to the dataset show a high correlation between mental health issues and working for a primarily technology-oriented company combined with certain demographics [7].
Prior work on linear regression, Multi-Layer Perceptron, and Echo State Networks shows that there is a close relationship between the SHAP values and differences in the performance of the models [8].Since the models differ in complexity, the paper concludes that model complexity is an important consideration concerning explainability.The metrics used in the work are monotonicity, trendability, and prognosability.There has been an attempt to measure the effectiveness of explainable techniques like SHAP and LIME [9].The authors created a new methodology to evaluate precision, generality, and consistency in attribution methods.They found that both SHAP and LIME lacked all three of these qualities, leading to the conclusion that more investigation is needed in the area of explainability.In this context, it is also worth mentioning the explainable AI (XAI) toolkit [10] that is built on top of DARPA's efforts in XAI.The toolkit is purported to be a resource for everyone wanting to use AI responsibly.
Explainability and fairness have been recommended for AI models used in healthcare in what the authors of [11] call "the FAIR principles".Understandably, the recent literature has abundant work in these growing areas.Classification and explainability were attempted on publicly available English language datasets related to mental health [12].The focus of the work was on analyzing the spectrum of language behavior evidenced on social media.Interestingly, t-SNE and UMAP were also considered explainable artificial intelligence (XAI) methods in a survey of the current literature on the explainability of AI models in medical health [13].Explainable AI (XAI) continues to be a hot topic in research, particularly in the health domain, as can be observed from the the recent literature [14,15].Machine learning models deployed in the health domain can suffer from fairness issues too, and some of the methods to address fairness can degrade the model's performance in terms of accuracy and fairness as well [16].
In an interesting scenario, SHAP explanations were used as features to improve machine learning classification [17].The article discusses the cost of wrong predictions in finance and how SHAP explanations can be used to reduce this cost.The authors propose a two-step classification process where the first step uses a base classifier and the second step uses a classifier trained on SHAP explanations as new features.They test their method on nine datasets and find that it improves classification performance, particularly in reducing false negatives.
However, SHAP and LIME are vulnerable to adversarial attacks.In an article about fooling post hoc explanation methods for black box models [18], the authors propose a new framework to create "scaffolded" classifiers that hide the biases of the original model.They show that these new classifiers can fool SHAP and LIME into generating inaccurate explanations.A recent detailed survey [19] on explainable AI (XAI) methods highlighted their limitations, including those of LIME.In yet another recent survey on XAI methods, including LIME and SHAP, for use in the detection of Alzheimer's disease [20], the authors also highlighted the limitations and open challenges with XAI methods.A number of open issues with XAI have been discussed under nine categories providing research directions in a recently published work [21].

Contribution
Our literature survey highlighted the advantages and disadvantages of explainability techniques.Despite the limitations of SHAP and LIME described above, this work successfully uses them to show how the black box approach of using machine learning algorithms can be dangerously misleading.For instance, our experiments show that with significant accuracy and a 100% prediction probability, the machine learning model called SGD Classifier may predict that a person does not have a mental health condition even if they answered "sometimes" to the survey question "If you have a mental health condition, do you feel that it interferes with your work?"In doing so, the model relies more on the fact that the person did not answer with "often" or "rarely" to the same question than on the fact that the person did answer "sometimes".The very fact that the person answered this question implies that the person accepts they have a mental health condition.But SGD Classifier predicts that the person does not have a mental health condition.The prediction is therefore highly misleading.To the best of our knowledge, this work is unique in highlighting the inadequacy of conventional evaluation metrics in assessing machine learning model performance in the mental health domain and argues strongly that the metrics need to be corroborated with a deeper understanding of the relationships between features and outcomes.

Paper Organization
The remainder of this paper is organized as follows.Section 2 describes the approach, detailing the dataset, tools, and experiments.The results from the experiments are presented in Section 3. Section 4 discusses the results and presents the analysis.Finally, the conclusion is presented in Section 5.

Materials and Methods
After performing some intuitive data preprocessing steps on the dataset, such as dropping irrelevant columns like timestamps and comments, smoothing the outliers, and renaming the columns for consistency, class prediction was performed using a host of machine learning algorithms, which are all popularly used in the literature.The algorithms used were logistic regression [22], K-NN [23], decision tree [24], random forest [25], Gradient Boosting [26], Adaboost [27], SGD Classifier [28], Naive Bayes [29], Support Vector Machines [30], XGBoost [31], and LightGBM [32].For brevity, the algorithms are not described here.Citations are provided instead.The selection of the algorithms was based on the existing literature [4].
LIME (Local Interpretable Model-Agnostic Explanations) [2] and SHAP (Shapley Additive Explanations) [3] are two prominent techniques used to demystify the complex workings of machine learning models.These are much better than merely computing feature importance using, say, information gain [24].For instance, features with more unique values tend to have higher information gain simply due to the increased opportunity for splits, even if their true predictive power is limited.Also, information gain only considers individual features and does not capture how features might interact with each other to influence the outcome.On the other hand, attribution methods like SHAP and LIME capture complex relationships between features and can identify synergistic effects.We therefore attempted to justify the results using the explainability algorithms SHAP and LIME.Both SHAP and LIME are used for complementary insights.LIME excels at local explanations [2], while SHAP provides global feature importance and theoretical guarantees [3].Details are provided below.

Dataset
The dataset used for the experiments in this work is called the "Mental Health in Tech Survey [1]".The data are from a survey conducted in 2014 pertaining to the prevalence of mental health issues among individuals working in the technology sector.Some of the questions asked in the survey include whether the respondent has a family history of mental illness, how often they seek treatment for mental health conditions, and whether their employer provides mental health benefits among questions about demographics, work environment, job satisfaction, and other mental health-related factors.The class was determined by the answer to the question in the survey "Have you ever been diagnosed with a mental health disorder?"and was used as the target variable for the experiments.For data pre-processing, we dropped the timestamp, country, state, and comments because those columns contained a number of missing values and are not as relevant for the prediction of mental health.Also, one of the numerical values named "Age" contained many outliers and missing values.Therefore, we replaced those values with the median value for our experiment.The data were then split into training and test sets in a ratio of 0.7:0.3.

LIME: Local Interpretable Model-Agnostic Explanations
LIME [2] is a technique used to explain individual predictions made by machine learning models that are often used as "black boxes".It is model-agnostic and works with any type of model, regardless of its internal structure or training process.LIME provides explanations specific to a single prediction, rather than trying to interpret the model globally.This allows for capturing nuanced relationships between features and outcomes for individual data points.LIME explanations are formulated using humanunderstandable concepts, such as feature weights or contributions.This enables users to grasp why the model made a particular prediction without needing deep expertise in the model's internal workings.
The core principle of LIME relies on surrogate models that are simple and interpretable like linear regression.Such a surrogate model is fitted to approximate the black box model's behavior around the data point of interest.This is achieved through the following steps.

•
Generate perturbations: A set of perturbed versions of the original data point is created by slightly modifying its features.This simulates how changes in input features might affect the model's prediction.

•
Query the black box: The model's predictions for each perturbed data point are obtained.

•
Train the surrogate model: A local surrogate model, such as a linear regression, is fit to the generated data and corresponding predictions.This model aims to mimic the black box's behavior in the vicinity of the original data point.

•
Explain the prediction: The weights or coefficients of the trained surrogate model represent the contributions of each feature to the model's output.These weights are interpreted as the explanation for the original prediction.
The surrogate model can often be represented as where • f is the prediction of the surrogate model for an input data point x .
• w 0 is the intercept.• w i are the coefficients for each feature x i .
The weights w i then explain the model's prediction, indicating how much each feature influenced the outcome.LIME uses various techniques to ensure the faithfulness of the explanation to the original model, such as the regularization and weighting of perturbed points based on their proximity to the original data point.

SHAP: Shapley Additive Explanations
SHAP [3] is also a model-agnostic technique for explaining machine learning models by assigning each feature an importance value for a particular prediction.It provides both individual prediction explanations (local) and overall feature importance insights across the dataset (global).It applies Shapley values [33] from coalitional game theory to calculate feature importance.In a game with players (features), a Shapley value represents a player's average marginal contribution to all possible coalitions (subsets of features).SHAP analyzes how each feature's marginal contribution impacts the model's output, considering all possible feature combinations.
For a model f and an input instance x with features S = x1, x2, ..., xn, the SHAP value for a feature xi is defined as where is the model's prediction using only features in S.

•
n is the number of features. • can be considered a weighing factor.
As can be seen from the above equation, the marginal importance of each feature is computed by including and excluding the feature in the prediction.SHAP assumes an additive feature attribution model, but not in the units of prediction.Therefore, a "link function" is used for additivity: The methodology for the experiments is further described in Figure 1.After preprocessing the data, various machine learning models are used to make predictions and are evaluated using a number of metrics.The models are then passed as parameters to LIME and SHAP explainers to generate interpretable plots.

Results
All the Python implementations of the machine learning algorithms used for prediction and the attribution methods SHAP and LIME were imported and used with some finetuning of the limited hyperparameters obtained by the grid search.The classification results obtained are similar to the ones in the literature [4] and are tabulated in Table 1 and Figure 2. To assess any impact due to class imbalance, we also computed the F1-score in addition to other metrics.The machine learning algorithms used for this work are the same as the ones used in [4], which has been used as the baseline for comparison with our work.The hyperparameters used with each of the models are summarized in Table 2.The specific findings from the application of SHAP and LIME on each of the models are discussed in the following subsections.

Logistic Regression
The evaluation metrics precision, recall, F1 score, and accuracy obtained from the application of logistic regression are tabulated in Table 1.

Analysis of Results for SHAP and LIME
Based on the outcomes derived from employing logistic regression in conjunction with the SHAP and LIME methods, it is evident that the variable "Work Interference(Sometimes)" significantly influences the determination of mental health conditions among study participants.
In Figure 3a, at the top, there are three charts.The leftmost is the predicted outcome.In this specific instance, the outcome is "no mental health issue with 90% certainty", indicated by the blue-colored bar.The rightmost chart is a listing of the feature-value pairs for the specific instance.Each feature in the middle plot is represented by a colorcoded bar whose length (positive or negative) indicates its overall contribution to the prediction.Higher positive values imply the feature pushes the prediction toward the outcome of the corresponding color, while higher negative values imply it opposes it.For instance, the biggest factor that pushed the prediction to a "no" is in blue, called "Work_Interfere_Sometimes", which corresponds to the survey question "If you have a mental health condition, do you feel that it interferes with your work?"Interestingly, it played a more significant role in the model predicting "no" than Work_Interfere_Rarely, which is a fallacy.One would expect that if the mental health condition interferes less frequently or, for that matter, does not interfere less frequently, meaning never interferes, it would be a better predictor of the complete absence of mental health condition than if it interferes more frequently or does not interfere more frequently.The latter case implies that there is still a possibility that it still interferes once in a while, hence the fallacy.
LIME is a local interpretation of a specific instance.SHAP, on the other hand, offers both individual prediction explanations and overall feature importance insights.Individual plots tell us why a specific prediction was made, while summary plots show the average impact of each feature across the entire dataset.Since individual plots were examined using LIME, Figure 3b at the bottom shows a summary plot for a more comprehensive analysis.As can be seen, answering "sometimes" to the survey question "If you have a mental health condition, do you feel that it interferes with your work?" weighs far more in the logistic regression model in predicting than answering "often" or "rarely", which is counter-intuitive.Similarly, an employer providing mental health benefits weighs more than respondent demographics like age and gender.Despite these apparent anomalies, logistic regression achieves superlative evaluation metrics, as can be seen from Table 1.

Explanations Based on the K-Nearest Neighbors Algorithm
Despite fine-tuning the hyperparameters, the K-NN algorithm does not perform as well as logistic regression.The evaluation metrics precision, recall, F1 score, and accuracy obtained from the application of the K-Nearest Neighbors algorithm are tabulated in Table 1.

SHAP and LIME Analysis of K-NN Model Performance
The analysis shows that K-NN differs significantly from logistic regression in terms of the explainability of their outcomes.Based on the results derived from employing the K-NN algorithm in conjunction with the SHAP and LIME methods, it is evident that the variables "Age" and "Work Interference(Sometimes)" significantly influence the determination of mental health conditions among study participants.The SHAP plot at the top in Figure 4 is considerably different from the one for logistic regression in terms of the relative importance attached to the features.Like with logistic regression, answering "sometimes" to the survey question "If you have a mental health condition, do you feel that it interferes with your work?" weighs far more in the logistic regression model in predicting than answering "often" or "rarely", which is counter-intuitive.But the SHAP values are more sensible than before because demographics like age, gender, and family history are given more weightage.The results from LIME are also more sensible because for the specific instance at the bottom in Figure 4, the person is predicted to have a mental health issue based on age, among other factors, while answering no to the survey question "If you have a mental health condition, do you feel that it interferes with your work?", and not having a family history pulled the outcome the other way (blue).It is also intuitive that the employer providing mental health benefits and the employee knowing about the care options that the employer provides contributed to the prediction of a mental health issue.However, the better sensibility of the explanations does not correlate with the not-so-appreciable evaluation metrics.

Explanations Based on the Decision Tree Algorithm
The decision tree model's performance is much better than that of K-NN.The evaluation metrics precision, recall, F1 score, and accuracy obtained from the application of the decision tree algorithm are tabulated in Table 1.

SHAP and LIME Analysis of the Decision Tree Model's Performance
Decision trees are inherently interpretable.Here, too, based on the outcomes derived from employing the decision tree algorithm in conjunction with the SHAP and LIME methods, it is evident that the variable "Work Interference(Sometimes)" significantly influences the determination of mental health conditions among study participants.This can be verified from Figure 5. Like before, "Work Interference(Sometimes)" ranks higher than "Work Interference(often)" and "Work Interference(rarely)".The ranking for the other features is more sensible than that for logistic regression.The instance examined by LIME is predicted to not have any mental health issues based on the response of "no" to family history and work interference, which is a reasonable conclusion.(Bottom) Understanding why the decision tree makes a specific prediction using LIME's localized approach.

Explanations Based on the Random Forest Algorithm
The random forest model performs reasonably well on the dataset.The evaluation metrics precision, recall, F1 score, and accuracy obtained from the application of random forest algorithm are tabulated in Table 1.

SHAP and LIME Analysis of the Random Forest Model's Performance
As can be seen from Figure 6, based on the outcomes derived from employing the random forest algorithm in conjunction with the SHAP and LIME methods, the explanations are not entirely consistent with those from the decision tree model but are similar.

Explanations Based on the Gradient Boosting Algorithm
The evaluation metrics precision, recall, F1 score, and accuracy obtained from the application of the Gradient Boosting algorithm are tabulated in Table 1.The model performs well on the dataset.

SHAP and LIME Analysis of the Gradient Boosting Model's Performance
Like for other models, based on the outcomes derived from employing the Gradient Boosting algorithm in conjunction with the SHAP and LIME methods, it is evident that the variable "Work Interference(Sometimes)" significantly influences the determination of mental health conditions among study participants.As can be seen from Figure 7, there is nothing significantly different for this model.

Explanations Based on the AdaBoost Algorithm
The evaluation metrics precision, recall, F1 score, and accuracy obtained from the application of the AdaBoost algorithm are tabulated in Table 1.The metrics are similar to those for most of the other models.

SHAP and LIME Analysis of the AdaBoost Model's Performance
Based on the outcomes derived from employing the AdaBoost algorithm in conjunction with the SHAP and LIME methods, it is evident that the variables "Work Interference(Sometimes)" and "Work Interference(Often)" significantly influence the determination of mental health conditions among study participants.Interestingly, from Figure 8, in the case chosen for analysis using LIME, none of the factors play a significant role in predicting a "no".The model is almost equanimous for this instance of data.

Explanations Based on the Stochastic Gradient Descent Classifier Algorithm
The evaluation metrics precision, recall, F1 score, and accuracy obtained from the application of the SGD Classifier algorithm are tabulated in Table 1.The numbers are lower than those for the other algorithms.

SHAP and LIME Analysis of the SGD Classifier Model Performance
From Figure 9, based on the outcomes derived from employing the Stochastic Gradient Descent algorithm in conjunction with the SHAP and LIME methods, it is evident that the variables "Work Interference(Sometimes)" and "Work Interference(Often)" significantly influence the determination of mental health conditions among study participants.But as pointed out earlier, the instance picked for analysis using LIME shows an erroneous outcome.SGD Classifier predicts that a person does not have a mental health condition even if they answered "sometimes" to the survey question "If you have a mental health condition, do you feel that it interferes with your work?"In doing so, the model relies more on the fact that the person did not answer with "often" or "rarely" to the same question than on the fact that the person did answer "sometimes".The prediction is therefore highly misleading.

Explanations Based on the Naive Bayes Algorithm
The evaluation metrics precision, recall, F1 score, and accuracy obtained from the application of the Naive Bayes algorithm are tabulated in Table 1.In terms of the metrics, the algorithm does not perform as well as decision tree or logistic regression.

SHAP and LIME Analysis of the Naive Bayes Model's Performance
As can be seen from Figure 10, the SHAP values are more logical than for the other models.Based on the outcomes derived from employing the Naive Bayes algorithm in conjunction with the SHAP and LIME methods, it is evident that the variable "Work Interference(Often)" significantly influences the determination of mental health conditions among study participants.The LIME analysis is also more reasonable than for the other models.

Explanations Based on the Support Vector Machine Algorithm
The evaluation metrics precision, recall, F1 score, and accuracy obtained from the application of the SVM algorithm are tabulated in Table 1.The performance of the model in terms of these metrics is on par with other superlative models.

SHAP and LIME Analysis of the SVM Model's Performance
From Figure 11, based on the outcomes derived from employing the Support Vector Machine algorithm in conjunction with the SHAP and LIME methods, it is evident that the variable "Work Interference(Sometimes)" significantly influences the determination of mental health conditions among study participants.However, strangely, none of the answers to the other questions matter much to the model in making predictions.This holds consistently for both SHAP and LIME, locally and globally.

Explanations Based on the XGBoost Algorithm
The evaluation metrics precision, recall, F1 score, and accuracy obtained from the application of the XGBoost algorithm are tabulated in Table 1.The algorithm performs reasonably well in terms of these metrics.

SHAP and LIME Analysis of the XGBoost Model's Performance
From Figure 12, based on the outcomes derived from employing the XGBoost algorithm in conjunction with the SHAP and LIME methods, it is evident that the variable "Work Interference(Sometimes)" significantly influences the determination of mental health conditions among study participants.These findings are not too different from those for most other models.

Explanations Based on the LightGBM Algorithm
The evaluation metrics precision, recall, F1 score, and accuracy obtained from the application of the LightGBM algorithm are tabulated in Table 1.The numbers are similar to other best-performing models.

SHAP and LIME Analysis of the LightGBM Model's Performance
As can be seen from Figure 13, based on the outcomes derived from employing the Light GBM algorithm in conjunction with the SHAP and LIME methods, it is evident that the variable "Work Interference(Sometimes)" significantly influences the determination of mental health conditions among study participants.Here, too, there are no major surprises.

Discussion
These experiments show that the models behave differently than what is commonly expected.Given that the survey is for understanding mental health issue prevalence in the tech sector, the factors that are supposed to impact the prediction the most should relate to the tech sector.In fact, the literature confirms a correlation between mental health prognosis and the environment in the tech sector.Using the same dataset, researchers [34] state that "When comparing those who work in tech to those who do not work in tech, there was a clear majority of those who do work in tech".It can therefore be expected that answers to questions such as "Is your employer primarily a tech company/organization?" should have ranked higher in the model's predictions.However, the factors that came at the top did not have anything specific to do with the tech sector.In fact, for some of the models, like Support Vector Machine and decision tree, tech sector-related features did not matter at all.Machine learning models are not expected to imply causality.Though the organization OSMI, which conducts the survey year after year, targets workers in the tech sector, it is clear from the results that indeed, there is no causality established.Despite the high accuracy of the models in predicting mental health, they do not in any way imply that working in the tech industry is a cause for mental health issues.The explainability techniques used for this work also do not indicate any causality.It is therefore important to understand the limitations and potential pitfalls of relying solely on machine learning models for drawing major conclusions based on an intuitive interpretation of the results.
The authors of [35] also assess the effectiveness of machine learning algorithms in predicting mental health outcomes.Our experiments demonstrate that although the evaluation metrics for the models are superlative, the justification of the outcomes from these models is not intuitive or reasonable.It is also apparent that models differ in their explainability.Accordingly, our research questions are answered as follows.
• RQ1: How reliable are evaluation metrics such as accuracy in assessing the machine learning model performance in making important predictions?
The experiments demonstrate that the evaluation metrics fall short in their trustworthiness.The metrics are high even when the models relied on an unsound ranking of the features.• RQ2: How well do the explainable AI techniques SHAP and LIME complement conventional evaluation metrics?
The attribution techniques SHAP and LIME can add corroboratory evidence of the model's performance.The results from the experiment show the complementary nature of the plots from the explainability methods and the evaluation metrics.LIME is substantially impacted by the choice of hyperparameters and may not fully comply with some legal requirements [36].However, for the experiments described in this paper, the profile of LIME explanations was mostly consistent with the explanations from SHAP.A summary metric for the explainability aspects may further enhance the utility of these methods.• RQ3: How do the various machine learning algorithms compare when it comes to the explainability of their outcomes?
The majority of the machine learning algorithms behave quite similarly with respect to the explainability of the outcomes.However, some algorithms like SGD Classifier perform poorly in terms of the explainability of their outcomes, while Naive Bayes performs slightly better.
It is quite evident from these experiments that focusing only on achieving high performance metrics from machine learning models, particularly those used for critical applications like mental health prediction, can be misleading and ethically concerning.This work emphasizes the crucial role of explainability techniques in providing insights into how models arrive at their predictions, fostering trust and transparency.Our results also underscore the broader applicability of XAI across various domains of application of machine learning.A promising direction for research is in developing metrics and more frameworks that evaluate models not just on accuracy-based performance but also on their interpretability and trustworthiness.

Conclusions
Machine learning is increasingly being applied to critical predictions such as mental health.The literature contains several bold claims about the effectiveness of the various machine learning algorithms in predicting mental health outcomes.Several of those papers use the same dataset as used in this work.The experiments described in this paper show that these claims need to be corroborated using results from the explainability techniques to gain insights into the models' working and justification of the outcomes.Our findings can be easily generalized to other domains and datasets as there is nothing specific in the dataset or experiments detailed in this paper, nor are there any assumptions that limit their generalizability.This work proves that merely achieving superlative evaluation metrics can be dangerously misleading and may infringe upon ethical horizons.A future direction is to investigate methods to quantify the effectiveness of machine learning models in terms of insights from their explainability.

•
g is a link function such as a sigmoid for binary classification Positive SHAP values indicate a feature pushing the prediction toward a higher value.Negative SHAP values indicate a feature pushing the prediction toward a lower value.Larger absolute SHAP values indicate greater feature importance for that prediction.

Figure 2 .
Figure 2. Bar graph comparing the results from various machine learning algorithms.

Figure 3 .
Figure 3. Insights into what drives the logistic regression predictions, to understand how and why the predictions were made.(Top) Visualization of how features nudge the logistic regression predictions up or down using SHAP values.(Bottom) Interpretable Explanations using LIME's localized approach.LIME uses weighted tweaks to features, revealing their impact on predictions.

Figure 4 .
Figure 4. (Top) SHAP bar chart showing overall feature importance when K-NN model is used.Features with larger absolute SHAP values (both positive and negative) have a stronger influence on the prediction.(Bottom) Interpretable Explanations of the K-NN model using LIME's localized approach.

Figure 5 .
Figure 5. (Top) Quantifying feature influence on decision tree like a game of fair contributions using SHAP.(Bottom)Understanding why the decision tree makes a specific prediction using LIME's localized approach.

Figure 6 .
Figure 6.(Top) SHAP bar chart showing overall feature importance when random forest model is used.Features with larger absolute SHAP values (both positive and negative) have a stronger influence on the prediction.(Bottom) Interpretable Explanations of the random forest model using LIME's localized approach.

Figure 7 .
Figure 7. (Top) SHAP bar chart showing overall feature importance when the Gradient Boosting model is used.(Bottom) Interpretable Explanations of the Gradient Boosting model's predictions using LIME's localized approach.

Figure 8 .
Figure 8. (Top) SHAP value chart when AdaBoost algorithm is used.(Bottom) LIME's localized approach shows that the AdaBoost algorithm is almost equanimous in predicting for this instance.

Figure 10 .
Figure 10.(Top) SHAP bar chart showing overall feature importance when Naive Bayes algorithm is used.(Bottom) Interpretable Explanations of the Naive Bayes algorithm using LIME's localized approach.

Figure 11 .
Figure 11.(Top) SHAP bar chart showing overall feature importance when SVM model is used.(Bottom) Interpretable Explanations of the SVM model using LIME's localized approach.

Table 1 .
Summary of results from applying various machine learning algorithms on the dataset.

Table 2 .
Summary of fine-tuned hyperparameters for each model using grid search.