Explainable AI to Predict Male Fertility Using Extreme Gradient Boosting Algorithm with SMOTE

: Infertility is a common problem across the world. Infertility distribution due to male factors ranges from 40% to 50%. Existing artiﬁcial intelligence (AI) systems are not often human interpretable. Further, clinicians are unaware of how data analytical tools make decisions, and as a result, they have limited exposure to healthcare. Using explainable AI tools makes AI systems transparent and traceable, enhancing users’ trust and conﬁdence in decision-making. The main contribution of this study is to introduce an explainable model for investigating male fertility prediction. Nine features related to lifestyle and environmental factors are utilized to develop a male fertility prediction model. Five AI tools, namely support vector machine, adaptive boosting, conventional extreme gradient boost (XGB), random forest, and extra tree algorithms are deployed with a balanced and imbalanced dataset. To produce our model in a trustworthy way, an explainable AI is applied. The techniques are (1) local interpretable model-agnostic explanations (LIME) and (2) Shapley additive explanations (SHAP). Additionally, ELI5 is utilized to inspect the feature’s importance. Finally, XGB outperformed and obtained an AUC of 0.98, which is optimal compared to existing AI systems.


Introduction
Pregnancy is a natural process that succeeds within three months to the end of 1 year after conception [1].In previous decades, couples failed to conceive due to impaired reproduction, singly or with a partner.The World Health Organization (WHO) suggested that about 48.5 million couples are affected by infertility, and 40 to 50 percent of cases occur due to male-related factors [2,3].Environmental, occupational, and lifestyle factors are profound causes that may be the reason for increasing male infertility [4].Key lifestyle factors include smoking, liquor intake, advanced paternal age, stress, food habit, caffeine consumption, mobile usage, and lack of sleep.These factors are easily modified and are the utmost reason for concern [5].Several studies have been performed to examine the effects of these factors, and their negative impact may well be mostly overcome.Hence, more awareness and early prediction are potential solutions for male reproductive disorders.
AI systems have significantly contributed to different healthcare areas such as genetics, urology, radiology, oncology, and many more.Reproductive medicine is not different, where predictive systems are helpful for problem analysis [6,7].However, the sheer volume of iterations and hyperparameters make AI systems challenging to evaluate and grasp.They are initially evaluated depending on their performance on the dataset under study, which must be more compelling for medical professionals and patients.As a result, it is restricted to the technical community.Because of this, we refer to such tools as 'black boxes' regardless of their performance quality compared to decision-making processes used in a medical setting.In other words, without medical implications, trust does not exist [8,9].Due to these reasons, AI systems need to be comprehensible and interpretable to analyze data and make judgments that would enhance patient care [10,11].With the help of these systems, we can enhance our ability to discover new biological phenomena [12,13].
In the last few decades, various studies have used different types of data, either image or behavioral/lifestyle, and environmental factors.For example, Ma et al. [14] recently performed a study to develop a model for predicting seminal quality where data balancing and classification were a prime concern.The ESLSMOTE approach is used to handle data balancing along with BPNN, ADA, and SVM classifiers.The maximum accuracy of 97% was reported (using ADA).Yibre and Kocer [15] developed a model based on a feed-forward neural network to predict male fertility.The authors reported that the proposed model had achieved accuracy and AUC of 97.50% and 97%, respectively.In addition, SMOTE technique is used to overcome data imbalance issues.Dash and Ray [16] performed the comparative study using eight classifiers: soft voting, DT, NB, LR, DT bagged, RF, and ET.The maximum accuracy of 90.02% was achieved by ET.Ahmed and Imtiaz [17] used the NB classifier for predicting male fertility and reported an accuracy of 87.75%.Engy et al. [18] conducted a comparative analysis using five AI techniques: ANN, ANN-GA, DT, SVM, and ANN-SWA for male fertility detection, and their reported accuracies were 90%, 95%, 88%, 95%, and 99.96%, respectively.Candemir et al. [19] used four Machine Learning (ML) techniques, such as MLP, SVM, DT, and FRBF, to detect male fertility.Of all, the FRBF classifier outperformed, and 90% accuracy was reported.Soltanzadeh et al. [20] performed a comparative analysis among four algorithms: NB, NN, LR, and Fuzzy Cmeans.The authors have used a filtering method and compared performance before and after that application.Simfukwe et al. [21] used NB and ANN to predict male fertility.In total, a 97% training accuracy was reported for both algorithms during the training phase.Palechor et al. [22] used classification and clustering methods to identify male fertility.J48, SMO, NB, and lazy IBK algorithms were considered for classification, whereas the simple K-means algorithm was used for clustering.The authors used TP and FP to identify model performance.Rhemimet et al. [23] developed individual systems for detecting male fertility using regression, classification, and clustering methods.DT and NB have been used for classification with 61.36% and 88.63% accuracy.Similarly, K-means and O-means have been applied to make a cluster, and the reported accuracies of 100% and 50% were noted.Finally, GLM and SVM were used for regression, and 12.34% and 15.08% accuracy were reported.Bidgoli et al. [24] used four methods: optimized MLP, NB, DT, and SVM, and the models obtained an accuracy of 93.3%, 73.10%, 83.82%, and 80.88%, respectively.Sahoo and Kumar [25] experimented to identify the importance of the feature selection method along with five algorithms.DT, MLP, SVM, SVM-PSO, and NB were used as classifiers, and they performed relatively similarly.The maximum accuracies of 89%, 92%, 91%, 94%, and 89% were reported using feature selection.Girela et al. [26] used the ANN model to detect seminal quality, and 97% accuracy was reported on the training dataset.Gil et al. [27] used three ML algorithms, such as SVM, MLP, and DT, to identify sperm concentration and morphology.The authors reported 86%, 86%, and 84% accuracies for sperm concentration, whereas 69%, 69%, and 67% accuracies were reported for sperm morphology.Wang et al. [28] used a clustering-based decision forest to predict the seminal quality.The authors compared the proposed model performance with SVM, MLP, and LR.Of all, the CBDF algorithm outperformed with an AUC of 80.5%.Roy and Alvi [29] used four AI tools, KNN, SVM, LR, and DT, to detect male fertility.They reported that KNN performed well, with an accuracy of 90%.
In the literature mentioned above, authors have worked on male fertility detection by considering only improvement accuracy level and imbalanced dataset handling [14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29].Unless the previously developed systems for predicting male fertility performed well, the articles in which system explainability issues are discussed have yet to be identified.As a result, most AI systems are not accepted in the healthcare setting.Very briefly, many authors not only used modifiable factors, but sperm images were also taken into consideration.For both cases, sperm concentration and morphology are gold standards in male fertility prediction.It is prudent to comprehend which factors are critical and determine their impact on changes in developed systems.For this reason, explainability is a fundamental tool to gain insights into the model.This paper presents an explainable AI (XAI) system for prediction analysis of male fertility for early-stage diagnosis using modifiable factors (lifestyle/environment).Our prime objective is to comprise a fertility risk predictive system for males.In this study, the explainability of AI system enables the discovery of risk factors which contribute to transparent and traceable explanations for decision-making criteria using AI.To the best of our knowledge, this is the first article where explainability techniques are used for predictive analysis of male fertility.This explanation provides significant benefits for patients and clinicians to understand the ways in which each feature contributes to the prediction.In line with the above-mentioned rigorous literature review, the key contributions in this research are summarized as follows:

•
To detect male fertility, a conventional XGB-SMOTE-based generalized AI system is proposed; • Hold-out and five-fold cross-validation schemes are utilized for system testing; • Benchmarking of the interpretability of the proposed system is performed via implemented XAI tools;

•
To assess the performance of the proposed system, a comparative analysis is performed with existing AI systems.
The remaining article is structured as follows: Section 2 describes the model development process for male fertility prediction.It also includes an overview of the XGB and XAI approaches.Section 3 describes the experimental setting, concentrating on the dataset and assessing the effectiveness of the suggested strategy.The model's testing and comparison to alternative strategies are shown in Section 4 of the article.The final concluding remarks are provided in Section 5.

The Model Development
In this article, an ML-assisted XGB-SMOTE algorithm for predicting male fertility has been proposed.
The output or target class is represented as y m ∈ {0, 1}.This means that it is a binary classification problem that can generate a system, y = f (x), depending on training data points.Now, for prediction ( ŷk = f (x k ), we can apply the system to the test data points, and the predicted output is ŷk , which is the same as y k .The first challenge in AI model design is the data processing and classifier selection.We are dealing with imbalanced data and a small number of samples.In other words, imbalanced data is a common issue, particularly in medical datasets.Hence, our prime objective is to balance our dataset and then predict patient classes with an ML classifier that can predict male fertility.Figure 1 depicts the prediction of male fertility using the proposed XGB-SMOTE method.Additionally, XGB also ensures the explainability of processing, resulting in reliable tools.
Numerous approaches are listed in the literature for dealing with imbalanced data and classifier selection.Each has unique advantages and disadvantages, and not one appears to be the best all-around.We looked for a suitable approach for our specific application of male fertility analysis and screening simulation.Finally, the following techniques are chosen, as discussed below.

Synthetic Minority Oversampling Technique (SMOTE)
Oversampling techniques, such as SMOTE, are utilized to increase the number of samples in the minority class.According to previous research, Ma et al. [14] and Yibre and Kocer [15] used ESMOTE and SMOTE techniques to increase the number of samples in the minority class.
We assume that p c and k c represent the number of majority and minority class instances.The imbalance ratio of p c and k c is represented as z c .Now, after the application of oversampling technique, the total data are computed using the following equations: where k c denotes the number of synthetically generated instances.
In this study, oversampling (i.e., SMOTE) technique is utilized because it has provided the best outcome in the field of healthcare [30][31][32].

XGB Algorithm
XGB is an ML classifier that is scalable and efficient.It is used to solve classification, regression, and ranking problems.In 2016, Chen and Guestrin popularized this algorithm [33].A gradient-boosting decision tree, a boosting mixture of many decision trees, is the foundation of the original XGB model.Each tree is built using gradient boosting to lower the residual of the preceding model.The term "residual" refers to the discrepancy between the actual and predicted values [34].The model has been trained until the threshold's allotted number of decision trees has been reached.This method accommodates both regression and classification [35].By 2015, XGB had completed 17 of the 29 ML tasks posted on Kaggle.

XAI
Black box AI and ML models do not provide explainable decisions.Explainable AI (XAI) approach is used to convert this black box to a glass box system that helps understand the AI system's prediction report [36].Explainability is the degree to which humans can understand the AI decision [37], provide insights into the AI system, and discuss the logic behind the decision.Application of XAI has three main benefits: (a) provides a transparent interpretation and boosts trust in the designed model; (b) enables model troubleshooting; (c) specifies the source of the system basis.Explainability and efficacy are two distinct aspects that should be maintained while designing AI systems.In contrast, efficiency needs to explain their decisions coherently and vice versa.Classification of XAI techniques includes global and local methods.Global explainability helps comprehend system behaviour and feature effects on prediction labels.Local explainability provides transparency in the decision of the model for individual instances [38][39][40].The interpretable system is critical to translate the output decision into human-understandable language, especially in healthcare.

Experimental Setting
To felicitate the fast development of our model, we use the Python language on the windows operating system.This section describes the dataset, feature importance, and performance evaluation metrics.

Dataset
Based on WHO guidelines, Gil et al. [27] investigated the factors that may impair male seminal quality.The study was carried out at the University of Alicante in Spain.A total of 123 young, healthy volunteers between the ages of 18 and 36 took part in the study and were required to refrain from having sex for a period of 3 to 6 days.Due to incomplete information from some participants, only 100 people were available at the end.The sperm examination was completed within 60 min of the sample collection.This dataset, available in the UCI databases, is used in this study.The dataset contains nine input attributes related to lifestyle and environmental factors.Table 1 summarizes the dataset information where the data values are normalized under the following rules.This dataset comprises 12 samples which belong to altered, and 88 are normal.Due to unequal distribution, this dataset nature is imbalanced, where most instances are fertile.

Feature Importance
A dataset is a collection of information typically presented in a tabular manner.A dataset consists of input features and output or target class.Each feature has its own relevance and is essential for the development of any ML model.These features have directly affected the model's performance.There are two main processes, feature selection and engineering, considered to be the primary steps to build an effective model.The benefits of feature selection are an optimized dataset, low memory consumption, and improved model performance.Similarly, feature engineering helps to reduce overfitting and underfitting issues.Under feature selection, most of the researchers used Pearson's correlation to identify the significant features from the dataset which have a high impact on input features.Similarly, under feature engineering, normalization and standardization approaches are always considered.The Pearson's correlation is calculated by r =

Performance Evalutaion
The accuracy, sensitivity, specificity, and f1-score of a designed model are typically used to estimate its performance.These measures are determined using fundamental terms.
True positive (TP) refers to the modified instances the classifier labels correctly.
True negative (TN) refers to the normal instances the classifier labels correctly.False positive (FP) refers to normal occurrences that have been mislabeled as altered.
False negative (FN) refers to altered instances incorrectly labelled as normal.Accuracy (ACC) is an AI system metric that determines the proportion of a specific class that is correctly predicted over the total number of samples.It can be calculated as Sensitivity (SEN) is the ratio of correctly predicted ill cases to the total number of ill cases, and it is calculated as TP TP+FN .Specificity (SPEC) is defined as the ratio of correctly predicted healthy cases to total healthy cases, and it is calculated as TN TN+FP .F1-Score is calculated as 2 * PREC * SEN PREC+SEN , where precision (PREC) is defined as the ratio of correctly predicted ill cases to the total predicted ill cases, and it is calculated as TP TP+FP .
The area under the ROC curve is defined as the Area Under Curve (AUC).AUC is the sum of measured performance across all classification thresholds.

Numerical Results and Analysis
The results are shown in both visual and numerical formats.We used the correlation function for data analysis (see Section 4.1).Then, we presented the results of all classifiers with imbalanced and balanced datasets (see Section 4.2).Following that, we discussed the explainability of the proposed AI system (see Section 4.3).Python packages and libraries were used to design our model because they are free and open-source.

Analysis for Dataset
This study uses normalization to scale the data from 0 to 1.The normalized value of the dataset is presented in Table 1.After normalization, we used Pearson's correlation coefficient to comprehend the strength of the linear relationship between two variables.The coefficient value can determine the relationship strategy in terms of positive and negative values.Figure 2 depicts the correlation matrix heat map for the male fertility dataset.The r lies between −1 to 1, where −1 indicates a perfect negative correlation, and +1 signifies a positive correlation.In Figure 2, the orange color indicates positive relation, and dark violet indicates negatively correlated features.

Performance Evalutaion
The total experiment was performed in multiple steps.In the first step, we deal with the original dataset, and the hold-out validation approach is considered.The original dataset consists of 100 samples (normal-88 and 12-altered).This dataset has training and testing modules, where 80% of the data is used for training and 20% for testing.Many ML algorithms are currently being used for disease diagnosis.Here, we have selected the conventional XGB algorithm, and 80% of the training data is used to train the model.After training, 20% of the data is used to evaluate the model performance.In the next step, SMOTE is applied to the training dataset, and we generate synthetic data to sample the minority classes (altered).Now, the training module contained 140 samples which are balanced (normal-70 and altered-70), whereas the testing module has 20 samples (normal-18 and altered-2).Finally, Table 2 provides the performance metrics, including ACC, SEN, SPEC, F1 score, and AUC using XGB with hold-out.Table 2 shows that 90.00% and 94.05% accuracies are obtained by XGB with original and balanced datasets, respectively.Therefore, the algorithm performed well in terms of accuracy and provided a good AUC value.At that point, XGB is chosen to proceed with further investigation.In this stage, a comparative study is performed along with the most popular algorithms, SVM and ADA.Based on the literature survey, these two algorithms provided optimal performance with SMOTE technique.The same experimental strategy we followed is discussed above.All the performance matrices values are listed in Table 3.The maximum accuracy of 94.05% is obtained by XGB, which is better than SVM and ADA.Hence, XGB performed best under SMOTE technique and hold-out cv (after comparison with Table 2).For a fine selection of the final algorithm, again, we have selected two ensemble algorithms, RF and ET, previously used in male fertility analysis.The experimental results are shown in Table 4, followed by the same experimental strategy.From Table 4, we can observe that RF and ET have achieved accuracies of 91.17% and 84.09%, respectively, for imbalanced datasets.Similarly, 92.45% and 85.87% accuracies were obtained by RF and ET with SMOTE.In both cases, RF and ET provide satisfactory accuracy, whereas AUC is fair at 71% to 86%.From Table 2, we obtained the AUC for XGB, which is better than ET and RF.After performing all these experiments in different stages, we selected XGB as our primary classifier and proceeded to the next level of the experiment.After selecting the primary classifier, we have applied k-fold cv and evaluated the XGB model with the default parameter setting.In this situation, the data is divided into training and testing samples by k value.Based on the previous literature, most researchers have taken the value of k as 5. Hence, in our study, we have chosen a five-fold cv technique.With the help of this validation, we can understand the statistical robustness of our proposed system.
Table 5 depicts the outcome of our model in terms of ACC, SEN, SPEC, F1-score, and AUC in each fold with their mean (µ) and standard deviation (σ).In this context, it is identified that our proposed model has achieved mean accuracy of 93.22% using fivefold cross-validation, which is lesser as compared to the hold-out validation outcome.In addition to this analysis, the obtained AUC value is 0.98, which is remarkable compared to other existing AI systems.Hence, our AI system provides satisfactory results for male fertility prediction.

Explainability of Male Fertility Prediction
An approach known as the XAI technique is applied to unboxing the AI system.Each XAI method identified local and global explainability for further research.To test our system's explainability, we located 11 libraries.These libraries have good documentation, explanations, and tabular data for XAI.However, we have chosen the ELI5 and SHAP libraries to explain the proposed model's global explainability.Similarly, for local explainability, LIME and SHAP were identified.

•
Global explainability Numerous libraries using various feature significance metrics were located.These techniques rely on scoring input features according to the predictive value they contribute to the system.Our overview focused on global system explainability and showcasing explainability outcomes.Two libraries, ELI5 and SHAP, have been utilized for in-depth analysis.
ELI5 tool is used to identify special features via permutation importance.It allows for extraction and visualization of features that help to identify the contribution towards the system as global explanations.A tabular list view of the features and their weights provided a framework for the visualization.Table 6 depicts the ELI5 library's implementation of feature importance by score.This tabular data shows that the high fever in the last year ( f 6 ) is the most critical feature.On the other hand, no features have a negative score, which signifies that each feature positively influences the proposed system.
The SHAP library allows for global model analysis and provides an interpretation of the AI system.The system interpretation has been performed by calculating feature importance concerning feature influence on a prediction about the input data (shown in Figures 3 and 4).The importance of features is shown along the x-axis; the most significant features are listed at the top.From Figures 3 and 4, we can conclude that age ( f 2 ) is the most important determinant, and childhood diseases (polio, measles, and pox) have little impact on male seminal quality.Similarly, SHAP provides a global interpretation for specific classes.The contributions of specific features are plotted along the x-axis, which can positively or negatively affect the prediction of this class.Each data point stacked vertically within this visualization represents the contribution of a specific instance.The color gradient encodes the raw values, with blue and red representing the lowest and highest values, respectively.

• Local explainability
Global explainability elaborates in detail on the ways in which an AI system works, but local explainability does not try to do the same.It shows the way in which the system outcome changes as the values of certain features change within given intervals.For this reason, the user gains the trust of individual predictions and the whole system.Additionally, globally essential features may be insignificant locally and vice versa.The two most frequent approaches, LIME and SHAP, are utilized for local interpretation.LIME is an innovative approach that describes any AI system's prediction process and provides insights into the prediction strategy and feature correlations.On the other hand, SHAP offers a force plot for explaining the local interpretation of the model.This plot helps to visualize each feature's effect on the prediction model for a given observation.
Figure 5 shows the local feature importance using SHAP values which explains individual prediction.Red color indicates the higher value of a feature, and blue indicates a lower value of a feature, along with its magnitude.Five features, the number of sitting hours per day, age, alcohol intake, fever, and smoking, have higher values.Four features, accident, surgical intervention, season, and child diseases, have lower values.Other features, such as smoking habit and child disease, have the same positive value.Figure 6 provides the local explanation of black box prediction outcomes where the most significant features, such as age, alcohol intake, and the number of sitting hours in a day, have been moved towards the positive class.The remaining three features, accident, surgical intervention, and season, move towards the negative class.All features compete against each other; finally, the highlighted value is a prediction for that observation.We have applied LIME to enhance the interpretability of our proposed system.Figure 7 shows the outcome of the LIME approach for the XGB-SMOTE system.The result reflects the contribution of each feature to the instance prediction as follows: LIME visualization is divided into three parts: a class description with an accurate prediction for each class, a plot showing the impact of features, and a table with actual values in the instances.The prediction probabilities are displayed in the section on the left.There are two colors for two-class classification tasks: orange (altered semen quality/infertile) and blue (normal semen quality/fertile).The middle part summarizes the crucial aspects.The impact of features assists the user in determining which feature values support class prediction (left side).Figure 7 shows that features such as surgical intervention, child disease, and smoking habit, supporting (positive) in relation to an instance, are predicted as normal.At the same time, the orange bar indicates the contradicting (negative) scores in relation to the prediction.
In this study, we have used SHAP, LIME, and ELI5 libraries under the XAI approach.A comparison study is a must to comprehend each process overview in terms of global and local explainability since there are three distinct XAI processes.Three criteria can be used in global explainability: computational overhead, implemented visualization, and interactivity.
From the perspective of computational overhead, ELI5 provides the most lightweight solution, implemented visualization, and interactivity.Feature importance alongside other implemented functionality of ELI5 can be convenient during the model development process.Similarly, increased interactivity and visualization options come with the additional computational overhead in the SHAP library.
Next, in terms of local explainability, the SHAP and LIME approaches yield different results.SHAP outperformed the others regarding computational resources and provided an interactive way to explore the various model predictions.The SHAP force plot is advantageous if focuses on exploring the features, providing more refined explainability and analytical experience.In other words, LIME has its speed advantage and provides an intuitive example explanation.

Comparision with Existing Systems
Table 7 compares our results with existing approaches in the literature for further assessment of the performance of the proposed XGB-SMOTE system.In previous research, Multilayer perceptron (MLP), SVM, support vector machine-particle swarm optimization (SVM-PSO), decision tree (DT), Nave Bayes (NB), Clustering based decision forest (CBDF), ESLSMOTE-BPNN, ESLSMOTE-ADA, and ESLMOTE-SVM were used for male fertility prediction.Table 7 shows that feature selection and data balancing approaches significantly impact accuracy.As a result, XGB-SMOTE is a better solution for detecting male fertility.

Figure 1 .
Figure 1.Methodology of the proposed system via XAI.

Figure 3 .
Figure 3. Global visualization of the impact of different features in the proposed model.

Figure 4 .
Figure 4. Global visual explanation of proposed system via SHAP.

Figure 5 .
Figure 5. Local visual explanation of the proposed system using SHAP.

Figure 6 .
Figure 6.Local visual explanation of the proposed system.

Figure 7 .
Figure 7. Local visualization of the proposed system.

Table 1 .
Features with their detailing.

Table 2 .
Performance analysis using XGB using hold-out cv.

Table 3 .
Performance of popular AI systems using hold out cv.

Table 4 .
Performance of existing ensemble AI systems using hold-out cv.

Table 6 .
Feature weights and their impact on the proposed model.

Table 7 .
Comparative study between proposed AI system to existing predictive systems on UCI dataset (according to year).