A Proposed Framework for Early Prediction of Schistosomiasis

Schistosomiasis is a neglected tropical disease that continues to be a leading cause of illness and mortality around the globe. The causing parasites are affixed to the skin through defiled water and enter the human body. Failure to diagnose Schistosomiasis can result in various medical complications, such as ascites, portal hypertension, esophageal varices, splenomegaly, and growth retardation. Early prediction and identification of risk factors may aid in treating disease before it becomes incurable. We aimed to create a framework by incorporating the most significant features to predict Schistosomiasis using machine learning techniques. A dataset of advanced Schistosomiasis has been employed containing recovery and death cases. A total data of 4316 individuals containing recovery and death cases were included in this research. The dataset contains demographics, socioeconomic, and clinical factors with lab reports. Data preprocessing techniques (missing values imputation, outlier removal, data normalisation, and data transformation) have also been employed for better results. Feature selection techniques, including correlation-based feature selection, Information gain, gain ratio, ReliefF, and OneR, have been utilised to minimise a large number of features. Data resampling algorithms, including Random undersampling, Random oversampling, Cluster Centroid, Near miss, and SMOTE, are applied to address the data imbalance problem. We applied four machine learning algorithms to construct the model: Gradient Boosting, Light Gradient Boosting, Extreme Gradient Boosting and CatBoost. The performance of the proposed framework has been evaluated based on Accuracy, Precision, Recall and F1-Score. The results of our proposed framework stated that the CatBoost model showed the best performance with the highest accuracy of (87.1%) compared with Gradient Boosting (86%), Light Gradient Boosting (86.7%) and Extreme Gradient Boosting (86.9%). Our proposed framework will assist doctors and healthcare professionals in the early diagnosis of Schistosomiasis.


Introduction
Schistosomiasis, snail fever or bilharzia, is one of the most lethal and contagious among several ignored tropical diseases of the world. This parasitic disease affixed to the skin through defiled water enters the human body. It moves into human veins, where parasites lay eggs which becomes the reason for two stages (chronic and acute) of Schistosomiasis [1,2]. The most severe type of late-stage Schistosomiasis  The remainder of this work is structured as follows. Section 2 reviews the literature, including past work on Schistosomiasis and traditional statistical approaches applied to high-dimensional medical data. Section 3 describes the methodology that contains dataset description, data preprocessing, feature selection, resampling, data transformation approaches, and modelling. Section 4 details the framework's experimental setup and outcomes. Section 5 discusses the ramifications of the results in terms of their relevance. Section 6 finishes the report by highlighting future work in this field.

Literature Survey
Li et al. divided Schistosomiasis cases into two groups: favourable and poor prognoses [1]. The cases in which improved patient health are referred to as favourable prognoses, while cases of perishing and death are classified as poor prognoses. Various machine learning models were employed for advanced Schistosomiasis prediction in which ANN outperformed the other models. Holing et al. [17] proposed a method for predicting 1-year unfavourable prognosis for advanced Schistosomiasis. Demographics, clinical factors, medical exams, and test results were used to choose candidate predictors. To build one-year prognostic model, five machine learning techniques were used: Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Artificial Neural Network (ANN), and Extreme Gradient Boosting (XGBoost). The model's performance was assessed using the area under the receiver operating characteristic curve (AUROC). XGBoost has outperformed other Machine learning models based on performance.
Computational techniques to estimate the amount of Schistosomiasis susceptibility and vulnerability, the researchers Olanloye et al. [18] compared support vector machine (SVM) models, i.e., Linear, Quadratic, Cubic, Fine Gaussian, and Medium Gaussian and Coarse Gaussian. Receiver Operating Character (ROC) and Parallel Coordinate Plots (PCP) were used to evaluate the experiments in terms of accuracy, processing speed, and execution time. Finally, it was determined that Medium Gaussian is the best among the six models. Asarnow et al. [19] applied the Asarnow-Singh algorithm to a set of images and extracted features by defining a threshold to identify Schistosomiasis mansoni on images with infected foreground and parasitic background areas. Effective results were obtained by using SVM for training and testing images. Data was gathered through a network of wireless sensors, and Kasse et al. [20] developed a system based on IoT monitoring that can help control and predict disease. For disease detection and transmission, multiple data mining algorithms were applied. The SVM detects irregularity better than other models. Chicco et al. [21] collected a dataset of 324 patients, of which 96 were mesothelioma patients. The imbalanced class problem arises because of the ratio between the positive and negative instances. The perceptron-based neural network was used to check the effectiveness of ANN. RF feature selection (RFFS) was used to investigate the most relevant features due to the best diagnosis predictive results.
According to the previous literature, the problem of class imbalance with medical datasets exists, and there are a couple of articles regarding this matter. Subsequently, significant writing looked at clinical science and different fields of science. Different methods to solve class imbalance were described [22]. Various techniques to solve this problem are categorised into different levels: data, algorithmic, cost-sensitive, feature selection and ensemble level. Understanding which preprocessing strategy should be chosen according to the problem is challenging. The dataset is considered to be an imbalanced dataset in which instances between both classes are not equal. Therefore, the class imbalance is mostly removed at the data level, and multiple under and oversampling-based approaches can be used.
Under-sampling was not preferred to deal with class imbalance by some scholars [23,24], while others do not prefer oversampling [25]. On the other hand, other studies [26] show that undersampling, especially multiple undersampling, performs better. In the dataset, data preprocessing techniques could remove missing values, redundancy, and high scarcity. Hyper-parameters can be optimised to increase the performance of classifiers, or they can be improved by categorising and expelling less related features [27].

Methodology
The detail of the proposed framework has been discussed in this section. Firstly, a dataset was obtained and then standardised to remove any biases among the various features. Then, several feature selection approaches are used to extract significant features from high-dimensional data [28,29]. In addition, several resampling approaches have been employed to address the data imbalance problem. After data preprocessing, various machine learning models were used, and the results were compared, as shown in Figure 1.
Under-sampling was not preferred to deal with class imbalance by some scholars [23] [24], while others do not prefer oversampling [25]. On the other hand, other studies [26] show that undersampling, especially multiple undersampling, performs better. In the dataset, data preprocessing techniques could remove missing values, redundancy, and high scarcity. Hyper-parameters can be optimised to increase the performance of classifiers, or they can be improved by categorising and expelling less related features [27].

Methodology
The detail of the proposed framework has been discussed in this section. Firstly, a dataset was obtained and then standardised to remove any biases among the various features. Then, several feature selection approaches are used to extract significant features from high-dimensional data [28,29]. In addition, several resampling approaches have been employed to address the data imbalance problem. After data preprocessing, various machine learning models were used, and the results were compared, as shown in Figure  1.

Dataset
A health record of Schistosomiasis has been analysed from the disease database compiled by the (HISPC) Hubei Institute of Schistosomiasis Prevention and Control, China. The patients with advanced Schistosomiasis gave information by surveying sociodemographic and epidemiological factors in Hubei. The (WS261-2006) National diagnostic criteria were used to assess Schistosomiasis. Advanced Schistosomiasis treatment varies

Dataset
A health record of Schistosomiasis has been analysed from the disease database compiled by the (HISPC) Hubei Institute of Schistosomiasis Prevention and Control, China. The patients with advanced Schistosomiasis gave information by surveying sociodemographic and epidemiological factors in Hubei. The (WS261-2006) National diagnostic criteria were used to assess Schistosomiasis. Advanced Schistosomiasis treatment varies from patient to patient depending upon disease conditions. If the patient's condition is stable for six months, then praziquantel (PZD) treatment can be utilised. The dataset has been categorised into two groups containing 4136 individuals, both males and females. The 1232 cases in which improved patient health are referred to as favourable prognoses, coded as 0, while 2904 death cases are classified as poor prognoses, coded as 1. 18 features are recorded for each participant, as shown in Table 1 [1]. Data was gathered on socioe-conomic, demographic, hospitalisation expenditures, clinical characteristics, and surgical procedures [30]. It was found that the area more exposed to water or lake containing marshy lands has a prevalence of Schistosomiasis, as the data also provides evidence.

Missing Data Imputation
An incomplete dataset for classification and prediction can reduce its effectiveness. The dataset containing large training data was more desirable for prediction because models would train better as more data was given to the model for training. The removal of missing values may increase the model's performance. A dataset contains a high percentage of empty or blank cells, and it was unclear whether they should be replaced with zeros or remain as missing values. The methods for dealing with the imputation problem use mean attribute values from the original set [31]. The missing values in the dataset are replaced by the class's median, as presented in Equations (1) and (2).

Data Normalization
Data normalisation is a technique in which values are scaled and shifted to make new values [32]. We scaled the data ranging between 0 and 1, known as Z-score normalisation. Data normalisation is useful because all the features are in different forms or units. To normalise attributes in our dataset, Z-score normalisation has been used. It uses mean and standard deviation to normalise, as shown in Equation (3).
where X i Z-score is a normalised value and X i is the row E value of ith column or mean value. The std(E) and E are calculated using Equations (4) and (5).

Data Transformation
A process to convert or change data format to another format, structure, or value [33]. Our dataset contains the numerical features which have been converted into categorical values. After the literature survey, continuous features are converted to Boolean and continuous values [34,35], as shown in Table 1.

Feature Selection
An n-dimensional vector x is commonly used to represent data, with each element x i of x representing a feature. The dataset has multiple features, but not all of them will likely have a favourable effect on the target class [36]. Based on the input x and output y, the features using a scoring function s(f) calculate filter scores where the feature is denoted by f. The filters are faster in processing and resistant to overfitting as compared to wrapper methods. Information gain [37], correlation-based technique [38], ReliefF [39], OneR [40], and gain ratio are the filters used in the scope of our study.

Information Gain
The Information Gain (IG) approach is based on the information theory notion of entropy. This method ranks features (or input variables) according to the rate at which a variable reduces the target class's uncertainty (entropy). Shannon's entropy E(s) measures how much uncertainty there is in a distribution and is calculated using Equation (6).
P(j) is the probability for outcome j, and m is the number of alternative outcomes. To retrieve the information for all data with c number of classes, we employed Shannon's entropy, as shown in Equation (7).
P c (b i ) denotes the observed probability for each P c class. Data C has now been divided into L pieces {C 1 , C 2 , . . ., C L } by feature y. Equation (8) denoted the information for a portion C k .
Info C y k = 0 if P j is zero then lim P j →0 log 2 P j = 0. Due to feature y, the value of the information gain is calculated using Equation (9).
where |C| and |C K | represent the number of instances in C and C k . Features with more distinct values are favoured by information gain. Considering Equation (9), more distinct characteristics do not affect the Infogain(C), but they do have an impact on − ∑ c j=1 P c (b i ) log 2 P c (b i ). For example, patient identification is a feature where each class has exactly one instance, with C id = 0 for each patientId, maximising the amount of information gained. To put it another way, patientId is a good feature for the class but ineffective for generalisation. Quinlan et al. [41] proposed that information gain may be normalised using the information split technique, as presented in Equation (10).
F is the dataset, which is divided into F j datasets by feature Y. The information gain ratio is the normalised information gain measure, as shown in Equation (11).

Correlation-Based Feature Selection (CFS)
The link between predictive factors and target variables, often called class, is the foundation of CFS. CFS identifies a selection of traits more correlated with the class than one another. As a result, CFS finds and removes redundant, irrelevant, and noisy features, leaving us with a subset of useful features. The single evaluation function is defined as in Equation (12).
where Eval A f is the merit of a feature subset A with f features, b cf is the average value of all feature-class correlations defined in Equation (13).
b cfj indicates the correlation between feature j and the class, and b ff is the average of feature correlations, with b f j f k indicating the correlation degree between features j and k as defined in Equations (14) and (15).
f p is the amount of feature pairs in subset A, and an information-theoretic measure can be used to calculate the correlation matrix.
Here, m in Equation (16) shows instances of dataset whereas σ x σ z shows the standard deviation of x and z, respectively. A weighted Pearson's Correlation is calculated when one characteristic is linear, and the other is categorically shown in Equations (17) and (18).
The likelihood that Y takes the jth value in the full training set is weighted in each of the estimated correlations.

ReliefF
When dealing with real-time and noisy data, the reliefF algorithm is favourable. The reliefF algorithm selects the instances s j at random and then determines the k nearest neighbours in the similar and dissimilar classes [42]. These instances are referred to as T i , R i for nearest hit and miss. The Manhattan distance is commonly used to distinguish between T i and R i occurrences. Using T i , s j and R i for the updating, the quality estimation, Q [E]. of all the attributes E is updated. If the values of the instances T i and s j are the same, the attribute E is divided into two instances with the same classes, which is necessary to reduce the Q[E] and if values of both instances are different with E is divided into different classes of two instances then Q[E] maximises. The entire procedure is performed n times, with n being a user-defined value. The Q[E] is updated in this procedure by applying Equations (19)- (21).
P(B) stands for the previous class, and A stands for the distance between the examples s j , B stands for classes and bl s j stands for a class of jth sample. The eight most significant features that have been selected through various feature selection techniques are presented in Table 2, and the score values of all features are shown in Table 3.

Data Resampling
A dataset is imbalanced when the number of instances in one class does not equal other classes. The resampling methods are widely used to balance the dataset. Usually, the model trained on a balanced dataset achieves high results. Most medical-related datasets are imbalanced, which needs resampling approaches to balance the dataset, so multiple undersampling and oversampling techniques have been used in this study.

Random Undersampling
In this approach, most class instances are randomly removed from the training set, so the ratio between both classes would be the same. The problem with this technique is that one would not know which information has been eliminated. The information which was very important for the study may be removed using Equation (22) [43].
where S i represents the total number of values with the label i. Undersampling is denoted by S 0 and 0 means from class 0. R(0) R(1) Show that one class has more values than the other, and by combining both, we get Equation (23).
When the dataset is rebalanced properly, this case can be applied S 0 = S 1 S 1 but when we have a scenario where the classes are balanced to a certain factor, inequality remains valid. On the contrary, when random points were selected for removal, the conditional data distribution between the classes was assumed to be unchanged, as shown in Equation (24).

Near Miss
Zhang and Mani [44] proposed a method to prevent information loss while utilising the undersampling technique. Instead of resampling the minority class, this strategy undersampling the majority class and making it equal to the minority class. This method uses the average distance between two points, a particular point, and the furthest points in the other class. NearMiss-1, 2, and 3 are the distinct versions. In NearMiss-1, we must choose a proportion of the major class size, and close to minor class points implies choosing the shortest average distance between the three minor class points. We chose the least average distance to the three farthest points from the minority class in NearMiss-2, which implies we chose the percentage of the majority class size closest to all points of the minority class. Select the nearest majority class points for each minor class in NearMiss-3. In this study, the NearMiss-1 method has been used.

Cluster Centroid
Cluster Centroids were proposed to minimise the loss of information in the majority class [45]. By calculating the ratio between both classes, Cluster Centroids substitutes the majority of samples with clusters of centroids using the K-means method. The loss of information has been minimised by making groups based on similarity. The K-means algorithm was used to get the clusters when applied to the data based on the level of undersampling. Then, the majority of samples from the clusters were replaced by the cluster centroids, which contain a different representation of the majority class at the centre from K-Means.

Synthetic Minority Oversampling Technique
Chawla et al. [46] introduced the SMOTE as it is one of the easier and most successful to address the problem of class imbalance. It creates new minority class instances synthetically rather than repeating duplicates of minority class instances [47]. Class overlapping is a critical aspect that makes learning a good classifier hypothesis for an unbalanced dataset difficult. It over-samples by using k nearest neighbours from minority samples, calculating the difference, multiplying the difference by a random number, and adding it to the feature vector, as shown in Equation (25).
y i is K-nearest neighbour of y i and α € [0; 1] is a random number. Borderline-SMOTE is a further variant of SMOTE that solves its limitation. The cases closer to the borderline (i.e., majority class examples) are more difficult to categorise correctly. This approach prioritises these examples to increase oversampling performance. It is similar to SMOTE in that it calculates synthetic examples based on the class with minimum samples on the borderline. For example, borderline-SMOTE [48] prioritises locating samples on the class boundaries, i.e., borderline samples, and then oversampling them to improve prediction classification techniques are predicated on learning these boundaries during training. The algorithm is largely based on SMOTE.

1.
For each sample in the minority class y iε S min . Get the collection of k-nearest neighbours, S Knn .

2.
Determine the number of nearest neighbours who are members of the majority class S Knn ∩ S maj for each y i . Select observations such that in Equation (26): 3.
After that, the observations are run through the standard SMOTE method to generate synthetic points that solely include the minority and majority classes.
Clustering, filtering, and oversampling are the three steps in K mean SMOTE [49] proposed by Douzas et al. [50]. The input samples are grouped into k groups using kmeans clustering, then choosing clusters for oversampling at the filtering stage, keeping the high proportion of samples from minority classes. The amount of synthetic samples to be generated is then dispersed, with more samples being assigned to clusters where minority samples are sparsely distributed then, to achieve the required ratio of samples SMOTE is used.
Each cluster's proportion of minority and majority instances is used to identify clusters for oversampling. Oversampling is selected by default for any cluster containing at least 50% minority samples; by changing the imbalance ratio threshold, we can change the behaviour, a k-means SMOTE hyperparameter that defaults to 1. The ratio is defined as majoritycount+1 minoritycount+1 of imbalanced instances. Cluster selection becomes more selective, requiring a higher ratio of minority samples to be selected when there is a criterion for an imbalance ratio. Lowering the barrier, on the other hand, loosens the selection criteria, making it possible to choose clusters with a bigger majority share and assigned sample weights between 0 and 1. They are calculated as follows:

1.
Euclidean distance was calculated for selected clusters (s), ignoring the majority class. The mean distance of each cluster is obtained by adding all non-diagonal entries of the distance matrix and dividing them with non-diagonal elements.

2.
Divide the number of minority instances in each cluster by the average minority distance raised to the power of the number of features n to get density, i.e., density(s) = minoritycount(s) averageMinorityDistance(s) n .

4.
We can calculate the cluster's sampling weight by dividing the cluster's sparsity factor by the sum of all cluster's sparsity factors.
Resulting totalsamplingWeights = 1, i.e., samplingWeights(s) * x X is the total number of samples generated in that cluster that was oversampled using SMOTE. The hyperparameter k nearest neighbours, or KNN, in SMOTE, determines how many surrounding minority samples of → a the point → b are randomly chosen. When a cluster includes fewer than knn + 1 minority samples, the value of KNN may need to be modified downward, depending on the SMOTE implementation. After SMOTE has been employed, both classes, minority and majority ones are the same. Figure 2 depicts the dataset before and after resampling. Table 4 presents the distribution of the dataset before and after resampling. 4. We can calculate the cluster's sampling weight by dividing the cluster's sparsity factor by the sum of all cluster's sparsity factors.
Resulting total sampling Weights=1, i.e., samplingWeights(s)*x X is the total number of samples generated in that cluster that was oversampled using SMOTE. The hyperparameter k nearest neighbours, or KNN, in SMOTE, determines how many surrounding minority samples of ⃗ the point ⃗ are randomly chosen. When a cluster includes fewer than knn + 1 minority samples, the value of KNN may need to be modified downward, depending on the SMOTE implementation. After SMOTE has been employed, both classes, minority and majority ones are the same. Figure 2 depicts the dataset before and after resampling. Table 4 presents the distribution of the dataset before and after resampling.

Gradient Boosting
Gradient Boosting (GB) is an iterative machine learning-based approach for solving classification problems. This method is based on an ensemble learning model trained using the previous iteration's mistakes as input. By fitting new learners, GB corrects misclassified data to the ensemble residual, which is the gap between the goal outputs and the ensemble's current predictions. GB aims to maximise the ensemble's prediction power while minimising bias. The benefit of Boosting is that it significantly impacts accuracy, but it comes at the expense of being slower to train because each learner is trained linearly.
It is a Boosting-like algorithm [51] and has a dataset D t = {a i , b i } M 1 for training, the goal of GB is to find an approximation,Â, of the function A * (a), which maps instances a to their output values b, by minimising the expected value of a given loss function ,  Lf(b, A(a)). GB builds an additive approximation of A * (a) as a weighted sum of functions calculated via Equation (27).
A n (a) = A n−1 (a)+w n g n (a) (27) where w n is the weight of n th function and g n ( a) are the parts of ensemble models. First, we obtain a constant approximation of A * (a) as shown in Equation (28).
Models are minimised as in Equation (29). However, instead of solving the optimisation problem directly, each g n can be seen as a greedy step in a gradient descent optimisation for A * . For that, each model, A * is trained on a new dataset D t = {a i , pr ni } M i=1 , where the pseudo-residuals pr ni , are calculated via Equation (29).
(w n , g n (a)) = arg min w,g After that, a line search optimisation problem is used to calculate the value of w n . If the iterative procedure is not adequately regularised, this approach may suffer from over-fitting. If the model g n fully matches the pseudo-residuals for some loss functions (e.g., quadratic loss), the pseudo-residuals become zero in the following iteration, and the process finishes prematurely. Several regularisation hyper-parameters are examined to govern the additive process of GB. The natural way to regularise GB is to apply shrinkage to reduce each gradient descent step A n (a) = A n−1 (a) + w n g n (a) with u = (0, 1.0). Typically, the value of u is set to 0.1. Furthermore, more regularisation may be obtained by reducing the complexity of the trained models. We can restrict the depth of decision trees or the minimum number of occurrences required to divide a node in the case of decision trees.
In contrast to a random forest, the default settings for these hyper-parameters in GB are designed to severely limit the expressive potential of the trees (for example, the depth is often limited to 3 to 5). Finally, various versions of GB incorporate hyper-parameters that randomise the base learners, such as random subsampling without substitution. These hyper-parameters can increase the generalisation of the ensemble [52]-Algorithm 1 lists all the steps of gradient boosting.

Initialising the constant
At the training point, calculate the gradient. The new base learner was fitted to the target value to find the best gradient step. Then, update the estimate function (w n , g n (a)) = argmin w,g XGBoost [53] is a highly scalable DT ensemble method based on GB. XGBoost, like GB, minimises a loss function because it is only interested in DT as a base classifier shown in Equation (30).
With the leaves of tree Lt in Equation (31), and the output scores v, the split criterion can be incorporated into decision trees' loss function, resulting in a pre-pruning approach. In XGBoost, shrinkage is an additional regularisation hyper-parameter that reduces the step size in the additive expansion. Finally, other tactics, such as tree depth, limit tree complexity, reduce storage space and increase training speed. To train individual trees, random subsamples and column subsampling at the tree and tree node levels are among the randomisation approaches used in XGBoost. Furthermore, first and second-order gradients were obtained by constructing a function and sending it through the "objective" hyper-parameter.
XGBoost, in particular, focuses on minimising computation cost by determining the optimal split of DT building algorithms. In most split discovery algorithms, all feasible candidate splits are enumerated, and the One with the biggest gain has been chosen. A regular scan of each ranked attribute is required to discover the optimal split for each node. XGBoost employs a compressed column-based structure containing data in a presorted form to prevent sorting the data in each node. XGBoost uses an approach based on data percentiles, in which just a sample of candidate splits is examined, and their gain is calculated using aggregated statistics. This study uses the parameters used to tune XGBoost: learning rate, minimum loss (gamma), max_dep, sample at each split, and subsampling rate.
Light Gradient Boosting (LGBoost) [54] executes GB and suggests several variations focusing on a computationally efficient approach, similar to XGBoost, based on histogram features and also provides hyper-parameters that allow being used in a range of scenarios: This works on both GPU and CPU, as simple GB, and has several options. Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling are two new features proposed by LGBoost. GOSS is a subsampling strategy for creating the training sets for the ensemble's base trees. This strategy, like AdaBoost, tries to increase the relevance of cases, referred to as instances with a higher gradient.
The training consist of gradients (x) and a sample fraction (y) when the GOSS option is enabled, when computing the information gain, the instances are weighted by (1 − x)/y for the change in distribution. Many sparse features were combined with the Exclusive Feature Bundling (EFB) method into a single feature. GOSS and EFB both improve training speed.

CatBoost
CatBoost [55] is a GB library that reduces prediction shifts during training. This distribution shift is the difference between A(a)|a with a i being a training instance and A(a)|a for a test instance a. GB minimises gradients during training using the same instances for both estimates and models. CatBoost proposes that the gradients be estimated in their training set by a series of models which do not contain that occurrence. CatBoost does this by introducing a random permutation into the training cases. CatBoost's logic is to create i = 1, . . . , N base models for each of the M Boosting iterations. The gradient of the i + 1 instance for the (m + 1)th Boosting iteration is estimated using the ith model of the mth iteration, trained on permutation first i instance. This method is repeated with j other random permutations in order to be independent of the first random permutation [56][57][58].
CatBoost is implemented so that a single model per iteration handles all permutations and models. Symmetric trees are formed by extending all leaf nodes level-wise with the same split condition to serve as the foundation models. The catBoost algorithm handles categorical features by replacing them with the numeric feature, which assesses the predicted goal value for each category. This numeric feature should preferably be generated using a different dataset to minimise overfitting the training data. This, however, is not always achievable. CatBoost proposes a strategy for computing this additional feature identical to the One used to generate the models. That is the information of instances < i is utilised to determine the feature value of instance i for a given random permutation of the instances. The feature value acquired for each instance is then averaged after many variations. Computing target statistics for categorical features, like EFB in the LGBoost model, is a preprocessing approach. CatBoost is a large library with many features like GPU training, different parameters according to scenarios, and Boosting standards.

Results
The model has been trained on training data and then uses the test set to assess the model's generalisation with a 70:30 split approach. When evaluating the performance of classifiers, performance assessment is crucial. A confusion matrix has been used to evaluate the performance of a classifier. The confusion matrix is a multidimensional table that depicts successfully or incorrectly predicted times for each class in real and projected costs for all times. Any predictive modelling assignment necessitates the evaluation of models. It is considerably more important in predictive modelling, whereas the performance and variety of classifiers must be assessed comprehensively. Each of the assessment measures is based on one of four classes: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). The experiments and results show that our model achieved state-ofthe-art performance. We evaluate our model by comparing it with other machine-learning models. The results are compared in terms of accuracy, precision, recall, and F1-Score, which can be calculated using parameters which are defined as [59][60][61][62] Accuracy is the ratio of all correctly predicted samples to the summation of total predictions, as shown in Equation (32).
Precision: It identifies if the positive predictions are correctly determined and is the ratio of TP to the summation of TP and FP, as shown in Equation (33).
Recall: It identifies total relevant results correctly predicted by the model and is the ratio of TP to the summation of TP and FN, as shown in Equation (34).

F1-Score:
It is characterised as the harmonic mean of the model's precision and recall and is a way to integrate the model's precision and recall. It is stated in Equation (35).
Before resampling, different feature selection techniques have also been used to reduce the data dimensionality. We have applied the five feature selection techniques: Correlation, Information gain, Gain ratio, ReliefF and OneR. We have selected eight features by using these feature selection techniques. Each feature selection approach scores all features; hence each technique ranks the features. Which enables the selection of the most important characteristics k using a feature selection approach. ∀j ∈ J where J be a set contain integers and (j ∈ [1, 2, . . . , n − 1, n]. Each j specifies the amount of the experiment's best features that must be assessed. In the best-case scenario, J includes all numbers in the range. Because feature selection strives to choose the best features with respect to scorer. The feature selection technique's assessment depends on the score of the underlying feature selection approach, and for each method, an experiment was run numerous times. Furthermore, we utilised undersampling and oversampling techniques to balance the dataset. The GB algorithm we used with the combination of hyperparameter tuning, which are learning-rate = 0.1, the number of tress (n-estimators = 100), max-depth = 3, min-samplessplit = 2, min-sample-leaf = 1 and subsample = 1. The combination of the parameter used by LGBoost is max-depth = 20, num-of-iterations or estimator = 100, num-leaves = 31, and sub-sample = 1. The other parameter, like L1, and L2 regularisation or the alpha value, was 0. For XGBoost, we have set the parameter's value as learning rate = 0.1, n-estimator = 100, subsample = 1 and max-depth = 3. When tuning XGBoost, we have to take care of mostly 3 parameters: number-of-tree, trees-depth and stepsize to achieve better results. The parameters used for the CatBoost algorithm are: iteration = 1000, leafestimation-iterations = 100, depth = 7, L2 regularisation = 5, learning-rate = 0.03, and other parameters like bagging-temperature, random-strength, leaf-estimation, eval-metric, bootstrap-type and loss-function are optional. As our model was a binary class, parameters like loss-function and eval-metric were not mentioned, but in the case of multiclass need to mention it. Table 5 shows the result of the random undersampling technique with multiple feature selection techniques. The GB model performed poorly, with an accuracy of 68.1% in the case of ReliefF and 68.6% with the OneR feature selections technique [63][64][65]. The XGBOOST model performed very well for correlation and information gain feature selection techniques with an accuracy of 82.8%. In contrast, the CatBoost model performed very well for the Gain ratio feature selection technique leading in the case of recall with a value of 82.7%. Tables 6 and 7 show the results of Cluster Centroid and Near Miss techniques. The GB model performed well in the case of the Gain ratio with an accuracy of 78.9%, shown in Table 6, and the CatBoost model performed well overall in both tables with an accuracy of 80% and 78.9%. Still, the results obtained from both undersampling techniques were less than the previous study done by Li G et al. [1]. The CatBoost performed very well classifying both classes when undersampling techniques were used. However, in the case of the undersampling, the accuracy decreased to 78.9% when clustering centroid techniques were used and 79.2 for Near miss, which was poor because, on an imbalanced dataset, Li G et al. [1] achieved an accuracy of 80%. Even though the accuracy achieved by CatBoost using random undersampling techniques was 82.7%, greater than Li G et al. [1], we cannot neglect oversampling techniques. As in Figure 3, a graph plot shows the highest result to feature selection and resampling techniques.     Table 8 contains the result of oversampling techniques to feature selection techniques. Table 8 shows the result of the random oversampling technique for multiple feature selection techniques. All OneR and ReliefF models performed very poorly, with the highest accuracy of 68.3% and 73.3%, whereas they performed well in the Gain ratio with an accuracy of 82.9%. Both XGBoost and CatBoost models obtained the highest result of 82.9% concerning correlation and Information gain feature selection techniques.  Table 9 shows the outcome of the SMOTE oversampling approach in comparison to numerous feature selection strategies. In terms of multiple feature selection strategies, the results produced using these resampling techniques have been less than those obtained using random oversampling techniques. In the OneR and ReliefF tests, models concerning those parameters underperformed with values of 75.9% and 71.0% but well in the Gain Ratio test with a value of 82.9 by LGBoost. Regarding correlation and information gain feature selection strategies, both the XGBoost and CatBoost models came on top, but XGBoost took the lead with a value of 83.3%.  Table 10 presents the results of an oversampling strategy K-means SMOTE and various feature selection procedures. The result obtained using this resampling approach was the best of all the resampling techniques in terms of multiple feature selection strategies. Utilising this resampling method, the results obtained from models employing feature selection techniques, Information gain, ReliefF, and OneR were unsatisfactory; however, models using Gain ratio and Correlation feature selection strategies did extremely well with a value of 87.1%. In addition, the CatBoost model outperformed other models. Summarising the K means SMOTE approach when utilised, the results were more efficient when using the correlation-based feature selection technique. The table, as shown above, summarises the performance evaluation of several classifiers using a selected feature subset from each feature selection approach and method. The CatBoost approach's performance was also compared to that of typical machine learning models, and it was found that the CatBoost method outperformed traditional machine learning models. Table 5 shows the findings, which show that the CatBoost model produced the greatest results, with an accuracy of 87.1 percent, also shown in Figure 4.

Discussion
Machine learning and data mining approaches can automatically discover complex patterns in the healthcare domain, such as COVID-19 [66,67], Skin Cancer [59], Breast Cancer [16], Malignant Mesothelioma [68][69][70], and Cervical Cancer [7], researchers are motivated to use these techniques in the early prediction of Schistosomiasis. In recent years, many machine learning methods, including LR, DT, RF, and ANN, have been widely applied in disease detection and prediction [59]. Before resampling, different feature selection techniques have also been used to reduce the data dimensionality. We have applied the five feature selection techniques: Correlation, Information gain, Gain ratio, ReliefF

Discussion
Machine learning and data mining approaches can automatically discover complex patterns in the healthcare domain, such as COVID-19 [66,67], Skin Cancer [59], Breast Cancer [16], Malignant Mesothelioma [68][69][70], and Cervical Cancer [7], researchers are motivated to use these techniques in the early prediction of Schistosomiasis. In recent years, many machine learning methods, including LR, DT, RF, and ANN, have been widely applied in disease detection and prediction [59]. Before resampling, different feature selection techniques have also been used to reduce the data dimensionality. We have applied the five feature selection techniques: Correlation, Information gain, Gain ratio, ReliefF and OneR. We have selected eight features by performing multiple experiments, as explained above. Furthermore, we utilised undersampling and oversampling techniques to balance the dataset. We have used four algorithms, GB, LGBoost, XGBoost and CatBoost, with a combination of various hyperparameters discussed in the result section.
The results of the random undersampling approach with several feature selection strategies are shown in Table 5. The GB model fared badly, with an accuracy of 68.1% for ReliefF and 68.6% for OneR feature choices. The XGBOOST model fared extremely well for correlation and information gain feature selection approaches, with an accuracy of 82.8%, while the CatBoost model did very well for the Gain ratio feature selection technique, with an accuracy of 82.7%. The outcomes of the Cluster Centroid and Near Miss approaches are shown in Tables 6 and 7. The GB model fared well in the Gain ratio instance, with an accuracy of 78.9% in Table 6, while the CatBoost model did well overall in both tables, with an accuracy of 80% and 78.9%.
Nonetheless, the findings from both undersampling strategies were lower than the prior study by Li G et al. [1]. When undersampling techniques were utilised, the CatBoost performed admirably in categorising both groups. However, when undersampling was utilised, the accuracy dropped to 78.9% when clustering centroid approaches were applied, and 79.2% for Near miss, which was disappointing considering Li G et al. [1] obtained an accuracy of 80% on an unbalanced dataset. Even though CatBoost's accuracy utilising random undersampling strategies was 82.7%, which was higher than Li G et al. [1] as shown in a graph plot, Figure 3, displays the best outcome for feature selection and resampling procedures.
The results of the random oversampling approach for different feature selection strategies are shown in Table 8. All models fared badly with OneR and ReliefF techniques, with the greatest accuracy of 68.3% and 73.3%, respectively. However, they did well in the Gain ratio, with an accuracy of 82.9%. In terms of correlation and information gain feature selection strategies, both the XGBoost and CatBoost models achieved 82.9%. Table 9 compares the SMOTE oversampling strategy to various feature selection algorithms. The results achieved via these resampling procedures were lower than those acquired by random oversampling techniques. Models considering such parameters underperformed at 71.0% in the OneR and ReliefF tests. XGBoost takes the lead in correlation and information gain feature selection algorithms, with a value of 83.3%. Table 10 shows the outcomes of a K-means SMOTE oversampling technique and several feature selection procedures. In terms of various feature selection strategies, the result of this resampling methodology was the best of all resampling procedures. Using this resampling strategy, the results from models utilising feature selection procedures, Information gain, ReliefF, and OneR were poor; however, models adopting Gain ratio and Correlation feature selection strategies performed very well, with a value of 87.1%. Other models were outperformed by the CatBoost model. When the K-means SMOTE strategy was applied, the results were more efficient with the correlation-based feature selection technique. Table 10 summarises the performance evaluation of classifiers. The performance of the CatBoost technique was also compared to that of standard machine learning models, and it was discovered that the CatBoost method beat traditional machine learning models. The outcomes are provided in Table 10, which reveals that the CatBoost model gave the best results, with an accuracy of 87.1 %, as shown in Figure 4.
CatBoost was more accurate than other methods in our research on different medical conditions. In the CatBoost model, we discovered that ascitic fluid volume was the most useful predictor. The most prevalent sign of advanced hepatic illness is ascites. The ascitic subtype of advanced Schistosomiasis is the most dangerous, accounting for 65-90 percent of cases [60]. Granulomatous inflammation can be caused by venous blockage and portal hypertension and leads to continuous fibrosis and a drop in plasma colloid osmotic pressure [61]. Ascites are the most common consequence of liver injury, and their sever-ity directly influences the overall prognosis. In advanced Schistosomiasis patients, the occurrence of severe ascites is one of the best indicators of a high level of impairment [62]. This study has demonstrated that CatBoost is a novel and promising modelling framework for the unfavourable prognosis of advanced Schistosomiasis. A comparison between our model and the previous studies is shown in Table 11.

Conclusions
Machine learning methods have been used in various fields in combination with unbalanced approaches. This study aims to use supervised learning algorithms to train multiple Schistosomiasis disease prediction systems. Because an unbalanced dataset is important for improving the model's performance in classification challenges, several resampling strategies were utilised to balance the dataset. We began by employing exploratory data analysis approaches, such as data standardisation, to analyse the dataset. Then, we conducted several tests to validate the performance of feature selection strategies against one another and narrow down the characteristics that may properly diagnose the disease. These variables can be addressed early in the disease to the point where it threatens human life. We began by comparing the outcomes of several GB models, such as LGBoost, XGBoost, and CatBoost, against each other and earlier studies. The CatBoost model has a greater prediction accuracy rate than typical machine learning-based models. Using the K-means SMOTE resampling approach, the CatBoost method had the greatest accuracy of 87.1%. We will utilise more datasets linked to this disease in the future to include the features categorised as important by the dataset collector but not included in this dataset. Finally, we will release this framework on the web so medical professionals can take benefit from it.  Data Availability Statement: The data will be available upon request.