Diagnosis and Prediction of Large-For-Gestational-Age Fetus Using the Stacked Generalization Method

Faheem Akhtar 1,2 , Jianqiang Li 1,* , Yan Pei 3 , Azhar Imran 1 , Asif Rajput 2 , Muhammad Azeem 1 and Qing Wang 4 1 School of Software Engineering, Beijing University of Technology, Beijing Engineering Research Center for IoT Software and Systems, Beijing 100124, China; fahim.akhtar@iba-suk.edu.pk (F.A.); azharimran63@gmail.com (A.I.); mazeem.qau@hotmail.com (M.A.) 2 Department of Computer Science, Sukkur IBA University, Sukkur 65200, Pakistan; asifali@iba-suk.edu.pk 3 School of Computer Science and Engineering, the University of Aizu, Aizuwakamatsu 965-8580, Japan; peiyan@u-aizu.ac.jp 4 Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China; qing.wang@tsinghua.edu.cn * Correspondence: lijianqiang@bjut.edu.cn; Tel.: +86-106-7396-842


Introduction
During the last several decades, an increased in the incidence of LGA neonates in the developed countries are reported, which is even wider in developing countries [1,2]. It is defined as a fetus whose gestational weight is above the 90th percentile of a fetus with a similar gestational age and sex [3]. It exhibits serious pre/post maternal complications, which are comprised of shoulder dystocia [4,5], insulin resistance [6,7], metabolic syndrome [6], prolonged labor [5], cesarean section [8], postpartum bleeding [8], serious adverse consequences before and after delivery including breast cancer [9,10], and expedited chances of infants mortality rate [11]. Therefore, based on these serious complications, which are associated with the health of a newborn, it is a topic of keen interest for paediatricians and related health-care officials.
On the basis of the above-discussed concerns, the primary motivation behind this research is to develop an accurate LGA classification model, which is capable of classifying an LGA fetus before birth using maternal biochemical indicators. To improve LGA classification performance, using the national pre-pregnancy and examination program of China (2010-2013) dataset [12], the master feature vector (MFV) is created to formalize LGA dataset, where we discretized feature values and entertained missing values. Principal feature subsets were created using the proposed GridSearch-based RFECV+ IG feature selection scheme followed by stacking to select, extract, and rank features to enhance proposed classification schemes performance with minimal generalization error. Based on the experimental results, it is foreseen that the top ten features selected by each of the feature selection processes proved best, but Support Vector Machine (SVM) with linear kernel remained best with the production of highest performance metrics scores. Moreover, to establish comparative analysis, the proposed scheme is compared with previously published research on the same LGA dataset.
The rest of the paper is organized as follows: Section 2 presents the related work. Section 3 defines the methodology of this research with complete details of data pre-processing, experimental flow, and performance metrics. Section 4 presents experimental results of various experimental processes. Section 5 discusses experimental results and compares them with existing baselines schemes to signify the importance of the proposed scheme. Finally, this paper is concluded in Section 6 and future work is presented.

Related Work
Previously, different practices were used to identify LGA fetus. The most common methods were using estimated fetal weight (EFW), abdominal circumferences (AC), ultrasound surveillance of an obese woman, maternal BMI, gestational weight, gestational diabetes mellitus (GDM), etc. Most of them were observational or retrospective studies that used simple logistic regression to extract discriminant features subset for the establishment of an LGA prognosis process. For example, Shen used sonographic estimated fetal weight (EFW) of Chinese women to classify a fetus as LGA or non-LGA and achieved specificity and sensitivity of 48.1% and 97.3% respectively [13]. Blue used AC with EFW for LGA classification [14]. Harper proposed to use ultrasound surveillance of obese women before 32 weeks of the gestational period to classify a fetus as LGA or non-LGA [15]. Chen used maternal BMI with gestational weight for LGA classification [16]. Moore established a cohort analysis and demonstrated that LGA fetus exhibits dichotomous risks at term [17]. Luangkwan used linear modelling to observe the risk of parental complications in pregnant women with an LGA fetus [18]. In addition, some research was proposed to monitor variations in fetus bio-chemical indicators during different physical checkups to control its consequences [19][20][21]. Based on this, it can easily be foreseen that most of them were observational or retrospective studies that used simple logistic regression to extract discriminant features subset for the establishment of an LGA prognosis process.
Perhaps, in our previous work, we were the first who exploited machine learning (ML) techniques for the establishment of an efficient LGA prognosis process. In [22], we used information gain (IG) feature selection scheme for the LGA prognosis and achieved precision and Area Under the Curve (AUC) score of 0.71 and 0.70, respectively. In [23], we used IG with an ensemble scheme to improve classification performance with the extraction of useful features to establish an efficient LGA prognosis process where we achieved precision and AUC scores of 0.84 and 0.72 respectively. Furthermore, in [24], using experts' expertise, we reached to obtain prediction precision and AUC scores of 0.95 and 0.86, respectively. In this research, we identified ranked twenty in practice features for the establishment of an efficient LGA prognosis process. However, still, there is still room to improve the prediction performance of an LGA classification system. Therefore, a master feature vector is created and GridSearch-based Recursive Feature Elimination with Cross-validation (RFECV) scheme followed by stacked generalization is introduced to select, rank, extract suitable feature subset with higher classification prediction performance with reduced generalization errors. RFECV and stacked generalization have previously been proven best in various related application domains [25,26].

Materials and Methods
This research proposes two different schemes for LGA classification. In the first scheme, creation of a master feature vector, GridSearch-based Recursive feature elimination with cross-validation (RFECV) feature selection scheme with machine learning models that are tuned with GridSearch, and ranked features subset based on Information gain (IG) feature selection scheme are given to four influential machine learning classifiers, as shown in Scheme 1, which illustrates the methodology of the first proposed LGA classification scheme. The second scheme is intended to enhance the LGA classification performance with minimized generalization errors. The objective is to expedite LGA classification performance with an ensemble of stacked classifiers based on the meta-level features extracted from the level-0 of the stacking procedure. Later, the extracted features from Level-0 of stacking process is given to Level-1 of the stacking process to establish a state of the art LGA classification model. Refer to Scheme 2, which illustrates the methodology of the proposed LGA classification scheme. In both of the schemes, the classifiers are constructed and tested with ten-fold cross-validation to diagnose an infant as LGA or non-LGA. The reason to deploy ten-fold cross-validation is to minimize the generalization errors and come up with a standardized LGA classification framework.

Dataset Collection
The benchmark LGA dataset used in this research is collected from National Pre-Pregnancy and Examination Program of China [12]. The program was initiated to eliminate birth deficiencies of Chinese citizens across China (2010 to 2013). The project covered all of the provisional and municipal hospitals of China. The examination checklist was suggested and finalized by the mutual consensus of a panel of experts constituted from various related domains (i.e., obstetrics, paediatrics, andrology, internal medicine, etc.). The checklist includes pre-pregnancy items ( i.e., eating habits (male(m)/female(f)), smoking (m/f), drinking (m/f), height (m/f), occupation (m/f), etc.), pregnancy items that includes parents' ( clinical measures, reproductive system measures, abnormalities in pregnancy, etc.) and socio-economic and demographic factors.
The obtained dataset is comprised of 371 features with 215,568 records. The distribution of the data is accomplished with a widely used LGA classification scheme proposed by Zhu et al. [27]. It is presented in Table 1. Based on the proposed scheme, each of the records is classified either as LGA or non-LGA. Therefore, following the proposed scheme, 26,226 records are labelled with the LGA, and remaining 189,342 are labelled with the non-LGA.

Preparation of the Master Feature Vector
For improving LGA classifiers performance, the master feature vector in the LGA dataset is required to be accurate, robust, and flawlessly identified. As mentioned previously, the LGA dataset is obtained from an official project that was launched across almost every related hospital in China; and it is evident that every massive project always contains a certain amount of missing fields. The reason might not always be a human error, but certain times, a paediatrician did not feel to prescribe or record specific test that is mentioned in the proposed recording guidelines. The entertainment of these missing fields is a significant and necessary step during the classification task. Otherwise, it will adversely affect the classification results. Therefore, considering the desired need for the establishment of a better LGA classification model, the following algorithm is proposed to eliminate discussed issues.
Above defined Algorithm 1 can be explained as follows, Let's suppose L 0 is the basic LGA dataset with f 0 features and n 0 records. To create MFV, initially, a classification column is added to each record as per [27] classification criteria. Each feature value is discretized with the help of literature and paediatrician's expertise; a threshold to delete 10% from controls (LGA's record) and 15% from cases (non-LGA) records are set and remaining records were imputed with the selected mode of these feature values. As a result of this process, a Master Feature Vector (MFV) is extracted from the complete LGA dataset. The details of the resultant MFV is presented in Figure 1. where, Figure 1a

Preparation of the Principal Feature Vector
An accurate and robust classification system requires discriminative features with reduced dimensions. Irrelevant and unnecessary features not only affect classifier performance but also demand excessive computational resources and time for the classification task [28][29][30][31]. A variety of feature selection, extraction, and reduction schemes are proposed by various researchers to deal with the curse of irrelevant and dimensionality problem of a classification system [23,28,[32][33][34]. In this article, we recommend using an ensemble of feature selection and extraction schemes to build an accurate and state of the art LGA prediction model. The development of Principal Feature Vector (PFV) is comprised of two different aspects. In the first aspect, GridSearch-based RFECV + IG feature selection scheme is applied to select, rank, and remove noisy features from the LGA dataset; whereas the second aspect further extracted features (i.e., for the sake of dimensionality reduction to eliminate generalization errors) obtained with the GridSearch-based RFECV + IG scheme using stack generalization to further improve the classification performance of the proposed scheme with lesser computational overhead. In the subsequent subsections, these schemes are precisely discussed.

GridSearch-Based RFECV + IG Feature Selection Scheme
Recursive Feature Elimination (RFE) is a scheme that excludes features based on its irrelevancy and low data integrity to a specified class distribution [25,35]. The elimination process is a continuous process until a complete list of deterministic feature subset is reached. The elimination process takes a classification model, and based on the classification weights, selects the weighted features with the elimination of noisy features. In addition, for a better elimination and selection process, the parameters for the classification model are required to be tuned. In this research, GridSearch, a popular technique for parameter tuning, is also accompanied by RFE to improve classification performance. Moreover, in the case of Recursive feature elimination with cross-validation (RFECV), the fitting is accompanied by testing, it uses training and test splits provided with a given folding parameter that helps in minimizing generalization errors. Support Vector Machine (SVM) with linear and rbf kernels [36,37], Logistic Regression (LR) [38], and Decision Tree (DT) [39] classifiers are used in RFECV feature selection scheme with the parameters tuned with GridSearch with five-fold cross validation. During the tuning of GridSearch, for SVM (with linear kernel) and LR classifiers, 'C' is tuned in the range of 2 −8 to 2 8 ; whereas, SVM with rbf kernel, 'C' is tuned in the range of 2 −8 to 2 8 Table 2. All of the corresponding feature subsets are given to specified machine learning classifier for the establishment of first experimental setup. Furthermore, the IG feature selection scheme is used as an ensemble of feature selection process to rank above-induced feature subsets which are discussed as follows, Information Gain (IG): is an extensively used feature selection scheme in a variety of machine learning problems, especially related to the medical domain. IG feature selection scheme works with an objective of uncertainty reduction in a feature vector. Once the level of uncertainty is known, the larger the uncertainty can be reduced, and more the information the feature can bring to the classification system. Thus, ultimately, we have a larger information gain that is brought to the system for the development of an efficient LGA classification model. In IG feature selection scheme, "information entropy" is used to measure the amount of information which is later calculated as the difference between dataset (B)'s information entropy including with and without LGA features x i . Furthermore, while training dataset B with I class labels, (B) represents the information entropy of the LGA class distribution in B, which can be expressed as follows, where P i represents the probability of i-th class in the training dataset B. Moreover, x i features with D distinct values can be used to partition dataset B into D distinct groups. Then, each group B d (d = 1, ..., D) entropy is calculated as, where p di is the probability of i-th class in the training data subset B d . Based on the fact that each subset may contain a different number of samples, i.e., each subset B d contains Z d samples where (d = 1, ..., D), its weight is set to Z d /Z. Information Gain with features x i to partition dataset B can be written as Based on calculated information gain of every attribute, the attributes with the highest IG are ranked with descending order for the further experimentations.

Feature Extraction and Dimension Reduction with Stacked Generalization
Stacked generalization (SG): also known as Stacking is a process that combines multiple classifiers to form an efficient classification system. It was introduced by Wolpert in 1992 [40]. The stacking process involves output generated from level-0 (base-level) classifiers as an input to level-1 (meta-level) classifier to improve classification performance using the process of cross validation is as follows, Let us suppose that L is the obtained LGA dataset with a i attributes with an associated y i -th class label. Thus, L = {(a i , y i ), i =, ..., n} refers to the level-0 of the LGA dataset. Based on K-fold cross-validation L is divided into K disjoints parts of L 1 , L 2 , ..., L k where at each k-th fold L k is used as test and L (−k) = L − L k is used as the test part. Later, N learning algorithms A 1 , A 2 , ..., A N are applied at training part L (−k) to build N level-0 classifiers C 1 , C 2 , ..., C N . The resultant concatenated predictions of each k-th fold at L k of N level-0 classifiers with the actual class label are used to form meta-level vector (ML k ). It will be use during the establishment of level-1 classification.
With the development of complete meta-level vector (ML k ) also called level-1 data which is obtained by the union of each of the ML k , where k = 1, 2, 3, ..., K during the cross-validation process, we applied the algorithm A m to form the meta-level classifier C m . During the development of C m , the Am could be any of the A 1 , A 2 , ..., A N or a different one. Based on this procedure, it is foreseen that after forming the meta-level data, the entire data is trained using the learning algorithms A 1 , A 2 , ..., A N to build final base-level classifiers C 1 , C 2 , ..., C N .
Ting et al. [41] proposed to use class probabilities instead of just using class labels for the formation of the meta-level feature vector, as it can better improve classification performance with an improved learning rate. Therefore, to classify a new instance predicted probabilities and predicted class labels by all level-0 classifiers are concatenated to form a meta-level classifier which has N components. Based on this formed meta-level feature vector, level-1 classifiers assign an actual class label to the final classification result of the input instance x.  Feature Extraction and Dimension Reduction with Stacking: represents the above-defined process where the outputs of level-0 classification are combined to form complete meta-level features. It is, in fact, a process to extract discriminant features rather than just classification predictions. Previously, the combination strategy has already benefited different researcher to improve generalization errors during the classification task. On the basis of this, considering the best classification schemes on the said GridSearch-based RFECV + IG feature subsets, we proposed to use Logistic regression (LR) and Random forest (RF) classifiers for the creation of the discriminant meta-level feature subset followed by Support Vector Machine (SVM) classifiers at level-1 to establish a state of the art LGA classification system. The reason to chose LR, RF, and SVM is because of their efficient performance during the sensitivity analysis process.
Furthermore, before starting the feature extraction process, we proposed to use the below-defined technique to reduces the size of data to expedite classification speed and performance without further deletion of valuable records. Following [26], where the authors trained classifiers on a hyperspectral data (shape features data and magnitude features data) and combined their results with stacking to form a new feature subset that is extracted with level-0 classifiers prediction probabilities, actual and predicted outputs. Based on this, we subdivided the whole LGA dataset L, which is obtained as a result of MFV creation wherein total 36,172 ( LGA = 14,658, non-LGA = 21,514 ) records are selected. We subdivided (LGA and non-LGA) / 2 and formulated a new LGA dataset which contains two equal number of records, and we call it subsets 1 and 2 of LGA dataset L. Each subset of L contains 18,086 (7329 = LGA, 10,757 = non-LGA )records. These two subsets are used at the level-0 of the stacking process, which is intensively discussed in the previous subsection. The complete process of feature extraction and classification task is presented in Figure 3 where at level-0 of stacking process RF and LR classifiers with ten-fold cross-validations are used for feature extraction task and SVM with the linear kernel is used at level-1 of the stacking process for the classification prediction task. These classifiers are selected because of their efficient performance on the said dataset L in the first group of experiments.

LGA Classification Tools and Schemes
IBM SPSS statistics 22.0 and python are used for primary data processing. The LGA classification schemes are coded in python using the sci-kit-learn toolkit. Based on recent studies, four machine learning classifiers are selected. Logistic Regression (LR), Support Vector Machine (SVM) [42,43], and Random Forest (RF) classifiers are selected because of its outstanding performance in previously reported literature [22][23][24] and Decision Tree (DT) classifier is selected because of its simplicity, implicit feature screening process, easiness of data interpretability, and it also does not require any assumptions of linearity in the data [39] . In addition, SVM with RBF kernel is also exploited to observe the efficiency of SVM using its kernel trick; other kernels are not exploited because of its high computational time and cost.

Performance Evaluation Metrics
To evaluate the performance of proposed GridSearch-based RFECV feature selection scheme which is followed by IG feature selection scheme with stacked generalization, we selected precision, recall, accuracy, AUC, specificity, and F1 scores as a performance evaluation measures [44]. The possible outcomes of the proposed Gridsearh based RFEC + IG and Gridsearh based RFECV + IG followed by Staking can be described as Furthermore, the derivation of these metrics are as follows, Precison is the prediction, when it predicts yes and how often it is accurate or the number of true positive divided by the summation of a true positive and false positive.
Recall is the fraction of true positive and total actual positives in the dataset or the ability of the system to extract all relevant cases from the dataset or the number of true positive divided by the summation of a true positive and false negative.
Accuracy is the correctness of the LGA classifiers in predicting LGA or non-LGA or the fraction of summation of true positive and true negative with the summation of true positive, true negative, false positive and false negative.
AUC is used to analyse the correctness of the classification system in predicting a specific class. In fact, represents the classwise occupied area of a specific class.
Specificity is the proportion of actual negatives which are correctly identified as it is, It represents true negative rate or the number of true negatives divided by the summation of a true negative and false positive.
F1 Score is the weighted average recall (true positive rate) and precision or it is the harmonic mean of recall and precision. Its formation is as follows.

Experiment Results
The experimental process is consolidated into two main processes. Where in the first process, two groups of experiments are performed. In the first group of experiments, RFECV feature selection scheme whose parameters are tuned with GridSearch is given to four widely used ML classifiers (Decision tree (DT), Logistic regression (LR), Support vector machine (SVM) with linear and RBF kernels) to classify LGA infants. In the second group of experiments, we used Information Gain (IG) feature selection scheme on previously identified features subsets identified using RFECV feature selection schemes. It is a sort of ensemble of the feature selection process where two feature selections schemes are added subsequently to remove noisy features with the identification of ranked feature subsets. Moreover, in the second process, stacking is proposed to extract features from the previous ensemble feature subsets to add another ensemble layer on the feature subsets to remove classifiers generalization errors and to improve classification performance. The results are of the experiments are presented in the following subsections.

Results of GridSearch Based RFECV + IG Feature Selection Scheme for LGA Prediction
To highlight the importance of proposed GridSearch-based RFECV feature selection scheme which is followed by IG feature selection scheme, we executed the initial experiments considering all features as per MFV created and the features selected by GridSearch + RFECV features selection scheme. Table 3 can be referred for the details of the results. From the results, it is foreseen that the GridSearch + RFECV features selection scheme improved the LGA classification prediction scores and best performed with SVM classifier (using linear and RBF kernel). Furthermore, the classification performance of all of the classifiers on GridSearch + RFECV features subsets are also improved compared to the results of MFV features subset. Based on foreseen improvement, considering the primary objective of this research, which is to identify principal features subset for a better LGA prognosis, we executed the first proposed experimental process. Figure 4 presents the results of the initial experiment process where an ensemble of feature selection scheme is created using GridSearch + RFECV and IG feature selection scheme. From the results, it is discerned that all of the proposed classifiers outperformed with principal ten feature subsets. SVM (with the linear kernel) outperformed among all with prediction precision, recall, accuracy, AUC, specificity, and F1 scores of 0.97, 0.61, 0.83, 0.87, 0.999, and 0.74 respectively, followed by SVM (with RBF kernel), LR, and DT. SVM (with RBF kernel) and LR classifiers remained almost similar by producing similar performance metrics scores whereas DT classifier remained weak in producing noticeable performance metrics scores. Table 3. Results of all features subsets selected by GridSearch-based RFECV features selection scheme using well known ML classifiers with 10-fold cross validation.

Results of GridSearch Based RFECV + IG Feature Selection Scheme with Stacking for LGA Prediction
To improve classification performance by removing generalization errors of the selected classifiers, we executed the second experimental process. The objective is to reduce or eliminate generalization errors with expedited classification performance using stacking, where level-0 of stacking is used for principal feature extraction with the intention of dimension reduction and level-1 is used to remove generalization errors to improve classification performance. Figure 5 presents the complete results of the proposed scheme. From the results, it is evident that the performance metrics scores are improved drastically, but with principal ten feature subsets, the results are noticeable. SVM (linear kernel) remained best with prediction precision, recall, accuracy, AUC, specificity, and F1 scores of 0.92, 0.87, 0.92, 0.95, 0.95, and 0.89 respectively, followed by SVM (RBF kernel), LR, and DT. SVM (RBF kernel) and LR classifiers remained almost similar by producing similar performance metrics scores whereas DT classifier remained weak in producing noticeable performance metrics scores.

Discussions and Comparative Analysis with Existing State-of-the-Art LGA Classifications
The proposed scheme for the classification of LGA fetus using stacked generalization with an ensemble feature selection scheme proved best in the selection of useful features subset which can accurately identify a fetus with its gestational parameters. From the results, it is also evident that the ranked ten principal features subset by every feature selection scheme remained best among all feature subsets and produced its highest prediction performance metrics scores. Table 4 presents the comparative best results of all three group of experiments with the proposed ensemble of feature selection and extraction techniques with stack-generalization. From the results, it is observed that among the three experimentations, SVM (linear kernel) classifiers outperformed with the production of highest prediction performance metrics scores and remained best with prediction precision, recall, accuracy, AUC, specificity, and F1 scores of 0.92, 0.87, 0.92, 0.95, 0.95, and 0.89 respectively. The reason for being the best is because of the formation of maximum hyperplane between the LGA and non-LGA class. The creation of maximum hyperplane is possible because of easily superable feature subsets induced as the result of applying the proposed ensemble of feature selection and extraction schemes with stacked generalization. SVM (RBF kernel) classifier is also suitable for LGA classification task because of its impressive results but is not recommended because of its computational complexity. Furthermore, LR classifier can also be used for the said classification task, but DT classifier is never recommended due to its low performance. The reason for DT classifier for being insignificant might be because of inadequacy in applying regression and possibility of duplication with the same sub-tree on different paths while predicting values. The significance of the proposed scheme is highlighted by comparing the results of the proposed scheme with existing state-of-arts LGA classification schemes. Table 5 presents the comparative best results of recently published schemes on the same dataset with the proposed scheme. The results reveals that the highest prediction performance metrics scores (i.e., precision = 0.92, AUC = 0.95 , recall = 0.87, accuracy = 0.92 , specificity = 0.95, and, F1 = 0.89) are obtained by the proposed scheme with SVM (linear kernel) using ten principal features subset. Table 6 present the results of ranked ten principal feature subset of GridSearch-based RFECV + IG feature selection scheme with four ML classifiers using ten-fold cross-validation. From the comparative analysis of the results of the proposed scheme, it is also discerned that the feature engineering and classification schemes of this research best suits the process of establishing a state-of-art LGA prognosis process with improved classification performance with lesser computational overhead. The reason for the improvement in classification performance is because of the extraction of reduced numbers of discriminant features subset, which eventually helps in removing LGA classifiers complexity with decreased generalization errors to improve LGA classification accuracy.  Table 6. The ranked ten principal feature subsets of GridSearch-based RFECV + IG feature selection scheme with four ML classifiers using ten-fold cross validation. Moreover, Friedman and Bonferroni-Dunn tests are also introduced to rank and highlight the significance difference between the results reported in Figures 4 and 5. Initially, Friedman test considering (p < 0.05) is employed to rank the classifiers based on the result of said experiments. The longitudinal axis in Figures 6 and 7 represents the average mean ranking calculated by using Friedman test on all group of experiments. From the results it is foreseen that SVM with linear kernel outperformed in almost each group of experiment. In addition, Bonferroni-Dunn test is also employed in the significance level of α < 0.05, α < 0.01, and α < 0.001 to the results of Friedman test. Equation (10) is used to calculate Critical Distance (CD) used in Bonferroni-Dunn Test. Based on provided guidelines by the author of [45], for Figure 6, we selected N = 6, and k = 4, q = is equal to q α (0.05) = 3.4077, q α (0.01) = 4.089, and q α (0.001) = 4.9198 whereas, for Figure 7, N = 5, and k = 4, q = is equal to q α (0.05) = 3.3045, q α (0.01) = 4.004, and q α (0.001) = 4.8444. Based on these figures' results, it is observed that SVM has the largest difference in-between pair-wise means of control group with the critical values which validates the previously concluded remarks of using ranked ten features subset with SVM (linear kernel) classifier as an important mean to diagnose infants as LGA or non-LGA.

Number GridSearch + RFECV GridSearch + RFECV GridSearch + RFECV GridSearch + RFECV + IG + SVM(Linear) + IG + SVM(RBF) + IG + LR + IG + DT
Furthermore, the proposed scheme has the potential to classify various disease classes accurately using gestational parameters as suggested by the panel of experts of different domains. The limitation of the proposed scheme is that it experiments only for LGA dataset. However, it has the potential to produce accurate results for Small for Gestational Age (SGA) infants as well, which we will explore in our future work. In addition, as previously discussed that machine learning techniques on LGA have never been exercised extensively, so this research presents an extensive work that can facilitate paediatricians and researcher to extend their research in the defined area. Moreover, in our future work, deep learning techniques, such Standard Deep Neural network (NN) [46], Hierarchical deep learning (HDL) [47], Random multimodel deep learning (RMDL) or deep perceptron [48] will also be exploited to add more scientific results to the related domain. Result ranked with Friedman test and Bonferroni-Dunn test of four ML algorithms (SVM (Linear kernel), SVM (RBF kernel), LR, and DT ) with precision, recall, accuracy, AUC, Specificity, and F1 Score in significant levels of α < 0.05, α < 0.01, and α < 0.001 taking DT as a control algorithm in Figure 5 results.

Conclusions and Future Work
In this research, an LGA classification model is developed to classify a fetus as LGA or non-LGA. It is composed of the GridSearch-based RFECV + IG feature selection scheme followed by stacking to select, rank, and extract significant features from the LGA dataset. The proposed LGA classification scheme using stacking with an ensemble of feature selection and extraction schemes yielded better performance in terms of precision, AUC, recall, accuracy, specificity, and, F1 scores, when it is compared with existing state-of-the-art schemes. This study helps to establish a comprehensive comparison of various decision models performance on the said LGA dataset, which concludes that GridSearch based RFECV+IG feature selection scheme with stacking using SVM (linear kernel) best suits the said classification process followed by SVM (RBF kernel) and LR classifiers. DT classifier is not suggested because of its low performance. Almost every classification scheme best performed with ten principal feature subsets. It is evident from the results that the proposed scheme has the potential to classify an LGA fetus accurately and efficiently. In addition, the promising results indicate that the paediatrician and experts can use the proposed model for the establishment of an efficient LGA classification system as a second opinion, which has the potential to assist them in establishing a proper LGA prognosis process with ranked features subset. In the future, the proposed scheme will also be extended for the classification and identification of Small for Gestational Age (SGA) infants with better performance metrics scores and deep learning techniques will also be exploited to improve classification performance.