Machine Learning Approach for Prediction of the Test Results of Gonadotropin-Releasing Hormone Stimulation: Model Building and Implementation

Precocious puberty in girls is defined as the onset of pubertal changes before 8 years of age, and gonadotropin-releasing hormone (GnRH) agonist treatment is available for central precocious puberty (CPP). The gold standard for diagnosing CPP is the GnRH stimulation test. However, the GnRH stimulation test is time-consuming, costly, and requires repeated blood sampling. We aimed to develop an artificial intelligence (AI) prediction model to assist pediatric endocrinologists in decision making regarding the optimal timing to perform the GnRH stimulation test. We reviewed the medical charts of 161 girls who received the GnRH stimulation test from 1 August 2010 to 31 August 2021, and we selected 15 clinically relevant features for machine learning modeling. We chose the models with the highest area under the receiver operating characteristic curve (AUC) to integrate into our computerized physician order entry (CPOE) system. The AUC values for the CPP diagnosis prediction model (LH ≥ 5 IU/L) were 0.884 with logistic regression, 0.912 with random forest, 0.942 with LightGBM, and 0.942 with XGBoost. For the Taiwan National Health Insurance treatment coverage prediction model (LH ≥ 10 IU/L), the AUC values were 0.909, 0.941, 0.934, and 0.881, respectively. In conclusion, our AI predictive system can assist pediatric endocrinologists when they are deciding whether a girl with suspected CPP should receive a GnRH stimulation test. With proper use, this prediction model may possibly avoid unnecessary invasive blood sampling for GnRH stimulation tests.


Introduction
Precocious puberty in girls is traditionally defined as the onset of pubertal changes before 8 years of age. If left untreated, it can lead to compromised final adult height and early menarche [1]. Furthermore, negative emotional and behavioral consequences have been reported, such as substance abuse, peer pressure, self-image concerns, social isolation, early sexual behavior, conduct issues, social isolation, truancy, and having multiple sexual partners [1,2]. Precocious puberty can be classified into two types: central (gonadotropindependent) and peripheral (gonadotropin-independent). Central precocious puberty (CPP) results from earlier maturation and activation of the hypothalamic-pituitary-gonadal axis. It is usually idiopathic in girls, though it can also be caused by pathological conditions, such as central nervous system (CNS) tumors, CNS injury, or genetic syndromes (neurofibromatosis type 1, tuberous sclerosis, Sturge-Weber Syndrome, etc.). Peripheral

Study Design, Setting, and Samples
We retrospectively reviewed the medical charts of all pediatric female patients who received the GnRH stimulation test in Chi Mei Medical Center between 1 August 2010 and 31 August 2021. We obtained approval from the Institutional Review Board of the hospital before data collection (IRB No.: 11011-001).
We excluded patients with menarche because a very high chance of a positive GnRH stimulation test result can be assumed in these cases without assistance from a prediction model. We also excluded patients whose medical records had missing items. Two girls received the GnRH stimulation test due to secondary amenorrhea, and two other girls received it due to delayed puberty. We excluded these four girls from our study because their GnRH stimulation tests did not indicate suspected CPP. The overall study flow is described in detail in Figure 1.

Features and Outcome Variables
We collected data from electronic medical records of the following variables ally available for data collection in clinical practice: chronological age (CA), a thelarche, height, height SDs, weight, weight SDs, BMI, paternal height, maternal h mid-parental height (MPH), predicted adult height (PAH), annual growth rate, bo (BA), Tanner stage (for breast and pubic hair development), lab result (random seru FSH, E2), and birth history (gestational age, birth body weight). We calculated MP the following formula: (paternal height + maternal height − 13)/2. We calculated scores for height and weight according to the data published by Chen et al. in 2010 [ interpreted the BA using the Greulich and Pyle method [9]. PAH was calculated usi and published data (Chen et al., 2010) based on Taiwanese children and adolescen The GnRH stimulation tests were performed with an intravenous bolus of 0 gonadorelin acetate. Blood samples were obtained for the baseline and at 30, 60, a min after gonadorelin injection. Serum LH, FSH, and E2 were measured using the A Architect i2000SR immunoassay analyzer (Abbott Laboratories, Irving, TX, USA chose two cutoff values of stimulated LH as the prediction outcomes for establishin AI models: LH ≥ 5 IU/L (level for CPP diagnosis) and ≥10 IU/L (level for NHI reim ment for CPP treatment in Taiwan). We compared the demographics of the stimu test results of the positive group (LH ≥ cutoff after stimulation) to that of the ne group (LH < cutoff after stimulation).

Statistical Analysis and Model Building
For practical implementation, we selected features based on statistical signifi and clinical experts' opinion. We then randomly stratified the data into a training d (70% data) for model building and a testing dataset (30% data) for model validatio cause the number of negative test result cases was lower than that of positive test cases, we used the SMOTE method (synthetic minority over-sampling technique) prove the data imbalance in the training dataset [10]. We paired each outcome w

Features and Outcome Variables
We collected data from electronic medical records of the following variables generally available for data collection in clinical practice: chronological age (CA), age of thelarche, height, height SDs, weight, weight SDs, BMI, paternal height, maternal height, mid-parental height (MPH), predicted adult height (PAH), annual growth rate, bone age (BA), Tanner stage (for breast and pubic hair development), lab result (random serum LH, FSH, E2), and birth history (gestational age, birth body weight). We calculated MPH via the following formula: (paternal height + maternal height − 13)/2. We calculated the z-scores for height and weight according to the data published by Chen et al. in 2010 [8]. We interpreted the BA using the Greulich and Pyle method [9]. PAH was calculated using BA and published data (Chen et al., 2010) based on Taiwanese children and adolescents [8]. The GnRH stimulation tests were performed with an intravenous bolus of 0.1 mg gonadorelin acetate. Blood samples were obtained for the baseline and at 30, 60, and 90 min after gonadorelin injection. Serum LH, FSH, and E2 were measured using the Abbott Architect i2000SR immunoassay analyzer (Abbott Laboratories, Irving, TX, USA). We chose two cutoff values of stimulated LH as the prediction outcomes for establishing our AI models: LH ≥ 5 IU/L (level for CPP diagnosis) and ≥10 IU/L (level for NHI reimbursement for CPP treatment in Taiwan). We compared the demographics of the stimulation test results of the positive group (LH ≥ cutoff after stimulation) to that of the negative group (LH < cutoff after stimulation).

Statistical Analysis and Model Building
For practical implementation, we selected features based on statistical significance and clinical experts' opinion. We then randomly stratified the data into a training dataset (70% data) for model building and a testing dataset (30% data) for model validation. Because the number of negative test result cases was lower than that of positive test result cases, we used the SMOTE method (synthetic minority over-sampling technique) to improve the data imbalance in the training dataset [10]. We paired each outcome with 4 machine learning algorithms to build the predictive models. These algorithms were as follows: (1) logistic regression (LR), (2) random forest (RF), (3) LightGBM, and (4) XGBoost. We used Python and scikit-learn machine learning tools. We used a grid search with 5-fold cross validation to tune the hyper-parameters to build the best models based on the training dataset.

Model Evaluation and Practical Implementation
After building the models, we used the test dataset to validate the models according to well-defined model performance indicators: accuracy, sensitivity, specificity, and AUC (area under the receiver operating characteristic curve). We regarded the model with the highest AUC value as the best model and used it to further implement a prediction system for practical use. We built the model with the Python language and scikit-learn libraries. We developed a web-based prediction system with Microsoft Visual Studio ® 2017 and integrated it into the existing hospital information system (HIS). The prediction system can immediately capture feature values from the HIS and display the risk probabilities for predicting LH > 5 and LH > 10.

Enrollment and Baseline Statistical Tests
As presented in Figure 1, during the study period of 1 September 2011 to 31 August 2021, 161 GnRH stimulation tests were performed for 140 pediatric female patients with early thelarche and suspected CPP. A total of 15 Patients received the GnRH test twice due to negative results from the first test. A total of three patients received the GnRH test three times due to negative results from the first two tests. Using the peak LH cutoff level of 5 IU/L, we allocated 24 tests to the "Peak LH < 5 IU/L group" and 137 tests to the "Peak LH ≥ 5 IU/L group". For the LH cutoff level of 10 IU/L, we allocated 65 tests to the "Peak LH < 10 IU/L group" and 96 tests to the "Peak LH ≥ 10 IU/L group". The detailed baseline characteristics and grouping based on the LH cutoff values of 5 and 10 are shown in Tables 1 and 2.

Characteristics and Features
When choosing the cutoff level of ≥5 IU/L for stimulated peak LH, there was a borderline difference in BMI classification between the negative test result and the positive test result groups (p = 0.046). The positive test result group (stimulated peak LH ≥ 5 IU/L) had a higher frequency of underweight BMIs (0.73% vs. 0%), normal BMIs (73.72% vs. 62.50%), and overweight BMIs (16.79% vs. 8.33%), and we observed a lower frequency for obesity ( 10.47, yr, p = 0.047). However, we found no difference in BA advancement (BA-CA) or in ∆BA/∆CA between the two bone age films.

Model Building and Evaluation
We used four machine learning algorithms, LR, RF, LightGBM, and XGBoost, to build the prediction models with the training dataset. We used a grid search with fivefold cross-validation to tune the hyper-parameters for each algorithm to obtain the best model. We tested the prediction models with the test dataset and evaluated the metrics of their accuracy, sensitivity, specificity, and AUC. Among the four algorithms, the XGBoost algorithm had the best performance with the highest AUC (see Table 3, Figure 2).
We further compared our machine learning algorithms to a practical scoring system based on breast Tanner stage, basal LH, and basal FSH, which was proposed in a recent study from Taiwan [11] (see Table 4). The AUC for all machine learning algorithms was higher than the scoring system. Accuracy and specificity were also better when adjusting to the same sensitivity. Table 3. Performance of the two outcome predictive models (peak LH cutoff of 5 IU/L and peak LH cutoff of 10 IU/L).

Prediction System Development and User Evaluation
The LightGBM prediction model for the LH cutoff value of 5 IU/L was the best model because it had the highest AUC and accuracy, and the random forest prediction model was the best model for the LH cutoff value of 10 IU/L. We therefore used these best models for the development and deployment of a clinical prediction system. The AI Center and Department of Information Systems of Chi Mei Medical Center embedded the best models in a web-based AI system for predicting the test results of gonadotropin-releasing hormone stimulation tests (see Figure 3). It launched in March 2022.
We showed the AI system to three clinicians in the pediatric endocrinology department and gained high recognition. The system is a potentially useful tool in the decisionmaking process of whether the GnRH stimulation test should be performed. The graphic display of AI-calculated probability can also aid clinicians with explanations to patients and parents.

Prediction System Development and User Evaluation
The LightGBM prediction model for the LH cutoff value of 5 IU/L was the best model because it had the highest AUC and accuracy, and the random forest prediction model was the best model for the LH cutoff value of 10 IU/L. We therefore used these best models for the development and deployment of a clinical prediction system. The AI Center and Department of Information Systems of Chi Mei Medical Center embedded the best models in a web-based AI system for predicting the test results of gonadotropin-releasing hormone stimulation tests (see Figure 3). It launched in March 2022.
We showed the AI system to three clinicians in the pediatric endocrinology department and gained high recognition. The system is a potentially useful tool in the decision-making process of whether the GnRH stimulation test should be performed. The graphic display of AI-calculated probability can also aid clinicians with explanations to patients and parents.

Case Scenarios
As shown in Figure 4A, a physician performed an AI prediction for a patient in the outpatient clinic. The result showed a low probability for both stimulated LH > 5 IU/L (30.09%) and stimulated LH > 10 IU/L (4.29%). With such results, the clinician may need to re-evaluate the patient and reconsider arranging the GnRH stimulation test. In Figure 4B, another patient showed a high probability for both stimulated LH > 5 IU/L (74.07%) and stimulated LH > 10 IU/L (60%). The result was consistent with the clinician's evaluation, and the patient would likely benefit from arranging the GnRH stimulation test. As shown in Figure 4C, the prediction results for the third patient showed a high probability for stimulated LH > 5 IU/L (67.91%). However, the prediction result for stimulated LH > 10 IU/L was low (6.73%). This prediction result suggests that the patient is likely to have ongoing central precocious puberty but may not yet fulfill the cutoff for the NHI reimbursement of GnRH analog treatment. The patient would likely benefit from a close follow-up and the arrangement of a GnRH stimulation test after a short period of time. Although the AI prediction system cannot replace the importance of the clinician's clinical evaluation, the above three scenarios show that the AI prediction system can be used to check for consistency with the clinical evaluation. When the prediction result is inconsistent with the clinical evaluation, a reminder is presented to the clinician to reassess the patient and consider the option of further close follow-up rather than immediate arrangement of the GnRH stimulation test.

Case Scenarios
As shown in Figure 4A, a physician performed an AI prediction for a patient in the outpatient clinic. The result showed a low probability for both stimulated LH > 5 IU/L (30.09%) and stimulated LH > 10 IU/L (4.29%). With such results, the clinician may need to re-evaluate the patient and reconsider arranging the GnRH stimulation test. In Figure  4B, another patient showed a high probability for both stimulated LH > 5 IU/L (74.07%) and stimulated LH > 10 IU/L (60%). The result was consistent with the clinician's evaluation, and the patient would likely benefit from arranging the GnRH stimulation test. As shown in Figure 4C, the prediction results for the third patient showed a high probability for stimulated LH > 5 IU/L (67.91%). However, the prediction result for stimulated LH > 10 IU/L was low (6.73%). This prediction result suggests that the patient is likely to have ongoing central precocious puberty but may not yet fulfill the cutoff for the NHI reimbursement of GnRH analog treatment. The patient would likely benefit from a close follow-up and the arrangement of a GnRH stimulation test after a short period of time. Although the AI prediction system cannot replace the importance of the clinician's clinical . Screenshot (translated into English) of the interactive AI prediction system (interface page to capture/fill in necessary feature values). AI, artificial intelligence; BA, bone age; BMI, body mass index; CA, chronological age; E2, estradiol; FSH, follicle-stimulating hormone; GnRH, gonadotropinreleasing hormone; ID, identity; LH, luteinizing hormone; MPH, mid-parental height; PAH, predicted adult height; SD, standard deviation.
Diagnostics 2023, 13, x FOR PEER REVIEW 10 of 14 evaluation, the above three scenarios show that the AI prediction system can be used to check for consistency with the clinical evaluation. When the prediction result is inconsistent with the clinical evaluation, a reminder is presented to the clinician to reassess the patient and consider the option of further close follow-up rather than immediate arrangement of the GnRH stimulation test.

Discussion
In this study, we established an AI prediction system and integrated it into our computerized physician order entry (CPOE) system to predict GnRH stimulation test results. We demonstrated that, if the GnRH stimulation test is still clinically needed, an AI predic-

Discussion
In this study, we established an AI prediction system and integrated it into our computerized physician order entry (CPOE) system to predict GnRH stimulation test results. We demonstrated that, if the GnRH stimulation test is still clinically needed, an AI prediction system is an ideal method to save time, reduce costs, and avoid unnecessary repeated blood samplings.
Many studies have attempted to simplify or substitute the GnRH stimulation test. Various studies have suggested various cutoff values for basal LH, with variable sensitivity [11][12][13][14][15][16]. Other studies have tried to reduce the number of blood samplings required for the GnRH stimulation test, suggesting that a single blood sampling at 30 min or 40 min post GnRH injection may be adequate [11,17,18]. The subcutaneous administration of GnRH was also proposed, but multiple blood samplings are still required under such a protocol [19]. Despite all these attempts, the GnRH stimulation test is still considered the gold standard test and is more frequently used.
Aside from trying to simplify or replace the GnRH test, using scoring systems or AI models to predict the probability of a positive test result may provide another solution to this problem. Researchers of a recent study from Taiwan proposed a practical scoring system using breast Tanner stage, basal LH, and basal FSH [11]. The scoring system had 76% sensitivity and 72% specificity. The stimulated LH cutoff level was set as 10 IU/L in the study due to the NHI reimbursement criteria for GnRH analog treatment. Researchers of another study from Shanghai proposed a CPP risk score model, which classifies patients into low-, median-, and high-risk groups [20]. They used a stimulated peak LH ≥ 5 IU/L and a peak LH/FSH ratio ≥ 0.6 for the diagnosis of CPP. Similar to our study, researchers of a study from Guangzhou proposed prediction models using machine learning algorithms [21]. However, the diagnostic criteria include either peak LH levels ≥ 10 IU/L or peak LH levels ≥ 5 IU/L combined with a peak LH/FSH ratio ≥ 0.6.
In this study, the features we selected were routinely evaluated for suspected CPP patients in clinical practice. Via history taking, the onset age of thelarche and the age of the patients during the visit were routinely recorded. Further inquiry about the height of parents can help calculate the mid-parental height (MPH). Via physical examination, height, weight, BMI, annual growth rate, and Tanner staging were obtained. Because patients presented at different ages, we used height SDs and weight SDs in our models to better represent the heights and weights of the patients compared with girls of the same age. Laboratory workup, including bone age, LH, FSH, and LH/FSH ratio, can also be easily obtained at the outpatient clinic.
Compared with the non-CPP group (stimulated peak LH < 5 IU/L), the CPP group (stimulated peak LH ≥ 5 IU/L) had a significantly higher frequency of underweight, normal, and overweight BMIs, and the frequency of obesity was lower. This can be explained by the fact that obese girls presenting breast lumps do not necessarily have activation of the hypothalamic-pituitary-gonadal axis [3]. However, these girls may be considered candidates for the GnRH stimulation test because of the earlier onset of thelarche. The results of the GnRH stimulation tests for these individuals would likely be negative. Therefore, the clinician should be careful not to overestimate breast development in obese girls [7]. On the other hand, obese children may have blunted LH response during the GnRH stimulation test as a result of LH suppression due to androgen/estrogen excess [22].
The Tanner stage is commonly used as the standard for breast and pubic hair development ratings in clinical practice [23]. In our study, the Tanner staging for breast development had a significantly more advanced distribution in the positive test result group, regardless of the stimulated peak LH cutoff level chosen. However, we found no significant difference in Tanner staging for pubic hair development and the presence of axillary hair. This can be explained by the finding that breast bud appearance is the first pubertal sign in girls. However, the appearance of pubic hair can occur before, after, or together with puberty onset [24]. Adrenal-derived androgens can cause the appearance of pubic hair and axillary hair. Premature adrenarche may present with an advanced bone age but without breast development [25]. The hypothalamic-pituitary-gonadal axis is not activated in such circumstances.
A higher growth velocity and advanced bone age were shown as factors for predicting positive results in the GnRH stimulation test, with a rapid growth velocity suggested as the most useful predictive factor [26]. In our study, annual growth rate was significantly higher in the positive test result group, regardless of the stimulated peak LH cutoff level chosen. In addition to bone age, we also included factors such as BA-CA and ∆BA/∆CA. However, we found no significant difference in these factors between the positive and negative test result groups. Intraoperator and interoperator variability in bone age determination using an atlas-based method is not uncommon [27]. This provides an explanation for our results because bone age reports are generated by different radiologists in our hospital.
In the laboratory features, our results show a higher basal FSH, basal LH, and basal LH/FSH ratio in the positive GnRH stimulation test result group. This is consistent with previous studies, in which basal LH, FSH, and LH/FSH ratio were found to be significantly higher in individuals with a positive GnRH stimulation test result [11,26].
The strength of the prediction models in our study is that our models can be used to predict the probability of either stimulated LH ≥ 5 IU/L or stimulated LH ≥ 10 IU/L. A physician may have different considerations while arranging the GnRH stimulation test. The prediction model for CPP diagnosis (stimulated LH ≥ 5 IU/L) can help decide whether the GnRH stimulation test is helpful in confirming the diagnosis for a girl with suspected CPP. However, if the reimbursement of the GnRH agonist treatment is also a consideration, the prediction model for treatment coverage (stimulated LH ≥ 10 IU/L) can be used together with the prediction model for CPP diagnosis (stimulated LH ≥ 5 IU/L). When used together, the two models can help predict whether the test result would reveal that the patient does not have CPP (LH < 5 IU/L), that the patient has CPP but does not meet NHI treatment criteria (≥5 IU/L but <10 IU/L), or that the patient has CPP and meets the NHI treatment criteria (LH ≥ 10 IU/L). However, our model still has some limitations. First, this study has a retrospective design. Potential problems, such as missing or inaccurate data, could have occurred. However, all the features used in our model are basic information obtained via history taking and physical exams. The lab data, including LH, FSH, and E2, were also routinely obtained in girls who received the GnRH stimulation test. Therefore, we did not encounter any missing data during data collection. Second, we interpreted bone age reports from different radiologists, and interoperator variability could have occurred, even when using the same Greulich and Pyle method. We did not include pelvic echo information for the same reason because such measurements can be even more operator-dependent. Third, we collected our data from a single medical center in Southern Taiwan. The study population is mostly composed of members of the Han Chinese ethnic group, with few ethnically Southeastern Asian individuals. Therefore, our model may not be suitable in other populations or regions. Finally, although our model shows a good AUC, as clinicians, we also need to take sensitivity into account, since we would not want to miss the diagnosis on any patient. To address this, we can lower the classification threshold to improve model sensitivity, but false alarms (false positives) may increase and affect model performance. Therefore, we would suggest using our model to assist predicting the result of the GnRH stimulation test and seek better timing with an appropriate evaluation threshold for test arrangement (0.5 by default), rather than using it to rule out the diagnosis. The diagnosis should be based on the overall clinical picture instead. Furthermore, though the model shows a good AUC, we did not prospectively validate the model. We plan to address this issue in our future work via real-time prediction using the AI system launched in March 2022.

Conclusions
Our machine learning models can provide valuable information to pediatric endocrinologists when they need to decide whether a girl with suspected CPP should receive a GnRH stimulation test. The two models may be used alone or together for predicting whether the stimulated LH level would be <5 IU/L, between 5 and 10 IU/L, or ≥10 IU/L. With the assistance of these prediction models, pediatric endocrinologists can choose the optimal timing for arranging GnRH stimulation tests. To the best of our knowledge, very few studies have used machine learning approaches to build prediction models regarding the optimal timing for arranging GnRH stimulation tests, and even fewer have implemented real-time AI system prediction in a clinical setting. Thus, the results of our study have profound academic and practical novelty and value. We call for future researchers to consider including more parameters to improve prediction performance. We also encourage broadening the retrospective data to include multiple centers.  Informed Consent Statement: Patient informed consents were waived because of the retrospective nature of the study and the research involved no more than minimal risk to the participants. Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.