Machine Learning Models to Forecast Outcomes of Pituitary Surgery: A Systematic Review in Quality of Reporting and Current Evidence

Background: The complex nature and heterogeneity involving pituitary surgery results have increased interest in machine learning (ML) applications for prediction of outcomes over the last decade. This study aims to systematically review the characteristics of ML models involving pituitary surgery outcome prediction and assess their reporting quality. Methods: We searched the PubMed, Scopus, and Web of Knowledge databases for publications on the use of ML to predict pituitary surgery outcomes. We used the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) to assess report quality. Our search strategy was based on the terms “artificial intelligence”, “machine learning”, and “pituitary”. Results: 20 studies were included in this review. The principal models reported in each article were post-surgical endocrine outcomes (n = 10), tumor management (n = 3), and intra- and postoperative complications (n = 7). Overall, the included studies adhered to a median of 65% (IQR = 60–72%) of TRIPOD criteria, ranging from 43% to 83%. The median reported AUC was 0.84 (IQR = 0.80–0.91). The most popular algorithms were support vector machine (n = 5) and random forest (n = 5). Only two studies reported external validation and adherence to any reporting guideline. Calibration methods were not reported in 15 studies. No model achieved the phase of actual clinical applicability. Conclusion: Applications of ML in the prediction of pituitary outcomes are still nascent, as evidenced by the lack of any model validated for clinical practice. Although studies have demonstrated promising results, greater transparency in model development and reporting is needed to enable their use in clinical practice. Further adherence to reporting guidelines can help increase AI’s real-world utility and improve clinical practice.


Introduction
Pituitary adenomas (PAs) comprise 10-15% of all intracranial tumors [1]. Medical management and radiation therapy are treatment options in selected cases but transsphenoidal surgery remains the primary treatment modality for most patients with symptomatic nonfunctioning and functioning pituitary tumors, with overall low rates of morbidity and mortality [2,3]. Surgical outcomes, such as disease remission, extent of resection and complications, are influenced by different factors, including tumor size and invasiveness, previous treatments and patient age and comorbidities [4][5][6][7].
Machine learning (ML) is a type of artificial intelligence (AI) that uses imputed data to generate outputs based on the learning of patterns, which has been successfully applied across different areas of medicine [8][9][10]. The increasing volume of health care data provides inputs for innovative methods of data gathering, selection and analysis [11]. ML is especially useful in these settings because of its capacity to deal with large swaths of data [12].
ML models have shown promising results in neurosurgery. For example, ML-based imaging analysis is promising for radiological identification of glioblastoma molecular subtypes [13]; also, ML models have been used to predict outcomes of radiosurgery for cerebral arteriovenous malformations [14], and outcomes of chronic subdural hematoma [15]. Therefore, ML holds promise as a tool to augment clinical decision making [16]. Recent studies on pituitary adenomas and transsphenoidal surgery have also explored methodological designs based on ML models. Radiological diagnosis, prediction of clinical outcomes and complications have been evaluated with promising initial results [16][17][18]. Table 1 presents a glossary with the most common terms from literature and, in Table 2, we described the most common ML algorithms used in healthcare. Table 1. Definitions of important concepts in machine learning and artificial intelligence areas.

Term Description
Artificial Intelligence A broad area of computer applications with the ability to perform tasks that conventionally require human intelligence.

Machine Learning
Machine learning is an application of artificial intelligence (AI) that provides systems with the ability to automatically learn and improve from experience without being explicitly programmed.

Deep Learning
Is a subset of ML which is formally defined as computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction.

Supervised learning
A model that is trained based on inputs of data aiming at determining a target output, which are manually labeled a priori (e.g., diagnosis or prognosis).
Unsupervised learning ML models that can perform tasks without being set with labels by a human (e.g., clustering data).
Structured data Data that are pre-defined to be displayed in rows and columns (e.g., electronic medical records, administrative data). More qualitative form of data.
Unstructured data Data without any predefined structure. More quantitative form of data (e.g., image analysis, text).
Missing values Hyperparameters, which specify how a model learns, need to be set by the data scientist before training. They are perpetually improved (tuned) to find the model that performs best.
Single case analysis Exclusion of a row with missing data among its features.

Feature
Data science term for predictor/independent variable.
Label/Target Data science term for outcome/dependent variable.
Parameter Inherent weights of a given model, which are set in the code of the algorithm. Define a search space as a grid of hyperparameter values and evaluate every position in the grid.

Hyperparameters
An ensemble of wights which define how a model learns. They are arbitrarily attributed, needing to be set by its developer to optimize its performance during and after training.

Overfitting
When a model performs well on the training data (seen patients) and performs poorly on the testing data (unseen patients). Regularization is often used to minimize overfitting and optimize generalizability of machine leaning algorithms Discrimination Describes the model's ability to correctly identify from random pairs in which it was trained who will develop the target condition. Usually evaluated through the model's AUC/C-statistic.
Area Under Curve/C-statistic Most used discriminative statistic. An area of 1.0 represents a perfect test; an area of 0.5 represents a worthless test. It enables assessment of predictive ability, and identification of an optimal threshold to distinguish between classes.

Term Description
Accuracy Ratio between the total number of predictions that are correct.
Sensitivity/Recall Proportion of true positives predictions.

Specificity
Proportion of correctly predicted true negatives which are correctly identified.
PPV/Precision Proportion of correctly predicted true positives which are correctly identified NPV Proportion of correctly predicted negatives among all negative predictions.
F1 score Composite metric defined as the harmonic mean between precision (or PPV) and recall (sensitivity).
Internal Validation Assessment of a model's performance with the same data or population, if prospective, used in the development process.
External Validation Assessment of a model's performance in a dataset which differs from the one used in its development geographically or temporally.

Cross Validation
Internal validation technique in which the dataset is randomly split into k-1 groups of similar size. Performance is evaluated in the remaining group with the whole process repeated n times; model performance is taken as average over n iterations.

Bootstrapping
Internal validation approach like cross validation but relying on random sampling with replacement; each sample is the same size as model development dataset

Split Sample
Internal validation approach in which the available development dataset is divided into two datasets: one to develop the model and the other to validate the model; division can be random or non-random. Table 2. Examples and conceptualization of most utilized machine learning-based algorithms for binarity outcome prediction.

Algorithm Description
Neural Networks (NN or ANN) Artificial neural networks are non-linear algorithms loosely inspired by human brain synapses. Convolutional neural networks, the most commonly applied, comprise input nodes, output nodes and intervening or hidden layers of nodes, which may number up to 100. Each node within a layer involves two or more inputs and applies an activation and weighting function to produce an output which serves as the input data for the next layer of nodes.

Support Vector Machine (SVM)
SVM is based on the idea of computing a hyperplane that best separates the features into different domains. Its objective is to find a decision boundary (the Hyperplane) that has the maximum separation degree between two nearer points of each class-i.e., the support vectors. Kernel functions are used when data are too non-linear functions; the algorithm can map examples to other dimensions and then operates on non-linear relationships by transforming low-dimensional input data into high-dimensional space.

k-Nearest Neighbors (k-NN)
The k-NN classing classes based on a distance criterion. The values of the distance from k (number of neighbors) in given distance between them and the subject of interest. This distance inputs-output is computed on comparing multidimensional vectors of feature values, defining the more similar ones as "neighbors".
Decision Trees (DTs) DT algorithms are architecture under a tree structure modeling approach with conditional control statements for establishing a framework of subsequent decisions. Its internal nodes represent 'test' on an attribute, branch represents the results of this test and "leaf" represents decision taken after computing all attributes.

Random Forest (RF)
RF is essentially an ensemble of DTs, although it differs from usual DTs by using randomly selected inputs or combinations of inputs at each node to grow each tree rather than a consistent set. That is intent yielding to avoid the overfitting usually present in deep DTs. Random distribution of inputs provides, when averaged, lower rates of error in the final output and reduced variance.
The popularization of studies based on AI methodology led to development of guidelines to specifically address such reports in medicine [19,20]. A version of the Transparent Reporting for Individual Prognosis Or Diagnosis (TRIPOD) Statement with focus on MLbased studies was recently proposed [21,22]. Goals of such guidelines include the assurance that studies are properly reported, providing information necessary for replicability, ensuring critical appraisal of ML models and improving the quality of reporting [22,23].
In the present study, we review the current evidence on the use of ML to predict outcomes after pituitary surgery. Additionally, we assess the completeness of model reporting of the reviewed papers according to the TRIPOD Statement.

Materials and Methods
This systematic review was conducted according to Preferred Reporting Items for Systematic Reviews (PRISMA) guidelines. The review protocol was registered within the International Prospective Register of Systematic Reviews (PROSPERO) database, maintained by the University of York (York, UK) (registration number CRD42021253264).

Literature Search and Studies Selection
The PubMed, Scopus, and Web of Science databases were searched to identify all potentially relevant studies. The following search terms were used: "((machine learning) or (artificial intelligence)) and (pituitary)". Original articles that described using a machine learning approach to study pituitary surgery outcomes published between 1 January 2010 and 31 December 2021 were included.
Subsequently, three authors (M.M.R, A.W. and L.M.F) independently screened each article's titles and abstracts. Disagreements were resolved through a discussion involving all three authors. For all studies deemed relevant, the full papers were reviewed.

Inclusion and Exclusion Criteria
During the full article review process, articles were included based on the following criteria: (1) specific focus on the development or validation of ML models for prediction; (2) specific focus of the model on predicting pituitary surgery outcomes; and (3) presented a ML model as its main prediction tool. Exclusion criteria were the following: (1) review articles; (2) other applications of artificial intelligence; and (3) studies using ML as a diagnostic tool. References from previous studies were also evaluated for the inclusion of additional studies.

Data Extraction
The data extraction protocol, as well as the form used to conduct it, is described in Online Resources 1. Outcomes were stratified in three categories: endocrine outcomes; tumor management or recurrence; or complications. If a study has reported more than one model or assessed different outcomes of pituitary surgery in a single publication, data extraction and stratification of this paper in the results section were performed regarding the model with the higher Area Under the receiver operating characteristics Curve (AUC).

Report Assessment
The TRIPOD Statement, launched in 2015, is a widely accepted EQUATOR Network guideline for prediction model reporting [21,24]. It consists of 22 items considered essential for informative reporting of prediction model studies. It was primarily developed to evaluate regression-based models, but it has also been successfully used to assess and guide the production of reports based on ML models [22]. It is important to mention, however, that differences in terminology are pointed out as one of the barriers to adherence to TRIPOD during the report of ML studies [25].
In this study, we utilized the TRIPOD Adherence Form as well as the instructions for its respective description, with two terms adjusted for the specification of ML models as suggested by Wang et al., to assess the completeness of reporting of ML prediction models [22,26].

Study Selection
A total of 191 studies were retrieved from PubMed, 89 studies from Web of Science, and 145 studies from Scopus, giving a total of 425 articles. In total, 219 duplicate studies were excluded. After abstract and title screening, 53 studies were considered potentially relevant. Seven additional studies from other sources were included at this time. After full-text article screening, 20 studies were selected for data extraction (Figure 1).

Report Assessment
Overall, adherence to TRIPOD among the studies had a median of 65% (IQR = 60-72%), ranging from 43% to 83. Figure 2 presents the proportions of adhered items across the included studies. The overall reporting of TRIPOD items was particularly low regarding abstract completeness of report, where no article fulfilled the criteria of the TRIPOD Adherence Form. Items concerning the report of title and performance measures (considered as adhered when discrimination with confidence intervals, calibration measure, and complementary metrics, such as accuracy, where provided) followed as the most underreported aspects-both with 12% of average adherence.

Report Assessment
Overall, adherence to TRIPOD among the studies had a median of 65% (IQR = 60-72%), ranging from 43% to 83. Figure 2 presents the proportions of adhered items across the included studies. The overall reporting of TRIPOD items was particularly low regarding abstract completeness of report, where no article fulfilled the criteria of the TRIPOD Adherence Form. Items concerning the report of title and performance measures (considered as adhered when discrimination with confidence intervals, calibration measure, and complementary metrics, such as accuracy, where provided) followed as the most underreported aspects-both with 12% of average adherence.

Models' Assessment
All models presented AUC measures to assess discrimination. The median reported AUC was 0.84 (IQR = 0.80-0.91). Figure 3 shows the AUC values reported for each of the subgroups included in this review. Moreover, calibration methods were not reported in 15 studies. When reported, the calibration methods used were the Hosmer-Lemeshow test (three studies) [36,45], calibration plot (one study) [36], calibration slope (two studies) [18,38], calibration intercept (two studies) [18,38], and the Brier Score (one study) [39].

Tumor Management and Recurrence
Two studies assessed tumor recurrence as the main outcome [32,44]. Both studies used only radiomics features to build the models. AUCs were of 0.78 [32] and 0.96 [44]; however, confidence intervals were not reported for these measures. The sample size among these studies ranged from 27 [44] to 50 [32]. Only one study reported how hyperparameters were defined [44]. Both models used k-fold CV approach for internal validation. Neither study reported calibration measures. Both studies were conducted in patients with nonfunctioning pituitary adenomas (NFPA).
The use of radiomics approaches was prominent among studies predicting management and recurrence of pituitary tumors, exclusively inputting raw imaging data [32,44]. Zhang et al. described three important features extracted from preoperative MRI and selected by an SVM classifier to compose their ML model to predict post-surgery recurrence in NFPA [32]. Machado et al. also evaluated the prognostic value of MRI radiomics in an ML model to predict recurrence of NFPA after surgery [44]. The most important features, selected by a k-NN algorithm, to integrate the model were related to parameters of energy, total-energy, and non-uniformity, which cannot be detected by the naked-eye but represent valuable information to be accessed for prediction purposes [44]. Gross-total resection (GTR) of tumor after pituitary surgery was the outcome predicted in one study [27] and presented as a secondary outcome in two other studies [38,39] based on structured information (i.e., tabular/spreadsheet data). The algorithms utilized were NN [27], k-NN [39], and generalized linear model (GLM) [38]. Staartjes and colleagues presented a polarity correlation plot, and found that GTR was prominently correlated with the Knosp grade and the ratio between the maximum adenoma diameter and the intracarotid distance in C4 horizontal segment [27].
Regarding clinical variables, Zhang et al. found that visual disturbance, extrasellar extension, hypopituitarism, and symptoms of sexual hormones were related to persistent/recurrent disease in NFPA [33]. Furthermore, prior surgery was the most important predictor of GTR, while age and Hardy grading were predictors of biochemical remission and cerebrospinal fluid (CSF) leak, respectively, in a study by Zanier et al. [38].
AUCs values were 0.96, 0.98, and 0.68, respectively. Sample sizes were of 140 [27], 151 [39] and 307 [38] participants. Two of the studies used a k-fold CV [27,38] and the other performed a random split sample to obtain an internal validation group [39]. Calibration was reported by two of the studies (Brier Score [39], calibration slope [38], and calibration intercept [38]). Two studies reported the method to handle missing values (single imputation predictive mean matching [27] and k-NN [38]), although neither reported the missingness distribution across features. Confidence intervals were reported by two of the articles [27,38]. The approach to define hyperparameters was mentioned in one of the studies [39].
Definition of endocrine outcomes varied across studies. Acromegaly remission was considered off-medication GH levels (nadir GH < 0.4 µg/L during an oral glucose tolerance test, and/or random GH < 1.0 µg/L) or normalized IGF-1 (<1) at 6-month follow-up after surgery by Qiao et al. to forecast response of functioning pituitary adenomas (FPA) to surgery [18]. Fan et al. defined the endocrine outcome, postoperative remission of GHsecreting FPAs, as random serum GH < 1 ng/mL or a GH nadir < 0.4 ng/mL during an oral glucose tolerance test at 12 weeks after surgical treatment [45]. Two studies investigated CD remission, defining it as morning serum cortisol values falling below 5 µg/dL (138 nmol/L) or 24 hUFC levels falling below 20 µg (56 nmol) in the 7-day postoperative follow-ups [40,42]. Zoli et al. defined CD postsurgical remission as demonstrated hypersecretion normalization at 1 to 3-6 months after surgery (the first surgery in case of repeated procedures) [39]. Kocak et al. defined response to somatostatin analogues (SAs) in acromegaly after surgery considering patients resistant if GH or age-adjusted IGF-1 levels were still elevated after 6 months of therapy with octreotide (40 mg per 28 days) or lanreotide (120 mg per 28 days) [37]. Finally, Nadezhdina et al. defined their endpoint, CD recurrence, as one of the following: increased evening salivary cortisol level; no suppression of serum cortisol below 50 nmol/L (1.8 µg/dL) during the 1-mg dexamethasone suppression test; increased 24 h urine free cortisol level; increased concentrations and abnormal secretory rhythms of ACTH and cortisol; or clinical recurrence of hypercorticism [41].
Tumor invasiveness, usually presented using Knosp grade, was reported as being among the top three most important variables in the majority of the studies on endocrinological outcomes [35,36,39,40,45]. Tumor size was also of main importance for two studies [39,40]. The post-operative levels of GH were the second most cited among the main important variables reported in the studies [18,35,36]. In addition, ACTH and cortisol were among the most important variables of one study [42].
Regarding clinical variables, Fan et al. found that age, hypertension, ophthalmic disorders, IGF-1, elevated GH, Knosp grade and maximal tumor diameter were associated with endocrine response after surgery in patients with acromegaly [36]. In patients with CD, Zhang et al. found the highest AUC with four variables including cavernous sinus invasion in MRI, first operation, preoperative ACTH, and tumor size [40]; in another study by Liu et al., top predictors for recurrence in this subset of patients were post-operative morning serum cortisol and ACTH nadir, and age [42]. The relevance of cortisol and ACTH levels in prediction models was also confirmed by Nadezhdina et al. [41].
Two studies adopted broad criteria defining early complications from pituitary surgery, aiming to predict at least one among a list of several events [28,29]. One of these analyzed more than 15 potential complications-e.g., extended length of stay or stroke-and presented as most influential in their model the disturbances of sodium, age, and body mass index (BMI) [28]. Muhlestein et al. proposed the prediction of any complication as a secondary analysis, aiming primarily to predict hospitals' total charges in an administrative dataset of almost 15,000 patients [29]. Their model revealed that age, fluid or electrolyte abnormalities, and admission type were the most important variables to predict complications in that sample [29].
Staartjes et al. proposed a ML model to estimate risk of intraoperative CSF leakage using an NN algorithm [34]. They reported a high suprasellar Hardy grade, prior transsphenoidal surgery, and age as contributing most to the outcome prediction [34]. In an effort to predict suboptimal outcomes-defined as hormonal non-remission or MRI evidence of recurrence/progression despite adjuvant treatment-Shahrestani et al. built an NN model and inputted clinical variables that were significant in a multivariate statistical analysis [46]. The authors found that additional surgery, preoperative visual deficit not improved after surgical intervention, and transient diabetes insipidus increased the odds of suboptimal outcomes [34].
Five studies reported methods to handle missing values. The models were developed on general samples of PAs patients (four studies) [28][29][30]34], on a sample of mixed types of FPAs [46], and on a sample of acromegaly patients [38]. Methods for selection of hyperparameters were reported by three studies [28,30,34]. Calibration techniques were mentioned by one of these studies (calibration slope and calibration intercept) [38]. The median of completeness of the TRIPOD was 62.1% (IQR = 52-63%).

Discussion
This systematic review addressed the quality and breadth of studies using ML methodology to predict outcomes of pituitary surgery. Heterogeneity in model reporting may impact the full understanding of ML's role in outcome prediction for patients with pituitary tumors and makes it challenging to conduct a meta-analysis of existing studies. Nonetheless, interest in the topic has substantially increased in the last decade, which highlights the importance of adequate reporting to maximize the usefulness of this approach in clinical research and patient care.

Clinical Findings
Regarding prediction of pituitary surgery outcomes by ML methods, an important part to ensure its use in clinical practice relies on variable importance analysis. In this review, aspects of tumor invasiveness were mentioned among the top predictors in the majority of the studies, regardless of the classifying system adopted (Table 5). These results agree with a previous review which found that cavernous sinus invasion is the best single predictor of tumor remission [48]. Knosp grade is also mentioned as a good predictor for GTR in previous studies [49,50]. Despite the existence of other tumor invasiveness scales, such as the Hardy Grade, these are less used in the actual clinical context [51]. Nevertheless, those tools present limitations such as allocating patients into large groups of risk and not tailoring individual characteristics, as well as problems in poor interrater reliability [52].
In additional to measures of invasiveness, endocrinological parameters integrated most of the models (Table 5). Age was the most common demographic variable utilized in the models and was the one demographic with high importance reported across different studies (Table 5). Externally validated ML algorithms can play a major role in precise risk stratification and in identifying patients who will not likely benefit from surgery or adjuvant therapy [16,49].
Furthermore, the analysis of clinical images through ML algorithms is prominent in ML models to predict pituitary surgery outcomes (Table 5). ML algorithms are trained to mine quantitative imaging features from medical images, looking for patterns between the images and outcome of interest [53,54]. Fan et al. and Niu et al. presented a direct comparison of their results using ML models inputted with radiomics and clinical features against the predictive power of Knosp grade alone [35,55]. In both cases, the ML-based approaches overperformed the traditional tool. Indeed, the studies from our review that combined radiomic signatures with clinical features and other types of structured data presented better performance than both forms-radiomics or structured data-alone.  The open-source availability of any reported model is a good practice in research and contributes to transparency as well as to the presentation of the real value of the developed model for clinical practice. The description of nomograms is one of the forms to make a model useful and valuable in practice. In our review, nomograms were presented in two papers, both carried out by Fan et al. [35,45]. In one of them, the authors presented a nomogram that uses the radiomic signatures obtained using the ML algorithm and the Knosp grade [45]. In the other study, the nomogram was composed of radiomics signature, random GH, IGF-1 standard deviation score, GH inhibition ratio, tumor volume, Knosp grade, tumor consistency, and P53 value [35]. Three studies provided access to their models, deploying them as web-

Report Assessment
As measured by the TRIPOD, the rates of report completeness were suboptimal for several items of the overall assessment. However, certain TRIPOD items are significantly more important to ensure research utility and quality than others. For instance, although only one article showed completeness of reporting in the Title and Abstract-Items 1 and 2, respectively-the lack of information on how missing data was handled and how the models were calibrated has a greater impact on reviewers' ability to assess the quality of these studies.
Calibration measures were reported by only three studies, which demonstrates a potential for improvement in future projects. Calibration is used to assess reliability of risk predictions of a given model. Thus, a good calibration implies predicting an event for a person with a specific feature matching with the proportion of all people in the population with similar feature values who experienced the event [11]. Therefore, even with a good discriminative performance described by AUC, it is not enough to provide a critical appraisal of the model and, consequently, not enough to properly guide clinical decisions. To make this possible, both a discrimination (e.g., AUC) and a calibration measure (e.g., Brier score) should be presented [56]. The lack of information about the latter can imply misinterpretation of a given ML model, lower clinical usefulness, compromising potential external validation by others, and unnecessary risk to patients.
Information on how hyperparameters of the final models were defined was mentioned in 10 studies [28,30,[34][35][36]39,40,42,44,45]. Hyperparameters settings significantly interfere in the final performance of the prediction model [57]. The most common approach utilized in the studies for hyperparameters selection was Grid Search CV-a method that iteratively tests all potential values for hyperparameters, choosing the ones with the result in the higher values of the metric of interest (e.g., AUC, F1-Score or accuracy)-which is also the method most commonly reported in the literature, although it is not always an ideal choice, given the chances of overfitting training datasets [58]. In addition, even the same model algorithm often needs different hyperparameter settings when training on different datasets during out-of-sample validations. For instance, in deep learning (DL) models, hyperparameters such as the number of layers or the dropout rate can dramatically affect performance in a NN algorithm [57,58]. Publishing the algorithm code, including exact hyperparameters utilized, allows for a rigorous assessment of the model and prevents redundant research from being undertaken.
Only two studies presented external validation [18,38]. External data are significantly important to assess real-world performance since they can measure performance losses and provide insight about biases in some step of the model's development. External validation is recommended to be performed at a different time (temporal validation) or location (geographical validation) from the original dataset which derived the initial ML model. Every model with only internal validation is marked by the idiosyncrasies of the original population and may thus perform poorly in others. This is true for a wide range of factors, including changes in policies, practice and demographics [59,60]. Methods for handling missing values were fully reported by three of the studies [29,42,46]. When the handling of missing data was mentioned but not fully reported, it was usually due to not reporting the number of missing values, the variable where the imputation was performed or the number of imputations, an important factor for the reliability of a model. However, from the studies that explicitly described the method used to replace the missing data, only 10 reported the used approach satisfactorily [27][28][29][30]34,36,[40][41][42]46]. When data are considered missing at random, multiple variable imputations, they are usually considered superior to single imputation and complete case analysis by preserving the natural variability of the missing values, and retains more useful information, respectively [61,62]. Within our results, only one study reported a form of multiple imputation [46].

ML versus Traditional Statistical Methods
Despite the exponential growth of AI research in medical areas during the last two decades, the real advantage of the use of ML over traditional statistical methods such as regression analyses remains under question. A systematic review conducted by Christodoulou et al. showed that discriminative measures of ML models to predict clinical risk compared with logistic regression were significantly higher only in comparisons with a high risk of bias and similar in the comparisons with low risk [63]. A common rationale for the development of ML models among the studies reviewed above was the capability of ML to identify and handle nonlinear interactions, which traditional methods would not perform so well with. Other authors report unsupervised ML's potential to analyze large, unorganized, and highly complex amounts of information, channeling the potential of big data to create prediction models [64].
There is more evidence for outperformance by ML compared to traditional models in neurosurgery, as reviewed by Azimi et al. regarding applications of NNs [65]. Specifically, this advantage was also reported in studies about pituitary-related ML applications [17,47]. When reporting the performance of prediction models on sellar diseases, Qiao reported a higher predictive power of ML algorithms compared to conventional regression methods but acknowledges concerns about the models such as the fact that ML methods are more time-and data-consuming compared to traditional statistics and less effective in several cases [47].
Another important difference between ML and traditional statistics lies in the interpretability of each predictor and the interpretability of the final model. While traditional statistics can offer concrete mathematical rationales between inputs and outputs and consequently optimal explicability, ML is often labeled as a "black box". That is, even with plain knowledge about all model's inputs and outputs, the generalization of the internal decision-making process is not feasible. Some authors described this phenomenon as a trade-off between performance and explicability, where one important aspect is sacrificed to obtain an optimal outcome in the other, also relevant [66]. In 2018, the European Union pioneered inserting in its General Data Protection Regulation that "meaningful information about the logic involved" in all decisions made by artificially intelligent systems should be provided [67]. This "right to explanation" has grounded a movement in favor of explainable AI models, which advocates that even with extremally high metrics, when choosing between models with inherent complexity and more simple ones, (e.g., Decision Trees or Random Forest) that provides interpretability, the latter should be taken [68].
Some solutions have been proposed to solve the explicability issue in ML. An innovative form for assessing variables' importance robustly and which reached wide use recently is the Shapley additive explanation (SHAP) approach, reported as an explainer for ML models by Lundberg and colleagues [69]. Originally developed in the context of game theory as a form to look after theoretically optimal solutions for cooperative games, SHAP values can be used to assign quantitative distributions of the total risk to individual model features. In brief, SHAP values apply cooperative game theory concepts to assign theoretically optimized distributions of the total risk of a given outcome to the individual model features [70]. In game theory, this is analogous to assigning each player on a team a ranked value for their contributions towards the team's overall outcome. Nevertheless, even with potential solutions to the interpretability issue inherent to ML, there is no current consensus about a reliable metric or tool to assess the quality or accuracy of these explanations [68].

ML-Specific Reporting Guidelines
It is expected that a best practices culture regarding all the steps towards ML models' clinical implementation will be promoted and encouraged by adherence to ML-specific protocols and statements. To illustrate the guidelines' potential, clinical trials' reporting had a significant improvement in quality after the release of CONSORT and SPIRIT, particularly when the adherence to them started to be mandatory amongst peer-reviewed journals [71,72]. Moreover, a crucial milestone to successfully implementing "good practices on ML modelling" also depends on establishing those proper standards as a mandatory requirement to further ML-model publication by peer-reviewed journals in the medical area.

Future Perspectives
To date, pituitary surgery has received less exploration than other neurosurgical entities regarding ML modeling. Other potentially relevant approaches may be pursued, particularly concerning the use of radiomics as a part of the development of new algorithms. Innovative applications such as the use of intraoperative MRI may present a pathway to clinical significance. Particular subjects, e.g., acromegaly condition, may benefit from future original studies and reviews scrutinizing surgical outcomes predictions and aspects such as diagnosis (e.g., facial recognition) or response to medical therapy.

Strengths and Limitations
This systematic review has inherent limitations. First, the data are substantially heterogeneous across the studies, limiting further comparison between the studies or metaanalytic approaches. Second, this review focused only on ML models predicting pituitary surgery outcomes and analyzed the quality of report of the respective studies. Thus, our review cannot comment on the performance of traditional statistical methods. Overall, evidence is limited by the lack of transparency in the reporting of the studies. This scope of literature could also benefit from a formal assessment of the risk of bias of published studies, for example, with the use of PROBAST (Prediction model Risk Of Bias Assessment Tool) [73]. The use of TRIPOD-AI guidelines may facilitate a more comprehensive reporting of ML model development methods in future publications. This review also has several strengths. Firstly, the review was performed under two guidelines: PRISMA checklist and TRIPOD Adherence Form, aiming for consistency and transparency. We provided the rationales and importance of some of the most poorly reported items in the TRIPOD which could enhance and provide insight for further reviews, as well as for future development and validation of ML models. To the best of our knowledge, this is the first systematic review to include assessment of report completeness in regard to ML in neurosurgery. Finally, this review provides a comprehensive account of the use of ML methods to predict patient outcomes after pituitary surgery.

Conclusions
Applications of ML in the prediction of pituitary outcomes are still nascent. Even though the articles presented in this review have a broad range of applications on pituitary surgery, current data suggest that there is an area of opportunity for improving the quality of ML model reporting. The use of report guidelines should be encouraged mainly by peer-reviewed journals. The release of TRIPOD-AI is expected to address this need and contribute to ML research applied to healthcare predictions. Institutional Review Board Statement: Institutional Review Board Statement Ethical review and approval were waived because this study is a systematic review.

Informed Consent Statement: Not applicable.
Data Availability Statement: Data available on request due to privacy and ethical restrictions.

Conflicts of Interest:
The authors have no financial or proprietary interest in any material discussed in this article.