Application of Machine Learning to Ranking Predictors of Anti-VEGF Response

Abstract Age-related macular degeneration (AMD) is a heterogeneous disease affecting the macula of individuals and is a cause of irreversible vision loss. Patients with neovascular AMD (nAMD) are candidates for the anti-vascular endothelial growth factor (anti-VEGF) treatment, designed to regress the growth of abnormal blood vessels in the eye. Some patients fail to maintain vision despite treatment. This study aimed to develop a prediction model based on features weighted in order of importance with respect to their impact on visual acuity (VA). Evaluations included an assessment of clinical, lifestyle, and demographic factors from patients that were treated over a period of two years. The methods included mixed-effects and relative importance modelling, and models were tested against model selection criteria, diagnostic and assumption checks, and forecasting errors. The most important predictors of an anti-VEGF response were the baseline VA of the treated eye, the time (in weeks), treatment quantity, and the treated eye. The model also ranked the impact of other variables, such as intra-retinal fluid, haemorrhage, pigment epithelium detachment, treatment drug, baseline VA of the untreated eye, and various lifestyle and demographic factors. The results identified variables that could be targeted for further investigation in support of personalised treatments based on patient data.


Introduction
Research in age-related macular degeneration (AMD) can be traced back as far as 1855, according to published accounts [1,2]. For example, Donders described one of the earliest cases of AMD using microscopy and post-mortem data [2]. He noticed obliquely orientated rods that were accommodating small drusen and discovered that the rods and cones were missing above the drusen. These drusen were rarely absent in the eyes of aged individuals, especially those who were from 70 to 80 years of age. Despite many years of research into possible treatments, AMD continues to remain a progressive, chronic, and degenerative eye disease that is most prevalent in the aging population (i.e., 50 years or older) [3,4]. It is not only one of the leading causes of central and irreversible vision loss, but affected patients are at risk of developing legal blindness [5][6][7]. AMD manifests as a result of a sub-clinical inflammatory process [8] that is characterised by damage or loss of photoreceptors (i.e., cells which respond to light) and the retinal pigment epithelium (RPE; i.e., a support system for photoreceptor cells that deliver essential nutrients, such as oxygen and clear cellular debris) within the macular region [9,10].
Due to the rapid growth of the aging population, the prevalence of AMD is increasing at a significant rate [11] and is predicted to increase to 288 million by 2040 [6,12,13]. Visual impairment poses a considerable global health and economic burden due to increasing life expectancy and a growing cohort of older adults. Estimates of global vision costs for AMD Life 2022, 12,1926 3 of 30 ophthalmology but have limited explainability [37][38][39]. A challenge to machine learning is to develop models that are not black box in nature but incorporate explainability in their predictions. In this study, a machine learning approach was developed that incorporates statistical features and metrics to produce a degree of explainability.
Potential predictor variables can be ranked by weights in the order of importance using the relative importance of variables (RIV) method. Larger predictor weights are considered the most important, while those with smaller weights are considered the least important [40]. This paper has two objectives: (1) to apply machine learning to develop mathematical models to predict vision outcomes for anti-VEGF-treated AMD patients; and (2) to rank variables that are available to the ophthalmologist, in order of importance (i.e., largest to smallest weights). The best models were selected based on model selection techniques, along with diagnostic and forecasting evaluations. The aim was to develop a prediction model to include the features most responsible for treatment response and to optimise prediction accuracy.

Study Design
A retrospective analysis was conducted as a case study using anonymised data from patients who attended the retina clinics at the Royal Victorian Eye and Ear Hospital (RVEEH). The study was approved by the Human Research Ethics Committee of RVEEH. The study was conducted in accordance with the International Conference on Harmonisation Guidelines for Good Clinical Practice and tenets of the Declaration of Helsinki Ethics approval was provided by the Human Research Ethics Committee (HREC: Project No. 95/283H/15) by the RVEEH. Written informed consent was obtained from all participants.

Patient Data
The patient dataset consisted of 150 treatment-naïve eyes, with patients >50 years of age who were diagnosed with subfoveal CNV secondary to AMD and who had attended the RVEEH between 2006 and 2010. Clinical diagnoses were based on a retinal examination, fundus photography, fundus fluorescein angiography, time-domain optical coherence tomography (OCT) with Stratus OCT version 5.0.1 (Carl Zeiss Meditec, Dublin, CA, USA) or Cirrus HD-OCT version 6.0.0.599 (Carl Zeiss Meditec). VA scores were obtained using the early treatment diabetic retinopathy study (ETDRS) chart performed at 4 m. The presence of intra-retinal fluid (IRF), sub-retinal fluid (SRF), macular thickness, macular scar, atrophy, and haemorrhage were analysed using OCT. Results were collated for baseline at three, six, twelve, and twenty-four months treatment intervals.
Patients with CNV secondary to non-AMD conditions, such as angioid streaks, severe myopia, central serous retinopathy, or hereditary retinal disorders, and those who received any previous treatment for nAMD, such as an anti-VEGF, photodynamic therapy, or laser photocoagulation were excluded.

Data Format
The time-series data followed the treatment schedule and clinical manifestations of all 150 eyes over the course of a two-year treatment. The dataset included general demographic information, such as age, gender, and ethnicity, along with several clinical variables (Table S1, Supplementary Materials). We identified whether each variable was binary, categorical, or continuous as part of our exploratory analysis.
The data were initially presented in the "wide" format, which contained approximately 156 variables across all 150 eyes. The data were converted into a "long" format, amalgamating variables across multiple time points into a single variable. For example, rather than having five variables for the VA at baseline at three, six, twelve, and twenty-four months, a single VA variable with a time variable as a reference was used.

Treatment Protocol
Patients were treated with either ranibizumab or bevacizumab, with most receiving ranibizumab, where bevacizumab was used occasionally for the first injection whilst awaiting approval for the subsidised use of ranibizumab (aflibercept was not available at the time). A total of 140 patients were treated for either the left eye (LE) or the right eye (RE), and five patients had both eyes treated. All patients received 3 initial monthly injections followed by a flexible (as required) period. The decision to re-treat in the flexible period was at the discretion of the treating retinal specialist at each follow-up visit on the basis of re-treatment criteria, including the VA loss of 5 letters, increased central retinal thickness of 100 µm, or the presence of retinal fluid on OCT (intraretinal or subretinal) or ophthalmic examination findings of new or persistent haemorrhage. The extension of 2 weeks was considered for the subsequent clinic visit if the clinical situation was stable and OCT was free of intra-retinal or sub-retinal fluid. This evolved into a treat-and-extend protocol in the latter half of the time period, where if the patient showed no signs of activity, the time between the injections was extended by two weeks. Individuals with persistent signs of activity continued to receive monthly injections.

Statistical Analysis
All statistical analyses were run using the statistical software R version 3.2.2. [41]. The null hypothesis for the RIV analysis was that the parameter estimates for all variables were identical and had the same level of importance in their contribution to vision outcomes in anti-VEGF-treated AMD patients.

Modelling Mixed-Effects
Mixed-effects models are used to describe relationships between response and predictor variables in data that are grouped based on one or more classifications [42]. Mixedeffects models explicitly specify the mean and covariance structure, incorporating two types of parameters: fixed and random effects [43,44]. Fixed effects refer to predictors that affect a response variable. Random effects, however, refer to effects on a response variable generated by variation within and among the levels of a predictor variable [43]. Population structure is the fixed effect in a mixed-effects model, while relatedness among individuals is incorporated as a variance-covariance structure of the random effect [45]. Mixed effects models have gained considerable popularity and are considered useful in the analysis of longitudinal data, the modelling of complex clustered data, penalised log-likelihood, etc. [31,46]. There are advantages to using mixed models in medical applications.
A medical study may be carried out at multiple locations, clinics, or hospitals, and therefore, medical data may often be clustered. The design of a medical study may be described as hierarchical and wider inferences can be made by fitting the clustering effect as a random effect. Repeated measurements are also common in medical studies, and it is not uncommon for several observations to be missing. The advantage of using a mixed-effects model is that it makes allowance for missing data and hierarchical clustering [47].
The RVEEH dataset is from a longitudinal study and consists of repeated observations by individual subjects over a time series. The research interest lies in the effects that are common and different among all individuals in the study [48]. The mixed-effects model allows the capture of among-subject variations. The use of mixed-effects modelling is that it assists in explaining variability in the patient response to anti-VEGF treatment and helps to identify other factors that may contribute to treatment response.
Linear mixed-effects models are an extension of regular linear models. Traditional linear models use only a single random term, the residual error. A linear mixed-effects model allows the specification of more than one random term [49], a useful feature, as it is more accurate to think of an effect coming from a specific normal distribution rather than that of a fixed value [50].
Life 2022, 12, 1926 5 of 30 With N independent sampling units (i.e., the patients), the linear mixed-effects model for the ith person may be written as follows: where Y i represents the response variable for the ith person, X i is a n i × p design matrix for the p-vector of the fixed effects β, and Z i is a n i × q design matrix associated with the q-vector of random effects u i that represent subject-specific regression coefficients. The error term, ε i , is assumed to be normally distributed with a mean zero and to be independent of the random effects [51]. The use of linear mixed-effects models counters the multiple drawbacks that are normally associated with traditional random effects modelling, such as [52]: (a) Deficiencies in statistical power with the use of repeated observations; (b) Lack of adaptability around dealing with missing data; (c) Disparate methods for treating continuous and categorical responses; (d) Unproven methods for modelling heteroscedasticity and non-spherical error variance.
There are multiple measurements for each subject thus, we need to incorporate random effects into the model to account for the variation in outcomes. To account for within-subject dependencies, a subject-specific latent variable (i.e., random effects) must be included in the model. Typically, an additional random effect is included for each regression coefficient that is expected to vary among the subjects. For example, in dose-response settings, one may account for baseline heterogeneity through a random intercept and for heterogeneity in susceptibility through a random slope, with these two factors potentially correlated [53]. To account for this heterogeneity, the random effect used across all our tested models included time (in weeks) and subject. This is represented in the analysis as (time|subject). The use of the random effect subject accounts for the random intercept. The random effect time accounts for the random slope. Software for data analytics was developed for this project and also sourced for linear mixed-effects models from the work of Bates and Maechler, as maintained by Ben Bolker [54].

Measure of Outcome
For our response variable, we preferred the use of follow-up VA measurements for both the LE and RE as the outcome/dependent variable (i.e., Y i ). All remaining variables, including the baseline VAs, were considered potential predictors. The responses were additionally divided into LE and RE.
Sometimes, change scores (i.e., post-treatment outcomes minus pre-treatment measurements) were used in place of follow-up scores as a way of accounting for chance imbalances at the baseline between treatment groups. Baseline imbalances can include factors such as age or disease severity; they can occur either due to (i) a true biological variability within the individual, or (ii) due to a measurement error, or even a combination of the two [55,56]; these imbalances are referred to as a regression to the mean [57]. While it may seem intuitive to use change scores to control for any chance imbalances at the baseline, as outcomes may occur due to regression to the mean, we opted to use follow-up scores in place of change scores instead.

Model Selection
The information criteria, such as the Akaike Information Criterion (AIC) and Schwartz or Bayesian Information Criterion (BIC), were used in the model selection process. Although a plethora of information criteria are available for model comparison, they are modifications or generalisations of the AIC or BIC [58]. The AIC and BIC criteria are defined as [59]: where 2 θ 2 − θ 1 is the likelihood ratio test statistic that is asymptotically distributed as χ 2 with p 2 − p 1 degrees-of-freedom. AIC and BIC theories have the same objective: to find the best model via comparison. However, each theory has a different motivation. While AIC compares models using a measure of similarity in the expected predictive performance, BIC compares the probabilities that each of the models tested is the true model [58].
The main idea behind the selection criteria is to compare models based on their maximised log-likelihood value, while penalising for the number of parameters. The model with the smallest AIC or BIC values is deemed the best [60]. Additionally, in finding the smallest AIC and BIC values, the model chosen needs to provide a good fit to the data, using R 2 , also known as the coefficient of determination, which relates to the impact of the predictor variable X [61]. Values for R 2 range from 0 ≤ R 2 ≤ 1. Values closer to 1 indicate a better fit.
For mixed-effects models, R 2 can be categorised into two types: marginal R 2 and conditional R 2 . Marginal R 2 accounts for the variance explained by fixed factors: and conditional R 2 is concerned with the variance explained by both fixed and random factors [27]: where σ 2 f = the variance calculated from the fixed effects component; u = the number of random factors in the model; σ 2 l = the variance component of the lth random factor; σ 2 e + σ 2 d = the sum of an additive dispersion component and the distribution-specific!variance.

Model Diagnostics
Once a suitable model has been identified and fitted, the key assumptions of the model can be tested. These assumptions include (i) linearity, (ii) homoscedasticity or constancy of the error variance, and (iii) normality of the errors. Discrepancies between the assumed model and data can be identified by studying the residuals (also known as the error component). The residuals represent the differences between observed and predicted values for the assumed model. Visual aids, such as residual plots, help identify whether the assumptions of the model have been satisfied. Typically, a good residual plot would be one with an even horizontal distribution of residuals or symmetry; whereas those that contain distinguishable patterns, such as being clustered to one side of the plot, usually indicate a violation of the model assumption and warrant a further review of the model (e.g., appropriate transformation of dependent or independent variables) [62]. Normal probability plots additionally allowed us to determine the fit of our model.
Using both residual plots and normal probability plots, we could identify any unusual or outlying observations based on large deviations in the observed Y values from that of the fitted line. Inferences drawn from the model can be potentially influenced by only a few cases in the data. The fitted model may reflect the unusual characteristics of those cases rather than the overall relationship between the dependent and independent variables [63].
Influence analysis consists of investigating whether observations (or a group of observations) are given disproportionate importance in the model estimation. The simple inclusion or exclusion of an influential case may lead to substantially different regression estimates [64]. DFBETAS is a standardised measure that indicates the level of influence observations have on single parameter estimates [65]. For mixed-effects models, this relates to the influence of a higher-level unit on the parameter estimates. DFBETAS is calculated as the difference in parameter estimate between the model included and the model excluding the higher-level case. This absolute difference is divided by the standard error of the parameter estimate excluding the higher-level case [66]: (6) in which i refers to the parameter estimate and j the higher-level group, soγ i represents the original parameter estimate i, andγ i(−j) represents the estimate of the parameter i after the higher-level group j has been excluded from the data. We used the influence.ME package in R to run these analyses [66]. As a rule of thumb, the cut-off value for DFBETAS is given as [67]: in which n is the number of observations under evaluation. Values exceeding this cut-off are regarded as potentially influencing the regression outcomes for that specific estimate.
As DFBETAS provides a value for each parameter and for each higher-level unit that is evaluated, this can result in a large number of values to review. An alternative method for identifying influence is Cook's distance. Cook's distance provides a summary of measures for the influence that a higher-level unit exerts on all parameter estimates simultaneously. A formula for Cook's distance is [66]: whereγ represents the vector of the original parameter estimatesγ (−j) the parameter estimates of the model excluding the higher-level unit j, and∑ represents the covariance matrix. As a rule of thumb, cases are regarded as potentially influential if the associated value for Cook's distance exceeds the cut-off value of [68]: where n refers to the number of groups in the grouping factor under evaluation.
To test for changes in statistical significance, we employed the sigtest() function. This is used to test for changing levels of significance after the deletion of each of the potentially influential data points identified using DFBETAS. For the Cook's distance, we carried out similar functions using the exclude.influence() function. While there could be many potentially influential points, those that created statistically significant changes upon deletion were considered overly influential.

Prediction Accuracy
Past data allows the identification of a pattern that can be extrapolated or extended into the future in order to prepare a prediction or forecast. Forecasting techniques rely on the assumption that the patterns which have been identified in the past will continue in the future. Good predictions cannot be expected unless this assumption is valid. Forecasting is subject to uncertainty analysis. There may be an irregular component that may be substantial and cause fluctuations in the data. Hence, we reviewed forecasting errors in an attempt to ascertain whether an irregular component was so large as to completely invalidate any forecasting technique or perhaps the forecasting technique used was not capable of accurately predicting the trend, seasonal, or cyclical components of the data, thus rendering the technique inappropriate [69].
The first metric to assess forecast quality is the mean error (ME), which is simply the average of past errors between the n observed and forecast values: where we used the following notation [70]: e t = Y t −Ŷ t is the forecast error for a particular at time t; Y t = the forecast value generated in period t (i.e., the fitted/predicted value); Y t = the observed value at time t.
The ME metric reveals whether the forecasting process, on average, tends to underforecast (i.e., ME would be positive) or over-forecast (i.e., ME would be negative); it was, in fact, a metric of bias. We, therefore, needed other metrics for forecast accuracy that could capture the proximity between the prediction produced using our model and the actual observed values.
The first metric for forecast accuracy is the mean absolute deviation (MAD). MAD uses the absolute error to ensure that negative and positive errors do not cancel when averaged: The second metric for forecast accuracy is the root mean square error (RMSE)-this measure squares errors to the sum of positive and negative ones. The RMSE is similar to the standard deviation (except that the deviations are not around the mean value): The previous metrics are measured in the same units as the data and are not scaleindependent. The normalisation of accuracy requires expression as a proportion or percentage. The metrics which accommodate for this are the mean percentage error (MPE) and mean absolute percentage error (MAPE), which measure percentage bias and percentage accuracy, respectively.
Our objective was to find a model that would have a prediction error rate of less than 10% (i.e., our prediction accuracy was not off by more than 10%): As these measures are percentages, no further scaling is required and interpretation is straight forward [69].

The RIV Method
The RIV method ranks predictor variables by weights, where larger predictor weights are considered more important, while those with lower weights are considered less important [71]. The advantage of this method is that it ensures that the variables are not evaluated as if all are equally important. By appropriate variable weighting, our model can determine which factors will have the most influence on the outcome. The ranking and weighting of variables improves the model accuracy, as the weighting reflects the contribution of each parameter to the outcome. A package for AIC determination was used to identify the level of importance for each variable using the RIV method [72].
To estimate the RIV of variable x j , the sum of all Akaike weights is required (i.e., AIC) across all the models in the set where j occurs; the sum of w + (j) reflects the importance of the variable. This sum is denoted as a numerical value between 0 and 1. The larger the sum w + (j) (i.e., closer to 1), the more important the variable is relative to other variables tested. Using w + (j), all the variables can be ranked in order of their importance.
The effect size is based on model-averaged estimates. It is, therefore, important to ensure a balance in the number of models which include the variable j. In other words, to ensure an accurate reflection of the importance of one variable versus another, a combination of models is required, which contain all prospective variables in equal proportion across all models, allowing each variable to be tested on an equal footing. Otherwise, if one variable were to be found more frequently across our test models, as compared to another, it may inadvertently give the more frequently occurring variable the advantage.
Typically, to calculate the Akaike weights, the following formulae are used: Alternatively, the weights can be viewed as a proportion of evidence, which is the sum of the model weights for the subset of the models that contain the predictor variable x j . The sum of the models for the subset of all the models that did not contain the predictor variable x j is: Hence, the importance of predictor x j is associated with the contrast between w + (j) and w − (j), with w + (j) + w − (j) = 1. The larger the w + (j) value is, the more important the predictor x j .

Treatment of Missing Data
Generally, mixed-effects models are more flexible in the treatment of missing data than fixed-effects models. It is reasonable to assume that a mixed model is capable of handling the imbalance caused by missing observations, provided that the data points are missing at random. When data cannot be considered to be missing at random, ad hoc approaches, such as the "last value carried forward" (i.e., where the last observed value of the response variable is substituted for every subsequent missing observation), are used [47].
For the selection of the mixed-effects model, we opted to use two methods to correct for missing data: the multiple imputation (MI) method to identify potential predictor variables and the stacked MI method to validate (or possibly further investigate) our original findings. For the RIV analysis, we simply used the MI method. Both methods aimed to restore the dataset from its incomplete state to that of completeness by substituting reasonable estimates for each missing data point.
The MI method, which was proposed by Rubin in 1978, rectifies the major disadvantage of single imputation-the under-representation of uncertainty [73][74][75]. While MI has the appeal of restoring the full dataset, we realise that there is no way to recover the actual unknown missing values. It is, therefore, important to note that imputed datasets are not to be treated as substitutes for true completed datasets but rather designed to produce valid overall inferences from the original incomplete dataset [76].

The Multiple Imputation (MI) Method
Generate an m number of copies of the incomplete dataset, using an appropriate procedure to impute the missing values in each copy. As we do not know the true values, the imputed values used in each copy are different from each other. The m values are ordered in the sense that the first components of the vectors for the missing values are used to create one completed data set, the second components of the vectors are used to create the second complete data set, and so on. Each completed dataset is analysed using standard complete-data methods [77]. The repetition of m times accounts for variability due to unknown values [78,79]. We opted to produce m = 5 imputed datasets, producing five separate (and complete) datasets, each with 150 rows of data.
(a) For each imputed copy of the dataset, standard analysis is performed, and the parameter estimates of interest are stored. (b) Using "Rubin's rules", a combined estimate of the parameter is generated as the average of the m separate estimates [76].
Step 1, the imputation step, predicts or fills in the missing values multiple times using the conditional distribution of the observed data. Although several imputation methods exist, such as predictive mean matching, the Markov Chain Monte Carlo (MCMC), or chained equations, the preferred method is one that matches that missing data pattern [80].
In the process of model selection, the MI method generally yields different predictor variables across each dataset. Three strategies have been proposed which "combines" and identifies the single most suitable model across all imputed datasets [81]: (a) Select predictors that appear in any model; (b) Select predictors that appear in at least half of the models; (c) Select predictors that appear in all of the models.
In this study, it was found that the second of the proposed methods was preferred, as it allowed us to find commonalities between each imputed dataset and provided the flexibility to assess the discrepancies in variables that appeared infrequently across all the datasets.
We additionally used the stacked weighted regression method to validate the model findings using the MI method. Rather than reviewing each imputed dataset separately, the five imputed datasets were "stacked" to create one large dataset of length m × n in place (m imputed datasets for n individuals). While fitting models to single-stacked data yields and valid parameter estimates, standard errors may end up being too small. To correct this issue, we scaled the log-likelihood for the stacked data using weights in our regression models, which additionally accounted for the degree of missing information in the dataset: where f is the fraction of missing data across all variables-the total number of missing data divided by np, with n being the number of individuals (150), and p is the number of predictor variables (19) [81,82]. For both our MI and stacked MI methods, we used the R package Amelia [83]. Amelia resamples the original data using a bootstrap algorithm while implementing an expectationmaximisation (EM) algorithm-an iterative method for maximum likelihood or maximum a posteriori estimates [84]. Amelia uses all observed data to estimate the missing values, then creates several complete datasets that include the original data points plus slightly different imputed points to account for uncertainty ( Figure 1). For stacked MIs, the same method of imputation takes place, with the addition of including the command separate = FALSE to ensure the imputed datasets are not separated and kept as one ( Figure 2). a posteriori estimates [84]. Amelia uses all observed data to estimate the missing values, then creates several complete datasets that include the original data points plus slightly different imputed points to account for uncertainty ( Figure 1). For stacked MIs, the same method of imputation takes place, with the addition of including the command separate = FALSE to ensure the imputed datasets are not separated and kept as one (Figure 2).   data divided by , with being the number of individuals (150), and is the number of predictor variables (19) [81,82]. For both our MI and stacked MI methods, we used the R package Amelia [83]. Amelia resamples the original data using a bootstrap algorithm while implementing an expectationmaximisation (EM) algorithm-an iterative method for maximum likelihood or maximum a posteriori estimates [84]. Amelia uses all observed data to estimate the missing values, then creates several complete datasets that include the original data points plus slightly different imputed points to account for uncertainty ( Figure 1). For stacked MIs, the same method of imputation takes place, with the addition of including the command separate = FALSE to ensure the imputed datasets are not separated and kept as one (Figure 2).

Summary Statistics
Our cohort of 150 eyes consisted of 85 eyes (56.7%) from females and 65 eyes (43.3%) from males ( Table 1). The mean age, with standard deviation (SD) at the baseline, was 78.9 ± 7.3 years. The mean baseline VA for the LE was 53.5 ± 24.0 letters, while the RE was 48.4 ± 24.3. At the baseline, ranibizumab was injected 122 times (81.3%) and bevacizumab 28 (18.7%).  quantity (5.2%). Finally, variables that contained a substantial amount of missing included: OCT derived SRF (18.13%), IRF (18.27%), CMT (19.47%), PED (20.67%), orrhage (24.8%), and the treatment drug (35.6%). We assumed that greater variab our outcomes would be found in the last set of variables and anticipated consistent for all other variables.  Those marked with dark red represent observed and available data, while the light pink represents missing data. Most of the missing information can be found in OCT derived variables. We found that treatment drug had the most missing values (35.6%), followed by haemorrhage (24.8%), and PED (20.67%).

Identifying Predictor Variables
We tested for all possible combinations of all 19 predictor variables (i.e., 524,288 models, including null models) for each imputed dataset and stacked imputed datasets. Possible combinations were tested in the following format: 1. Inspect the ith combination of predictor variables; 2. Add the ith combination into a mixed-effects formula, which includes the random effects variables for time and subject; 3. Store the AIC; 4. Store the BIC; 5. Once all possible combinations have been tested, list the combinations that produce the smallest AIC and BIC values.
Each tested model followed the format below: Response = ith combination of predictor variables + random effects (22) Life 2022, 12,1926 14 of 30 Using the MI method with five separate datasets (Table S2, Supplementary Materials), we initially identified the following predictors as producing the models with the lowest AIC/BIC for both the LE and RE.
In our methods (Treatment of Missing Data), our process of selecting the most appropriate predictors for a model included finding variables that appeared in at least half of the imputed dataset outcomes, with the flexibility to explore other predictor variables that occurred less frequently.
We then proceeded to repeat our analysis using the single stacked imputed dataset (Table S3, Supplementary Materials) in place of five separate imputed datasets.
Following the results from both methods, we proceeded to test models that included any of the predictors included in Tables S2 and S3 in the Supplementary Materials. The final model choice was additionally based on: (1) diagnostic outcomes and (2) prediction accuracy. The following model for both the LE and RE provided the most consistent prediction outcomes, in line with model assumptions: Y i = VA at time t (LE or RE); X 1 = LE baseline VA; X 2 = RE baseline VA; X 3 = OCT IRF; X 4 = OCT CMT; X 5 = time (in weeks); X 6 = treatment quantity; X 7 = treatment drug; X 8 = treated eye; X 9 = OCT haemorrhage; and X 10 = OCT PED. While other potential variables such as age, hypertension, and OCT SRF were also tested, it was found that the addition of these variables to the model neither added nor subtracted from the accuracy of the model. The preference was for an efficient model, with the least variables needed to produce an accurate outcome and to guard against over-fitting with the ten selected variables forming the basis of the final model.

Model Diagnostics
Residual versus fitted plots for both LE and RE models ( Figure 4) demonstrated a relatively even distribution. Some data points which were located considerably further out than most other data points could be considered potential outliers. The normal probability plots ( Figure 5) for both these models were generally normally distributed, with some deviation noted at the tail ends. While these plots suggested that the models were a good fit for the data, we must consider the possibility of influential data points.
Using DFBETAS plots for both the LE ( Figure 6) and RE (Figure 7) models, several data points for both models exceeded the cut-off value of 2 √ n = 0.17. Using the sigtest(), which identified the statistical changes in the model that may be caused by the removal of a potentially influential data point, the removal of the DFBETAS, which exceeded the cut-off values, did not cause changes in the outcome for either the LE or RE models.
Using Cook's distance plots for both the LE ( Figure 8) and RE (Figure 9), several plot points exceeded the cut-off 4 n = 0.027. We reviewed these points by momentarily excluding them using exclude.influence() and re-assessing our models; we found that the exclusion of these points did not affect or change our model outcomes.
These results suggest that, while there are several potentially influential data points, no data points appeared to be overly influential on our models. Additionally, we noticed that the original outliers we had noted in the residual versus fitted plots (Figure 4) appeared in our potentially influential analysis. However, similar to all the other potential data points, we noticed that the originally identified outliers had no bearing on the model (or prediction) accuracies. While we opted not to delete outlier points for posterity, we modified the dimensions of the residual versus fitted plots to demonstrate that, sans the outliers, we could clearly see evenly distributed and well-spaced data points of our residual plot (Figure 10), further validating that our model assumptions had been met.
Life 2022, 12,1926 15 of 30 peared in our potentially influential analysis. However, similar to all the other potential data points, we noticed that the originally identified outliers had no bearing on the model (or prediction) accuracies. While we opted not to delete outlier points for posterity, we modified the dimensions of the residual versus fitted plots to demonstrate that, sans the outliers, we could clearly see evenly distributed and well-spaced data points of our residual plot (Figure 10), further validating that our model assumptions had been met.      . DFBETAS for LE models for all variables. Using the cut-off value of 2/√ , our plot suggests that there are several potential influential points (indicated in red). However, using sigtest(),we found that the removal of the DFBETAS had no bearing on the model outcomes. Figure 6. DFBETAS for LE models for all variables. Using the cut-off value of 2/ √ n, our plot suggests that there are several potential influential points (indicated in red). However, using sigtest(),we found that the removal of the DFBETAS had no bearing on the model outcomes.  DFBETAS for RE models for all variables. Using the cut-off value of 2/ √ n, the plots suggested that there are several potential influential points (indicated in red). However, using sigtest(), we found that the removal of the DFBETAS had no bearing on the model outcomes. Figure 7. DFBETAS for RE models for all variables. Using the cut-off value of 2/√ , the plots suggested that there are several potential influential points (indicated in red). However, using sigtest(), we found that the removal of the DFBETAS had no bearing on the model outcomes.   Cook's distance for RE models for all variables. Using the cut-off value of 4/n, the plot reveals potential influential points (indicated in red), but the tests revealed that there was no significant impact.

Prediction Accuracy
The forecasting accuracy for the prediction model was evaluated for both the LE and RE models ( Table 2). Very low ME results were evident in both LE and RE models. Both sets of MAD results were quite low, with the LE model having a MAD of 1.70-1.87 and the RE model with a MAD value of 1.48-1.55. The RMSE ranged from 3.54 to 3.95 for the LE model and from 3.54 to 3.95 for the RE model.
With respect to the MPE and MAPE, the aim was to identify models which had a MAPE of less than 10%. MAPE for the LE model ranged from 5.56 to 6.39%, and for the RE model, from 7.02 to 7.41. Both models met the MAPE objective. Both LE and RE model MPE results were very low, being −0.02 and −0.03, respectively.
Finally, for goodness-of-fit, which included both the marginal and conditional both models had values close to 1, suggesting that the models were a good fit to the data Figures 11 and 12 provide a visual demonstration of the proximity between the observed and predicted values. The forecasting errors, along with the visual aids, suggest that the models, in general, have very good prediction accuracy, and the approach is suitable for predicting VA outcomes during anti-VEGF treatment for AMD patients.

Prediction Accuracy
The forecasting accuracy for the prediction model was evaluated for both the LE and RE models (Table 2). Very low ME results were evident in both LE and RE models. Both sets of MAD results were quite low, with the LE model having a MAD of 1.70-1.87 and the RE model with a MAD value of 1.48-1.55. The RMSE ranged from 3.54 to 3.95 for the LE model and from 3.54 to 3.95 for the RE model. Finally, for goodness-of-fit, which included both the marginal and conditional R 2 , both models had values close to 1, suggesting that the models were a good fit to the data. Figures 11 and 12 provide a visual demonstration of the proximity between the observed and predicted values. The forecasting errors, along with the visual aids, suggest that the models, in general, have very good prediction accuracy, and the approach is suitable for predicting VA outcomes during anti-VEGF treatment for AMD patients.

Relative Variables of Importance
We computed two sets of RIV analyses: (1) for all nineteen variables that were available (i.e., clinical variables available to ophthalmologists) and (2) for the ten predictor variables found only in our LE and RE models. We ran analyses across the five imputed datasets produced using Amelia. RIVs were weighted for both the LE (Table 3 for the full list of variables; Table 4 for model-only variables) and RE (Table 5 for the full list of variables; Table 6 for model-only variables), with the outcome set as the follow-up VA scores over the course of 24 months.     Generally, results across all the imputed datasets were consistently similar. We did, however, note a single anomaly in the LE outcomes (Table 3): the IRF in the fifth imputed dataset had a w + of 0.77 and w − of 0.23, which contrasted with the previous four imputed dataset outcomes. We repeated our analysis for this measure, and the weight scores remained the same. To account for any uncertainties, we averaged the results across all five imputed sets for each variable.
Once averaged, the weights were identified for each eye, and the variables were then ranked based on their average weighted scores across both eyes (Table 7 for the full list of variables; Table 8 for model-only variables). The top four variables were always classified as "Highly Important" and with average w + scores of at least 0.9 were: the treated eye, the baseline VA of the treated eye, the time (measured in weeks), and the number of injections received throughout the 24 months. No variables were classified as "Important", which included weight scores of between ≥0.7 and <0.9.
For the full list of variables, four variables were identified as "Moderate" based on a weighted score of between ≥0.5 and <0.7; these were: age, smoking status, the treatment drug, and CMT. It is worth noting that the moderate score for the treatment may purely be due to the use of either ranibizumab or bevacizumab in our studies; both anti-VEGFs were categorised as having similar treatment profiles. Diabetes and the baseline VA of the untreated eye were classified as "Low to Moderate" in importance based on weight scores of between ≥0.4 and < 0.5. Finally, variables with the lowest ranks (i.e., w + < 0.4) were gender, IRF, SRF, haemorrhage, PED, smokerpacks, hypertension, and ethnicity (both maternal and paternal).  For the model-only variables, those that were identified as "Moderate" included CMT and PED. Those in the "Low to Moderate" categories were the baseline VA of the untreated eye, and IRF. Treatment drug and haemorrhage in this instance was noted as being "Low." When comparing the rank of variables between the full list of variables available and those of our model, we noticed for the most part the rank/order of the variables were similar. Minor differences were evident. However, this is unsurprising given that the RIV method ranks variables as relative to the presence of other variables. Overall, though, the rank/order generally appears to remain the same across the board.

Discussion
Many AMD patients have variable responses to anti-VEGF injections due to medical issues, lifestyle, and demographic factors. A machine learning approach was developed for the prediction of VA outcomes that accounted for these modifying factors and also ranked the predictors in order of importance. The prediction model included age, treated baseline VA, the time of treatment, treatment quantity, the treated eye, baseline of the untreated eye, treatment drug, CMT, IRF, PED, and haemorrhage.
The analytic approach combined a mixed-effects (ME) model and RIV methods, together with the treatment of missing values with the multiple imputation (MI) method and various statistical diagnostic tests to confirm the validity of the model assumptions, such as the normality of residuals.
The variables with the highest rankings included the baseline VA of the treated eye, the time of treatment, treatment quantity, and the treated eye. Given that these variables are important aspects of the anti-VEGF response, their high rankings are unsurprising. The presence of variables, such as age, hypertension, and SRF, had a less significant impact on the accuracy of the model predictions of VA. The analytic approach had a number of strengths and weaknesses, which are described as follows.

Strengths of the Study
Incorporating mixed-effects modelling as part of a machine learning approach is consistent with the analysis of biological and medical data [31], as it provides flexible and powerful statistical tools for controlling stratification, relatedness, and confounding factors [32][33][34]. Features that support statistical confidence in the methodology include the use of the ME and RIV methods to aid in the assessment of predictor importance and the multiple imputation (MI) treatment of missing values. Statistical diagnostics produced very good support for the model with respect to the analysis of residuals and outliers, using methods such as Q-Q plots and Cook's distance.
There were two noteworthy features of the machine learning approach described in this investigation. First, the use of time as an explicit variable in the model is often absent in other machine learning approaches, especially in classification studies comparing training data with test data. This means that no assumptions were necessary on the issue of non-stationarity in the time-series statistics for function approximation, and there was no confounding of the time in either the training or test data, thus reducing error and uncertainty.
The second feature of note is that the weighting and ranking of predictors, as described by the methods in this study, provides information on the relative impact of each predictor on visual acuity and, therefore, adds a degree of explainability to the results. In machine learning research, there is currently a strong interest in improving explainability in order to reveal the reasoning used in decision-making and to avoid a black-box analysis by AI algorithms [38]. In the case of explainable AI research, there is a class of approaches commonly referred to as 'attribution' methods, which assign to each input feature a score representing its contribution to the response function [85,86]. The machine learning method in this study is an example of such an attribution approach.

Limitations of the Study
The study also has several limitations. With respect to the collection of clinical data. The data were collected retrospectively, and the treatment protocol varied according to a clinician's choice. The cohort was collected early in the history of anti-VEGF treatment, and as such, individual clinician treatment protocols may have evolved in more recent cases. Similarly, the OCT quality was lower compared to the current advances in spectral domain OCT technology. As such, the ability to judge the presence of SRF and IRF scarring was not as accurate as it could have been if the cohort had been collected more recently. Missing data, particularly relating to retinal characteristics identified by OCT, were most likely due to poor-quality OCT images.
To account for the missing data, we created both multiple and separately imputed and stacked imputed versions of the original dataset, with the latter being created for model validation purposes. The objective of the multiply imputed datasets was to account for uncertainty by generating imputed values that not only mimicked the distribution of the original data but were also slightly different for each imputed dataset to account for any potential uncertainty. Our second limitation was the use of RIV itself. We assessed these variables as relative to each other; their values may have changed if they were tested against other, stronger predictor variables.
We believe that the model weights w i summed over all the models that included a given variable provided a better weight of evidence for the importance of each variable in the context of the set models considered. Using the predictor variables that were considered of interest, the rank of the aforementioned predictors (Tables 5 and 8) provided a good indication of the relative importance of the variables considered in determining the treatment response. However, with improved imaging technology, new variables, and new data, it is feasible that the relative importance of some variables may need updating-which can be accomplished using the proposed approach.

Conclusions
This study developed a methodology and prediction model for visual acuity (VA) response following anti-VEGF therapy in nAMD patients. The analysis provided an approach for targeting and prioritising contextual factors that may have an impact on the degree of success in the treatment of wet AMD with anti-VEGF treatments. The evaluation of visual responses included the assessment of clinical, lifestyle, and demographic factors. The approach combined mixed-effects modelling with the relative importance of variables (RIV) modelling, together with statistical learning approaches and data processing with diagnostic tests. The most important predictors were confirmed as the baseline VA, time to treatment, treatment quantity, and the treated eye involved. There were also impacts from OCT features, such as CMT, IRF, PED, and the presence of haemorrhage, together with lifestyle and demographic factors, such as age and ethnicity.
There are several noteworthy features of the study. The incorporation of mixed-effects modelling as part of the machine learning approach is compatible with the analysis of biological and medical data. The approach provided powerful statistical tools for controlling stratification, relatedness, and confounding factors. Statistical confidence in the methodology is highlighted by the use of mixed-effects modelling and RIV methods for the assessment of predictor importance and the multiple imputation (MI) treatment of missing values. Statistical diagnostics underpinned the model performance with respect to the analysis of residuals and outliers, using methods such as Q-Q plots and Cook's distance.
The study provided support for the use of machine learning in personalised medicine. The machine learning approach investigated had some notable attributes. First, the use of time as an explicit variable avoids issues of non-stationarity and confounding in statistics that may be a problem in classification studies. Second, the approach had a degree of explainability because of its inclusion of attribution analysis.
The flexibility of the approach allowed for extending the model to investigate other potential predictors from personal electronic health records and also updating weights with new training data.