Statistical Models for High-Risk Intestinal Metaplasia with DNA Methylation Profiling

We consider the newly developed multinomial mixed-link models for a high-risk intestinal metaplasia (IM) study with DNA methylation data. Different from the traditional multinomial logistic models commonly used for categorical responses, the mixed-link models allow us to select the most appropriate link function for each category. We show that the selected multinomial mixed-link model (Model 1) using the total number of stem cell divisions (TNSC) based on DNA methylation data outperforms the traditional logistic models in terms of cross-entropy loss from ten-fold cross-validations with significant p-values 8.12×10−4 and 6.94×10−5. Based on our selected model, the significance of TNSC’s effect in predicting the risk of IM is justified with a p-value less than 10−6. We also select the most appropriate mixed-link models (Models 2 and 3) when an additional covariate, the status of gastric atrophy, is available. When the status is negative, mild, or moderate, we recommend Model 2; otherwise, we prefer Model 3. Both Models 2 and 3 can predict the risk of IM significantly better than Model 1, which justifies that the status of gastric atrophy is informative in predicting the risk of IM.


Introduction
Gastric intestinal metaplasia (IM) is a precancerous change in the mucosa of the stomach with intestinal epithelium [1], which increases the risk of gastric cancer [2], the third leading cause of cancer death worldwide and the fifth most common malignancy in the world [3].Intestinal-type gastric cancer is more common and is associated with chronic inflammation, atrophy, and IM of the stomach, often relevant to Helicobacter pylori infection [4].The exact mechanism of how IM leads to gastric cancer is not fully understood, but it may involve genetic and epigenetic alterations that affect the expression and function of key genes, including DNA methylations [5].There has been increasing evidence that DNA methylation changes in normal tissue are correlated with cancer risk [6][7][8][9][10][11][12], including gastric cancer [5,13].The DNA methylation levels observed in IM tissue samples are significantly higher than normal gastric samples, which indicates that the DNA methylation profiles may help with predicting IM and gastric cancer [5].
In this study, we utilize the DNA methylation data of 124 samples obtained from the Gastric Cancer Epidemiology Program (GCEP) and deposited in NCBI (GSE103186) by [5].We aim to build the most appropriate statistical model to predict the risk level of IM, including Normal (normal gastric samples), MIM (mild IM or low-risk samples, type I), and IM (high-risk samples, type II or type III), using the total number of stem cell divisions per stem cell (TNSC) estimated by the epiTOC2 (Epigenetic Timer of Cancer-2, [12]) model from the measured DNA methylation profile, along with other clinical information such as the status of gastric atrophy [5].
For categorical responses with three or more categories, such as {Normal, MIM, IM} in this study, multinomial logistic models have been widely used in the statistical literature, including the baseline-category, cumulative, adjacent-categories, and continuation-ratio logit models [14][15][16][17].Among the four classes of logit models, the baseline-category logit model, also known as the (multiclass) logistic regression model, has been extended with a probit link and is known as the multinomial probit model [18][19][20]; the cumulative logit model has been extended to cumulative link models [19,21,22]; and the continuation-ratio logit model has been extended with a complementary log-log link [23].It should be noted that all these models assume the same link function for all categories.
In this study, we adopt the multinomial mixed-link model (see Section 2.2), proposed by [24] recently, because it not only covers all the models mentioned above but also allows us to choose different link functions across categories.By choosing the multinomial mixedlink model, we find out that the cumulative mixed-link model with proportional odds (po) assumption and g 1 = loglog, g 2 = logit link functions outperforms the traditional models, in terms of predicting the risk level of IM using DNA methylation profiles (see Section 3.1).Based on ten-fold cross-validations, the improvement is statistically significant.Our results also show that by incorporating the status of gastric atrophy can further improve the prediction accuracy significantly.Having run our model selection procedure again, we determine that an adjacent-categories logit model with po (see Section 3.2) is most appropriate when the status of gastric atrophy is negative, mild or moderate, whereas an adjacent-categories probit model with po (see Section 3.3) works the best when the status is marked or unknown.For readers' reference, we provide the predictive probabilities for each tissue sample in the Supplementary Materials, as well as the sample IDs and the corresponding covariates.

epiTOC2 Model and TNSC Covariate
The mitotic age of tissues is relevant to the total number of cell divisions, which can be estimated by the DNA methylation changes in the stem cell.Recent studies have shown the correlation between the mitotic age of tissue and the neoplastic transformation [25][26][27].Many models for estimating mitotic age have been proposed based on DNA methylation data, including the epiTOC model [28], the solo-WCGWs model [29], and the epiTOC2 model [12].In this study, we adopt the epiTOC2 model, which shows good robustness and is better for discriminating preneoplastic lesions [12].The epiTOC2 model estimates the total number of stem cell divisions directly (TNSC) and is based on CpG sites marked by polycomb repressive complex-2 (PRC2).These sites are generally unmethylated across fetal tissues and become methylated during ontogeny and aging.The epiTOC2 model was fitted using the Illumina Infinium 450k data from [30], who selected n c = 163 CpG sites in their model based on the rate of increase in DNA methylation rates.A simplified epiTOC2 model can be rewritten as a weighted average of DNA methylation beta values over the n c CpGs in a sample s as follows: where δ i is a model parameter representing the probability of de novo methylation of parent and daughter strands (see [12] for more details).
In this study, we first use TNSC as the only covariate representing the DNA methylation profile to predict the risk level of IM (see Section 3.1).

Multinomial Mixed-Link Models
In general, we consider d covariates or predictors with m distinct settings x i = (x i1 , . . ., x id ) T , for i = 1, . . ., m.At the ith setting, n i categorical responses are collected and summarized into a multinomial response , where Y ij is the number of observations with the jth response category, and π ij is the probability that the response falls into the jth category, j = 1, . . ., J. Assuming all π ij ∈ (0, 1), there are four classes of multinomial logit models that have ever been used in the literature (see [16] and the references therein): where β j = (β j1 , . . ., β jd ) T , i = 1, . . ., m, and j = 1, . . ., J − 1.In the statistical literature (see, for example, [16]), the four logit models, (1)-( 4), are also called nonproportional odds (npo) models, which allow β j 's to be different across j = 1, . . ., J − 1.If we further assume β j ≡ β = (β 1 , . . ., β d ) T , then the four models are known as proportional odds (po) models.For more general odds structures for multinomial logistic models, that is, partial proportional odds (ppo) models, please see [16,17].
In the form of npo models, the multinomial mixed-link model [24] can be written as follows where , for baseline-category mixed-link models , for adjacent-categories mixed-link models , for continuation-ratio mixed-link models (6) where g j is a predetermined link function, i = 1, . . ., m, and j = 1, . . ., J − 1.It can be verified that if , that is, the logit link, then the multinomial mixed-link model ( 5) plus (6) leads to the four multinomial logit models (1)- (4).In this study, we also consider some other link functions that have been commonly used in the literature, namely, probit (g j (ρ ij ) = Φ −1 (ρ ij ), where Φ is the cumulative distribution function of standard normal distribution), log-log (or loglog, g j (ρ ij ) = − log(− log(ρ ij )), and complementary log-log (or cloglog, g j (ρ ij ) = log(− log(1 − ρ ij )).For more options of link functions, please see Table 1 in [24].
Following the notation in [24], the multinomial mixed-link model ( 5) plus ( 6) can be written into its matrix form: where g = (g 1 , . . ., g J−1 ) T , L and R are (J − 1) × (J − 1) constant matrices, b is a constant vector of length J − 1, ) is a d × (J − 1) matrix of parameters.Note that the vector g of link functions in (7) applies to the ratio of two vectors component-wise.That is, if we denote L = (L 1 , . . ., L J−1 ) T , R = (R 1 , . . ., R J−1 ) T and b = (b 1 , . . ., b J−1 ) T , then the multinomial mixed-link model ( 7) can be written in its equation form: In other words, ρ ij in ( 5) and ( 6) can be written as In this study, we consider the four classes of mixed-link models listed in (6).For baselinecategory mixed-link models, L = R = I J−1 , the identity matrix of order J − 1, and b = 1 J−1 , the vector of ones with length J − 1; for cumulative mixed-link models, and for continuation-ratio mixed-link models, and b = 1 J−1 .
In this study, we implement the algorithms described in Section 4 of [24] to find the maximum likelihood estimate (MLE) θ for either the npo model's parameter vector

Model Selection and Evaluation
In this study, we use the multinomial mixed-link model ( 5) plus (6) to predict the risk level of IM in three ordered categories, namely, Normal, MIM, and IM.In terms of the structure of ρ ij as defined in (6), we have four options, namely, baseline-category, cumulative, adjacent-categories, and continuation-ratio mixed-link models.In this study, the number of response categories is J = 3.For each j = 1, . . ., J − 1, we consider four possible link functions, namely, logit, probit, loglog, and cloglog.From the right-hand side of (5), we still have two options, an npo model (β 0j + β T j x i ) or a po model (β 0j + β T x i ).As a summary, we have 4 × 4 J−1 × 2 candidate models.
In the statistical literature, the Akaike Information Criterion (AIC, [31,32]) and Bayesian Information Criterion (BIC, [33]) have been widely used for model selection, given that a statistical model is assumed.In our case, the maximized likelihood l( θ) is obtained along with the MLE θ after fitting the model.In our notation, where n = ∑ m i=1 n i stands for the total number of observations or the sample size, p = (d + 1) × (J − 1) for npo models or d + J − 1 for po models in our study.Smaller AIC or BIC values imply better models.Since in this study the sample size n = 124 (see Section 3) is not large, we recommend AIC against BIC if their results of model selection are not consistent (see, for example, [34], for more discussions on AIC and BIC).
To show if the selected model is significantly better than commonly used models in the literature, we use a ten-fold cross-validation to estimate the prediction errors of the models under comparison.Different from five-fold cross-validations chosen by [17], we choose ten-fold cross-validations in this study because our sample size n = 124 is relatively smaller (for more discussion on ten-fold versus five-fold cross-validations, see [34]).
Different from many machine learning techniques, the multinomial mixed-link model provides a stochastic classification answer [35] to each tissue sample.That is, given the covariate or predictor setting x i , we obtain by the fitted multinomial mixed-link model predictive probabilities πij for Normal (j = 1), MIM (j = 2), and IM (j = 3), respectively, which is much more informative than a deterministic classification answer [35].Following [17], we use the cross-entropy loss to evaluate the performance of statistical models under comparison.Given a random partition B of the index set [n] = {1, . . ., n}, which divides [n] into ten non-overlapped subsets (called blocks) of roughly the same size, the (average) cross-entropy (CE) loss for a specified model is where n = 124 is the sample size, y i is the observed response label of the ith tissue sample, and k(i) is the block label to which the ith sample belongs.More details about calculating CE can be found in Section 2.4 of [17] except that we use a ten-fold instead of five-fold cross-validation.A smaller CE value implies a better model.To check whether the improvement of one model against another is statistically significant, in this study we randomly generate partitions and use a one-sided paired t-test to check whether the improvement is significant.

Statistical Model Selection for Predicting IM Based on TNSC
In this study, we first match the DNA methylation data downloaded from NCBI (https://www.ncbi.nlm.nih.gov/geo/,GSE103186, accessed on 23 January 2024) with the tissue samples listed in Table S3 in [5] (https://www.cell.com/cancer-cell/,accessed on 18 January 2024).Among the 134 tissue samples collected at the antrum site [5], there are 10 samples lacking DNA methylation profiles.We use the remaining 124 samples for our analysis.We then compute the TNSC values for the 124 samples using their DNA methylation data, as described in Section 2.1.The R codes for computing TNSC are accessible online (https://zenodo.org/records/2632938,epiTOC2.R, accessed on 15 January 2024) as indicated by [12].In this section, we consider the multinomial mixed-link model as described in Section 2.2, and use the computed TNSC as the only covariate to predict the risk level of IM in three categories (Normal, MIM, and IM).For each of 4 × 2 models, the optimal link functions for j = 1, 2, respectively, along with their corresponding AIC and BIC values, are listed in Table 1 (see Appendix A for the AIC and BIC values of all link combinations).Note: The best model overall, along with its links and values, is highlighted in bold.
According to Table 1, the best multinomial mixed-link model with the lowest AIC overall in this case, called Model 1, is a cumulative po model with loglog and logit links for j = 1 (Normal) and j = 2 (MIM), respectively.Note that by default j = 3 (IM) is treated as the baseline category.The fitted Model 1 is provided in (8), where x TNSC,i is the computed TNSC value for the ith tissue sample.
In ( 8), the estimated coefficient of x TNSC,i is −4.228 × 10 −4 , which is fairly small.To test whether the effect of TNSC is significant in predicting IM, we obtain its 95% confidence interval (−4.167 × 10 −4 , −4.290 × 10 −4 ), which does not contain zero.Actually, the corresponding p-value of its significance test is less than 10 −6 .As a conclusion, the effect of TNSC is statistically significant in predicting the risk level of IM.
To further check whether Model 1 outperforms the traditional statistical models, as described in Section 2.3, we run a ten-fold cross-validation and compare its cross-entropy loss against other models.For illustration purposes, we choose the baseline-category logit model with npo (also known as the multiclass logistic regression model) and the cumulative logit model with npo (one of the most popular models for ordinal responses) as the alternative models.As for other models, including multinomial logit models and probit models, the conclusions are similar (see Appendix A).To avoid misleading conclusions relying on a particular partition, we randomly generate ten partitions and compute their corresponding CE values.The boxplots of the resulting ten CE values are provided in Figure 1, which shows that the CE values of Model 1 seem to be much lower than those values of the other two models.Although we only run ten random partitions due to computational intensity, our one-sided paired t-tests based on the ten CE values show that the improvements of Model 1 are significant.The p-values of the t-tests for comparing Model 1 against the baseline npo model and the cumulative npo model displayed in Figure 1 are 8.12 × 10 −4 and 6.94 × 10 −5 , respectively.That is, the recommended cumulative po model with loglog and logit links significantly outperforms the two multinomial logistic models that are commonly used in practice.To show how well Model 1 works, we plot in Figure 2 the predictive probabilities πij against the true response labels, j = 1, 2, 3, respectively.According to Figure 2, the recommended Model 1 works reasonably well.For examples, in the left panel, we plot πi1 , which is the predictive probability that the ith tissue sample belongs to Normal, against its true response label.If the true label is Normal, the left boxplot in the left panel of Figure 2, which is apparently higher than the other two boxplots in the same panel, indicates that the corresponding tissue sample tends to be predicted as Normal as well.Similarly, in the right panel, πi3 , the predictive probability that the sample belongs to IM, is plotted, and the significantly higher boxplot to the right indicates that the sample with true label IM tends to be predicted as IM as well.Nevertheless, the middle panel, which plots the predictive probabilities for MIM, indicates that the MIM class is not so different from Normal or IM, and thus is more difficult to predict correctly.

Statistical Model Selection for Predicting IM Based on TNSC and Gastric Atrophy
In this section, we show that when additional information, such as the status of gastric atrophy, is available, the prediction accuracy of the IM risk level can be significantly improved.
In this study, the status of gastric atrophy is a 5-class categorical variable (see Table S3 in [5]), namely, Marked, Moderate, Mild, Negative, and Unknown.In our regression analysis involving the status of gastric atrophy, we replace it with four dummy variables: x mild,i , x moderate,i , x negative,i , and x unknown,i .Each dummy variable is binary, taking a value of either 1 or 0, with at most one variable allowed to be 1 for any given sample.For instance, a configuration of (x mild,i , x moderate,i , x negative,i , x unknown,i ) = (1, 0, 0, 0) indicates a mild gastric atrophy status for the ith sample, (0, 1, 0, 0) indicates a moderate gastric atrophy status, whereas (0, 0, 0, 0) indicates a marked status, that is, the baseline status.Similarly to Table 1, we list the optimal link functions for j = 1, 2, respectively, along with their AIC and BIC values, in Table 2.With the presence of gastric atrophy, the best multinomial mixed-link model, called Model 2, is an adjacent-categories logit model with po, which is different from the type of Model 1 with TNSC only (see Section 3.1).Since its AIC value, 108.89, is much less than 144.29 in Table 1, Model 2 is expected to outperform Model 1 significantly in terms of prediction accuracy (see [36] for more discussion on AIC differences).The fitted Model 2 is provided in (9) Similarly to Figure 1, we compare in Figure 3 the cross-entropy loss of two recommended models shown in (8) (Model 1) and ( 9) (Model 2).It is not surprising that Model 2 with both TNSC and gastric atrophy as predictors has a significantly smaller cross-entropy loss, which implies that the status of gastric atrophy is informative in predicting the risk level of IM.Similarly to Figure 2, we plot the predictive probabilities based on Model 1 and Model 2 against the true IM labels in Figures 4-6.When the true IM label matches the predictive label, such as the left panel in Figure 4, the middle panel in Figure 5, and the right panel of Figure 6, Model 2 tends to provide a higher predictive probability than Model 1, which shows that overall Model 2 outperforms Model 1.

Statistical Model Selection after Removing Unknown and Marked Categories
Among the 124 samples considered in this study, there are only 3 cases with "Marked" status of gastric atrophy, and there are 23 cases with "Unknown" status, which is not informative.In this section, we consider the best multinomial mixed-link model for the 98 cases after removing the samples that belong to Marked or Unknown categories.
In this section, the status of gastric atrophy is a three-class categorical variable restricted to the 98 samples.Similarly to Model 2 in Section 3.2, we replace the status of gastric atrophy with two dummy variables (x mild,i , x moderate,i ).More specifically, (x mild,i , x moderate,i ) = (1,0) stands for mild status, (0,1) for moderate status, and (0,0) for negative status representing the baseline.Similarly to Tables 1 and 2, we provide in Table 3 the optimal choices of link functions for each type of multinomial model.According to Table 3, the best multinomial mixed-link model for this scenario is an adjacent-categories po model with probit links for both j = 1, 2. We call it Model 3 and list its fitted model in (10).Note: The best model overall, along with its links and values, is highlighted in bold.
To compare the performance of Model 3 with Model 1 and Model 2, we use the crossentropy loss based on ten-fold cross-validations similarly to Sections 3.1 and 3.2.Since Model 3 cannot be applied to cases with marked or unknown status of gastric atrophy, we compare the performance of the three models on samples with mild, moderate, or negative status of gastric atrophy only.Their boxplots of cross-entropy loss based on ten random partitions for ten-fold cross-validations are displayed in Figure 7.According to Figure 7, Model 3 has a significantly smaller (average) cross-entropy loss compared with Model 1 and Model 2, in terms of predicting IM for individuals whose gastric atrophy statuses are negative, mild or moderate.Nevertheless, Models 1 and 2 are still useful since they can be applied to cases with marked or unknown status of gastric atrophy as well.

Discussion
In Section 3, we presented three models for different scenarios.When only the TNSC (or the DNA methylation profile) is available, we recommend Model 1, a cumulative mixedlink model with po, which works reasonably well with TNSC as the only input.When the status of gastric atrophy is also available, there are two different scenarios.If the status is negative, mild, or moderate, we recommend Model 2, an adjacent-categories logit model with po, which belongs to the traditional multinomial logit models.If the status is marked or unknown, we recommend Model 3 instead, which is an adjacent-categories probit model with po.Each of the three models has its own advantages.For example, although both Model 2 and Model 3 outperform Model 1 in terms of prediction accuracy, Model 1 is still useful when the status of gastric atrophy is not available.
To further compare the performance of Models 1 and 2 on cases with marked or unknown status of gastric atrophy, we display in Figure 8 the (average) cross-entropy loss on predicting those 26 cases with marked or unknown status of gastric atrophy only.According to Figure 8, Model 2 still outperforms Model 1 in predicting the risk level of IM for those 26 cases, which suggests that Model 2 be recommended against Models 1 and 3 for cases with marked or unknown status of gastric atrophy.In practice, more covariates or predictors may be added to the multinomial mixed-link model as well, given their availability.For example, it is known that Helicobacter pylori (Hp) infection is an important factor for both IM and gastric cancer development [5,37].When the Hp status, in terms of Hp serology test result [38], histological examination result [39], or Hp sequence reads [5], is available, one may add it into the model and use AIC, BIC, or cross-validation to determine whether the model with the newly added covariate works significantly better (see Section 2.3).
It should be noted that when using model selection techniques described in Section 2.3, sometimes the differences between the best models are not significant.For example, when selecting Model 3, two other models, a cumulative probit model with po and a continuationratio probit model with po, have similar AIC values (see Table 3) that are not significantly smaller than Model 3's [36].In this case, one may use any of them for prediction purposes.That is saying, with the current data or a finite sample size, those models are comparable or not significantly different from each other.
With an increased sample size, if there is a true statistical model associated with the response and available predictors, then the true model is expected to be among the best models asymptotically [40].Nevertheless, it does not necessarily mean that the true model is asymptotically identifiable (see [40] for more discussion on asymptotic consistency related to model selections for multinomial models).
In a previous study [5], DNA methylation alteration has been reported as significantly correlated with IM regression at the univariate level.Nevertheless, the significance vanishes when mutation burden and Hp density are incorporated into a multivariate logistic regression analysis [5].It is worthy of further exploration using the recommended multinomial mixed-link model with the most appropriate link functions selected.

Conclusions
In this study, we recommend the newly developed multinomial mixed-link models for predicting Intestinal Metaplasia using DNA methylation profiling.Using model selection techniques, such as AIC, BIC, and cross-validations, we show that the selected multinomial mixed-link model (Model 1) outperforms the traditional multinomial models that assume the same link function for all categories.We also show that when additional information, such as new covariates or predictors, is added to the model, the selection procedure needs to be rerun and the best mixed-link model may change.
When four or more response categories are involved, models other than multinomial mixed-link models have been proposed as well, including two-group models, which can deal with NA or unknown response categories, and po-npo mixture models, which are more flexible than npo, po, or ppo (partial proportional odds) models (see [24] for more examples).Model selection techniques described in Section 2.3 can still be applied, just to a much larger set of candidate models.

Figure 1 .
Figure 1.Cross-entropy loss based on ten-fold cross-validations with ten random partitions.

Figure 3 .
Figure 3. Boxplots of cross-entropy loss of Model 1 and Model 2 based on ten-fold cross-validations with ten random partitions.

Figure 4 .
Figure 4. Predictive probabilities for the normal category based on Model 1 and Model 2.

Figure 5 .
Figure 5. Predictive probabilities for the MIM category based on Model 1 and Model 2.

Figure 6 .
Figure 6.Predictive probabilities for IM category based on Model 1 and Model 2.

Figure 7 .
Figure 7. Boxplots of cross-entropy loss (on 98 Samples only) of Models 1, 2, and 3 based on ten-fold cross-validations with ten random partitions.

Figure 8 .
Figure 8. Boxplots of cross-entropy loss (on 26 Samples only) of Models 1 and 2 based on ten-fold cross-validations with ten random partitions.

Table 1 .
Best mixed-link models for predicting IM based on TNSC.

Table 2 .
Best mixed-link models for predicting IM based on TNSC and gastric atrophy.
Note: The best model overall, along with its links and values, is highlighted in bold.

Table 3 .
Best mixed-link models for predicting IM based on TNSC and 3-class gastric atrophy.

Table A1 .
Baseline-category mixed-link models with npo.

Table A2 .
Cumulative mixed-link models with npo.
Note: The AIC/BIC values, associated with the best pair of links, are highlighted in bold.

Table A3 .
Adjacent-categories mixed-link models with npo.
Note: The AIC/BIC values, associated with the best pair of links, are highlighted in bold.

Table A4 .
Continuation-ratio mixed-link models with npo.: The AIC/BIC values, associated with the best pair of links, are highlighted in bold. Note

Table A5 .
Baseline-category mixed-link models with po.
Note: The AIC/BIC values, associated with the best pair of links, are highlighted in bold.

Table A6 .
Cumulative mixed-link models with po.: The AIC/BIC values, associated with the best pair of links, are highlighted in bold. Note

Table A7 .
Adjacent-categories mixed-link models with po.: The AIC/BIC values, associated with the best pair of links, are highlighted in bold. Note

Table A8 .
Continuation-ratio mixed-link models with po.
Note: The AIC/BIC values, associated with the best pair of links, are highlighted in bold.