# Sample Size Analysis for Machine Learning Clinical Validation Studies

^{1}

^{2}

^{3}

^{*}

(This article belongs to the Section Biomedical Engineering and Materials)

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

**STEP (1)**Specify performance metrics, including measures of model discrimination (ability to distinguish cases from controls) and calibration (how well the model’s risk predictions match observed case rates). For discrimination, we used the area under the receiver operating curve (AUC) for binary classification tasks (ST and COVA); we used the Harrell’s C-index (a generalization of AUC) for our survival model task (BAI). For calibration, we employed calibration slope and calibration-in-the-large (CIL) [13].**STEP (2)**Specify the required precision (relative width of confidence intervals, or RWD) and accuracy (percent bias or BIAS). We used cut-offs of ≤0.5 for precision and ±5% for accuracy. Let CI be the difference of the limits of the confidence interval, and trueValue be the estimate of the true value being estimated from the population. Then, RWD is given by:$$RWD=\frac{CI}{trueValue}$$For a given trial attempting to approximate the trueValue, let the approximation be called estimate. The accuracy or BIAS is then given by:$$BIAS=\frac{estimate-trueValue}{estimate}$$**STEP (3)**Specify the required confidence (probability that the CI includes the true value, i.e., “coverage probability” or COVP). We recommend 95%. COVP is given by:$$COVP=Prob\left[trueValue\in CI\right]$$**STEP (4)**For increasing sample sizes, calculate the expected precision (RWD) and accuracy (BIAS) that is achievable, subject to the coverage probability requirement (COVP > 95%). This should be calculated for each metric (slope, AUC (or C-index), and CIL).**STEP (5)**Choose the minimum sample size that meets the requirements. Thus, at the minimal sample size, all three metrics (slope, AUC/C-index, CIL) must satisfy these equations:$$RWD<0.5$$$$\left|BIAS\right|<5\%$$$$COVP>95\%$$Let N represent a given number of samples. The sample size required for each metric (slope, AUC/C-index, CIL) to satisfy Equations (4)–(6) is:$${S}_{slope}=\underset{RWD<0.5,\left|BIAS\right|5\%,COVP95\%}{\mathrm{min}}N$$$${S}_{AUC}=\underset{RWD0.5,\left|BIAS\right|5\%,COVP95\%}{\mathrm{min}}N$$$${S}_{CIL}=\underset{RWD<0.5,\left|BIAS\right|5\%,COVP95\%}{\mathrm{min}}N$$Taking the maximum sample size of the three metrics ensures that all metrics will meet all criteria. Therefore, S_{overall}represents the ideal sample size from SSAML:$${S}_{overall}=\underset{i=slope,AUC,CIL}{\mathrm{max}}{S}_{i}$$

## 3. Results

## 4. Discussion

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Rajkomar, A.; Dean, J.; Kohane, I. Machine Learning in Medicine. N. Engl. J. Med.
**2019**, 380, 1347–1358. [Google Scholar] [CrossRef] [PubMed] - Steyerberg, E.W.; Moons, K.G.M.; van der Windt, D.A.; Hayden, J.A.; Perel, P.; Schroter, S.; Riley, R.D.; Hemingway, H.; Altman, D.G.; The Progress Group. Guidelines and Guidance Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research. PLoS Med.
**2013**, 10, e10011381. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Leisman, D.E.; Harhay, M.O.; Lederer, D.J.; Abramson, M.; Adjei, A.A.; Bakker, J.; Ballas, Z.K.; Barreiro, E.; Bell, S.C.; Bellomo, R.; et al. Development and Reporting of Prediction Models: Guidance for Authors from Editors of Respiratory, Sleep, and Critical Care Journals. Crit. Care Med.
**2020**, 48, 623–633. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Collins, G.S.; Ogundimu, E.O.; Altman, D.G. Sample size considerations for the external validation of a multivariable prognostic model: A resampling study. Stat. Med.
**2016**, 35, 214–226. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Riley, R.D.; Collins, G.S.; Ensor, J.; Archer, L.; Booth, S.; Mozumder, S.I.; Rutherford, M.J.; Smeden, M.; Lambert, P.C.; Snell, K.I.E. Minimum sample size calculations for external validation of a clinical prediction model with a time-to-event outcome. Stat. Med.
**2022**, 41, 1280–1295. [Google Scholar] [CrossRef] [PubMed] - Riley, R.D.; Debray, T.P.A.; Collins, G.S.; Archer, L.; Ensor, J.; Smeden, M.; Snell, K.I.E. Minimum sample size for external validation of a clinical prediction model with a binary outcome. Stat. Med.
**2021**, 40, 4230–4251. [Google Scholar] [CrossRef] [PubMed] - Archer, L.; Snell, K.I.E.; Ensor, J.; Hudda, M.T.; Collins, G.S.; Riley, R.D. Minimum sample size for external validation of a clinical prediction model with a continuous outcome. Stat. Med.
**2021**, 40, 133–146. [Google Scholar] [CrossRef] [PubMed] - Sun, H.; Paixao, L.; Oliva, J.T.; Goparaju, B.; Carvalho, D.; van Leeuwen, K.G.; Akeju, O.; Thomas, R.J.; Cash, S.S.; Bianchi, M.T.; et al. Brain age from the electroencephalogram of sleep. Neurobiol. Aging
**2019**, 74, 112–120. [Google Scholar] [CrossRef] [PubMed] - Quan, S.F.; Howard, B.V.; Iber, C.; Kiley, J.P.; Nieto, F.J.; O’Connor, G.T.; Rapoport, D.M.; Redline, S.; Robbins, J.; Samet, J.M.; et al. The Sleep Heart Health Study: Design, rationale, and methods. Sleep
**1997**, 20, 1077–1085. [Google Scholar] [PubMed] [Green Version] - Paixao, L.; Sikka, P.; Sun, H.; Jain, A.; Hogan, J.; Thomas, R.; Westover, M.B. Excess brain age in the sleep electroencephalogram predicts reduced life expectancy. Neurobiol. Aging
**2020**, 88, 150–155. [Google Scholar] [CrossRef] [PubMed] - Sun, H.; Jain, A.; Leone, M.J.; AlAbsi, H.S.; Brenner, L.N.; Ye, E.; Ge, W.; Shao, Y.-P.; Boutros, C.L.; Wang, R.; et al. CoVA: An Acuity Score for Outpatient Screening that Predicts Coronavirus Disease 2019 Prognosis. J. Infect. Dis.
**2021**, 223, 38–46. [Google Scholar] [CrossRef] [PubMed] - Goldenholz, D.M.; Goldenholz, S.R.; Romero, J.; Moss, R.; Sun, H.; Westover, B. Development and Validation of Forecasting Next Reported Seizure Using e-Diaries. Ann. Neurol.
**2020**, 88, 588–595. [Google Scholar] [CrossRef] [PubMed] - Van Calster, B.; McLernon, D.J.; van Smeden, M.; Wynants, L.; Steyerberg, E.W.; Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Calibration: The Achilles heel of predictive analytics. BMC Med.
**2019**, 17, 230. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Martin, M.A. On the Double Bootstrap. In Computing Science and Statistics; Page, C., LaPage, R., Eds.; Springer: New York, NY, USA, 1992; pp. 73–78. [Google Scholar] [CrossRef]
- Goldenholz, D.M.; Sun, H.; Ganglberger, W.; Westover, M.B. Sample Size Analysis for Machine Learning Clinical Validation Studies. medRxiv
**2021**. [Google Scholar] [CrossRef]

**Figure 1.**Narrowing confidence regions with increased number of patients/events. Shown here are three example datasets, one per column: Brain Age Index (BAI) [8,10], COVID-19 risk Assessment (COVA) [11], and Seizure Tracker™ (ST) [12]. Each row indicates metrics for model performance: calibration slope (slope), area under the receiver operator curve or Harrell’s c-index (C-index), and calibration-in-the-large (CIL). In each subplot, as the number of patients or events (N) increased, the confidence interval narrowed. When the desired performance level was reached, this represents the minimum powered study.

**Figure 2.**We generated four simulated datasets (in four columns) representing a binary classification problem that have different numbers (#) of features (10 and 100), positive vs. negative class imbalance ratios (1:1 and 1:10), and noise levels (label flipping rate of 10% and 20%). Each dataset was fit using a logistic regression. The fitted data were then fed into the SSAML algorithm. The purpose is to explore the behavior of the confidence intervals of the calibration slope (top row), AUC (middle row), and CIL (bottom row) under known conditions. Several simple observations can be made here: (1) the confidence intervals decreased with increasing N in all conditions; (2) the calibration slope was sensitive to different conditions; (3) increased noisy or class imbalance decreased the AUC; (4) CIL was not sensitive to the different conditions tested; and (5) evaluating all three metrics (slope, AUC, and CIL) was important.

**Table 1.**Characteristics of the three example datasets used. BAI = brain age index. COVA = COVID-19 risk study. ST = Seizure Tracker™. In all three datasets, the number of events and number of patients are listed, but only in the COVA dataset we employed an event-based analysis, whereas in BAI and ST we used a patient-based analysis.

Dataset | Machine Learning | # of Patients/# of Events | Disease Model | Outcome | Repeated Measures | Survival Analysis | Event-Based Analysis |
---|---|---|---|---|---|---|---|

BAI [8,10] | Transformed linear regression | 4070 patients 3359 events | Aging | Estimate of brain age (used to forecast life expectancy) | N | Y | N |

COVA [11] | Ordinal regression | 2205 patients 1479 events | COVID-19 | Risk of hospitalization, critical illness, or death | N | N | Y |

ST [12] | Deep learning | 1613 patients 98,119 events | Epilepsy | Risk of seizure within 24 h | Y | N | N |

**Table 2.**SSAML result tables from each example dataset (BAI, COVA, ST). BAI = Brain Age Index (BAI) [8,10], COVA = COVID-19 risk Assessment [11], ST = Seizure Tracker™ [12]. Highlighted in bold are numbers that satisfy the requirements: RWD < 0.5, |BIAS| < 0.05, and COVP > 0.95. The number of patients/events that satisfy the requirements for all categories are also highlighted in bold. Conf. int. = confidence interval, Slope = calibration slope, AUC = area under the receiver operator curve, C-index = Harrell’s c-index, CIL = calibration-in-the-large, RWD = relative width of confidence interval, BIAS = bias in estimate compared with “true” value, COVP = probability of confidence interval covering “true” value. Note: for the confidence interval used for ST with 20 patients, the actual value was 0.9999; however, due to rounding for three significant digits it is listed as 1.000.

BAI | Number of Participants | ||||
---|---|---|---|---|---|

METRIC | 500 | 1000 | 1500 | 2000 | |

Conf. int. | 0.997 | 0.997 | 0.955 | 0.955 | |

RWD | slope | 1.602 | 0.848 | 0.444 | 0.382 |

RWD | C-index | 0.157 | 0.110 | 0.061 | 0.052 |

RWD | CIL | 0.196 | 0.134 | 0.073 | 0.063 |

BIAS | slope | −0.053 | −0.024 | −0.008 | −0.008 |

BIAS | C-index | −0.001 | −0.001 | −0.002 | −0.001 |

BIAS | CIL | 0.005 | 0.002 | 0.002 | 0.001 |

COVP | slope | 0.981 | 0.989 | 0.953 | 0.958 |

COVP | C-index | 0.998 | 0.993 | 0.955 | 0.958 |

COVP | CIL | 0.994 | 0.994 | 0.959 | 0.951 |

COVA | Number of events | ||||

METRIC | 125 | 150 | 175 | 200 | |

Conf.int. | 0.955 | 0.955 | 0.955 | 0.955 | |

RWD | slope | 0.557 | 0.486 | 0.421 | 0.378 |

RWD | AUC | 0.149 | 0.132 | 0.116 | 0.104 |

RWD | CIL | 0.285 | 0.248 | 0.218 | 0.197 |

BIAS | slope | −0.005 | 0.000 | 0.000 | 0.002 |

BIAS | AUC | −0.001 | −0.001 | 0.000 | 0.001 |

BIAS | CIL | −0.011 | −0.005 | −0.004 | −0.007 |

COVP | slope | 0.966 | 0.964 | 0.956 | 0.973 |

COVP | AUC | 0.968 | 0.979 | 0.972 | 0.974 |

COVP | CIL | 0.977 | 0.971 | 0.966 | 0.977 |

ST | Number of patients | ||||

METRIC | 20 | 40 | 60 | 80 | |

Conf.int. | 1.000 | 0.997 | 0.997 | 0.997 | |

RWD | slope | 1.001 | 0.324 | 0.187 | 0.131 |

RWD | AUC | 0.378 | 0.205 | 0.168 | 0.146 |

RWD | CIL | 0.370 | 0.165 | 0.129 | 0.107 |

BIAS | slope | 0.076 | 0.022 | 0.010 | 0.007 |

BIAS | AUC | 0.046 | 0.023 | 0.015 | 0.011 |

BIAS | CIL | −0.026 | −0.010 | −0.008 | −0.005 |

COVP | slope | 0.990 | 0.996 | 0.996 | 0.991 |

COVP | AUC | 0.977 | 0.965 | 0.972 | 0.981 |

COVP | CIL | 0.997 | 0.992 | 0.996 | 0.997 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Goldenholz, D.M.; Sun, H.; Ganglberger, W.; Westover, M.B.
Sample Size Analysis for Machine Learning Clinical Validation Studies. *Biomedicines* **2023**, *11*, 685.
https://doi.org/10.3390/biomedicines11030685

**AMA Style**

Goldenholz DM, Sun H, Ganglberger W, Westover MB.
Sample Size Analysis for Machine Learning Clinical Validation Studies. *Biomedicines*. 2023; 11(3):685.
https://doi.org/10.3390/biomedicines11030685

**Chicago/Turabian Style**

Goldenholz, Daniel M., Haoqi Sun, Wolfgang Ganglberger, and M. Brandon Westover.
2023. "Sample Size Analysis for Machine Learning Clinical Validation Studies" *Biomedicines* 11, no. 3: 685.
https://doi.org/10.3390/biomedicines11030685