# Dataset Growth in Medical Image Analysis Research

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Methods

## 3. Descriptive Results

## 4. Statistical Analysis and Prediction—Phase I (2011–2018 Data)

- For MRI, the median numbers of subjects for 2011–2014 (26) and 2015–2018 (55.5) were statistically significantly different, U = 19.115, p < 0.001.
- For CT, the median numbers of subjects for 2011–2014 (18.5) and 2015–2018 (36) were statistically significantly different, U = 8.311, p = 0.006.
- For fMRI, the median numbers of subjects for 2011–2014 (37) and 2015–2018 (77) were statistically significantly different, U = 10.493, p = 0.003.

^{2}) of the variance in the natural logarithm of dataset sizes. The year was statistically significant (B = 0.189, CI = (0.129, 0.249), p < 0.001), where B denotes slope and CI is its confidence interval. The regression equation is

^{0.189}− 1). Figure 4 shows Ĝ(N) with its confidence interval for each of the years 2011–2019. The empirical geometric means, taken from Table 3, are shown (in green) for comparison. We predicted the geometric mean of MRI dataset sizes in MICCAI 2019 to be 87.5, with a confidence interval of (65.5, 116.9). The empirical geometric mean for 2019 (in purple) became available later, in Phase II of this research.

^{2}) of the variance in the natural logarithm of the dataset sizes. The year was statistically significant (B = 0.213, (CI = 0.122, 0.305), p < 0.001). The regression equation is

^{2}) of the variance in the natural logarithm of dataset sizes. The year was statistically significant (B = 0.271, CI = (0.168,0.374), p < 0.001). The regression equation is

^{2}values in this section (6.2% for MRI, 8.4% for CT and 18.5% for fMRI) and in the next section require clarification. The regression tasks in this research are unusual because for each imaging modality and for each value of the independent variable (year), there are many (10–187, see Table 2) disparate values of the dependent variable (dataset size). The values themselves are radically different from each other, as dataset sizes encountered in MICCAI articles can be as small as one or as large as many thousands (where large external datasets are used). This implies a huge inherent variance of the dependent variable at each value of the independent variable. No single-valued hypothetic regression function, regardless of linearity or of any other property, can provide a single prediction at a specific value of the independent variable that simultaneously “explains” hugely different observations of the dependent variable at that point. This is the reason for the inevitably low adjusted R

^{2}values. The higher adjusted R

^{2}value for fMRI, compared to MRI and CT, follows from the scarcity of large external fMRI datasets, implying a smaller inherent variance that needs to be “explained”. Nevertheless, our models are statistically significant, and the predicted geometric means are pleasantly close to the empirical ones where the latter are available.

## 5. Statistical Analysis and Prediction—Phase II (2011–2019 Data)

^{2}) of the variance in the natural logarithm of dataset sizes. The year was statistically significant (B = 0.240, CI = (0.194,0.286), p < 0.001), where B denotes slope and CI is its confidence interval. The Phase II regression equation is

^{0.240}− 1). Figure 7 shows Ĝ(N) with its confidence interval for each of the years 2011–2021. The empirical geometric means for 2011–2019, taken from Table 3, are shown (in green) for comparison. We predict the geometric mean of MRI dataset sizes in MICCAI 2020 to be 147.2, with a confidence interval of (116.9, 185.5). We predict the geometric mean of MRI dataset sizes in MICCAI 2021 to be 187.1, with a confidence interval of (142.8, 245.2).

^{2}) of the variance in the natural logarithm of dataset sizes. The year was statistically significant (B = 0.266, (CI = 0.200, 0.331), p < 0.001). The Phase II regression equation is

^{2}) of the variance in the natural logarithm of dataset sizes. The year was statistically significant (B = 0.277, CI = (0.193,0.361), p < 0.001). The Phase II regression equation is

## 6. Discussion

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Kalayeh, H.M.; Landgrebe, D.A. Predicting the required number of training samples. IEEE Trans. Pattern Anal. Mach. Intell.
**1983**, 5, 664–667. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Boonyanunta, N.; Zaaphongsekul, P. Predicting the relationship between the size of training sample and the predictive power of classifiers. In Knowledge-Based Intelligent Information and Engineering Systems. KES 2004. Lecture Notes in Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2004; Volume 3215, pp. 529–535. [Google Scholar]
- Hutter, M. Learning Curve Theory. arXiv
**2021**, arXiv:2102.04074v1[cs.LG]. [Google Scholar] - Kohli, M.D.; Summers, R.M.; Geis, J.R. Medical image data and datasets in the era of machine learning—Whitepaper from the 2016 C-MIMI meeting dataset session. J. Digit. Imaging
**2017**, 30, 392–399. [Google Scholar] [CrossRef] [Green Version] - Baro, E.; Degoul, S.; Beuscart, R.; Chazard, E. Toward a literature driven definition of big data in healthcare. Biomed. Res. Int.
**2015**, 2015, 639021. [Google Scholar] [CrossRef] - Litjens, G.; Kooi, T.; Bejnordi, B.H.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal.
**2017**, 42, 60–88. [Google Scholar] [CrossRef] [Green Version] - Fukunaga, K.; Hayes, R.A. Effects of sample size in classifier design. IEEE Trans. Pattern Anal. Mach. Intell.
**1989**, 11, 873–885. [Google Scholar] [CrossRef] - Adcock, C.J. Sample size determination: A review. J. R. Stat. Soc. Ser. D
**1997**, 46, 261–283. [Google Scholar] [CrossRef] - Eng, J. Sample size estimation: How many individuals should be studied? Radiology
**2003**, 227, 309–313. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Mukherjee, S.; Tamayo, P.; Rogers, S.; Rifkin, R.; Engle, A.; Campbell, C.; Golub, T.B.; Mesirov, J.P. Estimating dataset size requirements for classifying DNA microarray data. J. Comput. Biol.
**2003**, 10, 119–142. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Maxwell, S.E.; Kelley, K.; Rausch, J.R. Sample size planning for statistical power and accuracy in parameter estimation. Annu. Rev. Psychol.
**2008**, 59, 537–563. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Sahiner, B.; Pezeshk, A.; Hadjiiski, L.M.; Wang, X.; Drukker, K.; Cha, K.H.; Summers, R.M.; Giger, M.L. Deep learning in medical imaging and radiation therapy. Med. Phys.
**2018**, 46, e1–e36. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Fichtinger, G.; Martel, A.; Peters, T. (Eds.) Medical Image Computing and Computer-Assisted Intervention—MICCAI 2011, Proceedings of the 14th International Conference, Toronto, Canada, 18–22 September 2011; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volumes 6891–6893. [Google Scholar]
- Ayache, N.; Delingette, H.; Goland, P.; Mori, K. (Eds.) Medical Image Computing and Computer-Assisted Intervention—MICCAI 2012, Proceedings of the 15th International Conference, Nice, France, 1–5 October 2012; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2012; Volumes 8673–8675. [Google Scholar]
- Mori, K.; Sakuma, I.; Sato, Y.; Barillot, C.; Navab, N. (Eds.) Medical Image Computing and Computer-Assisted Intervention—MICCAI 2013, Proceedings of the 16th International Conference, Nagoya, Japan, 22–26 September 2013; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2013; Volumes 8149–8151. [Google Scholar]
- Goland, P.; Hata, N.; Barillot, C.; Hornegger, J.; Howe, R. (Eds.) Medical Image Computing and Computer-Assisted Intervention—MICCAI 2014, Proceedings of the 17th International Conference, Boston, MA, USA, 14–18 September 2014; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2014; Volumes 8673–8675. [Google Scholar]
- Navab, N.; Hornegger, J.; Wells, W.M.; Frangi, A.F. (Eds.) Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2015; Volumes 9349–19351. [Google Scholar]
- Ourselin, S.; Joskowicz, L.; Sabuncu, M.R.; Unal, G.; Wells, W. (Eds.) Medical Image Computing and Computer-Assisted Intervention—MICCAI 2016, Proceedings of the 19th International Conference, Athens, Greece, 17–21 October 2016; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2016; Volumes 9900–9902. [Google Scholar]
- Descoteaux, M.; Maier-Hein, L.; Franz, A.; Jannin, P.; Collins, D.L.; Duchesne, S. (Eds.) Medical Image Computing and Computer Assisted Intervention—MICCAI 2017, Proceedings of the 20th International Conference, Quebec City, QC, Canada, 11–13 September 2017; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2017; Volumes 10433–10435. [Google Scholar]
- Frangi, A.F.; Schnabel, J.A.; Davatzikos, C.; Alberola-López, C.; Fichtinger, G. (Eds.) Medical Image Computing and Computer Assisted Intervention—MICCAI 2018, Proceedings of the 21st International Conference, Granada, Spain, 16–20 September 2018; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2018; Volumes 11070–11073. [Google Scholar]
- Shen, D.; Liu, T.; Peters, T.M.; Staib, L.H.; Esert, C.; Zhuo, S.; Yap, P.T.; Khan, A. (Eds.) Medical Image Computing and Computer Assisted Intervention—MICCAI 2019, Proceedings of the 22nd International Conference, Shenzhen, China, 13–17 October 2019; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2019; Volumes 11764–11769. [Google Scholar]
- Tovino, S.A. The use and disclosure of protected health information for research under the HIPAA privacy rule: Unrealized patient autonomy and burdensome government regulation. South Dak. Law Rev.
**2004**, 49, 447–501. [Google Scholar] - Landau, Y.; Kiryati, N. Dataset growth in medical image analysis research. arXiv
**2019**, arXiv:1908.07765v1[eess.IV]. [Google Scholar] - van Ginneken, B.; Kerkstra, S.; Meakin, J. Grand Challenges in Biomedical Image Analysis. Available online: https://grand-challenge.org (accessed on 30 June 2021).
- Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Ravishankar, H.; Sudhakar, P.; Venkataramani, R.; Thiruvenkadam, S.; Annang, P.; Babu, N.; Vaidya, V. Understanding the mechanisms of deep transfer learning in medical images. arXiv
**2017**, arXiv:1704.06040v1. [Google Scholar] - Hussain, Z.; Gimenez, F.; Yi, D.; Rubin, D. Differential data augmentation techniques for medical imaging classification tasks. AMIA Annu. Symp. Proc.
**2017**, 2017, 979–984. [Google Scholar] [PubMed] - Shen, D.; Wu, G.; Suk, H.-I. Differential data augmentation techniques for medical imaging classification tasks. Annu. Rev. Biomed. Eng.
**2017**, 19, 221–248. [Google Scholar] [CrossRef] [Green Version] - Shin, H.-C.; Tenenholtz, N.A.; Rogers, J.K.; Schwarz, C.G.; Senjem, M.L.; Gunter, J.L.; Andriole, K.; Michalski, M. Medical image synthesis for data augmentation and anonymization using generative adversarial networks. arXiv
**2018**, arXiv:1807.10225v2. [Google Scholar]

**Figure 1.**Median size of datasets used in MICCAI articles related to MRI in each of the years 2011–2019.

**Figure 2.**Median size of datasets used in MICCAI articles related to CT in each of the years 2011–2019.

**Figure 3.**Median size of datasets used in MICCAI articles related to fMRI in each of the years 2011–2019.

**Figure 4.**Phase I—Predicted geometric mean (black dots) and confidence intervals of dataset sizes in MICCAI articles involving MRI for the years 2011–2019, based on the whole ensemble of 2011–2018 MRI dataset sizes. The empirical geometric means (2011–2018) are shown (in green) for comparison (x marks). The empirical geometric mean for 2019 (in purple) became available later, in Phase II of this research.

**Figure 5.**Phase I—Predicted geometric mean (black dots) and confidence intervals of dataset sizes in MICCAI articles involving CT for the years 2011–2019, based on the whole ensemble of 2011–2018 CT dataset sizes. The empirical geometric means (2011–2018) are shown (in green) for comparison (x marks). The empirical geometric mean for 2019 (in purple) became available later, in Phase II of this research.

**Figure 6.**Phase I—Predicted geometric mean (black dots) and confidence intervals of dataset sizes in MICCAI articles involving fMRI for the years 2011–2019, based on the whole ensemble of 2011–2018 fMRI dataset sizes. The empirical geometric means (2011–2018) are shown (in green) for comparison (x marks). The empirical geometric mean for 2019 (in purple) became available later, in Phase II of this research.

**Figure 7.**Phase II—Predicted geometric mean (black dots) and confidence intervals of dataset sizes in MICCAI articles involving MRI for the years 2011–2021 based on the whole ensemble of 2011–2019 MRI dataset sizes. The empirical geometric means (2011–2019) are shown (in green) for comparison (x marks). The empirical geometric means for 2020 and 2021 are not known at the time of writing.

**Figure 8.**Phase II—Predicted geometric mean (black dots) and confidence intervals of dataset sizes in MICCAI articles involving CT for the years 2011–2021 based on the whole ensemble of 2011–2019 CT dataset sizes. The empirical geometric means (2011–2019) are shown (in green) for comparison (x marks). The empirical geometric means for 2020 and 2021 are not known at the time of writing.

**Figure 9.**Phase II—Predicted geometric mean (black dots) and confidence intervals of dataset sizes in MICCAI articles involving fMRI for the years 2011–2021 based on the whole ensemble of 2011–2019 fMRI dataset sizes. The empirical geometric means (2011–2019) are shown (in green) for comparison (x marks). The empirical geometric means for 2020 and 2021 are not known at the time of writing.

Year | Submitted Papers | Accepted Papers | Acceptance Rate |
---|---|---|---|

2011 | 819 | 251 | 30% |

2012 | 781 | 252 | 32% |

2013 | 798 | 262 | 33% |

2014 | 862 | 253 | 29% |

2015 | 810 | 263 | 32% |

2016 | 756 | 228 | 30% |

2017 | 800 | 255 | 32% |

2018 | 1068 | 372 | 35% |

2019 | 1809 | 540 | 30% |

2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|

MRI | 62 | 63 | 76 | 63 | 66 | 69 | 75 | 94 | 187 |

CT | 36 | 24 | 20 | 40 | 23 | 14 | 30 | 36 | 97 |

fMRI | 11 | 10 | 14 | 10 | 14 | 16 | 15 | 26 | 24 |

**Table 3.**Average, geometric mean and median MRI dataset sizes (number of subjects) used in MICCAI articles in each of the years 2011–2019.

MRI | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|---|---|---|

average | 74.3 | 52.2 | 79.9 | 139.2 | 65.6 | 163.6 | 178.1 | 250.6 | 650.2 |

geom. mean | 21.7 | 19.2 | 28.5 | 43.0 | 27.4 | 64.2 | 62.5 | 68.6 | 141.1 |

median | 23 | 20 | 21 | 54 | 33 | 64 | 80 | 67 | 152 |

**Table 4.**Average, geometric mean and median CT dataset sizes (number of subjects) used in MICCAI articles in each of the years 2011–2019.

CT | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|---|---|---|

average | 54.4 | 26.8 | 40.4 | 48.0 | 71.6 | 71.3 | 143.9 | 504.0 | 509.9 |

geom. mean | 19.1 | 15.6 | 26.4 | 20.3 | 29.3 | 39.7 | 35.7 | 102.0 | 126.2 |

median | 17 | 16 | 33 | 20 | 29 | 24 | 28 | 72 | 128 |

**Table 5.**Average, geometric mean and median fMRI dataset sizes (number of subjects) used in MICCAI articles in each of the years 2011–2019.

fMRI | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|---|---|---|

average | 21.3 | 31.9 | 32.3 | 67.5 | 86.6 | 111.7 | 151.6 | 264.4 | 316.5 |

geom. mean | 17.3 | 27.9 | 25.2 | 65.1 | 59.8 | 93.7 | 68.0 | 131.6 | 180.5 |

median | 15 | 29 | 25 | 64 | 53 | 86 | 46 | 191 | 174 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kiryati, N.; Landau, Y.
Dataset Growth in Medical Image Analysis Research. *J. Imaging* **2021**, *7*, 155.
https://doi.org/10.3390/jimaging7080155

**AMA Style**

Kiryati N, Landau Y.
Dataset Growth in Medical Image Analysis Research. *Journal of Imaging*. 2021; 7(8):155.
https://doi.org/10.3390/jimaging7080155

**Chicago/Turabian Style**

Kiryati, Nahum, and Yuval Landau.
2021. "Dataset Growth in Medical Image Analysis Research" *Journal of Imaging* 7, no. 8: 155.
https://doi.org/10.3390/jimaging7080155