From Measurements to Patients: Data Aggregation in Supervised Classification of X-Ray Diffraction Datasets
Abstract
1. Introduction
2. Materials and Methods
2.1. Description of the Datasets
2.2. Aggregation Strategies
2.3. Classification Methods
- Only data at the level of measurements corresponding to (randomly selected) approximately 70% of patients were used to train the model. The remaining ~30% of patients were used for testing. Separating the training and testing sets by patient ensured no data leakage.
- The selected training data was used to optimize the hyperparameters of the classifiers (Logistic Regression and Random Forest). Hyperparameter optimization was performed on a grid of values using GridSearchCV with 5-fold cross-validation and model efficiency scoring using ROC-AUC. The cross-validation was performed by grouping by patient IDs.
- Calibration of probabilities using the Platt method was carried out using training data. Next, the optimal decision threshold was determined using the calibrated probabilities.
- Subsequently, the metric was determined, taking into account the data aggregation method.
- The entire procedure was repeated 100 times with different random_state values, responsible for selecting patients for training. Finally, 100 ROC-AUC and balanced accuracy values were obtained for each aggregation method.
3. Results
3.1. Human Breast Samples
3.1.1. Aggregation Before Modeling
3.1.2. Aggregation After Modeling
3.2. Canine Claw Samples
3.2.1. Aggregation Before Modeling
3.2.2. Aggregation After Modeling
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Rajkomar, A.; Dean, J.; Kohane, I. Machine Learning in Medicine. N. Engl. J. Med. 2019, 380, 1347–1358. [Google Scholar] [CrossRef] [PubMed]
- Shehab, M.; Abualigah, L.; Shambour, Q.; Abu-Hashem, M.A.; Shambour, M.K.Y.; Alsalibi, A.I.; Gandomi, A.H. Machine learning in medical applications: A review of state-of-the-art methods. Comput. Biol. Med. 2022, 145, 105458. [Google Scholar] [CrossRef]
- An, Q.; Rahman, S.; Zhou, J.; Kang, J.J. A Comprehensive Review on Machine Learning in Healthcare Industry: Classification, Restrictions, Opportunities and Challenges. Sensors 2023, 23, 4178. [Google Scholar] [CrossRef] [PubMed]
- Andrès, E.; Escobar, C.; Doi, K. Machine Learning and Artificial Intelligence in Clinical Medicine-Trends, Impact, and Future Directions. J. Clin. Med. 2025, 14, 8137. [Google Scholar] [CrossRef]
- Dehbozorgi, P.; Ryabchykov, O.; Bocklitz, T.W. A comparative study of statistical, radiomics, and deep learning feature extraction techniques for medical image classification in optical and radiological modalities. Comput. Biol. Med. 2025, 187, 109768. [Google Scholar] [CrossRef]
- Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
- Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
- Mahajan, P.; Uddin, S.; Hajati, F.; Moni, M.A. Ensemble Learning for Disease Prediction: A Review. Healthcare 2023, 11, 1808. [Google Scholar] [CrossRef]
- Müller, D.; Soto-Ray, I.; Kramer, F. An Analysis on Ensemble Learning Optimized Medical Image Classification with Deep Convolutional Neural Networks. IEEE Access 2022, 10, 66467–66480. [Google Scholar] [CrossRef]
- Wang, Z.; Poon, J.; Sun, S.; Poon, S. Attention-based Multi-instance Neural Network for Medical Diagnosis from Incomplete and Low Quality Data. arXiv 2019, arXiv:1904.04460. [Google Scholar]
- Ilse, M.; Tomczak, J.M.; Welling, M. Attention-based Deep Multiple Instance Learning. arXiv 2018, arXiv:1802.04712. [Google Scholar] [CrossRef]
- Keskin, Z.; İnan, O.; Özberk, Ö.; Bilici, R.; Servi, S.; Çelikdelen, S.Ö.; Yıldırım, M. A Gated Attention-Based Multiple Instance Learning and Test-Time Augmentation Approach for Diagnosing Active Sacroiliitis in Sacroiliac Joint MRI Scans. J. Clin. Med. 2026, 15, 2101. [Google Scholar] [CrossRef] [PubMed]
- Hayat, M.; Aramvith, S. Superpixel-Guided Graph-Attention Boundary GAN for Adaptive Feature Refinement in Scribble-Supervised Medical Image Segmentation. IEEE Access 2025, 13, 196654–196668. [Google Scholar] [CrossRef]
- Mercan, C.; Aksoy, S.; Mercan, E.; Shapiro, L.G.; Weaver, D.L.; Elmore, J.G. Multi-Instance Multi-Label Learning for Multi-Class Classification of Whole Slide Breast Histopathology Images. IEEE Trans. Med. Imag. 2018, 37, 316–325. [Google Scholar] [CrossRef]
- Tu, Y.; Lei, H.; Long, W.; Yang, Y. HAMIL: Hierarchical Aggregation-Based Multi-Instance Learning for Microscopy Image Classification. Pattern Recognit. 2021, 136, 109245. [Google Scholar]
- Zhao, K.; Ling Yu Hung, A.; Pang, K.; Hajipour, P.; Wu, H.; Raman, S.; Sung, K. PCa-Mamba: Spatiotemporal state space models for prostate cancer detection in multi-parametric MRI. Med. Image Anal. 2026, 111, 104033. [Google Scholar] [CrossRef] [PubMed]
- Kidane, G.; Speller, R.D.; Royle, G.J.; Hanby, A.M. X-ray scatter signatures for normal and neoplastic breast tissues. Phys. Med. Biol. 1999, 44, 1791. [Google Scholar] [CrossRef]
- Lewis, R.A.; Rogers, K.D.; Hall, C.J.; Towns-Andrews, E.; Slawson, S.; Evans, A.; Pinder, S.E.; Ellis, I.O.; Boggis, C.R.M.; Hufton, A.P.; et al. Breast cancer diagnosis using scattered X-rays. J. Synchrotron Radiat. 2000, 7, 348–352. [Google Scholar] [CrossRef]
- Sidhu, S.; Siu, K.K.W.; Falzon, G.; Nazaretian, S.; Hart, S.A.; Fox, J.G.; Susil, B.J.; Lewis, R.A. X-ray scattering for classifying tissue types associated with breast disease. Med. Phys. 2008, 35, 4660–4670. [Google Scholar] [CrossRef]
- Conceicao, A.L.C.; Antoniassi, M.; Cunha, D.M.; Ribeiro-Silva, A.; Poletti, M.E. Multivariate analysis of the scattering profiles of healthy and pathological human breast tissues. Nucl. Instrum. Methods Phys. Res. A Accel. Spectrom. Detect. Assoc. Equip. 2011, 652, 870–873. [Google Scholar] [CrossRef]
- Farquharson, M.J.; Al-Ebraheem, A.; Cornacchi, S.; Gohla, G.; Lovrics, P. The use of X-ray interaction data to differentiate malignant from normal breast tissue at surgical margins and biopsy analysis. X-Ray Spectr. 2013, 42, 349–358. [Google Scholar] [CrossRef]
- Denisov, S.; Blinchevsky, B.; Friedman, J.; Gerbelli, B.; Ajeer, A.; Adams, L.; Greenwood, C.; Rogers, K.; Mourokh, L.; Lazarev, P. Vitacrystallography: Structural Biomarkers of Breast Cancer Obtained by X-ray Scattering. Cancers 2024, 16, 2499. [Google Scholar] [CrossRef]
- Alekseev, A.; Shcherbakov, V.; Avdieiev, O.; Denisov, S.A.; Kubytskyi, V.; Blinchevsky, B.; Murokh, S.; Ajeer, A.; Adams, L.; Greenwood, C.; et al. Benign/Cancer Diagnostics Based on X-Ray Diffraction: Comparison of Data Analytics Approaches. Cancers 2025, 17, 1662. [Google Scholar] [CrossRef]
- Murokh, S.; Alekseev, A.; Kubytskyi, V.; Shcherbakov, V.; Avdieiev, O.; Denisov, S.A.; Ajeer, A.; Adams, L.; Greenwood, C.; Rogers, K.; et al. X-Ray Diffraction of Collagen-Structured Water Molecules for Cancer Detection. Molecules 2026, 31, 650. [Google Scholar] [CrossRef]
- Alekseev, A.; Yuk, D.; Lazarev, A.; Labelle, D.; Mourokh, L.; Lazarev, P. Canine Cancer Diagnostics by X-ray Diffraction of Claws. Cancers 2024, 16, 2422. [Google Scholar] [CrossRef] [PubMed]
- Alekseev, A.; Avdieiev, O.; Murokh, S.; Yuk, D.; Lazarev, A.; Labelle, D.; Mourokh, L.; Lazarev, P. Fourier Transformation-Based Analysis of X-Ray Diffraction Pattern of Keratin for Cancer Detection. Crystals 2025, 15, 57. [Google Scholar] [CrossRef]
- Berkson, J. Application of the Logistic Function to Bio-Assay. J. Am. Stat. Assoc. 1944, 39, 357–365. [Google Scholar]
- Cramer, J.S. The Origins of Logistic Regression. Tinbergen Institute Working Paper No. 2002-119/4. 2002. Available online: https://www.econstor.eu/handle/10419/86100 (accessed on 30 March 2026).









| Breast Cancer Data | Canine Claw Data | |||||
|---|---|---|---|---|---|---|
| WAXS | SAXS | |||||
| Patients (no cancer/cancer) | 282 (123/159) | 249 (149/100) | ||||
| Samples | 564 | 920 | ||||
| Measurements | 4067 | 4074 | 920 | |||
| Number of patients in the train set | 200 | 170 | ||||
| Number of patients with a given number of measurements | 8 | 40 | 8 | 40 | 3 | 77 |
| 9 | 14 | 9 | 12 | 4 | 171 | |
| 10 | 21 | 10 | 21 | 5 | 1 | |
| 13 | 23 | 13 | 25 | |||
| 14 | 50 | 14 | 50 | |||
| 18 | 134 | 17 | 1 | |||
| 18 | 133 | |||||
| Abbreviation | Comments/Definitions |
|---|---|
| Waxs | A combination of mean, standard deviation, minimum, maximum, skewness, and kurtosis from WAXS data |
| Saxs | A combination of mean, standard deviation, minimum, maximum, skewness, and kurtosis from SAXS data |
| waxs_mean | Mean value of coefficients from WAXS data |
| waxs_std | Standard deviation for coefficients from WAXS data |
| waxs_min | Minimal value of coefficients from WAXS data |
| waxs_max | Maximal value of coefficients from WAXS data |
| waxs_median | Median for coefficients from WAXS data |
| Abbreviation | Comments/Definitions |
|---|---|
| p_mean | Mean of probabilities per patient |
| p_median | Median of probabilities per patient |
| p_std | Standard deviation per patient |
| p_min | Min probability per patient |
| p_max | Max probability per patient |
| dist_mean | Mean distance from optimal decision threshold, di |
| w_p_mean | Mean of weighted probabilities, dipi |
| c_w_p_mean | Mean of cubic-weighted probabilities, |
| logit_mean | Mean of ln(pi/(1 − pi)) |
| Data | Aggregation Approach | Best Aggregation Method (or Model) | Mean ROC-AUC | Mean BA |
|---|---|---|---|---|
| Biopsy WAXS | No aggregation | RF | 0.840 | 0.774 |
| 1 | Mean + RF | 0.894 | 0.854 | |
| 2 | Logit_mean + RF | 0.907 | 0.863 | |
| 3 | n = 5, RF | 0.814 | ||
| Biopsy SAXS | No aggregation | LR | 0.825 | 0.758 |
| 1 | Not studied | |||
| 2 | C_w_p_mean + LR | 0.897 | 0.857 | |
| 3 | n = 6, LR | 0.806 | ||
| Canines’ nails | No aggregation | RF | 0.788 | 0.732 |
| 1 | Mean + RF | 0.844 | 0.788 | |
| 2 | Logit_mean + RF | 0.853 | 0.804 | |
| 3 | n = 2, LR | 0.787 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Alekseev, A.; Rogers, K.; Mourokh, L.; Lazarev, P. From Measurements to Patients: Data Aggregation in Supervised Classification of X-Ray Diffraction Datasets. Int. J. Transl. Med. 2026, 6, 22. https://doi.org/10.3390/ijtm6020022
Alekseev A, Rogers K, Mourokh L, Lazarev P. From Measurements to Patients: Data Aggregation in Supervised Classification of X-Ray Diffraction Datasets. International Journal of Translational Medicine. 2026; 6(2):22. https://doi.org/10.3390/ijtm6020022
Chicago/Turabian StyleAlekseev, Alexander, Keith Rogers, Lev Mourokh, and Pavel Lazarev. 2026. "From Measurements to Patients: Data Aggregation in Supervised Classification of X-Ray Diffraction Datasets" International Journal of Translational Medicine 6, no. 2: 22. https://doi.org/10.3390/ijtm6020022
APA StyleAlekseev, A., Rogers, K., Mourokh, L., & Lazarev, P. (2026). From Measurements to Patients: Data Aggregation in Supervised Classification of X-Ray Diffraction Datasets. International Journal of Translational Medicine, 6(2), 22. https://doi.org/10.3390/ijtm6020022

