Radiomics and Deep Features: Robust Classification of Brain Hemorrhages and Reproducibility Analysis Using a 3D Autoencoder Neural Network

This study evaluates the reproducibility of machine learning models that integrate radiomics and deep features (features extracted from a 3D autoencoder neural network) to classify various brain hemorrhages effectively. Using a dataset of 720 patients, we extracted 215 radiomics features (RFs) and 15,680 deep features (DFs) from CT brain images. With rigorous screening based on Intraclass Correlation Coefficient thresholds (>0.75), we identified 135 RFs and 1054 DFs for analysis. Feature selection techniques such as Boruta, Recursive Feature Elimination (RFE), XGBoost, and ExtraTreesClassifier were utilized alongside 11 classifiers, including AdaBoost, CatBoost, Decision Trees, LightGBM, Logistic Regression, Naive Bayes, Neural Networks, Random Forest, Support Vector Machines (SVM), and k-Nearest Neighbors (k-NN). Evaluation metrics included Area Under the Curve (AUC), Accuracy (ACC), Sensitivity (SEN), and F1-score. The model evaluation involved hyperparameter optimization, a 70:30 train–test split, and bootstrapping, further validated with the Wilcoxon signed-rank test and q-values. Notably, DFs showed higher accuracy. In the case of RFs, the Boruta + SVM combination emerged as the optimal model for AUC, ACC, and SEN, while XGBoost + Random Forest excelled in F1-score. Specifically, RFs achieved AUC, ACC, SEN, and F1-scores of 0.89, 0.85, 0.82, and 0.80, respectively. Among DFs, the ExtraTreesClassifier + Naive Bayes combination demonstrated remarkable performance, attaining an AUC of 0.96, ACC of 0.93, SEN of 0.92, and an F1-score of 0.92. Distinguished models in the RF category included SVM with Boruta, Logistic Regression with XGBoost, SVM with ExtraTreesClassifier, CatBoost with XGBoost, and Random Forest with XGBoost, each yielding significant q-values of 42. In the DFs realm, ExtraTreesClassifier + Naive Bayes, ExtraTreesClassifier + Random Forest, and Boruta + k-NN exhibited robustness, with 43, 43, and 41 significant q-values, respectively. This investigation underscores the potential of synergizing DFs with machine learning models to serve as valuable screening tools, thereby enhancing the interpretation of head CT scans for patients with brain hemorrhages.


Introduction
Brain hemorrhage is a critical and life-threatening medical condition that demands prompt and accurate evaluation and management.With a mortality rate of approximately 40%, early detection and precise classification of brain hemorrhage on non-contrast computed tomography (CT) scans are crucial for improving patient outcomes and minimizing neurological deficits [1][2][3].However, the growing number of CT scans received by medical facilities can often lead to delays in diagnosis due to limited access to specialized radiologists, especially in academic institutions [4,5].As a potential solution to address this challenge, an automatic notification system employing artificial intelligence (AI) methods has been proposed for efficient and timely brain hemorrhage detection [6][7][8][9][10].
An accurate and early distinction between different subtypes of brain hemorrhage is essential for appropriate patient management and treatment planning.However, visual inspection of brain CT images by human observers can be challenging and subject to interpretation variability.To overcome these limitations, radiomics and deep learning features have emerged as powerful quantitative imaging research methods.Radiomics allows for the non-invasive assessment of regional tissue heterogeneity at a millimeter scale, while deep learning techniques enable the extraction of complex and high-dimensional features from medical images.These advanced imaging approaches have shown promising applications in brain imaging and have the potential to revolutionize medical diagnosis and personalized treatment strategies [13][14][15][16].
In recent years, machine learning methods and radiomics analyses have been utilized to differentiate brain hemorrhage subtypes on CT images.This cutting-edge approach bridges the gap between conventional medical imaging practices and the emerging era of personalized medicine.Radiomics features (RF), extracted using various software packages, have been demonstrated to provide valuable insights into disease characterization and prediction.However, some RFs are sensitive to variations in image acquisition, reconstruction, and processing procedures, leading to potential inconsistencies and reduced reproducibility across studies.Ensuring robustness and reliability in feature extraction is essential to address these technical challenges and foster the widespread adoption of radiomics in clinical applications [17,18].Moreover, while radiomics frameworks have shown promising results in predicting and diagnosing brain hemorrhage subtypes, the reproducibility of these results remains a paramount concern.Numerous studies have been conducted in this field; however, discrepancies in reported outcomes and a lack of standardized methodologies hinder the seamless translation of radiomics into clinical practice.Therefore, addressing the reproducibility of radiomics frameworks is a crucial step toward establishing this innovative imaging approach as a reliable and indispensable tool in the clinical setting [15,[19][20][21][22][23][24].
The Intraclass Correlation Coefficient (ICC) was first introduced by Fisher in 1954 as a modification of the Pearson Correlation Coefficient.Currently, the ICC is determined using mean squares, which estimate population variances based on the variability within a given set of measurements, derived from analysis of variance.ICC is widely employed in conservative care medicine to evaluate interrater, test-retest, and intrarater reliability.Ensuring the reproducibility of radiomics, particularly the robustness of its measurements, is crucial for its application in clinical practice and medical research.The ICC, a widely accepted measure in various scientific fields, including radiology and medical imaging, offers several compelling advantages that justify its adoption.Historically, the ICC has been a reliable measure for assessing the agreement and consistency of measurements [23,24].Its use is well documented in radiomics, where precision and reliability are paramount.Moreover, ICC's capacity to account for various sources of variability, such as those from diverse imaging protocols, equipment, and patient populations, makes it a robust tool for radiomics reproducibility assessment [24][25][26][27][28][29][30][31].This aligns with the overarching goal of radiomics, which is to consistently and accurately derive quantitative information from medical images.Furthermore, ICC's adaptability and versatility make it a valuable asset in radiomics research [25].It transcends the constraints of measurement scales and data types, enabling its application to a wide range of radiomic features and datasets [26,28,[30][31][32][33][34].This adaptability is invaluable in handling the inherent diversity of radiomics data.Additionally, ICC produces interpretable results that facilitate meaningful comparisons and inference.Researchers and practitioners in the field are adept at interpreting ICC values, which allows for effective communication of the reliability of radiomics measurements.Overall, the ICC coefficient effectively reduces errors stemming from individual variability or the extraction methodology.
In light of these considerations, this study aims to investigate the identification of robust radiomics and deep features extracted from a 3D autoencoder neural network (DFs) in CT brain imaging using the ICC.By employing the ICC metric, we aim to quantify the agreement and consistency of feature extraction across image samples.Furthermore, we intend to apply selected RFs and DFs (with ICC values exceeding 0.75) extracted from each CT image to various state-of-the-art classifiers for the accurate prediction and classification of brain hemorrhage subtypes.To the best of our knowledge, this study represents the first comprehensive investigation into the reproducibility of RFs and DFs in CT imaging for classifying brain hemorrhages into six subtypes (SAH, EDH, CC, SAH, IPH, and IVH).

Research Participant Selection and CT Imaging Approach
A general diagram showing the major steps of the proposed methodology is presented in Figure 1.The experimental procedures undertaken in this retrospective study received approval from the institutional committee of Ahvaz Jondishapur University of Medical Sciences (AJUMS) University.It is worth noting that our study was conducted retrospectively, and the AJUMS University institutional committee waived the requirement for informed consent.All methods were carried out following relevant guidelines and regulations.
Through a meticulous and systematic exploration of our institutional data repository, our goal was to identify patients who exhibited either non-traumatic or traumatic cerebral hemorrhage, as evident in their CT scans.To ensure the consistency and reliability of our results, we established a set of exclusion criteria.These included (1) the presence of multiple hemorrhagic lesions, which might introduce complexities in the interpretation of results; (2) inadequate imaging quality or improper imaging settings, which could compromise the accuracy of our study; (3) undetermined origin of the ailment, since the certainty of etiology is crucial in developing a precise diagnostic model; and (4) hematomas detected at fewer or equal to three layers, as our focus is primarily on substantial intracranial hemorrhages.
The development of our AI diagnostic support models necessitated the use of a comprehensive and varied dataset.Hence, we integrated data from an array of patients presenting different types of hemorrhages.The distribution of the patient data, according to the type of hemorrhage, was carefully selected to reflect the varied occurrence rates in the general population, a critical factor in developing a broadly applicable diagnostic tool.The dataset included 120 patients showing signs of subdural hemorrhage (SDH) represented by 3840 slices (16.67%), 60 patients with epidural hematoma (EDH) represented by 1920 slices (8.33%), 180 patients with cerebral contusions (CC) represented by 5760 slices (25%), 120 patients with subarachnoid hemorrhage (SAH) represented by 3864 slices (16.67%), 150 patients with intraparenchymal hemorrhage (IPH) represented by 4804 slices (20.83%), and finally, 90 patients without intraventricular hemorrhage (IVH) represented by 2887 slices (12.5%).The dataset was meticulously compiled from four different institutions, each contributing imaging data acquired over a period spanning from August 2016 to December 2022.This deliberate inclusion of data from multiple centers ensures a substantial temporal spread and variability in patient populations, imaging protocols, and equipment.The choice of a multicenter dataset is motivated by the recognition that the reproducibility of radiomics and deep learning techniques is a critical aspect of their clinical utility.Imaging was conducted using a 64-channel CT scanner (LightSpeed VCT, GE Healthcare), with parameters meticulously selected to ensure high-quality scans.These included detector collimation set at 64 × 0.625 mm, rotation time of 0.5 s, pitch of 1.375, and tube voltage at 120 kV.The tube current was modulated between 250 and 400 mA.Of the total dataset employed for the development (720 patients), a significant proportion (690 patients or 95.83%) possessed a slice thickness of 5 mm, with the remaining slices varying between 3.0 mm and 10 mm.Through a meticulous and systematic exploration of our institutional data repository, our goal was to identify patients who exhibited either non-traumatic or traumatic cerebral hemorrhage, as evident in their CT scans.To ensure the consistency and reliability of our results, we established a set of exclusion criteria.These included (1) the presence of multiple hemorrhagic lesions, which might introduce complexities in the interpretation of results; (2) inadequate imaging quality or improper imaging settings, which could compromise the accuracy of our study; (3) undetermined origin of the ailment, since the certainty

Segmentation of Intracranial Hemorrhage
In pursuit of determining the volume of interest (VOI) for subsequent analysis, we delineated all the contiguous slices of the entire hemorrhage.Furthermore, our inclusion criteria specified that lesions should be visible in at least three slices.This criterion ensures that the radiomics and deep analysis are conducted on lesions with adequate spatial representation, aligning with the three-dimensional nature of our imaging data.Limiting our analysis to substantial intracranial hemorrhages observed in three or more layers enhances the robustness and reliability of our radiomics features extraction methodology.Contours were sketched marginally within the periphery of the hemorrhagic masses for SDH, EDH, CC, SAH, IPH, and IVH.These VOIs were delineated by two experienced medical professionals, with Reader 1 having ten years of experience in detecting hemorrhage in CT images and Reader 2 boasting five years of similar experience.Importantly, this segmentation process was repeated twice weekly to ensure that the same radiomic features were calculated at least two times, meeting the criteria for ICC calculation.The segmentation, conducted independently by Reader 1 and Reader 2, was performed with diligence and adherence to established protocols.This rigorous approach involved the assessment of the same set of images on two separate occasions during each week, allowing for the calculation of ICC to evaluate intraobserver reliability.The rationale behind involving two experienced radiologists was to enhance the robustness of the segmentation process and minimize potential bias.Both physicians were not privy to the clinical and pathological information of the patients.An illustration of this manual segmentation process is provided in Figure 2.

Radiomics and Deep Features (Features Extracted from a 3D Autoencoder) Ext
This research endeavor employed a comprehensive methodology for t

Radiomics and Deep Features (Features Extracted from a 3D Autoencoder) Extraction
This research endeavor employed a comprehensive methodology for the extraction of features, encompassing both radiomics-based features (RFs) and DFs.The RFs were obtained utilizing the standardized Software Environment for Radiomic Analysis (SERA), a framework renowned for its robustness and effectiveness in feature extraction from medical images.Additionally, DFs were extracted through a 3D autoencoder.
To extract deep features (DFs), a 3D autoencoder neural network architecture was meticulously implemented.An autoencoder consists of an encoder network, which maps input images to a latent representation or bottleneck, and a decoder network, which reconstructs the original images from the latent representation, as shown in Figure 3.For this study, the encoder network adhered to a standard convolutional architecture, consisting of three 3 × 3 convolutional layers, each followed by a leaky rectified linear unit (LeakyReLU) activation and a 2 × 2 max-pooling operation for parameter reduction.The decoder path comprised three 3 × 3 convolutional layers, each followed by a LeakyReLU activation and an up-sampling operation.

Assessment of Reproducibility, Feature Selection, and Classification
The reproducibility of the DFs and RFs was evaluated using the Intracla Coefficient (ICC) with carefully selected parameters.These parameters incl random effects, absolute agreement, and multiple raters/measurements. widely used reliability index for measuring agreement between continuou reproducible studies.It is a ratio ranging from 0 to 1. Based on the ICC va can be categorized as Poor (ICC < 0.5), Moderate (0.5 ≤ ICC < 0.75), Good (0. or Excellent (ICC ≥ 0.9).Features with an ICC value of 0.75 or greater were further analysis.To compute the ICC, an in-house-developed Python cod Additionally, Pearson's Correlation Coefficient was employed to identify r Training the proposed autoencoder involved the following steps and parameters:

•
Loss Function: The binary cross-entropy loss function was minimized to train the autoencoder.

•
Optimization Algorithm: The Adam optimization algorithm was used for its adaptive learning rate capabilities.

•
Learning Rate: The learning rate was carefully set to 0.001 to balance convergence speed and optimization stability.The reproducibility of the DFs and RFs was evaluated using the Intraclass Correlation Coefficient (ICC) with carefully selected parameters.These parameters included two-way random effects, absolute agreement, and multiple raters/measurements.The ICC is a widely used reliability index for measuring agreement between continuous variables in reproducible studies.It is a ratio ranging from 0 to 1. Based on the ICC value, reliability can be categorized as Poor (ICC < 0.5), Moderate (0.5 ≤ ICC < 0.75), Good (0.75 ≤ ICC < 0.9), or Excellent (ICC ≥ 0.9).Features with an ICC value of 0.75 or greater were considered for further analysis.To compute the ICC, an in-house-developed Python code was utilized.Additionally, Pearson's Correlation Coefficient was employed to identify redundant features (correlation coefficient > 0.9).Hence, we computed the average of features with high reliability and non-redundancy for further analysis.Next, we utilized Boruta, Recursive Feature Elimination (RFE), and XGBoost techniques, along with the ExtraTreesClassifier, to identify the most relevant features.Subsequently, we employed 11 classifiers: AdaBoost, CatBoost, Decision Trees, K-Means Clustering, LightGBM, Logistic Regression, Naive Bayes, Neural Networks, Random Forest, Support Vector Machines (SVM), and k-Nearest Neighbors (k-NN) for the classification of Intracranial Hemorrhage.Next, we conducted hyperparameter optimization using a train-test split approach with a ratio of 70:30.The grid search technique was applied to explore the optimal hyperparameters on the training dataset.This process aims to fine-tune the model's performance by systematically searching through a predefined hyperparameter space and identifying the combination that yields the best results.Subsequently, we executed 1000 bootstraps with a ratio of 0.9 on the test dataset.This comprehensive approach resulted in a total of 44 models for each of the six brain hemorrhage types, comprising four feature selection methods and eleven distinct machine learning models.The evaluation of these models was carried out using four well-established metrics, namely, Area Under the Curve (AUC), Accuracy (ACC), Sensitivity (SEN), and F1-score.The AUC values for all the models were subjected to a comparative analysis using the Wilcoxon signed-rank test, alongside the application of q-values to control for multiple testing.Multiple testing issues arise when conducting numerous statistical tests simultaneously, leading to an increased risk of obtaining false positives.To address this concern, we employed q-values as a robust tool in comparing different machine learning models.Q-values offer a controlled approach to the false discovery rate (FDR) and play a pivotal role in mitigating the impact of multiple comparisons.The method involves assigning a q-value to each observed p-value, representing the minimum FDR at which the corresponding test may be deemed significant.By utilizing q-values, we ensure a more stringent control over false positives, enabling a more accurate assessment of observed differences in model performance.This approach is particularly relevant in our study, where the comparison of machine learning models necessitates careful consideration of the potential for inflated significance levels due to multiple testing.A threshold of 0.05 was set to identify statistically significant results.The implementation of all the models and analyses was performed in R programming language version 4.0, utilizing the mlr library version 2.18 on the Ubuntu 18.04 operating system.

II. Moderate Reliability Category:
This category encompasses 37 features.Here, the distribution of features is unbalanced similar to the Poor category.RLM and IH remain prominent, with 11 and 6 features, respectively.Additionally, MORPH, CM, and SZM each have 4 features.STAT, DZM, and NGL have representation in this category, with 3, 2, and 2 features, respectively.However, some feature categories like NGT and LOC do not have any occurrences in this reliability category.

III. Good Reliability Category:
The Good category consists of 45 features.In this category, CM retains its significance with 8 features.Other features like IH, MORPH, STAT, and DZM are represented with 7, 7, 5, and 5 features, respectively.

IV. Excellent Reliability Category:
The Excellent category comprises 90 features.Here, we observe that CM stands out with a substantial presence of 30 features.MORPH, NGL, and RLM also remain significant with 10, 10, and 8 features, respectively, and DZM, IH, SZM, and STAT each have 7, 7, 6, and 5 features.Other features like LOC, IH, IVH, and NGT have limited representation.
Overall, the analysis presented in Table 2 allows us to gain valuable insights into the distribution and reliability of RFs across various feature categories.By categorizing the RFs based on ICC, we have provided a comprehensive overview of their performance, thereby facilitating a better understanding of their robustness and applicability for our study.
Furthermore, Figures 4 and 5 illustrate distinct distributions of features related to the defined reliability and feature categories in comparison to their ICC values, respectively.As depicted in Figures 4 and 5 and Table 2, it can be observed that 41.86% of the features, particularly those pertaining to NGL, NGT, and CM, fall within the excellent reliability category.Additionally, 20.94% of the features, notably DZM, LOC, and IH, exhibit good reliability.Meanwhile, 17.2% of the features, especially RLM, SZM, and IH, demonstrate moderate reliability.In contrast, a subgroup of features amounting to 20%, particularly those encompassing RLM, STAT, and MORPH, is categorized under poor reliability.In general, 62.79% of all features had a suitable ICC (good and excellent) and others excluded from study.
defined reliability and feature categories in comparison to their ICC values, respectively.As depicted in Figures 4 and 5 and Table 2, it can be observed that 41.86% of the features, particularly those pertaining to NGL, NGT, and CM, fall within the excellent reliability category.Additionally, 20.94% of the features, notably DZM, LOC, and IH, exhibit good reliability.Meanwhile, 17.2% of the features, especially RLM, SZM, and IH, demonstrate moderate reliability.In contrast, a subgroup of features amounting to 20%, particularly those encompassing RLM, STAT, and MORPH, is categorized under poor reliability.In general, 62.79% of all features had a suitable ICC (good and excellent) and others excluded from study.In the second phase of our study, we embarked on an exploration involving 135 dependable RFs, all of which demonstrated an ICC surpassing 0.75.These 135 features exhibited ICC values that exceeded 0.75, signifying a strong level of agreement both between different observers and within the same observer for these specific attributes.Employing a univariate correlation analysis, we unearthed that 112 radiomics features displayed sig- In the second phase of our study, we embarked on an exploration involving 135 dependable RFs, all of which demonstrated an ICC surpassing 0.75.These 135 features exhibited ICC values that exceeded 0.75, signifying a strong level of agreement both between different observers and within the same observer for these specific attributes.Employing a univariate correlation analysis, we unearthed that 112 radiomics features displayed significant variations across different types of brain hemorrhagic conditions.Subsequently, these features underwent a sequential assessment using Boruta, ExtraTreesClassifier, RFE, and XGBoost methodologies.This comprehensive approach led to the identification of the most crucial features (as depicted in Figures 6-9).Only confirmed features for which the importance was significantly larger than that of the shadow variables were chosen as the final selected features for constructing the radiomics model.A total of 69 radiomics features were selected to construct the radiomics model.

Classifiers Performance for RFs
The current study delves into the critical domain of differential diagnosis encompassing SDH, EDH, CC, SAH, IPH, and IVH.The investigation entails a meticulous evaluation employing an ensemble of eleven classifiers coupled with four distinct feature selection methodologies.The metrics under scrutiny include the AUC, ACC, SEN, and F1-Score.The visual representations in Figure 10 encapsulate the overarching performance metrics.It is elucidated that the best-performing classifier-feature selection combinations are unveiled for each of the metrics.For AUC, the combination of SVM with Boruta feature selection and SVM with ExtraTreesClassifier achieved an AUC value of 0.89.Similarly, for

Classifiers Performance for RFs
The current study delves into the critical domain of differential diagnosis encompassing SDH, EDH, CC, SAH, IPH, and IVH.The investigation entails a meticulous evaluation employing an ensemble of eleven classifiers coupled with four distinct feature selection methodologies.The metrics under scrutiny include the AUC, ACC, SEN, and F1-Score.The visual representations in Figure 10 encapsulate the overarching performance metrics.It is elucidated that the best-performing classifier-feature selection combinations are unveiled for each of the metrics.For AUC, the combination of SVM with Boruta feature selection and SVM with ExtraTreesClassifier achieved an AUC value of 0.89.Similarly, for ACC, SVM with Boruta and SVM with ExtraTreesClassifier reached an ACC of 0.85.The metric of SEN was maximized at 0.82, through the deployment of SVM with Boruta and SVM with ExtraTreesClassifier.However, for F1-Score, the combination of Random Forest + XGBoost and Naive Bayes + XGBoost achieved an F1-Score of 0.8.Moreover, the results encompass several noteworthy alternatives.Notably, for AUC, commendable outcomes were attained with SVM combined with XGBoost, Random Forest combined with XGBoost, Neural Networks combined with XGBoost, and Logistic Regression combined with XGBoost, all exhibiting an AUC of 0.88.The pivotal role of XGBoost in feature selection for AUC is evident from the results.In congruence, ACC exhibited promising results through Random Forest combined with XGBoost, Naive Bayes combined with XGBoost, and Logistic Regression combined with XGBoost, all registering an ACC of 0.82.Here again, XGBoost demonstrated its efficacy in feature selection.SEN values mirrored these trends, with Random Forest combined with XGBoost, Naive Bayes combined with XGBoost, and Logistic Regression combined with XGBoost yielding a sensitivity of 0.81.Subsequently, F1-Score indicated favorable outcomes through Logistic Regression combined with XGBoost, LightGBM combined with XGBoost, KNN combined with ExtraTreesClassifier, and KNN combined with Boruta, all culminating in an F1-Score of 0.79.

Analyzing Deep Features and Classifiers Performance
A comprehensive collection of 15,680 deep attributes was systematically derived from each brain CT scan image through the utilization of the 3D CNN.Notably, a subset of 1000 attributes exhibited ICCs surpassing the threshold of 0.75, signifying commendable levels of inter-observer consensus for these specific attributes.Employing a univariate correlation analysis, the deep attributes showcased noteworthy distinctions across various classifications of brain hemorrhagic conditions.In harmony with the approach taken for radiomics attributes, graphical representations serve as illuminating depictions of performance metrics associated with the differentiation of SDH, EDH, CC, SAH, IPH, and IVH.The focal point of this visual narrative is encapsulated within Figure 11, encapsulating the holistic performance metrics of the diagnostic endeavor.What is revealed is that the best combinations of classifier and feature selection methods have been identified for each specific measure.Specifically, in terms of the AUC, combining Naive Bayes with ExtraTree-sClassifier feature selection and Neural Networks with ExtraTreesClassifier resulted in a

Analyzing Deep Features and Classifiers Performance
A comprehensive collection of 15,680 deep attributes was systematically derived from each brain CT scan image through the utilization of the 3D CNN.Notably, a subset of 1000 attributes exhibited ICCs surpassing the threshold of 0.75, signifying commendable levels of inter-observer consensus for these specific attributes.Employing a univariate correlation analysis, the deep attributes showcased noteworthy distinctions across various classifications of brain hemorrhagic conditions.In harmony with the approach taken for radiomics attributes, graphical representations serve as illuminating depictions of performance metrics associated with the differentiation of SDH, EDH, CC, SAH, IPH, and IVH.The focal point of this visual narrative is encapsulated within Figure 11, encapsulating the holistic performance metrics of the diagnostic endeavor.What is revealed is that the best combinations of classifier and feature selection methods have been identified for each specific measure.Specifically, in terms of the AUC, combining Naive Bayes with ExtraTreesClassifier feature selection and Neural Networks with ExtraTreesClassifier resulted in a high AUC of 0.96.Similarly, when considering ACC, notable results were achieved by combining Naive Bayes with ExtraTreesClassifier, Logistic Regression with ExtraTreesClassifier, and Neural Networks with ExtraTreesClassifier, leading to an ACC of 0.93.The highest SEN was achieved at 0.93 through a strategic approach involving Naive Bayes with ExtraTreesClassifier, Logistic Regression with ExtraTreesClassifier, Neural Networks with ExtraTreesClassifier, and Neural Networks with Boruta.Likewise, the F1-Score metric reached a peak of 0.92, facilitated by coordinating Naive Bayes with ExtraTreesClassifier, Logistic Regression with ExtraTreesClassifier, Neural Networks with ExtraTreesClassifier, and Neural Networks with Boruta.

Wilcoxon Signed-Rank Test for RFs and DFs
The strength of the proposed models was further validated by subjecting them to the Wilcoxon signed-rank test.This statistical analysis enabled a thorough comparison of model performance across each of the endpoints in relation to the remaining 43 models.The insights obtained from this examination are depicted in Figures 12 and 13.The uniqueness of the models is highlighted through color-coded annotations indicating significant q-values.Among the RFs, noteworthy performers include SVM with Boruta, Logistic Regression with XGBoost, SVM with Extra TreesClassifier, CatBoost with XGBoost, and Random Forest with XGBoost.These models exhibited 42, 42, 41, 41, and 41 significant q-values, respectively, showcasing their robust discriminatory capabilities.However, in the case of DFs, remarkable performers comprise ExtraTreesClassifier + Naive Bayes, ExtraTreesClassifier + Random Forest, and Boruta + k-NN, each with 43, 43, and 41 significant q-values.The insights obtained from this examination are depicted in Figures 12 and 13.The uniqueness of the models is highlighted through color-coded annotations indicating significant q-values.Among the RFs, noteworthy performers include SVM with Boruta, Logistic Regression with XGBoost, SVM with Extra TreesClassifier, CatBoost with XGBoost, and Random Forest with XGBoost.These models exhibited 42, 42, 41, 41, and 41 significant q-values, respectively, showcasing their robust discriminatory capabilities.However, in the case of DFs, remarkable performers comprise ExtraTreesClassifier + Naive Bayes, Ex-traTreesClassifier + Random Forest, and Boruta + k-NN, each with 43, 43, and 41 significant q-values.

Discussion
This study employed a comprehensive and systematic approach to tackle the challenge of differentially diagnosing various types of brain hemorrhage.This was achieved through the utilization of RFs and DFs.The research methodology encompassed several key steps, including participant selection, CT imaging protocols, segmentation of intracranial hemorrhage, extraction of features, and classification using a diverse range of machine learning techniques.An essential focal point of the study involved evaluating the reliability of both RFs and DFs.The ICC was employed to categorize features based on their reliability, shedding light on their distribution across distinct categories.Notably, approximately 41.86% of these features exhibited excellent reliability, underscoring their potential importance within the diagnostic framework.To identify the most pertinent

Discussion
This study employed a comprehensive and systematic approach to tackle the challenge of differentially diagnosing various types of brain hemorrhage.This was achieved through the utilization of RFs and DFs.The research methodology encompassed several key steps, including participant selection, CT imaging protocols, segmentation of intracranial hemorrhage, extraction of features, and classification using a diverse range of machine learning techniques.An essential focal point of the study involved evaluating the reliability of both RFs and DFs.The ICC was employed to categorize features based on their reliability, shedding light on their distribution across distinct categories.Notably, approximately 41.86% of these features exhibited excellent reliability, underscoring their potential importance within the diagnostic framework.To identify the most pertinent features for classification, various feature selection techniques were applied, including Boruta, ExtraTreesClassifier, RFE, and XGBoost.The classification phase encompassed a variety of classifiers, and the ensuing performance metrics (AUC, ACC, SEN, and F1-Score) underscored the effectiveness of different classifier feature selection combinations.In a similar vein, DFs extracted using a 3D autoencoder underwent analogous classification procedures.The reliability of these DFs demonstrated parity with radiomics-based features and further highlighted their significance in the context of differentiating brain hemorrhage types.The fusion of classifiers and feature selection methodologies yielded impressive performance metrics, thereby showcasing the potential efficacy of deep features in achieving accurate classification.The corroborative Wilcoxon signed-rank test reinforced the robustness of the models, revealing substantial variations in model performance and bolstering the credibility of both radiomics-based and deep feature-based approaches.The general review (35)(36)(37)(38)(39)(40)(41)(42)(43) of other articles is shown in Table 3.  [8] systematic review brought together a compilation of studies involving AI algorithms for ICH detection and CMBs on medical images.Their comprehensive analysis revealed pooled estimates of AI tool performance, demonstrating remarkable sensitivity, specificity, and accuracy.Their results underscore the potential of AI-driven solutions to significantly enhance the diagnostic process for ICH and CMBs.While our study specifically investigates RFs and DFs reliability and classifier performance, our results parallel their findings by showcasing the high potential for AI-driven models to excel in diagnosing brain hemorrhage conditions.Seyam et al.'s [32] research focused on implementing an AI-based detection tool for ICH in a real-time clinical workflow.They demonstrated both the diagnostic accuracy and the potential workflow improvements that AI tools can offer.Although they encountered challenges in detecting specific subtypes of ICH, their work emphasizes the practical integration of AI algorithms into real-world clinical settings.Our study, in comparison, provides a multicenter dataset of patients with brain hemorrhage, the reliability of radiomics and deep features, and the performance of various classifiers.While our focus differs, the overarching theme of AI's potential to enhance clinical practice aligns with Seyam et al.'s findings.Lee et al.'s [33] work introduced a deep-learning algorithm for ICH detection, presenting an alternative approach to conventional CNNs.Their study emphasized diagnostic performance and its practical implications, highlighting the algorithm's potential to expedite diagnosis in emergency situations.While Lee et al.'s [33] work showcases an algorithmic approach, our study complements this by demonstrating the broader spectrum of analysis possible in the field of AI and medicine.In the study conducted by Angkurawaranon et al. [34], the focus lies on the critical realm of ICH stemming from traumatic brain injury (TBI).Their investigation underscores the urgency of timely radiological assessment and physician recognition in such scenarios.The advent of CT scanning, chosen as the investigation method of preference for TBI cases, gains prominence due to the scarcity of trained radiology personnel.Against this backdrop, Angkurawaranon et al. [34] delved into the potential of deep learning models as a promising avenue to expedite the generation of accurate radiology reports.Their study not only evaluates the diagnostic proficiency of a deep learning model but also contrasts its performance with the abilities of radiology, emergency medicine, and neurosurgery residents in detecting, localizing, and classifying traumatic ICHs.Remarkably, the deep learning model achieved a commendable level of accuracy at 0.89, surpassing the residents in sensitivity at 0.82.However, it is noteworthy that the specificity of 0.90 demonstrated room for improvement.Overall, the research highlights the potential utility of the deep learning model as a screening tool to support the interpretation of head CT scans in the context of traumatic brain injury patients.In comparing these findings with our study, both studies collectively emphasize the substantial promise of AI-driven tools in enhancing diagnostic accuracy and clinical decision-making for various brain hemorrhage conditions.
In the context of these studies, our research further expands the understanding of AI's potential in diagnosing brain hemorrhage conditions.By delving into various machine learning techniques and using RFs and DFs reliability, we provide a more holistic perspective on AI's applications.The robustness of our proposed models, as indicated by the Wilcoxon signed-rank test, adds to the mounting evidence of AI's proficiency in medical diagnosis.Collectively, our study's results converge with the findings of other authors, supporting the notion that AI-driven tools hold substantial promise in enhancing diagnostic accuracy, clinical workflows, and emergency care in the domain of brain hemorrhage detection.Our research contributes a distinct analytical dimension, which, when juxtaposed with these existing studies, portrays a more comprehensive landscape of AI's role in advancing medical practice.Our study advocates for a judicious approach, emphasizing the need for rigorous validation studies that span different clinical settings.These studies are crucial not only for gaining confidence in the robustness of radiomics and deep learning techniques but also for fostering their adaptability across a spectrum of real-world scenarios.

Conclusions
Our study primarily focused on assessing the reproducibility of a machine learning model incorporating RFs and DFs for the effective classification of various types of brain hemorrhages.The ICC was employed as a metric to gauge the agreement and consistency of feature extraction, aligning with the study's second aim.Specifically, our meticulous screening process, guided by ICC thresholds (>0.75), led to the identification of 135 RFs and 1054 DFs deemed robust for further analysis.This step directly addresses the aim to quantify the agreement and consistency of feature extraction across image samples, demonstrating our commitment to methodological rigor.In conclusion, the study's aims were systematically integrated into the research methodology, and the results presented in the conclusion section succinctly encapsulate the outcomes achieved in the context of investigating robust radiomics and deep features using the ICC.We believe that our approach not only aligns with the stated aims but also contributes valuable insights to the field of machine learning in the context of brain hemorrhage classification.
The outcomes of this study not only contribute to enhancing the accuracy of brain hemorrhage classification but also shed light on the intricate interplay between demographic characteristics, imaging features, and advanced artificial intelligence techniques in the medical domain.This research serves as a crucial stepping stone toward improved clinical decision-making and patient care, as well as a potential framework for further advancements in the fusion of artificial intelligence and medical diagnostics.

Figure 1 .
Figure 1.Flowchart of the proposed approach.

Figure 2 .
Figure 2. Visualization of manual segmentation for hemorrhagic types, including CC, EDH, IPH, IVH, SAH, and SDH, labeled as (a-f) respectively, with hemorrhages marked in red.

oengineering 2024 ,
11, x FOR PEER REVIEW decoder path comprised three 3 × 3 convolutional layers, each followed by activation and an up-sampling operation.Training the proposed autoencoder involved the following steps and p  Loss Function: The binary cross-entropy loss function was minimize autoencoder. Optimization Algorithm: The Adam optimization algorithm was use tive learning rate capabilities. Learning Rate: The learning rate was carefully set to 0.001 to balanc speed and optimization stability. Number of Epochs: The training process spanned 20 epochs. Batch Size: A batch size of 8 was employed during training. Dataset: The dataset utilized for training and validation included all av images, ensuring comprehensive representation.The data were rando training and testing sets with a 70:30 ratio.Finally, the bootstrapping utilized . Feature Extraction: A total of 15,680 features were extracted from the b of the 3D autoencoder model .

Figure 5 .
Figure 5.This box plot displays the Intraclass Correlation Coefficient (ICC) values for various reliability factors (RFs) related to every group of features.The X-axis represents the specific RFs, while the Y-axis indicates the corresponding ICC values.The abbreviations used are as follows: MORPH: morphology, LOC: location, STAT: statistic, IH: intensity histogram, IVH: intensity-volume histogram, CM: co-occurrence matrix, RLM: run length matrix, SZM: size zone matrix, DZM: distance zone matrix, NGT: neighboring gray tone, NGL: neighboring gray level.

Figure 5 .
Figure 5.This box plot displays the Intraclass Correlation Coefficient (ICC) values for various reliability factors (RFs) related to every group of features.The X-axis represents the specific RFs, while the Y-axis indicates the corresponding ICC values.The abbreviations used are as follows: MORPH: morphology, LOC: location, STAT: statistic, IH: intensity histogram, IVH: intensity-volume histogram, CM: co-occurrence matrix, RLM: run length matrix, SZM: size zone matrix, DZM: distance zone matrix, NGT: neighboring gray tone, NGL: neighboring gray level.

Bioengineering 2024 , 25 Figure 6 .
Figure 6.Boruta's feature selection method illustrating the significance of various features in the differential diagnosis of six types of brain hemorrhages: SDH, EDH, CC, SAH, IPH, and IVH.The three blue boxes represent the minimal, average, and maximal importance of the shadow attributes.Only confirmed features for which the importance was significantly larger than that of the shadow variables were chosen as the final selected features for constructing the radiomics model.A total of 69 radiomics features were selected to construct the radiomics model.

Figure 6 .
Figure 6.Boruta's feature selection method illustrating the significance of various features in the differential diagnosis of six types of brain hemorrhages: SDH, EDH, CC, SAH, IPH, and IVH.The three blue boxes represent the minimal, average, and maximal importance of the shadow attributes.Only confirmed features for which the importance was significantly larger than that of the shadow variables were chosen as the final selected features for constructing the radiomics model.A total of 69 radiomics features were selected to construct the radiomics model.

Figure 7 .
Figure 7.The ExtraTreesClassifier method showcasing the significance of various features in the differential diagnosis of six types of brain hemorrhage, namely, SDH, EDH, CC, SAH, IPH, and IVH.The ExtraTreesClassifier represents an ensemble tree-based machine learning approach that leverages randomization to mitigate variance and computational expenses, as compared to the Random Forest method.The importance values of features were ranked with respect to their role in the differential diagnosis of the six brain hemorrhage types.The figure highlights the most crucial features, which have been selected based on SHAP analysis.

Figure 7 . 25 Figure 8 .
Figure 7.The ExtraTreesClassifier method showcasing the significance of various features in the differential diagnosis of six types of brain hemorrhage, namely, SDH, EDH, CC, SAH, IPH, and IVH.The ExtraTreesClassifier represents an ensemble tree-based machine learning approach that leverages randomization to mitigate variance and computational expenses, as compared to the Random Forest method.The importance values of features were ranked with respect to their role in the differential diagnosis of the six brain hemorrhage types.The figure highlights the most crucial features, which have been selected based on SHAP analysis.

Figure 8 .
Figure 8.The RFE technique showcasing the significance of various features in the differential diagnosis of six types of brain hemorrhages, namely, subdural hematoma (SDH), epidural hematoma (EDH), cerebral contusion (CC), subarachnoid hemorrhage (SAH), intraventricular hemorrhage (IVH), and intraparenchymal hemorrhage (IPH).The selection process systematically eliminates less relevant features one by one until reaching the optimal number required for peak performance.The features were ranked based on their importance values in the context of differentiating among the six types of brain hemorrhages.The figure highlights the most crucial features selected using the SHAP (SHapley Additive exPlanations) method.

Figure 9 .
Figure 9.The importance of various features in the differential diagnosis of six types of brain hemorrhage, namely, SDH, EDH, CC, SAH, IPH, and IVH.In essence, this depiction showcases the scores that signify the utility and value of each individual feature during the creation of the boosted decision trees within the model.The ranking of features is determined based on their importance values concerning the differentiation of the six brain hemorrhage types.Notably, the figure presents the most crucial features selected through the utilization of SHAP (SHapley Additive exPlanations).

Figure 9 .
Figure 9.The importance of various features in the differential diagnosis of six types of brain hemorrhage, namely, SDH, EDH, CC, SAH, IPH, and IVH.In essence, this depiction showcases the scores that signify the utility and value of each individual feature during the creation of the boosted decision trees within the model.The ranking of features is determined based on their importance values concerning the differentiation of the six brain hemorrhage types.Notably, the figure presents the most crucial features selected through the utilization of SHAP (SHapley Additive exPlanations).

Bioengineering 2024 , 25 Figure 10 .
Figure 10.Illustration of the AUC, ACC, SEN, and F1-Score for differential diagnosis of SDH, EDH, CC, SAH, IPH, and IVH using 4 different feature selection methods and 11 classifiers based on RFs.

Figure 10 .
Figure 10.Illustration of the AUC, ACC, SEN, and F1-Score for differential diagnosis of SDH, EDH, CC, SAH, IPH, and IVH using 4 different feature selection methods and 11 classifiers based on RFs.

Bioengineering 2024 ,
11, x FOR PEER REVIEW 18 of 25 high AUC of 0.96.Similarly, when considering ACC, notable results were achieved by combining Naive Bayes with ExtraTreesClassifier, Logistic Regression with ExtraTree-sClassifier, and Neural Networks with ExtraTreesClassifier, leading to an ACC of 0.93.The highest SEN was achieved at 0.93 through a strategic approach involving Naive Bayes with ExtraTreesClassifier, Logistic Regression with ExtraTreesClassifier, Neural Networks with ExtraTreesClassifier, and Neural Networks with Boruta.Likewise, the F1-Score metric reached a peak of 0.92, facilitated by coordinating Naive Bayes with Ex-traTreesClassifier, Logistic Regression with ExtraTreesClassifier, Neural Networks with ExtraTreesClassifier, and Neural Networks with Boruta.

Figure 11 .
Figure 11.Illustration of the AUC, ACC, SEN, and F1-Score for differential diagnosis of SDH, EDH, CC, SAH, IPH, and IVH using 4 different feature selection methods and 11 classifiers based on DFs.3.4.Wilcoxon Signed-Rank Test for RFs and DFsThe strength of the proposed models was further validated by subjecting them to the Wilcoxon signed-rank test.This statistical analysis enabled a thorough comparison of model performance across each of the endpoints in relation to the remaining 43 models.

Figure 11 .
Figure 11.Illustration of the AUC, ACC, SEN, and F1-Score for differential diagnosis of SDH, EDH, CC, SAH, IPH, and IVH using 4 different feature selection methods and 11 classifiers based on DFs.

Figure 12 .
Figure 12.Wilcoxon signed-rank test is used to compare the AUC of RFs models compared with all other 43 models.(Q-value > 0.05 = red, q-value ≤ 0.05 = blue).

Figure 12 .
Figure 12.Wilcoxon signed-rank test is used to compare the AUC of RFs models compared with all other 43 models.(Q-value > 0.05 = red, q-value ≤ 0.05 = blue).

Figure 13 .
Figure 13.Wilcoxon signed-rank test is used to compare the AUC of DFs models compared with all other 43 models.(Q-value > 0.05 = red, q-value ≤ 0.05 = blue).

Figure 13 .
Figure 13.Wilcoxon signed-rank test is used to compare the AUC of DFs models compared with all other 43 models.(Q-value > 0.05 = red, q-value ≤ 0.05 = blue).

•
Dataset: The dataset utilized for training and validation included all available 3D CT images, ensuring comprehensive representation.The data were randomly split into training and testing sets with a 70:30 ratio.Finally, the bootstrapping technique was utilized.•FeatureExtraction: A total of 15,680 features were extracted from the bottleneck layer of the 3D autoencoder model.

Table 1 .
Comparison of demographic features and CT findings among various types of brain hemorrhage patients.
Table 2 presents a comprehensive breakdown of the various features associated with each reliability category.The rows of the table are labeled as Poor, Moderate, Good, and Excellent, each representing a specific reliability category.The columns, on the other hand, indicate the feature categories under examination.

Table 2 .
Distribution of features by reliability categories in RF analysis.

Table 3 .
Comparing the proposed method for the differential diagnosis of cerebral hemorrhages with other developed approaches.