Next Article in Journal
Solar Origins of Short-Term Periodicities in Near-Earth Solar Wind and Interplanetary Magnetic Field
Previous Article in Journal
Performances of Selective Mechanical Traps for Autumn Control of the Invasive Asian Hornet Vespa velutina nigrithorax in Western and Southern Europe
Previous Article in Special Issue
ME-YOLO: Improved YOLOv5 for Detecting Medical Personal Protective Equipment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deepening the Diagnosis: Detection of Midline Shift Using an Advanced Deep Learning Architecture

by
Tuğrul Hakan Gençtürk
1,
İsmail Kaya
2 and
Fidan Kaya Gülağız
1,*
1
Department of Computer Engineering, Faculty of Engineering, Kocaeli University, İzmit 41001, Turkey
2
Department of Neurosurgery, Faculty of Medicine, Niğde Ömer Halisdemir University, Niğde 51240, Turkey
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(2), 890; https://doi.org/10.3390/app16020890
Submission received: 13 December 2025 / Revised: 6 January 2026 / Accepted: 13 January 2026 / Published: 15 January 2026
(This article belongs to the Special Issue Artificial Intelligence in Medicine and Healthcare—2nd Edition)

Abstract

Midline shift (MLS) is one of the conditions that strongly affects mortality and prognosis in critical neurological emergencies such as traumatic brain injury (TBI). Especially, MLS over 5 mm requires urgent diagnosis and treatment. Despite widespread tomography imaging capabilities, the lack of radiologists capable of interpreting the images causes delays in the diagnosis process. Therefore, there is a need for AI-supported diagnostic systems specifically tailored to the field for MLS detection. However, the lack of open, disorder-specific datasets in the literature has limited research in the field and hindered the ability to make comparisons against a reliable reference point. Therefore, the current state of deep learning (DL) methods in the field is not sufficiently addressed. Within the scope of this study, a DL architecture is proposed for MLS detection as a classification task, with millimeter-scale MLS measurements used for evaluation and stratified analysis. This process also comprehensively addresses the status of MLS detection in contemporary DL architecture. Furthermore, to address the lack of open datasets in the literature, two publicly available datasets originally collected with a primary focus on TBI have been annotated for MLS detection. The proposed model was tested on two different open datasets and achieved mean sensitivity values of 0.9467–0.9600 for the Radiological Society of North America (RSNA) dataset and 0.8623–0.8984 for the CQ500 dataset in detecting MLS presence above 5 mm across two different scenarios. It achieved a mean Area Under the Curve-Receiver Operating Characteristic (AUC-ROC) value of 0.9219–0.9816 for the RSNA dataset and 0.9443–0.9690 for the CQ500 dataset. The aim of the study is to detect not only emergency cases but also small MLSs independent of quantity for patient follow-up, so the overall performance of the proposed model (MLS present/absent) was calculated without an MLS quantity threshold. Mean F1 Score values of 0.7403 for the RSNA dataset and 0.7271 for the CQ500 dataset were obtained, along with mean AUC-ROC values of 0.8941 for the RSNA dataset and 0.9301 for the CQ500 dataset. The study presents a clinically applicable, optimized, fast, reliable, up-to-date, and successful DL solution for the rapid diagnosis of MLS, intervention in emergencies, and monitoring of small MLS. It also contributes to the literature by enabling a high level of reproducibility in the scientific community with labeled open data.

1. Introduction

Midline shift (MLS) is the displacement of the anatomical midline, which is defined by the boundaries of the falx cerebri formed by the folding of the dura mater surrounding the cerebral hemispheres, caused by the mass effect of an underlying pathology [1]. Generally, ≥5 mm MLS is considered clinically significant [2]. This finding is closely related to increased intracranial pressure and impaired cerebral perfusion.
Mechanisms leading to MLS include pathologies that cause volume increase, such as hematoma, contusion, diffuse cerebral edema, tumor, or infection. Displacement of midline structures may cause brainstem compression and transtentorial herniation [2,3]. This process is one of the fundamental pathophysiological mechanisms of neurological deterioration.
The MLS score is a strong prognostic indicator for mortality and functional outcomes in patients with traumatic brain injury (TBI) [3]. Increased MLS levels are associated with decreased consciousness, pupil abnormalities, motor deficits, and brainstem dysfunction. Furthermore, the literature has shown that MLS can also predict 6-month recovery scores [3].
Computed tomography (CT) is the gold standard for diagnosing MLS [4]. In addition, non-invasive assessment can be performed with transcranial sonography in intensive care, but it provides subjective images [4]. Although MRI provides more detailed tissue assessment, CT is preferred in emergency patients due to the need for immobility and the long acquisition time [4]. CT devices are now widely available in secondary and higher-level hospitals, placing substantial responsibility on emergency physicians, especially where specialists are lacking and patient care is largely provided by general practitioners. Because it is not feasible for non-radiologist clinicians to interpret all trauma-related CT scans, this gap is increasingly addressed through telemedicine-based remote radiology services.
Therefore, especially for medically significant MLSs over 5 mm, there is a need for alternative solutions due to the subjective and time-consuming nature of traditional methods, as well as the limited availability of experts for rapid diagnosis and consistent clinical assessment of MLS severity. At this point, the concepts of artificial intelligence (AI) and deep learning (DL), which can extract information from CT images, come into play.
According to the basic definition established in 1995, the concept of AI is defined as “the study of agents that receive percepts from the environment and perform actions” [5,6]. Although the concept first emerged during the 1900s, it has seen a significant increase in popularity recently. The driving force behind this popularity is the emergence of the DL concept, which is the cornerstone of this field and the concept of DL is a specific subfield of machine learning (ML), which is a subset of AI. At the core of DL lies a key topic in ML: multi-layer artificial neural networks (ANNs). One definition that illustrates the relationship between the concepts of AI, ML, and DL is as follows: “DL is a subset of ML based on ANN with representation learning” [7]. This definition clearly emphasizes the learning that lies at the core of the DL concept. Unlike traditional ML methods, DL methods optimize the feature extraction process and can also accept not only structured but also semi-structured/unstructured data as input. This innovation is particularly significant for the healthcare sector, because it deals with semi-structured and unstructured data. In addition to tables and data used for diagnosis in healthcare, methods such as CT and magnetic resonance imaging (MRI) are also of great importance. With DL, these data have become processable/interpretable by models. For this reason, AI applications based on medical imaging, particularly those designed to support physicians, have shown significant growth and advancement in recent years. Neurosurgery is one of the medical fields where physician and labeled data deficiencies are particularly evident. In this field, both data scarcity and limited research stand out, especially in supporting AI systems for MLS detection. The next section details the limited number of recent studies conducted on MLS detection.

1.1. Related Work

An examination of studies conducted from 2010 to the present reveals that there are only a limited number of studies that have performed MLS diagnosis with AI support. Due to the limited number of studies, the literature review was conducted over a broad time span. Early studies were more image processing-based, while later studies became increasingly popular with feature-based ML and subsequently DL-based MLS detection and measurement approaches.
Image processing-based studies are divided into two distinct categories: symmetry-based and landmark-based. Symmetry-based methods focus on measuring symmetrical disruption in midline structures caused by various pathologies, based on the brain’s symmetrical structure in a healthy state [8]. Wang et al. [9] proposed a new symmetry-based method for MLS detection. The study was conducted on a limited dataset consisting of CT scans from 41 patients. At the end of the study, it was stated that the proposed method was both fast and comparable to experts in terms of accuracy (accuracy with automated CT image calibration: 90.24, accuracy with manual head rotation: 92.68). In another symmetry-based study conducted by Liao et al. [10], MLS measurements were performed on CT scans of 81 patients with varying degrees of midline deformation. MLS measurements could be performed automatically in 65 patients (approximately 80% of the total number of patients). Ninety-five percent of the measurements were considered accurate by experts. The accuracy of the proposed method was stated to be clinically valid, but its high processing intensity was noted as a disadvantage. Landmark-based methods, on the other hand, aim to detect MLS by referencing points in the brain such as falx cerebri connection points, ventricle centers, the septum pellucidum (SP), and the pineal gland [8]. Liu et al. [11] stated that the method proposed in their study using this technique was more successful when the ventricles used as key points were unclear. The study was conducted using CT scans from 565 patients. However, it was noted that the method had the disadvantage of containing many parameters and being sensitive to errors in the detection of key points. In the study conducted by Xiao et al. [12], SP and the frontal horns of the lateral ventricle were used as reference points, and MLS measurements were performed on images from 80 patients. Again, this study emphasized the importance of accurately identifying key points for the method. Image processing-based MLS detection and measurement studies have been found to be inadequate compared to current techniques due to anatomical diversity in the human brain and the fact that the studies have been conducted on a small number of patients.
For this reason, in recent years, the focus has been on ML and DL-based methods that will improve MLS detection and measurement. While ML-based methods work with ML algorithms applied to features obtained through image processing, DL-based methods focus more on processing by automatically extracting features directly from CT scans. Chen et al. [13] used regression methods in their study conducted on CT scans of 17 patients to measure the amount of MLS. They provided features obtained through image processing techniques, such as tissue density, lateral ventricle shape, and hematoma volume, as well as the patients’ clinical data as inputs for the method. The study clearly demonstrates the effect of feature selection and quantity on the success of ML algorithms. The study conducted by Chilamkurthy et al. [14] to detect abnormalities in head CT scans is one of the most comprehensive studies in this DL-based field, which is also of significant importance in terms of MLS detection. The study created two datasets named Qure25k (number of scans: 21,095, number of scans containing MLS: 666) and CQ500 (number of scans: 491, number of scans containing MLS: 65) from the collected data, and the CQ500 dataset was made publicly available in the literature. The study demonstrated that DL algorithms are effective in detecting abnormalities in head CT scans. Additionally, MLS detection was achieved with an Area Under the Curve-Receiver Operating Characteristic (AUC-ROC) of 0.93 on the Qure25k dataset and an AUC-ROC of 0.97 on the CQ500 dataset. Wu et al. [15] developed a DL-based framework using 3D CT images to determine the midline boundaries of the brain. The success of the study has been demonstrated both through a dataset compiled specifically for the study (519 CT scans) and through the CQ500 dataset (491 CT scans). In another study conducted by Wei et al. [16] to determine the midline in deformed brains, a fully convolutional neural network (CNN)-based method was proposed. The CQ500 dataset [14] was used for training in the study, while a closed dataset was used for testing. In the designed pipeline architecture, the MLS amount was quantitatively measured using a regression-based approach after midline detection. Yan et al. [17] proposed a method called Key-point R-CNN, which combines the Residual Network (ResNet) Feature Pyramid Network (FPN)-50 backbone and landmark-based detection for MLS detection and measurement. A dataset of 300 consecutive non-contrast CT scans was used. A sensitivity of 87.5% and a specificity of 96.7% were achieved, particularly in the detection of MLS larger than 5 mm. Nag et al. [18] proposed a DL-based method that can accurately measure MLS even when the key regions used in MLS detection are not evident. In a study conducted with scans from 45 patients, it was stated that the algorithm was able to track deformed midline with low error values. In another study conducted by Xia et al. [19], the applicability of CNN to measure the amount and volume of MLS in non-contrast CT scans was evaluated. The study was conducted using CT scans from 140 patients. In patients grouped according to coma score, the AUC-ROC values obtained for measuring MLS quantity were 0.799 and 0.736. Agrawal et al. [20] proposed a 3D CNN architecture for detecting MLS in CT images of patients with TBI. In the study conducted using 176 CT scans, the proposed model achieved 55% accuracy, 70% specificity, and 40% moderate sensitivity. In another DL-based study by Wu et al. for MLS detection [21], mortality rate prediction was also performed by combining MLS and clinical information. The study used data from 10 healthy patients with 69 MLS for MLS detection and measurement. The proposed model performed the correct slice selection from CT scans with an accuracy of 0.966. It was also able to detect the measurement of the MLS amount with a small error. The AUC-ROC value was obtained as 0.8 for the patient mortality rate prediction. Again, in one of the most recent studies, Nag et al. [22] proposed a lightweight DL architecture designed for MLS measurement. The study was trained and tested on a subset of the Radiological Society of North America (RSNA) 2019 hemorrhage dataset [23] (55,000 axial slices; 90% for training and 10% for validation, 5000 axial slices for testing). The measurement of the MLS amount was performed with an error of 0.09 mm (mean absolute error (MAE)). In addition to all these studies, there are also review studies that focus on or include MLS detection and measurement [8,24,25,26,27].
In recent years, DL-based studies in particular have made progress in MLS detection, but when the studies are evaluated as a whole, it has been determined that the datasets generally used are performed with a limited amount of data collected specifically for the study and are closed to access. Therefore, one of the main challenges faced in the field is the lack of publicly accessible data. Furthermore, the limited number of accessible/open datasets lacks ground truth values that encompass both MLS detection and MLS quantity measurement. For these reasons, studies are conducted with a limited number of patients, and the proposed methods cannot be compared under equal conditions.

1.2. Contributions of Study

This study was primarily conducted to detect brain conditions requiring early diagnosis, such as those presenting with MLS, using current DL architectures. One of the main objectives of the study is to introduce a new problem-specific architecture designed to achieve robust cross-dataset generalization across heterogeneous data sources. Alongside the study, a medium-sized open dataset was labeled to address the lack of domain-specific labeled open data in the literature, making it suitable for both classification and quantitative measurement of MLS. This study comprehensively reviews the past literature on MLS detection and contributes the following original insights through its architectural and data labeling process:
  • The open-access RSNA Brain Hemorrhage dataset was used for training purposes. This dataset was created for hemorrhage detection and does not include labels for the presence/absence and amount of MLS. In this study, new quantitative MLS annotations were created on a subset of the RSNA dataset, including presence/absence labels and millimeter-scale measurements used solely for evaluation. This labeling provides comparability and a data source for future studies in the field.
  • A novel attention-augmented CNN architecture tailored to the problem has been proposed, aiming to rapidly detect MLSs above the critical threshold requiring urgent intervention, as well as to detect small MLSs that are important for disease monitoring.
  • The proposed method is designed to be faster and less resource-intensive than standard attention-augmented architecture in terms of clinical integration.
  • To demonstrate the clinical applicability of the proposed model, its performance and generalization ability were also tested on an open dataset, CQ500.
  • In the CQ500 dataset, which only has presence/absence labeling, MLS quantities were annotated by an expert to enable a deeper assessment of MLS detection by presenting separate accuracy analyses for MLS above and below 5 mm.
  • The study was conducted using open datasets (RSNA and CQ500). Therefore, the proposed method and the results obtained are highly reproducible and verifiable by the scientific community.
The remainder of this paper is organized as follows: Section 2 describes the materials and methods, including the selection and preparation of datasets, proposed model and details of model implementation and evaluation metrics. Section 3 presents experimental results. Section 4 presents a detailed discussion including interpretation of the results, comparison with state-of-the-art methods, clinical implications, limitations of the study and recommendations for future studies. Section 5 concludes the paper.

2. Materials and Methods

This section of the study details the materials used and methods adopted in the development and validation process of the proposed MLS detection system.

2.1. Selection and Preparation of Datasets

There is one publicly available dataset in the literature for MLS detection. This dataset, named CQ500 [14], was collected by the Centre for Advanced Research in Imaging, Neuroscience and Genomics (CARING) in New Delhi, India, and contains a total of 491 CT scans in two batches, of which only 65 have a diagnosis of MLS. Data labeling was performed by consensus (majority vote) of three independent radiologists [14]. The dataset was compiled primarily to detect critical conditions in head CT scans; therefore, it includes scans involving hemorrhage, and the number of scans containing MLS is limited. The original source of this dataset, the Qure25k dataset, is a comprehensive large-scale dataset but is not publicly available. Due to its limited number of MLS scans, CQ500 was used more for testing purposes in studies. The article published in 2018 on this dataset states that MLSs larger than 5 mm were considered positive in the study [14]. In the original CQ500 dataset, MLS annotations are provided at the scan level without quantitative MLS measurements. Moreover, the scans within the dataset have been shared in folders. A single folder may contain different CT scans taken at different times that belong to same patient. In our evaluation, all available scans were independently evaluated for MLS detection. As a result, although 65 unique patients were labeled as MLS positive in the original dataset, the inclusion of all relevant scans resulted in a total of 92 MLS positive scans in our analysis, regardless of MLS size. The CQ500 dataset was used for testing purposes in our study. However, since it did not include MLS amounts as labels, MLS amounts were measured by a neurosurgeon using Digital Imaging and Communications in Medicine (DICOM) files. Thus, our study was able to report the detection results for both MLS quantities above 5 mm and those below.
Another open dataset included in the study is RSNA 2019 Brain CT Hemorrhage [23]. The dataset was created in collaboration between RSNA and the American Society of Neuroradiology (ASNR). It is a comprehensive dataset where the training data has been labeled by a single expert, while the test data has been labeled by consensus among three experts [23]. As the name suggests, the dataset was created for detecting hemorrhage in brain CT scans and does not contain any MLS-related labels or measurements. However, considering that MLS is thought to arise from conditions such as TBI, stroke, brain tumor, or hematoma, it is clear that the RSNA 2019 dataset may contain scans with varying amounts of MLS depending on the size of the hemorrhage. Therefore, 1344 scans were randomly selected from the dataset for this study. Since the dataset was pre-divided into train and test sets, 988 of these scans were randomly selected from the train folder and 356 from the test folder. Of the 988 scans in the train set, 405 scans contain MLSs of varying sizes, while the remaining scans do not contain MLSs. Of the 356 scans in the test set, 104 scans contain MLSs of varying sizes, while the remaining scans do not contain MLSs. The measurements for this subset of data, also created within RSNA, were performed via DICOM by an expert neurosurgeon for both classification labeling and evaluation of MLS quantity.
Table 1 details the CT scan and slice counts included in the datasets used, as well as the number of scans included in the study. Also, the scanner and institution information is provided to highlight the heterogeneity of data sources and acquisition conditions, supporting the evaluation of cross-dataset generalization rather than reliance on dataset-specific characteristics. Table 2 also shares sample images containing and not containing MLS from both datasets. (The presence of MLS was defined on axial CT images based on a visually appreciable displacement of midline anatomical structures (e.g., the septum pellucidum or the third ventricle) from the ideal midline. In the absence of a measurable displacement, MLS was considered absent.) Within the scope of this study, two datasets were created that enable fair comparisons with the literature: one by updating the selected datasets to include the MLS amount of CQ500, and the other by labeling a medium-sized subset of the RSNA dataset for use in MLS detection. The accuracy of the proposed model was also evaluated on these two datasets obtained from different sources to demonstrate cross-dataset generalization under heterogeneous acquisition conditions. During the training phase, only the train sub-dataset created from the RSNA dataset was used. Both datasets consist of DICOM images. While the measurement of MLS quantities is performed on DICOM images, the images were converted to PNG during the training phase.
All quantitative MLS annotations were performed manually by a neurosurgeon with 14 years of clinical experience. Measurements were conducted on original DICOM images using RadiAnt DICOM Viewer. For each scan, the axial slice containing the transition level from the lateral ventricles to the third ventricle—where midline displacement is most reliably assessed—was selected for measurement.
The ideal midline was defined by identifying the anatomical midline landmarks, primarily based on the falx cerebri and ventricular symmetry. The current (displaced) midline was determined at the level of the septum pellucidum. MLS magnitude was then measured as the perpendicular distance between the ideal midline and the displaced midline using the built-in measurement tools of the DICOM viewer and recorded in millimeters.
A single axial slice per scan was used to ensure consistency across annotations, acknowledging that the total number of slices may vary depending on slice thickness, patient head size, and acquisition parameters. Presence/absence labels were assigned based on visually appreciable midline displacement, while millimeter-scale values (e.g., 1 mm, 2 mm, 3 mm) were recorded to enable magnitude-stratified analysis.
Given the use of fixed anatomical reference points and objective geometric measurement criteria, inter-observer variability is expected to be minimal. Potential discrepancies primarily arise from reference point localization rather than measurement itself. To minimize such variability, all measurements were performed with millimetric precision, which is sufficient for clinical decision-making, particularly considering the widely accepted critical threshold of 5 mm MLS.
To prevent data leakage and ensure unbiased evaluation, dataset splits were handled at the scan/study level following the original dataset structures. For the RSNA dataset, the original training and test partitions provided by the dataset creators were preserved. The derived training and test subsets used in this study were sampled exclusively from the original RSNA training and RSNA test sets, respectively; therefore, scans from the same patient were never mixed across training and test splits. The CQ500 dataset was used solely for external testing and was not involved in any stage of model training. Consequently, even though CQ500 may contain multiple scans from the same patient, these data are entirely independent of the RSNA training data, eliminating any possibility of cross-dataset leakage or metric inflation.

2.2. DL Architectures for MLS Detection

DL, especially CNN-based architectures and, more recently, transformer-based models, has achieved groundbreaking successes in medical image analysis in recent years [28,29]. The main feature distinguishing these approaches from traditional ML is their ability to learn hierarchical features automatically from raw image data. This capability becomes particularly valuable in situations where subtle anatomical differences need to be captured, such as MLS detection.
In this study, multiple model families representing different architectural approaches for MLS detection were systematically evaluated. Architectural depth, computational efficiency, number of parameters, and global context capturing capabilities were considered as selection criteria. The ResNet [30] family was handled as the reference model overcoming fundamental challenges in deep network training with residual connections. The EfficientNet [31] family was included as a modern CNN representative that optimizes the efficiency-performance balance with compound scaling strategy. Also, Swin Transformer [32] was included as the transformer-based approach capable of capturing both local and global context with attention mechanisms. In addition, MobileNetV3 [33] was incorporated as a lightweight CNN architecture emphasizing computational efficiency and low memory footprint, while MaxViT [34] was included as a hybrid CNN–Transformer model combining convolutional inductive biases with global self-attention mechanisms. In the subsections below, the theoretical foundations of these architectures and their suitability for MLS detection are discussed in detail.

2.2.1. ResNet

ResNet is a revolutionary architecture proposed by He et al. in 2016 and that aims to overcome fundamental difficulties experienced in deep network training [30]. In traditional deep networks, as the number of layers increases, vanishing gradient and exploding gradient problems emerge, and this situation prevents the network from being trained effectively. ResNet solves this problem with the “residual learning” approach and “skip connections”.
As shown in Figure 1, in the residual learning approach, a layer block learns the residual function defined as F(x) = H(x) − x instead of learning the H(x) function directly. In the residual block structure detailed inside the yellow box in the figure, the input data passes through the convolution layer (Conv 3 × 3), batch normalization, and Rectified Linear Unit (ReLU) activation, respectively. It later passes through a second convolution and batch normalization layer. The output on this main path is added to the original input via the skip connection shown with the blue dashed line in the figure. This addition process is expressed with the “+” symbol in the figure, and the final output of the block is calculated with the formula H(x) = F(x) + x. Finally, this sum is passed through ReLU activation and transferred to the next block.
On the right side of Figure 1, the general ResNet architecture is shown. The architecture starts with a 7 × 7 initial convolution layer and maximum pooling, then passes through consecutive residual blocks and ends with global average pooling and a fully connected classifier layer. In this study, models at two different depths from the ResNet family were evaluated: ResNet-18 contains eight residual blocks, while ResNet-50 has a deeper structure with sixteen bottleneck blocks. In terms of parameter numbers, ResNet-18 contains approximately 11.2 million, and ResNet-50 contains approximately 23.5 million parameters. Both models were evaluated at an input size of 512 × 512 pixel and require a computational cost of approximately 9.5 and 21.6 Giga Floating Point Operations Per Second (GFLOPs), respectively.
ResNet architectures are widely used in many fields including medical imaging due to their success on ImageNet. However, specific to MLS detection, the lack of channel-based attention mechanisms in these architectures and their inability to establish the local-global context balance specifically in the detection of small anatomical MLSs are evaluated as potential limitations.

2.2.2. EfficientNet Family

EfficientNet is a modern CNN architecture proposed by Tan and Le in 2019 and brings a systematic approach to neural network scaling [31]. While network depth, width, or input resolution is increased independently in traditional approaches, EfficientNet presents the “compound scaling” strategy that scales these three dimensions in a balanced way.
In Figure 2, the basic components of the EfficientNet architecture are shown. The building block of the architecture, the MBConv (Mobile Inverted Bottleneck Convolution) block, is detailed inside the yellow box on the left side of the figure. This block starts with a depthwise separable convolution layer providing computational efficiency, then passes through the Squeeze-and-Excitation (SE) block highlighted with blue in the figure. The SE block provides the prominence of important features by applying a channel-based attention mechanism. Finally, the number of channels is adjusted with pointwise convolution (1 × 1 conv).
The compound scaling strategy shown in the middle part of the figure constitutes the main innovation of EfficientNet. In this approach, network depth (d), width (w), and input resolution (r) are scaled simultaneously using a single compound coefficient φ with the formulas d = αᵠ, w = βᵠ, r = γᵠ. This balanced scaling provides a better performance-efficiency balance compared to single dimensional increments.
The EfficientNet family has significant advantages for MLS detection. The channel attention mechanism in SE blocks provides feature selection that can be useful in distinguishing anatomical structures. Also, low computational cost offers a critical advantage for real-time applications in clinical environments. However, the SE block providing only channel-based attention and lacking spatial attention mechanism creates a potential limitation in tasks where spatial location information is critical such as MLS detection.

2.2.3. Swin Transformer

Swin Transformer is a hierarchical transformer architecture proposed in 2021 [32]. This architecture is solving the computational complexity problem of the Vision Transformer (ViT) architecture [32]. Contrary to the original ViT architecture, Swin Transformer can create multi-scale representations like CNNs by generating hierarchical feature maps and scales computational complexity linearly with image size thanks to the shifted window mechanism.
In Figure 3, the basic components of the Swin Transformer architecture are shown. As seen on the left side of the figure, the input image is firstly divided into 4 × 4-pixel patches, and these patches are converted into tokens with linear projection. Then, Swin Transformer blocks detailed inside the green box in the figure are applied. Each block consists of two consecutive attention mechanisms: Window Multi-head Self-Attention (W-MSA) calculates attention inside local windows, while Shifted Window MSA (SW-MSA) provides information flow between neighboring windows by shifting windows by half the window size. This shifted window strategy performs the transition from local attention to global context efficiently.
The hierarchical structure shown on the right side of the figure allows Swin Transformer to produce multi-scale feature maps similar to CNN architectures. In each stage, while spatial resolution is reduced by half with the patch merging process, the number of channels is doubled. In this way, feature maps starting with H/4 × W/4 resolution in Stage 1 reach H/32 × W/32 resolution in Stage 4.
Swin Transformer was evaluated as a strong candidate in MLS detection in terms of global context capturing capacity and ability to model spatial relationships. However, the computational cost emphasized with the red box in the figure creates a significant disadvantage.

2.2.4. MobileNet

MobileNetV3 is a lightweight CNN architecture optimized for mobile devices, proposed by Howard et al. in 2019 [33]. This architecture aims to achieve high accuracy while prioritizing computational efficiency and low latency requirements. MobileNetV3 was developed by combining the strengths of previous MobileNet versions with neural architecture search (NAS) and NetAdapt algorithms.
In Figure 4, the fundamental components of the MobileNetV3 architecture are shown. The Inverted Residual Block, which is the building block of architecture, is detailed in the yellow box on the left side of the figure. This block follows a narrow → wide → narrow channel expansion strategy and first increases the number of channels by the expansion factor using 1 × 1 convolution. Then, depthwise separable convolution is applied to reduce computational cost. The SE block, as shown in the section highlighted with blue in the figure, provides a channel-wise attention mechanism. Finally, the number of channels is reduced with 1 × 1 pointwise convolution and added to the input via skip connection.
One of the important innovations of MobileNetV3 is the activation function optimization shown in the middle part of the figure. In the later layers of the network, the computationally expensive swish activation is replaced by the hard-swish (h-swish) function, which offers similar performance with improved efficiency. This activation function can be computed faster while preserving the nonlinear properties of the swish function.
On the right side of the figure, the general MobileNetV3 architecture is shown. The architecture starts with 3 × 3 convolution and then passes through stages containing multiple Inverted Residual Blocks. At each stage, the spatial resolution is halved while the number of channels increases. The final stage contains global average pooling and a fully connected classifier layer. The MobileNetV3 family offers important advantages for MLS detection. Thanks to low parameter count and computational cost, real-time inference can be performed in clinical settings. The channel attention mechanism in SE blocks provides useful feature selection for distinguishing anatomical structures. However, the absence of a spatial attention mechanism and relatively shallow feature representation may create potential limitations in the detection of small MLSs.

2.2.5. MaxVit

MaxViT is a hybrid architecture proposed by Tu et al. in 2022 that combines the multi-axis attention mechanism with CNN’s local feature extraction capacity [34]. This architecture stands out with its ability to efficiently capture both local and global context.
In Figure 5, the fundamental components of the MaxViT architecture are shown. The MaxViT block, which forms the core of the architecture, is detailed in the green box on the left side of the figure. This block consists of four main components: MBConv, Block Attention, Grid Attention, and feed-forward network (FFN). This sequential structure combines local and global information within a single block.
The MBConv component contains depthwise separable convolution and SE attention mechanism as in MobileNetV3. This layer provides strong local feature extraction and brings CNN’s inductive biases (locality, translation equivariance) to the architecture.
In the middle part of the figure, the Multi-Axis Attention mechanism is detailed. Block Attention divides the input feature map into small blocks (for example, 7 × 7 pixels) and applies full self-attention within each block. This approach captures fine details in the local context. Grid Attention uses the dilated sampling strategy highlighted in red in the figure. Points are selected at fixed intervals across the image to form a global grid, and attention is calculated between these points. This approach enables obtaining global context with O(n) linear computational complexity.
On the right side of the figure, the general MaxViT architecture is shown. The architecture has a hierarchical structure, and at each stage, the resolution decreases while the number of channels increases. The input image first passes through the stem convolution, then passes through stages containing multiple MaxViT blocks (Stage 1–4). At each stage, downsampling is performed with patch embedding. The final stage contains global average pooling and a classifier layer.
MaxViT has been evaluated as a strong candidate for MLS detection. The grid attention mechanism can capture global anatomical relationships in the brain (for example, symmetry assessment along the entire falx cerebri), while block attention can detect local structural changes (septum pellucidum shift). However, the computational cost highlighted with the red box in the figure is higher compared to EfficientNet. This situation may create a disadvantage in terms of integration in resource-constrained clinical settings.

2.2.6. CBAM

CBAM (Convolutional Block Attention Module) is a light and effective attention mechanism proposed by Woo et al. in 2018 that can be added to CNN architectures [35]. Contrary to the focus of the SE block in EfficientNet on only channel attention, CBAM learns “which features” and “where” to focus on by applying both channel and spatial attention mechanisms consecutively.
In Figure 6, the structure of the CBAM module is shown in detail. The input feature map F ∈ ℝ^(C × H × W) is first passed through the channel attention module shown inside the yellow box on the left side of the figure. In this module, spatial information is compressed by passing the feature map through global average pooling (AvgPool) and global maximum pooling (MaxPool) operations in parallel. The two obtained descriptors are summed element-wise after passing through a shared multi-layer perceptron (Shared MLP), and the M_c channel attention map is obtained by applying sigmoid activation. This map is multiplied element-wise (⊗) with the original input to produce the refined intermediate feature F′ = M_c(F) ⊗ F on a channel basis.
The spatial attention module shown inside the green box on the right side of the figure takes the F’ feature passed through channel attention as input. In this module, two 2D spatial descriptors are obtained by applying average and maximum pooling along the channel dimension. These descriptors are combined and passed through a 7 × 7 size convolution layer, and the M_s spatial attention map is produced with sigmoid activation. The final output is calculated with the formula F″ = M_s(F′) ⊗ F′.
As specified in the bottom part of the figure, while the channel attention module determines “which” features are important, the spatial attention module learns “where” to focus. In this study, the reduction ratio was determined as r = 16 and spatial convolution kernel size as k = 7 for CBAM. These parameters preserve the effectiveness of the attention mechanism while minimizing computational cost.
The most important advantage provided by CBAM in terms of MLS detection is the spatial attention mechanism. SP, lateral ventricles, and falx cerebri like midline structures are critical anatomical reference points in MLS evaluation. The spatial attention component of CBAM ensures the model automatically focuses on these anatomical structures and highlights relevant regions in determining MLS direction and amount. This feature offers a significant superiority for MLS detection compared to the SE block of EfficientNet which has only channel attention.

2.3. Proposed Model

In this study, a hybrid DL architecture combining the EfficientNet-B2 backbone and CBAM attention mechanism is proposed for MLS detection from CT images. The proposed approach aims to preserve computational efficiency while combining the strengths of architecture analyzed in Section 2.2. Figure 7 shows the end-to-end operation flow of the proposed system.
The system consists of seven basic stages: (1) CT images obtained from RSNA and CQ500 datasets; (2) preprocessing pipeline including DICOM to Portable Network Graphics (PNG) conversion, resizing, Contrast Limited Adaptive Histogram Equalization (CLAHE) contrast enhancement, and ImageNet normalization; (3) data augmentation strategy designed considering laterality protection; (4) model architecture consisting of EfficientNet-B2 backbone, CBAM attention modules, and fully connected classifier; (5) two-phase training strategy; (6) evaluation with multiple metrics; and (7) results obtained in both test sets.
The proposed architecture integrates three main components in a hierarchical manner. The first component is the EfficientNet-B2 backbone, which operates in feature extraction mode using pretrained ImageNet weights. This backbone processes the input CT images through multiple hierarchical stages, producing multi-scale feature maps at progressively decreasing spatial resolutions while increasing channel depth. The compound scaling strategy inherent to EfficientNet-B2 enables high representation capacity while maintaining computational efficiency, making it particularly suitable for medical imaging applications.
The second component consists of CBAM attention modules integrated at each feature extraction stage. Unlike the Squeeze-and-Excitation blocks native to EfficientNet that provide only channel-wise attention, CBAM applies both channel and spatial attention mechanisms sequentially. The channel attention module identifies “which” features are important by analyzing the inter-channel relationships, while the spatial attention module determines “where” to focus by examining the spatial dependencies. This dual attention mechanism is particularly critical for MLS detection, as it enables the model to automatically focus on relevant midline anatomical structures such as the septum pellucidum, lateral ventricles, and falx cerebri.
The third component is the fully connected classifier head, which processes the attention-refined features through a regularization pipeline designed to prevent overfitting. The feature map from the final stage undergoes global average pooling, and the resulting feature vector passes through multiple fully connected layers with dropout and batch normalization applied at each stage. The final output layer produces a probability score for binary MLS classification through sigmoid activation.
The basic design decisions of the proposed architecture can be summarized as follows: EfficientNet-B2 offers high representation capacity with low parameter number (~8.1 M) thanks to the compound scaling strategy. The CBAM module adds a spatial attention mechanism critical for positional detection of MLS by going beyond the channel attention provided by SE blocks. Disabling horizontal flip augmentation ensures the preservation of left-right anatomical orientation in brain CT images. As a result, the proposed system aims to obtain competitive performance with approximately 14 times less computational cost (~3.3 GFLOPs) compared to Swin Transformer (~88 M parameters, ~47 GFLOPs).
In the following subsections, the preprocessing pipeline (Section 2.3.1), architectural details (Section 2.3.2), and laterality protection strategy (Section 2.3.3) are explained in detail, respectively.

2.3.1. Preprocessing Pipeline

Appropriate preprocessing of CT images before being given to the DL model has critical importance in terms of model performance. In this study, Hounsfield unit (HU) normalization, CLAHE-based contrast enhancement, and normalization according to ImageNet statistics are applied, respectively, on raw CT data in DICOM format.
CT images were originally kept in HU. These values typically range between −1000 and +1000 HU for brain tissue. In the first stage, these values are normalized to the [0, 255] range using the brain window (window level: 40, window width: 80), and images are converted to PNG format.
The CLAHE algorithm shown in Figure 8 constitutes the basic component of the contrast enhancement stage. Contrary to the standard histogram equalization method, CLAHE divides the image into small regions (tiles) and applies local histogram equalization in each region. This approach ensures soft tissue differences in CT images become more distinct. As shown in the middle part of the figure, the algorithm consists of three basic steps: dividing the image into 8 × 8 size regions (tiles), applying local histogram equalization in each region, and applying clip limit to prevent excessive contrast increase.
The basic reason for CLAHE being preferred in medical imaging is preventing the excessive contrast increase (over-amplification) that standard histogram equalization can cause. The clip limit parameter (determined as 2.0 in this study) prevents noise amplification by limiting the number of pixels in histogram boxes. This feature is important in terms of preserving fine anatomical details like SP and ventricular structures which are critical in MLS detection.
In the final stage, images improved with CLAHE are normalized according to ImageNet statistics (mean: [0.485, 0.456, 0.406], standard deviation: [0.229, 0.224, 0.225]). This normalization ensures the effective use of EfficientNet-B2 weights pretrained on ImageNet.

2.3.2. Architecture Overview

The proposed architecture consists of three basic components: EfficientNet-B2 backbone pretrained on ImageNet, CBAM attention modules integrated into each feature stage, and a multi-layer fully connected classifier. Figure 9 shows the relationship of these components with each other in detail.
EfficientNet-B2 Backbone: In the proposed architecture, EfficientNet-B2 is used in feature extractor mode. In this configuration, the original classification layer is removed, and intermediate layer feature maps are obtained. EfficientNet-B2 consists of five main stages, and each stage produces feature maps in different resolutions and channel depths. As shown in the figure, the 512 × 512 size input image is converted to feature maps in sizes 256 × 256 × 16, 128 × 128 × 24, 64 × 64 × 48, 32 × 32 × 120, and 16 × 16 × 352, respectively. MBConv blocks in each stage contain depthwise separable convolution and SE attention mechanism.
CBAM Integration: The CBAM module explained in Section 2.2.6 is integrated into the output of each feature stage of EfficientNet-B2. This approach ensures applying both channel and spatial attention to features at different scales. All CBAM modules are configured with the same hyperparameters: reduction ratio r = 16 in channel attention module and kernel size k = 7 in spatial attention module. As a result of this configuration, the attention applied feature map (16 × 16 × 352) obtained from the last stage is reduced to 1 × 1 × 352 size with global average pooling operation.
Classifier Layer: Global average pooling output is processed by a multi-layer fully connected network. The classifier structure is equipped with regularization techniques to prevent overfitting. First, the 352-dimensional feature vector is expanded to 1024 dimensions with a 40% dropout rate, then ReLU activation and batch normalization are applied. Before the second fully connected layer, 40% dropout is applied again, and then transition from 1024 to 512 dimensions, ReLU, and batch normalization operations are performed. Before the last layer, the 512-dimensional vector is converted to a single logit value with 30% dropout. Sigmoid activation is applied at the output layer, and an MLS probability value in the [0, 1] range is obtained.
Parameter Analysis: The total parameter number of the proposed architecture is approximately 8.1 million, and the vast majority of this belongs to the EfficientNet-B2 backbone. CBAM modules add only a few thousand extra parameters at each stage. For the 512 × 512 input size, computational cost was measured as approximately 3.3 GFLOPs.

2.3.3. Laterality Preservation Strategy

Preserving the left-right anatomical orientation of brain CT images in MLS detection has critical importance in terms of diagnostic accuracy. Therefore, the data augmentation strategy in the proposed system was designed specifically to preserve laterality integrity.
Laterality Concept in MLS: MLS is a directional pathological finding. The MLS of midline structures to the left or right points to different clinical scenarios. A leftward MLS usually indicates mass effect (hematoma, tumor, edema) in the right hemisphere, and a rightward MLS indicates pathology in the left hemisphere. This directional information has critical importance in terms of surgical planning, correlation with neurological examination findings, and prognosis evaluation. Figure 10 shows allowed and disabled data augmentation techniques from a laterality preservation perspective.
Disabling Horizontal Flip: Horizontal flip augmentation, which is widely used in image classification tasks, causes serious problems in MLS detection. When this operation is applied, a midline structure with MLS to the left side appears as MLS to the right side, and vice versa. Presenting such contradictory examples during model training prevents the network from learning the MLS direction and reduces diagnostic accuracy. As emphasized in the figure, the horizontal flip operation destroys critical diagnostic information by converting left-sided MLS to right-sided MLS.
Allowing Vertical Flip: Contrary to horizontal flip, vertical flip augmentation can be used safely in MLS detection. The flipping operation in the vertical axis in brain CT images does not change the left-right anatomical relationship. This augmentation increases the robustness of the model against different patient positions and imaging variations. In the proposed system, vertical flip is applied with 50% probability.
Other Safe Augmentations: Other data augmentation techniques that do not disrupt laterality integrity were used to improve model generalization. Rotation (±20°) simulates different patient head positions by adding slight angular variations. Color jitters provide robustness against brightness and contrast differences in CT images. Random affine transformations add translation and scaling variations. Random perspective models slight perspective distortions. Mixup (α = 0.2) provides regularization by combining two images and their labels linearly. Random erasing increases occlusion robustness by deleting random regions of the image.

2.4. Implementation Details

In this section, the training process of the proposed model, hyperparameters, and experimental setup are explained in detail. All experiments were performed considering the reproducibility principle and used hyperparameters are summarized in Section 3.
Hardware and Software Environment: All experiments were performed on the Google Colab Pro+ platform on an NVIDIA A100 (40 GB) GPU. As a software environment, Python 3.10, PyTorch 2.0, and timm (PyTorch Image Models) 0.9.2 libraries were used. For image preprocessing, OpenCV 4.8 and PIL libraries, and for metric calculations, scikit-learn 1.3 were used.
Training Configuration: Model training was performed for maximum 40 epochs with a mini-batch size of 12 images. Due to the 512 × 512-pixel input size and addition of CBAM modules, batch size was optimized according to GPU memory capacity. In DataLoaders, 4 parallel workers were used, and the pin_memory feature was activated for memory efficiency.
Transfer Learning Strategy: As shown in Figure 11, the training process consists of two phases. In the first phase (Epoch 1–3), the EfficientNet-B2 backbone pretrained on ImageNet was kept frozen, and only CBAM modules and the classifier layer were trained. This approach ensures randomly initialized attention and classifier weights reach a stable starting point without corrupting previously learned feature representations. In the second phase (Epoch 4–40), backbone weights were released (unfrozen), and the whole network was fine-tuned end-to-end for the MLS detection task.
Optimization Parameters: The AdamW optimization algorithm was used for weight update. AdamW solves the deficiency of the standard Adam algorithm in L2 regularization application and applies weight decay directly to weights, and in this way, provides better generalization performance. Learning rate was determined as 5 × 10−5, and this value was kept low to establish a balance between preservation of pretrained weights and adaptation for the new task. Weight decay coefficient was set as 1 × 10−4.
Learning Rate Scheduler: The Cosine Annealing Warm Restarts scheduler was used for dynamic adjustment of the learning rate. This scheduler restarts periodically by reducing the learning rate in cosine function form. The T0 = 10 parameter indicates that the first cycle lasts 10 epochs, and the T_mult = 2 parameter indicates that every subsequent cycle is twice the length of the previous one. This strategy helps the model escape from local minimums and reach better global optimums.
Loss Function and Class Balancing: The Binary Cross-Entropy with Logits (BCEWithLogitsLoss) loss function was used for the binary classification task. This function provides numerical stability by combining sigmoid activation and binary cross-entropy loss in a single operation. To eliminate class imbalance in the RSNA training set, positive class weight (pos_weight) was applied. This weight was calculated as the ratio of negative sample number to positive sample number (583/405 ≈ 1.44) and integrated into the loss function. This approach ensures the model gives more importance to the minority class (MLS positive).
Gradient Clipping: Gradient clipping was applied to ensure training stability and prevent gradient explosion problems. Maximum gradient norm was limited to 1.0. This technique is important for the stability of the training process, especially in deep networks containing attention mechanisms.
Early Stopping and Model Selection: The early stopping mechanism was applied to prevent overfitting and select the model with the best generalization performance. The AUC-ROC value on the RSNA test set was determined as the monitoring metric. If no improvement is observed in AUC-ROC value for 30 consecutive epochs, training is terminated. Model weights producing the highest AUC-ROC value during training were saved, and these weights were used for final evaluation.
Reproducibility: To ensure reproducibility of experimental results, random seed values were fixed. All PyTorch, NumPy, and Python random number generators were initialized with the same seed value. Also, the CUDA deterministic mode was activated, and consistency was ensured in GPU calculations.

2.5. Evaluation Metrics

Within the scope of the study, binary classification was performed to detect MLS presence. The metrics used to evaluate the success of the proposed method are explained in detail in this subsection. The study focuses on the following main evaluation metrics:
  • Accuracy and F1 Score, measuring the overall success of the model;
  • Sensitivity, which is critical to prevent missed MLS cases;
  • Specificity, which measures the ability to reduce unnecessary examinations;
  • Area Under the ROC Curve (AUC), which shows how well the model can separate classes.
The metrics mentioned above are obtained from the Confusion Matrix data, which compares the model’s prediction with the actual situation. The confusion matrix is a matrix that expresses the probabilistic success of both classes (in the scope of the study, the positive class is defined as “MLS” and the negative class as “no MLS”) in binary classification problems. This matrix contains four basic definitions: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). Within the scope of this study, the positive class was evaluated in two different ways. The first was to classify patients with 5 mm or greater MLS as positive, while the other was to classify all patients with MLS as positive, regardless of the amount of MLS. The following positive class definitions were made based on the second acceptance, but it should be noted that there were two different positive class acceptances during the evaluation phase. In this case
  • TP: The rate at which patients with MLS are correctly predicted as having MLS;
  • FN: The rate at which patients with MLS are incorrectly predicted as “no MLS”;
  • TN: The rate of patients with “no MLS” being correctly predicted as “no MLS”;
  • FP: The rate at which patients with “no MLS” are incorrectly predicted as having “MLS”.
In all these definitions, Accuracy refers to the rate at which the model makes correct predictions when considering both positive and negative classes, as shown in Equation (1).
A c c u r a c y = N u m b e r   o f   C o r r e c t   P r e d i c t i o n s   ( T N + T P ) N u m b e r   o f   P r e d i c t i o n s   ( T P + F P + T N + F N )
Although accuracy shows the rate at which the model makes correct predictions, accuracy must also be evaluated on a class-by-class basis, especially in datasets with imbalanced class distribution. At this point, we come across the F1 Score metric used for this purpose and the Precision and Recall metrics underlying this metric.
P r e c i s i o n = T P T P + F P
R e c a l l / S e n s i t i v i t y / T r u e   P o s i t i v e   R a t e = T P T P + F N
F 1   S c o r e = 2 P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l
The precision metric (Equation (2)) indicates how many of the scans we predicted to have MLS actually had MLS. Recall (Equation (3)) indicates what percentage of CT scans that actually contained MLS the model correctly identified as having MLS. F1 Score (Equation (4)) is the harmonic mean of the Precision and Recall metrics. Because the Precision and Recall metrics are inversely proportional to each other, the F1 Score metric provides a final conclusion that evaluates them together. Therefore, using Accuracy and F1 Score together allows us to evaluate the model’s success by considering both classes. The other two metrics used are sensitivity and specificity. The sensitivity metric is equivalent to recall. Because it is only referred to as sensitivity in some studies in the literature, it is referred to by the same name in this study.
S p e c i f i c i t y = T N T N + F P
F a l s e   P o s i t i v e   R a t e = 1 S p e c i f i c i t y = F P F P + T N
The formula for the specificity metric is given in Equation (5). As can be understood from the formula, this metric specifically shows what percentage of no-MLS CT scans the model correctly identifies as “no MLS.” This metric is also equivalent to the True Negative Rate metric. It is expressed in two ways in the literature. Another metric used in the study is the AUC-ROC metric. Here, the relationship between the model’s True Positive Rate (TPR) (Sensitivity/Recall) and False Positive Rate (FPR) metrics is graphically expressed. How the FPR metric is calculated is shown in Equation (6). Here, it calculates how many false alarms the model produces, i.e., how many scans without MLS are labeled as “MLS” by the model. To calculate the AUC value, a curve expressed as ROC must be drawn. This curve is a graphical representation of the model’s classification performance. In the curve plot, the x-axis represents the FPR value, while the y-axis represents the TPR value. The AUC value represents the area under the ROC curve obtained for different classification thresholds. The larger this area, the more successful the model will be. This curve summarizes the overall performance of the model, specifically for both positive and negative classes.
B i n a r y   c r o s s e n t r o p y   L o g l o s s = 1 N   i = 1 N [ y i log y i ^ + 1 y i   log 1 y i ^ ]
Another metric included in the study, the Binary cross-entropy (log-loss), is shown in Equation (7). The N value in the equation represents the total number of CT scans, the y i value represents the actual label of the ith CT scan (“MLS”/“No MLS”), and the y i ^ value represents the probability that the ith CT scan is positive, i.e., belongs to the MLS class. Since equality represents a loss value, the lower the loss, the more successful the model is understood to be. The metric also measures the model’s probability of making the correct class prediction. Therefore, it has been included in the study as a confidence measure.
In addition to reporting mean performance values, the statistical significance of performance differences between the proposed model and baseline architectures were assessed using the Wilcoxon signed-rank test. The test was applied to determine whether the observed performance differences were statistically meaningful rather than attributable to random variation. Given the limited number of runs (n = 5), median differences and effect sizes were reported to better characterize the magnitude and consistency of performance differences. Effect sizes (r) were interpreted according to conventional thresholds proposed by Cohen [36], where values of approximately 0.1, 0.3, and ≥0.5 indicate small, medium, and large effects, respectively.
Computational efficiency was evaluated by measuring average inference time per image and average peak GPU memory allocated during inference, normalized by batch size to obtain per-image estimates. All measurements were performed during forward-pass execution with the model in evaluation mode and gradients disabled. An initial warm-up iteration and GPU synchronization were applied to ensure stable timing. Inference time was computed as per-image latency. GPU memory usage was assessed as the peak memory allocated during each inference batch, normalized by batch size and then averaged across batches. Together, these metrics provide a practical assessment of the model’s deployability in resource-constrained clinical environments.
In addition to the above metrics, the Params and Floating-Point Operations (FLOPs) metrics, which are used to determine the size and computational cost of the models used, have also been included in the study. The Params metric is defined as the number of parameters in a DL model [37]. This metric indicates the size, capacity, and amount of memory required for the model. Models with more parameters are more capable of learning complex patterns/relationships, but they may also be more prone to overfitting [38]. FLOPs can be defined as the total number of floating-point arithmetic operations performed by a DL model when making an inference on a single input (e.g., an image for this study) [37,39]. The more FLOPs the model contains, the slower it will run [40]. As a result, params represent the memory usage of the model, while FLOPs represent the computational power requirement. When models are evaluated in terms of these two metrics, the expected result is to create models that provide high accuracy but have both a low number of parameters and low FLOPs [41].

3. Experiments and Results

The main objective of the study is to classify MLS as present or absent based on CT images. To this end, the success of commonly used methods in the literature was first evaluated. In comparative analysis, multiple model families representing different architectural approaches were selected. These models were chosen based on their architectural characteristics, considering their depth, efficiency, and ability to capture the global context. The ResNet family (ResNet-18/50) has been used as the baseline reference model that demonstrates the performance of deep networks with Residual Blocks [30]. The EfficientNet family (B0–B3) was chosen because it was thought to provide a balance of high accuracy with compound scaling [31] and computational efficiency that is critical for clinical applications. In addition, MobileNetV3-Large was included as a lightweight CNN architecture optimized for low computational cost and reduced memory footprint [33], reflecting practical deployment constraints in clinical environments. Furthermore, MaxViT was incorporated as a hybrid CNN–Transformer architecture that integrates convolutional inductive biases with global self-attention mechanisms [34]. Another architecture included in the study, Swin Transformer, was developed to handle both local and global contexts thanks to the Window-based Attention mechanism and Shifted Window strategy [32]. With this feature, it is thought that it may have the potential to detect small amounts of MLS in MLS detection.
Following the selection of baseline architecture, numerous trials were initially conducted for the proposed model only without explicitly controlling sources of randomness. During these trials, the initialization of model weights, dropout masks, data ordering, and data augmentation parameters were determined entirely randomly. The highest performance value obtained from this process was reported as a representative result of proposed architecture. To verify the statistical reliability of the results across all selected models and the proposed model, the same model configurations were retrained using five different seed values. At this stage, all sources of randomness were controlled using the torch.manual_seed(), torch.cuda.manual_seed(), and np.random.seed() functions. The obtained results were reported in mean ± standard deviation (mean ± std) format to demonstrate the performance stability of the models. Table 3 summarizes the mean ± standard deviation of each performance metric over five independent runs. Table 4 details the models’ input size, params, and FLOPs metrics. The data in Table 3 and Table 4 were also used to select the fundamental architecture that will form the backbone of the proposed model. Unless otherwise stated, all quantitative results reported in this section correspond to mean values computed over five independent runs.
When examining the results obtained in Table 3, the Swin Transformer and EfficientNet architectures stand out, especially considering the F1-Score and AUC-ROC values obtained for both datasets. The Swin Transformer model achieved an F1-Score of 0.7423 on the RSNA dataset, falling behind both EfficientNet-B3 (0.7556) and B1 (0.7482). In the CQ500 dataset, it was the best with an F1 Score of 0.7309. In terms of the AUC-ROC metric, although the Swin transformer achieved higher results than the EfficientNet architecture in both datasets, there was no significant difference between the EfficientNet-B1 and B2 and MaxVit-Tiny models in particular. When comparing the EfficientNet-B1, B2 and B3 architectures based on Accuracy, F1 Score, and AUC-ROC metrics, it is observed that the B2 model is more resilient to test set variations and yields higher accuracy results especially in CQ500 dataset. At the same time, according to the data in Table 4, the Swin Transformer architecture has both a higher number of parameters (~88 million vs. ~7.7 million) and greater complexity (GFLOPs) (~47 G vs. ~1.0 G) compared to EfficientNet B2. This situation means that integrating the model into hospitals requires more computational resources and inference time. Considering that clinical environments also have limited resources, the EfficientNet-B2 model was chosen for its lower computational cost and fewer parameters, while still delivering consistent and near-optimal performance. This choice was guided not only by performance metrics, but also by the need to achieve a robust balance between detection accuracy, model compactness, and computational efficiency when compared with alternative CNN-based and transformer-based architectures. To best adapt the selected EfficientNet-B2 model to the MLS detection task and maximize the model’s performance, an attention mechanism has been added to the architecture. The proposed EfficientNet-B2 + CBAM model has been optimized for the problem. The hyperparameters of the proposed model are listed in detail in Table 5.
The MLS detection capability of the proposed EfficientNet-B2 + CBAM architecture was tested across three different scenarios. The first of these is binary classification as MLS present/absent, the second is the detection of MLSs above 5 mm indicating surgical status clinically, and the third is the ability to distinguish MLSs above 5 mm from healthy cases without MLS.
Binary classification results belonging to tests performed for the purpose of distinguishing all MLS cases from no-MLS cases without considering any MLS amount threshold, in accordance with the first scenario, are presented in Table 6. These results reflect including both clinically critical and subtle displacements. The accuracy of the model is 0.8539 in the RSNA test set, while it reached 0.8884 in the CQ500 test set. High Specificity values of 0.9135 (RSNA) and 0.9154 (CQ500) were obtained in both sets. The 0.7739 Sensitivity value and 0.9301 AUC-ROC value obtained in the CQ500 test set demonstrate strong generalization performance on external validation data. A lower Sensitivity value of 0.7096 was obtained in the RSNA test set compared to CQ500, which can be attributed to the increased difficulty of detecting subtle MLS cases in the RSNA dataset. Notably, the proposed model exhibits among the lowest standard deviations in sensitivity and AUC-ROC across both RSNA and CQ500 datasets. The F1 scores further confirm a balanced performance between sensitivity and precision across both datasets. The table also includes the results of the EfficientNet-B2 model forming the basis of the proposed model and the Swin Transformer model that gave the most successful AUC-ROC results in Table 3. Again, when compared in terms of AUC-ROC metric, the proposed model demonstrates competitive discriminative performance with 0.8941 in the RSNA dataset and with 0.9301 AUC-ROC in CQ500 datasets.
As summarized in Table 6, the proposed model demonstrates competitive mean performance with EfficientNet-B2 and Swin Transformer models on both RSNA and CQ500 datasets. To assess whether the observed differences in mean performance are statistically significant, Wilcoxon signed-rank tests were conducted using paired results from five independent runs. Table 7 presents the results of the statistical comparison based on the Wilcoxon signed-rank test between the proposed model and selected baselines for Scenario 1. According to the table, the proposed model demonstrated similar performance EfficientNet-B2 and Swin Transformer models in terms of accuracy, F1 score, and AUC-ROC on the RSNA dataset. The negative differences obtained in the log-loss metric indicate that the proposed model is better calibrated and provides more reliable probability estimates compared to the models it is benchmarked against. In the CQ500 dataset, the proposed model achieved positive differences in most metrics. While improvement was observed in terms of AUC-ROC compared to EfficientNet-B2, similar results were obtained with Swin Transformer. When the positive or negligible differences observed in the discrimination metrics are evaluated together with the consistent decreases in log-loss, it is seen that the proposed model offers a balanced performance between discrimination power and probability calibration.
Table 8 reports the case counts stratified by MLS magnitude for Scenario 1 on the RSNA and CQ500 datasets, together with the corresponding true positive (TP) and false negative (FN) results from a representative single run. Although Scenario 1 is emphasized to support early detection and follow-up, the relatively lower sensitivity observed on the RSNA dataset highlights the challenge of identifying low-magnitude or borderline MLS cases, even in expert annotated CT data. As shown in Table 8, the majority of FN in Scenario 1 corresponds to low-magnitude MLS cases (≤5 mm), particularly in the RSNA dataset. While MLS cases above 5 mm are detected with high sensitivity, subtle displacements—especially those in the 1–3 mm range—remain challenging and account for a substantial portion of missed cases. This breakdown explains the lower overall sensitivity observed in Scenario 1 as a deliberate clinical trade-off, whereby the model prioritizes high sensitivity for surgically actionable MLS (>5 mm) at the expense of limited performance for very small displacements.
The magnitude-stratified results summarized in Table 8 reveal systematic performance differences across MLS sizes, indicating the presence of dataset-related bias rather than uniform model behavior. Both the RSNA and CQ500 datasets were originally constructed for hemorrhage or critical abnormality detection rather than for midline shift analysis, which inherently biases the data distribution toward cases with more pronounced mass effect. As a result, subtle MLS cases are relatively underrepresented, making their detection more challenging compared to clinically obvious displacements. Furthermore, differences in imaging protocols, scanners, and patient populations across datasets may further contribute to domain-related bias, partially explaining the sensitivity gap observed on the RSNA dataset. These sources of data bias should therefore be considered when interpreting the reported results, particularly for small MLS cases.
Figure 12 shows the confusion matrices belonging to the binary classification (MLS present/absent) results obtained by the proposed model in the RSNA test and CQ500 external test sets with a representative single run. These matrices summarize the performance of the model in the overall classification task with numbers. Out of a total of 104 data containing MLS in the RSNA dataset, 76 were correctly detected. Again, 79 out of 92 MLS scans in the CQ500 dataset could be detected. When looking at the matrix, it can be seen that the number of scans detected as having MLS despite not containing MLS is also low for both datasets. (RSNA: 22 scans out of 252 scans and CQ500: 29 scans out of 390 scans.)
In Figure 13, the graphical representation of the AUC-ROC values belonging to the proposed model obtained from both datasets with representative single run is presented. The red dot in the graphs represents the Optimal Threshold Value for Sensitivity and Specificity values. This point is the balance point where the model’s Sensitivity and Specificity values are at their best. This threshold value was determined as 0.251 for the RSNA dataset and 0.250 for the CQ500 dataset. According to the graphs, the model demonstrated near-perfect classification success with 0.9008 AUC-ROC in the RSNA dataset and with 0.9497 AUC-ROC in the external CQ500 dataset.
MLS detection, in clinical practice, generally requires operation when the MLS amount exceeds a certain threshold (above 5 mm). Therefore, in addition to detecting the presence of MLS, providing an emergency intervention warning is also an important situation that will support physicians. For this purpose, the proposed model’s ability to distinguish MLSs above 5 mm was also tested through two scenarios. The reason for designing two scenarios here is that it is not clear in studies in the literature how the dataset separation is made when classifying for MLS detection above 5 mm. Therefore, first, images containing MLS of 5 mm and below in the test set were included in the negative class, and only images containing MLS above 5 mm were left in the positive class. The classification results obtained through this scenario are given in Table 9.
The accuracy of the model was 0.8028 in the RSNA test set, while it reached 0.8793 in the CQ500 test set. While the model fell behind the Swin Transformer, EfficientNet B1 and B3 models with 0.7896 Specificity in the RSNA dataset, a result close to the most successful models was obtained with 0.8817 in the CQ500 dataset. The model stands out in terms of Sensitivity in distinguishing above 5 mm. While 0.9467 Sensitivity was obtained in the RSNA dataset, this value 0.8623 in CQ500. The moderate F1 scores observed in both datasets reflect the intentional emphasis on high sensitivity in this scenario, where avoiding missed MLS cases above 5 mm is prioritized, even at the expense of increased false positives. When viewed in terms of the AUC-ROC metric, the proposed model achieved results similar to the most successful models with 0.9219 in the RSNA dataset, while it stands out as the most successful model with 0.9443 AUC-ROC value in the CQ500 dataset. Especially, the AUC-ROC value obtained in the external dataset shows the generalization ability of the model, while AUC-ROC and Sensitivity metrics together clearly express the potential to capture emergency cases.
Table 10 presents the results of the statistical comparison based on the Wilcoxon signed-rank test between the proposed model and selected baselines for Scenario 2. The model proposed in the RSNA dataset showed positive differences compared to EfficientNet-B2 in terms of accuracy and F1 score. When compared to the Swin Transformer, it showed only limited negative differences. AUC-ROC values remained at a similar level with both reference models, while the log-loss metric yielded lower values compared to both models. In the CQ500 dataset, the proposed model demonstrated positive differences across all metrics: accuracy, F1 score, and AUC-ROC. In terms of log-loss values, significant reductions were achieved compared to both reference models. Overall, these results confirm that the model maintains its discriminatory power and improves calibration in RSNA and delivers consistent and robust performance in distinguishing clinically critical > 5 mm MLS cases in CQ500.
Figure 14 shows the confusion matrix from a representative single run for the scenario in which scans with MLS ≤ 5 mm are assigned to the negative class. According to the figure, 28 out of a total of 30 scans containing MLS above 5 mm were correctly detected in the RSNA dataset. Again, 58 out of 61 MLS scans containing MLS above 5 mm could be detected in the CQ500 dataset. When looking at the matrix, it was determined that the number of scans detected as having MLS despite not containing MLS above 5 mm was 70 out of 326 scans in the RSNA dataset, and 50 out of 421 scans in CQ500. This ratio corresponds to 21.7 percent in RSNA, while it corresponds to 11.88 percent in the CQ500 dataset. Another point that should not be overlooked here is this: the training set of the dataset on which the models were trained has only MLS present/absent labels. The test sets contain the MLS amounts in cases containing MLS for both RSNA and CQ500. Therefore, the obtained values were achieved without training with a set containing MLSs above 5 mm. If the training set is trained with such data, the success and these rates will be even higher.
Figure 15 presents the AUC-ROC curves of the proposed model for the second scenario on both datasets, obtained from a representative single run. According to the graphs, the optimal threshold value was determined as 0.6961 for the RSNA dataset and 0.5391 for the CQ500 dataset. According to the graphs, the model demonstrated near-perfect classification success with 0.9208 AUC-ROC in the RSNA dataset and with 0.9574 AUC-ROC in the external CQ500 dataset. At the same time, it was determined that the obtained values were higher than the model’s MLS present/absent classification success (RSNA AUC-ROC: 0.9008, CQ500 AUC-ROC: 0.9497).
The goal of the third scenario in the study is to answer the question of how well the model can distinguish MLSs of above 5 mm from healthy scans. For this purpose, the positive class of the dataset was set to contain only MLSs above 5 mm and the negative class was left in its original state to contain cases without MLS. Table 11 presents success metrics for both internal validation (RSNA) and external validation (CQ500) datasets through this scenario.
According to the results, high sensitivity values of 0.9600 on the RSNA dataset and 0.8984 on the CQ500 dataset were achieved. These findings indicate that the proposed model correctly identifies a large proportion of clinically critical cases with MLS greater than 5 mm from no-MLS. The overall classification accuracy was 0.9113 for both RSNA and CQ500 datasets, indicating consistent performance across datasets. In terms of specificity, the model achieved values of 0.9056 on RSNA and 0.9133 on CQ500, reflecting its ability to correctly identify no-MLS cases. In addition, the model achieved F1-scores of 0.7031 (RSNA) and 0.7337 (CQ500), demonstrating a balanced trade-off between sensitivity and precision in this clinically critical classification task. The proposed model also exhibited strong discriminative performance, achieving AUC-ROC values of 0.9816 on RSNA and 0.9690 on CQ500. Finally, the relatively low standard deviation values across repeated runs indicate stable and consistent model performance.
Table 12 presents the results of the statistical comparison based on the Wilcoxon signed-rank test between the proposed model and selected baselines for Scenario 3. The results indicate that, in Scenario 3, the proposed model achieves performance comparable to the reference models in terms of accuracy and F1 score on the RSNA dataset, while providing advantages in AUC-ROC and log-loss metrics. On the CQ500 dataset, the proposed model yields positive differences across all evaluated metrics when compared to both reference models. In particular, the AUC-ROC and log-loss results demonstrate a consistent advantage in terms of discriminative ability and probability calibration.
Figure 16 presents the confusion matrices for the third scenario, obtained from a representative single run. According to the figure, 28 out of a total of 30 scans containing MLS above 5 mm were correctly detected in the RSNA dataset. Again, 58 out of 61 MLS scans containing MLS above 5 mm could be detected in the CQ500 dataset. When looking at the matrix, it was determined that the number of scans detected as having MLS despite not containing MLS above 5 mm was 22 out of 252 scans in the RSNA dataset, and 29 out of 390 scans in CQ500. This ratio corresponds to approximately 8.73 percent in RSNA, while it is approximately 7.44 percent in the CQ500 dataset.
Figure 17 illustrates the ROC curves and corresponding AUC metrics for the third scenario from a representative single run. Figure 17a focuses on the RSNA test set, while Figure 17b focuses on the CQ500 test set, and Figure 17c shows the comparison of ROC curves belonging to RSNA test and CQ500 external validation sets. The AUC-ROC values obtained as 0.9783 for RSNA and 0.9754 for CQ500 test sets demonstrate that the model exhibits near-perfect classification performance in distinguishing critical MLS and no-MLS cases. Optimal point thresholds were determined as 0.6961 for RSNA and 0.5391 for CQ500. At this threshold point, the model could correctly distinguish almost all the clinically most critical cases from healthy scans.
Table 13 presents a comparison of EfficientNet-B2, Swin Transformer, and the proposed model in terms of inference time and average peak GPU memory usage during inference under three different scenarios. The results show that inference costs calculated independently of the scenario are determined by the model architecture. When median differences are examined, the Swin Transformer has a longer inference time and significantly higher average peak GPU memory usage compared to the proposed model in all three scenarios. In contrast, EfficientNet-B2 stands out with lower average peak GPU memory consumption and shorter inference time. The proposed model offers a computational cost very close to EfficientNet-B2, while still operating with a reasonable increase in average peak GPU memory and time despite the addition of the CBAM module to its architecture.
To quantify the contribution of the attention mechanism, an ablation study was conducted by comparing the EfficientNet-B2 backbone with and without the CBAM module across all three scenarios and both datasets. As shown in Table 14, adding CBAM consistently improves performance, particularly in terms of F1 score and AUC-ROC. The observed gains are more pronounced in clinically critical scenarios, highlighting the role of the attention mechanism in improving sensitivity to relevant anatomical structures.
In addition to the primary experiments, the robustness of the proposed model against image noise was evaluated by adding Gaussian noise to the test data only, without modifying the training process. Noise was applied at three different levels (σ = 0.01, 0.02, and 0.03). (Noise-related effects on DL models in medical imaging have been investigated in the literature [42,43,44].) The pattern of noise levels and its effect on the sample CT image are shown in Figure 18. Model performance on noisy test data was evaluated on both the RSNA and CQ500 datasets under three different scenarios. Changes in accuracy, sensitivity, and AUC-ROC metrics, using the noise-free case (σ = 0.00) as a reference, are reported in detail in Table 15.
The results presented in Table 15 show that a gradual decline in performance metrics was observed as the noise level increased in all scenarios. At a low noise level (σ = 0.01), changes in AUC-ROC values remained limited. While slight decreases were observed in the RSNA dataset, small increases were also reported in the CQ500 dataset. In contrast, at a higher noise level (σ = 0.03), significant decreases in AUC-ROC values were observed in both datasets.
Figure 19, Figure 20 and Figure 21 show the AUC-ROC curves for Scenarios 1, 2, and 3, respectively, at different noise levels. When comparing the scenarios, it was observed that in all three scenarios, the model largely maintained its discrimination performance under low and medium noise levels, while at the highest noise level, more pronounced deterioration was observed in all metrics. The AUC-ROC (drop) column in Table 15 quantitatively demonstrates the change in performance loss depending on noise levels.

4. Discussion

This section of the study aims to analyze and interpret in detail the experimental findings obtained using the proposed hybrid model for MLS detection. The results regarding the success of the proposed model have been presented numerically in the experimental study section. This section will primarily focus on interpreting and deeply comparing the results achieved by the proposed architecture in key metrics for medical diagnosis such as sensitivity, specificity, and AUC-ROC with more complex and parameter-intensive approaches in the literature. Finally, the methodological constraints of the study, dataset-specific limitations, and issues regarding the generalizability of the obtained results will be comprehensively addressed. Furthermore, this section serves to consolidate the unique and impactful contribution of our study to the field of MLS detection, outlining a roadmap for future research.

4.1. Interpretation of MLS Detection Results

The results obtained in the experimental study section were presented across three different scenarios. Therefore, in this subsection, the results will be addressed in a similar order. The first scenario was the binary classification of MLS as present/absent. In this direction, the results of the proposed model on the RSNA dataset show that the model achieved a slightly higher mean accuracy than Swin Transformer (Swin: 0.8478, proposed model: 0.8539) while demonstrating almost the same (Swin: 0.9011, proposed model: 0.8941) discriminative power in terms of AUC-ROC. This situation demonstrates that in terms of AUC-ROC, one of the most important metrics in medical classification, the model can compete with one of the advanced architectures (Swin Transformer) despite having fewer parameters. In the first scenario, the model achieved the low complexity/parameter target on RSNA data, establishing a competitive, and in some metrics numerically superior, balance against a model requiring high computation like Swin Transformer. When evaluated in terms of the CQ500 dataset, the superiority of the model emerges more distinctly. Across both datasets, the proposed model demonstrated competitive overall classification performance compared to the evaluated baselines. It showed strong capability in distinguishing true negative cases, contributing to improved decision reliability. Although the Swin Transformer achieved higher average sensitivity, its larger standard deviation especially on the RSNA dataset across five independent runs indicates greater sensitivity to training stochasticity. In contrast, the proposed model exhibited more stable sensitivity across runs, suggesting improved training stability.
In Figure 22, examples of FP, i.e., scans predicted as MLS despite not having MLS, are shown for both datasets. In Figure 23, examples of FN, i.e., scans predicted as no-MLS but containing MLS, are shown. Here, both incorrect predictions containing small and large amounts of MLSs have been exemplified. All FP and FN examples shown in Figure 22 and Figure 23 correspond to the same representative single run used for the AUC-ROC curve and confusion matrix analyses in Section 3.
When the images in Figure 22 are examined, it can be seen that the main source of error is local mass effect or an unnatural asymmetry of the brain (reasons such as infarct the patient had previously experienced, arachnoid cyst, etc.). The upper left and lower left RSNA images show mild asymmetry or slight compression in the brain sulci. Particularly in the lower-row CQ500 images, large infarct (stroke) or hemorrhage areas are noticeable. These lesions may be too small to shift the midline or there may be opposing forces preventing MLS, but due to intense asymmetry and edema, the model may have perceived this as a “threat” (potential MLS). FPs generally contain larger and more prominent pathologies. This situation indicates that the model was trained with the assumption that large lesions would always lead to MLS, and this assumption produces errors in rare cases where MLS does not occur. The model tends to perceive the intense shadowing and asymmetry created by the local lesion as a ‘positive’ signal even when MLS is absent.
The (a) RSNA (MLS = 3 mm) and (b) CQ500 (MLS = 3 mm) samples in Figure 23 show small MLSs that are close to the critical MLS threshold. For these examples, the model’s inability to find the true midline due to natural differences from person to person and the absence of obvious mass effects were effective in the model’s decision that there is no MLS. The scan containing 6 mm MLS in Figure 23c shows a cortical (near-surface) cyst area. The location of the cyst can sometimes create a less prominent pushing force on the midline compared to deeper lesions, or compression of other brain structures may make it difficult for the model to detect the main line. Missing a prominent 6 mm MLS indicates that the model could not learn the distinguishing features of this lesion type. This situation resulted from the scarcity of data caused by this lesion type in the training set. The 8 mm error missed in Figure 23d is the most critical error that can be seen. The models missing such a large MLS is due to the ventricular structure shrinking in a way that makes it difficult to distinguish, as a result of pressure on surrounding tissues due to the amount of hemorrhage. The model could not clearly detect the main reference points where it measures MLS. The reason for the missed 5 mm MLSs in both datasets is again the inability to clearly distinguish ventricular reference points due to tissue edema. When interpreting all these results, it should be noted that only 3 of the 13 images labeled as FN in the CQ500 dataset, and only 2 of the 28 scans detected as FN in the RSNA dataset were above 5 mm MLS requiring urgent intervention.
When the results obtained from the second scenario (MLSs above 5 mm as positive and 0–5 mm MLSs as negative) are evaluated, the proposed model demonstrates a balanced performance on the RSNA dataset. While its accuracy and specificity are slightly lower than those of some baseline models, it achieves a high sensitivity, which is critical for minimizing missed surgically relevant MLS cases. On the CQ500 external test set, the proposed model attains performance values that are comparable to or exceed those of other evaluated architectures across most evaluation metrics. Notably, it achieves the highest AUC-ROC value, outperforming Swin Transformer and other baseline models, indicating superior overall discriminative capability under this evaluation setting.
The primary objective of this scenario is to identify MLS cases exceeding the 5 mm surgical threshold with high sensitivity while maintaining reliable class separation, as reflected by AUC-ROC. When the results from both datasets are considered together, the proposed model exhibits a favorable balance between sensitivity and overall discriminative power, supporting its suitability for clinically relevant MLS detection.
The third scenario tests the model’s ability to separate high-risk MLS (above 5 mm) from healthy cases without MLS. The Proposed Model offers strong overall discriminative performance (AUC-ROC) and clinical sensitivity in this critical scenario. This result means that the model demonstrates competitive performance compared to other models in not missing serious injuries, which is the primary priority in clinical settings. Specifically, for the CQ500 dataset, the model achieved a moderate F1 Score, demonstrating that it could balance high sensitivity with reasonable Precision. The Proposed Model’s high sensitivity and AUC-ROC success was achieved at the expense of relatively low Precision. That is, the model accepted labeling a few more healthy cases as FP to capture MLSs above 5 mm at the maximum level, which means high sensitivity. This is generally acceptable for a clinical diagnostic tool, particularly in emergency settings, because referring a patient unnecessarily to the hospital/doctor is considered a better scenario than missing a serious injury. In conclusion, the Proposed Model is a reliable and competitive model for distinguishing MLSs above 5 mm from healthy cases for MLS detection. Its Accuracy, sensitivity, and AUC-ROC value support its suitability for clinical application.
Tests conducted by adding different noise levels to the above scenarios revealed that the proposed model did not exhibit sudden breaks in performance under increasing noise levels, but rather showed gradual, predictable changes. The proportional and predictable decrease in model performance with increasing noise can support clinical interpretation by preventing unexpected behavior in suboptimal images.
In Table 16, the EfficientNet-B2 model forming the basis of the proposed model and the Swin Transformer model, which demonstrated strong performance across the three evaluated scenarios, are compared in terms of parameter count and FLOPs. The parameter count reflects model size and memory requirements, whereas FLOPs provide an estimate of computational complexity and inference cost. Despite operating at a higher input resolution, the proposed model achieves the high overall discriminative performance (AUC-ROC) and sensitivity in the first and third scenarios on both the RSNA and CQ500 datasets, while remaining substantially more compact (8.1 M vs. 88 M parameters) and computationally efficient (3.3 G vs. 47 G FLOPs) than transformer-based architectures such as Swin Transformer. Importantly, this reduction in computational cost translates into faster inference and lower hardware requirements, which are critical for real-time and resource-constrained clinical environments. In the second scenario, the proposed model also achieves the highest AUC-ROC on the CQ500 dataset and performance comparable on RSNA. Overall, these results demonstrate that the proposed approach provides a favorable trade-off between detection performance and computational efficiency, supporting its practical clinical applicability.
In Table 17, the FLOP values obtained when the input sizes of models are 512 × 512 are shown. Since the proposed model in this study takes 512 × 512 sized image input, an additional comparison is presented to answer the question of what the complexity would be if other models also took input of this size. Swin Transformer and MaxVit models could not be run with 512 × 512 input due to size constraints. According to Table 17, the proposed model is competitive with MobileNetV3-Large, EfficientNet-B0, B1 and B2 models, which have low parameters and FLOP values, despite containing an attention mechanism. It produces results many times faster than the Swin transformer architecture while having much lower memory requirements.
The fact that the proposed model takes a 512 × 512-pixel input also contributes to MLS detection. Increasing the input resolution allows the model to preserve fine-grained spatial details and global anatomical context, which is particularly important for detecting subtle or borderline midline shifts. This benefit becomes more pronounced when combined with attention mechanisms, enabling the model to focus on clinically relevant regions while maintaining overall structural consistency. Furthermore, the improvement in AUC-ROC reflects enhanced discriminative capability, indicating a more reliable separation between MLS classes in both Scenario 1 and Scenario 2, where subtle and threshold-adjacent cases are included.

4.2. Comparison with State-of-the-Art Methods

In Table 18, the proposed model is compared with studies in the literature that perform MLS detection above 5 mm. Although the literature states that values above 5 mm are taken as positive, the representation of the negative class is not clearly specified. Therefore, both scenarios investigated in our study for MLS detection above 5 mm have been included in the comparison.
The Sensitivity metric is crucial for clinical reliability as it reflects the model’s ability to correctly identify critical MLS cases. In Scenario 3, the model achieved mean sensitivity values of 0.9600 (RSNA) and 0.8984 (CQ500), with the representative single run reaching 0.9333 and 0.9508, respectively. These results indicate that the proposed model successfully detects the vast majority of clinically significant MLS cases requiring surgical intervention. Compared to Chilamkurthy et al. [14], who reported sensitivity values of 0.9385 and 0.9077 on the CQ500 dataset, our model achieves competitive performance in Scenario 2 and 3, where the representative single runs (0.9508) exceed their reported values. In terms of the AUC-ROC metric, the proposed model demonstrates discriminative performance. In Scenario 2, mean AUC-ROC values of 0.9219 on RSNA and 0.9443 on CQ500 were obtained, while the representative single runs yielded 0.9208 and 0.9574, respectively. In Scenario 3, the model achieved mean AUC-ROC values of 0.9816 (RSNA) and 0.9690 (CQ500), with representative single-run results of 0.9783 and 0.9754, respectively. Across both scenarios, the CQ500 AUC-ROC value of 0.9754 is comparable to the 0.9697 reported by Chilamkurthy et al. [14], indicating that the proposed model maintains strong generalization performance on external validation data. Compared to Wang et al. [9], who reported an AUC-ROC of 95.77 on their specific dataset, the proposed model achieves superior discriminative performance for scenario 3 and competitive for scenario 2 across both open datasets. Compared to Yan et al. [17], who reported a specificity of 0.967, our model shows a trade-off between sensitivity and specificity. While the specificity values are slightly lower in Scenario 2 and 3, where the representative single runs, this is compensated by substantially higher sensitivity. From a clinical perspective, this trade-off is acceptable and even preferable, as missing a critical MLS case has far greater consequences than unnecessary further examination. The high sensitivity ensures that patients requiring urgent surgical intervention are reliably identified, which aligns with the primary clinical objective of MLS detection.
The consistent performance across two independent datasets (RSNA for internal validation and CQ500 for external validation) demonstrates the generalization capability of the proposed model, supporting its applicability in diverse clinical settings.
The literature contains a limited number of comparable studies that perform MLS detection with present/absent results. Table 19 provides a comparison of the first scenario with the existing literature. The proposed model has established a clear superiority over the Agrawal et al. [20] study across all metrics in both datasets, even in this simple MLS presence/absence detection scenario. Notably, the high mean AUC-ROC values (0.8941 and 0.9301) demonstrate that our model’s predictions are far from random, thereby highlighting its strong potential for use as a clinical triage tool.

4.3. Clinical Implications and Translational Impact

CT, developed by Sir Godfrey Hounsfield in 1967, has become a fundamental imaging method especially for rapid diagnosis of intracranial emergencies such as intracranial hemorrhages and MLS [45,46,47,48]. The ability of CT to acquire images rapidly is particularly important in situations involving motion or limited patient stability [49]. Due to its relatively low cost and widespread device network, it has found a wide application area [50,51,52,53,54]. The widespread use of CT has also given rise to several challenges. In centers without specialist radiologists, the need has emerged for CT images to be interpreted by general practitioners. In particular, physicians working in emergency departments are required to evaluate whole-body CT scans; however, this skill is not even included in the curriculum of emergency medicine specialists within the current education system [55,56]. Additionally, the mandatory requirement for CT images to be separately reported by a radiologist due to legal regulations can lead to delays in diagnosis and treatment, especially in situations where immediate access cannot be provided through telemedicine.
MLS is an important indicator of increased intracranial pressure and poor neurological outcome in conditions such as TBI and spontaneous intracerebral hemorrhage [57,58]. Trauma guidelines play a key role in emergency intervention decisions such as surgical intervention (like decompressive craniectomy) indication arising in patients with MLS > 5 mm [57] or initiating hyperosmolar treatments aimed at reducing ICP when MLS is detected [58]. Therefore, rapid and accurate interpretation of CT is of great importance for starting appropriate treatment without delay, and the use of AI-supported automated systems for MLS detection and threshold determination will make an important contribution to reducing mortality and morbidity.
Considering the complexity of neurological structures and lesions that vary from individual to individual, our AI-supported MLS system offers the ability to make rapid and accurate inferences from massive datasets. The deployment of such a system integrates its existing and prospective large and diverse sample pool with the power of AI. Large volumes of imaging data can be rapidly pre-analyzed and prioritized by AI-based systems, potentially reducing cognitive workload and supporting error mitigation in radiological workflows [59]. However, it should be noted that FP and FN predictions may have important clinical implications. FPs may lead to prolonged observation periods and unnecessary follow-up imaging, increasing clinical workload, particularly in high-volume emergency settings. More critically, FNs may result in patients with a clinically significant midline shift being overlooked during emergency triage, potentially delaying intervention and leading to adverse outcomes. These considerations underscore that the proposed model is intended to function as a clinical decision-support tool rather than a standalone diagnostic system, with operating thresholds that should be adapted to the specific clinical context and risk tolerance. Such systems will provide invaluable support in the early and accurate diagnosis of urgent and critical neurological conditions, especially for inexperienced physicians or non-specialist clinicians.

4.4. Limitations and Future Work

The relabeling of existing open datasets collected with a TBI focus to be suitable for MLS detection, to address the deficiency in the literature within the scope of the study, is an important contribution. However, these datasets were not obtained solely for MLS detection. Therefore, they may not adequately contain cases including all etiologies that can cause MLS. Especially considering the structure of the brain that varies from person to person, obtaining data collected from much larger scale and diverse patients for MLS detection will enable much more accurate and consistent results to be obtained in the future.
While high accuracy was achieved for MLS above the clinically critical threshold of 5 mm, a more detailed performance analysis for small MLS (≤5 mm) remains limited by the scarcity of publicly available, magnitude-annotated datasets in this domain. Detection of low-magnitude MLS is inherently challenging due to slice-level variability, partial volume effects and anatomical asymmetry. These factors partially explain the reduced sensitivity observed in Scenario 1, as well as the observed false positive and false negative patterns for small MLS cases.
Although millimeter-level MLS annotations enable detailed performance analysis, the current framework focuses on classification-based detection and does not perform direct regression or automated estimation of MLS magnitude. In the future, firstly, the development of DL architectures that directly predict the MLS amount with millimetric precision beyond classification-based MLS detection is planned. Subsequently, to establish the clinical applicability of the model, we plan to conduct multi-center and independent validation studies using patient data collected from diverse geographies and various imaging devices. Following this, the proposed model is planned to be tested as a real-time (near real-time) decision support tool by integrating it with Picture Archiving and Communication Systems (PACS)/Radiology Information Systems (RIS) or directly into CT devices. Comparing the diagnostic performance of the model with physicians of different experience levels and measuring the effect of the model on physician decisions are also among future studies that can be performed. Again, adding attention mechanisms that will enable the model to learn problem-specific refined features to reduce the FP rate, or the use of geometry-based additional loss functions are among future studies. Additionally, whether success can be increased by integrating image preprocessing techniques that highlight midline structures or segmentation-based approaches into the system should also be investigated. Furthermore, incorporating explainable AI techniques into the process will enhance physician trust in the proposed methodology.

5. Conclusions

In this study, a computationally efficient, novel hybrid DL architecture combining the EfficientNet-B2 backbone and CBAM attention mechanism has been proposed for MLS detection, a life-threatening condition, in CT images.
The proposed model was tested across three different scenarios. The results obtained in all scenarios demonstrate the success and generalizability of the model. Especially in the scenario of distinguishing MLS cases above 5 mm, which is the emergency surgical intervention threshold, from healthy cases without MLS, it demonstrated high clinical sensitivity and strong overall discriminative performance. These results indicate that the model provides a level of performance in avoiding missed critical cases that is comparable to previously reported methods in the literature. In addition, the proposed model requires fewer parameters and incurs lower computational cost compared to high-performance but parameter-intensive architectures such as Swin Transformer. This efficiency makes the proposed system an ideal solution for integration into clinical environments with limited resources and PACS/RISs. At the same time, with this study, new quantitative labeling with MLS presence/absence and millimetric precision measurements have been added to RSNA and CQ500 open datasets to address the lack of labeled open data, which is one of the fundamental challenges in the field of MLS detection. In this way, the path for reproducible evaluation and fair comparisons has been opened for the scientific community.
Overall, the conclusions drawn in this study are directly supported by the experimental results and analyses presented. The findings presented in this study indicate that the proposed EfficientNet-B2 + CBAM architecture can achieve high sensitivity and discriminative performance for clinically critical MLS detection while maintaining low computational cost. The magnitude-stratified analyses show that the model is particularly effective in detecting surgically relevant MLS cases (>5 mm), while smaller displacements remain challenging. In addition, the newly introduced quantitative MLS annotations for the RSNA and CQ500 datasets enable reproducible evaluation and fair comparison in future studies.

Author Contributions

Conceptualization, T.H.G., İ.K. and F.K.G.; methodology, T.H.G., İ.K. and F.K.G.; software, T.H.G.; validation, İ.K. and F.K.G.; formal analysis, T.H.G., İ.K. and F.K.G.; investigation, T.H.G., İ.K. and F.K.G.; resources, T.H.G., İ.K. and F.K.G.; writing—original draft preparation, T.H.G., İ.K. and F.K.G.; writing—review and editing, T.H.G., İ.K. and F.K.G.; visualization, T.H.G., İ.K. and F.K.G.; supervision, İ.K. and F.K.G.; project administration, İ.K. and F.K.G.; funding acquisition, T.H.G., İ.K. and F.K.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study because the data were obtained from publicly available, fully anonymized patient datasets.

Informed Consent Statement

Informed consent was waived because the study used publicly available, fully anonymized data.

Data Availability Statement

All datasets used and analyzed during this study are freely available on https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection (accessed on 1 September 2025) and https://www.kaggle.com/datasets/crawford/qureai-headct (accessed on 1 September 2025). The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ko, S.B. Multimodality monitoring in the neurointensive care unit: A special perspective for patients with stroke. J. Stroke 2013, 15, 99–108. [Google Scholar] [CrossRef] [PubMed]
  2. Narayan, R.K.; Kishore, P.R.; Becker, D.P.; Ward, J.D.; Enas, G.G.; Greenberg, R.P.; Da Silva, A.D.; Lipper, M.H.; Choi, S.C.; Mayhall, C.G.; et al. Intracranial pressure: To monitor or not to monitor? A review of our experience with severe head injury. J. Neurosurg. 1982, 56, 650–659. [Google Scholar] [CrossRef] [PubMed]
  3. Cooper, D.J.; Rosenfeld, J.V.; Murray, L.; Arabi, Y.M.; Davies, A.R.; D’Urso, P.; Kossmann, T.; Ponsford, J.; Seppelt, I.; Reilly, P.; et al. Decompressive craniectomy in diffuse traumatic brain injury. N. Engl. J. Med. 2011, 364, 1493–1502. [Google Scholar] [CrossRef]
  4. Robba, C.; Cardim, D.; Sekhon, M.; Budohoski, K.; Czosnyka, M. Transcranial Doppler: A stethoscope for the brain-neurocritical care use. J. Neurosci. Res. 2018, 96, 720–730. [Google Scholar] [CrossRef]
  5. Russell, S.J.; Norvig, P. Artificial Intelligence: A Modern Approach; Prentice Hall Series in Artificial Intelligence; Prentice Hall: Englewood Cliffs, NJ, USA, 1995. [Google Scholar]
  6. Sulis, E.; Taveter, K. Agents and Organization Studies. In Agent-Based Business Process Simulation; Springer: Cham, Switzerland, 2022. [Google Scholar] [CrossRef]
  7. Heydaran Daroogheh Amnyieh, Z.; Rastegar Fatemi, S.M.J.; Rastgarpour, M.; Ghazvini, G.A. CNN-RDM: A new image processing model for improving the structure of deep learning based on representational dissimilarity matrix. J. Supercomput. 2023, 79, 4266–4290. [Google Scholar] [CrossRef]
  8. Gençtürk, T.H.; Gülağız, F.K.; Kaya, İ. Artificial intelligence and computed tomography imaging for midline shift detection. Eur. Phys. J. Spec. Top. 2025, 234, 4539–4566. [Google Scholar] [CrossRef]
  9. Wang, H.C.; Ho, S.H.; Xiao, F.; Chou, J.H. A simple, fast and fully automated approach for midline shift measurement on brain computed tomography. arXiv 2017, arXiv:1703.00797. [Google Scholar] [CrossRef]
  10. Liao, C.C.; Xiao, F.; Wong, J.M.; Chiang, I.J. Automatic recognition of midline shift on brain CT images. Comput. Biol. Med. 2010, 40, 331–339. [Google Scholar] [CrossRef]
  11. Liu, R.; Li, S.; Su, B.; Tan, C.L.; Leong, T.Y.; Pang, B.C.; Lim, C.C.T.; Lee, C.K. Automatic detection and quantification of brain midline shift using anatomical marker model. Comput. Med. Imaging Graph. 2014, 38, 1–14. [Google Scholar] [CrossRef]
  12. Xiao, F.; Chiang, I.J.; Wong, J.M.; Tsai, Y.H.; Huang, K.C.; Liao, C.C. Automatic measurement of midline shift on deformed brains using multiresolution binary level set method and Hough transform. Comput. Biol. Med. 2011, 41, 756–762. [Google Scholar] [CrossRef]
  13. Chen, W.; Belle, A.; Cockrell, C.; Ward, K.R.; Najarian, K. Automated midline shift and intracranial pressure estimation based on brain CT images. J. Vis. Exp. 2013, 74, e3871. [Google Scholar] [CrossRef]
  14. Chilamkurthy, S.; Ghosh, R.; Tanamala, S.; Biviji, M.; Campeau, N.G.; Venugopal, V.K.; Mahajan, V.; Rao, P.; Warier, P. Deep learning algorithms for detection of critical findings in head CT scans: A retrospective study. Lancet 2018, 392, 2388–2396. [Google Scholar] [CrossRef]
  15. Wu, D.; Li, H.; Chang, J.; Qin, C.; Chen, Y.; Liu, Y.; Zhang, Q.; Huang, B.; Feng, M.; Wang, R.; et al. Automatic brain midline surface delineation on 3d ct images with intracranial hemorrhage. IEEE Trans. Med. Imaging 2022, 41, 2217–2227. [Google Scholar] [CrossRef] [PubMed]
  16. Wei, H.; Tang, X.; Zhang, M.; Li, Q.; Xing, X.; Sean Zhou, X.; Xue, Z.; Zhu, W.; Chen, Z.; Shi, F. The delineation of largely deformed brain midline using regression-based line detection network. Med. Phys. 2020, 47, 5531–5542. [Google Scholar] [CrossRef]
  17. Yan, J.L.; Chen, Y.L.; Chen, M.Y.; Chen, B.A.; Chang, J.X.; Kao, C.C.; Hsieh, M.-C.; Peng, Y.T.; Huang, K.C.; Chen, P.Y. A robust, fully automatic detection method and calculation technique of midline shift in intracranial hemorrhage and its clinical application. Diagnostics 2022, 12, 693. [Google Scholar] [CrossRef] [PubMed]
  18. Nag, M.K.; Gupta, A.; Hariharasudhan, A.S.; Sadhu, A.K.; Das, A.; Ghosh, N. Quantitative analysis of brain herniation from non-contrast CT images using deep learning. J. Neurosci. Methods 2021, 349, 109033. [Google Scholar] [CrossRef]
  19. Xia, X.; Zhang, X.; Huang, Z.; Ren, Q.; Li, H.; Li, Y.; Liang, K.; Wang, H.; Han, K.; Meng, X. Automated detection of 3D midline shift in spontaneous supratentorial intracerebral haemorrhage with non-contrast computed tomography using deep convolutional neural networks. Am. J. Transl. Res. 2021, 13, 11513–11521. [Google Scholar] [PubMed] [PubMed Central]
  20. Agrawal, D.; Joshi, S.; Bahel, V.; Poonamallee, L.; Agrawal, A. Three-dimensional convolutional neural network-based automated detection of midline shift in traumatic brain injury cases from head computed tomography scans. J. Neurosci. Rural Pract. 2024, 15, 293–299. [Google Scholar] [CrossRef]
  21. Wu, A.R.; Hsieh, S.Y.; Chou, H.H.; Lai, C.S.; Hung, J.Y.; Wang, B.; Tsai, Y.S. Deep learning-based prediction of mortality using brain midline shift and clinical information. Heliyon 2025, 11, e41271. [Google Scholar] [CrossRef] [PubMed]
  22. Nag, M.K.; Sadhu, A.K.; Kumar, C.; Choudhary, S. Efficient automated quantification of midline shift in intracerebral hemorrhage using a binarized deep learning model on non-contrast head CT. Neuroradiology 2025. epub ahead of printing. [Google Scholar] [CrossRef]
  23. Flanders, A.E.; Prevedello, L.M.; Shih, G.; Halabi, S.S.; Kalpathy-Cramer, J.; Ball, R.; Mongan, J.T.; Stein, A.; Kitamura, F.C.; Lungren, M.P.; et al. Construction of a machine learning dataset through collaboration: The RSNA 2019 brain CT hemorrhage challenge. Radiol. Artif. Intell. 2020, 2, e190211. [Google Scholar] [CrossRef]
  24. Vidhya, V.; Gudigar, A.; Raghavendra, U.; Hegde, A.; Menon, G.R.; Molinari, F.; Ciaccio, E.J.; Acharya, U.R. Automated detection and screening of traumatic brain injury (TBI) using computed tomography images: A comprehensive review and future perspectives. Int. J. Environ. Res. Public Health 2021, 18, 6499. [Google Scholar] [CrossRef]
  25. Hibi, A.; Jaberipour, M.; Cusimano, M.D.; Bilbily, A.; Krishnan, R.G.; Aviv, R.I.; Tyrrell, P.N. Automated identification and quantification of traumatic brain injury from CT scans: Are we there yet? Medicine 2022, 101, e31848. [Google Scholar] [CrossRef] [PubMed]
  26. Liao, C.C.; Chen, Y.F.; Xiao, F. Brain midline shift measurement and its automation: A review of techniques and algorithms. Int. J. Biomed. Imaging 2018, 2018, 4303161. [Google Scholar] [CrossRef]
  27. Buchlak, Q.D.; Milne, M.R.; Seah, J.; Johnson, A.; Samarasinghe, G.; Hachey, B.; Esmaili, N.; Tran, A.; Leveque, J.-C.; Farrokhi, F.; et al. Charting the potential of brain computed tomography deep learning systems. J. Clin. Neurosci. 2022, 99, 217–223. [Google Scholar] [CrossRef]
  28. Yao, W.; Bai, J.; Liao, W.; Chen, Y.; Liu, M.; Xie, Y. From CNN to transformer: A review of medical image segmentation models. J. Imaging Inform. Med. 2024, 37, 1529–1547. [Google Scholar] [CrossRef]
  29. Mienye, I.D.; Swart, T.G.; Obaido, G.; Jordan, M.; Ilono, P. Deep convolutional neural networks in medical image analysis: A review. Information 2025, 16, 195. [Google Scholar] [CrossRef]
  30. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  31. Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. Available online: http://proceedings.mlr.press/v97/tan19a.html (accessed on 12 January 2026).
  32. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
  33. Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
  34. Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxvit: Multi-axis vision transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 459–479. [Google Scholar] [CrossRef]
  35. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
  36. Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Routledge: New York, NY, USA, 2013. [Google Scholar]
  37. Guo, M.H.; Lu, C.Z.; Liu, Z.N.; Cheng, M.M.; Hu, S.M. Visual attention network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar] [CrossRef]
  38. Neyshabur, B.; Tomioka, R.; Salakhutdinov, R.; Srebro, N. Geometry of optimization and implicit regularization in deep learning. arXiv 2017. [Google Scholar] [CrossRef]
  39. Bai, J.; Zhu, W.; Nie, Z.; Yang, X.; Xu, Q.; Li, D. HFC-YOLO11: A lightweight model for the accurate recognition of tiny remote sensing targets. Computers 2025, 14, 195. [Google Scholar] [CrossRef]
  40. Taruneshwaran, T.; Chidambaram, S.; Manoj, P.; Sakthi Swaroopan, S.; Divya, S.; Sowmya, V.; Ravi, V. Lightweight Generative Model for Synthetic Biomedical Images with Enhanced Quality. In Machine Learning and Deep Learning Modeling and Algorithms with Applications in Medical and Health Care; Springer Nature: Cham, Switzerland, 2025; pp. 57–81. [Google Scholar] [CrossRef]
  41. Yeom, S.K.; Shim, K.H.; Hwang, J.H. Toward compact deep neural networks via energy-aware pruning. arXiv 2021. [Google Scholar] [CrossRef]
  42. Bhonsle, D.; Pillai, A.G.; Rizvi, T.; Mishra, R.; Sahu, A.K.; Mishra, R. White Gaussian noise removal from computed tomography images using python. In Proceedings of the 2024 Fourth International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies, ICAECT 2024, Bhilai, India, 11–12 January 2024; pp. 1–5. [Google Scholar] [CrossRef]
  43. Chithra, K.; Santhanam, T. Hybrid denoising technique for suppressing Gaussian noise in medical images. In Proceedings of the 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering, ICPCSI 2017, Chennai, India, 21–22 September 2017; pp. 1460–1463. [Google Scholar] [CrossRef]
  44. Inkinen, S.I.; Mäkelä, T.; Kaasalainen, T.; Peltonen, J.; Kangasniemi, M.; Kortesniemi, M. Automatic head computed tomography image noise quantification with deep learning. Phys. Medica 2022, 99, 102–112. [Google Scholar] [CrossRef] [PubMed]
  45. Richmond, C. Sir Godfrey Hounsfield. BMJ 2004, 329, 687. [Google Scholar] [CrossRef]
  46. Kuno, H.; Sekiya, K.; Chapman, M.N.; Sakai, O. Miscellaneous and emerging applications of dual-energy computed tomography for the evaluation of intracranial pathology. Neuroimaging Clin. 2017, 27, 411–427. [Google Scholar] [CrossRef]
  47. Tran, A.T.; Desser, D.; Zeevi, T.; Abou Karam, G.; Zietz, J.; Dell’Orco, A.; Chen, M.-C.; Malhotra, A.; Qureshi, A.I.; Murthy, S.B.; et al. Optimizing automated hematoma expansion classification from baseline and follow-up head computed tomography. Appl. Sci. 2024, 15, 111. [Google Scholar] [CrossRef]
  48. Abramova, V.; Oliver, A.; Salvi, J.; Terceño, M.; Silva, Y.; Lladó, X. An end-to-end deep learning framework for predicting hematoma expansion in hemorrhagic stroke patients from CT images. Appl. Sci. 2024, 14, 2708. [Google Scholar] [CrossRef]
  49. Hori, K.; Fujimoto, T.; Kawanishi, K. Development of ultra-fast X-ray computed tomography scanner system. IEEE Trans. Nucl. Sci. 1998, 45, 2089–2094. [Google Scholar] [CrossRef]
  50. Howell, J.D. Early clinical use of the x-ray. Trans. Am. Clin. Clim. Assoc. 2016, 127, 341–349. [Google Scholar] [PubMed]
  51. Brenner, D.J.; Hall, E.J. Computed tomography—An increasing source of radiation exposure. N. Engl. J. Med. 2007, 357, 2277–2284. [Google Scholar] [CrossRef]
  52. Shuaib, A.; Jeerakathil, T. The mobile stroke unit and management of acute stroke in rural settings. CMAJ. 2018, 190, E855–E858. [Google Scholar] [CrossRef]
  53. 2024 Health Statistics Yearbook for Türkiye. Available online: https://www.saglik.gov.tr/TR-114952/saglik-istatistikleri-yilligi-2024.html (accessed on 1 December 2025).
  54. Gençtürk, T.H.; Gülağiz, F.K.; Kaya, İ. Detection and segmentation of subdural hemorrhage on head CT images. IEEE Access 2024, 12, 82235–82246. [Google Scholar] [CrossRef]
  55. Türkiye Emergency Medicine Specialty Training Curriculum. Available online: https://dosyamerkez.saglik.gov.tr/Eklenti/41090/0/aciltipmufredat-v241pdf.pdf (accessed on 1 December 2025).
  56. Subramaniam, R.M.; Kim, C.; Scally, P. Medical student radiology teaching in Australia and New Zealand. Australas. Radiol. 2007, 51, 358–361. [Google Scholar] [CrossRef] [PubMed]
  57. Greenberg, M.S. Handbook of Neurosurgery, 9th ed.; Thieme: New York, NY, USA, 2020; pp. 408–453. [Google Scholar]
  58. Dixon, J.; Comstock, G.; Whitfield, J.; Richards, D.; Burkholder, T.W.; Leifer, N.; Mould-Millman, N.-K.; Hynes, E.J.C. Emergency department management of traumatic brain injuries: A resource tiered review. Afr. J. Emerg. Med. 2020, 10, 159–166. [Google Scholar] [CrossRef]
  59. Coelho, S.; Fernandes, A.; Freitas, M.; Fernandes, R.J. Artificial intelligence in computed tomography radiology: A systematic review on risk reduction potential. Appl. Sci. 2025, 15, 9659. [Google Scholar] [CrossRef]
Figure 1. ResNet: Residual learning framework.
Figure 1. ResNet: Residual learning framework.
Applsci 16 00890 g001
Figure 2. EfficientNet: Compound scaling strategy architecture.
Figure 2. EfficientNet: Compound scaling strategy architecture.
Applsci 16 00890 g002
Figure 3. Swin Transformer: Hierarchical ViT.
Figure 3. Swin Transformer: Hierarchical ViT.
Applsci 16 00890 g003
Figure 4. MobileNetV3: Lightweight CNN Architecture.
Figure 4. MobileNetV3: Lightweight CNN Architecture.
Applsci 16 00890 g004
Figure 5. MaxViT: Multi-Axis Vision Transformer architecture.
Figure 5. MaxViT: Multi-Axis Vision Transformer architecture.
Applsci 16 00890 g005
Figure 6. General CBAM architecture.
Figure 6. General CBAM architecture.
Applsci 16 00890 g006
Figure 7. End-to-end operational flow visualization of the proposed system.
Figure 7. End-to-end operational flow visualization of the proposed system.
Applsci 16 00890 g007
Figure 8. CLAHE algorithm processing schema.
Figure 8. CLAHE algorithm processing schema.
Applsci 16 00890 g008
Figure 9. Proposed EfficientNet-B2 + CBAM architecture.
Figure 9. Proposed EfficientNet-B2 + CBAM architecture.
Applsci 16 00890 g009
Figure 10. Visualization of allowed and disabled data augmentation techniques in MLS detection from a laterality preservation perspective.
Figure 10. Visualization of allowed and disabled data augmentation techniques in MLS detection from a laterality preservation perspective.
Applsci 16 00890 g010
Figure 11. Complete training pipeline of the proposed system.
Figure 11. Complete training pipeline of the proposed system.
Applsci 16 00890 g011
Figure 12. Confusion matrix for binary MLS classification under Scenario 1 (MLS vs. no-MLS), obtained from a representative single run: (a) RSNA dataset; (b) CQ500 dataset.
Figure 12. Confusion matrix for binary MLS classification under Scenario 1 (MLS vs. no-MLS), obtained from a representative single run: (a) RSNA dataset; (b) CQ500 dataset.
Applsci 16 00890 g012
Figure 13. AUC-ROC Curve for binary MLS classification under Scenario 1 (MLS vs. no-MLS), obtained from a representative single run: (a) RSNA dataset; (b) CQ500 dataset; (c) RSNA and CQ500 together.
Figure 13. AUC-ROC Curve for binary MLS classification under Scenario 1 (MLS vs. no-MLS), obtained from a representative single run: (a) RSNA dataset; (b) CQ500 dataset; (c) RSNA and CQ500 together.
Applsci 16 00890 g013
Figure 14. Confusion Matrices for Binary MLS Classification under Scenario 2 (MLS > 5 mm as Positive Class (P), MLS ≤ 5 mm as Negative Class (N)), obtained from a representative single run: (a) RSNA dataset; (b) CQ500 dataset.
Figure 14. Confusion Matrices for Binary MLS Classification under Scenario 2 (MLS > 5 mm as Positive Class (P), MLS ≤ 5 mm as Negative Class (N)), obtained from a representative single run: (a) RSNA dataset; (b) CQ500 dataset.
Applsci 16 00890 g014
Figure 15. AUC-ROC Curve for binary MLS classification under Scenario 2 (MLS > 5 mm as Positive Class (P), MLS ≤ 5 mm as Negative Class (N)), obtained from a representative single run: (a) RSNA dataset; (b) CQ500 dataset; (c) RSNA and CQ500 together.
Figure 15. AUC-ROC Curve for binary MLS classification under Scenario 2 (MLS > 5 mm as Positive Class (P), MLS ≤ 5 mm as Negative Class (N)), obtained from a representative single run: (a) RSNA dataset; (b) CQ500 dataset; (c) RSNA and CQ500 together.
Applsci 16 00890 g015
Figure 16. Confusion Matrices for Binary MLS Classification under Scenario 3 (MLS > 5 mm as Positive Class (P), no-MLS as Negative Class (N)), obtained from a representative single run: (a) RSNA dataset; (b) CQ500 dataset.
Figure 16. Confusion Matrices for Binary MLS Classification under Scenario 3 (MLS > 5 mm as Positive Class (P), no-MLS as Negative Class (N)), obtained from a representative single run: (a) RSNA dataset; (b) CQ500 dataset.
Applsci 16 00890 g016
Figure 17. AUC-ROC Curve for binary MLS classification under Scenario 3 (MLS > 5 mm as Positive Class (P), no-MLS as Negative Class (N)), obtained from a representative single run: (a) RSNA dataset; (b) CQ500 dataset; (c) RSNA and CQ500 together.
Figure 17. AUC-ROC Curve for binary MLS classification under Scenario 3 (MLS > 5 mm as Positive Class (P), no-MLS as Negative Class (N)), obtained from a representative single run: (a) RSNA dataset; (b) CQ500 dataset; (c) RSNA and CQ500 together.
Applsci 16 00890 g017
Figure 18. Effect of Gaussian Noise on normalized medical image.
Figure 18. Effect of Gaussian Noise on normalized medical image.
Applsci 16 00890 g018
Figure 19. Robustness of the EfficientNet-B2 + CBAM model under increasing Gaussian noise levels (σ = 0.01, 0.02 and 0.03) for Scenario 1 (MLS vs. no-MLS).
Figure 19. Robustness of the EfficientNet-B2 + CBAM model under increasing Gaussian noise levels (σ = 0.01, 0.02 and 0.03) for Scenario 1 (MLS vs. no-MLS).
Applsci 16 00890 g019
Figure 20. Robustness of the EfficientNet-B2 + CBAM model under increasing Gaussian noise levels (σ = 0.01, 0.02 and 0.03) for Scenario 2 (MLS > 5 mm as Positive Class (P), MLS ≤ 5 mm as Negative Class (N)).
Figure 20. Robustness of the EfficientNet-B2 + CBAM model under increasing Gaussian noise levels (σ = 0.01, 0.02 and 0.03) for Scenario 2 (MLS > 5 mm as Positive Class (P), MLS ≤ 5 mm as Negative Class (N)).
Applsci 16 00890 g020
Figure 21. Robustness of the EfficientNet-B2 + CBAM model under increasing Gaussian noise levels for (σ = 0.01, 0.02 and 0.03) Scenario 3 (MLS > 5 mm as Positive Class (P), no-MLS as Negative Class (N)).
Figure 21. Robustness of the EfficientNet-B2 + CBAM model under increasing Gaussian noise levels for (σ = 0.01, 0.02 and 0.03) Scenario 3 (MLS > 5 mm as Positive Class (P), no-MLS as Negative Class (N)).
Applsci 16 00890 g021
Figure 22. Examples of FP in binary MLS classification (MLS vs. no-MLS) for RSNA and CQ500 datasets obtained from a single representative run.
Figure 22. Examples of FP in binary MLS classification (MLS vs. no-MLS) for RSNA and CQ500 datasets obtained from a single representative run.
Applsci 16 00890 g022
Figure 23. Examples of FN in binary MLS classification (MLS vs. no-MLS) for RSNA and CQ500 datasets obtained from a single representative run (a) RSNA dataset (MLS = 3 mm), (b) CQ500 dataset (MLS = 3 mm), (c) RSNA dataset (MLS = 6 mm, MLS = 5 mm), (d) CQ500 dataset (MLS = 8 mm, MLS = 5 mm).
Figure 23. Examples of FN in binary MLS classification (MLS vs. no-MLS) for RSNA and CQ500 datasets obtained from a single representative run (a) RSNA dataset (MLS = 3 mm), (b) CQ500 dataset (MLS = 3 mm), (c) RSNA dataset (MLS = 6 mm, MLS = 5 mm), (d) CQ500 dataset (MLS = 8 mm, MLS = 5 mm).
Applsci 16 00890 g023
Table 1. Scan and slice statistics of the CQ500 and RSNA datasets used in this study, along with scanner-related information and data source institutions.
Table 1. Scan and slice statistics of the CQ500 and RSNA datasets used in this study, along with scanner-related information and data source institutions.
CQ500 [14]RSNA [23]
Number of scans491~25,000 (train: 21,784; test: 3528)
Number of slices171,390~874,035 (train: 752,803; test: 121,232)
Number of scans with MLS65* 509 (train: 405; test: 104)
Number of scans without MLS354* 835 (train: 583; test: 252)
CT scanner informationGE BrightSpeed,
GE Discovery CT750 HD,
GE LightSpeed,
GE Optima CT660,
Philips MX 16-slice,
Philips Access-32 CT
The data originates from multiple CT scanner manufacturers, resulting in heterogeneous acquisition conditions.
Detailed scanner-specific parameters are not reported.
Data source institutionsSix radiology centers
in New Delhi, India (2012–2018)
Stanford University,
Universidade Federal de São Paulo
Thomas Jefferson University Hospital
* These data do not represent the entire RSNA dataset. They are a subset of the RSNA randomly selected for the study.
Table 2. Representative CT image examples with and without MLS, including quantitative MLS measurement illustrations, from the RSNA and CQ500 datasets.
Table 2. Representative CT image examples with and without MLS, including quantitative MLS measurement illustrations, from the RSNA and CQ500 datasets.
CQ500 [14]
MLS > 5 mmApplsci 16 00890 i001Applsci 16 00890 i002
MLS measurement e.g.,Applsci 16 00890 i003Applsci 16 00890 i004
MLS ≤ 5 mmApplsci 16 00890 i005Applsci 16 00890 i006
MLS measurement e.g.,Applsci 16 00890 i007Applsci 16 00890 i008
RSNA [23]
MLS > 5 mmApplsci 16 00890 i009Applsci 16 00890 i010
MLS measurement e.g.,Applsci 16 00890 i011Applsci 16 00890 i012
MLS ≤ 5 mmApplsci 16 00890 i013Applsci 16 00890 i014
MLS measurement e.g.,Applsci 16 00890 i015Applsci 16 00890 i016
Without MLSCQ500 [14]RSNA [23]
Applsci 16 00890 i017Applsci 16 00890 i018
Table 3. Comparative performance results of different DL models for MLS classification (MLS vs. no-MLS) on RSNA and CQ500 datasets (mean ± standard deviation over five runs).
Table 3. Comparative performance results of different DL models for MLS classification (MLS vs. no-MLS) on RSNA and CQ500 datasets (mean ± standard deviation over five runs).
Accuracy
Mean ± Std
Precision
Mean ± Std
Sensitivity
Mean ± Std
F1 Score
Mean ± Std
Specificity
Mean ± Std
AUC-ROC
Mean ± Std
RSNAResNet-180.7764
± 0.0092
0.6084
± 0.0135
0.6577
± 0.0232
0.6320
± 0.0174
0.8254
± 0.0056
0.8290
± 0.0070
ResNet-500.7674
± 0.0100
0.5939
± 0.0282
0.6673
± 0.0828
0.6244
± 0.0274
0.8087
± 0.0418
0.8118
± 0.0079
MobileNetV3-Large0.7916
± 0.0078
0.6186
± 0.0202
0.7558
± 0.0406
0.6792
± 0.0062
0.8063
± 0.0268
0.8654
± 0.0027
EfficientNet-B00.8264
± 0.0114
0.6980
± 0.0284
0.7192
± 0.0196
0.7078
± 0.0135
0.8706
± 0.0194
0.8671
± 0.0017
EfficientNet-B10.8601
± 0.0167
0.7949
± 0.0580
0.7096
± 0.0288
0.7482
± 0.0249
0.9222
± 0.0266
0.8947
± 0.0073
EfficientNet-B20.8517
± 0.0084
0.7612
± 0.0371
0.7231
± 0.0318
0.7402
± 0.0097
0.9048
± 0.0219
0.8952
± 0.0060
EfficientNet-B30.8629
± 0.0048
0.7888
± 0.0191
0.7269
± 0.0387
0.7556
± 0.0142
0.9190
± 0.0137
0.8994
± 0.0063
Swin Transformer0.8478
± 0.0086
0.7387
± 0.0446
0.7558
± 0.0766
0.7423
± 0.0220
0.8857
± 0.0362
0.9011
± 0.0079
MaxVit-Tiny0.8208
± 0.0308
0.6672
± 0.0739
0.8212
± 0.0607
0.7300
± 0.0198
0.8206
± 0.0685
0.8955
± 0.0056
CQ500ResNet-180.8402
± 0.0175
0.5723
± 0.0407
0.6674
± 0.0334
0.6152
± 0.0305
0.8810
± 0.0206
0.8636
± 0.0140
ResNet-500.7739
± 0.0192
0.4472
± 0.0229
0.7304
± 0.0946
0.5510
± 0.0175
0.7841
± 0.0445
0.8320
± 0.0200
MobileNetV3-Large0.8137
± 0.0094
0.5081
± 0.0154
0.8543
± 0.0341
0.6365
± 0.0051
0.8041
± 0.0192
0.9084
± 0.0048
EfficientNet-B00.8266
± 0.0217
0.5340
± 0.0371
0.8000
± 0.0280
0.6390
± 0.0232
0.8328
± 0.0318
0.8882
± 0.0085
EfficientNet-B10.8531
± 0.0130
0.5787
± 0.0260
0.8565
± 0.0270
0.6903
± 0.0222
0.8523
± 0.0149
0.9217
± 0.0098
EfficientNet-B20.8801
± 0.0165
0.6645
± 0.0628
0.7783
± 0.0454
0.7135
± 0.0252
0.9041
± 0.0275
0.9227
± 0.0128
EfficientNet-B30.8697
± 0.0260
0.6635
± 0.0896
0.7087
± 0.0624
0.6777
± 0.0299
0.9077
± 0.0455
0.9027
± 0.0101
Swin Transformer0.8722
± 0.0329
0.6305
± 0.0865
0.8891
± 0.0543
0.7309
± 0.0399
0.8682
± 0.0527
0.9463
± 0.0088
MaxVit-Tiny0.8411
± 0.0331
0.5720
± 0.0785
0.8391
± 0.0745
0.6720
± 0.0285
0.8415
± 0.0582
0.9209
± 0.0042
Table 4. Params (M) and FLOPs (G) metrics of selected classification models. (The results/metrics obtained using the original input dimensions of the models.)
Table 4. Params (M) and FLOPs (G) metrics of selected classification models. (The results/metrics obtained using the original input dimensions of the models.)
ModelInput Size~Params (M)~FLOPs (G)
ResNet-18224 × 22411.21.8
ResNet-50224 × 22423.54.1
MobileNetV3-Large224 × 2245.40.22
EfficientNet-B0224 × 2244.00.4
EfficientNet-B1240 × 2406.50.7
EfficientNet-B2260 × 2607.71.0
EfficientNet-B3300 × 30010.71.8
Swin Transformer384 × 3848847
MaxVit-Tiny224 × 22430.95.6
Table 5. Hyperparameters of the proposed model.
Table 5. Hyperparameters of the proposed model.
HyperparameterValue
Model Architecture
Backbone NetworkEfficientNet-B2
Attention MechanismCBAM (r = 16, k = 7)
ClassifierFC(1024) → FC(512) → FC(1)
Dropout Rates0.4/0.4/0.3
Input Features
Image Size512 × 512 Pixel
Pre-processingCLAHE (clipLimit = 2.0)
NormalizationImageNet statistics
Optimization
OptimizerAdamW
Learning Rate (η)5 × 10−5
Weight Decay (λ)1 × 10−4
LR SchedulerCosineAnnealingWarmRestarts
Loss FunctionWeighted BCE (w = 1.44)
Gradient Clipping1.0
Training Configuration
Batch Size12
Epoch Number40
Backbone FreezingFirst 3 epoch
Early Stopping Patience30 epoch
Data Augmentation
Mixup α0.2
Random Rotation±20°
Vertical Flipp = 0.5
Horizontal FlipDeactivated *
Color Jitteringbrightness = 0.3, contrast = 0.3
Random Affinetranslate = 0.15, scale = [0.85, 1.15]
Random Perspectivep = 0.3, distortion = 0.2
Random Erasingp = 0.3, scale = [0.02, 0.1]
* Critical for the preservation of laterality in MLS detection.
Table 6. Comparative analysis of binary classification performance for MLS detection on the RSNA and CQ500 datasets under Scenario 1 (MLS vs. no-MLS), reported as mean ± standard deviation over five independent runs.
Table 6. Comparative analysis of binary classification performance for MLS detection on the RSNA and CQ500 datasets under Scenario 1 (MLS vs. no-MLS), reported as mean ± standard deviation over five independent runs.
Acc.
Mean ± Std
Precision
Mean ± Std
Sensitivity
Mean ± Std
F1 Score
Mean ± Std
Specificity
Mean ± Std
AUC-ROC
Mean ± Std
RSNAEfficientNet-B20.8517
± 0.0084
0.7612
± 0.0371
0.7231
± 0.0318
0.7402
± 0.0097
0.9048
± 0.0219
0.8952
± 0.0060
Swin Transformer0.8478
± 0.0086
0.7387
± 0.0446
0.7558
± 0.0766
0.7423
± 0.0220
0.8857
± 0.0362
0.9011
± 0.0079
Proposed Model0.8539
± 0.0171
0.7774
± 0.0556
0.7096
± 0.0186
0.7403
± 0.0183
0.9135
± 0.0308
0.8941
± 0.0056
CQ500EfficientNet-B20.8801
± 0.0165
0.6645
± 0.0628
0.7783
± 0.0454
0.7135
± 0.0252
0.9041
± 0.0275
0.9227
± 0.0128
Swin Transformer0.8722
± 0.0329
0.6305
± 0.0865
0.8891
± 0.0543
0.7309
± 0.0399
0.8682
± 0.0527
0.9463
± 0.0088
Proposed Model0.8884
± 0.0205
0.6905
± 0.0613
0.7739
± 0.0437
0.7271
± 0.0350
0.9154
± 0.0279
0.9301
± 0.0081
Table 7. Wilcoxon signed-rank test results comparing the proposed model with selected baselines on the RSNA and CQ500 datasets under Scenario 1 (MLS vs. no-MLS).
Table 7. Wilcoxon signed-rank test results comparing the proposed model with selected baselines on the RSNA and CQ500 datasets under Scenario 1 (MLS vs. no-MLS).
MetricBaselineProposedMedian DifferenceEffect Size r
RSNAAccuracyEfficientNet-B2EfficientNet-B2 + CBAM+0.00560.0603 (negligible)
Swin Transformer+0.00840.3015
F1 ScoreEfficientNet-B2−0.00490.0603 (negligible)
Swin Transformer−0.00640.0603 (negligible)
AUC-ROCEfficientNet-B2+0.00070.0603 (negligible)
Swin Transformer−0.01120.5427
Log-lossEfficientNet-B2−0.24110.9045
Swin Transformer−0.14160.9045
CQ500AccuracyEfficientNet-B2EfficientNet-B2 + CBAM+0.01240.3618
Swin Transformer+0.02700.6633
F1 ScoreEfficientNet-B2+0.01120.4221
Swin Transformer+0.00050.0603 (negligible)
AUC-ROCEfficientNet-B2+0.00920.6633
Swin Transformer−0.01850.9045
Log-lossEfficientNet-B2−0.11280.7839
Swin Transformer−0.17130.9045
Table 8. Case distribution by MLS magnitude for Scenario 1 on the RSNA and CQ500 datasets with corresponding TP and FN results from a representative single run.
Table 8. Case distribution by MLS magnitude for Scenario 1 on the RSNA and CQ500 datasets with corresponding TP and FN results from a representative single run.
DatasetRSNACQ500
MetricTPFNTPFN
Count76287913
MLS > 5 mm282583
MLS ≤ 5 mm5 mm10255
4 mm9592
3 mm11952
2 mm10721
1 mm83--
Table 9. Comparative analysis of binary classification performance for MLS detection on the RSNA and CQ500 datasets under Scenario 2 (MLS > 5 mm as Positive Class (P), MLS ≤ 5 mm as Negative Class (N)), reported as mean ± standard deviation over five independent runs.
Table 9. Comparative analysis of binary classification performance for MLS detection on the RSNA and CQ500 datasets under Scenario 2 (MLS > 5 mm as Positive Class (P), MLS ≤ 5 mm as Negative Class (N)), reported as mean ± standard deviation over five independent runs.
Accuracy
Mean ± Std
Precision
Mean ± Std
Sensitivity
Mean ± Std
F1 Score
Mean ± Std
Specificity
Mean ± Std
AUC-ROC
Mean ± Std
RSNAResNet-180.7118
± 0.0802
0.2262
± 0.0441
0.9133
± 0.0748
0.3582
± 0.0495
0.6933
± 0.0931
0.8945
± 0.0101
ResNet-500.7292
± 0.0438
0.2261
± 0.0269
0.8867
± 0.0452
0.3590
± 0.0325
0.7147
± 0.0503
0.8880
± 0.0073
MobileNetV3-Large0.7045
± 0.0384
0.2224
± 0.0213
0.9867
± 0.0163
0.3624
± 0.0287
0.6785
± 0.0421
0.8991
± 0.0045
EfficientNet-B00.7792
± 0.0206
0.2681
± 0.0179
0.9267
± 0.0133
0.4155
± 0.0211
0.7656
± 0.0231
0.8994
± 0.0038
EfficientNet-B10.8169
± 0.0466
0.3123
± 0.0563
0.8667
± 0.1011
0.4512
± 0.0429
0.8123
± 0.0589
0.9206
± 0.0043
EfficientNet-B20.7921
± 0.0256
0.2789
± 0.0220
0.9067
± 0.0389
0.4255
± 0.0215
0.7816
± 0.0312
0.9188
± 0.0031
EfficientNet-B30.8157
± 0.0219
0.3143
± 0.0271
0.9867
± 0.0267
0.4762
± 0.0319
0.8000
± 0.0238
0.9244
± 0.0069
Swin Transformer0.8135
± 0.0505
0.3095
± 0.0470
0.9067
± 0.0442
0.4585
± 0.0521
0.8049
± 0.0580
0.9260
± 0.0034
MaxVit-Tiny0.8034
± 0.0286
0.2976
± 0.0280
0.9533
± 0.0452
0.4522
± 0.0287
0.7896
± 0.0345
0.9350
± 0.0076
Proposed Model0.8028
± 0.0215
0.2943
± 0.0227
0.9467
± 0.0340
0.4486
± 0.0288
0.7896
± 0.0228
0.9219
± 0.0030
CQ500ResNet-180.8174
± 0.0407
0.3913
± 0.0560
0.6820
± 0.0907
0.4897
± 0.0340
0.8371
± 0.0576
0.8674
± 0.0119
ResNet-500.7548
± 0.0372
0.3090
± 0.0342
0.7279
± 0.0794
0.4308
± 0.0282
0.7587
± 0.0518
0.8422
± 0.0177
MobileNetV3-Large0.7639
± 0.0409
0.3393
± 0.0371
0.8689
± 0.0427
0.4857
± 0.0337
0.7487
± 0.0515
0.8882
± 0.0061
EfficientNet-B00.8257
± 0.0209
0.4096
± 0.0331
0.8164
± 0.0480
0.5438
± 0.0254
0.8271
± 0.0283
0.9005
± 0.0146
EfficientNet-B10.8544
± 0.0304
0.4842
± 0.0917
0.8197
± 0.1422
0.5891
± 0.0183
0.8594
± 0.0545
0.9184
± 0.0197
EfficientNet-B20.8523
± 0.0257
0.4639
± 0.0571
0.8623
± 0.0471
0.5995
± 0.0371
0.8508
± 0.0344
0.9261
± 0.0202
EfficientNet-B30.8639
± 0.0378
0.4938
± 0.0771
0.8066
± 0.0608
0.6063
± 0.0552
0.8722
± 0.0474
0.9164
± 0.0190
Swin Transformer0.8622
± 0.0399
0.4925
± 0.0714
0.8393
± 0.0729
0.6135
± 0.0471
0.8656
± 0.0550
0.9353
± 0.0075
MaxVit-Tiny0.8768
± 0.0098
0.5126
± 0.0295
0.8033
± 0.0892
0.6218
± 0.0143
0.8874
± 0.0232
0.9194
± 0.0072
Proposed Model0.8793
± 0.0210
0.5216
± 0.0495
0.8623
± 0.0525
0.6463
± 0.0251
0.8817
± 0.0316
0.9443
± 0.0054
Table 10. Wilcoxon signed-rank test results comparing the proposed model with selected baselines on the RSNA and CQ500 datasets under Scenario 2 (MLS > 5 mm as Positive Class (P), MLS ≤ 5 mm as Negative Class (N)).
Table 10. Wilcoxon signed-rank test results comparing the proposed model with selected baselines on the RSNA and CQ500 datasets under Scenario 2 (MLS > 5 mm as Positive Class (P), MLS ≤ 5 mm as Negative Class (N)).
MetricBaselineProposedMedian
Difference
Effect Size r
RSNAAccuracyEfficientNet-B2EfficientNet-B2 + CBAM+0.01400.3015
Swin Transformer−0.01690.3015
F1 ScoreEfficientNet-B2+0.01700.4221
Swin Transformer−0.01470.4221
AUC-ROCEfficientNet-B2+0.00420.5427
Swin Transformer−0.00310.9045
Log-lossEfficientNet-B2−0.29310.5427
Swin Transformer−0.12220.3015
CQ500AccuracyEfficientNet-B2EfficientNet-B2 + CBAM+0.03110.7839
Swin Transformer+0.01040.7236
F1 ScoreEfficientNet-B2+0.06150.9045
Swin Transformer+0.03600.7839
AUC-ROCEfficientNet-B2+0.01200.9045
Swin Transformer+0.00880.6633
Log-lossEfficientNet-B2−0.2640 0.9045
Swin Transformer−0.21640.9045
Table 11. Comparative analysis of binary classification performance for MLS detection on the RSNA and CQ500 datasets under Scenario 3 (MLS > 5 mm as the positive class and no-MLS as the negative class), reported as mean ± standard deviation over five independent runs.
Table 11. Comparative analysis of binary classification performance for MLS detection on the RSNA and CQ500 datasets under Scenario 3 (MLS > 5 mm as the positive class and no-MLS as the negative class), reported as mean ± standard deviation over five independent runs.
Accuracy
Mean ± Std
Precision
Mean ± Std
Sensitivity
Mean ± Std
F1 Score
Mean ± Std
Specificity
Mean ± Std
AUC-ROC
Mean ± Std
RSNAResNet-180.8340
± 0.0365
0.3936
± 0.0606
0.8933
± 0.0827
0.5393
± 0.0378
0.8270
± 0.0502
0.9306
± 0.0039
ResNet-500.8064
± 0.0321
0.3494
± 0.0341
0.9067
± 0.0533
0.5021
± 0.0288
0.7944
± 0.0418
0.9278
± 0.0068
MobileNetV3-Large0.8277
± 0.0205
0.3817
± 0.0301
0.9733
± 0.0249
0.5474
± 0.0290
0.8103
± 0.0245
0.9558
± 0.0037
EfficientNet-B00.8745
± 0.0163
0.4590
± 0.0347
0.9333
± 0.0211
0.6143
± 0.0289
0.8675
± 0.0195
0.9573
± 0.0022
EfficientNet-B10.9376
± 0.0268
0.6820
± 0.1220
0.8800
± 0.0581
0.7583
± 0.0681
0.9444
± 0.0333
0.9790
± 0.0048
EfficientNet-B20.9000
± 0.0207
0.5266
± 0.0609
0.9200
± 0.0499
0.6655
± 0.0361
0.8976
± 0.0285
0.9722
± 0.0038
EfficientNet-B30.9021
± 0.0472
0.5446
± 0.0956
0.9800
± 0.0163
0.6942
± 0.0856
0.8929
± 0.0539
0.9851
± 0.0029
Swin Transformer0.9170
± 0.0194
0.5747
± 0.0632
0.9400
± 0.0249
0.7103
± 0.0431
0.9143
± 0.0240
0.9807
± 0.0023
MaxVit-Tiny0.9163
± 0.0130
0.5689
± 0.0485
0.9467
± 0.0400
0.7082
± 0.0276
0.9127
± 0.0186
0.9809
± 0.0023
Proposed Model0.9113
± 0.0281
0.5610
± 0.0964
0.9600
± 0.0365
0.7031
± 0.0669
0.9056
± 0.0338
0.9816
± 0.0063
CQ500ResNet-180.8585
± 0.0110
0.4852
± 0.0298
0.6426
± 0.1263
0.5462
± 0.0545
0.8923
± 0.0268
0.8994
± 0.0094
ResNet-500.7867
± 0.0341
0.3678
± 0.0434
0.7508
± 0.0980
0.4892
± 0.0328
0.7923
± 0.0513
0.8676
± 0.0207
MobileNetV3-Large0.8155
± 0.0194
0.4174
± 0.0273
0.8918
± 0.0304
0.5678
± 0.0229
0.8036
± 0.0250
0.9170
± 0.0055
EfficientNet-B00.8408
± 0.0290
0.4593
± 0.0482
0.8459
± 0.0459
0.5927
± 0.0368
0.8400
± 0.0381
0.9147
± 0.0144
EfficientNet-B10.9104
± 0.0280
0.6398
± 0.1029
0.8590
± 0.0514
0.7271
± 0.0635
0.9185
± 0.0345
0.9452
± 0.0164
EfficientNet-B20.8847
± 0.0262
0.5560
± 0.0772
0.8918
± 0.0435
0.6806
± 0.0497
0.8836
± 0.0332
0.9432
± 0.0197
EfficientNet-B30.8732
± 0.0688
0.5621
± 0.1195
0.8328
± 0.0445
0.6597
± 0.0915
0.8795
± 0.0849
0.9327
± 0.0107
Swin Transformer0.9086
± 0.0143
0.6178
± 0.0550
0.8951
± 0.0677
0.7270
± 0.0280
0.9108
± 0.0227
0.9642
± 0.0062
MaxVit-Tiny0.9122
± 0.0081
0.6490
± 0.0413
0.7836
± 0.0706
0.7064
± 0.0247
0.9323
± 0.0154
0.9438
± 0.0097
Proposed Model0.9113
± 0.0135
0.6209
± 0.0420
0.8984
± 0.0123
0.7337
± 0.0322
0.9133
± 0.0143
0.9690
± 0.0027
Table 12. Wilcoxon signed-rank test results comparing the proposed model with selected baselines on the RSNA and CQ500 datasets under Scenario 3 (MLS > 5 mm as the positive class and no-MLS as the negative class).
Table 12. Wilcoxon signed-rank test results comparing the proposed model with selected baselines on the RSNA and CQ500 datasets under Scenario 3 (MLS > 5 mm as the positive class and no-MLS as the negative class).
MetricBaselineProposedMedian DifferenceEffect Size r
RSNAAccuracyEfficientNet-B2EfficientNet-B2 + CBAM−0.00350.1809
Swin Transformer+0.00710.1809
F1 ScoreEfficientNet-B2+0.00780.3015
Swin Transformer+0.02270.1809
AUC-ROCEfficientNet-B2+0.01400.7839
Swin Transformer+0.00190.3015
Log-lossEfficientNet-B2−0.05780.5427
Swin Transformer−0.09470.7839
CQ500AccuracyEfficientNet-B2EfficientNet-B2 + CBAM+0.04210.6633
Swin Transformer+0.00440.6030
F1 ScoreEfficientNet-B2+0.06810.7839
Swin Transformer+0.02450.1809
AUC-ROCEfficientNet-B2+0.01530.9045
Swin Transformer+0.00660.3015
Log-lossEfficientNet-B2−0.10620.9045
Swin Transformer−0.13680.7839
Table 13. Scenario-wise comparison of inference time and average peak GPU memory usage during inference for EfficientNet-B2, Swin Transformer, and the proposed model.
Table 13. Scenario-wise comparison of inference time and average peak GPU memory usage during inference for EfficientNet-B2, Swin Transformer, and the proposed model.
Inference Time Avg. Peak GPU Memory
ScenarioModelMedian DifferenceMedian Difference
1EfficientNet-B2+4.6980+48.9126
Swin Transformer−16.2239−220.2933
2EfficientNet-B2+4.6975+38.3109
Swin Transformer−15.9482−229.6997
3EfficientNet-B2+4.6957+38.3109
Swin Transformer−16.6172−230.9575
Table 14. Ablation analysis of the proposed EfficientNet-B2 + CBAM architecture, comparing performance with and without the CBAM module across three scenarios on RSNA and CQ500 datasets.
Table 14. Ablation analysis of the proposed EfficientNet-B2 + CBAM architecture, comparing performance with and without the CBAM module across three scenarios on RSNA and CQ500 datasets.
ScenarioDatasetBackboneCBAMAccuracy Mean ± StdF1 Score Mean ± StdAUC-ROC Mean ± Std
1RSNAEfficientNet-B2No0.8517 ± 0.00840.7402 ± 0.00970.8952 ± 0.0060
Yes0.8539 ± 0.01710.7403 ± 0.01830.8941 ± 0.0056
CQ500No0.8801 ± 0.01650.7135 ± 0.02520.9227 ± 0.0128
Yes0.8884 ± 0.02050.7271 ± 0.03500.9301 ± 0.0081
2RSNAEfficientNet-B2No0.7921 ± 0.02560.4255 ± 0.02150.9188 ± 0.0031
Yes0.8028 ± 0.02150.4486 ± 0.02880.9219 ± 0.0030
CQ500No0.8523 ± 0.02570.5995 ± 0.03710.9261 ± 0.0202
Yes0.8793 ± 0.02100.6463 ± 0.02510.9443 ± 0.0054
3RSNAEfficientNet-B2No0.9000 ± 0.02070.6655 ± 0.03610.9722 ± 0.0038
Yes0.9113 ± 0.02810.7031 ± 0.06690.9816 ± 0.0063
CQ500No0.8847 ± 0.02620.6806 ± 0.04970.9432 ± 0.0197
Yes0.9113 ± 0.01350.7337 ± 0.03220.9690 ± 0.0027
Table 15. Robustness analysis of the proposed model under additive Gaussian noise across three experimental scenarios on the RSNA and CQ500 datasets.
Table 15. Robustness analysis of the proposed model under additive Gaussian noise across three experimental scenarios on the RSNA and CQ500 datasets.
ScenarioDatasetNoise LevelAccuracy Sensitivity AUC-ROC AUC-ROC (Drop)
1RSNAσ = 0.000.85960.73080.9008-
σ = 0.010.84550.81730.89670.0040
σ = 0.020.82020.76920.86420.0366
σ = 0.030.75840.56730.80440.0963
CQ500σ = 0.000.91290.85870.9497-
σ = 0.010.89420.91300.9539−0.0043
σ = 0.020.84440.86960.92750.0222
σ = 0.030.78010.60870.82230.1273
2RSNAσ = 0.000.79780.93330.9208-
σ = 0.010.74720.96670.91680.0040
σ = 0.020.73030.93330.88000.0408
σ = 0.030.77810.70000.83460.0862
CQ500σ = 0.000.89000.95080.9574-
σ = 0.010.85680.98360.9607−0.0033
σ = 0.020.80500.91800.93610.0213
σ = 0.030.78420.72130.83980.1176
3RSNAσ = 0.000.91490.93330.9783-
σ = 0.010.86880.96670.97530.0030
σ = 0.020.85110.93330.95190.0265
σ = 0.030.82270.70000.89020.0881
CQ500σ = 0.000.92900.95080.9754-
σ = 0.010.90240.98360.9782−0.0028
σ = 0.020.85140.93440.95950.0159
σ = 0.030.80710.72130.85540.1200
Table 16. Comparison of model complexity between baseline architectures, the proposed model, and a high-capacity reference model.
Table 16. Comparison of model complexity between baseline architectures, the proposed model, and a high-capacity reference model.
ModelInput Size~Params (M)~FLOPs (G)
EfficientNet-B2260 × 2607.71.0
Swin Transformer384 × 3848847
Proposed Model512 × 5128.13.3
Table 17. Params (M) and FLOPs (G) comparison of DL models at fixed 512 × 512 input size.
Table 17. Params (M) and FLOPs (G) comparison of DL models at fixed 512 × 512 input size.
ModelInput Size~Params (M)~FLOPs (G)
ResNet-18512 × 51211.29.5
ResNet-50512 × 51223.521.6
MobileNetV3-Large512 × 5125.41.2
EfficientNet-B0512 × 5124.02.0
EfficientNet-B1512 × 5126.53.0
EfficientNet-B2512 × 5127.73.4
EfficientNet-B3512 × 51210.75.0
Swin Transformer384 × 38488-
MaxVit-Tiny512 × 51230.9-
Proposed Model512 × 5128.13.3
Table 18. Comparison of the proposed model with existing literature for critical MLSs (>5 mm). (For fair comparison with prior studies that report single-run results, representative runs are additionally provided. Mean ± standard deviation values are reported to demonstrate reproducibility.)
Table 18. Comparison of the proposed model with existing literature for critical MLSs (>5 mm). (For fair comparison with prior studies that report single-run results, representative runs are additionally provided. Mean ± standard deviation values are reported to demonstrate reproducibility.)
YearDatasetAcc.SensitivitySpecificityAUC-ROC
Wang et al. [9]2017Specific82.93 (automatic)
87.80 (manual
calibration)
--95.77>5 mm positive
Chilamkurthy
et al. [14]
2018CQ500-0.9385
0.9077
0.8944
0.9108
0.9697>5 mm positive
Yan et al. [17]2022Specific-0.8750.967->5 mm positive
Our Study
(mean ± std)
2025RSNA0.8028
± 0.0215
0.9467
± 0.0340
0.7896
± 0.0228
0.9219
± 0.0030
>5 mm positive
0–5 mm negative
CQ5000.8793
± 0.0210
0.8623
± 0.0525
0.8817
± 0.0316
0.9443
± 0.0054
Our Study
Representative Single Run
2025RSNA0.79780.93330.78530.9208>5 mm positive
0–5 mm negative
CQ5000.89000.95080.88120.9574
Our Study
(mean ± std)
2025RSNA0.9113
± 0.0281
0.9600
± 0.0365
0.9056
± 0.0338
0.9816
± 0.0063
>5 mm positive
0 mm negative
CQ5000.9113
± 0.0135
0.8984
± 0.0123
0.9133
± 0.0143
0.9690
± 0.0027
Our Study
Representative Single Run
2025RSNA0.91490.93330.91270.9783>5 mm positive
0 mm negative
CQ5000.92900.95080.92560.9754
Table 19. Comparison of the proposed model with existing literature for binary MLS classification (MLS vs. no-MLS). (For fair comparison with prior studies that report single-run results, representative runs are additionally provided. Mean ± standard deviation values are reported to demonstrate reproducibility.)
Table 19. Comparison of the proposed model with existing literature for binary MLS classification (MLS vs. no-MLS). (For fair comparison with prior studies that report single-run results, representative runs are additionally provided. Mean ± standard deviation values are reported to demonstrate reproducibility.)
YearDatasetAcc.SensitivitySpecificityAUC-ROC
Agrawal et al. [20]2024Specific55.00.400.70-MLS positive
No-MLS negative
Our Study
(mean ± std)
2025RSNA0.8539
± 0.0171
0.7096
± 0.0186
0.9135
± 0.0308
0.8941
± 0.0056
CQ5000.8884
± 0.0205
0.7739
± 0.0437
0.9154
± 0.0279
0.9301
± 0.0081
Our Study
Representative Single Run
2025RSNA0.85960.73080.91270.9008
CQ5000.91290.85870.92560.9497
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gençtürk, T.H.; Kaya, İ.; Kaya Gülağız, F. Deepening the Diagnosis: Detection of Midline Shift Using an Advanced Deep Learning Architecture. Appl. Sci. 2026, 16, 890. https://doi.org/10.3390/app16020890

AMA Style

Gençtürk TH, Kaya İ, Kaya Gülağız F. Deepening the Diagnosis: Detection of Midline Shift Using an Advanced Deep Learning Architecture. Applied Sciences. 2026; 16(2):890. https://doi.org/10.3390/app16020890

Chicago/Turabian Style

Gençtürk, Tuğrul Hakan, İsmail Kaya, and Fidan Kaya Gülağız. 2026. "Deepening the Diagnosis: Detection of Midline Shift Using an Advanced Deep Learning Architecture" Applied Sciences 16, no. 2: 890. https://doi.org/10.3390/app16020890

APA Style

Gençtürk, T. H., Kaya, İ., & Kaya Gülağız, F. (2026). Deepening the Diagnosis: Detection of Midline Shift Using an Advanced Deep Learning Architecture. Applied Sciences, 16(2), 890. https://doi.org/10.3390/app16020890

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop