This section details the comprehensive methodology employed to design, optimize, and validate the proposed Cloud-to-Edge deployment framework for 3D ischemic stroke lesion segmentation. The study is structured across five key pillars: (i) the characterization of the multi-center ISLES 2022 dataset and the selection of clinical MRI modalities; (ii) the architectural configuration and cloud-based training of the nnU-Net v2 framework; (iii) the multi-stage optimization pipeline used to transform research-grade models into hardware-aware inference engines via ONNX and TensorRT; (iv) the technical specifications and software environment of the embedded NVIDIA Jetson deployment platform; and (v) the dual-evaluation approach encompassing both voxel-level segmentation accuracy and hardware-level performance metrics. By integrating these components, the methodology ensures a rigorous assessment of the transition from high-performance computing environments to resource-constrained clinical edge devices.
3.1. Dataset
The experiments conducted in this study are based on the ISLES 2022 dataset, introduced by Hernandez Petzsche et al. [
17], which provides a multi-center benchmark for automatic segmentation of acute and sub-acute ischemic stroke lesions from multimodal magnetic resonance imaging (MRI). The dataset was specifically designed to evaluate segmentation algorithms under heterogeneous clinical acquisition conditions and to promote the development of robust and generalizable methods.
ISLES 2022 comprises a total of 400 MRI cases collected from three independent stroke centers: the University Hospital of the Technical University of Munich (Germany), the University Hospital of Bern (Switzerland), and the University Medical Center Hamburg-Eppendorf (Germany). In the official challenge design, the dataset is divided into 250 publicly available training cases and 150 hidden test cases reserved for external challenge evaluation. Since the official hidden test labels are not publicly available, the present study uses only the publicly available training cohort. This cohort was further divided into 200 cases for training, 25 cases for validation, and 25 cases for internal testing, ensuring that the same held-out internal test subset was used consistently for all Cloud and Jetson inference comparisons.
All subjects were adults who underwent MRI as part of routine clinical stroke evaluation. Image acquisition was performed using scanners from multiple vendors and field strengths ( and 3 ), resulting in considerable variability in spatial resolution and acquisition parameters. In-plane voxel spacing ranges approximately between and , while slice thickness varies from to depending on center and modality. Differences in repetition time (TR), echo time (TE), and inversion time (TI) further increase inter-site heterogeneity. This variability reflects real-world clinical practice rather than standardized research protocols.
Each case includes three MRI modalities routinely used in hyperacute stroke assessment: (i) Diffusion-Weighted Imaging (DWI), (ii) Apparent Diffusion Coefficient (ADC) maps and (iii) Fluid-Attenuated Inversion Recovery (FLAIR).
DWI highlights regions of restricted diffusion associated with acute ischemic injury. ADC maps provide quantitative confirmation of diffusion restriction and help distinguish true infarction from T2 shine-through effects. Although the original ISLES 2022 dataset also includes Fluid-Attenuated Inversion Recovery (FLAIR) sequences, this study focuses on diffusion imaging (DWI and ADC) due to their higher sensitivity in the hyperacute phase of ischemic stroke. Lesion annotations were generated following a structured multi-stage labeling protocol. After anonymization and format standardization, preliminary lesion masks were produced using a previously trained 3D U-Net model. These initial segmentations were manually corrected by trained annotators, reviewed by neuroradiology residents, and ultimately validated by senior attending neuroradiologists with extensive experience in stroke imaging. In cases where automated pre-segmentation was insufficient, lesions were delineated manually. All annotations were performed by jointly inspecting DWI, ADC, and FLAIR sequences to ensure accurate infarct boundary definition. Lesions were primarily identified as hyperintense regions on DWI with corresponding hypointensity on ADC, consistent with restricted diffusion. FLAIR images were used during the annotation process to provide contextual confirmation and to differentiate acute lesions from chronic findings or imaging artifacts. The final ground-truth masks are binary volumetric segmentations representing infarcted tissue [
17]. The dataset encompasses a broad range of infarct sizes, vascular territories, and anatomical locations, including middle cerebral artery (MCA), anterior cerebral artery (ACA), posterior cerebral artery (PCA), and infratentorial strokes. Posterior circulation cases were intentionally represented in higher proportion due to their increased segmentation difficulty. This composition, combined with the inherent multi-center variability, makes the dataset particularly suitable for evaluating models intended for deployment in heterogeneous clinical environments.
3.2. Model Architecture and Cloud Training
The segmentation model used in this work is based on the 3D full-resolution configuration of the nnU-Net v2 framework. Rather than relying on manually designed architectures or fixed hyperparameters, nnU-Net v2 analyzes the properties of the dataset and automatically derives an optimized network configuration and training protocol [
8]. This design philosophy is particularly advantageous for heterogeneous clinical datasets such as ISLES 2022, where spatial resolution and intensity distributions vary across centers.
Although the original ISLES 2022 dataset provides DWI, ADC, and FLAIR sequences, only Diffusion-Weighted Imaging (DWI) and Apparent Diffusion Coefficient (ADC) maps were used as input modalities in this study. This choice is motivated by the clinical relevance of diffusion imaging in the hyperacute phase of ischemic stroke, where diffusion restriction represents the primary biomarker of infarcted tissue. Acute infarcts typically appear hyperintense in DWI and hypointense in ADC, providing complementary information that facilitates lesion identification. Although FLAIR imaging is available in the dataset, it was not included because FLAIR signal alterations often appear later in stroke evolution and may remain negative in the hyperacute phase. Focusing on DWI and ADC therefore prioritizes the most informative modalities for early infarct detection while reducing input complexity for the segmentation model.
To enable multimodal learning, the network was configured with two input channels corresponding to the selected diffusion modalities. Following the nnU-Net v2 data structure convention, each case was stored using a channel-wise indexing scheme (caseXXX_0000 for the DWI volume and caseXXX_0001 for the ADC map). The two volumes were combined into a two-channel 3D input tensor, where DWI was assigned to channel 0 and ADC to channel 1, allowing the model to jointly exploit their complementary diffusion patterns associated with acute ischemic injury.
Figure 2 illustrates representative axial slices from the ISLES 2022 dataset, showing the different imaging modalities available in the dataset, including DWI, ADC, and FLAIR, together with the corresponding ground-truth lesion annotation. The model is trained using a subset of these modalities.
All volumes were processed using the preprocessing configuration generated by nnU-Net v2. DWI and ADC volumes were independently normalized using z-score normalization and resampled to an isotropic spacing of 1.0 × 1.0 × 1.0 mm. After foreground cropping and padding when required, the experiment planner selected a full-resolution 3D setup using cubic patches of 128 × 128 × 128 voxels and a batch size of 2.
The fixed 128 × 128 × 128 input size corresponds to the preprocessed nnU-Net space used for training, validation, and controlled deployment experiments. This fixed-shape representation was also used during ONNX export and TensorRT engine generation to ensure compatibility with hardware-specific optimization. Therefore, the same spatial input dimensions were preserved across PyTorch, ONNX, and TensorRT inference.
To avoid introducing preprocessing-related differences between inference environments, all configurations were evaluated using the same preprocessed input tensors and the same postprocessing procedure. Consequently, differences between the cloud FP32 baseline and the embedded TensorRT configurations cannot be attributed to different preprocessing pipelines.
Segmentation labels were handled using a label-preserving procedure. When interpolation generated non-binary intermediate values, the resulting masks were converted back to binary labels using an argmax or fixed-threshold operation before metric computation. Therefore, all reported segmentation metrics were computed on binary volumetric masks.
Although this fixed-shape strategy enabled reproducible TensorRT deployment, it may limit direct applicability to larger or differently cropped full clinical volumes. In particular, the possibility that fixed-shape preprocessing may affect peripheral lesions or unusual lesion locations cannot be fully excluded and is therefore acknowledged as a limitation of the present deployment study.
The resulting architecture follows a six-stage 3D encoder–decoder structure. In the encoder pathway, spatial resolution is progressively reduced through strided 3D convolutions, while the number of feature channels increases to capture increasingly abstract contextual information. The progression of feature maps across resolution levels is , providing sufficient representational capacity without excessive memory growth. Each stage consists of two consecutive convolutional layers followed by Instance Normalization and LeakyReLU activation. Downsampling is achieved via convolutional strides of after the first level.
The decoder mirrors the encoder structure, restoring spatial resolution through transposed convolutions and integrating encoder features via skip connections. These skip connections preserve fine-grained spatial information, which is essential for accurate delineation of ischemic lesions that may present irregular and poorly contrasted boundaries. Each decoder level applies two convolutional layers to refine reconstructed feature maps before the final segmentation layer. From the 250 publicly available training cases provided by the ISLES 2022 dataset, a further internal split was performed to allow controlled experimentation. Specifically, 200 cases were used for network training, 25 cases were reserved for validation during optimization, and 25 cases were held out as an internal test subset, as summarized in
Table 2. The official ISLES 2022 hidden test set was not used because its ground-truth annotations are not publicly available for independent metric computation. Therefore, the internal 25-case test subset was used to enable a controlled paired comparison between the cloud-based PyTorch FP32 baseline and the embedded TensorRT FP32 and FP16 inference configurations under identical input conditions. These 25 test volumes were never exposed to the model during training or validation and were used as an internal held-out subset for paired comparison between inference configurations. The purpose of this subset was not to replace the official ISLES hidden test set or to establish external clinical generalization, but to ensure that the cloud PyTorch baseline and the embedded TensorRT configurations were evaluated under identical data conditions.
Preprocessing was automatically determined by nnU-Net v2. Both modalities were independently normalized using z-score normalization. Image volumes were resampled using third-order interpolation, while segmentation masks were resampled using first-order interpolation to maintain label consistency. During training, on-the-fly data augmentation was applied, including spatial transformations and intensity perturbations, improving robustness to inter-center variability. Training was conducted in a cloud-based high-performance computing environment equipped with an NVIDIA A100 GPU (40 GB VRAM). The availability of large GPU memory enabled full-resolution 3D training without reducing patch size or network depth. Optimization followed the standard nnU-Net v2 protocol using stochastic gradient descent with Nesterov momentum and a polynomial learning rate decay schedule. A compound Dice and cross-entropy loss function was employed to address class imbalance, as ischemic lesions typically occupy a small fraction of the total brain volume. Training was performed for 120 epochs, after which validation performance plateaued and further improvements became negligible. No manual modifications were introduced to the default nnU-Net v2 3D full-resolution architecture beyond the configuration automatically derived from the dataset fingerprint. The network utilized the standard nnU-Net v2 encoder–decoder design, employing a compound loss function consisting of Dice and cross-entropy. Optimization was performed using Stochastic Gradient Descent (SGD) with Nesterov momentum, a polynomial learning-rate decay schedule, and extensive on-the-fly data augmentation. For the deployment experiments, the model trained on fold-0 was selected as the reference for all inference configurations. The use of a single predefined fold was a deliberate design choice to enable a controlled paired comparison across different hardware and optimization backends. By keeping the trained weights, the fixed input shape, and the postprocessing procedure strictly identical, we isolated the impact of the Cloud-to-Edge transition from potential variations arising from multi-fold ensemble averaging.
The combined loss function used during training can be expressed as:
where
denotes the soft Dice loss and
represents the voxel-wise cross-entropy loss. The Dice component directly optimizes overlap between predicted and ground-truth segmentations, while the cross-entropy term stabilizes training by providing voxel-level supervision.
During development, nnU-Net v2’s standard cross-validation scheme was available. However, a single predefined split (fold 0) was used in this study to maintain a fixed and reproducible test subset for all deployment configurations. This design allowed paired comparison of PyTorch FP32, TensorRT FP32, and TensorRT FP16 inference on exactly the same cases. Nevertheless, the use of a single split provides a less robust estimate of generalization than full cross-validation and is therefore explicitly acknowledged as a limitation.
This cloud-based training stage ensured that the architecture was fully adapted to the spatial characteristics of the ISLES 2022 dataset while leveraging high-memory GPU resources. The trained network then served as the starting point for the subsequent optimization and deployment steps within the proposed Cloud-to-Edge workflow.
3.3. Optimization Pipeline and Cloud-to-Edge Deployment
Once the nnU-Net v2 model was fully trained in the cloud environment, the next stage consisted of transforming the research-grade model into a hardware-aware deployment artifact suitable for execution on resource-constrained embedded devices. This transformation required a structured optimization pipeline comprising three main stages: (i) export to an intermediate representation (ONNX), (ii) TensorRT-based engine optimization, and (iii) deployment and validation on NVIDIA Jetson platforms. The overall cloud-to-edge transformation workflow adopted in this study is illustrated in
Figure 3.
3.3.1. Model Export to ONNX
The trained PyTorch nnU-Net v2 model was first converted into the Open Neural Network Exchange (ONNX) format using the native PyTorch exporter (torch.onnx.export). ONNX provides a framework-agnostic intermediate representation that decouples model definition from the original deep learning framework, enabling subsequent hardware-specific optimizations. During export, the model input shape was defined using a fixed volumetric size of voxels, corresponding to the patch size automatically configured by nnU-Net v2 during training. Although ONNX supports dynamic input dimensions, fixed shapes were deliberately used in this work to simplify the deployment pipeline and maximize compatibility with hardware-specific optimizers.
All convolutional layers, normalization operations, activation functions, and skip connections were preserved in the computational graph. Particular attention was given to maintaining numerical consistency between the PyTorch implementation and the exported ONNX graph to avoid discrepancies in inference outputs. After export, the ONNX model was validated through runtime inference checks using representative input volumes, confirming that the exported model produced consistent segmentation outputs and preserved the functional behavior of the original PyTorch implementation. Importantly, the TensorRT inference engine was subsequently built using the same fixed input dimensions (). Building the engine for a fixed spatial resolution allows TensorRT to perform more aggressive kernel selection, memory planning, and layer fusion, which significantly improves execution efficiency on resource-constrained embedded hardware such as the NVIDIA Jetson Xavier NX.
3.3.2. TensorRT Optimization
The ONNX (Open Neural Network Exchange, Microsoft Corporation, Redmond, WA, USA) model was subsequently optimized using NVIDIA TensorRT (NVIDIA Corporation, Santa Clara, CA, USA), a high-performance inference engine designed to maximize throughput and minimize latency on NVIDIA GPUs. TensorRT performs graph-level optimizations, kernel auto-tuning, and memory planning to generate an execution engine tailored to the target hardware architecture.
Two precision modes were evaluated on the embedded platform: (i) FP32 (single precision), where the model is executed using 32-bit floating-point arithmetic and TensorRT applies graph optimizations such as layer fusion while preserving full numerical precision; (ii) FP16 (half precision), where model weights and activations are converted to 16-bit floating-point representation to reduce memory footprint and increase throughput by leveraging hardware-specific Tensor Cores.
In both configurations, TensorRT performs automatic layer fusion, combining adjacent convolution, normalization, and activation operations into single optimized kernels. This reduces memory transfers between intermediate feature maps and improves cache utilization. Additionally, TensorRT optimizes memory allocation by precomputing buffer reuse strategies, minimizing dynamic memory overhead during inference. The FP16 configuration further reduces the memory footprint of intermediate tensors and model parameters, enabling faster execution and lower energy consumption. Importantly, since FP16 maintains floating-point representation (unlike integer quantization), segmentation accuracy degradation is typically negligible for medical imaging tasks. The impact of reduced precision is experimentally evaluated in
Section 4. Engine building was performed directly on the target Jetson device to ensure that the generated inference engine was fully optimized for the specific GPU architecture, including CUDA cores and Tensor Core availability.
The NVIDIA Jetson Xavier NX (NVIDIA Corporation, Santa Clara, CA, USA) employs a unified memory architecture in which the CPU and GPU share the same 8 GB LPDDR4x memory pool. In this work, explicit manual memory management was not required, as TensorRT internally handles buffer allocation and memory reuse during engine construction and execution. The TensorRT builder automatically plans memory usage for intermediate feature maps and activation tensors, leveraging the unified memory architecture of the Jetson platform to minimize unnecessary data transfers between host and device. This automated memory planning simplifies the deployment pipeline while ensuring efficient utilization of the limited memory resources available on embedded systems.
3.4. Hardware Setup
The embedded inference experiments were conducted primarily on the NVIDIA Jetson Xavier NX Developer Kit, a compact System-on-Module (SoM) designed for edge AI applications requiring a balance between computational performance and energy efficiency. Owing to its high performance-per-watt ratio, the Xavier NX is particularly well suited for portable medical devices and point-of-care diagnostic systems, where computational capability must be balanced with strict power and thermal constraints. The Jetson Xavier NX integrates a 384-core NVIDIA Volta GPU with 48 Tensor Cores, enabling hardware acceleration for both FP32 and FP16 operations. It is equipped with a 6-core NVIDIA Carmel ARMv8.2 64-bit CPU and 8 GB of LPDDR4x memory with a memory bandwidth of 51.2 GB/s. Detailed hardware components and the interconnection bus of the embedded System on Module (SoM) used for the inference experiments are shown in
Figure 4.
The device operates within a configurable power envelope of 10 W to 20 W, making it suitable for portable or point-of-care medical scenarios where thermal constraints and energy consumption are critical factors. In contrast to data-center GPUs used during training (e.g., NVIDIA A100), the Xavier NX must manage limited memory resources and reduced computational throughput, which makes hardware-aware optimization essential for volumetric 3D models such as nnU-Net v2. The Volta GPU architecture in the Xavier NX supports mixed-precision computation through Tensor Cores, allowing efficient execution of FP16 operations with minimal accuracy degradation. This capability is particularly relevant in this study, as FP16 TensorRT optimization was leveraged to reduce memory footprint and inference latency while maintaining segmentation fidelity.
For contextual comparison, representative platforms within the NVIDIA Jetson family are summarized in
Table 3. The NVIDIA Jetson Nano represents an entry-level platform within the Jetson ecosystem. It integrates a 128-core Maxwell GPU and typically operates within a 5–10 W power range. However, it lacks Tensor Cores and provides limited memory (4 GB), making it significantly less suitable for full-resolution 3D inference of deep volumetric networks. At the higher end of the family, the NVIDIA Jetson Orin series offers substantial performance improvements, incorporating Ampere-based GPUs with a significantly larger number of CUDA cores and Tensor Cores, as well as increased memory capacity. While Orin platforms provide higher throughput and are better suited for computationally demanding AI workloads, their cost and power consumption are correspondingly higher.
The Xavier NX Developer Kit represents an intermediate solution, offering sufficient computational capability to execute optimized 3D nnU-Net v2 inference while maintaining a compact form factor and moderate energy requirements. This balance makes it particularly appropriate for medical edge applications such as mobile stroke units, rural diagnostic systems, or embedded modules integrated into imaging equipment. All inference benchmarks were conducted directly on the Jetson Xavier NX under its maximum performance configuration (20 W, 6-core mode). In this setting, all CPU cores are enabled and GPU frequency scaling is configured to allow sustained peak performance. This configuration provides the highest available computational throughput of the device and was selected to evaluate the upper-bound inference capability of the embedded platform.
The experimental Jetson Xavier NX system was equipped with an NVMe solid-state drive used as the primary storage device for the operating system, model files, and inference scripts. The Jetson Xavier NX Developer Kit used in this study was equipped with the standard active cooling solution provided with the developer kit. The integrated fan was operated under the default automatic control mode of the Jetson platform during the experiments. Active cooling helps mitigate thermal throttling effects during sustained GPU workloads, allowing the device to maintain stable performance during volumetric inference benchmarks.
Although lower power modes (e.g., 10 W or 15 W) are available to reduce energy consumption, they were not considered in this study, as the objective was to assess the feasibility of real-time volumetric inference under the most favorable hardware conditions. The device ran Ubuntu-based JetPack software with CUDA, cuDNN, and TensorRT libraries configured according to NVIDIA’s embedded AI deployment guidelines. Power consumption measurements were obtained using the integrated monitoring utilities provided by the Jetson platform (e.g., tegrastats) to ensure reproducible benchmarking conditions.
To ensure reproducibility of the embedded inference experiments,
Table 4 summarizes the key software versions used in the Jetson deployment pipeline.
This software configuration corresponds to the official NVIDIA JetPack distribution for the Jetson Xavier NX and ensures compatibility between CUDA, cuDNN, TensorRT, and the PyTorch runtime. By selecting the Xavier NX as the primary deployment target, this study evaluates the feasibility of executing advanced 3D volumetric segmentation models within realistic embedded constraints, bridging the gap between cloud-based training environments and clinical edge deployment.
3.5. Evaluation Metrics
Segmentation performance was quantitatively assessed using a comprehensive set of voxel-level metrics, including Precision, Recall, Accuracy, F1-score, Dice coefficient, Intersection over Union (IoU), and mean Average Precision (mAP). All segmentation metrics were computed by comparing the predicted lesion masks against the corresponding ground-truth annotations across the 25 ISLES 2022 test cases.
To evaluate the spatial agreement between predicted and reference segmentations, two overlap-based metrics widely used in medical image analysis were employed: the Dice Similarity Coefficient (DSC) and the Intersection over Union (IoU). These metrics are defined as:
where
X denotes the predicted lesion mask and
Y represents the corresponding ground-truth segmentation. The Dice coefficient measures the degree of spatial overlap between predicted and reference regions, while IoU applies a stricter penalization to mismatched areas and therefore typically produces lower values for the same prediction.
Precision, Recall, Accuracy, and F1-score were computed from voxel-wise confusion matrix statistics, while mean Average Precision (mAP) was obtained by treating voxel-wise softmax probabilities as confidence scores and integrating the precision–recall curve across multiple decision thresholds. Beyond segmentation accuracy, computational performance and resource efficiency were evaluated to characterize the feasibility of edge deployment. Throughput was measured as the number of processed volumes per unit time, reflecting real-time applicability in clinical scenarios. GPU memory usage was monitored to quantify the graphics memory footprint during inference, while system RAM usage was recorded to assess overall memory consumption. Power consumption was also measured to evaluate the energy efficiency of the embedded platform, which is particularly relevant for portable or point-of-care medical applications. Together, these metrics provide a holistic assessment encompassing both segmentation accuracy and hardware-level performance across cloud and edge inference environments.