Improved Repeatability of Mouse Tibia Volume Segmentation in Murine Myelofibrosis Model Using Deep Learning

A murine model of myelofibrosis in tibia was used in a co-clinical trial to evaluate segmentation methods for application of image-based biomarkers to assess disease status. The dataset (32 mice with 157 3D MRI scans including 49 test–retest pairs scanned on consecutive days) was split into approximately 70% training, 10% validation, and 20% test subsets. Two expert annotators (EA1 and EA2) performed manual segmentations of the mouse tibia (EA1: all data; EA2: test and validation). Attention U-net (A-U-net) model performance was assessed for accuracy with respect to EA1 reference using the average Jaccard index (AJI), volume intersection ratio (AVI), volume error (AVE), and Hausdorff distance (AHD) for four training scenarios: full training, two half-splits, and a single-mouse subsets. The repeatability of computer versus expert segmentations for tibia volume of test–retest pairs was assessed by within-subject coefficient of variance (%wCV). A-U-net models trained on full and half-split training sets achieved similar average accuracy (with respect to EA1 annotations) for test set: AJI = 83–84%, AVI = 89–90%, AVE = 2–3%, and AHD = 0.5 mm–0.7 mm, exceeding EA2 accuracy: AJ = 81%, AVI = 83%, AVE = 14%, and AHD = 0.3 mm. The A-U-net model repeatability wCV [95% CI]: 3 [2, 5]% was notably better than that of expert annotators EA1: 5 [4, 9]% and EA2: 8 [6, 13]%. The developed deep learning model effectively automates murine bone marrow segmentation with accuracy comparable to human annotators and substantially improved repeatability.


Introduction
Myelofibrosis (MF) is a chronic, ultimately fatal myeloproliferative neoplasm caused by genetic mutations in hematopoietic stem cells that cause systemic inflammation and progressive fibrosis, disrupting normal architecture and composition of the bone marrow [1,2]. Bone marrow biopsy remains the standard of care to assess the primary site of disease in MF [2,3]. These biopsies are painful and inherently suffer from sampling error, as the technique only analyzes millimeter amounts of tissue from one anatomic site (iliac crest) and do not assess heterogeneity of disease [4]. Development of quantitative imaging biomarkers (QIBs) of MF would facilitate comprehensive monitoring of disease progression and therapy response in patients in place of needle biopsies.
The goal of our ongoing co-clinical MF study [5,6] is to develop robust imaging protocols and analysis methods for quantitative imaging biomarkers of myelofibrosis. Investigation of the murine model of MF disease for preclinical testing of new therapeutic interventions [7] allows direct exploration of correlations between MRI and bone marrow pathology as full bone marrow can be extracted and studied under microscope [6]. To determine quantitative thresholds for significant changes in MRI biomarkers related to MF pathology, their measurement errors must be evaluated from repeatability studies [8,9] that include all the acquisition and analysis steps. Establishing QIB confidence intervals allows

Materials and Methods
The current study sought to determine whether the A-U-Net DCNN model segmentation of mice tibia would be more repeatable than manual segmentation without compromised accuracy. We performed test-retest repeatability and relative accuracy analysis on tibial bone volumes (scanned on two consecutive days) obtained by a deep learning (DL) segmentation model with respect to two human experts. This study design assumes that the tibia volume of a given mouse does not change significantly over the period of 1 day so that repeatability of the segmentation process can be assessed objectively without knowledge of the true tibia volume.

Experimental Design and Dataset
The studied dataset consisted of 3D Gradient echo (3D-GRE) MR images for 32 mice (26 diseased and 6 normal controls) with a total of 157 scans. The scans were acquired over a three-month period after induction of myelofibrosis by bone-marrow ablation for female mice with JK1 mutation. Forty-nine test-retest scan pairs were collected on consecutive days to assess repeatability [8,9]. The MRI acquisition parameters are detailed in Appendix A. Conversion from the acquired scanner native 3D magnitude-valued 2dseq-image format to the meta-image header (MHD) ITK format was performed using custom routines in MATLAB (R2016b+, Mathworks, Natick MA, USA). The resulting MHD 3D volumes had a dimension of 128 × 64 × 256 voxels with voxel size of 0.09 × 0.075 × 0.094 mm 3 (voxel volume of 6.35 × 10 −4 mm 3 ) and gray-scale intensity bit-depth of 14; indexing from each dimension provided different cross-sectional views (Figure 1, inset).
The MHD images were used for DL model training and data analysis. The experimental design is shown in Figure 1. The available dataset was split into training (18 diseased mice, five controls: 107 scans), validation (two diseased mice, one control: 17 scans), and test (six diseased mice: 33 scans) subsets to approximately balance by scan number (70%/10%/20%). The validation and test subsets were not altered during evaluation. Due to the small number of normal mice and our intended application for disease segmentation, their scans were included only in training (21 scans) and validation (six scans) sets. The full training set included 32 test-retest pairs, validation set contained four pairs, and test set had 13 pairs. To assess model sensitivity to the training set (TS) size, the full TS was split into three training subsets ( Figure 1). Training split 1 (TS1) and training split 2 (TS2) each contained roughly half of the training data. Each split was balanced on the basis of test-retest pairs. The training single mouse (TSM) subset consisted of a single mouse randomly selected with a long time-series of (six biweekly scans), with four test-retest pairs. Tomography 2023, 9, FOR PEER REVIEW 3 Figure 1. Experimental design diagram showing data content (including repeatability scan pairs) for different A-U-net "training" split scenarios (lower left branches). Data subsets that had manual segmentations by Expert Annotator 1 and 2 (EA1 and EA2) available are marked in the middle branches. The inset shows an example of 3D MRI image for mouse tibia with three orthogonal view projection planes.
The MHD images were used for DL model training and data analysis. The experimental design is shown in Figure 1. The available dataset was split into training (18 diseased mice, five controls: 107 scans), validation (two diseased mice, one control: 17 scans), and test (six diseased mice: 33 scans) subsets to approximately balance by scan number (70%/10%/20%). The validation and test subsets were not altered during evaluation. Due to the small number of normal mice and our intended application for disease segmentation, their scans were included only in training (21 scans) and validation (six scans) sets. The full training set included 32 test-retest pairs, validation set contained four pairs, and test set had 13 pairs. To assess model sensitivity to the training set (TS) size, the full TS was split into three training subsets ( Figure 1). Training split 1 (TS1) and training split 2 (TS2) each contained roughly half of the training data. Each split was balanced on the basis of test-retest pairs. The training single mouse (TSM) subset consisted of a single mouse randomly selected with a long time-series of (six biweekly scans), with four test-retest pairs.
Two expert annotators provided the tibia segmentation contours for the test and validation subsets. The masks contoured by Expert Annotator 1 (EA1) with 3 years of experience were used for training and as a reference for model accuracy assessment. Expert Annotator 2 (EA2) was instructed by EA1 (on a single-mouse example) to perform segmentations of the test and validation subsets. Scan and rescan annotations were performed independently. Prior to instruction, EA2 also dispatched the (unbiased) data subset selection for DL model training, validation, and testing. The reference tibia contours were drawn on the coronal view typically spanning 15-30 slices; thus, the coronal view was selected for segmentation by the DL model. The dimension of each grayscale MHD image in coronal view was 128 × 256, thus providing 64 2D images per mouse scan. for different A-U-net "training" split scenarios (lower left branches). Data subsets that had manual segmentations by Expert Annotator 1 and 2 (EA1 and EA2) available are marked in the middle branches. The inset shows an example of 3D MRI image for mouse tibia with three orthogonal view projection planes.

DL Model Architecture
Two expert annotators provided the tibia segmentation contours for the test and validation subsets. The masks contoured by Expert Annotator 1 (EA1) with 3 years of experience were used for training and as a reference for model accuracy assessment. Expert Annotator 2 (EA2) was instructed by EA1 (on a single-mouse example) to perform segmentations of the test and validation subsets. Scan and rescan annotations were performed independently. Prior to instruction, EA2 also dispatched the (unbiased) data subset selection for DL model training, validation, and testing. The reference tibia contours were drawn on the coronal view typically spanning 15-30 slices; thus, the coronal view was selected for segmentation by the DL model. The dimension of each grayscale MHD image in coronal view was 128 × 256, thus providing 64 2D images per mouse scan.

DL Model Architecture
We selected the A-U-net DL model for the tibia segmentation based on improved performance observed in our preliminary comparison study [24]. Our U-Net was implemented using Pytorch library (version 1.13.0) The base U-Net model architecture is detailed in [19,25]. Attention U-Net adds attention gates to the U-Net architecture [19,23] to increase the weights for automatically detected important image features, as illustrated in Figure 2. Each MRI slice was treated as an independent image, and the batch size (N) was randomly sampled from all available slices. The size of the input was (N,1,H,W) for a batch size of N, image height H, and width W. Attention U-Net consists of four encoder blocks and four decoder blocks (Figure 2A). The encoder network (contracting path) preserves the spatial dimensions and doubles the number of filters after each stage whereas the decoder network (expanding path) doubles the spatial dimensions and halves the number of filters.
Tomography 2023, 9, FOR PEER REVIEW 4 Figure 2. Each MRI slice was treated as an independent image, and the batch size (N) was randomly sampled from all available slices. The size of the input was (N,1,H,W) for a batch size of N, image height H, and width W. Attention U-Net consists of four encoder blocks and four decoder blocks ( Figure 2A). The encoder network (contracting path) preserves the spatial dimensions and doubles the number of filters after each stage whereas the decoder network (expanding path) doubles the spatial dimensions and halves the number of filters. The attention gate ( Figure 2B) was described in [23]. Briefly, it takes the input signal (shape: (N,F/2,2H,2W)) and gating signal (shape: (N,F,H,W)) from two different stages, where the input signal and the gating signal have different sizes. The convolutions applied to the input signal (filter size = 1 × 1, stride 2, output size = F) and the gating signal (filter size = 1 × 1 and stride 1, output size = F) equalize their sizes, which are then added, and the rectified linear unit (ReLU) activation function is applied, followed by convolution (filter size = 1 × 1, stride =1, pad = 1, output size = 1) and sigmoid activation. Next, bilinear interpolation is applied to double the height and width to match that of the input signal, producing an attention coefficient array (N,1,2H,2W). The dot product of the attention coefficients with the input signal is the output from the attention gate that emphasizes the important features for image segmentation in the expanding path of the network ( Figure 2A). The attention gate ( Figure 2B) was described in [23]. Briefly, it takes the input signal (shape: (N,F/2,2H,2W)) and gating signal (shape: (N,F,H,W)) from two different stages, where the input signal and the gating signal have different sizes. The convolutions applied to the input signal (filter size = 1 × 1, stride 2, output size = F) and the gating signal (filter size = 1 × 1 and stride 1, output size = F) equalize their sizes, which are then added, and the rectified linear unit (ReLU) activation function is applied, followed by convolution (filter size = 1 × 1, stride =1, pad = 1, output size = 1) and sigmoid activation. Next, bilinear interpolation is applied to double the height and width to match that of the input signal, producing an attention coefficient array (N,1,2H,2W). The dot product of the attention coefficients with the input signal is the output from the attention gate that emphasizes the important features for image segmentation in the expanding path of the network ( Figure 2A).
We experimented with training the Attention U-Net on the complete subset ( Figure 1, TS), 50% splits ( Figure 1, TS1 and TS2), and a single mouse subset (Figure 1, TSM) with respect to the EA1 segmentations as reference. The model was trained to minimize the binary cross-entropy loss. We fine-tuned the hyperparameters and selected the best model using the validation dataset for the different training scenarios. The model training typically took up to 120 epochs with 10,500 iterations for TS and batch size of 40 (6 h for TS, 3 h for TS1 or TS2, and 20 min for TSM using NVIDIA 1080Ti.) The best models were then deployed to the held-out test set, and the automated segmentations were compared to the manual EA1 and EA2 segmentations.

Accuracy and Repeatability Evaluation
The DL model performance was assessed by segmentation accuracy with respect to EA1 reference and repeatability from test-retest experiments. Since true tibia volumes were unknown, only "relative accuracy" was assessed by this procedure. For an evaluation of the relative model accuracy, we used four commonly used performance metrics [19], namely, average Jaccard index (AJI), average volume intersection ratio (AVI), average volume error (AVE), and average Hausdorff distance (AHD) to compare the performance of the models and expert annotators (EA1 and EA2). The metrics are defined in Appendix B. The average of a given metric is the average over the entire set. Student's t-test with Bonferroni correction was applied to compare accuracy metrics (with respect to EA1 reference) of test-set mouse segmentations for different training scenarios ( Figure 1, full training set, TS1, TS2, and TSM), and with EA2. The differences were considered significant for p-value < 0.009.
The relative accuracy of the model volume segmentations for the test subset ( Figure 1) was also assessed by comparison to the EA1 reference via Bland-Altman (BA) analysis [9] and the agreement was quantified by Pearson correlation, R. BA analysis was also used to evaluate test-retest repeatability for volume segmentations by individual expert annotators and DL models trained on different data subsets. Bland-Altman plots provided a graphic view of agreement.
For the scan pairs acquired on consecutive days, the volume of the mouse tibia is not expected to change significantly; thus, the differences in the segmented volumes should ideally be zero. The repeatability and comparability of volumes estimated from the testretest pairs were assessed using the within-subject coefficient of variation (wCV%) and BA analysis for limits of agreement (LOA) [9,26]. wCV is a relative (dimensionless) repeatability metric with confidence intervals (CIs) defined in Appendix B. The bias and LOA together determine the similarity between the test and retest volume pair. Small bias and narrow LOA indicate that the two measurements are essentially equivalent.

DL Model Segmenation Accuracy with Respect to Reference
An example of the tibia segmentation from the validation set is presented in Figure 3. A darker tibia bone boundary is clearly visible. All the models accurately predicted the tibia bone contour with very minimal variations from the expert annotations even in regions of signal heterogeneity along the tibia length ( Figure 3). The change in the tibia volumes over a period of 6 weeks ( Figure 3E) was consistent with minor changes in the contours. As expected, the DL model performance metrics (Table 1) reflect the highest relative accuracy (AJI: 86%-88%; AVI: 93%-97%) for the training sets. As the training set size is reduced, the standard deviations (SDs) get wider (e.g., from 5% to 10%) for the test subset segmentations. As expected, the DL model performance metrics (Table 1) reflect the highest relative accuracy (AJI: 86%-88%; AVI: 93%-97%) for the training sets. As the training set size is reduced, the standard deviations (SDs) get wider (e.g., from 5% to 10%) for the test subset segmentations. However, the average performance metrics are sufficiently close (within SD) for the TS1, TS2, and full training subsets, and similar accuracy is observed for the test and validation subsets when the models were trained on half of the training data. Compared to training with the full set, the accuracies and performance metrics for the test subset were only slightly lower (<1.5%) when the models were trained on TS1 or TS2, which was not significantly different after Bonferroni correction (p > 0.015). For the test subset, the relative accuracy of DL model trained on a single mouse was significantly lower compared to the other training scenarios. The AHDs were below 1 mm for all training scenarios, increasing with decreasing training set size.
The lower value of AJI% for EA2 vs. EA1 reflects the interobserver variability in the way EA1 and EA2 generate the masks. There is large over-segmentation by EA2 relative to EA1 as observed from the AVEs of 14-18%. The interobserver variability in mouse tibia segmentation by the human experts is much larger than that of the A-U-Net models relative to EA1 (<3%). For the test set, the relative accuracy of EA2 segmentations with respect to the EA1 reference (AJI = 81%, AVI = 83%, AVE = 14%) was notably below that achieved by A-U-net models trained on full or TS1 and TS2 subsets (AJI = 82.5-83.5%, AVI = 89-90%, AVE = 2-3%), with significant improvement for AVI (6%) and AVE (−11%).
In the test example shown in Figure 4, the DL model segmentations derived after training with full and half datasets (Figure 1, TS1 and TS2) apparently produced slightly more accurate tibia segmentation at the distal end compared to somewhat under-segmented EA1 "reference" (Figure 4B, yellow arrow). Since expert segmentation may have variabilities, the agreement or disagreement with respect to the reference annotations may not reflect the true model accuracy ( Table 1). The predictions made by all the models (Attention U-Net trained on multiple mice) are fairly robust except for the one trained on a single mouse ( Figure 4G). The predicted tibia mask by the model trained on the TSM subset misses a large portion of bone contour near the knee (indicated by the yellow arrow in Figure 4G). Thus, the model learned from a small training set shows declining performance and is more susceptible to errors in reference annotations of the training samples. The small data size fails to include all the diversity and features that would be encountered while making the test predictions.
In the test example shown in Figure 4, the DL model segmentations derived after training with full and half datasets (Figure 1, TS1 and TS2) apparently produced slightly more accurate tibia segmentation at the distal end compared to somewhat under-segmented EA1 "reference" (Figure 4B, yellow arrow). Since expert segmentation may have variabilities, the agreement or disagreement with respect to the reference annotations may not reflect the true model accuracy ( Table 1). The predictions made by all the models (Attention U-Net trained on multiple mice) are fairly robust except for the one trained on a single mouse ( Figure 4G). The predicted tibia mask by the model trained on the TSM subset misses a large portion of bone contour near the knee (indicated by the yellow arrow in Figure 4G). Thus, the model learned from a small training set shows declining performance and is more susceptible to errors in reference annotations of the training samples. The small data size fails to include all the diversity and features that would be encountered while making the test predictions.     Table 2 further summarizes the observed repeatability trends in volume estimations for the test-retest pairs in all data subsets and volumes. For single-mouse training and all validation results that included only four test-retest pairs (Figure 1), the CI range is quite Figure 5. Tibia volume (including test-retest) accuracy for the test subset with respect to EA1 reference for different training scenarios (color-coded in the legend). Solid horizontal lines mark the bias (average volume difference), and dashed lines correspond to 95% limits of agreement. Table 2 further summarizes the observed repeatability trends in volume estimations for the test-retest pairs in all data subsets and volumes. For single-mouse training and all validation results that included only four test-retest pairs (Figure 1), the CI range is quite broad (from 6% to 35%), likely resulting from insufficient training to handle the large variations in image properties of unknown validation and test cases and, thus, causing inconsistent segmentation of the tibia volumes. As expected, for the 16-32 test-retest pairs in the training subsets, the model-derived tibia volume repeatability is close to that of the EA1 reference with an average wCV = 7.7% and CI = 5.5-12.3%, as the A-U-Net models were specifically trained to follow the EA1 segmentations. Similar precision performance is observed for EA2 on the test set with wCV = 8%, while EA1 with wCV = 5.3% is notably better, but there is substantial CI overlap (3.8-8.7% versus 5.7-13.2%), reflecting inter-and intra-observer variabilities for manually segmented tibia volumes. In contrast, for DL models trained on the full set, TS1 or TS2, the average repeatability (wCV~3%) with tight CI = 1.9-5.3% is about twofold better than that for single-mouse training or for EA1 and EA2 segmentations of the test-retest pairs in the test set.

DL Model Segmenation Precision
The repeatability of tibia volume measurements for all DL models and expert annotators is compared for the test-retest pairs in the test subset in Figure 6. The models show negligible (positive) volume bias between the scan and rescan segmentations, while EA2 shows notable positive bias (0.4 mm 3 ), consistent with the tendency of EA2 to slightly oversegment for the repeated scan (similar to over-segmentation relative to EA1 as observed in Figure 5). The LOAs for tibia volumes generated by DL models trained on all or half the data are small with the overall precision within 0.5 mm 3 . The precision of EA1 is about half of DL model with the spread up to 1 mm 3 . For the EA2 and DL model trained on a single mouse, the precision is threefold lower, with an overall spread of about 1.5 mm 3 .

Discussion
The main finding of our study is that DL-based segmentation shows strong promise in improving repeatability of tibia volume measurements while the accuracy is comparable to manual annotations. For the test subset, the accuracy of DL segmentation trained on more than 10 animals (>52 scans) notably exceeded that of the second expert annotator (by 6% for volume intersect ratio and by −11% for volume error), with repeatability errors reduced to 3% from 5%-8% (observed for manual segmentations). Thus, the developed model has strong potential to enhance precision of the corresponding volume-based bone marrow QIBs of myelofibrosis [6] with substantial saving of human effort. The utilized test-retest performance metrics is generally more objective than relative accuracy assessment and would be useful for routine evaluation of DL model-based segmentation precision.
The variability in volumes from manual segmentation is a known major factor that limits precision in quantitative imaging [12,13,27]. The manual image segmentation is a tedious time-consuming task with repeatability dependent on the level of experience, attention span and fatigue of an annotator. Variabilities among the experts are also known to contribute to reproducibility errors [15,16]. Since human annotations are generally not free from subjective judgments, only relative segmentation accuracy can be assessed for the DL models with respect to the expert-provided reference. In fact, we noticed that the DL model trained on multiple mice can provide more accurate segmentations compared to DL model trained on a single mouse, while lower performance measures could just indicate discrepancy with the expert reference. In contrast, the segmentation repeatability is a more objective measure of model performance since the tibia volumes should not change between test and retest scans.
For clinical QIB precision assessments, QIBA recommends a test-retest sample size of >35 [8] to evaluate wCV with nominal CIs. Our preclinical validation sample size was insufficient for confident wCV measurement, but the samples in the test subset provided moderate CIs that allowed confident detection of twofold improved repeatability of DL

Discussion
The main finding of our study is that DL-based segmentation shows strong promise in improving repeatability of tibia volume measurements while the accuracy is comparable to manual annotations. For the test subset, the accuracy of DL segmentation trained on more than 10 animals (>52 scans) notably exceeded that of the second expert annotator (by 6% for volume intersect ratio and by −11% for volume error), with repeatability errors reduced to 3% from 5%-8% (observed for manual segmentations). Thus, the developed model has strong potential to enhance precision of the corresponding volume-based bone marrow QIBs of myelofibrosis [6] with substantial saving of human effort. The utilized test-retest performance metrics is generally more objective than relative accuracy assessment and would be useful for routine evaluation of DL model-based segmentation precision.
The variability in volumes from manual segmentation is a known major factor that limits precision in quantitative imaging [12,13,27]. The manual image segmentation is a tedious time-consuming task with repeatability dependent on the level of experience, attention span and fatigue of an annotator. Variabilities among the experts are also known to contribute to reproducibility errors [15,16]. Since human annotations are generally not free from subjective judgments, only relative segmentation accuracy can be assessed for the DL models with respect to the expert-provided reference. In fact, we noticed that the DL model trained on multiple mice can provide more accurate segmentations compared to DL model trained on a single mouse, while lower performance measures could just indicate discrepancy with the expert reference. In contrast, the segmentation repeatability is a more objective measure of model performance since the tibia volumes should not change between test and retest scans.
For clinical QIB precision assessments, QIBA recommends a test-retest sample size of >35 [8] to evaluate wCV with nominal CIs. Our preclinical validation sample size was insufficient for confident wCV measurement, but the samples in the test subset provided moderate CIs that allowed confident detection of twofold improved repeatability of DL model-based mouse tibia segmentations (3%) compared to expert annotations (5.3% to 8%) with relatively high accuracy (>82%) when the model was trained on the dataset of the size exceeding test dataset size. Apparently, training on 53-54 scans for 11-12 animals enabled relative accuracy and repeatability for test set segmentation equivalent (within confidence intervals) to training on 107 scans for 23 animals. Small image datasets with limited expert annotations are typical for biomedical image segmentation applications [28,29]. The training sample size required for robust model training is expected to depend on the complexity of the task and the DL architectures used. Therefore, estimating reasonable size for DL model training for a given task is of practical value to save human effort and improve segmentation consistency [30,31].
To date DL segmentation of murine MRI data has predominantly focused on application to soft-tissue regions and lesions [32,33]. In contrast to prior DL segmentation studies of murine bone primarily performed for microcomputed tomography (µCT) images [30,34,35], our current study provides the first example of DL bone segmentation in high-resolution MRI data. We studied the use of 2D images and the dependence on training sample sizes for training A-U-Net models. The relative accuracy of 80-90% and AHD <1 mm achieved for bone segmentation are similar to those reported for the µCT studies [30,34,35]. Although prior DL segmentation work compared relative model performance for multiple annotators [34] and different CT scanners [35], none used scan-rescan analysis for objective precision assessment as reported in our study.
Our study had limitations. First, the test and training subsets (of relatively small sizes) came from the same dataset acquired on a single scanner with fixed scan protocol for a single mouse mutation model; second, the reference annotations were provided by a single expert. These would likely limit direct generalization of the model for different acquisition protocols and should be tested independently for different murine models. However, the implementation of the automated segmentation provided valuable insights into practical DL model-based segmentation workflow, which could be utilized for other quantitative MRI acquisition protocols and image datasets.
The mouse tibia anatomy is largely consistent among the animals, allowing the use of highly uniform imaging protocols. Thus, mouse tibia segmentation may represent a relatively straightforward task for DCNN model, providing robust example for development of performance evaluation workflow. Apparently, the Attention U-Net trained on more than 50 scans successfully handled the challenges of both heterogeneous intensities and small murine bone sizes. The developed DL model for murine tibia bone marrow segmentation can be directly applied to future mouse scans acquired with similar MRI protocols. In future work, iterative semi-supervised DL training could be implemented for murine bone segmentation. In such an approach, manual segmentation is performed for a small subset of mouse scans for model training, followed by automatic segmentation and quality check for a small batch of new cases and adding manually corrected (challenging) cases to retrain the model on the enlarged annotated set. The process can be iterated for a number of times until the performance of the trained model is acceptable. This approach may substantially reduce the effort of manual segmentation.
Our upcoming myelofibrosis studies will focus on generalizing the DL model workflow for human bone marrow segmentation. The human bone sizes are much larger, and expert annotations are even more time-consuming. Additionally, there is intrinsically higher variability of orientations, anatomic details, and scan parameters compared to mice, which might potentially require a larger training dataset and or transfer learning. Overall, improved processing throughput is vital for development, validation, and implementation of MRI-based quantitative biomarkers for advancing experimental therapeutics in myeloproliferative neoplasm mouse models with the overarching goal of improving patient outcomes.

Conclusions
DL-based segmentation shows strong promise of improved repeatability of murine tibia volume measurements from MRI scans with accuracy comparable to manual annotations and µCT. Attention U-Net was found to be an efficient alternative to human expert segmentations of tibia, providing high accuracy and precision for quantitative imaging of myelofibrosis.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. MRI Acquisition Parameters
All mice were scanned using a fixed 3D FLASH coronal-plane imaging protocol on 7 T 30 cm bore Agilent system magnet with Small Receive CryoProbeTM4-Element Array RF Coil Kit, cryogenically cooled to 20-30 K. The anesthetized mouse leg was held in place between a 3D-printed, leg-shaped mold on posterior side and the CryoProbeTM on anterior side for a 15 min scan. The imaging protocol parameters were defined in Bruker BioSpec ® MRI Console Paravision 7.0.0 software and included single-echo; TR/TE = 111 ms/2.99 ms; flash-spoiling; flip angle = 9 • ; 1 NSA; imaging matrix size: 256 × 128 × 64; voxel size: 0.09 × 0.075 × 0.094 mm 3 ; field of view (FOV): 23 × 9.6 × 6 mm 3 .

Appendix B. Performance metrics definitions
Jaccard index: It is defined as the ratio of the intersection of the segmentation volumes divided by the union of the volumes of the two 3D masks. The Jaccard index for a single scan is calculated as where V h and V c are the volumes of the segmentation mask drawn by human as reference standard and predicted by computer, respectively, for the same scan. Volume intersection ratio: The volume intersection ratio for the segmentations of a single scan is calculated as where V h and V c are defined as above. Volume error (AVE): The volume error for the segmentations of a single scan is calculated as volume error where V h and V c are defined as above. A positive value of volume error indicates undersegmentation, and a negative value indicates over-segmentation by the computer.
Hausdorff distance (AHD): The Hausdorff distance measures the maximum distances between the closest points of two segmentation contours, where r1 and r2 are set of points in 3D of the two contours respectively, and d(x, y) is the Euclidean distance.
Within-subject coefficient of variance (wCV): The sum and difference between the paired observations are conveniently used to calculate pairwise mean M i and pairwise variance V i for each i th test-retest pair as follows: where V 1 mm 3 and V 2 mm 3 denote the volumes from the same tibia in scan 1 and scan 2, respectively, which are, thus, measured twice on different days yielding a single paired observation. The square of the within-subject coefficient of variation (wCV 2 ) is first obtained by taking the mean of the ratio of variance V i to the square of the mean M 2 i over all N pairs and then is applied to calculate the within-subject coefficient of variation (wCV%) as a percentage as follows: The corresponding 95% confidence intervals (α = 0.05) to wCV are given by the multiplicative lower-bound (LB) and upper-bound (UB) factors given by ChiSqr function for N-1 degrees of freedom as 95% CI of wCV = wCV%· Bland-Altman analysis (BA): The BA agreement was computed for the pairwise volume differences (V 2 − V 1 ) i against their means M i . The mean of all N paired differences on tibia volume data is assessed to measure the bias between test and retest volumes. Ideally, the bias should be close to zero thereby supporting the assumption that the volume will remain constant between test and retest measurements.
The corresponding 95% limits of agreement (LOA) along the difference axis (V 2 − V 1 ) in Bland-Altman plots provide a graphical indication of measurement precision. To find the precision, the standard deviation SD of the differences between test and retest volumes is obtained, and then the 95% LOA of (V 2 − V 1 ) is calculated as follows: Precision (LOA) = Bias ± 1.96 * SD (V 2 − V 1 ).