1. Introduction
Accurate measurement of urinary bladder (UB) volume is critical in patient care in numerous clinical contexts, including the assessment of urinary retention, evaluation of post-void residual (PVR) urine, and guidance for catheterization [
1,
2]. Incorrect estimation may lead to unnecessary catheterization or missed diagnoses, both of which have substantial risks such as urinary tract infection, bladder overdistension, and compromised patient outcomes [
3].
Multiple techniques are available for UB volume assessment, ranging from catheterization (considered the gold standard but invasive) to non-invasive imaging modalities such as ultrasonography. Among these, point-of-care ultrasound (POCUS) has emerged as a preferred method, offering rapid, bedside assessment with minimal discomfort [
1]. Conventional approaches include the two-dimensional (2D) ultrasound, where bladder dimensions are measured in orthogonal planes and volume is estimated via geometric formulas. Additionally, automated ultrasound bladder scanners are widely used to estimate urinary bladder volume without providing direct visualization of the bladder. These devices typically present only a numerical volume reading or a basic schematic representation, rather than a true ultrasound image. As a result, anechoic structures such as abdominal or pelvic fluid collections can potentially distort the measurements [
4]. Multiple studies have shown that the reproducibility and accuracy of such non-visual scanners are inferior to that of imaging-based 2D scanning. This has been observed in specific patient groups, including children [
5,
6], postpartum women [
7], women undergoing uroflowmetry [
8], and cases with complicated conditions such as pelvic organ prolapse [
9].
Three-dimensional (3D) ultrasound is an alternative method that could enable more accurate volumetric reconstruction and reduce reliance on manual caliper placement [
10,
11]. Comparative studies have consistently demonstrated that 3D methods improve both accuracy and inter-operator reproducibility over 2D techniques [
12]. However, these studies often involve cart-based or semi-automated 3D systems that require significant operator expertise and specific reconstruction software to determine bladder boundaries [
11,
12,
13]. Recent advances in handheld ultrasound technology have introduced compact, portable devices capable of performing fully automated 3D bladder volume measurements with minimal user input. Early evaluations suggest these devices may deliver clinically acceptable accuracy while requiring little operator training [
14,
15,
16,
17,
18]. Despite these promising findings, existing studies often involve small sample sizes and lack operator reproducibility testing. Emerging trends in urology also include a growing interest in complementary and alternative medicine approaches for managing bladder health and capacity, underscoring the need for accurate volumetric assessment tools to evaluate their impacts [
19].
The present study aimed to directly compare bladder volume measurement accuracy and reproducibility between the conventional 2D ultrasound method and a semi-automated 3D method on a handheld ultrasound. We hypothesized that the 3D method would demonstrate improved accuracy and inter-operator reproducibility over the conventional 2D method, while maintaining clinical feasibility.
2. Materials and Methods
2.1. Study Design
This cross-sectional study included healthy male medical college students at Prince Sattam bin Abdulaziz University recruited through convenience sampling using personal relationships and word-of-mouth to obtain the largest feasible sample. Volunteers were eligible if they were over 18 years old and had no previous history of urinary bladder operations. The study was approved by the university ethics committee (approval number: REC-HSD-116-224), and informed consent was obtained from all participants prior to enrollment.
2.2. Acquisition Methods
The study compared two sonographic methods for bladder volume estimation accuracy and agreement. The first method (2D method) employed traditional 2D bladder volume measurement by calculating height, width, and depth dimensions using the HI VISION Avius ultrasound system (Hitachi Medical Corporation, Tokyo, Japan) equipped with an EUP-C715 Abdominal Convex transducer operating at 1–5 MHz frequency range. This device automatically calculates bladder volume using the built-in calculation tool that assumes a spheroidal shape according to the formula:
The second method (3D method) utilized the handheld Butterfly iQ ultrasound device (Butterfly Network Inc., Guilford, CT, USA), which employs Capacitive Micromachined Ultrasonic Transducers (CMUT) operating at 1–10 MHz frequency range [
20,
21]. This device acquires 3D bladder volumes by electronically sweeping the ultrasound beam across the bladder area using its proprietary 3D software: “Butterfly Auto Bladder Volume Tool”. The manufacturer reports a volume accuracy of ±7.5% using this tool [
22]. The device was connected to a 12.9″ tablet (iPad Pro, Apple Inc., Cupertino, CA, USA) for image control and acquisition.
2.3. Study Protocol
Two novice operators (S.K.S.A and A.K.A) with six months of ultrasound imaging training acquired all measurements using both systems. We intentionally chose a small number of operators to simulate the typical users of handheld devices and to minimize bladder filling during repeated scans, which can confound accuracy assessments. Nevertheless, the operators had sufficient training on the standard 2D method but no previous experience with the 3D method. Hence, a single simple demo on how to acquire valid results was demonstrated to them by an expert consultant in ultrasonography (A.M.A). A third research team member (A.I.A) recorded the bladder volume measurements while maintaining operator blinding to prevent measurement bias.
Participants were instructed to drink 1 L of water at least 30 min prior to their scheduled appointments to ensure adequate bladder filling. They were then asked to inform a research team member of a full bladder sensation. When this is noted, the first scan began after confirming the presence of at least 150 mL in the bladder using the conventional 2D method.
The measurement protocol involved Operator A performing two consecutive bladder volume measurements using the standard 2D method, immediately followed by two consecutive measurements using the 3D method. Operator B then repeated this identical sequence. Each scan was performed independently, and the transducer was removed between repeated measurements to simulate separate acquisitions. Pilot testing with three participants confirmed that repeated acquisitions could be completed within less than one and a half minutes. Although the precise interval was not recorded during the main study, the short acquisition time was considered insufficient for meaningful bladder filling to occur. All measurements were recorded by an independent research team member, with operators remaining unaware of their values throughout the procedure.
Following completion of all ultrasound measurements, participants completely voided their bladders into a urine collection container. The total voided volume was determined by weighing it using an assumed urine density of 1.0 g/mL, serving as the reference standard for accuracy assessment. A pilot study involving three participants was conducted to verify protocol feasibility and estimate examination duration prior to the main study.
2.4. Statistical Methods
Statistical analysis was conducted to compare the accuracy, reproducibility and agreement of bladder volume measurements obtained using the different techniques. All statistical analyses were performed using JASP software (version 0.95), except where mentioned [
23]. Given that the data did not meet the assumption of normality, as confirmed by histograms and the Kolmogorov–Smirnov test (
p < 0.05), non-parametric tests were employed where appropriate.
Descriptive statistics, including median, interquartile range (IQR), and mean differences ± standard deviation were calculated for bladder volume measurements obtained by each operator using both techniques. Bootstrap methods were used to calculate 95% confidence intervals (CI) for medians using SPSS (version 28).
Accuracy was evaluated by comparing ultrasound measurements to the reference standard (measured voided volume). Absolute and percentage differences were calculated for each method. Systematic bias was assessed using Wilcoxon signed-rank tests to determine if median differences significantly differed from zero. Bland–Altman analysis was performed to assess agreement between each ultrasound method and the reference standard, calculating bias (mean difference), limits of agreement (mean ± 1.96 × SD), and their respective 95% confidence intervals.
Intra-operator and inter-operator reproducibility were assessed using Intraclass Correlation Coefficients (ICCs) (two-way mixed effects, absolute agreement, single rater/measurement) with 95% confidence intervals. ICC values were interpreted as follows: <0.5 = poor, 0.5–0.75 = moderate, 0.75–0.9 = good, and >0.9 = excellent reliability. Agreement between the 2D and 3D methods was assessed both within (intra) and across (inter) operators using ICC and Bland–Altman analysis. For intra-operator comparisons, a two-way mixed-effects, absolute agreement ICC was used. For inter-operator inter-method comparisons, a two-way random-effects ICC was used to reflect the random pairing of different operators and methods. Bland–Altman plots were constructed for all reproducibility and agreement assessments, including the mean difference (bias) and 95% limits of agreement (LoA).
This study was designed as an exploratory methods-comparison with an emphasis on estimation (bias, limits of agreement, and ICC) rather than null-hypothesis testing; therefore, no formal a priori sample size calculation was performed. We targeted a sample of approximately 50 participants based on feasibility and to achieve acceptable precision for Bland–Altman limits of agreement and ICC estimates. The final sample (n = 53) yielded narrow 95% confidence intervals for both bias/LoA and ICCs. As a reference, n = 53 provides >90% power at α = 0.05 to detect a paired difference of ~0.4 SD.
3. Results
A total of 53 participants were enrolled in this study. The volunteers had a mean age of 19.6 ± 2.0 years (range: 18–26 years), reflecting a young adult cohort. Participants had a mean height of 173.0 ± 5.7 cm (range: 162.0–191.0 cm) and mean weight of 84.0 ± 21.7 kg (range: 48.1–136.8 kg). The mean body mass index was 28.1 ± 7.3 kg/m2 (range: 16.2–50.3 kg/m2), indicating a diverse range of body habitus from underweight to obese categories.
3.1. Accuracy Assessment
Urinary bladder volume estimation was compared across two methods—standard ultrasound and 3D ultrasound (Butterfly iQ)—performed by two different operators. The median true voided volume was 485.0 mL (95% CI: 441.0–580.0; IQR: 329.4 mL), demonstrating the wide range of bladder volumes in our study population. All ultrasound methods systematically underestimated bladder volume compared to the reference standard of measured voided volume (
Table 1).
Additionally, all methods significantly differed from the true voided volume based on Wilcoxon signed-rank tests (all
p < 0.001).
Figure 1 presents boxplots comparing all four estimation methods to the measured voided volume, illustrating the systematic underestimation by the standard methods and the improved estimations of 3D measurements to the voided volume.
Operator A’s standard method significantly underestimated bladder volume with a median estimate of 309.7 mL (95% CI: 257.3–352.0), yielding a mean difference of −191.6 mL (SD = 88.3) and a relative error of −36.7%. In contrast, the 3D method produced higher accuracy, with a median estimate of 436.0 mL (95% CI: 394.5–524.0), corresponding to a smaller mean difference of −64.9 mL (SD = 83.7) and −11.2% error.
Operator B demonstrated similar trends. The standard method yielded a median estimate of 349.8 mL (95% CI: 306.8–403.4), with a mean difference of −137.4 mL (SD = 69.4) and −27.6% error. The 3D method had a median estimate of 430.0 mL (95% CI: 381.0–520.5), a mean difference of −72.3 mL (SD = 90.6), and −12.0% error.
Bland–Altman analysis revealed distinct agreement patterns between methods (
Table 2;
Figure 2). The standard method demonstrated tighter limits of agreement but with substantial negative bias; whilst the 3D method exhibited wider limits of agreement but smaller systematic bias.
3.2. Operators’ Reproducibility
After confirming normal distribution of the differences between repeated measurements, reproducibility was assessed using intraclass correlation coefficients (ICCs) and Bland–Altman analysis (
Table 3;
Figure S1). Intra-operator reproducibility was excellent for both operators across both methods, with ICC values exceeding 0.96. Operator A showed near-identical reproducibility for the standard (ICC = 0.977) and 3D (ICC = 0.976) techniques, while Operator B showed slightly higher reproducibility with the 3D method (ICC = 0.983) compared to the standard method (ICC = 0.962).
Bland–Altman analyses revealed minimal mean differences between repeated scans for both operators. The 3D method showed a near-zero bias for Operator A (−0.3 mL) and a small positive bias for Operator B (5.8 mL), indicating minimal systematic error. Limits of agreement were narrower for the 3D method than for the standard method, suggesting greater consistency.
Inter-operator agreement was also strong, with higher ICC for the 3D method (0.977) compared to the standard method (0.927). While the standard method showed a notable negative bias (−54.2 mL) between operators, the 3D method demonstrated a much smaller and more symmetric mean difference (7.3 mL), reinforcing its potential for more reliable cross-operator use.
3.3. Methods Agreement
The agreement between the standard and 3D ultrasound methods was assessed within and across operators using intraclass correlation coefficients (ICCs) and Bland–Altman analysis (
Table 4;
Figure S2). Within-operator agreement was good, with ICC values of 0.798 for Operator A and 0.890 for Operator B. Despite this, Bland–Altman plots showed considerable mean differences between the two methods, particularly for Operator A, who exhibited a bias of −126.7 mL (95% CI: −152.5 to −100.9) with wide limits of agreement (−310.3 to 56.9 mL). Operator B demonstrated a smaller bias of −65.2 mL (95% CI: −93.0 to −37.4), but agreement limits remained relatively broad (−262.8 to 132.5 mL).
Cross-operator comparisons between different methods also had some variability. Agreement between Operator A’s standard method and Operator B’s 3D method yielded an ICC of 0.735 and a bias of −119.4 mL, while the reverse comparison of Operator A’s 3D method versus Operator B’s standard method had the highest inter-method ICC (0.901) and a lower bias of 72.5 mL. Notably, the limits of agreement in all comparisons were relatively wide, indicating variability in individual measurements despite good overall correlation. These findings suggest that while 3D ultrasound shows better consistency across operators, considerable variability still exists between methods.
3.4. Feasibility Assessment
All scans were successfully completed with no missing data or technical difficulties during data acquisition. Examples of the acquired ultrasound images are presented in
Figure 3 and
Figure 4. While both devices were usable under all conditions, a limitation was observed with the 3D method. During acquisition, the Butterfly iQ device occasionally displayed a warning message stating “Bladder extends off view” (
Figure 3). This occurred in 7 participants (13% of the cohort), all of whom had large bladder volumes (≥700 mL) or elongated bladder shapes. The issue is likely related to constraints in the probe’s electronic beam steering capabilities, which can limit capturing the full field of view. Despite this warning, the device continued to generate numerical volume estimates, and exclusion of these cases did not materially change the overall accuracy or reproducibility results.
Despite this, both operators noted the ease of use and efficiency of the handheld 3D device where it provided automated volume estimation by simply positioning the probe at the center of the bladder, with minimal operator input. Overall, both systems generated volume readings promptly, though the 3D method was consistently faster in acquisition.
4. Discussion
This study demonstrates that the automated 3D method using the handheld Butterfly iQ ultrasound device provides higher accuracy compared to the standard 2D method for bladder volume estimation. The 3D method achieved substantially improved accuracy, with percentage errors of 11–12% compared to 28–37% for the 2D method across both operators. However, both ultrasound methods systematically underestimated bladder volume compared to the reference standard; however, the 3D technique exhibited substantially less bias, with mean differences of 65–72 mL compared to 137–192 mL for the 2D method.
Both methods demonstrated excellent intra-operator reproducibility, with ICC values exceeding 0.96, indicating that individual operators can achieve highly consistent measurements with minimal training. However, the 3D method showed superior inter-operator agreement (ICC: 0.977 vs. 0.927) and markedly reduced systematic bias between operators (7.3 mL vs. 54.2 mL difference). This improved consistency across different users represents a significant advantage for clinical implementation, as it reduces operator-dependent variability that can compromise measurement reliability in practice. The study also revealed that both methods were technically feasible with 100% successful completion rates. However, the 3D method demonstrated faster acquisition times and required minimal operator input beyond probe positioning, while the 2D method required manual caliper placement and dimensional measurements.
One of the notable observations we found was the 3D method limitations with large bladder volumes (≥700 mL), where “bladder extends off view” warnings indicated field-of-view constraints. The literature supports our finding that bladder volume estimation is influenced by bladder size and shape, though most studies focus on accuracy rather than technical limitations. Vinod et al. demonstrated that both 3D ultrasound and BladderScan significantly underestimated bladder volumes, with 3D ultrasound showing a 30.1% error rate that improved to 20.7% after applying correction factors [
13]. They also found that large, irregularly shaped bladders also produced greater underestimation errors. Bih et al. reported that bladder shape significantly affects volume estimation accuracy, with different correction coefficients needed for cuboidal (0.89), ellipsoid (0.81), and triangular prism-shaped (0.66) bladders [
24]. The field-of-view constraint we observed likely represents a technical limitation of the Butterfly iQ handheld ultrasound device when encountering extremely distended bladders, which may have more irregular shapes that exceed the electronic beam steering capabilities. This suggests that while 3D automated methods offer superior accuracy for most clinical scenarios, alternative techniques may be necessary for patients with very large bladder capacities, particularly those with neurogenic bladder dysfunction who are more likely to develop both large volumes and irregular shapes.
Very few studies have investigated the semi-automated Butterfly iQ technology for urinary bladder volume assessment, and most have been conducted on a smaller sample size or in settings or populations different from the present work [
15,
16,
17,
18]. In an emergency department setting, Ho-Gotshall et al. compared a nursing bladder scanner, cart-based ultrasound, and Butterfly iQ against post-measurement catheterization and found that the cart-based ultrasound demonstrated the highest agreement with the gold standard, whereas Butterfly significantly overestimated catheterized volume; nonetheless, both the nursing scanner and Butterfly were rated more convenient than the cart-based system [
18]. This divergence from our results likely reflects methodological differences—specifically, their use of catheterization in older, acutely unwell patients versus our use of voided volume in healthy young adults, as well as differing operator experience levels and timing of measurements, all of which can influence bias and variability.
From an implementation perspective, Nunan et al. reported that after a 20 min teaching session, ward staff confidence in using the Butterfly semi-automated visual method was significantly greater than when using automated non-visual bladder scanners, and that overall uptake was high [
17]. Moreover, Jalfon et al. examined both operator- and patient-acquired postvoid residual measurements using Butterfly iQ and found excellent repeatability (ICCs 0.95–0.98) for Butterfly and standard scanners alike [
15]. This aligns closely with our observation that even novice operators achieved excellent reproducibility with minimal training. The visual confirmation inherent in the Butterfly system likely reduces gross targeting errors and increases user trust, which may help explain the consistency of repeated measures observed in our cohort. Jalfon et al. also reported that Bland–Altman limits of agreement between devices exceeded their pre-specified ±50 mL threshold [
15]. We found a similar pattern, where high ICC values were accompanied by wide limits of agreement, highlighting that good reliability does not necessarily equate to close absolute agreement at the individual-measurement level. The observed limits, exceeding 100–200 mL in some cases, may be clinically significant, particularly around common thresholds for catheterization (300–500 mL) or postoperative monitoring. While the handheld 3D method reduced systematic bias and improved reproducibility compared to 2D, the wide limits suggest that results should be interpreted with caution in borderline cases. We therefore recommend handheld 3D ultrasound as a more reliable adjunct to conventional methods, but not as a sole determinant when precise volume thresholds are critical to management.
In another study, Wright et al. compared Butterfly iQ, Clarius C3, and a dedicated bladder scanner for prostate and bladder volumes and found that in the bladder subset, ICCs for voided volume prediction were better for Butterfly iQ (0.82) compared to Clarius (0.72) and the bladder scanner (0.69); they also reported shorter scan times and lower device cost for Butterfly [
16]. This reliability agrees with our reproducibility results and supports our feasibility findings, where both operators noted the simplicity and speed of acquisition with the 3D method. Taken together, the emerging literature suggests that Butterfly’s visual and semi-automated 3D acquisition is feasible, fast, and associated with high reproducibility. In contrast, the conventional 2D method requires manual determination of 6 calipers to calculate the volume. It could be challenging for operators with minimal sonographic skills to confidently determine the exact location of each measurement [
12,
25,
26].
Our results have significant clinical implications. The accuracy differences observed between the two methods can directly influence bladder management decisions. The 28–37% underestimation by the 2D ultrasound method could lead to misclassification of bladder filling states, potentially resulting in inappropriate catheterization, inadequate bladder training protocols, or misinterpretation of post-void residual volumes. In contrast, the 11–12% systematic error of the 3D method approaches clinically acceptable thresholds, although the presence of systematic underestimation still warrants consideration in volume-dependent clinical interventions. Similar findings have been reported in previous comparative studies of 2D and 3D ultrasound, where 3D approaches consistently reduced error rates and improved agreement with true bladder volumes [
26].
The improved inter-operator consistency of the 3D method addresses a critical limitation in point-of-care ultrasound applications. In clinical settings where multiple healthcare providers may perform bladder assessments, the reduced operator-dependent variability (7.3 mL vs. 54.2 mL systematic difference) could enhance measurement standardization and reduce training requirements. Prior research has emphasized that operator skill level and caliper placement in 2D methods are major contributors to variability [
26]. Coelho et al. developed a useful and validated 23 items training checklist to assess the skill of nurses in measuring UB volume using POCUS [
27]. This is particularly relevant for emergency departments, urology clinics, and nursing units where rapid, reliable bladder volume assessment is essential for patient care decisions.
The automated acquisition process of the 3D method offers additional clinical benefits by reducing training requirements and measurement time. Minimal user input is required beyond correct probe placement, which could facilitate broader implementation in resource-limited settings or situations where sonographic expertise is scarce. Other authors have suggested that such automated approaches could mitigate some of the pitfalls of conventional non-visual bladder scanners, which may misinterpret adjacent pelvic or abdominal fluid collections as bladder contents [
4,
26]. Nevertheless, our observation of field-of-view limitations in cases of extreme bladder distension (>700 mL) aligns with reports of incomplete volume capture in both handheld and stationary 3D systems [
11,
13], indicating that hybrid imaging strategies may be required for optimal accuracy across all clinical scenarios. Beyond technical performance, considerations of cost and workflow are also important. Handheld 3D ultrasound devices are generally more affordable than conventional cart-based systems and bladder scanners, while offering multipurpose imaging capabilities. Combined with faster, semi-automated acquisition, these features suggest practical advantages in clinical workflows. Nonetheless, formal cost-effectiveness studies are warranted across diverse healthcare settings.
Finally, the systematic underestimation identified in both methods—more pronounced with the 2D technique—highlights the potential role of calibration factors or correction algorithms for precise volume estimation. Similar recommendations have been made in prior work evaluating correction coefficients for bladder scanners and 3D ultrasound devices [
13]. By understanding and adjusting for these biases, clinicians can interpret measurements more accurately, particularly in contexts where precise thresholds guide intervention, such as postoperative bladder monitoring, urinary retention diagnosis, or pediatric bladder management.
Our study adds several original contributions to the literature on bladder volume assessment. To our knowledge, it is the first to directly compare the Butterfly 3D method with the conventional 2D approach in the same cohort, using true voided urine volume as the reference. Unlike prior feasibility studies, we also incorporated systematic intra- and inter-operator reproducibility testing, which provides novel insight into consistency across novice users. These aspects set our work apart and support its relevance for clinical practice. The study included only novice operators. This choice was intentional to evaluate usability in real-world contexts, where bladder volume assessments are often performed by non-specialists. Despite this, intra- and inter-operator reproducibility exceeded 0.96, underscoring that the handheld 3D system can provide reliable results even with minimal training. Future research could also include experienced sonographers to examine potential performance differences across training levels. Nevertheless, our work has several limitations. First, bladder filling during the course of repeated measurements is an inherent challenge. The rate of filling varies substantially between individuals and can be influenced by factors such as hydration status, fluid intake timing, renal function, and detrusor muscle activity [
2]. We assumed a constant bladder volume between scans; however, in reality, some degree of filling likely occurred between repeated measurements. Given that the acquisition time for all scans was short, no adjustments for filling rate were applied, as the potential change in volume during this window was considered negligible. Second, our sample did not include elderly participants or individuals with known bladder pathology. Older adults often have more irregular bladder morphology and thicker bladder walls due to chronic conditions such as outlet obstruction, which can make the measurements more challenging. In such cases, the semi-automated 3D acquisition may offer greater advantages over conventional techniques, but this hypothesis requires direct investigation. Although catheterization remains the most definitive reference standard, we used voided volume for its non-invasiveness. Post-void residuals were measured and found to be minimal (average <25 mL), and accounting for them did not alter the study conclusions. These small volumes may also reflect bladder refilling during repeated scans. Nonetheless, future studies in clinical populations should consider catheterization when performed as part of routine care to fully eliminate residual urine bias. Finally, the study was conducted only on healthy young male volunteers. Although both genders were invited to participate, only male students volunteered due to cultural considerations, as the scanning procedure can be slightly revealing and was conducted by male sonographers. The gender limitation may limit the generalizability of the findings to other populations, including females, pediatric patients, and those in acute or postoperative settings.
Future studies should expand to include elderly participants, females, and patients with bladder pathology (e.g., neurogenic bladder, outlet obstruction, pelvic organ prolapse), as these groups represent the populations where bladder volume measurement is most clinically relevant. Beyond accuracy, practical considerations also influence adoption. Handheld 3D ultrasound devices are portable, require minimal training compared to conventional ultrasound, and can be more readily integrated into bedside workflows. Their multipurpose use may further support cost-effectiveness, although formal health-economic analyses are warranted. To consolidate the evidence base, multicenter studies with diverse populations and systematic reviews or meta-analyses of handheld devices are also needed to confirm generalizability and guide clinical adoption.