1. Introduction
In acute trauma care, internal hemorrhage is one of the leading causes of mortality [
1]. More than 60% of blunt internal trauma cases occur in a pre-hospital setting, such as traffic accidents and combat injuries [
2]. Consequently, the accuracy of rapid point-of-care assessment directly influences clinical decision-making during the critical “golden hour” of resuscitation [
3]. Computed Tomography (CT) is traditionally regarded as the gold standard for diagnosing abdominothoracic injuries [
4]. However, the inherent limitations of CT imaging, including site requirements and radiation exposure, often impede timely evaluation in patients with suspected internal hemorrhage [
5]. Within contemporary trauma resuscitation protocols, point-of-care ultrasound has become the first-line imaging modality for rapid detection of hemoperitoneum in emergency and critical care settings [
6]. The Focused Assessment with Sonography for Trauma (FAST) [
7] examination systematically detects free fluid in the pericardial sac and peritoneal cavity. When combined with clinical judgment, this bedside diagnostic technique helps refine diagnoses in trauma patients and guides decisions regarding the need for emergent surgical intervention or transfer to trauma centers. Moreover, ultrasound imaging enables clinicians to rapidly identify life-threatening injuries at the bedside or in resource-limited environments without exposing patients to radiation [
8].
Although FAST examination has been widely adopted in some hospitals, its further development remains constrained by significant limitations. A major challenge lies in the fact that ultrasound imaging is highly operator-dependent, requiring substantial training and clinical experience [
9]. Yet, across various clinical settings, including emergency scenes and resource-limited areas, there is a widespread shortage of adequately trained and experienced medics [
10]. In acute trauma care, variability in operator skill levels can lead to missed life-threatening injuries. Even medics who have received point-of-care ultrasound training may not always be able to accurately interpret ultrasound images promptly in every situation. This issue is exacerbated in emergencies, where the need for rapid assessment increases the pressure on the operator.
Given these challenges, deep learning (DL) technology demonstrates considerable potential. During FAST examination conducted by paramedics under urgent conditions, AI-assisted ultrasonography can improve diagnostic performance by enabling rapid identification and localization of free fluid, thereby shortening examination time and accelerating clinical intervention. Furthermore, this technology could potentially allow medics without specialized diagnostic training to make effective clinical decisions using ultrasound devices. Currently, deep learning models have been widely applied in FAST examination [
11,
12,
13,
14,
15,
16,
17,
18]. However, despite achieving promising accuracy, most of these models adopt generic object detection architectures and incur considerable computational costs. However, point-of-care and pre-hospital ultrasound devices are severely resource-constrained, which poses a major obstacle to the clinical application of intelligent diagnosis. Consequently, existing AI-assisted free fluid detection methods have remained largely in the research domain, with no practical deployment pathway for point-of-care use. Furthermore, the datasets used in these studies primarily consist of effusions (e.g., pleural or ascitic fluid), which do not necessarily represent the dynamic, irregular morphology of free fluid resulting from active internal hemorrhage.
To bridge this gap, we propose a lightweight detection network specifically designed for free fluid identification in FAST examinations, targeting resource-constrained bedside devices. The network is built upon the efficient YOLOx [
19] architecture, with two key structural innovations: a Dual-Stream Fusion (DSF) backbone that decomposes features into high- and low-frequency components to preserve spatial details while reducing computation, and a Global Fusion Feedback (GFF) neck that integrates multi-scale features with bottom-up refinement to improve fusion efficiency.
Another distinctive aspect of our work is the training dataset. Unlike prior studies that used static effusion data, we collected and annotated ultrasound images from rabbits with actively hemorrhaging liver tissue. This allows our model to learn the morphological characteristics of free fluid during active hemorrhage, which is a more clinically relevant and challenging scenario.
Experiments show that our lightweight design achieves competitive detection accuracy while reducing computational complexity by an order of magnitude compared with mainstream detectors. This makes accurate, automated internal hemorrhage assessment feasible on portable ultrasound devices without requiring cloud connectivity or high-end hardware—thereby providing a practical technical solution for point-of-care intelligent internal hemorrhage detection. The code is available at
https://github.com/Mingyi-Yang-hub/lightweight-detection (accessed on 9 May 2026).
The remainder of this paper is organized as follows.
Section 2 reviews the related work, covering free fluid detection and lightweight object detection methods.
Section 3 elaborates on the proposed method in detail, including network architecture design, dataset construction, and model training strategy.
Section 4 presents the experimental results and analysis, involving comparative evaluation with mainstream methods as well as ablation experiments.
Section 5 provides relevant discussions on this work. Finally, the overall conclusions are drawn in
Section 6.
3. Methods
This section provides a systematic exposition of the proposed method. First,
Section 3.1 details the model structure, explaining its core components and design principles. Subsequently,
Section 3.2 describes the acquisition process of the dataset employed in this study. Then,
Section 3.3 outlines the specific implementation strategy for training the network. Finally,
Section 3.4 introduces the quantitative metrics used to evaluate the model’s performance.
3.1. Model Structure
The overall structure of the proposed network is illustrated in
Figure 1. Built upon the YOLOx baseline, an efficient framework for object detection, we introduce targeted optimizations in the backbone and neck components to better adapt to the characteristics of free fluid in ultrasound images. These modifications aim to reduce network complexity while preserving detection accuracy, thereby enhancing overall performance in this medical imaging context. To fully extract features of free fluid regions with lower computational cost, the backbone module (detailed in
Section 3.1.1) decomposes features into two parallel pathways, i.e., high-frequency and low-frequency. The low-frequency branch operates at a smaller resolution to reduce computational load, while the high-frequency branch supplements detailed information, thereby balancing efficiency and representational capacity. Simultaneously, a feature cross-fusion module is introduced to facilitate complementary interaction between the two branches, which enhances feature expressiveness and mitigates information loss. Furthermore, to enhance multi-scale feature fusion while maintaining low complexity, the neck module (detailed in
Section 3.1.2) introduces a structure based on a global fusion feature. This architecture enables top-down propagation of high-level semantic information through the global fusion feature while reducing redundancy along the feature transmission path. Simultaneously, the fused global information is fed back to each feature level, strengthening bottom-up detail representation and improving the robustness of multi-scale features.
3.1.1. Dual-Stream Fusion Backbone (DSF-Backbone)
In ultrasound imaging, free fluid typically appears as hypoechoic areas with continuous boundaries due to the low attenuation of sound waves of fluids. This characteristic exhibits a distinct pattern in the spatial frequency domain: the overall grayscale distribution and regional continuity are primarily contained in the low-frequency components of the image, while boundary details and textural features are carried by the high-frequency components. To more clearly observe the information characteristics across different frequency bands, we performed visualization by conducting frequency-domain decomposition analysis on ultrasound images containing free fluid. The corresponding low-frequency and high-frequency components are shown in
Figure 2. Systematic observation reveals that the low-frequency components significantly retain the overall echo-intensity distribution characteristics of the region, clearly displaying the macroscopic morphology and average grayscale level of the free fluid area, which aligns with the physical imaging mechanism of the hypoechoic nature of fluid. As shown in
Figure 2, the location and extent of the hypoechoic area can be largely identified using only the low-frequency reconstructed image. The high-frequency components primarily represent detailed image information, including key diagnostic features such as boundary sharpness and internal texture. Boundary sharpness aids in accurately defining the extent of free fluid areas, while the textural features displayed in the high-frequency components effectively distinguish parenchymal tissue areas. Another important basis for identifying free fluid is that it is often surrounded by parenchymal tissue structures. Therefore, high-frequency information holds significant value in assisting localization and differential diagnosis.
Based on the above analysis, we designed a dual-frequency branch multi-scale feature extraction backbone network, whose structure is illustrated in the “Backbone” part of
Figure 1. The network first employs the stem convolutions (Stem Conv) in YOLOx to perform preliminary feature extraction on the original ultrasound images. The resulting features are then split into two independent branches along the channel dimension in proportions
and
. One branch focuses on low-frequency feature extraction, capturing the macroscopic morphology and overall grayscale distribution in the image. Due to the lower resolution of feature maps in the low-frequency branch, it significantly reduces the model’s computational complexity and memory usage, thereby improving the network’s computational efficiency. The other branch specializes in high-frequency feature extraction, capturing fine structural details such as edges and internal textures, which serve as supplementary information, enabling the network to obtain more comprehensive and accurate image features.
For a single branch, as the feature extraction module deepens, features on that branch gradually lose some information and become overly simplified. To address this issue, we introduce a feature cross-fusion module after each scale of feature extraction, enabling information exchange and fusion between the high-frequency and low-frequency features of the two branches. Inspired by octave convolution [
32], the structure of the cross-fusion module is designed based on octave convolution, with the specific architecture illustrated in
Figure 3. The dual-frequency branches process the high-frequency features (
) and low-frequency features (
), respectively. After interaction and fusion, the generated new high-frequency features (
) and low-frequency features (
) are given by
respectively, where
denotes the feature maps obtained by the convolutional update from feature map group
A to group
B. Specifically,
and
represent the feature maps obtained through intra-frequency updates, while
and
represent the feature maps obtained through inter-frequency communication. Among them, the intra-frequency update is implemented via a convolution operation; the inter-frequency communication corresponding to
is completed through convolution and pooling-based downsampling operations (Conv & pool), and
is realized through convolution and upsampling operations (Conv & Up). By leveraging octave convolution, cross-frequency communication is effectively achieved, promoting complementary information integration between the two branches.
3.1.2. Global Fusion Feedback Neck (GFF-Neck)
In object detection networks, the neck module plays a crucial role in feature fusion, with its core objective being the comprehensive integration of multi-scale features extracted from the backbone module. Typically, the outputs from the last three hierarchical levels of the backbone module are selected for fusion, such as the lower-level C3, mid-level C4, and highest-level C5, as shown in
Figure 1. These feature maps possess varying spatial resolutions, containing rich positional details, mid-level semantic information, and higher-level abstract features, respectively, collectively forming the foundation of multi-scale representation for detection tasks. Currently, widely used neck modules, such as the Feature Pyramid Network (FPN) [
33] and its extensions like the Path Aggregation Network (PANet) [
34], employ bidirectional top-down and bottom-up fusion mechanisms across layers to achieve thorough interaction among cross-scale features. This effectively enhances the model’s ability to recognize objects at multiple scales. However, such methods often introduce significant computational and parameter overhead, and feature propagation paths may contain certain redundancies, which limit the further balance between efficiency and performance. Therefore, with the aim of improving computational efficiency while preserving effective feature fusion, we explore a more efficient feature fusion mechanism, attempting to achieve effective integration of multi-scale features while reducing redundant operations.
To reduce redundancy in feature propagation paths, we first propose a global fusion mechanism that directly integrates multi-scale features into a unified global fusion (GF) feature. This GF feature is subsequently leveraged to enhance the expressive capacity of all feature levels, thereby streamlining propagation paths and reducing redundancy. The generation of the GF feature consists of two main steps: first, compressing high-level semantic features (C4/C5) to extract lightweight contextual descriptors; then, fusing them with the detail-rich low-level feature C3, as illustrated in
Figure 4. Specifically, for the C4/C5 features, a Pyramid Pooling Module (PPM) [
35] is applied to generate multi-receptive-field contextual feature maps through adaptive average pooling at four different scales:
,
,
, and
. The outputs from the PPM undergo
convolution and upsampling operations, followed by concatenation and fusion along the channel dimension. The resulting features are then processed through convolution and upsampling and finally combined with the C3 feature to form the GF feature. This approach offers several advantages:
- ➀
The pooling operations at different scales in the PPM capture a range of contextual information. By integrating multi-scale features, the model’s receptive field coverage is effectively expanded, which aids in improving the network’s ability to identify target regions.
- ➁
The pooling operations themselves are parameter-free, significantly reducing computational overhead.
- ➂
This design avoids the traditional stepwise propagation structure in FPN, such as C5 → C4 → C3, reducing the number of feature transmissions and mitigating information degradation during propagation.
The GF feature is then further fed back to each feature level from the bottom up and integrated with their original features, as illustrated in the “Neck” part of
Figure 1. The fused features are then input into their corresponding detection heads to accomplish the final detection tasks. This design effectively combines global contextual information with local detail features while maintaining a bidirectional information flow (top-down and bottom-up): on one hand, the global information fusion mechanism enables the top-down propagation of high-level semantic information; on the other hand, by feeding GF features back to each level, it further enhances the bottom-up expression of detailed information. Although this process is simple, it enriches and reinforces the feature representation at each level, achieving efficient multi-scale feature fusion.
3.2. Data Collection
Taking into account the practical challenges in obtaining sufficient data on active internal hemorrhage in humans in order to accurately simulate the ultrasound characteristics of free fluid in the state of internal hemorrhage, this study established an active liver bleeding model in rabbits. This animal model is used to collect ultrasound images under controlled and standardized conditions, creating a dedicated dataset to validate the effectiveness of the proposed detection method. The dataset, derived from real active hemorrhage processes, better reflects the ultrasonographic features of traumatic internal hemorrhage. Additionally, this dataset can be used to verify the model’s efficacy and lay the foundation for future translation of the algorithm to human applications.
In the experiments, 18 healthy adult rabbits were selected. Under ultrasound guidance, the anatomical position of the liver was precisely located, and an active hemorrhage model was constructed by puncturing the liver parenchyma with an 11-gauge sharp knife while avoiding major blood vessels and bile ducts, as shown in
Figure 5. Ultrasound image acquisition commenced immediately after inducing hemorrhage, and the ultrasound images of the liver and the kidney-surrounding area were recorded from multiple perspectives. A total of 677 frames were extracted at fixed intervals from the recorded ultrasound videos to form the free fluid dataset. During the acquisition process, free fluid was absent in some images due to probe movement or positioning deviations. Such images were retained during dataset construction and annotated as target-free samples to enhance the model’s ability to identify negative samples and improve the overall robustness of the network. All data were professionally annotated by physicians with over five years of clinical experience to obtain accurate labels for the free fluid.
3.3. Model Training
All models in this study were implemented using the PyTorch framework and trained on an NVIDIA RTX 3090 GPU. The model was optimized using Stochastic Gradient Descent (SGD) with a momentum of 0.9 and a weight decay coefficient of to improve generalization. The training process employed a multi-stage learning rate scheduling strategy. Specifically, the initial value of the learning rate was set to and gradually increased to this value over the first 5 epochs using a quadratic formula warm-up strategy. Subsequently, the learning rate was reduced to and at the 60% and 85% marks of the total training epochs, respectively, to facilitate convergence toward a better local optimum in the later training stages. Training was performed with a batch size of 16. All input images were resized to pixels, and data augmentation techniques were applied to enhance model robustness.
The loss function is designed following YOLOx’s unified framework, which integrates object detection and classification into a single objective. The overall loss can be expressed as
where
,
, and
are balancing coefficients, which are assigned values of 5, 1, and 1, respectively, in our experiments according to the default settings of YOLOx. Each loss component is defined as follows: (1) Bounding box regression loss (
), which uses IoU Loss to improve localization precision; (2) Class prediction loss (
) and object confidence loss (
), both implemented using Cross Entropy, with a weighted positive–negative sample strategy to alleviate class imbalance; (3) Instead of static assignment, SimOTA dynamically selects positive samples based on the matching cost between predictions and ground truths. The matching cost for the
i-th prediction and
j-th ground truth is defined as:
where
and
are adjustable weights, which are assigned values of 1 and 3, respectively, in our experiments according to the default settings of YOLOx. Based on
, the top-
k predictions with the lowest cost are assigned as positive samples for each ground truth, which improves both training efficiency and final detection performance.
Given the limited scale of the dataset, data augmentation was employed during training to enhance the model’s generalization ability. We adopt the data augmentation methods used in YOLOx, including Mosaic, MixUp, and RandomFlip. It should be noted that RandomFlip is applied only horizontally, not vertically. This is because ultrasound images exhibit a distinct physical orientation during acquisition: the upper part of the image corresponds to the epidermal tissue near the probe, while the lower part typically represents acoustic shadows or artifact regions. Vertical flipping would disrupt the alignment between the image structure and the actual anatomical position, violating the physical principles of ultrasound imaging. Therefore, vertical flipping is not adopted in this study.
3.4. Evaluation Metrics
To comprehensively evaluate the model and compare it with other mainstream object detection models, we employ precision, recall, mean average precision (mAP), and F1-score as performance metrics. Higher recall indicates a lower risk of missed detection, while higher precision corresponds to a lower false alarm rate; the F1-score provides a balanced measure of both. In addition, model complexity is assessed using the number of parameters (Params) and gigaflops operations per second (FLOPs).
Given that the free fluid detection task places relatively relaxed demands on bounding box localization accuracy while prioritizing the avoidance of missed and false detections, this study reports not only the commonly used mAP@50:95 in object detection (which reflects overall performance across IoU thresholds from 0.5 to 0.95 with a step size of 0.05) but also fine-grained metrics under different IoU strictness levels. Specifically, precision, recall, and F1-score are calculated separately at IoU thresholds of 0.5 (relatively loose matching) and 0.75 (relatively strict matching).
Specifically, the following metrics are reported:
mAP@50:95: The mean average precision computed over IoU thresholds from 0.5 to 0.95 with a step size of 0.05, reflecting the model’s overall robustness in localization quality across varying matching strictness.
Metrics at IoU = 0.5: precision, recall, and F1-score under a looser matching condition.
Metrics at IoU = 0.75: precision, recall, and F1-score under a stricter matching condition.
Due to the limited sample size of the dataset, a 5-fold cross-validation strategy was adopted to fully evaluate model performance and enhance result stability. All data were randomly divided into five mutually exclusive subsets. In each validation round, one subset was used as the test set while the remaining four formed the training set, resulting in five independent training-test cycles. To ensure fairness and consistency across comparative experiments, the same set of data partitions was used in all comparisons, thereby eliminating performance variations caused by random data splits. All reported performance metrics are the arithmetic mean of the results from the five folds. This approach provides a more robust estimate of the model’s generalization ability under data-limited conditions and reduces evaluation bias introduced by single random partitioning.
5. Discussion
This paper proposes a lightweight network for detecting free fluid in FAST examinations to screen for internal hemorrhage. Built upon YOLOx, the model introduces a dual-stream backbone that splits features into high-frequency and low-frequency branches according to ultrasound characteristics. The low-frequency branch captures the overall echo-intensity distribution and macro-shape of the fluid region with reduced computational cost, while the high-frequency branch provides details such as boundary sharpness and internal texture. To preserve information during feature extraction, a feature cross-fusion module is introduced after each stage to integrate both branches. For the neck module, a global fusion feedback mechanism aggregates multi-scale features into a unified global representation and feeds it back bottom-up, enhancing feature representation while reducing redundancy along the propagation path.
Experimental results demonstrate that the proposed method outperforms mainstream object detection methods in terms of detection accuracy, including precision, recall, and F1-score, while achieving the lowest computational complexity. YOLOx-tiny-Lite, constructed by incorporating the proposed lightweight strategy into the YOLOx-tiny baseline, only requires 3.543G FLOPs and 4.592M parameters, yet achieves the optimal mAP@50:95 of 71.02%. At an IoU threshold of 0.5, its precision, recall, and F1-score all exceed 98%; at IoU = 0.75, these metrics remain above 83%, indicating that the model can accurately detect free fluid and locate its position. YOLOx-nano-Lite, improved from the YOLOx-nano baseline with the proposed lightweight method, only consumes 0.554 G FLOPs and 0.82 M parameters, with its mAP@50:95 also higher than 70%. Further ablation studies show that the proposed DSF backbone reduces computational cost while improving detection accuracy, in which the cross-feature fusion module plays a key role in preventing information loss during dual-branch feature extraction. The GFF neck enhances the perception of free fluid regions through global feature fusion, further reducing computation while ensuring efficient multi-scale information fusion.
Although the above experiments have verified the effectiveness of the proposed lightweight method, the model deployment on edge hardware has not been implemented in this work. In future research, we will focus on operator optimization and software–hardware co-design, transplant the algorithm to portable ultrasound edge computing platforms such as FPGA/ARM, and carry out real-time inference tests and verification in clinical scenarios so as to further improve the engineering practical value of the proposed method.
The proposed model is primarily developed for intelligent FAST examination. Given the challenges in acquiring a sufficient number of ultrasound images of human internal hemorrhage and to more accurately capture the morphology of free fluid under active hemorrhage conditions, the model is trained using rabbit-derived ultrasound images from liver tissues with active hemorrhage. Rabbits exhibit high similarity to humans in terms of acoustic characteristics of free fluid in ultrasound imaging; training on this dataset, therefore, enables the model to more effectively extract sonographic features associated with active hemorrhage. Nevertheless, certain differences in organ morphology persist between rabbits and humans. While the dataset can be used to algorithmically validate the effectiveness of the proposed model, future research should focus on exploring effective strategies for model transfer and adaptation so as to facilitate its application in the actual diagnosis and treatment of human patients.