1. Introduction
Transthoracic echocardiography (TTE) is an essential diagnostic tool in cardiovascular medicine that enables the real-time assessment of cardiac motion and hemodynamics without causing pain or exposing patients to radiation. When TTE is performed by non-cardiology specialists for rapid diagnosis and initial treatment decision-making, it is termed Focus Cardiac Ultrasound (FoCUS). FoCUS is increasingly recognized as an essential skill for physicians, particularly in the emergency department, the intensive care unit (ICU), and primary care settings [
1,
2,
3,
4,
5]. Globally, there is a growing emphasis on integrating ultrasound education, including TTE, into undergraduate medical training, with widespread adoption expected among medical students [
6].
FoCUS is a brief bedside screening examination mainly conducted by non-cardiology specialists. Its primary goal is to prevent the oversight of common or potentially life-threatening conditions by focusing on a limited number of essential findings using simple and fundamental techniques [
7]. Unlike comprehensive echocardiography conducted by ultrasound specialists, FoCUS is not designed to provide a detailed evaluation of the entire heart or quantify parameters, including left ventricular ejection fraction or transvalvular flow velocities.
A significant challenge for beginners in FoCUS training is determining whether the image they obtain represents an appropriate cross-sectional view. Typically, learners acquire this skill through hands-on sessions with ultrasound instructors, who provide direct feedback on image quality. However, tools that are currently available for evaluating image appropriateness are few, making it difficult for beginners to train effectively without the supervision of experienced practitioners. In Japan, many physicians receive FoCUS training during their residency; nevertheless, the limited availability of instructors or technicians in some institutions has led to junior residents completing their programs without acquiring adequate FoCUS skills [
7].
Recently, advancements in artificial intelligence (AI) have improved image recognition technology. Notably, research has focused on tasks, including standard view classification, which involves identifying commonly depicted views on TTE, such as the parasternal long-axis view, parasternal short-axis view at the mitral valve level, and apical four-chamber view [
8,
9,
10,
11,
12,
13,
14,
15]. Additionally, ultrasound devices equipped with AI-guided scanning assistance have become available, helping to acquire appropriate cross-sectional views. However, the underlying algorithms differ among manufacturers and mostly depend on specific ultrasound systems.
There are two essential elements for acquiring optimal TTE images in FoCUS. The first is the position of the probe required to obtain the intended view. The second is whether the anatomical structures necessary for FoCUS were adequately visualized. Some factors, including respiratory motion, imaging artifacts, insufficient gel application, and inappropriate display of structures at their expected locations, can impair the second element. In this study, these two elements are referred to as “position” and “quality,” respectively. Deficiencies in any of these elements can hinder rapid and accurate diagnosis in clinical settings. Skilled sonographers achieve optimal imaging by understanding the target view and continuously adjusting the position and orientation of the probe based on real-time image feedback. Trainees must learn to recognize optimal views and infer the necessary adjustments to the probe based on the images they observe to acquire this level of skill.
In this study, a TTE image assessment AI as an educational tool was developed to help beginners acquire FoCUS skills. A two-step framework was established for this purpose. Step 1 involves classifying the TTE images into three standard views. Step 2 involves evaluating whether each image corresponds to an optimal cross-sectional view and is of sufficient quality. Completing both steps enables the identification of images that are suitable for FoCUS.
Various studies have addressed Step 1 (the prerequisite task), which involves classifying standard echocardiographic views [
8,
9,
10,
11,
12,
13,
14,
15,
16]. From a methodological perspective, deep learning approaches—particularly convolutional neural networks (CNNs) such as ResNet, VGG, and Inception, as well as neural architecture search (NAS)—have become the standard for automated view classification. These models achieve high classification accuracies, often exceeding 95%, and have demonstrated robustness across diverse clinical scenarios, including point-of-care ultrasound (POCUS), contrast-enhanced echocardiography, and handling cardiac motion via spatial-temporal or graph-based constraints.
Despite these strengths, a key limitation of existing tools is their reliance on datasets composed primarily of clinically optimal, properly aligned standard views. Their primary objective is to identify which standard view is depicted, rather than to detect how the probe is misaligned. Consequently, they cannot evaluate the typical suboptimal images produced by novice trainees during hands-on practice. Some studies have also investigated the use of neural networks to evaluate the image quality in TTE as part of Step 2 (the primary task) [
16,
17,
18]. However, estimating the position of the probe from an image requires a unique dataset that includes intentionally misaligned or suboptimal images. Because existing automated tools were not designed or trained to output fine-grained probe deviation classes (e.g., distinguishing between specific sliding, tilting, or rotating errors from a reference view), a direct empirical comparison between our implementation and other existing tools using the same data is not feasible; the classification tasks and output labels are fundamentally different.
Three types of image-classification tasks were conducted in this study. The first image classification task focused on achieving the previously described Step 1, where a view classifier was used to identify the three cross sections: parasternal long-axis view, parasternal short-axis view at the mitral valve level, and apical four-chamber view. The second task involved estimating positional deviations from the correct probe position using echocardiographic images. This position classification task used a position classifier to categorize 15 types of deviations in the parasternal long-axis view, 19 in the parasternal short-axis view, and 18 in the apical four-chamber view. The third task focused on evaluating the visual quality of the displayed images by categorizing them into four levels–best, acceptable, poor, and bad–using a quality classifier. The position and quality classifiers were designed to achieve Step 2 and were individually trained based on the view classifier. Subsequently, the results predicted by the position and quality classifiers were combined or threshold-processed to enhance the performance of the position classification using quality classification as a weighting factor. The classification results were evaluated using Venn diagrams under two conditions: one where all images, regardless of quality, were defined as targeted images and the other where only high-quality images (best or acceptable) were defined as targeted images. The latter metric was intended to identify images suitable for FoCUS. Finally, Step 3 is discussed based on the results of the position classifier for non-optimal cross sections.
We engaged in an engineering optimization task aimed at the practical implementation of an AI system that assesses TTE image quality based on the FoCUS criteria without requiring supervision from expert sonographers or technicians. While we utilized an existing AI architecture for system development, the novelty of this study lies in the construction of a framework that integrates a unique dataset with distinct evaluation models. In particular, images where the probe position deviated from the optimal cross-sectional plane were collected. Classifiers based on these images were constructed to estimate the probe position and assess diagnostic image quality.
3. Results
3.1. Participants Characteristics
Table 3 shows the characteristics of the participants whose data were used to train the position, quality, and view classification models. Overall, 281,534 frames from 14 participants (maximum: 22,202 frames; minimum: 1918 frames) were used exclusively for training. An additional 425,781 frames from 15 participants (an average of 28,385 frames per subject) were used for testing.
Among the 75 participants initially enrolled in the position and quality evaluation models, seven were excluded owing to factors such as pectus excavatum or insufficient image acquisition. The remaining 1,886,225 frames from 68 participants were grouped as follows: 595,571 frames from 53 participants (average: 11,237 frames per participants after undersampling) were used for training and validation, and 425,781 frames from 15 participants (average: 28,385 frames per subject) were used for testing.
The undersampled data used for training are as follows (
Figure 4): view classification model, 82,940 frames per class; position evaluation model, parasternal long-axis, 14,089 frames per class; parasternal short-axis, 11,690 frames per class; apical four-chamber, 9007 frames per class; quality evaluation model, parasternal long-axis, 95,420 frames per class; parasternal short-axis, 39,247 frames per class; and apical four-chamber, 45,823 frames per class.
For testing, the view classification model used 425,781 frames (average: 30,413 frames per participants), while the Position and Quality evaluation models used 141,927 frames (average: 9461 frames per participants). The distribution of quality labels according to their position in the test data is shown in
Table 4.
3.2. Results of the View Classification Model
Figure 6 shows a confusion matrix for classifying echocardiographic images into three standard views: the parasternal long-axis, parasternal short-axis, and apical four-chamber views. Numbers within each square indicate the number of images in each category.
Overall, 121,873 images with the true label “PLAX” were correctly classified as “PLAX.” However, 5176 images with the true label “PLAX” were misclassified as “PSAX”, and 229 images with the true label “PLAX” were misclassified as “A4C”. Similarly, 9044 images with the true label “PSAX” were misclassified as “PLAX,” while 157,972 images with the true label “PSAX” were correctly classified. Additionally, 3836 images with the true label “PSAX” were misclassified as “A4C.” For the true label “A4C,” 244 images were misclassified as “PLAX,” and 614 images with the true label “A4C” were misclassified as “PSAX,” while 126,793 images with the true label “A4C” were correctly classified. The rows indicate true labels, and the columns indicate predicted labels. Color intensity corresponds to the number of images, as indicated by the color bars.
Evaluating the view classification model designed to categorize the three standard views achieved an accuracy of 0.955, a recall of 0.959, a precision of 0.954, and an F1-score of 0.956 (
Figure 6). Among the 425,781 images in the test dataset (15 participants), 19,143 images (4.5%) were misclassified as incorrect. Given the high F1-score, the view classification model was considered suitable as a pretrained model and was subsequently used to develop the position and quality evaluation models.
3.3. Results of the Position Evaluation Model
The performance of the position evaluation models, which assess whether the probe is correctly positioned to capture optimal views, including the PLAX view, PSAX view, and A4C view, was evaluated using the test dataset (
Table 5,
Figure 7).
Figure 7.
Inference Results of the Position Evaluation Model: The vertical axis represents the actual labels, and the horizontal axis represents the predicted labels. Each cell color indicates the number of samples. (
a) Position evaluation model for parasternal long-axis view. (
b) Position evaluation model for parasternal short-axis view. (
c) Position evaluation model for apical four-chamber view. Inference Results of the Quality Evaluation Model: The vertical axis represents the actual labels, and the horizontal axis represents the predicted labels. Each cell color indicates the number of samples. (
d) Quality evaluation model for parasternal long-axis view. (
e) Quality evaluation model for parasternal short-axis view. (
f) Quality evaluation model for apical four-chamber view. Three colored outlines (green, red, and orange) enclose parts of the confusion matrix and indicate the combinations corresponding to the regions in the Venn diagram in
Figure 8.
Figure 7.
Inference Results of the Position Evaluation Model: The vertical axis represents the actual labels, and the horizontal axis represents the predicted labels. Each cell color indicates the number of samples. (
a) Position evaluation model for parasternal long-axis view. (
b) Position evaluation model for parasternal short-axis view. (
c) Position evaluation model for apical four-chamber view. Inference Results of the Quality Evaluation Model: The vertical axis represents the actual labels, and the horizontal axis represents the predicted labels. Each cell color indicates the number of samples. (
d) Quality evaluation model for parasternal long-axis view. (
e) Quality evaluation model for parasternal short-axis view. (
f) Quality evaluation model for apical four-chamber view. Three colored outlines (green, red, and orange) enclose parts of the confusion matrix and indicate the combinations corresponding to the regions in the Venn diagram in
Figure 8.
Figure 8.
Venn Diagrams of the Predicted Appropriate Sections by Position and Quality Evaluation Models Compared with the Ground Truth: The green dashed line, red solid line, and purple solid line represent the regions of the ground truth labels for appropriate sections, the predicted regions by the position evaluation model, and the predicted regions by the quality evaluation model, respectively. (a) Parasternal long-axis view. (b) Parasternal short-axis view. (c) Apical four-chamber view. Venn Diagrams of the Predicted Appropriate Sections with Best or Acceptable Quality by Position and Quality Evaluation Models Compared with the Ground Truth: The ochre dashed line, red solid line, and purple solid line represent the regions of the ground truth labels for appropriate sections with acceptable quality (best or acceptable), the predicted regions by the position evaluation model, and the predicted regions by the quality evaluation model, respectively. (d) Parasternal long-axis view. (e) Parasternal short-axis view. (f) Apical four-chamber view.
Figure 8.
Venn Diagrams of the Predicted Appropriate Sections by Position and Quality Evaluation Models Compared with the Ground Truth: The green dashed line, red solid line, and purple solid line represent the regions of the ground truth labels for appropriate sections, the predicted regions by the position evaluation model, and the predicted regions by the quality evaluation model, respectively. (a) Parasternal long-axis view. (b) Parasternal short-axis view. (c) Apical four-chamber view. Venn Diagrams of the Predicted Appropriate Sections with Best or Acceptable Quality by Position and Quality Evaluation Models Compared with the Ground Truth: The ochre dashed line, red solid line, and purple solid line represent the regions of the ground truth labels for appropriate sections with acceptable quality (best or acceptable), the predicted regions by the position evaluation model, and the predicted regions by the quality evaluation model, respectively. (d) Parasternal long-axis view. (e) Parasternal short-axis view. (f) Apical four-chamber view.
![Diagnostics 16 01032 g008 Diagnostics 16 01032 g008]()
Figure 7 shows the confusion matrices for the position evaluation model applied separately to each standard view: (a) PLAX, (b) PSAX, and (c) A4C views. The matrix compares the number of images correctly classified as “optimal view” versus those classified as “other” positions for each view.
For the PLAX model, 32,109 images with the true label “PLAX” were correctly identified as optimal, while 19,788 were misclassified as “others.” Conversely, 10,631 images from other positions were misclassified as “PLAX,” and 64,750 were correctly classified as “others.” The model achieved an accuracy of 0.761, a recall of 0.619, a precision of 0.751, and an F1-score of 0.679 when evaluating the proper sections. When including optimal and non-optimal sections (full range of positional classes), the model achieved a recall of 0.492, a precision of 0.477, and an F1-score of 0.474 (see
Supplementary Figure S1(I) online).
For the PSAX model, 49,932 images with the true label “PSAX” were correctly identified, while 3989 were misclassified as “others.” Among the images from other positions, 11,687 were incorrectly classified as “PSAX” and 105,244 were correctly identified as “others.” The accuracy, recall, precision, and F1-score for the proper sections were 0.908, 0.926, 0.810, and 0.864, respectively. When including optimal and non-optimal sections, the recall, precision, and F1-score were 0.656, 0.680, and 0.660, respectively (see
Supplementary Figure S1(II)).
For the A4C model, 31,441 images with the true label “A4C” were correctly classified, while 7292 were misclassified as “others.” Conversely, 5459 images from other positions were incorrectly classified as “A4C,” and 83,459 were correctly classified as “others.” The model achieved an accuracy of 0.900, a recall of 0.812, a precision of 0.852, and an F1-score of 0.831 for the proper sections. When including both optimal and non-optimal sections, the recall, precision, and F1-score were 0.585, 0.614, and 0.573, respectively (see
Supplementary Figure S1(III)).
In all matrices, the rows correspond to the true labels and the columns to the predicted labels. Color intensity represents the number of images, as shown by the color scale above each matrix.
3.4. Results of the Quality Evaluation Model
The performance of the quality evaluation model, which assessed whether key anatomical structures are visible without prominent artifacts, was tested using four quality levels (best, acceptable, poor, and bad) on the test dataset (
Table 6,
Figure 7).
Figure 7 shows the confusion matrices for the quality evaluation model applied separately to the (d) PLAX, (e) PSAX, and (f) A4C views. In each matrix, true labels are categorized as “Best/Acceptable” or “Poor/Bad,” with predictions shown in the same categories.
For the PLAX view, 34,128 images with the true label “Best/Acceptable” were correctly classified, while 23,070 were misclassified as “Poor/Bad.” Among the images labeled “Poor/Bad,” 10,008 were misclassified as “Best/Acceptable,” while 60,072 were correctly classified. The model achieved an accuracy of 0.740, a recall of 0.597, a precision of 0.773, and an F1-score of 0.674 when assessing combined “best” and “acceptable” quality.
For the PSAX view, 57,411 images with the true label “Best/Acceptable” were correctly identified, while 13,196 were misclassified as “Poor/Bad.” Among the images labeled “Poor/Bad,” 7898 were incorrectly classified as “Best/Acceptable,” and 92,347 were correctly identified. The corresponding scores were an accuracy of 0.877, a recall of 0.813, a precision of 0.879, and an F1-score of 0.845.
For the A4C view, 23,355 images with the true label “Best/Acceptable” were correctly classified, while 7965 were misclassified as “Poor/Bad.” Among the “Poor/Bad” images, 7966 were incorrectly labeled as “Best/Acceptable,” and 88,365 were correctly classified. The corresponding scores were an accuracy of 0.875, while recall, precision, and F1-score were 0.746 across all metrics.
The rows indicate true labels, whereas the columns indicate predicted labels. Color intensity corresponds to the number of images, as shown by the scale bar above each matrix.
3.5. Analysis of Position-Quality Model Combination to Improve Position Evaluation
The combination of the position and quality evaluation models was compared with the standalone position evaluation model (
Table 7,
Figure 8a–c). The Venn diagrams show the overlap between the ground truth labels (green dashed circles), regions predicted by the position evaluation model (red solid circles), and regions predicted by the quality evaluation model (purple solid circles) for each standard view: (a) PLAX, (b) PSAX, and (c) A4C. The corresponding areas in the confusion matrices in
Figure 7 are aligned with the Venn diagrams in
Figure 8a–c. The two confusion matrices at the bottom of the figure represent the results of the position evaluation model (left) and quality evaluation model (right).
For the PLAX view, 51,897 images were labeled using PLAX. The position evaluation model predicted 42,740 images as PLAX, while the quality evaluation model predicted 44,136 images as “Best/Acceptable.” The intersection of the two models produced 28,722 images, of which 24,419 were correctly labeled as PLAX. Using this intersection, the recall, precision, and F1-score were 0.471, 0.850, and 0.606, respectively. Using the union, 58,154 images were produced, and the corresponding metrics were 0.757, 0.676, and 0.714, representing an improvement in the F1-score compared to the position model alone (0.679).
For the PSAX view, 53,921 images were labeled with PSAX. The position evaluation model predicted 61,619 as PSAX, while the quality evaluation model predicted 65,309 as “Best/Acceptable.” The intersection yielded 51,265 images with a recall, a precision, and an F1-score of 0.815, 0.857, and 0.835, respectively. The union yielded 75,663 images with a recall, precision, and F1-score of 0.964, 0.687, and 0.802, respectively, both of which were lower than the 0.864 achieved by the position model alone.
For the A4C view, 38,733 images were labeled as A4C. The position evaluation model predicted 36,900 as A4C, while the quality evaluation model predicted 31,321 as “Best/Acceptable.” This intersection yielded 24,667 images with a recall, a precision, and an F1-score of 0.602, 0.946, and 0.736, respectively. The union yielded 43,554 images with a recall, a precision, and an F1-score of 0.869, 0.773, and 0.818, respectively. However, both approaches showed lower F1-scores than that of the positional model alone (0.831).
Table 7 shows the estimation results. Collectively, these findings indicate that combining the position and quality evaluation models enhanced the F1-score for the PLAX view, especially when using the union approach, while for the PSAX and A4C views, the position model alone outperformed the combined approach.
3.6. Evaluation of Position-Quality Model Combinations for FoCUS
To identify images suitable for FoCUS, position and quality evaluation models were combined to classify images meeting the appropriate position and quality criteria (labeled as either “best” or “acceptable”). The inference results are shown in
Table 8 and
Figure 8d–f.
Figure 8d–f shows the results of combining the position and quality evaluation models to identify the images suitable for FoCUS. The green dashed circles represent the ground truth labels of images classified as appropriate position and “Best/Acceptable” quality, the red solid circles indicate the regions predicted by the position evaluation model, and the purple solid circles indicate the regions predicted by the quality evaluation model.
For the PLAX view, 45,235 images were labeled as PLAX with Best/Acceptable quality. The position evaluation model predicted 42,740 images as PLAX, whereas the quality evaluation model predicted 44,136 images as best or acceptable. The intersection of the two models produced 28,722 images, of which 22,308 were correctly labeled as PLAX with the Best/Acceptable quality. The union resulted in 58,154 images, of which 35,647 were correctly labeled as PLAX with the Best/Acceptable quality. The calculated metrics were as follows: recall 0.493, precision 0.777, and F1-score 0.603 for the intersection, and recall 0.788, precision 0.613, and F1-score 0.690 for the union.
For the PSAX view, 51,222 images were labeled as PSAX with Best/Acceptable quality. The position evaluation model predicted 61,619 images as PSAX, whereas the quality evaluation model predicted 65,309 images as Best/Acceptable. The intersection of the two models produced 51,265 images, of which 43,199 were correctly labeled as PSAX with the Best/Acceptable quality. The union resulted in 75,663 images, of which 49,714 were correctly labeled as PSAX with the Best/Acceptable quality. The calculated metrics are as follows: recall = 0.843, precision = 0.843, and F1-score 0.843 for the intersection, and recall 0.971, precision 0.657, and F1-score 0.784 for the union.
For the A4C view, 28,039 images were labeled as A4C with Best/Acceptable quality. The position evaluation model predicted 36,900 images as A4C, whereas the quality evaluation model predicted 31,321 images as Best/Acceptable. The intersection of the two models yielded 24,667 images, of which 20,134 were correctly labeled as A4C with Best/Acceptable quality. The union resulted in 43,554 images, of which 26,278 were correctly labeled as A4C with Best/Acceptable quality. The calculated metrics were recall 0.718, precision 0.816, and F1-score 0.764 for the intersection, and recall 0.937, precision 0.603, and F1-score 0.734 for the union.
4. Discussion
Securing adequate practice time is essential to improve TTE skills. However, owing to the limited availability of clinical instructors, the demand for educational applications that support independent learning is increasing. The core component of these applications is an AI system capable of evaluating the quality of echocardiographic images and the positioning of the ultrasound probe. In this study, we developed and evaluated an image-assessment AI system using a dataset collected prospectively from healthy volunteers. A two-step framework was used to evaluate the performance of the AI system.
In the first step, a view classification model was developed to classify the images into three standard views: PLAX, PSAX, and A4C. The model achieved a high F1-score of 0.956. In the second step, the system evaluated whether the images depicted optimal cross sections of these three standard views and assessed their quality. The F1 scores of the position evaluation models for each standard view were 0.679 for PLAX, 0.864 for PSAX, and 0.831 for A4C. Conversely, the F1-scores of the quality evaluation model were 0.674 for PLAX, 0.845 for PSAX, and 0.746 for A4C. Inference methods that integrate a position-evaluation model with a quality-evaluation model were explored in this study. Among the three standard echocardiographic views, the union of the position and quality evaluation models yielded a higher F1-score in the PLAX view (0.714) than the position model alone (0.679). Conversely, in the PSAX and A4C views, the F1-score achieved by the position evaluation model alone outperformed the union and intersection combinations of the two models. Notably, the F1-score for the PLAX view using the position evaluation model alone was 0.679, over 0.1 points lower than the scores for the PSAX and A4C views, which were 0.864 and 0.831, respectively. These results indicate that combining the position and quality evaluation models improved the F1-score of the PLAX view, thereby compensating for its relatively lower baseline performance.
The position and quality evaluation models were integrated in this study to evaluate FoCUS usable imaging. For the PLAX view, combining the position and quality evaluation models using a union approach enhanced the F1-score to 0.714. For the PSAX view, the intersection approach produced an F1-score of 0.835, while for the A4C view, the union approach achieved an F1-score of 0.818. The proposed position and quality evaluation models and their integration showed adequate accuracy in assessing the appropriate sections.
The aim of this study was to estimate the probe position from echocardiographic images; however, the accuracy of classifying non-optimal cross-sectional views remained relatively low. One contributing factor was the use of multiclass classification within each standard view, involving 15 to 19 different classes. Grouping similar classes based on the degree of deviation, for instance, by combining PLAX_cc1 and PLAX_cc2, may improve the classification performance for non-optimal sections.
Additionally, the quality evaluation was based on the depiction of anatomical structures, some of which were very small in the images. This unique characteristic of TTE images, where small anatomical structures significantly influence quality ratings, poses challenges for developing quality evaluation models. The limited training data compared to other studies also restricted the performance. Data augmentation was used to enhance diversity; nonetheless, its impact was insufficient.
The proposed models were designed for use on tablets and smartphones using MobileViTv2_075 for efficient operation in resource-constrained environments. Recent advancements in image recognition, including Vision–Language Models (VLMs) [
23,
24], may further improve classification accuracy. Among these, Bootstrapping Language–Image Pre-training (BLIP) is a vision–language pre-training framework that supports both understanding and generation tasks [
25]. BLIP enhances supervision by generating synthetic captions and filtering out low-quality samples, thereby improving model robustness. Future work may explore the application of VLMs such as BLIP to further improve image assessment performance.
This study makes three major contributions to the literature. First, it is novel because a unique dataset collected from healthy young adults, reflecting practice scenarios among beginner-level trainees, was used for the development of the AI model. Second, the model was trained using a dataset that included a large number of suboptimal images that deviated from optimal cross-sectional views, similar to those beginners often encounter during image acquisition. This deliberate inclusion of non-ideal images has enabled the development of position classifiers capable of evaluating subtle deviations from optimal views, an aspect often overlooked in previous studies. Furthermore, a quality evaluation was conducted based on anatomical visibility to address the unique imaging characteristics of TTE. Third, our proposed framework that integrates position and quality models improved overall performance.
The performances of the models were rigorously evaluated. Position classifiers achieved strong performance for the PSAX and A4C views; however, combining them with quality classifiers further improved the results for views with lower baseline performance, including the PLAX view. The integration of the position and quality models enabled the identification of images suitable for FoCUS, even without the supervision of an expert. These findings highlight the potential of AI-based systems to support independent simulation-based ultrasound training, particularly in environments with limited access to instructors.