1. Introduction
In recent years, large language models (LLMs) such as OpenAI’s ChatGPT have rapidly expanded the application of natural language processing in the medical field. They have been employed in tasks such as medical record summarization, dialog support, and the automatic generation of radiological findings. With the emergence of GPT-4o, interest in multimodal diagnostic support has grown, particularly in its ability to integrate visual and textual information. For example, in a meta-analysis of 330 studies, Ye et al. concluded that multimodal LLMs, including GPT-4o, have been widely applied for medical report generation and diagnostic support, with notable progress in visual-text integration [
1]. Furthermore, a recent scoping review by Zhou et al. found that vision-language models (VLMs) such as GPT-4V, LLaVA, and Flamingo have been applied to diagnostic tasks by combining radiological images with clinical text, and several studies have reported encouraging results [
2].
By contrast, the application of LLMs and VLMs in dentistry remains in its infancy. Practical investigations using GPT-4o have only emerged since late 2024. On text-based assessments, GPT-4-based models have shown promising results. For example, GPT-4o outperformed GPT-4 on text-based questions from Japan’s 117th National Dental Examination and achieved high performance in academic written examinations, in some cases exceeding the average scores of dental students [
3,
4]. Jaworski et al. reported that GPT-4o achieved 70.85% overall accuracy on 200 text-based questions from the Polish National Dental Examination, though performance was much lower (36.36%) on clinical case-based items compared to general knowledge questions (72.87%) [
5]. However, when GPT-4V was evaluated on image-based questions from the 116th Japanese National Dental Examination—including intraoral and extraoral photographs, panoramic and dental radiographs, CT, MRI, and pathological images—the overall correct response rate was only 35.0%, with particularly low performance (28.6%) on clinical practical questions requiring integration of multiple imaging modalities [
6]. These studies indicate that while current LLMs show promise in text-based dental examinations, VLMs still lack the visual recognition accuracy required for reliable dental image interpretation.
Recently, several multimodal dental imaging applications have been explored. For instance, Dasanayaka et al. proposed methods using GPT-4o for generating findings from panoramic radiographs, showing potential in detecting caries and impacted teeth [
7]. Several studies have explored VLM-based approaches for dental imaging tasks such as detection or classification of dental conditions [
8,
9,
10,
11]. However, these studies highlight the limitations of general-purpose models in visual inference. In the specific domain of cyst diagnosis in panoramic radiographs, which is the focus of this study, prior work has been limited to text-based approaches, such as Silva et al.’s study using GPT-3.5 with textual clinical inputs [
12]. While Dasanayaka et al. [
7] explored finding generation for panoramic radiographs, their study targeted different pathologies (caries and impacted teeth) from those addressed in the present study. Other studies such as those by Pham [
8], Aşar et al. [
9], Liu et al. [
10], and Camlet et al. [
11] have primarily focused on detection or classification of dental conditions using VLMs. This represents a significant gap and challenge in the current literature.
Another challenge in VLMs lies in hallucinations and factual inconsistencies, which remain critical concerns for medical AI applications. For example, Wang et al. conducted a comprehensive investigation into the factuality of LLMs and concluded that strategies such as Chain-of-Thought (CoT) prompting and retrieval-augmented generation (RAG) contribute to suppressing hallucinations, highlighting the importance of reliability in specialized medical domains [
13]. Alkaissi and McFarlane documented erroneous citations and fabricated facts generated by ChatGPT (version not specified in the original study), underscoring the need for careful verification in healthcare contexts [
14]. Chang et al. recently introduced the “MedHEval” benchmark for systematic evaluation and mitigation of hallucination in multimodal medical models, emphasizing that robust output-verification mechanisms are essential for safe deployment of VLMs [
15].
As such, although GPT-based models have demonstrated a certain level of effectiveness in dental diagnostic support, they still face limitations in ensuring spatial consistency, factual accuracy, and comprehensive domain knowledge, and a substantial risk of erroneous outputs remains. Therefore, to enhance the reliability and practicality of VLMs in dentistry, it is necessary to establish a framework that integrates consistency checks with structured information, visual data preprocessing, and hallucination-suppression techniques. In this preliminary study, we developed an integrated framework that functions as an external validation mechanism by harnessing the multimodal capabilities of a GPT-based VLM to support the interpretation of cysts in dental panoramic radiographs. This framework, called Self-correction Loop with Structured Output (SLSO), incorporates both structured-data generation and finding-text generation. The aim of this study is to investigate the feasibility of the proposed SLSO framework as a structured self-correction mechanism designed to improve the consistency and reliability of AI-generated radiological findings.
2. Materials and Methods
2.1. Pilot Experiments
As a preliminary investigation, a series of pilot experiments were conducted to refine the system design. Four phases were examined: (1) direct finding generation from resized panoramic images, (2) manual ROI (Region of Interest) extraction around cysts, (3) incorporation of tooth segmentation and tooth number annotation, and (4) introduction of CoT prompts. These preliminary studies revealed major limitations, including difficulties in tooth number identification, frequent vague expressions, and hallucinations. The insights gained from these investigations motivated the design of the proposed SLSO framework with a self-correction loop, described in detail in the following section. Detailed procedures, example outputs, and phase-specific results are provided in
Supplementary Note S1 of the Supplementary Information.
2.2. Dataset and Ground-Truth Creation
For this study, dental panoramic radiographs of 22 jaw cysts (cases_001–022) were collected at the Asahi University Medical and Dental Center (Mizuho, Gifu, Japan). Each image was annotated with the jaw cyst and tooth margins as well as the Fédération Dentaire Internationale (FDI) number for each tooth. A dental radiologist (T.K.) manually performed all annotations. None of the cases included in this study were histopathologically confirmed. Instead, all diagnoses were established based on radiographic findings and clinical assessment by an experienced oral and maxillofacial radiologist.
For this preliminary study, ground truth annotations were created by a single expert reviewer (T.K., an experienced oral and maxillofacial radiologist). Future studies involving larger and more diverse datasets will implement multi-rater consensus approaches to ensure robust reference standards.
The corresponding ground-truth findings were created by the same radiologist and saved as text files. The number of teeth affected by jaw cysts (number of affected teeth) ranged from one to six, with an average of 2.8.
For each case, the ground-truth findings were standardized to describe the lesion location and extent in the format “from tooth # to tooth #,” and to include the following three mandatory radiographic features: radiolucency, margination, and multilocularity. Furthermore, anatomical effects such as root resorption, tooth displacement, and their relationship with the cortical bone were also described, and the names of likely diseases in the differential diagnosis were stated when necessary. The minimum number of words in the ground-truth findings was 68, the maximum was 201, and the average was 127.3 ± 31.2 words.
All patient data were collected and de-identified at Asahi University with the approval of the Institutional Review Board (Approval No. 32040). Subsequent data analysis was conducted at Gifu University with the approval of the Institutional Review Board (Approval No. 2020-250). All the procedures complied with the principles of the Declaration of Helsinki.
2.3. Structured Output Schema and Image Annotation
Free-form written findings often contain variability and ambiguity, with synonyms such as “clear border,” “clear margin,” and “sharp contour” used interchangeably, and descriptions are prone to omission or hallucination (the inclusion of nonexistent findings). To mitigate these issues, we adopted a structured approach that constrains outputs to predefined options, thereby reducing variability, enabling machine-readable knowledge extraction, enhancing verifiability, and accumulating reusable structured data. Guided by collaboration with a co-author dental radiologist, we defined an explicit schema for the interpretation of items related to jaw cysts. In this study, a schema refers to a structured template that specifies the organization of interpretation categories and constraints on allowable outputs, ensuring consistency and reproducibility. Using this scheme, we classified the interpretation items into seven categories and assigned predefined options with labels, as summarized in
Table 1. Notably, the three-level categorization (‘no/mild/severe’) for root resorption and tooth displacement was adopted to provide a clinically intuitive and reproducible framework, reflecting commonly used radiological assessment practices while maintaining simplicity for structured evaluation.
For the image input, we employed annotated ROI images, including tooth margins and tooth numbers, produced as described in
“Supplementary Note S1 of the Supplementary Information” (see
Phase 3 and Figure S1d for details). These annotations were designed to guide the model in recognizing tooth boundaries and numbers while also providing standardized inputs that ensured consistency between schema-based interpretation and visual data. Representative examples of annotated input images used in this process are shown in
Supplementary Figure S1.
2.4. Overall Flow of Proposed Framework
The image interpretation support framework developed in this study used dental panoramic radiographs as the input, generates structured data and natural language findings regarding jaw cysts in a step-by-step manner, and includes a process for evaluating and correcting the consistency of each output (see
Figure 1 for an overview of the processing flow). A brief outline of the ten sequential steps is provided here, and detailed descriptions, prompt examples, and schema definitions are available in
Supplementary Note S2 of Supplementary Information.
Input: ROI images with annotated tooth margins and numbers, together with interpretation instructions, are provided.
GPT-4o Image Analysis: Multimodal analysis of the ROI image is initiated.
Structured Data Generation: Schema-based structured outputs are generated in JSON format.
Tooth Number Extraction: Affected tooth numbers are extracted directly from the image.
Tooth Number Consistency Check: Structured data and extracted tooth numbers are compared for consistency.
Regenerate Structured Data and Re-extract Tooth Numbers: Structured data and tooth numbers are regenerated when mismatches are detected.
Finding Generation: Radiology findings in natural language are generated from the structured data.
Regenerate Structured Data from Findings: Generated findings are converted back into structured data for verification.
Structured Data Consistency Check: Consistency between restructured and original structured data is checked.
Regenerate Findings: Findings are regenerated if inconsistencies remain, yielding the final output.
2.5. Experimental Setup
We compared the proposed SLSO framework with the conventional CoT method in terms of accuracy and consistency of the radiological findings generated for dental panoramic radiographs. To clarify the statistical evaluation framework, we defined the following null hypothesis: H0: The application of the proposed SLSO framework does not improve the diagnostic performance of the GPT-based VLM compared with the conventional CoT approach. Because this study focuses primarily on methodological feasibility using a limited dataset, this statistical formulation was introduced to support comparative evaluation rather than to perform a definitive hypothesis-driven validation.
2.5.1. Evaluation Criteria
To evaluate the accuracy of the generated findings, all structured items defined in the schema (e.g., location, transparency, internal structure, border, and affected teeth) were used as evaluation criteria. Each item was assigned a score of 0/1 (perfect match or not) for quantitative evaluation.
Score 1; Semantically consistent with the ground-truth findings (same meaning, no discrepancies in description).
Score 0; Inconsistent or missing (misinterpretation of meaning, omission, incorrect description, etc.).
This scoring allowed us to evaluate the accuracy of each schema item for each case and compare the overall performance by calculating the average accuracy. Although this binary exact-match scoring ensured objectivity and reproducibility, it may underestimate outputs that are partially correct or semantically close to the ground truth. More fine-grained evaluation methods such as semantic similarity metrics or expert-based rating scales should be incorporated into future studies to better capture its practical clinical utility.
2.5.2. Evaluation Procedure
Both methods (proposed SLSO framework and direct generation using CoT) were applied to all 22 cases. The score was calculated for each structured item in the generated output. Additionally, a case-based analysis of representative cases was conducted to examine whether the consistency check and regeneration process contributed to the suppression of incorrect descriptions and hallucinations.
2.5.3. Model and API Configuration
The model used in this study was the GPT-4o (gpt-4o-2024-11-20; OpenAI, San Francisco, CA, USA), a multimodal LLM provided by OpenAI. It was adopted because it supports both image and text input and can integrate image interpretation and natural language generation for dental panoramic radiographs.
For inference using the API, a Base64-encoded image together with a text prompt was simultaneously input via the gpt-4o endpoint. The output was structured data in JSON format or radiological findings, depending on the prompt design.
To stabilize the behavior of the model and maintain consistent outputs from the same image, the following parameter settings were used:
Temperature; 0.2 (low to increase output certainty);
Top_p; 1.0 (considering all top values in the probability distribution);
Max_tokens; 2048 (maximum tokens generated);
Frequency_penalty; 0.0 (no suppression of word repetition);
Presence_penalty; 0.0 (no induction of new words).
These settings minimize output fluctuations while achieving high reproducibility and accuracy for structured data and sentences.
2.5.4. Prompt Design
In this study, multiple prompt designs were used depending on the task to improve the accuracy and consistency of finding generation.
- (1)
Prompts for Structured Data Generation
In the structured output-based approach, we predefined a schema containing the interpretation items required for radiological diagnosis and designed English prompts using the Pydantic schema to request precise output in JSON format. The prompt explicitly included criteria for each interpretation item, such as:
“For the cyst in the image, please determine transparency, borders, internal structure, etc., and output the results in the specified JSON format.”
The prompt also clearly stated that the task was a “professional interpretation task for dental radiology diagnosis,” guiding GPT-4o to recognize the medical context before outputting.
- (2)
Prompts for Generating Findings
When generating findings from the structured data, we used Japanese prompts to ensure clinically usable language, such as:
“Based on the following structured data, please output dental radiology findings in a natural, medically sounding format.”
The output format followed the ground-truth style created by dental radiologists, naturally expressing location (e.g., “from no. 45 to no. 47”) and characteristic descriptions (e.g., “a clearly demarcated unilocular translucency”).
- (3)
Output Format
Structured data were output in English using a unified JSON format with values stored as strings or lists. In contrast, the findings were output as natural Japanese sentences because the ground truth was prepared in Japanese. This style is concise and clinically applicable, with sentences ending in a period as the standard.
3. Results
3.1. Overall Performance Comparison
Table 2 summarizes the performance of the two methods for each interpretation category, showing the mean correct answer rate for each item, together with the absolute and relative differences between the SLSO and CoT methods and the corresponding
p-values.
Compared with the CoT method, the structured output method yielded higher accuracy in several interpretation items, most notably tooth number (relative difference of 66.9%), tooth displacement (33.3%), root resorption (28.6%), and relationship with other structures (9.0%). In contrast, clear improvements were not observed for radiolucency, internal structure, and boundary. For these items, correct answer rates were already high (0.864–1.000) for both methods, resulting in similar levels of accuracy.
Statistical tests were performed using the Shapiro–Wilk test to assess normality, followed by paired tests. A p-value < 0.05 was considered statistically significant, and 0.05 ≤ p < 0.10 was regarded as a trend toward significance. Owing to the limited sample size (22 cases), statistical significance was not achieved for any of the evaluation items.
An item-by-item analysis of
Table 2 yielded the following three observations. For high-accuracy items (mean agreement rate > 0.8), namely radiolucency, internal structure, and boundary, both methods showed similarly high performance. For moderate-accuracy items (mean agreement rate 0.4–0.8), namely root resorption and relationship with other structures, the SLSO method consistently showed improvement. For tooth number identification, a low-accuracy item (mean agreement rate < 0.4), the structured method demonstrated a 66.9% relative improvement. This pattern is consistent with the contribution of the consistency-checking mechanism incorporated in the SLSO framework.
3.2. Comparison in Representative Cases
In this section, a detailed analysis of representative cases classified as successful and unsuccessful examples is provided to illustrate the effectiveness and limitations of the proposed method. A successful case was defined as one in which the structured output method achieved a substantial improvement in the average score for each interpretation item compared to the CoT method. Conversely, an unsuccessful case was defined as one in which the SLSO method showed a decrease in the average score for each interpretation item compared with the CoT method. For each case, the following five aspects were evaluated: (1) average score by interpretation item; (2) accuracy of tooth number identification; (3) conciseness and clinical appropriateness of the descriptions; (4) number of regeneration cycles (reflecting the effect of the self-correction loop); and (5) main improvements or common issues.
3.2.1. Successful Case
Figure 2 shows the input image for a successful case.
Table 3 lists the ground-truth finding, the report generated by the CoT method, the report generated by the SLSO method, the item-by-item scores for the two methods, the main improvements, and the number of regeneration cycles.
3.2.2. Failure Case
Figure 3 presents the input image for a failure case, and
Table 4 provides a comparison of the reports and evaluation outcomes for this case.
4. Discussion
This study demonstrated that the proposed SLSO framework improved the consistency and reliability of AI-generated radiological findings for jaw cysts in dental panoramic radiographs.
4.1. Effectiveness of Proposed Method
As noted in the Introduction, Liu et al. [
10] reported substantial performance gaps between detection of radiopaque structures and radiolucent lesions using GPT-4o. Within this context, the present study explored whether a structured integrated processing framework could mitigate known weaknesses of current multimodal models. Given the known instability of direct structured-output prompting in GPT-4o due to medical refusal behavior, the Chain-of-Thought (CoT) method was selected as a realistic and reproducible baseline for comparison.
The proposed SLSO framework demonstrated consistent performance differences compared with the conventional CoT method in the generation of structured findings for jaw cysts on dental panoramic radiographs. Three characteristic technical effects were observed:
- (1)
Improved Tooth Number Accuracy
The most notable difference between the two methods was observed in tooth number accuracy, which increased from 0.136 with the CoT method to 0.227 with the proposed method (+66.9%). This increase suggested a favorable trend associated with the consistency-checking mechanism at the core of the proposed framework. Unlike the CoT approach, which often yielded vague location expressions (e.g., “lower left mandibular molar region”), the structured output method enabled more precise identification (e.g., “tooth #47 and #48”). This effect could be attributed to the self-correction process implemented in Steps 4 (tooth number extraction) and 5 (consistency check). In one representative case, a perfect score of 1.0 was achieved after five regeneration cycles, illustrating the potential value of iterative correction. Although overall accuracy remained limited, these observations suggested that structured, self-correcting approaches might contribute to more reliable localization in dental radiographs.
- (2)
Enforced Negative Findings and Improved Comprehensiveness
The structured schema required explicit “present/absent” judgments for each interpretation category, which promoted documentation of negative findings. This addressed a key weakness of the CoT approach, which frequently included vague expressions (e.g., “appears to…” or “may be…”) and often omitted negative findings (e.g., “no evidence of resorption”). In contrast, the proposed framework enabled explicit and reproducible documentation of negative findings, such as “No evidence of pathological effects such as root resorption or tooth displacement.” This mechanism resulted in higher completeness and internal consistency of the generated outputs, which was reflected in improved scores for root resorption, tooth displacement, and relationships with other structures.
- (3)
Hallucination Suppression Effect
Constraints imposed by the structured format reduced references to nonexistent anatomical structures and logically inconsistent statements. From a technical perspective, this effect reflected the role of schema-based constraints in limiting the generative freedom of LLMs and enforcing clinically valid output spaces. For example, verbose CoT expressions such as “presents a smooth contour” were replaced with clinically appropriate descriptions such as “The internal structure of the lesion is unilocular, with a well-defined, round boundary.”
These findings aligned with and extended the observations by Liu et al. [
10]. While their study evaluated GPT-4o’s detection capabilities using different tooth numbering systems, the present study addressed the complementary challenge of generating internally consistent and verifiable structured findings for radiolucent lesions. Importantly, the observed effects were achieved through an external processing framework rather than through modifications of the underlying vision-language model.
A distinguishing feature of the approach was its design as an integrated processing framework rather than a model modification strategy. By implementing structured output constraints using Pydantic schemas and multi-stage validation loops external to the model, the SLSO framework represented a model-agnostic technical strategy for improving the reliability of generative outputs in radiological reporting.
For example, as shown in
Table 3, the tooth number consistency check detected a discrepancy between the tooth number item in the structured output and the tooth number extracted directly from the image, triggering five automatic regeneration cycles and ultimately correcting the hallucinated description to achieve a perfect category-wise score of 1.00. In contrast, as shown in
Table 4, although one regeneration cycle was triggered by the same mechanism, the correction remained incomplete due to the complexity of the lesion spanning multiple teeth.
4.2. Study Limitations and Challenges
This study had several intrinsic limitations that should be acknowledged. This section addresses both study-level limitations and model/method-level limitations of the proposed framework.
- (1)
Dataset Size and Generalizability
As a preliminary investigation, the dataset comprised only 22 cases from a single institution, limiting its statistical power and external validity. The limited availability of jaw cyst cases with confirmed histopathological diagnoses and expert radiologist annotations constrained further expansion at this stage. Despite these constraints, the dataset provided preliminary evidence suggesting potential improvements through the SLSO framework and demonstrated the technical feasibility of this integrated processing approach for future validation studies. Verification across multiple institutions, imaging devices, and patient populations would be essential to confirm the generalizability and robustness of the proposed approach in larger-scale studies. In addition, because annotated ROI images were used as inputs in this study, the applicability of the framework to full panoramic radiographs without manual localization remains an important topic for future investigation. Future studies may integrate automated lesion detection or tooth localization models with the proposed framework to enable a fully automated interpretation pipeline. Several deep learning-based approaches have been proposed for automated jaw cyst detection [
16,
17] and segmentation [
18,
19] in dental radiographs, as well as for tooth number identification [
20], and integration with such methods represents a promising direction for future work.
- (2)
Complex Anatomical and Pathological Cases
In the case of maxillary anterior failure (
Table 4), the structured approach underperformed compared with CoT (0.43 vs. 0.57). This case involved extensive lesions spanning multiple teeth, complex anatomical relationships (e.g., nasal cavity floor and incisive canal), and subtle changes (e.g., mild resorption). Structural schema options (e.g., unilocular/multilocular) were insufficient to capture this complexity, and rigid structuring sometimes hindered appropriate descriptions. Future work will involve larger-scale analysis of failure patterns using expanded datasets to better characterize systematic sources of error, such as tooth number misidentification, difficulties in interpreting lesions spanning multiple teeth, and challenges in detecting subtle anatomical changes. Future improvements may involve integrating the proposed framework with models that provide stronger spatial localization capabilities. For example, combining the VLM-based interpretation framework with conventional deep-learning–based lesion detection or tooth localization models, or replacing the underlying VLM with newer models that offer improved spatial reasoning, may help address these limitations.
- (3)
Limits of Visual Recognition
For persistently low-scoring items, such as tooth number (0.227) and tooth displacement (0.364), the inherent visual recognition capabilities of the GPT-4o were considered a limiting factor. In one failure case involving an extensive lesion spanning multiple teeth (
Figure 3,
Table 4), the AI output produced a more restricted range (“around teeth #11 and #12 in the maxillary anterior”) compared to the ground truth (“apical region from right maxillary canine #3 to left maxillary canine #3”). While mild root resorption was detectable in other cases, neither method described subtle changes such as “mild resorption is also suspected” or “nasal cavity floor elevation is also suspected.” These findings suggested that, in cases involving multiple complex anatomical structures, detecting subtle individual changes became even more challenging.
4.3. Experimental Technical Constraints
In addition to the study-level limitations, several technical issues arose during the comparative experiments.
- (1)
Difficulty in Direct Structured Output Generation
In the initial phase of this study, we aimed to directly compare the proposed method with a simple approach that generates the same structured output format without a self-correction loop. However, unexpected technical challenges were encountered during these experiments. When GPT-4o was prompted to produce structured outputs directly (e.g., “Evaluate the cyst and output in JSON format”), it frequently generated refusal responses (e.g., “I cannot provide a diagnosis because I am not a doctor”), making it difficult to obtain stable comparative data. This behavior appeared to stem from GPT-4o’s safety mechanisms that automatically reject medical diagnosis requests. Although reframing the context (e.g., “professional dental radiology interpretation task”) or replacing “diagnosis” with “image analysis” could reduce refusals, such adjustments deviated from the notion of a “simple” structured output baseline.
- (2)
Choice of Comparison Method
For the reasons stated above, the CoT method was adopted for comparison. CoT was a well-established approach to medical AI and its contrast to the SLSO framework offered valuable insights. Future research should explore minimally modified strategies to circumvent refusal errors and conduct more direct comparative experiments. This would enable a clearer separation of the effects of the structured output from the additional benefits of the self-correction loop.
- (3)
Evaluation Method
The binary 0/1 scoring scheme was adopted to ensure objective and reproducible comparison by focusing on exact semantic consistency with ground-truth findings. However, this approach may underestimate partially correct outputs or clinically acceptable interpretations. Future studies may incorporate more nuanced evaluation approaches, such as semantic similarity metrics and multi-expert clinical scoring, to better reflect the practical utility of AI-generated radiological interpretations.
4.4. Implications for Clinical Application and Future Perspectives
- (1)
Role in Collaboration with Specialists
This study suggested that AI-assisted diagnostic systems should function as supportive tools rather than as a replacement for specialists. Final confirmation by experts remained essential, particularly for complex anatomical interpretations, subtle or rare findings, and judgments requiring a clinical context. A stepwise introduction was considered effective, starting with support for clear and typical cases, moving to preliminary reading assistance under specialist confirmation, and eventually extending to educational use such as training residents. The structured nature of the proposed framework might facilitate such collaboration by providing transparent and verifiable intermediate outputs that could be reviewed by clinicians. Although the current performance remains limited for some categories, the proposed framework demonstrates potential for improving the reliability and consistency of AI-assisted radiological reporting rather than replacing expert interpretation.
- (2)
Study Comparison and Originality
This preliminary study was among the first to address cyst diagnostic support using a multimodal approach with direct image input, extending beyond text-based approaches reported previously [
12]. Rather than improving model performance through fine-tuning or architectural modification, the present work emphasized a framework-level contribution that complemented existing hallucination-suppression strategies such as RAG and CoT. Although the iterative design shared similarities with the Self-Refine framework [
16], the inclusion of structured consistency checks distinguished the proposed method and aligned it with the strict requirements of medical imaging. Unlike the Self-Refine framework [
21], which performs iterative refinement primarily at the text level, the proposed SLSO framework introduces structured intermediate representations that integrate image interpretation, structured data generation, and natural language findings, enabling consistency checks across different representation formats.
- (3)
Future Technical Improvements and Expansion
Future work should involve larger datasets, specialist-in-the-loop evaluations, and semantic-level metrics to better capture clinical utility. Beyond jaw cysts, the proposed framework could be extended to other dental diseases and broader medical imaging domains, particularly when structured outputs could be aligned with domain-specific diagnostic criteria. The remaining challenges included reducing costs and processing time, and integration into clinical workflows.
- (4)
Data Privacy and Deployment Considerations
In the present study, GPT-4o was accessed through a cloud-based API environment. For actual clinical deployment, however, the handling of medical images through external cloud services raises important issues related to patient privacy, data governance, and regulatory compliance. In many jurisdictions, the transmission of medical images to external servers may fall under regulations related to the protection of personal medical information, and strict institutional policies are often required.
In addition, in maxillofacial imaging, anatomical structures themselves may potentially serve as biometric identifiers, making complete anonymization challenging. Therefore, future clinical implementations will likely require locally deployable vision–language models operating within secure on-premises environments inside medical institutions. Such architectures would enable inference processing without transmitting medical images outside institutional networks, thereby addressing privacy and legal concerns.
The present study focused on methodological feasibility rather than clinical deployment; however, these regulatory and privacy considerations will be important topics for future research and system development.
- (5)
Model-Agnostic Design and Robustness to Future Model Evolution
Although GPT-4o was used in the present study, the primary objective was not to evaluate a specific model version but to demonstrate a generalized framework for improving the reliability of AI-generated radiological findings. Handler et al. recently discussed in Nature Medicine that even advanced generative models such as GPT-5 remained inherently fragile in medical applications, lacking intrinsic mechanisms for error detection and self-verification [
22]. These limitations were unlikely to be resolved by model scaling alone in the near term. In this context, the proposed SLSO framework provided a practical pathway for enhancing robustness independently of model scaling and might serve as a foundation for future agent-based AI systems incorporating multiple reasoning and verification components.