Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with a Structured Output (SLSO) Framework

Hosokawa, Nanaka; Takahashi, Ryo; Kitano, Tomoya; Iida, Yukihiro; Muramatsu, Chisako; Hayashi, Tatsuro; Seino, Yuta; Zhou, Xiangrong; Hara, Takeshi; Katsumata, Akitoshi; Fujita, Hiroshi

doi:10.3390/diagnostics16071096

Open AccessArticle

Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with a Structured Output (SLSO) Framework

by

Nanaka Hosokawa

^1,*,

Ryo Takahashi

²

,

Tomoya Kitano

³

,

Yukihiro Iida

³

,

Chisako Muramatsu

⁴,

Tatsuro Hayashi

²,

Yuta Seino

⁵

,

Xiangrong Zhou

⁶

,

Takeshi Hara

⁶

,

Akitoshi Katsumata

³ and

Hiroshi Fujita

^6,*

¹

Graduate School of Natural Science and Technology, Gifu University, Gifu 501-1193, Japan

²

EyeTech Co., Ltd., Tokyo 113-0033, Japan

³

School of Dentistry, Asahi University, Mizuho 501-0296, Japan

⁴

Faculty of Data Science, Shiga University, Hikone 522-8522, Japan

⁵

Division for Oral Dental Informatics, The University of Osaka Dental Hospital, Osaka 565-0871, Japan

⁶

Faculty of Engineering, Gifu University, Gifu 501-1193, Japan

^*

Authors to whom correspondence should be addressed.

Diagnostics 2026, 16(7), 1096; https://doi.org/10.3390/diagnostics16071096

Submission received: 17 February 2026 / Revised: 30 March 2026 / Accepted: 3 April 2026 / Published: 5 April 2026

(This article belongs to the Special Issue Application of Artificial Intelligence to Oral Diseases)

Download

Browse Figures

Versions Notes

Abstract

Background/Objectives: Vision-language models (VLMs) such as GPT (Generative Pre-Trained Transformer) have shown potential for medical image interpretation; however, challenges remain in generating reliable radiological findings in clinical practice, as exemplified by dental pathologies. This study proposes a Self-correction Loop with Structured Output (SLSO) framework as an integrated processing methodology to enhance the accuracy and reliability of AI-generated findings for jaw cysts in dental panoramic radiographs. Methods: Dental panoramic radiographs with jaw cysts were used to implement a 10-step integrated processing framework incorporating image analysis, structured data generation, tooth number extraction, consistency checking, and iterative regeneration. The framework functioned as an external validation mechanism for GPT outputs. Performance was compared against the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth displacement, relationships with other structures, and tooth number. Results: The SLSO framework improved output accuracy for multiple items compared to the CoT method, with the most notable improvements observed in tooth number identification, tooth displacement detection, and root resorption assessment. In successful cases, consistently structured outputs were achieved after up to five regenerations. The framework enforced explicit negative finding descriptions and suppressed hallucinations, although accurate identification of extensive lesions spanning multiple teeth remained limited. Conclusions: This investigation established the feasibility of the proposed integrated processing methodology and provided a foundation for future validation studies with larger, more diverse datasets.

Keywords:

dental panoramic radiograph; vision-language model; GPT; structured data generation; iterative self-correction; hallucination suppression

Graphical Abstract

1. Introduction

In recent years, large language models (LLMs) such as OpenAI’s ChatGPT have rapidly expanded the application of natural language processing in the medical field. They have been employed in tasks such as medical record summarization, dialog support, and the automatic generation of radiological findings. With the emergence of GPT-4o, interest in multimodal diagnostic support has grown, particularly in its ability to integrate visual and textual information. For example, in a meta-analysis of 330 studies, Ye et al. concluded that multimodal LLMs, including GPT-4o, have been widely applied for medical report generation and diagnostic support, with notable progress in visual-text integration [1]. Furthermore, a recent scoping review by Zhou et al. found that vision-language models (VLMs) such as GPT-4V, LLaVA, and Flamingo have been applied to diagnostic tasks by combining radiological images with clinical text, and several studies have reported encouraging results [2].

By contrast, the application of LLMs and VLMs in dentistry remains in its infancy. Practical investigations using GPT-4o have only emerged since late 2024. On text-based assessments, GPT-4-based models have shown promising results. For example, GPT-4o outperformed GPT-4 on text-based questions from Japan’s 117th National Dental Examination and achieved high performance in academic written examinations, in some cases exceeding the average scores of dental students [3,4]. Jaworski et al. reported that GPT-4o achieved 70.85% overall accuracy on 200 text-based questions from the Polish National Dental Examination, though performance was much lower (36.36%) on clinical case-based items compared to general knowledge questions (72.87%) [5]. However, when GPT-4V was evaluated on image-based questions from the 116th Japanese National Dental Examination—including intraoral and extraoral photographs, panoramic and dental radiographs, CT, MRI, and pathological images—the overall correct response rate was only 35.0%, with particularly low performance (28.6%) on clinical practical questions requiring integration of multiple imaging modalities [6]. These studies indicate that while current LLMs show promise in text-based dental examinations, VLMs still lack the visual recognition accuracy required for reliable dental image interpretation.

Recently, several multimodal dental imaging applications have been explored. For instance, Dasanayaka et al. proposed methods using GPT-4o for generating findings from panoramic radiographs, showing potential in detecting caries and impacted teeth [7]. Several studies have explored VLM-based approaches for dental imaging tasks such as detection or classification of dental conditions [8,9,10,11]. However, these studies highlight the limitations of general-purpose models in visual inference. In the specific domain of cyst diagnosis in panoramic radiographs, which is the focus of this study, prior work has been limited to text-based approaches, such as Silva et al.’s study using GPT-3.5 with textual clinical inputs [12]. While Dasanayaka et al. [7] explored finding generation for panoramic radiographs, their study targeted different pathologies (caries and impacted teeth) from those addressed in the present study. Other studies such as those by Pham [8], Aşar et al. [9], Liu et al. [10], and Camlet et al. [11] have primarily focused on detection or classification of dental conditions using VLMs. This represents a significant gap and challenge in the current literature.

Another challenge in VLMs lies in hallucinations and factual inconsistencies, which remain critical concerns for medical AI applications. For example, Wang et al. conducted a comprehensive investigation into the factuality of LLMs and concluded that strategies such as Chain-of-Thought (CoT) prompting and retrieval-augmented generation (RAG) contribute to suppressing hallucinations, highlighting the importance of reliability in specialized medical domains [13]. Alkaissi and McFarlane documented erroneous citations and fabricated facts generated by ChatGPT (version not specified in the original study), underscoring the need for careful verification in healthcare contexts [14]. Chang et al. recently introduced the “MedHEval” benchmark for systematic evaluation and mitigation of hallucination in multimodal medical models, emphasizing that robust output-verification mechanisms are essential for safe deployment of VLMs [15].

As such, although GPT-based models have demonstrated a certain level of effectiveness in dental diagnostic support, they still face limitations in ensuring spatial consistency, factual accuracy, and comprehensive domain knowledge, and a substantial risk of erroneous outputs remains. Therefore, to enhance the reliability and practicality of VLMs in dentistry, it is necessary to establish a framework that integrates consistency checks with structured information, visual data preprocessing, and hallucination-suppression techniques. In this preliminary study, we developed an integrated framework that functions as an external validation mechanism by harnessing the multimodal capabilities of a GPT-based VLM to support the interpretation of cysts in dental panoramic radiographs. This framework, called Self-correction Loop with Structured Output (SLSO), incorporates both structured-data generation and finding-text generation. The aim of this study is to investigate the feasibility of the proposed SLSO framework as a structured self-correction mechanism designed to improve the consistency and reliability of AI-generated radiological findings.

2. Materials and Methods

2.1. Pilot Experiments

As a preliminary investigation, a series of pilot experiments were conducted to refine the system design. Four phases were examined: (1) direct finding generation from resized panoramic images, (2) manual ROI (Region of Interest) extraction around cysts, (3) incorporation of tooth segmentation and tooth number annotation, and (4) introduction of CoT prompts. These preliminary studies revealed major limitations, including difficulties in tooth number identification, frequent vague expressions, and hallucinations. The insights gained from these investigations motivated the design of the proposed SLSO framework with a self-correction loop, described in detail in the following section. Detailed procedures, example outputs, and phase-specific results are provided in Supplementary Note S1 of the Supplementary Information.

2.2. Dataset and Ground-Truth Creation

For this study, dental panoramic radiographs of 22 jaw cysts (cases_001–022) were collected at the Asahi University Medical and Dental Center (Mizuho, Gifu, Japan). Each image was annotated with the jaw cyst and tooth margins as well as the Fédération Dentaire Internationale (FDI) number for each tooth. A dental radiologist (T.K.) manually performed all annotations. None of the cases included in this study were histopathologically confirmed. Instead, all diagnoses were established based on radiographic findings and clinical assessment by an experienced oral and maxillofacial radiologist.

For this preliminary study, ground truth annotations were created by a single expert reviewer (T.K., an experienced oral and maxillofacial radiologist). Future studies involving larger and more diverse datasets will implement multi-rater consensus approaches to ensure robust reference standards.

The corresponding ground-truth findings were created by the same radiologist and saved as text files. The number of teeth affected by jaw cysts (number of affected teeth) ranged from one to six, with an average of 2.8.

For each case, the ground-truth findings were standardized to describe the lesion location and extent in the format “from tooth # to tooth #,” and to include the following three mandatory radiographic features: radiolucency, margination, and multilocularity. Furthermore, anatomical effects such as root resorption, tooth displacement, and their relationship with the cortical bone were also described, and the names of likely diseases in the differential diagnosis were stated when necessary. The minimum number of words in the ground-truth findings was 68, the maximum was 201, and the average was 127.3 ± 31.2 words.

All patient data were collected and de-identified at Asahi University with the approval of the Institutional Review Board (Approval No. 32040). Subsequent data analysis was conducted at Gifu University with the approval of the Institutional Review Board (Approval No. 2020-250). All the procedures complied with the principles of the Declaration of Helsinki.

2.3. Structured Output Schema and Image Annotation

Free-form written findings often contain variability and ambiguity, with synonyms such as “clear border,” “clear margin,” and “sharp contour” used interchangeably, and descriptions are prone to omission or hallucination (the inclusion of nonexistent findings). To mitigate these issues, we adopted a structured approach that constrains outputs to predefined options, thereby reducing variability, enabling machine-readable knowledge extraction, enhancing verifiability, and accumulating reusable structured data. Guided by collaboration with a co-author dental radiologist, we defined an explicit schema for the interpretation of items related to jaw cysts. In this study, a schema refers to a structured template that specifies the organization of interpretation categories and constraints on allowable outputs, ensuring consistency and reproducibility. Using this scheme, we classified the interpretation items into seven categories and assigned predefined options with labels, as summarized in Table 1. Notably, the three-level categorization (‘no/mild/severe’) for root resorption and tooth displacement was adopted to provide a clinically intuitive and reproducible framework, reflecting commonly used radiological assessment practices while maintaining simplicity for structured evaluation.

For the image input, we employed annotated ROI images, including tooth margins and tooth numbers, produced as described in “Supplementary Note S1 of the Supplementary Information” (see Phase 3 and Figure S1d for details). These annotations were designed to guide the model in recognizing tooth boundaries and numbers while also providing standardized inputs that ensured consistency between schema-based interpretation and visual data. Representative examples of annotated input images used in this process are shown in Supplementary Figure S1.

2.4. Overall Flow of Proposed Framework

The image interpretation support framework developed in this study used dental panoramic radiographs as the input, generates structured data and natural language findings regarding jaw cysts in a step-by-step manner, and includes a process for evaluating and correcting the consistency of each output (see Figure 1 for an overview of the processing flow). A brief outline of the ten sequential steps is provided here, and detailed descriptions, prompt examples, and schema definitions are available in Supplementary Note S2 of Supplementary Information.

Input: ROI images with annotated tooth margins and numbers, together with interpretation instructions, are provided.
GPT-4o Image Analysis: Multimodal analysis of the ROI image is initiated.
Structured Data Generation: Schema-based structured outputs are generated in JSON format.
Tooth Number Extraction: Affected tooth numbers are extracted directly from the image.
Tooth Number Consistency Check: Structured data and extracted tooth numbers are compared for consistency.
Regenerate Structured Data and Re-extract Tooth Numbers: Structured data and tooth numbers are regenerated when mismatches are detected.
Finding Generation: Radiology findings in natural language are generated from the structured data.
Regenerate Structured Data from Findings: Generated findings are converted back into structured data for verification.
Structured Data Consistency Check: Consistency between restructured and original structured data is checked.
Regenerate Findings: Findings are regenerated if inconsistencies remain, yielding the final output.

2.5. Experimental Setup

We compared the proposed SLSO framework with the conventional CoT method in terms of accuracy and consistency of the radiological findings generated for dental panoramic radiographs. To clarify the statistical evaluation framework, we defined the following null hypothesis: H0: The application of the proposed SLSO framework does not improve the diagnostic performance of the GPT-based VLM compared with the conventional CoT approach. Because this study focuses primarily on methodological feasibility using a limited dataset, this statistical formulation was introduced to support comparative evaluation rather than to perform a definitive hypothesis-driven validation.

2.5.1. Evaluation Criteria

To evaluate the accuracy of the generated findings, all structured items defined in the schema (e.g., location, transparency, internal structure, border, and affected teeth) were used as evaluation criteria. Each item was assigned a score of 0/1 (perfect match or not) for quantitative evaluation.

Score 1; Semantically consistent with the ground-truth findings (same meaning, no discrepancies in description).
Score 0; Inconsistent or missing (misinterpretation of meaning, omission, incorrect description, etc.).

This scoring allowed us to evaluate the accuracy of each schema item for each case and compare the overall performance by calculating the average accuracy. Although this binary exact-match scoring ensured objectivity and reproducibility, it may underestimate outputs that are partially correct or semantically close to the ground truth. More fine-grained evaluation methods such as semantic similarity metrics or expert-based rating scales should be incorporated into future studies to better capture its practical clinical utility.

2.5.2. Evaluation Procedure

Both methods (proposed SLSO framework and direct generation using CoT) were applied to all 22 cases. The score was calculated for each structured item in the generated output. Additionally, a case-based analysis of representative cases was conducted to examine whether the consistency check and regeneration process contributed to the suppression of incorrect descriptions and hallucinations.

2.5.3. Model and API Configuration

The model used in this study was the GPT-4o (gpt-4o-2024-11-20; OpenAI, San Francisco, CA, USA), a multimodal LLM provided by OpenAI. It was adopted because it supports both image and text input and can integrate image interpretation and natural language generation for dental panoramic radiographs.

For inference using the API, a Base64-encoded image together with a text prompt was simultaneously input via the gpt-4o endpoint. The output was structured data in JSON format or radiological findings, depending on the prompt design.

To stabilize the behavior of the model and maintain consistent outputs from the same image, the following parameter settings were used:

Temperature; 0.2 (low to increase output certainty);
Top_p; 1.0 (considering all top values in the probability distribution);
Max_tokens; 2048 (maximum tokens generated);
Frequency_penalty; 0.0 (no suppression of word repetition);
Presence_penalty; 0.0 (no induction of new words).

These settings minimize output fluctuations while achieving high reproducibility and accuracy for structured data and sentences.

2.5.4. Prompt Design

In this study, multiple prompt designs were used depending on the task to improve the accuracy and consistency of finding generation.

(1): Prompts for Structured Data Generation

In the structured output-based approach, we predefined a schema containing the interpretation items required for radiological diagnosis and designed English prompts using the Pydantic schema to request precise output in JSON format. The prompt explicitly included criteria for each interpretation item, such as:

“For the cyst in the image, please determine transparency, borders, internal structure, etc., and output the results in the specified JSON format.”

The prompt also clearly stated that the task was a “professional interpretation task for dental radiology diagnosis,” guiding GPT-4o to recognize the medical context before outputting.

(2): Prompts for Generating Findings

When generating findings from the structured data, we used Japanese prompts to ensure clinically usable language, such as:

“Based on the following structured data, please output dental radiology findings in a natural, medically sounding format.”

The output format followed the ground-truth style created by dental radiologists, naturally expressing location (e.g., “from no. 45 to no. 47”) and characteristic descriptions (e.g., “a clearly demarcated unilocular translucency”).

(3): Output Format

Structured data were output in English using a unified JSON format with values stored as strings or lists. In contrast, the findings were output as natural Japanese sentences because the ground truth was prepared in Japanese. This style is concise and clinically applicable, with sentences ending in a period as the standard.

3. Results

3.1. Overall Performance Comparison

Table 2 summarizes the performance of the two methods for each interpretation category, showing the mean correct answer rate for each item, together with the absolute and relative differences between the SLSO and CoT methods and the corresponding p-values.

Compared with the CoT method, the structured output method yielded higher accuracy in several interpretation items, most notably tooth number (relative difference of 66.9%), tooth displacement (33.3%), root resorption (28.6%), and relationship with other structures (9.0%). In contrast, clear improvements were not observed for radiolucency, internal structure, and boundary. For these items, correct answer rates were already high (0.864–1.000) for both methods, resulting in similar levels of accuracy.

Statistical tests were performed using the Shapiro–Wilk test to assess normality, followed by paired tests. A p-value < 0.05 was considered statistically significant, and 0.05 ≤ p < 0.10 was regarded as a trend toward significance. Owing to the limited sample size (22 cases), statistical significance was not achieved for any of the evaluation items.

An item-by-item analysis of Table 2 yielded the following three observations. For high-accuracy items (mean agreement rate > 0.8), namely radiolucency, internal structure, and boundary, both methods showed similarly high performance. For moderate-accuracy items (mean agreement rate 0.4–0.8), namely root resorption and relationship with other structures, the SLSO method consistently showed improvement. For tooth number identification, a low-accuracy item (mean agreement rate < 0.4), the structured method demonstrated a 66.9% relative improvement. This pattern is consistent with the contribution of the consistency-checking mechanism incorporated in the SLSO framework.

3.2. Comparison in Representative Cases

In this section, a detailed analysis of representative cases classified as successful and unsuccessful examples is provided to illustrate the effectiveness and limitations of the proposed method. A successful case was defined as one in which the structured output method achieved a substantial improvement in the average score for each interpretation item compared to the CoT method. Conversely, an unsuccessful case was defined as one in which the SLSO method showed a decrease in the average score for each interpretation item compared with the CoT method. For each case, the following five aspects were evaluated: (1) average score by interpretation item; (2) accuracy of tooth number identification; (3) conciseness and clinical appropriateness of the descriptions; (4) number of regeneration cycles (reflecting the effect of the self-correction loop); and (5) main improvements or common issues.

3.2.1. Successful Case

Figure 2 shows the input image for a successful case. Table 3 lists the ground-truth finding, the report generated by the CoT method, the report generated by the SLSO method, the item-by-item scores for the two methods, the main improvements, and the number of regeneration cycles.

3.2.2. Failure Case

Figure 3 presents the input image for a failure case, and Table 4 provides a comparison of the reports and evaluation outcomes for this case.

4. Discussion

This study demonstrated that the proposed SLSO framework improved the consistency and reliability of AI-generated radiological findings for jaw cysts in dental panoramic radiographs.

4.1. Effectiveness of Proposed Method

As noted in the Introduction, Liu et al. [10] reported substantial performance gaps between detection of radiopaque structures and radiolucent lesions using GPT-4o. Within this context, the present study explored whether a structured integrated processing framework could mitigate known weaknesses of current multimodal models. Given the known instability of direct structured-output prompting in GPT-4o due to medical refusal behavior, the Chain-of-Thought (CoT) method was selected as a realistic and reproducible baseline for comparison.

The proposed SLSO framework demonstrated consistent performance differences compared with the conventional CoT method in the generation of structured findings for jaw cysts on dental panoramic radiographs. Three characteristic technical effects were observed:

(1): Improved Tooth Number Accuracy

The most notable difference between the two methods was observed in tooth number accuracy, which increased from 0.136 with the CoT method to 0.227 with the proposed method (+66.9%). This increase suggested a favorable trend associated with the consistency-checking mechanism at the core of the proposed framework. Unlike the CoT approach, which often yielded vague location expressions (e.g., “lower left mandibular molar region”), the structured output method enabled more precise identification (e.g., “tooth #47 and #48”). This effect could be attributed to the self-correction process implemented in Steps 4 (tooth number extraction) and 5 (consistency check). In one representative case, a perfect score of 1.0 was achieved after five regeneration cycles, illustrating the potential value of iterative correction. Although overall accuracy remained limited, these observations suggested that structured, self-correcting approaches might contribute to more reliable localization in dental radiographs.

(2): Enforced Negative Findings and Improved Comprehensiveness

The structured schema required explicit “present/absent” judgments for each interpretation category, which promoted documentation of negative findings. This addressed a key weakness of the CoT approach, which frequently included vague expressions (e.g., “appears to…” or “may be…”) and often omitted negative findings (e.g., “no evidence of resorption”). In contrast, the proposed framework enabled explicit and reproducible documentation of negative findings, such as “No evidence of pathological effects such as root resorption or tooth displacement.” This mechanism resulted in higher completeness and internal consistency of the generated outputs, which was reflected in improved scores for root resorption, tooth displacement, and relationships with other structures.

(3): Hallucination Suppression Effect

Constraints imposed by the structured format reduced references to nonexistent anatomical structures and logically inconsistent statements. From a technical perspective, this effect reflected the role of schema-based constraints in limiting the generative freedom of LLMs and enforcing clinically valid output spaces. For example, verbose CoT expressions such as “presents a smooth contour” were replaced with clinically appropriate descriptions such as “The internal structure of the lesion is unilocular, with a well-defined, round boundary.”

These findings aligned with and extended the observations by Liu et al. [10]. While their study evaluated GPT-4o’s detection capabilities using different tooth numbering systems, the present study addressed the complementary challenge of generating internally consistent and verifiable structured findings for radiolucent lesions. Importantly, the observed effects were achieved through an external processing framework rather than through modifications of the underlying vision-language model.

A distinguishing feature of the approach was its design as an integrated processing framework rather than a model modification strategy. By implementing structured output constraints using Pydantic schemas and multi-stage validation loops external to the model, the SLSO framework represented a model-agnostic technical strategy for improving the reliability of generative outputs in radiological reporting.

For example, as shown in Table 3, the tooth number consistency check detected a discrepancy between the tooth number item in the structured output and the tooth number extracted directly from the image, triggering five automatic regeneration cycles and ultimately correcting the hallucinated description to achieve a perfect category-wise score of 1.00. In contrast, as shown in Table 4, although one regeneration cycle was triggered by the same mechanism, the correction remained incomplete due to the complexity of the lesion spanning multiple teeth.

4.2. Study Limitations and Challenges

This study had several intrinsic limitations that should be acknowledged. This section addresses both study-level limitations and model/method-level limitations of the proposed framework.

(1): Dataset Size and Generalizability

As a preliminary investigation, the dataset comprised only 22 cases from a single institution, limiting its statistical power and external validity. The limited availability of jaw cyst cases with confirmed histopathological diagnoses and expert radiologist annotations constrained further expansion at this stage. Despite these constraints, the dataset provided preliminary evidence suggesting potential improvements through the SLSO framework and demonstrated the technical feasibility of this integrated processing approach for future validation studies. Verification across multiple institutions, imaging devices, and patient populations would be essential to confirm the generalizability and robustness of the proposed approach in larger-scale studies. In addition, because annotated ROI images were used as inputs in this study, the applicability of the framework to full panoramic radiographs without manual localization remains an important topic for future investigation. Future studies may integrate automated lesion detection or tooth localization models with the proposed framework to enable a fully automated interpretation pipeline. Several deep learning-based approaches have been proposed for automated jaw cyst detection [16,17] and segmentation [18,19] in dental radiographs, as well as for tooth number identification [20], and integration with such methods represents a promising direction for future work.

(2): Complex Anatomical and Pathological Cases

In the case of maxillary anterior failure (Table 4), the structured approach underperformed compared with CoT (0.43 vs. 0.57). This case involved extensive lesions spanning multiple teeth, complex anatomical relationships (e.g., nasal cavity floor and incisive canal), and subtle changes (e.g., mild resorption). Structural schema options (e.g., unilocular/multilocular) were insufficient to capture this complexity, and rigid structuring sometimes hindered appropriate descriptions. Future work will involve larger-scale analysis of failure patterns using expanded datasets to better characterize systematic sources of error, such as tooth number misidentification, difficulties in interpreting lesions spanning multiple teeth, and challenges in detecting subtle anatomical changes. Future improvements may involve integrating the proposed framework with models that provide stronger spatial localization capabilities. For example, combining the VLM-based interpretation framework with conventional deep-learning–based lesion detection or tooth localization models, or replacing the underlying VLM with newer models that offer improved spatial reasoning, may help address these limitations.

(3): Limits of Visual Recognition

For persistently low-scoring items, such as tooth number (0.227) and tooth displacement (0.364), the inherent visual recognition capabilities of the GPT-4o were considered a limiting factor. In one failure case involving an extensive lesion spanning multiple teeth (Figure 3, Table 4), the AI output produced a more restricted range (“around teeth #11 and #12 in the maxillary anterior”) compared to the ground truth (“apical region from right maxillary canine #3 to left maxillary canine #3”). While mild root resorption was detectable in other cases, neither method described subtle changes such as “mild resorption is also suspected” or “nasal cavity floor elevation is also suspected.” These findings suggested that, in cases involving multiple complex anatomical structures, detecting subtle individual changes became even more challenging.

4.3. Experimental Technical Constraints

In addition to the study-level limitations, several technical issues arose during the comparative experiments.

(1): Difficulty in Direct Structured Output Generation

In the initial phase of this study, we aimed to directly compare the proposed method with a simple approach that generates the same structured output format without a self-correction loop. However, unexpected technical challenges were encountered during these experiments. When GPT-4o was prompted to produce structured outputs directly (e.g., “Evaluate the cyst and output in JSON format”), it frequently generated refusal responses (e.g., “I cannot provide a diagnosis because I am not a doctor”), making it difficult to obtain stable comparative data. This behavior appeared to stem from GPT-4o’s safety mechanisms that automatically reject medical diagnosis requests. Although reframing the context (e.g., “professional dental radiology interpretation task”) or replacing “diagnosis” with “image analysis” could reduce refusals, such adjustments deviated from the notion of a “simple” structured output baseline.

(2): Choice of Comparison Method

For the reasons stated above, the CoT method was adopted for comparison. CoT was a well-established approach to medical AI and its contrast to the SLSO framework offered valuable insights. Future research should explore minimally modified strategies to circumvent refusal errors and conduct more direct comparative experiments. This would enable a clearer separation of the effects of the structured output from the additional benefits of the self-correction loop.

(3): Evaluation Method

The binary 0/1 scoring scheme was adopted to ensure objective and reproducible comparison by focusing on exact semantic consistency with ground-truth findings. However, this approach may underestimate partially correct outputs or clinically acceptable interpretations. Future studies may incorporate more nuanced evaluation approaches, such as semantic similarity metrics and multi-expert clinical scoring, to better reflect the practical utility of AI-generated radiological interpretations.

4.4. Implications for Clinical Application and Future Perspectives

(1): Role in Collaboration with Specialists

This study suggested that AI-assisted diagnostic systems should function as supportive tools rather than as a replacement for specialists. Final confirmation by experts remained essential, particularly for complex anatomical interpretations, subtle or rare findings, and judgments requiring a clinical context. A stepwise introduction was considered effective, starting with support for clear and typical cases, moving to preliminary reading assistance under specialist confirmation, and eventually extending to educational use such as training residents. The structured nature of the proposed framework might facilitate such collaboration by providing transparent and verifiable intermediate outputs that could be reviewed by clinicians. Although the current performance remains limited for some categories, the proposed framework demonstrates potential for improving the reliability and consistency of AI-assisted radiological reporting rather than replacing expert interpretation.

(2): Study Comparison and Originality

This preliminary study was among the first to address cyst diagnostic support using a multimodal approach with direct image input, extending beyond text-based approaches reported previously [12]. Rather than improving model performance through fine-tuning or architectural modification, the present work emphasized a framework-level contribution that complemented existing hallucination-suppression strategies such as RAG and CoT. Although the iterative design shared similarities with the Self-Refine framework [16], the inclusion of structured consistency checks distinguished the proposed method and aligned it with the strict requirements of medical imaging. Unlike the Self-Refine framework [21], which performs iterative refinement primarily at the text level, the proposed SLSO framework introduces structured intermediate representations that integrate image interpretation, structured data generation, and natural language findings, enabling consistency checks across different representation formats.

(3): Future Technical Improvements and Expansion

Future work should involve larger datasets, specialist-in-the-loop evaluations, and semantic-level metrics to better capture clinical utility. Beyond jaw cysts, the proposed framework could be extended to other dental diseases and broader medical imaging domains, particularly when structured outputs could be aligned with domain-specific diagnostic criteria. The remaining challenges included reducing costs and processing time, and integration into clinical workflows.

(4): Data Privacy and Deployment Considerations

In the present study, GPT-4o was accessed through a cloud-based API environment. For actual clinical deployment, however, the handling of medical images through external cloud services raises important issues related to patient privacy, data governance, and regulatory compliance. In many jurisdictions, the transmission of medical images to external servers may fall under regulations related to the protection of personal medical information, and strict institutional policies are often required.

In addition, in maxillofacial imaging, anatomical structures themselves may potentially serve as biometric identifiers, making complete anonymization challenging. Therefore, future clinical implementations will likely require locally deployable vision–language models operating within secure on-premises environments inside medical institutions. Such architectures would enable inference processing without transmitting medical images outside institutional networks, thereby addressing privacy and legal concerns.

The present study focused on methodological feasibility rather than clinical deployment; however, these regulatory and privacy considerations will be important topics for future research and system development.

(5): Model-Agnostic Design and Robustness to Future Model Evolution

Although GPT-4o was used in the present study, the primary objective was not to evaluate a specific model version but to demonstrate a generalized framework for improving the reliability of AI-generated radiological findings. Handler et al. recently discussed in Nature Medicine that even advanced generative models such as GPT-5 remained inherently fragile in medical applications, lacking intrinsic mechanisms for error detection and self-verification [22]. These limitations were unlikely to be resolved by model scaling alone in the near term. In this context, the proposed SLSO framework provided a practical pathway for enhancing robustness independently of model scaling and might serve as a foundation for future agent-based AI systems incorporating multiple reasoning and verification components.

5. Conclusions

This preliminary study demonstrated that the novel integrated processing approach, the Self-correction Loop with Structured Output (SLSO) framework, could achieve a modest yet meaningful improvement in the practical utility of GPT-4o for dental image interpretation. By enforcing structured documentation and incorporating an external validation mechanism, the proposed framework showed promising trends in improving the reliability of AI-generated outputs, particularly in the explicit documentation of negative findings and the suppression of hallucinations. The stepwise and iterative nature of the proposed framework enables agent-like processing behaviors, while maintaining a human-in-the-loop design suitable for clinical applications. Because the SLSO framework functions independently of model architecture, its underlying principles are not limited to dental imaging and may have potential applicability to other medical imaging domains and multimodal diagnostic tasks. While further validation using larger and more diverse datasets is warranted, the present findings provide technical insights into framework-level strategies for developing more reliable AI-assisted radiological reporting systems.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/diagnostics16071096/s1, Supplementary Note S1: Preliminary Experiments; Supplementary Note S2: Detailed Description of the Proposed Framework.

Author Contributions

Conceptualization, N.H. and H.F.; methodology, N.H.; validation, N.H., R.T. and T.K.; formal analysis, N.H.; investigation, N.H. and T.K.; resources, T.K., Y.I. and A.K.; data curation, N.H. and T.K.; writing—original draft preparation, N.H.; writing—review and editing, H.F., C.M., T.H. (Tatsuro Hayashi)., Y.S., X.Z. and T.H. (Takeshi Hara).; visualization, N.H.; supervision, H.F.; project administration, H.F.; funding acquisition, H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by EyeTech Co., Ltd., through a collaborative research project with Gifu University (grant number: N/A).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of Asahi University (protocol code 32040) and Gifu University (protocol code 2020-250) on 27 February 2023.

Informed Consent Statement

Patient consent was waived due to the retrospective nature of the study using de-identified radiographic images. Information about the study was disclosed to patients through an institutional opt-out procedure at Asahi University Hospital.

Data Availability Statement

Data used in this study are not publicly available to preserve individuals’ privacy.

Conflicts of Interest

Author Ryo Takahashi and Tatsuro Hayashi are employees of EyeTech Co., Ltd. Author Hiroshi Fujita received research funding from EyeTech Co., Ltd. The other authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. The employee authors contributed to this study as individual researchers, and their contributions were independent of the funder’s role.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
API	Application Programming Interface
CBCT	Cone-Beam Computed Tomography
CNN	Convolutional Neural Network
CoT	Chain-of-Thought
FDI	Fédération Dentaire Internationale
JSON	JavaScript Object Notation
LLM	Large Language Model
RAG	Retrieval-Augmented Generation
ROI	Region of Interest
SLSO	Self-correction Loop with Structured Output
VLM	Vision-Language Model

References

Ye, J.; Tang, H. Multimodal Large Language Models for Medicine: A Comprehensive Survey. arXiv 2025, arXiv:2504.21051. [Google Scholar] [CrossRef]
Zhou, S.; Xu, Z.; Zhang, M.; Xu, C.; Guo, Y.; Zhan, Z.; Fang, Y.; Ding, S.; Wang, J.; Xu, K.; et al. Large Language Models for Disease Diagnosis: A Scoping Review. NPJ Artif. Intell. 2025, 1, 9. [Google Scholar] [CrossRef] [PubMed]
Morishita, M.; Fukuda, H.; Yamaguchi, S.; Muraoka, K.; Nakamura, T.; Hayashi, M.; Yoshioka, I.; Ono, K.; Awano, S. An Exploratory Assessment of GPT-4o and GPT-4 Performance on the Japanese National Dental Examination. Saudi Dent. J. 2024, 36, 1577–1581. [Google Scholar] [CrossRef] [PubMed]
Ramlogan, S.; Raman, V.; Ramlogan, S. A Pilot Study of the Performance of ChatGPT and Other Large Language Models on a Written Final Year Periodontology Exam. BMC Med. Educ. 2025, 25, 727. [Google Scholar] [CrossRef] [PubMed]
Jaworski, A.; Jasiński, D.; Sławińska, B.; Błecha, Z.; Jaworski, W.; Kruplewicz, M.; Jasińska, N.; Sysło, O.; Latkowska, A.; Jung, M. GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination. Cureus 2024, 16, e68813. [Google Scholar] [CrossRef]
Morishita, M.; Fukuda, H.; Muraoka, K.; Nakamura, T.; Hayashi, M.; Yoshioka, I.; Ono, K.; Awano, S. Evaluating GPT-4V’s Performance in the Japanese National Dental Examination: A Challenge Explored. J. Dent. Sci. 2024, 19, 1595–1600. [Google Scholar] [CrossRef]
Dasanayaka, C.; Dandeniya, K.; Dissanayake, M.B.; Gunasena, C.; Jayasinghe, R. Multimodal AI and Large Language Models for Orthopantomography Radiology Report Generation and Q&A. Appl. Syst. Innov. 2025, 8, 39. [Google Scholar] [CrossRef]
Pham, T.D. A Deep Learning Vision-Language Model for Diagnosing Pediatric Dental Diseases. Intell. Med. 2026, 13, 100364. [Google Scholar] [CrossRef]
Aşar, E.M.; İpek, İ.; Bilge, K. Customized GPT-4V(ision) for Radiographic Diagnosis: Can Large Language Model Detect Supernumerary Teeth? BMC Oral Health 2025, 25, 756. [Google Scholar] [CrossRef]
Liu, Z.; Ai, Q.Y.H.; Yeung, A.W.K.; Tanaka, R.; Nalley, A.; Hung, K.F. Performance of a Vision-Language Model in Detecting Common Dental Conditions on Panoramic Radiographs Using Different Tooth Numbering Systems. Diagnostics 2025, 15, 2315. [Google Scholar] [CrossRef]
Camlet, A.; Kusiak, A.; Ossowska, A.; Świetlik, D. Advances in Periodontal Diagnostics: Application of MultiModal Language Models in Visual Interpretation of Panoramic Radiographs. Diagnostics 2025, 15, 1851. [Google Scholar] [CrossRef]
Silva, T.P.; Andrade-Bortoletto, M.F.S.; Ocampo, T.S.C.; Alencar-Palha, C.; Bornstein, M.M.; Oliveira-Santos, C.; Oliveira, M.L. Performance of a Commercially Available Generative Pre-Trained Transformer (GPT) in Describing Radiolucent Lesions in Panoramic Radiographs and Establishing Differential Diagnoses. Clin. Oral Investig. 2024, 28, 204. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, Y.; Parnell, L.A. Factuality of Large Language Models: A Survey. arXiv 2024, arXiv:2402.02420. [Google Scholar] [CrossRef]
Alkaissi, H.; McFarlane, S.I. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus 2023, 15, e35179. [Google Scholar] [CrossRef] [PubMed]
Chang, A.; Huang, L.; Bhatia, P.; Kass-Hout, T.; Ma, F.; Xiao, C. MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models. arXiv 2025, arXiv:2503.02157. [Google Scholar] [CrossRef]
Kwon, O.; Kim, J.Y.; Kim, J.W.; Nam, W.; Cha, I.H.; Kim, H.J. Automatic Diagnosis for Cysts and Tumors of Both Jaws on Panoramic Radiographs Using a Deep Convolution Neural Network. Dentomaxillofac. Radiol. 2020, 49, 20200185. [Google Scholar] [CrossRef]
Yang, H.; Jo, E.; Kim, H.J.; Cha, I.H.; Jung, Y.S.; Nam, W.; Kim, J.Y.; Kim, J.W.; Kim, J.H.; Lee, J.H.; et al. Deep Learning for Automated Detection of Cyst and Tumors of the Jaw in Panoramic Radiographs. J. Clin. Med. 2020, 9, 1839. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Li, B.; Cheng, Y.; Kim, J. Odontogenic Cystic Lesion Segmentation on Cone-Beam CT Using an Auto-Adapting Multi-Scaled UNet. Front. Oncol. 2024, 14, 1379624. [Google Scholar] [CrossRef]
Lee, H.S.; Ha, E.G.; Kim, J.E.; Huh, K.H.; Yi, W.J.; Heo, M.S.; Lee, S.S.; Choi, S.C. Automatic Detection and Classification of Nasopalatine Duct Cyst and Periapical Cyst on Panoramic Radiographs Using Deep Convolutional Neural Networks. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. 2024, 138, 184–195. [Google Scholar] [CrossRef]
Sakai, K.; Kitano, T.; Iida, Y.; Katsumata, A. Detection and Dual-Label Classification of Tooth Number and Condition in Dental Panoramic Radiographs Including Deciduous Teeth. Radiol. Phys. Technol. 2025, 18, 821–832. [Google Scholar] [CrossRef]
Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-Refine: Iterative Refinement with Self-Feedback. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023); Neural Information Processing Systems Foundation: San Diego, CA, USA, 2023; pp. 46534–46594. [Google Scholar]
Handler, R.; Sharma, S.; Hernandez-Boussard, T. The Fragile Intelligence of GPT-5 in Medicine. Nat. Med. 2025, 31, 3968–3970. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of proposed Self-correction Loop with Structured Output (SLSO) framework.

Figure 2. ROI image around cyst used as input for the successful case.

Figure 3. ROI image around cyst used as input in the failure case.

Table 1. Structured category and output values based on schema design.

Category	Output Value
X-ray transparency	radiolucent
X-ray transparency	radiopaque
Internal structure	unilocular
Internal structure	multilocular
Border	well-defined
Border	ill-defined
Root resorption	no
	mild
	severe
Tooth displacement	no
	mild
	severe
Relationship to mandibular canal/maxillary sinus	contact
	displacement
	invasion
Affected tooth number	11, 12, 13, … *

* FDI notation is used for tooth numbering.

Table 2. Comparison of correct answer rates between the CoT and SLSO methods across interpretation categories. Mean correct answer rates (mean ± SE) for each interpretation category are shown for both methods, together with the absolute and relative differences between the two methods and the corresponding p-values.

Interpretation Category	CoT Method (Mean ± SE)	SLSO Method (Mean ± SE)	Absolute Difference	Relative Difference (%)	p-Value
Radiolucency	1.000 ± 0.000	1.000 ± 0.000	+0.000	+0.0%	n/a
Internal structure	0.909 ± 0.063	0.909 ± 0.063	+0.000	+0.0%	n/a
Boundary	0.864 ± 0.075	0.864 ± 0.075	+0.000	+0.0%	n/a
Root resorption	0.318 ± 0.102	0.409 ± 0.107	+0.091	+28.6%	0.248
Tooth displacement	0.273 ± 0.097	0.364 ± 0.105	+0.091	+33.3%	0.248
Relationship with other structures	0.500 ± 0.109	0.545 ± 0.109	+0.045	+9.0%	0.564
Tooth number	0.136 ± 0.075	0.227 ± 0.091	+0.091	+66.9%	0.180

Note: SE; Standard error; Absolute difference = mean score of SLSO method—mean score of CoT method; Relative difference = (absolute difference/mean score of CoT method) × 100 (%).

Table 3. Generated outputs and evaluation results for the successful case.

Category	Content
Ground-truth Findings	The lower-right 8th tooth is horizontally impacted, with a well-defined, round radiolucent lesion surrounding the crown. The lesion extends to the inferior border cortical bone of the mandible. The mandibular canal appears to overlap with the lesion. A radiolucent area is observed in the crown of the lower right 7th tooth. The root is in contact with the lesion but no resorption is observed. Lower right 8th tooth: suspected dentigerous cyst.
CoT-Method Output	A well-defined, unilocular radiolucent lesion is observed in the mandibular molar region. The lesion presents a smooth contour, and radiographic findings suggest a cystic lesion (likely cyst). No obvious resorption is observed in the roots of teeth #47 and #48 (root resorption: none), with slight mesial displacement of the roots (tooth displacement: mild). Expansion of the inferior border cortical bone (cortical bone expansion: present) is noted, with bone expansion observed. The lesion overlaps with the inferior alveolar canal but shows no obvious invasion. There is no relationship with the maxillary sinus (unrelated), and no association with this anatomical structure is recognized. The teeth involved in the lesion are #47 and #48 in FDI notation.
CoT Category-wise Score	0.71
SLSO-Method Output	A radiolucent lesion is confirmed around teeth #47 and #48 on the right side of the mandible. The internal structure of the lesion is unilocular, with well-defined and round borders. These characteristics suggest a lesion with high probability of being a cyst. Although the lesion is adjacent to the inferior alveolar nerve canal, no pathological effects such as root resorption or tooth displacement are observed, and no cortical bone expansion is seen. Additionally, no association with the maxillary sinus is identified.
SLSO-Method Category-wise Score	1.00
Main observations	✓ Accurate identification of tooth numbers (“teeth #47 and #48”). ✓ Conciseness and clinical appropriateness of description. ✓ Consistent description across all categories.
Number of Regenerations	5 times (automatic correction via tooth number consistency check)

Note: # indicates FDI tooth numbers.

Table 4. Generated results and evaluation outcomes for the failure case.

Category	Content
Ground-truth Findings	A cystic lesion is observed in the maxillary anterior region (from upper right 3 to upper left 3 apical area). It partially overlaps with the root apices, and mild resorption is also suspected. No left-right asymmetry is observed, but elevation of the nasal cavity floor is also suspected. No obvious calcifications or foreign bodies are observed internally. Suspected nasopalatine duct cyst.
CoT-Method Output	Step 1: Internal structure—The observed lesion shows radiolucency compared to surrounding structures, with high X-ray transparency. The morphology is unilocular, presenting a uniform radiolucent appearance. Step 2: Lesion characteristics—The lesion boundary is well-defined, with a round to oval shape. Such findings suggest a suspected cyst (likely cyst). Step 3: Effects on surrounding anatomical structures—No root resorption is observed in adjacent teeth (none). Tooth displacement is none to mild, with possible slight mobility particularly around tooth #21. No cortical bone expansion is observed (false). Step 4: Relationship with anatomical structures—The lesion has no relationship with the inferior alveolar canal or maxillary sinus (unrelated). Step 5: Involved tooth numbers (FDI system)—The lesion is located around tooth #21, with possible mild extension to teeth #11 and #22, but structural involvement is minimal. Therefore, the involved FDI tooth number is #21.
CoT Category-wise Score	0.57
SLSO-Method Output	A radiolucent lesion is observed around teeth #11 and #12 in the maxillary anterior region. The internal structure is unilocular, presenting a well-defined, round morphology. Based on these radiological findings, it is judged to have a high probability of being a cystic lesion. Regarding this lesion, no effects such as root resorption, tooth displacement, or cortical bone expansion were observed. Additionally, no relationship with the inferior alveolar canal or maxillary sinus is identified.
SLSO-Method Category-wise Score	0.43
Main observations	✓ Difficulty in accurately identifying the extent of lesions spanning multiple teeth. ✓ Difficulty in evaluating relationships with anatomical structures (nasal cavity floor, incisive canal). ✓ Difficulty in detecting subtle findings (mild resorption).
Number of Regenerations	Once (automatic correction via tooth number consistency check).

Note: # indicates FDI tooth numbers.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hosokawa, N.; Takahashi, R.; Kitano, T.; Iida, Y.; Muramatsu, C.; Hayashi, T.; Seino, Y.; Zhou, X.; Hara, T.; Katsumata, A.; et al. Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with a Structured Output (SLSO) Framework. Diagnostics 2026, 16, 1096. https://doi.org/10.3390/diagnostics16071096

AMA Style

Hosokawa N, Takahashi R, Kitano T, Iida Y, Muramatsu C, Hayashi T, Seino Y, Zhou X, Hara T, Katsumata A, et al. Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with a Structured Output (SLSO) Framework. Diagnostics. 2026; 16(7):1096. https://doi.org/10.3390/diagnostics16071096

Chicago/Turabian Style

Hosokawa, Nanaka, Ryo Takahashi, Tomoya Kitano, Yukihiro Iida, Chisako Muramatsu, Tatsuro Hayashi, Yuta Seino, Xiangrong Zhou, Takeshi Hara, Akitoshi Katsumata, and et al. 2026. "Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with a Structured Output (SLSO) Framework" Diagnostics 16, no. 7: 1096. https://doi.org/10.3390/diagnostics16071096

APA Style

Hosokawa, N., Takahashi, R., Kitano, T., Iida, Y., Muramatsu, C., Hayashi, T., Seino, Y., Zhou, X., Hara, T., Katsumata, A., & Fujita, H. (2026). Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with a Structured Output (SLSO) Framework. Diagnostics, 16(7), 1096. https://doi.org/10.3390/diagnostics16071096

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using a GPT-Based VLM: A Preliminary Study on Building a Two-Stage Self-Correction Loop with a Structured Output (SLSO) Framework

Abstract

1. Introduction

2. Materials and Methods

2.1. Pilot Experiments

2.2. Dataset and Ground-Truth Creation

2.3. Structured Output Schema and Image Annotation

2.4. Overall Flow of Proposed Framework

2.5. Experimental Setup

2.5.1. Evaluation Criteria

2.5.2. Evaluation Procedure

2.5.3. Model and API Configuration

2.5.4. Prompt Design

3. Results

3.1. Overall Performance Comparison

3.2. Comparison in Representative Cases

3.2.1. Successful Case

3.2.2. Failure Case

4. Discussion

4.1. Effectiveness of Proposed Method

4.2. Study Limitations and Challenges

4.3. Experimental Technical Constraints

4.4. Implications for Clinical Application and Future Perspectives

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI