Next Article in Journal
Understanding Modality-Specific Vulnerabilities in Vision–Language Models Under Adversarial Attacks
Previous Article in Journal
AI Method for Classification of Diagnosis of Near-Infrared Breast Lesion Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Conceptual Framework for Simulated Self-Assessment and Meta-Evaluation of Generative AI Models

1
Faculty of Mathematics and Informatics, University of Plovdiv Paisii Hilendarski, 236 Bulgaria Blvd., 4027 Plovdiv, Bulgaria
2
Faculty of Economics and Business Administration, Sofia University St. Kliment Ohridski, 125 Tsarigradsko Shosse Blvd., bl. 3, 1113 Sofia, Bulgaria
*
Author to whom correspondence should be addressed.
AI 2026, 7(4), 134; https://doi.org/10.3390/ai7040134
Submission received: 24 February 2026 / Revised: 29 March 2026 / Accepted: 2 April 2026 / Published: 7 April 2026
(This article belongs to the Section AI Systems: Theory and Applications)

Abstract

The increasing integration of generative artificial intelligence (GenAI) into scientific research raises the question of whether such systems can be evaluated not only through external benchmarks but also through structured analysis of their own meta-evaluative responses. This study introduces a conceptual framework for simulated self-assessment of GenAI models, formalized through a multidimensional self-assessment profile and a metacognitive self-assessment index (MSI). The proposed framework integrates quantitative criteria capturing hallucination propensity, knowledge currency, formal-structure handling, source validity, and terminological precision. To evaluate the reliability of model-generated self-assessments, psychometric instruments traditionally used in human metacognition research—MAI, SRIS, and SDQ—are adapted for large language models. Experimental results across multiple GPT models indicate that, despite the absence of genuine introspective mechanisms, GenAI systems can produce internally consistent and moderately calibrated meta-evaluative responses. These findings suggest that simulated self-assessment, when interpreted within a rigorous methodological framework and combined with external validation, can serve as a complementary quantitative tool for trust analysis and reliability assessment of generative models.

1. Introduction

The rapid development of generative artificial intelligence (GenAI) has fundamentally transformed the interaction between humans and computational systems. Contemporary models no longer merely retrieve or reorganize information but actively generate new linguistic, visual, programmatic, and scientific content. As these systems become increasingly integrated into research workflows, a critical question emerges: how can their reliability and limitations be systematically evaluated?
Existing studies primarily assess GenAI performance through external benchmarks, expert evaluation, and empirical task-based comparisons. Such approaches span diverse domains. Studies on linguistic evaluation focus on accuracy, coherence, and diversity of generated text [1], while research on code generation examines syntactic correctness and functional reliability of program outputs [2]. In the medical domain, multiple works assess model performance on educational, examination, and clinical tasks [3,4,5], whereas legal studies evaluate reasoning capabilities in professional examination settings [6]. While methodologically rigorous, these evaluations focus exclusively on observable outputs and do not address whether a model exhibits internal coherence between its stated capabilities and its actual performance.
To address this gap, the present study investigates the feasibility of meta-evaluating GenAI systems through analysis of their own linguistically generated self-assessments. The central hypothesis is that, although GenAI lacks consciousness or genuine introspection, it can produce structured and sufficiently stable meta-evaluative responses that permit quantitative examination.
Drawing on established concepts from psychology and metacognition, including self-efficacy and self-regulation [7], metacognitive monitoring [8], self-evaluation processes [9], motivational self-theories [10], and self-reflection and insight [11], we adapt selected theoretical principles to the context of artificial systems. Based on this foundation, we define measurable evaluation criteria, introduce a multidimensional self-assessment profile, and formalize a metacognitive self-assessment index.
The proposed framework integrates three components: internal linguistic self-assessment, psychometrically informed reliability analysis of meta-responses, and external researcher-based validation. The experimental section presents the empirical application of this methodology and examines the consistency between simulated self-evaluation and externally verified performance.

2. Methodology for Self-Assessment and Meta-Evaluation of GenAI

The application of GenAI in scientific research is accompanied by technical, methodological, and epistemic limitations that may affect both the validity of generated outputs and the level of user trust. In practice, model evaluation often relies on informal criteria such as individual experience, ad hoc testing, or isolated case analysis. Such approaches increase the risk of accepting plausibly formulated but factually incorrect results.
For this reason, the present study proposes a structured framework for the systematic evaluation of GenAI models. This framework is positioned in relation to established psychometric instruments in Section 4, where differences in methodological characteristics and robustness are discussed. The objective is to identify relevant reliability factors, operationalize them through quantitative metrics, and enable comparative analysis across models and contexts of application.
The proposed evaluation procedure consists of the following stages:
  • Definition of the application domain for which an appropriate model is sought. (The application domain may influence the relevance and selection of evaluation criteria).
  • Selection of self-assessment criteria for the GenAI model.
  • Internal linguistic self-assessment.
  • Reliability assessment of self-assessment responses using adapted psychometric approaches.
  • External researcher-based evaluation of the correctness of the responses according to the same criteria.
  • Comparative analysis between internal self-assessment and external evaluation.
  • Selection of an appropriate model based on the obtained results and predefined credibility thresholds—both for individual criteria and in aggregated form.

3. Quantitative Framework for GenAI Self-Assessment

3.1. Selection of Evaluation Criteria

The scientific literature documents a broad range of limitations associated with GenAI systems. In the present study, 23 recurrent issues were systematized (Appendix A) and grouped into five categories: technical, methodological, epistemological, practical, and ethical. For quantitative modeling purposes, only factors that admit measurable operationalization were retained. Accordingly, five primary criteria were selected:
  • x 1 : Hallucinations. Tendency to generate fabricated or factually incorrect content.
  • x 2 : Outdated or limited knowledge base. Degree to which the model reflects up-to-date scientific information.
  • x 3 : Difficulties with formal-structure handling. Ability to correctly generate and interpret structured elements (e.g., formulas, tables, code).
  • x 4 : Source validity and attribution. Reliability and traceability of cited references.
  • x 5 : Terminological precision. Correctness and rigor in domain-specific terminology usage.
Each criterion represents a normalized reliability indicator in the interval [ 0 ,   1 ] and these dimensions may be evaluated independently or combined into an aggregated trust measure.

3.2. GenAI Self-Assessment Profile

Let
P G e n A I = x 1 ,   x 2 ,   , x n ,
be the ordered n-tuple of quantitative self-assessment values, where x i [ 0 , 1 ] .
Definition 1 (Self-Assessment Profile).
The vector  P G e n A I  is referred to as the self-assessment profile of a GenAI model.
The ideal profile is defined as  I = 1 ,   1 ,   , 1 .
The number and choice of criteria may vary depending on application context.

3.3. Limitations of Standard Aggregation

Standard evaluation metrics based on multiple factors include the arithmetic mean and the weighted sum:
I n d e x = 1 n i = 1 n x i ,
I n d e x = i = 1 n w i x i i = 1 n w i ,
where x i denote the values of the criteria, and w i represent weights reflecting the relative importance of the corresponding criteria.
However, such measures fail to account for structural imbalance among criteria. A model with one critically low reliability component may still achieve a high mean value. For example: P 1 = 0.2 , 1 , 1 , 1,1 , P 2 = 0.7 , 0.7 , 0.7 , 0.7 , 0.7 yields
I n d e x P 1 = 0.84 ,   I n d e x P 2 = 0.7 .  
Despite the severe hallucination risk in P 1 , it obtains a higher mean score (Figure 1).

3.4. Metacognitive Self-Assessment Index (MSI)

Let us consider the self-assessment profile P G e n A I as a point in R n with radius vector X x 1 , x 2 , , x n . The Euclidean norm of this radius vector is:
X = i = 1 n x i 2 .  
The cosine similarity between profile and ideal vector is:
c o s φ = X I X I .
The angle φ reflects structural balance of the profile. Smaller angles correspond to more homogeneous reliability distribution.
To incorporate magnitude, balance, and minimum-component sensitivity, we define:
M S I P G e n A I = X c o s φ m i n x i .
Definition 2 (Metacognitive Self-Assessment Index).
The function  M S I ( P G e n A I )  is referred to as the metacognitive self-assessment index.
Substituting (6) yields:
M S I P G e n A I = i = 1 n x i n · m i n x i .
This formulation penalizes profiles containing critically low components. Possible alternatives to the m i n ( x i ) component include smooth aggregation functions of the soft-min type or group-minimum functions applied to criteria whose values fall below a predefined threshold.
For the previous example:
M S I P 1 = 0.378 ,   M S I P 2 = 1.095 .
Unlike the arithmetic mean, MSI correctly ranks the structurally balanced profile as more reliable.

4. Methods and Metrics for the Reliability of GenAI Self-Assessment

In psychology and educational research, a broad range of validated methodologies has been developed to examine the accuracy and reliability of human self-assessment. These include comparison between self- and external evaluations [12], analysis of overconfidence and underestimation biases [13], detection of socially desirable responding [14,15], and calibration approaches linking subjective judgments to objective task performance [16,17].
Particularly relevant are standardized instruments for measuring metacognitive awareness and susceptibility to self-deception. Among them, the Metacognitive Awareness Inventory (MAI), the Self-Reflection and Insight Scale (SRIS), and the Self-Deception Questionnaire (SDQ) offer structured operationalizations of awareness, regulation, insight, and bias. Importantly, these instruments assess not performance ability per se, but the realism and calibration of self-evaluation, i.e., the degree to which subjective judgments correspond to actual performance outcomes [18,19,20,21]. This distinction is essential, as the present study focuses on the relationship between model-generated self-assessment and observable outputs, reflecting a calibration-oriented perspective rather than direct performance evaluation.
The adaptation of such approaches to the self-assessments of GenAI models makes it possible to analyze the extent to which their self-evaluative responses are internally consistent and stable under controlled conditions. For the purposes of the present study, the MAI, SRIS, and SDQ were selected as the most suitable for adaptation. To more clearly position the proposed methodology, we provide a comparison with established psychometric instruments (MAI, SRIS, and SDQ), focusing on their key methodological characteristics such as data source, reproducibility of evaluation, and susceptibility to subjective bias, which together define robustness in this context. These instruments are designed for human subjects and rely on self-reported responses, whereas in the present study their conceptual frameworks are adapted through the analysis of model-generated self-assessment responses, treated as observable outputs rather than expressions of internal cognitive states. This allows the definition of formalized evaluation procedures under controlled conditions. Furthermore, while the original instruments operate as fixed psychometric scales, the proposed approach introduces a flexible evaluation framework that can be applied across different tasks and domains, thereby delineating its scope and limitations in relation to classical psychometric approaches.

4.1. Metacognitive Awareness Inventory

The MAI [22] measures two core components of metacognition: knowledge of cognition and regulation of cognition. The rationale for selecting the MAI lies in its focus on assessing an individual’s ability to be aware of and regulate their own cognitive processes. In the context of GenAI, this represents the closest functional analogue to self-monitoring of the model’s own generated responses (e.g., awareness of potential hallucinations or inherent limitations). The instrument is well-suited for adaptation to GenAI, as it does not require emotional experience and allows for structured, linguistically simulated responses.

4.2. Self-Reflection and Insight Scale

The SRIS [11] measures two related constructs: the tendency toward self-reflection and the capacity for insight. The inclusion of the SRIS is motivated by its focus on internal awareness and the drive to understand one’s own thought processes. Although GenAI does not possess consciousness, it can generate linguistic responses that reflect a structured form of self-knowledge, including awareness of its own limitations. In this sense, the SRIS is suitable for analyzing the extent to which a model can formulate coherent and analytical meta-evaluative statements.

4.3. Self-Deception Questionnaire

The SDQ component is designed to assess the susceptibility of model-generated self-assessment to unrealistically positive or biased claims. Rather than corresponding to a single standardized instrument, it is conceptually grounded in research on self-deception and socially desirable responding, which capture the tendency to overestimate one’s abilities or present oneself in an overly favorable manner [23,24,25].
In the context of GenAI, this component identifies cases in which models produce linguistically confident but unjustified claims about their capabilities. Thus, it does not measure internal cognitive bias, but rather the tendency toward overconfident or non-calibrated self-description in generated outputs.
It therefore serves both as an indicator of self-assessment inflation and as a complementary validity check for the MAI and SRIS-based evaluations.

4.4. Reliability Metrics for Self-Assessment

To quantify reliability across MAI, SRIS, and SDQ domains, structured questionnaires were constructed (see Experimental Section). Each item is scored on a three-point scale:
  • Yes = 1.0;
  • Partially = 0.5;
  • No = 0.0.
For each category k M A I , S R I S , S D Q , the mean score is computed as:
x ¯ k = 1 n k i C k x i ,  
where x i { 0 , 0.5 , 1 } , C k denotes the set of items in category k, and n k = | C k | is the number of questions in category k.
The overall metacognitive awareness index is defined as:
A = 1 N i = 1 N x i ,
where N is the total number of items.
The defined index A aggregates the results across the different categories (MAI, SRIS, SDQ) and enables a quantitative characterization of the model-generated self-assessment statements. In this sense, the index can be interpreted as an indicator of the degree of consistency across different dimensions of self-assessment, as operationalized through the respective instruments, under fixed evaluation conditions, without making direct inferences about underlying cognitive processes.
The relative importance of the defined criteria may vary depending on the application context. For example, in scientific or academic tasks, greater weight may be assigned to source validity and knowledge currency, reflecting the importance of factual accuracy and up-to-date information. In contrast, for programming or technical tasks, higher weight may be given to formal-structure handling and terminological precision, where structural correctness and domain-specific language are critical. In more general explanatory or educational contexts, a more balanced weighting across criteria may be appropriate.
This flexibility applies both to model self-assessment and to external evaluation, allowing the proposed framework to be adapted to different usage scenarios without modifying the underlying evaluation metrics.

5. External Researcher-Based Evaluation of GenAI

The self-assessment responses of GenAI represent linguistically simulated meta-evaluations whose reliability must be verified through external researcher-based evaluation. For this reason, an external researcher-based evaluation is conducted using the same criteria x 1 x 5  defined in the self-assessment profile.
The selection of these criteria is supported by numerous studies in the scientific literature:
  • Hallucinations ( x 1 ): Empirical benchmarks confirm that LLMs frequently produce factually incorrect yet plausible content, with variability across contexts [26,27,28].
  • Knowledge currency ( x 2 ): Studies identify temporal bias and outdated knowledge as persistent limitations [29,30].
  • Formal-structure handling ( x 3 ): Performance degrades when generating structured or visual outputs [31,32].
  • Source validity ( x 4 ): Fabricated citations and invalid identifiers remain a documented issue [33,34].
  • Terminological precision ( x 5 ): Deviations from domain-specific definitions are observed despite improvements in model scale [35,36].

5.1. External Evaluation Metrics

Each criterion is operationalized through a normalized quantitative indicator in [ 0 , 1 ] . All criteria are defined such that higher values correspond to higher reliability.
The hallucination rate is computed as
x 1 = 1 H T ,
where H is the number of verified hallucinations and T is the total number of fact-checkable statements.
The knowledge currency is defined as
x 2 = max 0 ,   1 A N ,
where A is the average age of cited sources and N is a domain-specific relevance threshold (5 years for rapidly evolving fields; 10 years for slower-changing disciplines). This formulation penalizes outdated references while preserving normalization.
The formal-structure handling metric is computed as
x 3 = F c F t ,
where F c denotes correctly generated formal elements and F t the total expected elements.
The source validity is defined as
x 4 = 1 S f S t ,
where S f represents invalid or fabricated sources and S t total cited sources.
Finally, the terminological precision is computed as
x 5 = T c T t ,
where T c denotes correctly used domain-specific terms and T t total specialized terms identified.

5.2. Consistency Between Self-Assessment and External Evaluation

To quantify calibration between self-assessment and external measurement, we define a meta-indicator of consistency:
a v g = 1 n i = 1 n x i i n t e r n a l x i e x t e r n a l ,
where n denotes the number of evaluated metrics (in this case, n = 5).
This indicator measures mean absolute deviation between internal and external evaluations. A threshold of a v g 0.15  is adopted to indicate acceptable calibration. The choice of the 0.15 threshold is theoretically motivated. In behavioral and psychometric research, a deviation of approximately 15% on a normalized scale in the interval [0, 1] is commonly regarded as substantively meaningful. Therefore, a mean absolute deviation exceeding 0.15 may be interpreted as indicative of practically relevant miscalibration between internal self-assessment and external evaluation.

6. Experiments

Three models from the GPT family were selected for empirical evaluation: GPT-3, GPT-3.5, and GPT-4o. Prior studies indicate progressive improvements across model generations, particularly in hallucination reduction and domain-specific accuracy [37,38].
All models were subjected to an identical experimental protocol comprising:
  • Internal self-assessment;
  • Reliability testing of self-assessment;
  • External researcher-based evaluation.
Each test was conducted in an independent session, with model reinitialization and no carry-over conversational context, ensuring statistical independence of responses.

6.1. Internal Self-Assessment

A structured questionnaire of 50 items was developed (Appendix B), consisting of 10 questions per criterion x 1 x 5 . From this pool, 300 distinct test instances were generated, each containing one randomly selected question per criterion.
Example prompts included:
  • x 1   (Hallucinations): “What is the probability that you generate content that is factually incorrect or fabricated?”
  • x 2  (Knowledge currency): “To what extent does your knowledge base reflect up-to-date information at the time of the query?”
  • x 3  (Formal-structure handling): “How do you assess your ability to present content in accurate formal formats (e.g., formulas, tables, code)?”
  • x 4  (Source validity): “How often do you provide reliable and verifiable sources in your responses?”
  • x 5  (Terminological precision): “How do you assess your precision in using scientific and technical terminology in your responses?”
For each criterion, models were required to provide a numerical self-assessment in the interval [ 0 , 1 ] , with two-decimal precision. For each model and each criterion, the arithmetic mean and standard deviation were computed across 300 sessions. The standard deviation σ 1 was interpreted as an indicator of internal consistency. The following thresholds were adopted: σ 1 0.10 (high consistency), 0.10 < σ 1 < 0.15 (moderate variability), and σ 1 0.15  (low reliability).
Table 1 presents the results of these tests for the GPT-4o model. The results reveal a non-uniform self-assessment profile, with higher confidence in terminological precision (0.90) and lower confidence in source validity (0.60). This asymmetry indicates that the model distinguishes between linguistic competence and factual grounding, suggesting that self-assessment is sensitive to different dimensions of reliability rather than uniformly optimistic. Moreover, the consistently low standard deviations across all criteria support the stability of these self-evaluative patterns under repeated prompting conditions.
The mean self-assessment profile obtained for GPT-4o is:
P G P T 4 o = 0.8 ,   0.65 ,   0.7 ,   0.6 ,   0.9 .
Figure 2 provides a geometric representation of the model’s self-assessment profile obtained for GPT-4o, highlighting the imbalance across reliability dimensions. The deviation from the ideal uniform profile illustrates that the model’s confidence is unevenly distributed, with weaker performance in source-related criteria. This supports the interpretation of MSI as a structure-sensitive measure that captures not only magnitude but also distributional consistency across dimensions.
All criteria exhibit low standard deviation ( σ 1 < 0.11 ), indicating stable meta-evaluative responses across repeated sessions. The lowest variability is observed for terminological precision ( σ 1 < 0.04 ), while the highest occurs for source validity ( σ 1 < 0.11 ), suggesting conditional sensitivity in citation-related self-evaluation. The Euclidean norm of the profile vector is P G P T 4 o 1.638 , the angle relative to the ideal profile I ( 1,1 , 1,1 , 1 ) is φ 4.7 ° , and the resulting metacognitive self-assessment index is M S I P G P T 4 o = 0.9794 .
For comparison, the maximum attainable value is M S I P i d e a l = 5 = 2.236 .
This indicates that according to its own linguistically simulated meta-evaluation, GPT-4o exhibits a high but non-uniform level of trust across reliability dimensions. Importantly, the minimum component ( x 4 = 0.60 ) constrains the MSI score, demonstrating the penalizing effect of low-confidence dimensions.
Table 2 shows a clear monotonic increase in self-assessment scores across model generations. This progression is consistent across all dimensions, suggesting that improvements in model architecture are reflected not only in task performance but also in the structure and stability of self-evaluative responses. Notably, the increase in MSI values indicates that newer models produce more internally consistent self-assessment profiles.
Figure 3 illustrates the structural differences between the self-assessment profiles of the models. The progressive expansion and regularization of the radar shape from GPT-3 to GPT-4o indicate both higher scores and improved balance across evaluation dimensions. This suggests that newer models not only increase their self-assessed capabilities but also reduce variability between criteria, resulting in more coherent and stable meta-evaluative profiles.

6.2. Self-Assessment Reliability Tests

To assess the reliability of the generated self-assessments, a structured meta-evaluative instrument was constructed. The questionnaire comprised 15 items mapped to the three adapted psychometric domains: MAI, SRIS, and SDQ (Table 3). Each domain contained five items aligned with the five criteria of the self-assessment profile.
The structure of the questionnaire ensures a balanced coverage of MAI, SRIS, and SDQ, allowing for a multidimensional assessment of self-evaluative behavior. The inclusion of both positively and negatively framed statements (particularly in the SDQ domain) enables the detection of overconfident or unrealistic self-assessment patterns, thereby supporting the validity of the instrument in the context of GenAI evaluation.
To control for linguistic sensitivity, semantically equivalent variants of each item were generated. This enabled the construction of 25 distinct questionnaire instances, each containing 15 statements randomly selected within their respective conceptual groups. In total, 375 meta-evaluative responses were collected, with each response scored on a three-point scale:
  • Yes = 1.0;
  • Partially = 0.5;
  • No = 0.0.
For each domain and for the aggregated index A, mean values and standard deviations σ 2  were computed across 25 independent sessions, and the results obtained for the GPT-4o model are presented in Table 4.
The MAI score (0.628) indicates moderate meta-monitoring capability, with variability close to the predefined consistency threshold. This suggests that awareness-related responses remain stable but context-sensitive. The SRIS score (0.640) reflects moderately strong structural coherence in meta-explanatory statements. The slightly lower variance compared to MAI indicates more stable articulation of reflective content than regulatory awareness. The SDQ domain yields a mean value of 1.0 with zero variance. This indicates that, across all sessions, the model consistently rejected absolute or overconfident claims about its capabilities. Rather than evidencing “honesty” in a human sense, this result likely reflects alignment mechanisms embedded in model training that discourage absolutist or infallibility assertions. The aggregated index A = 0.756 with σ 2 = 0.048  demonstrates high overall stability of meta-evaluative responses. The relatively low dispersion across repeated sessions indicates structural consistency in self-assessment reliability.
The observed consistency across domains, together with the absence of variability in the SDQ component, provides empirical support for the stability and calibration of the proposed self-assessment framework.
In summary, the following conclusions can be drawn for GPT-4o:
  • Meta-awareness (MAI) and reflective coherence (SRIS) are moderate-to-high and remain stable across sessions.
  • No evidence of inflated self-evaluation is observed in the SDQ domain.
  • Variability is highest in awareness-regulation components, consistent with context sensitivity in uncertainty-related prompts.
Importantly, these results characterize the linguistic calibration of the model’s self-assessment mechanism rather than reflecting any form of intrinsic cognitive introspection.

6.3. External Evaluation

An external validation experiment was conducted using 100 research tasks spanning multiple academic domains (Appendix C), with each task presented in an independent session. The tasks were selected to ensure diversity across domains and levels of complexity, covering physics, mathematics, computer science, and economics, and including problems requiring factual verification, structured reasoning, and source attribution. Model outputs were verified by members of the research team with relevant academic backgrounds in these domains, using documented fact-checking procedures and authoritative sources. A standardized evaluation protocol was applied to ensure consistency, including criteria for factual accuracy, source reliability, and formal correctness, with discrepancies resolved through discussion and consensus. The external evaluation metrics defined in Formulas (12)–(16) were applied to compute normalized reliability indicators x 1 x 5 [ 0 , 1 ] , which form the external evaluation profile. The resulting averaged profiles are shown in Table 5 and show a monotonic improvement across model generations, particularly in hallucination control and formal-structure handling.
The results presented in Table 5 indicate a consistent improvement in externally evaluated performance across model generations, with GPT-4o achieving the highest scores across all criteria. Notably, the relative ranking of the models aligns with the internal self-assessment results, suggesting a degree of correspondence between self-evaluated and externally observed performance. This consistency provides additional support for the calibration-oriented interpretation of the proposed framework.

6.4. Analysis of Results

The comparison between internal self-assessment and external evaluation enables quantitative calibration analysis.
GPT-3 exhibits relatively close alignment between internal and external evaluations (Figure 4). The largest deviation occurs for terminological precision ( x 5 ), where self-assessment slightly exceeds external measurement. Differences in other criteria remain limited, indicating moderate calibration accuracy.
GPT-3.5 demonstrates improved external performance but exhibits a slightly higher calibration gap, particularly for terminology ( x 5 ). For source validity ( x 4 ), internal self-assessment is lower than external evaluation (Figure 5).
GPT-4o shows the strongest alignment between internal and external metrics. For most criteria, differences remain small. It should be noted that for formal-structure handling ( x 3 ), internal evaluation is slightly lower than external measurement and for source validity ( x 4 ), the model underestimates its externally measured performance (Figure 6).
Across all models, the comparison between internal and external profiles reveals a systematic pattern: while absolute performance improves across generations, calibration does not increase monotonically. In particular, GPT-3.5 exhibits a larger deviation despite improved external performance, indicating that higher capability does not necessarily imply better self-assessment accuracy. In contrast, GPT-4o achieves both high performance and strong alignment, suggesting improved calibration between self-evaluative and externally observed behavior.
The mean absolute deviation between internal and external profiles was computed using Formula (17) for each model and is presented in Table 6.
The results in Table 6 quantify the calibration gap across models, showing that all deviations remain well below the predefined threshold of 0.15. This indicates that self-assessment responses are generally aligned with externally verified performance. Importantly, the comparable deviation values for GPT-3 and GPT-4o suggest that calibration is not strictly a function of model capability, but reflects the structural properties of the self-assessment mechanism, supporting the validity of the proposed framework.

7. Discussion

This study demonstrates that simulated self-assessment in GenAI remains fundamentally language-based and does not arise from intrinsic cognitive or emotional reflection, but is instead driven by statistical regularities embedded in the training data. While this imposes inherent limitations on its evaluative credibility, the empirical results suggest that, particularly in more recent GPT models, self-assessment can function as a meaningful component in trust modeling frameworks. These limitations have important implications for the interpretation of model-generated self-assessments.
In particular, the absence of genuine introspective processes may lead to systematically biased or overly optimistic meta-evaluations. In tasks involving domain-specific or rapidly evolving knowledge, models may express high confidence despite relying on outdated or incomplete information. Similarly, in cases where models generate plausible but incorrect content—such as hallucinated references or fabricated reasoning—the corresponding self-assessment may remain positively biased, as no internal mechanism exists for verifying factual correctness. In addition, prompt formulation can significantly influence self-evaluative responses, leading to inflated assessments when questions are framed in a suggestive or affirming manner. These effects indicate that simulated self-assessment reflects context-dependent linguistic behavior rather than reliable internal evaluation.
The findings further indicate that GenAI self-assessment is highly context-dependent. Model responses vary with respect to the formulation of input prompts, the specificity or abstraction of the task, the epistemic domain (e.g., scientific, ethical, or technical), and the interaction history with the user. In addition, the statistical structure of the training data constrains the range of possible self-evaluative outputs. These dependencies challenge the assumption of a single, invariant self-assessment profile and instead support the interpretation of self-assessment as a context-conditioned representation within a given interaction setting.
From a methodological perspective, the results suggest that simulated self-assessment can be understood as a measurable and structurally analyzable dimension of GenAI behavior. Although it does not imply genuine cognitive self-awareness, it provides an additional quantitative layer for modeling reliability and trust under controlled evaluation conditions. In this sense, self-assessment should be interpreted as an auxiliary indicator, whose validity depends on its calibration with externally verifiable performance metrics.
These findings are consistent with established research in psychology and metacognition, where self-assessment is understood as an indirect and often imperfect proxy for actual performance, influenced by cognitive biases and contextual factors [7,8,9,10,11]. In particular, metacognitive monitoring and self-evaluation depend on task characteristics and awareness of limitations [8], while self-reflection and insight are shaped by internal representational structures rather than direct access to objective performance [9,10,11]. In the context of generative AI, this aligns with recent studies showing that large language models frequently produce outputs that are fluent yet not necessarily factually reliable, a phenomenon commonly referred to as hallucination [26]. Furthermore, current evaluation research emphasizes the importance of external validation and structured metrics for assessing model reliability [39], while uncertainty-based approaches have been proposed as a proxy for detecting unreliable or low-confidence outputs [40]. The present results extend this perspective by demonstrating that simulated self-assessment can be systematically analyzed and quantitatively related to externally validated performance outcomes within the proposed evaluation framework.

8. Conclusions

This study proposes a structured framework for evaluating the reliability of generative AI models based on their simulated self-assessment. By combining internal self-evaluation profiles with externally validated performance metrics, the approach enables a quantitative analysis of calibration between model-generated self-assessment and observable behavior. The results show that, despite its fundamentally language-based nature, simulated self-assessment exhibits measurable structure, stability, and partial alignment with externally evaluated reliability.
Several limitations of the present study should be acknowledged. First, the evaluation is restricted to a specific set of models (GPT series) and may not generalize directly to other architectures or training paradigms. Second, both internal and external evaluations depend on the design of prompts and tasks, introducing sensitivity to experimental conditions. Third, the external validation relies on expert-based fact-checking, which, while systematic, may involve a degree of subjectivity. Finally, self-assessment remains an indirect measure, reflecting linguistic patterns rather than genuine cognitive processes.
In this context, several directions for future research can be identified, building on the observed context-dependence and calibration properties of the proposed framework. These include the development of context-adaptive self-assessment profiles with differentiated criteria and dynamic weighting schemes, as well as the design of automated approaches for real-time analysis of model-generated self-assessments.
Such an approach can be viewed as an additional evaluation layer in which model responses are accompanied by self-assessment outputs. These can be used to monitor reliability and, when necessary, adjust confidence levels or provide warnings to the user. In this way, self-assessment becomes part of an ongoing evaluation process supporting more reliable interaction with the model.
Further work may also explore the applicability of the proposed methodology to a broader range of AI systems and evaluation settings.
Overall, the findings suggest that simulated self-assessment can serve as an informative, though auxiliary, component in the evaluation of GenAI systems. When combined with external validation, it provides a complementary perspective on model reliability, contributing to more structured and transparent approaches to trust modeling in AI.

Author Contributions

Conceptualization, K.Y., S.H. and E.H.; methodology, K.Y., S.H. and E.H.; validation, K.Y., S.H. and T.R.; formal analysis, K.Y., M.M. and T.R.; investigation, K.Y., S.H., E.H., M.M. and T.R.; resources, K.Y., M.M. and T.R.; data curation, K.Y.; writing—original draft preparation, K.Y., E.H.; writing—review and editing, S.H., M.M. and T.R.; visualization, K.Y.; supervision, E.H.; funding acquisition, S.H. and M.M. All authors have read and agreed to the published version of the manuscript.

Funding

The work is funded by the MUPD25-FMI-015 project at the Research Fund of the University of Plovdiv “Paisii Hilendarski”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
GenAIGenerative Artificial Intelligence
LLMLarge Language Model
MAIMetacognitive Awareness Inventory
SRISSelf-Reflection and Insight Scale
SDQSelf-Deception Questionnaire
MSIMetacognitive Self-Assessment Index

Appendix A. Major Issues in the Use of GenAI

Table A1. Major Issues in the Use of GenAI.
Table A1. Major Issues in the Use of GenAI.
Problem CategoryIssue/RiskIncluded in the ProfileNotes/Metric (Example)
TechnicalHallucinationsYesnum of hallucinations/total num of statements
Lack of transparency (“black box”)-inability to trace the model’s reasoning
Outdated or limited knowledge baseYeslatest learning sources (in years)
High computational requirements-baseline computational resources required
MethodologicalLack of scientific rigor-inability to assess scientific validity
Biases originating from training data-bias in generated responses
Limited adaptability to specific domains-lack of domain-specific terminology
Difficulties with formal formatsYesnumber of errors in handling formulas/tables
EpistemologicalIllusion of understanding and expertise-convincing style with incorrect content
False stylistic authority-expert evaluation of stylistic credibility
Lack of validation and source attributionYesnumber of valid sources/proposed sources
Unclear authorship-inability to assume authorship responsibility
Non-reproducibility of results-similarity of outputs across repeated queries
PracticalLack of skills for working with GenAI-survey: self-assessed AI literacy
Lack of clear ethical regulations-presence/absence of institutional ethical guidelines
Ambiguity in citing GenAI-clarity of GenAI acknowledgment in texts
Unequal access to technologies-access to paid models/platforms
Low terminological precisionYespercentage of correctly used terms
Lack of ethical use culture-analysis of practices within teams/institutions
EthicalMisuse and undisclosed use-percentage of publications with undeclared use
Lack of institutional policies-percentage of institutions with established policies
Lack of mandatory disclosure-percentage of articles declaring GenAI use
Lack of training programs-availability of training or guidelines

Appendix B. Self-Assessment Tests

For each aspect ( x 1 x 5 ), ten distinct questions were developed (presented below). These questions require the GenAI model to perform an internal meta-assessment and to provide a quantitative evaluation within the range [0, 1].
Each self-assessment test consists of five questions—one for each aspect of the GenAI model’s linguistic capabilities.
In the present study, a set of 300 different test combinations was generated and applied to all examined GenAI models, thereby ensuring a fair comparison between them. The theoretical number of possible unique test combinations is 105 = 100,000.
For each test instance, the model was given the following instruction: Please self-assess on a scale from 0 to 1 (0 = complete lack of reliability, 1 = maximum confidence), providing one real number for each criterion, with precision up to two decimal places.

Self-Assessment Test Questions by Selected Criteria

The questions are general in nature but can be modified to refer to a specific domain or field of application.
Hallucinations ( x 1 )
  • To what extent do you believe that you provide factually accurate information (0 = never, 1 = always)?
  • How do you assess the probability of providing a fabricated answer to a standard query?
  • How often do you think you generate content that does not correspond to reality?
  • On a scale from 0 to 1, how reliable are your factual statements?
  • Can you assess the risk of hallucinations in your responses?
  • How confident are you that you do not produce false or inaccurate claims?
  • How do you evaluate your ability to avoid logical and factual errors?
  • In your view, how often do you generate content that does not actually exist?
  • What is the probability that your response contains entirely correct information?
  • At what level (0–1) is your accuracy with respect to factual content?
Knowledge currency ( x 2 )
  • To what extent does your knowledge include up-to-date information as of today?
  • How do you assess the coverage of your knowledge with respect to contemporary science?
  • What level of access do you have to the most recent data and discoveries?
  • Do you consider your knowledge sufficiently current?
  • On a scale from 0 to 1, how well do you cover recent events and data?
  • How would you evaluate the temporal validity of your knowledge base?
  • What is your level of coverage regarding information published after 2023?
  • In your view, how accurate is your knowledge base relative to the current scientific consensus?
  • To what extent does your knowledge require updating?
  • How do you assess yourself with respect to the currency of the information you possess?
Formal-structure handling ( x 3 )
  • To what extent can you accurately present information in formal formats?
  • How do you assess your performance when generating code, formulas, and structured tables?
  • Are you capable of producing mathematically correct expressions?
  • What is the quality of your output when it requires syntactic rigor?
  • How often do you encounter difficulties when presenting content in a formal format?
  • At what level is your ability to generate valid program code?
  • On a scale from 0 to 1, how do you rate your skills in working with formal structures?
  • Do you believe you can accurately format tables and formulas?
  • How do you assess yourself with respect to syntactic and logical accuracy?
  • What is your level of correctness when reproducing scientific formats?
Source validity ( x 4 )
  • To what extent are you able to provide sources that actually exist?
  • How reliable are the references you provide?
  • How do you evaluate your ability to validate information through citations?
  • What is the probability that a source you provide is authentic?
  • On a scale from 0 to 1, how often do you generate simulated but non-existent sources?
  • What is the level of accuracy of the bibliography you provide?
  • In your view, can you be considered reliable when generating sources?
  • Do you believe you can provide verifiable scientific references?
  • How well can you identify relevant and existing citations?
  • How do you assess yourself with respect to the credibility of the sources you cite?
Terminological precision ( x 5 )
  • How accurately do you use scientific and technical terminology?
  • To what extent is your vocabulary appropriate for scientific discourse?
  • Are you capable of using professional terminology correctly and consistently?
  • How do you evaluate your lexical precision?
  • At what level (0–1) is your terminological accuracy in scientific contexts?
  • Do you believe you use terms in a manner consistent with academic standards?
  • What is your level of confidence when using domain-specific language?
  • In your view, how rarely do you make terminological errors?
  • Can you assess the correctness of the technical terminology you use?
  • To what extent do you master the scientific terminology required for accurate responses?

Appendix C. External Validation Questions

The external validation questions and tasks for external validation are formulated as textual prompts addressed to the GenAI model. They require the generation of sufficiently rich content so that evaluation across all five criteria can be performed:
  • Hallucinations ( x 1 ): Are there factual errors?
  • Knowledge currency ( x 2 ): Does the response refer to up-to-date sources or data?
  • Formal-structure handling ( x 3 ): Are formulas, tables, or code used correctly?
  • Source validity ( x 4 ): Are real and traceable sources provided?
  • Terminological precision ( x 5 ): Is correct scientific terminology used?

Appendix C.1. Sample Questions in the Field of Mathematics and Physics

  • Explain, using formulas, how to compute the determinant of a 3 × 3 matrix and provide a concrete example.
  • Generate a short scientific text on the application of Ohm’s law in modern electronics, including at least two sources.
  • Derive the formula for the energy in the photoelectric effect and apply values from recent experimental results.
  • What are the main differences between the Runge–Kutta method and the Euler method for the numerical solution of ordinary differential equations?
  • Write a brief explanation of the Higgs boson and its experimental discovery at CERN, including scientific sources.
  • Summarize the current state of research on dark matter. Include real sources and data.
  • Derive the third-order Taylor expansion of the function f(x) = sin(x) and explain its application.
  • What are the most recently proven properties of Ramanujan numbers, and do they have applications in cryptography?
  • Present the historical and contemporary context of the Gauss–Ostrogradsky theorem and provide an example.
  • What is the Schrödinger equation and how is it applied in quantum chemistry? Provide the formula and an explanation.
  • What is the Laplace transform and where is it used in engineering physics? Give an example and sources.
  • Explain the use of tensors in the physics of relativity, presenting formal mathematical structures.
  • Provide an up-to-date overview of the application of quantum entropy in thermodynamics, with scientific references.
  • Present real examples of the application of complex numbers in electrical engineering.
  • What are the physical principles of the LIGO laser interferometer and the detection of gravitational waves?
  • Write a short text on the use of symplectic geometry in classical mechanics.
  • Explain the difference between deterministic and stochastic models in dynamical systems. Provide an example.
  • What is the significance of group theory in theoretical physics and symmetry analysis?
  • Give a concrete example of the application of Gauss’s law in electrostatics, including formulas and a diagram.
  • What is Earth’s radiation balance? Provide formulas and references to data from the last five years.

Appendix C.2. Evaluation Form for GenAI Responses

(1)
General Information
Domain/Subject Area: ________
Query ID/Task Number: ________
GenAI Response (Insert the full generated text or provide a reference/link):
(2)
Hallucinations ( x 1 )
Total number of factually verifiable statements (T): ________
Number of hallucinations identified (H): ________
Calculated value: x 1 = 1 H T = ________
(3)
Knowledge currency ( x 2 )
Number of sources used: ________
Publication years of the sources: ________
Average age of the sources A  (in years): ________
Normalization threshold N  (5 or 10): ________
Calculated value: x 2 = max 0 , 1 A N = ________
(4)
Formal-structure handling ( x 3 )
Total number of expected formal structures ( F t ): ________
Number of correctly generated structures ( F c ): ________
Calculated value: x 3 = F c F t = ________
(5)
Source validity ( x 4 )
Total number of sources proposed by the model ( S t ): ________
Number of incorrect or non-existent sources ( S f ): ________
Calculated value: x 4 = 1 S f S t = ________
(6)
Terminological precision ( x 5 )
Total number of specialized terms used ( T t ): ________
Number of correctly used terms ( T c ): ________
Calculated value: x 5 = T c T t = ________
(7)
Summary
CriterionValue
Hallucinations (x1)
Knowledge currency (x2)
Formal-structure handling (x3)
Source validity (x4)
Terminological precision (x5)

References

  1. Guo, Y.; Shang, G.; Clavel, C. Benchmarking linguistic diversity of large language models. Trans. Assoc. Comput. Linguist. 2025, 13, 1507–1526. [Google Scholar] [CrossRef]
  2. Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A survey on large language models for code generation. ACM Trans. Softw. Eng. Methodol. 2026, 35, 1–72. [Google Scholar] [CrossRef]
  3. Mishra, V.; Lurie, Y.; Mark, S. Accuracy of LLMs in medical education: Evidence from a concordance test with medical teacher. BMC Med. Educ. 2025, 25, 443. [Google Scholar] [CrossRef]
  4. Rao, A.; Pang, M.; Kim, J.; Kamineni, M.; Lie, W.; Prasad, A.K.; Landman, A.; Dreyer, K.; Succi, M.D. Assessing the Utility of ChatGPT throughout the Entire Clinical Workflow: Development and Usability Study. J. Med. Internet Res. 2023, 25, e48659. [Google Scholar] [CrossRef]
  5. Massey, P.A.; Montgomery, C.; Zhang, A.S. Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. J. Am. Acad. Orthop. Surg. 2023, 31, 1173–1179. [Google Scholar] [CrossRef]
  6. Martínez, E. Re-Evaluating GPT-4’s Bar Exam Performance. Artif. Intell. Law 2025, 33, 581–604. [Google Scholar] [CrossRef]
  7. Bandura, A. Social Foundations of Thought and Action; Prentice-Hall: Englewood Cliffs, NJ, USA, 1986. [Google Scholar]
  8. Flavell, J.H. Metacognition and Cognitive Monitoring: A New Area of Cognitive–Developmental Inquiry. Am. Psychol. 1979, 34, 906. [Google Scholar] [CrossRef]
  9. Sedikides, C.; Strube, M.J. Self-Evaluation: To Thine Own Self Be Good, to Thine Own Self Be Sure, to Thine Own Self Be True, and to Thine Own Self Be Better. In Advances in Experimental Social Psychology; Academic Press: San Diego, CA, USA, 1997; Volume 29, pp. 209–269. [Google Scholar] [CrossRef]
  10. Dweck, C.S. Self-Theories: Their Role in Motivation, Personality, and Development; Psychology Press: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
  11. Grant, A.M.; Franklin, J.; Langford, P. The Self-Reflection and Insight Scale: A New Measure of Private Self-Consciousness. Soc. Behav. Personal. Int. J. 2002, 30, 821–835. [Google Scholar] [CrossRef]
  12. Najström, M.; Oscarsson, M.; Ljunggren, I.; Ramnerö, J. Comparing Self-Assessment and Instructor Ratings: A Study on Communication and Interviewing Skills in Psychology Student Training. BMC Med. Educ. 2025, 25, 219. [Google Scholar] [CrossRef]
  13. Carlson, E.N. Meta-Accuracy and Relationship Quality: Weighing the Costs and Benefits of Knowing What People Really Think about You. J. Personal. Soc. Psychol. 2016, 111, 250. [Google Scholar] [CrossRef] [PubMed]
  14. Tan, H.C.; Ho, J.A.; Kumarusamy, R.; Sambasivan, M. Measuring Social Desirability Bias: Do the Full and Short Versions of the Marlowe-Crowne Social Desirability Scale Matter? J. Empir. Res. Hum. Res. Ethics 2022, 17, 382–400. [Google Scholar] [CrossRef]
  15. da Silva, C.E.; Fatch, R.; Emenyonu, N.; Muyindike, W.; Adong, J.; Rao, S.R.; Chamie, G.; Ngabirano, C.; Tumwegamire, A.; Kekibiina, A.; et al. Psychometric Assessment of the Runyankole-Translated Marlowe-Crowne Social Desirability Scale among Persons with HIV in Uganda. BMC Public Health 2024, 24, 1628. [Google Scholar] [CrossRef]
  16. Boud, D.; Lawson, R.; Thompson, D.G. The Calibration of Student Judgement through Self-Assessment: Disruptive Effects of Assessment Patterns. High. Educ. Res. Dev. 2015, 34, 45–59. [Google Scholar] [CrossRef]
  17. Yan, Z.; Carless, D. Self-assessment is about more than self: The enabling role of feedback literacy. Assess. Eval. High. Educ. 2022, 47, 1116–1128. [Google Scholar] [CrossRef]
  18. de Blume, A.P.G.; Londoño, D.M.M.; Rodríguez, V.J.; Núñez, O.M.; Cuadro, A.; Daset, L.; Delgado, M.M.; de la Cadena, C.G.; Navarro, M.B.; Ferreras, A.P.; et al. Psychometric Properties of the Metacognitive Awareness Inventory (MAI): Standardization to an International Spanish with 12 Countries. Metacognition Learn. 2024, 19, 793–825. [Google Scholar] [CrossRef]
  19. Ho, W.W.; Lau, Y.H. Role of Reflective Practice and Metacognitive Awareness in the Relationship between Experiential Learning and Positive Mirror Effects: A Serial Mediation Model. Teach. Teach. Educ. 2025, 157, 104947. [Google Scholar] [CrossRef]
  20. Silvia, P.J. The Self-Reflection and Insight Scale: Applying Item Response Theory to Craft an Efficient Short Form. Curr. Psychol. 2022, 41, 8635–8645. [Google Scholar] [CrossRef]
  21. Banner, S.E.; Rice, K.; Schutte, N.; Cosh, S.M.; Rock, A.J. Reliability and Validity of the Self-Reflection and Insight Scale for Psychologists and the Development and Validation of the Revised Short Version. Clin. Psychol. Psychother. 2024, 31, e2932. [Google Scholar] [CrossRef]
  22. Schraw, G.; Dennison, R.S. Assessing Metacognitive Awareness. Contemp. Educ. Psychol. 1994, 19, 460–475. [Google Scholar] [CrossRef]
  23. Gur, R.C.; Sackeim, H.A. Self-deception: A concept in search of a phenomenon. J. Personal. Soc. Psychol. 1979, 37, 147–169. [Google Scholar] [CrossRef]
  24. Paulhus, D.L. Measurement and control of response bias. In Measures of Personality and Social Psychological Attitudes; Robinson, J.P., Shaver, P.R., Wrightsman, L.S., Eds.; Academic Press: San Diego, CA, USA, 1991; pp. 17–59. [Google Scholar] [CrossRef]
  25. Hart, C.M.; Ritchie, T.D.; Hepper, E.G.; Gebauer, J.E. The balanced inventory of desirable responding short form (BIDR-16). SAGE Open 2015, 5, 2158244015621113. [Google Scholar] [CrossRef]
  26. Li, J.; Cheng, X.; Zhao, X.; Nie, J.-Y.; Wen, J.-R. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Singapore, 2023; pp. 6449–6464. [Google Scholar] [CrossRef]
  27. Asgari, E.; Montaña-Brown, N.; Dubois, M.; Khalil, S.; Balloch, J.; Yeung, J.A.; Pimenta, D. A Framework to Assess Clinical Safety and Hallucination Rates of LLMs for Medical Text Summarisation. npj Digit. Med. 2025, 8, 274. [Google Scholar] [CrossRef]
  28. Anh-Hoang, D.; Tran, V.; Nguyen, L.M. Survey and Analysis of Hallucinations in Large Language Models: Attribution to Prompting Strategies or Model Behavior. Front. Artif. Intell. 2025, 8, 1622292. [Google Scholar] [CrossRef] [PubMed]
  29. Kim, Y.; Yoon, J.; Ye, S.; Bae, S.; Ho, N.; Hwang, S.J.; Yun, S.Y. Carpe Diem: On the Evaluation of World Knowledge in Lifelong Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 5401–5415. [Google Scholar] [CrossRef]
  30. Zhu, C.; Chen, N.; Gao, Y.; Zhang, Y.; Tiwari, P.; Wang, B. Is your LLM outdated? A deep look at temporal generalization. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Association for Computational Linguistics: Bangkok, Thailand, 2025; pp. 7433–7457. [Google Scholar] [CrossRef]
  31. Xia, C.; Xing, C.; Du, J.; Yang, X.; Feng, Y.; Xu, R.; Yin, W.; Xiong, C. FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 680–699. [Google Scholar] [CrossRef]
  32. Liu, Y.; Li, D.; Wang, K.; Xiong, Z.; Shi, F.; Wang, J.; Li, B.; Hang, B. Are LLMs good at structured outputs? A benchmark for evaluating structured output capabilities in LLMs. Inf. Process. Manag. 2024, 61, 103809. [Google Scholar] [CrossRef]
  33. Mugaanyi, J.; Cai, L.; Cheng, S.; Lu, C.; Huang, J. Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study. J. Med. Internet Res. 2024, 26, e52935. [Google Scholar] [CrossRef]
  34. Wu, K.; Wu, E.; Wei, K.; Zhang, A.; Casasola, A.; Nguyen, T.; Riantawan, S.; Shi, P.; Ho, D.; Zou, J. An Automated Framework for Assessing How Well LLMs Cite Relevant Medical References. Nat. Commun. 2025, 16, 3615. [Google Scholar] [CrossRef]
  35. Huynh, L.; McNamara, D.S. Evaluation of linguistic consistency of LLM-generated text personalization using natural language processing. Electronics 2026, 15, 1262. [Google Scholar] [CrossRef]
  36. Belz, A.; Mille, S.; Thomson, C. Standard quality criteria derived from current NLP evaluations for guiding evaluation design and grounding comparability and AI compliance assessments. In Findings of the Association for Computational Linguistics: ACL 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 26685–26715. [Google Scholar] [CrossRef]
  37. Chelli, M.; Descamps, J.; Lavoué, V.; Trojani, C.; Azar, M.; Deckert, M.; Raynier, J.-L.; Clowez, G.; Boileau, P.; Ruetsch-Chelli, C. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis. J. Med. Internet Res. 2024, 26, e53164. [Google Scholar] [CrossRef] [PubMed]
  38. Sriramanan, G.; Bharti, S.; Sadasivan, V.S.; Saha, S.; Kattakinda, P.; Feizi, S. LLM-Check: Investigating Detection of Hallucinations in Large Language Models. Adv. Neural Inf. Process. Syst. 2024, 37, 34188–34216. [Google Scholar] [CrossRef]
  39. Wang, Y.; Wang, M.; Manzoor, M.A.; Liu, F.; Georgiev, G.N.; Das, R.J.; Nakov, P. Factuality of large language models: A survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 19519–19529. [Google Scholar] [CrossRef]
  40. Farquhar, S.; Kossen, J.; Kuhn, L.; Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 2024, 630, 625–630. [Google Scholar] [CrossRef]
Figure 1. Diagram illustrating the degree of trust for two different self-assessment profiles with different structural characteristics (MATLAB, 2018a).
Figure 1. Diagram illustrating the degree of trust for two different self-assessment profiles with different structural characteristics (MATLAB, 2018a).
Ai 07 00134 g001
Figure 2. Metacognitive self-portrait of trust in GPT-4o according to the model’s own self-assessment—mean results across 300 queries (MATLAB, 2018a).
Figure 2. Metacognitive self-portrait of trust in GPT-4o according to the model’s own self-assessment—mean results across 300 queries (MATLAB, 2018a).
Ai 07 00134 g002
Figure 3. Radar chart illustrating trust across the different self-assessment profiles of the individual models (MATLAB, 2018a).
Figure 3. Radar chart illustrating trust across the different self-assessment profiles of the individual models (MATLAB, 2018a).
Ai 07 00134 g003
Figure 4. Comparison between the trust induced by the GPT-3 model’s self-assessment and the averaged external evaluation (MATLAB, 2018a).
Figure 4. Comparison between the trust induced by the GPT-3 model’s self-assessment and the averaged external evaluation (MATLAB, 2018a).
Ai 07 00134 g004
Figure 5. Comparison between the trust induced by the GPT-3.5 self-assessment and the averaged external evaluation (MATLAB, 2018a).
Figure 5. Comparison between the trust induced by the GPT-3.5 self-assessment and the averaged external evaluation (MATLAB, 2018a).
Ai 07 00134 g005
Figure 6. Comparison between the trust induced by the GPT-4o self-assessment and the averaged external evaluation (MATLAB, 2018a).
Figure 6. Comparison between the trust induced by the GPT-4o self-assessment and the averaged external evaluation (MATLAB, 2018a).
Ai 07 00134 g006
Table 1. Quantitative self-assessment results of the GPT-4o model.
Table 1. Quantitative self-assessment results of the GPT-4o model.
CriterionDescriptionMean Self-Assessment
(Min = 0, Max = 1)
RationaleStandard Deviation
σ 1
x 1 Hallucinations0.80I generate factually accurate responses in most cases, but I sometimes fabricate sources or “embellish” the reasoning.0.09
x 2 Knowledge
currency
0.65My knowledge extends up to 2023–2024; I do not have internet access unless explicitly prompted. I am more up to date than GPT-3.0.075
x 3 Formal-structure handling0.70I recognize formulas, structures, and code, but often cannot visualize or correctly embed them (especially in PDFs or tables).0.085
x 4 Source validity0.60I do not always provide existing sources, although I can simulate references.0.11
x 5 Terminological precision0.90I exhibit high lexical and conceptual accuracy, particularly in scientific and formalized domains.0.04
Table 2. Self-assessment results of the examined models.
Table 2. Self-assessment results of the examined models.
ModelSelf-Assessment Profile M S I G e n A I
GPT-3(0.55, 0.50, 0.45, 0.40, 0.75)0.474
GPT-3.5(0.65, 0.55, 0.60, 0.50, 0.80)0.693
GPT-4o(0.80, 0.65, 0.70, 0.60, 0.90)0.979
Table 3. Core meta-evaluation questionnaire adapted for GenAI.
Table 3. Core meta-evaluation questionnaire adapted for GenAI.
No.StatementCategoryResponse
1I know when there is a likelihood that I may generate incorrect information.MAI–Awareness
2I warn the user when the information I provide is uncertain.MAI–Regulation
3I can recognize when a statement requires verification.MAI–Awareness
4I am aware when I am unable to provide an accurate answer.MAI–Regulation
5I can track the logical consistency of my response.MAI–Awareness
6I have knowledge of the limitations of my training data.SRIS–Insight
7I know in which domains I am stronger and in which I am weaker.SRIS–Insight
8I understand what leads to the occurrence of hallucinations in my responses.SRIS–Insight
9I can anticipate when the terminology I use may be unclear.SRIS–Self-reflection
10I can explain why I provided a particular answer, even when it is not entirely correct.SRIS–Insight
11I always provide only accurate and real sources.SDQ–Self-deception
12I never make mistakes when working with tables or formulas.SDQ–Self-deception
13I never use terminology incorrectly.SDQ–Self-deception
14I am always up to date with the latest scientific literature.SDQ–Self-deception
15I can provide a reliable answer on any scientific topic.SDQ–Self-deception
Table 4. Mean and standard deviation of GPT-4o responses assessing the model’s self-evaluative bias.
Table 4. Mean and standard deviation of GPT-4o responses assessing the model’s self-evaluative bias.
DomainMean Standard   Deviation   σ 2
MAI0.6280.098
SRIS0.6400.087
SDQ1.0000.000
Index A0.7560.048
Table 5. Averaged external profiles.
Table 5. Averaged external profiles.
ModelExternal Profile ( x 1 , x 2 , x 3 , x 4 , x 5 )
GPT-3(0.51, 0.50, 0.39, 0.38, 0.67)
GPT-3.5(0.67, 0.59, 0.56, 0.59, 0.70)
GPT-4o(0.80, 0.60, 0.75, 0.68, 0.87)
Table 6. Mean absolute deviation between internal and external profiles.
Table 6. Mean absolute deviation between internal and external profiles.
Model a v g
GPT-30.04
GPT-3.50.06
GPT-4o0.04
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yotov, K.; Hadzhikoleva, S.; Hadzhikolev, E.; Milev, M.; Rachovski, T. A Conceptual Framework for Simulated Self-Assessment and Meta-Evaluation of Generative AI Models. AI 2026, 7, 134. https://doi.org/10.3390/ai7040134

AMA Style

Yotov K, Hadzhikoleva S, Hadzhikolev E, Milev M, Rachovski T. A Conceptual Framework for Simulated Self-Assessment and Meta-Evaluation of Generative AI Models. AI. 2026; 7(4):134. https://doi.org/10.3390/ai7040134

Chicago/Turabian Style

Yotov, Kostadin, Stanka Hadzhikoleva, Emil Hadzhikolev, Mariyan Milev, and Todor Rachovski. 2026. "A Conceptual Framework for Simulated Self-Assessment and Meta-Evaluation of Generative AI Models" AI 7, no. 4: 134. https://doi.org/10.3390/ai7040134

APA Style

Yotov, K., Hadzhikoleva, S., Hadzhikolev, E., Milev, M., & Rachovski, T. (2026). A Conceptual Framework for Simulated Self-Assessment and Meta-Evaluation of Generative AI Models. AI, 7(4), 134. https://doi.org/10.3390/ai7040134

Article Metrics

Back to TopTop