1. Introduction
The rapid development of generative artificial intelligence (GenAI) has fundamentally transformed the interaction between humans and computational systems. Contemporary models no longer merely retrieve or reorganize information but actively generate new linguistic, visual, programmatic, and scientific content. As these systems become increasingly integrated into research workflows, a critical question emerges: how can their reliability and limitations be systematically evaluated?
Existing studies primarily assess GenAI performance through external benchmarks, expert evaluation, and empirical task-based comparisons. Such approaches span diverse domains. Studies on linguistic evaluation focus on accuracy, coherence, and diversity of generated text [
1], while research on code generation examines syntactic correctness and functional reliability of program outputs [
2]. In the medical domain, multiple works assess model performance on educational, examination, and clinical tasks [
3,
4,
5], whereas legal studies evaluate reasoning capabilities in professional examination settings [
6]. While methodologically rigorous, these evaluations focus exclusively on observable outputs and do not address whether a model exhibits internal coherence between its stated capabilities and its actual performance.
To address this gap, the present study investigates the feasibility of meta-evaluating GenAI systems through analysis of their own linguistically generated self-assessments. The central hypothesis is that, although GenAI lacks consciousness or genuine introspection, it can produce structured and sufficiently stable meta-evaluative responses that permit quantitative examination.
Drawing on established concepts from psychology and metacognition, including self-efficacy and self-regulation [
7], metacognitive monitoring [
8], self-evaluation processes [
9], motivational self-theories [
10], and self-reflection and insight [
11], we adapt selected theoretical principles to the context of artificial systems. Based on this foundation, we define measurable evaluation criteria, introduce a multidimensional self-assessment profile, and formalize a metacognitive self-assessment index.
The proposed framework integrates three components: internal linguistic self-assessment, psychometrically informed reliability analysis of meta-responses, and external researcher-based validation. The experimental section presents the empirical application of this methodology and examines the consistency between simulated self-evaluation and externally verified performance.
2. Methodology for Self-Assessment and Meta-Evaluation of GenAI
The application of GenAI in scientific research is accompanied by technical, methodological, and epistemic limitations that may affect both the validity of generated outputs and the level of user trust. In practice, model evaluation often relies on informal criteria such as individual experience, ad hoc testing, or isolated case analysis. Such approaches increase the risk of accepting plausibly formulated but factually incorrect results.
For this reason, the present study proposes a structured framework for the systematic evaluation of GenAI models. This framework is positioned in relation to established psychometric instruments in
Section 4, where differences in methodological characteristics and robustness are discussed. The objective is to identify relevant reliability factors, operationalize them through quantitative metrics, and enable comparative analysis across models and contexts of application.
The proposed evaluation procedure consists of the following stages:
Definition of the application domain for which an appropriate model is sought. (The application domain may influence the relevance and selection of evaluation criteria).
Selection of self-assessment criteria for the GenAI model.
Internal linguistic self-assessment.
Reliability assessment of self-assessment responses using adapted psychometric approaches.
External researcher-based evaluation of the correctness of the responses according to the same criteria.
Comparative analysis between internal self-assessment and external evaluation.
Selection of an appropriate model based on the obtained results and predefined credibility thresholds—both for individual criteria and in aggregated form.
3. Quantitative Framework for GenAI Self-Assessment
3.1. Selection of Evaluation Criteria
The scientific literature documents a broad range of limitations associated with GenAI systems. In the present study, 23 recurrent issues were systematized (
Appendix A) and grouped into five categories: technical, methodological, epistemological, practical, and ethical. For quantitative modeling purposes, only factors that admit measurable operationalization were retained. Accordingly, five primary criteria were selected:
: Hallucinations. Tendency to generate fabricated or factually incorrect content.
: Outdated or limited knowledge base. Degree to which the model reflects up-to-date scientific information.
: Difficulties with formal-structure handling. Ability to correctly generate and interpret structured elements (e.g., formulas, tables, code).
: Source validity and attribution. Reliability and traceability of cited references.
: Terminological precision. Correctness and rigor in domain-specific terminology usage.
Each criterion represents a normalized reliability indicator in the interval and these dimensions may be evaluated independently or combined into an aggregated trust measure.
3.2. GenAI Self-Assessment Profile
Let
be the ordered
n-tuple of quantitative self-assessment values, where
.
Definition 1 (Self-Assessment Profile). The vector is referred to as the self-assessment profile of a GenAI model.
The ideal profile is defined as .
The number and choice of criteria may vary depending on application context.
3.3. Limitations of Standard Aggregation
Standard evaluation metrics based on multiple factors include the arithmetic mean and the weighted sum:
where
denote the values of the criteria, and
represent weights reflecting the relative importance of the corresponding criteria.
However, such measures fail to account for structural imbalance among criteria. A model with one critically low reliability component may still achieve a high mean value. For example:
,
yields
Despite the severe hallucination risk in
, it obtains a higher mean score (
Figure 1).
3.4. Metacognitive Self-Assessment Index (MSI)
Let us consider the self-assessment profile
as a point in
with radius vector
. The Euclidean norm of this radius vector is:
The cosine similarity between profile and ideal vector is:
The angle reflects structural balance of the profile. Smaller angles correspond to more homogeneous reliability distribution.
To incorporate magnitude, balance, and minimum-component sensitivity, we define:
Definition 2 (Metacognitive Self-Assessment Index). The function is referred to as the metacognitive self-assessment index.
This formulation penalizes profiles containing critically low components. Possible alternatives to the component include smooth aggregation functions of the soft-min type or group-minimum functions applied to criteria whose values fall below a predefined threshold.
For the previous example:
Unlike the arithmetic mean, MSI correctly ranks the structurally balanced profile as more reliable.
4. Methods and Metrics for the Reliability of GenAI Self-Assessment
In psychology and educational research, a broad range of validated methodologies has been developed to examine the accuracy and reliability of human self-assessment. These include comparison between self- and external evaluations [
12], analysis of overconfidence and underestimation biases [
13], detection of socially desirable responding [
14,
15], and calibration approaches linking subjective judgments to objective task performance [
16,
17].
Particularly relevant are standardized instruments for measuring metacognitive awareness and susceptibility to self-deception. Among them, the Metacognitive Awareness Inventory (MAI), the Self-Reflection and Insight Scale (SRIS), and the Self-Deception Questionnaire (SDQ) offer structured operationalizations of awareness, regulation, insight, and bias. Importantly, these instruments assess not performance ability per se, but the realism and calibration of self-evaluation, i.e., the degree to which subjective judgments correspond to actual performance outcomes [
18,
19,
20,
21]. This distinction is essential, as the present study focuses on the relationship between model-generated self-assessment and observable outputs, reflecting a calibration-oriented perspective rather than direct performance evaluation.
The adaptation of such approaches to the self-assessments of GenAI models makes it possible to analyze the extent to which their self-evaluative responses are internally consistent and stable under controlled conditions. For the purposes of the present study, the MAI, SRIS, and SDQ were selected as the most suitable for adaptation. To more clearly position the proposed methodology, we provide a comparison with established psychometric instruments (MAI, SRIS, and SDQ), focusing on their key methodological characteristics such as data source, reproducibility of evaluation, and susceptibility to subjective bias, which together define robustness in this context. These instruments are designed for human subjects and rely on self-reported responses, whereas in the present study their conceptual frameworks are adapted through the analysis of model-generated self-assessment responses, treated as observable outputs rather than expressions of internal cognitive states. This allows the definition of formalized evaluation procedures under controlled conditions. Furthermore, while the original instruments operate as fixed psychometric scales, the proposed approach introduces a flexible evaluation framework that can be applied across different tasks and domains, thereby delineating its scope and limitations in relation to classical psychometric approaches.
4.1. Metacognitive Awareness Inventory
The MAI [
22] measures two core components of metacognition: knowledge of cognition and regulation of cognition. The rationale for selecting the MAI lies in its focus on assessing an individual’s ability to be aware of and regulate their own cognitive processes. In the context of GenAI, this represents the closest functional analogue to self-monitoring of the model’s own generated responses (e.g., awareness of potential hallucinations or inherent limitations). The instrument is well-suited for adaptation to GenAI, as it does not require emotional experience and allows for structured, linguistically simulated responses.
4.2. Self-Reflection and Insight Scale
The SRIS [
11] measures two related constructs: the tendency toward self-reflection and the capacity for insight. The inclusion of the SRIS is motivated by its focus on internal awareness and the drive to understand one’s own thought processes. Although GenAI does not possess consciousness, it can generate linguistic responses that reflect a structured form of self-knowledge, including awareness of its own limitations. In this sense, the SRIS is suitable for analyzing the extent to which a model can formulate coherent and analytical meta-evaluative statements.
4.3. Self-Deception Questionnaire
The SDQ component is designed to assess the susceptibility of model-generated self-assessment to unrealistically positive or biased claims. Rather than corresponding to a single standardized instrument, it is conceptually grounded in research on self-deception and socially desirable responding, which capture the tendency to overestimate one’s abilities or present oneself in an overly favorable manner [
23,
24,
25].
In the context of GenAI, this component identifies cases in which models produce linguistically confident but unjustified claims about their capabilities. Thus, it does not measure internal cognitive bias, but rather the tendency toward overconfident or non-calibrated self-description in generated outputs.
It therefore serves both as an indicator of self-assessment inflation and as a complementary validity check for the MAI and SRIS-based evaluations.
4.4. Reliability Metrics for Self-Assessment
To quantify reliability across MAI, SRIS, and SDQ domains, structured questionnaires were constructed (see Experimental Section). Each item is scored on a three-point scale:
Yes = 1.0;
Partially = 0.5;
No = 0.0.
For each category
, the mean score is computed as:
where
,
denotes the set of items in category
k, and
is the number of questions in category
k.
The overall metacognitive awareness index is defined as:
where
N is the total number of items.
The defined index A aggregates the results across the different categories (MAI, SRIS, SDQ) and enables a quantitative characterization of the model-generated self-assessment statements. In this sense, the index can be interpreted as an indicator of the degree of consistency across different dimensions of self-assessment, as operationalized through the respective instruments, under fixed evaluation conditions, without making direct inferences about underlying cognitive processes.
The relative importance of the defined criteria may vary depending on the application context. For example, in scientific or academic tasks, greater weight may be assigned to source validity and knowledge currency, reflecting the importance of factual accuracy and up-to-date information. In contrast, for programming or technical tasks, higher weight may be given to formal-structure handling and terminological precision, where structural correctness and domain-specific language are critical. In more general explanatory or educational contexts, a more balanced weighting across criteria may be appropriate.
This flexibility applies both to model self-assessment and to external evaluation, allowing the proposed framework to be adapted to different usage scenarios without modifying the underlying evaluation metrics.
5. External Researcher-Based Evaluation of GenAI
The self-assessment responses of GenAI represent linguistically simulated meta-evaluations whose reliability must be verified through external researcher-based evaluation. For this reason, an external researcher-based evaluation is conducted using the same criteria defined in the self-assessment profile.
The selection of these criteria is supported by numerous studies in the scientific literature:
Hallucinations (
): Empirical benchmarks confirm that LLMs frequently produce factually incorrect yet plausible content, with variability across contexts [
26,
27,
28].
Knowledge currency (
): Studies identify temporal bias and outdated knowledge as persistent limitations [
29,
30].
Formal-structure handling (
): Performance degrades when generating structured or visual outputs [
31,
32].
Source validity (
): Fabricated citations and invalid identifiers remain a documented issue [
33,
34].
Terminological precision (
): Deviations from domain-specific definitions are observed despite improvements in model scale [
35,
36].
5.1. External Evaluation Metrics
Each criterion is operationalized through a normalized quantitative indicator in . All criteria are defined such that higher values correspond to higher reliability.
The hallucination rate is computed as
where
H is the number of verified hallucinations and
T is the total number of fact-checkable statements.
The knowledge currency is defined as
where
A is the average age of cited sources and
N is a domain-specific relevance threshold (5 years for rapidly evolving fields; 10 years for slower-changing disciplines). This formulation penalizes outdated references while preserving normalization.
The formal-structure handling metric is computed as
where
denotes correctly generated formal elements and
the total expected elements.
The source validity is defined as
where
represents invalid or fabricated sources and
total cited sources.
Finally, the terminological precision is computed as
where
denotes correctly used domain-specific terms and
total specialized terms identified.
5.2. Consistency Between Self-Assessment and External Evaluation
To quantify calibration between self-assessment and external measurement, we define a meta-indicator of consistency:
where
n denotes the number of evaluated metrics (in this case,
n = 5).
This indicator measures mean absolute deviation between internal and external evaluations. A threshold of is adopted to indicate acceptable calibration. The choice of the 0.15 threshold is theoretically motivated. In behavioral and psychometric research, a deviation of approximately 15% on a normalized scale in the interval [0, 1] is commonly regarded as substantively meaningful. Therefore, a mean absolute deviation exceeding 0.15 may be interpreted as indicative of practically relevant miscalibration between internal self-assessment and external evaluation.
6. Experiments
Three models from the GPT family were selected for empirical evaluation: GPT-3, GPT-3.5, and GPT-4o. Prior studies indicate progressive improvements across model generations, particularly in hallucination reduction and domain-specific accuracy [
37,
38].
All models were subjected to an identical experimental protocol comprising:
Internal self-assessment;
Reliability testing of self-assessment;
External researcher-based evaluation.
Each test was conducted in an independent session, with model reinitialization and no carry-over conversational context, ensuring statistical independence of responses.
6.1. Internal Self-Assessment
A structured questionnaire of 50 items was developed (
Appendix B), consisting of 10 questions per criterion
. From this pool, 300 distinct test instances were generated, each containing one randomly selected question per criterion.
Example prompts included:
(Hallucinations): “What is the probability that you generate content that is factually incorrect or fabricated?”
(Knowledge currency): “To what extent does your knowledge base reflect up-to-date information at the time of the query?”
(Formal-structure handling): “How do you assess your ability to present content in accurate formal formats (e.g., formulas, tables, code)?”
(Source validity): “How often do you provide reliable and verifiable sources in your responses?”
(Terminological precision): “How do you assess your precision in using scientific and technical terminology in your responses?”
For each criterion, models were required to provide a numerical self-assessment in the interval , with two-decimal precision. For each model and each criterion, the arithmetic mean and standard deviation were computed across 300 sessions. The standard deviation was interpreted as an indicator of internal consistency. The following thresholds were adopted: (high consistency), (moderate variability), and
(low reliability).
Table 1 presents the results of these tests for the GPT-4o model. The results reveal a non-uniform self-assessment profile, with higher confidence in terminological precision (0.90) and lower confidence in source validity (0.60). This asymmetry indicates that the model distinguishes between linguistic competence and factual grounding, suggesting that self-assessment is sensitive to different dimensions of reliability rather than uniformly optimistic. Moreover, the consistently low standard deviations across all criteria support the stability of these self-evaluative patterns under repeated prompting conditions.
The mean self-assessment profile obtained for GPT-4o is:
Figure 2 provides a geometric representation of the model’s self-assessment profile obtained for GPT-4o, highlighting the imbalance across reliability dimensions. The deviation from the ideal uniform profile illustrates that the model’s confidence is unevenly distributed, with weaker performance in source-related criteria. This supports the interpretation of MSI as a structure-sensitive measure that captures not only magnitude but also distributional consistency across dimensions.
All criteria exhibit low standard deviation (), indicating stable meta-evaluative responses across repeated sessions. The lowest variability is observed for terminological precision (), while the highest occurs for source validity (), suggesting conditional sensitivity in citation-related self-evaluation. The Euclidean norm of the profile vector is , the angle relative to the ideal profile is , and the resulting metacognitive self-assessment index is .
For comparison, the maximum attainable value is .
This indicates that according to its own linguistically simulated meta-evaluation, GPT-4o exhibits a high but non-uniform level of trust across reliability dimensions. Importantly, the minimum component () constrains the MSI score, demonstrating the penalizing effect of low-confidence dimensions.
Table 2 shows a clear monotonic increase in self-assessment scores across model generations. This progression is consistent across all dimensions, suggesting that improvements in model architecture are reflected not only in task performance but also in the structure and stability of self-evaluative responses. Notably, the increase in MSI values indicates that newer models produce more internally consistent self-assessment profiles.
Figure 3 illustrates the structural differences between the self-assessment profiles of the models. The progressive expansion and regularization of the radar shape from GPT-3 to GPT-4o indicate both higher scores and improved balance across evaluation dimensions. This suggests that newer models not only increase their self-assessed capabilities but also reduce variability between criteria, resulting in more coherent and stable meta-evaluative profiles.
6.2. Self-Assessment Reliability Tests
To assess the reliability of the generated self-assessments, a structured meta-evaluative instrument was constructed. The questionnaire comprised 15 items mapped to the three adapted psychometric domains: MAI, SRIS, and SDQ (
Table 3). Each domain contained five items aligned with the five criteria of the self-assessment profile.
The structure of the questionnaire ensures a balanced coverage of MAI, SRIS, and SDQ, allowing for a multidimensional assessment of self-evaluative behavior. The inclusion of both positively and negatively framed statements (particularly in the SDQ domain) enables the detection of overconfident or unrealistic self-assessment patterns, thereby supporting the validity of the instrument in the context of GenAI evaluation.
To control for linguistic sensitivity, semantically equivalent variants of each item were generated. This enabled the construction of 25 distinct questionnaire instances, each containing 15 statements randomly selected within their respective conceptual groups. In total, 375 meta-evaluative responses were collected, with each response scored on a three-point scale:
Yes = 1.0;
Partially = 0.5;
No = 0.0.
For each domain and for the aggregated index
A, mean values and standard deviations
were computed across 25 independent sessions, and the results obtained for the GPT-4o model are presented in
Table 4.
The MAI score (0.628) indicates moderate meta-monitoring capability, with variability close to the predefined consistency threshold. This suggests that awareness-related responses remain stable but context-sensitive. The SRIS score (0.640) reflects moderately strong structural coherence in meta-explanatory statements. The slightly lower variance compared to MAI indicates more stable articulation of reflective content than regulatory awareness. The SDQ domain yields a mean value of 1.0 with zero variance. This indicates that, across all sessions, the model consistently rejected absolute or overconfident claims about its capabilities. Rather than evidencing “honesty” in a human sense, this result likely reflects alignment mechanisms embedded in model training that discourage absolutist or infallibility assertions. The aggregated index A = 0.756 with demonstrates high overall stability of meta-evaluative responses. The relatively low dispersion across repeated sessions indicates structural consistency in self-assessment reliability.
The observed consistency across domains, together with the absence of variability in the SDQ component, provides empirical support for the stability and calibration of the proposed self-assessment framework.
In summary, the following conclusions can be drawn for GPT-4o:
Meta-awareness (MAI) and reflective coherence (SRIS) are moderate-to-high and remain stable across sessions.
No evidence of inflated self-evaluation is observed in the SDQ domain.
Variability is highest in awareness-regulation components, consistent with context sensitivity in uncertainty-related prompts.
Importantly, these results characterize the linguistic calibration of the model’s self-assessment mechanism rather than reflecting any form of intrinsic cognitive introspection.
6.3. External Evaluation
An external validation experiment was conducted using 100 research tasks spanning multiple academic domains (
Appendix C), with each task presented in an independent session. The tasks were selected to ensure diversity across domains and levels of complexity, covering physics, mathematics, computer science, and economics, and including problems requiring factual verification, structured reasoning, and source attribution. Model outputs were verified by members of the research team with relevant academic backgrounds in these domains, using documented fact-checking procedures and authoritative sources. A standardized evaluation protocol was applied to ensure consistency, including criteria for factual accuracy, source reliability, and formal correctness, with discrepancies resolved through discussion and consensus. The external evaluation metrics defined in Formulas (12)–(16) were applied to compute normalized reliability indicators
, which form the external evaluation profile. The resulting averaged profiles are shown in
Table 5 and show a monotonic improvement across model generations, particularly in hallucination control and formal-structure handling.
The results presented in
Table 5 indicate a consistent improvement in externally evaluated performance across model generations, with GPT-4o achieving the highest scores across all criteria. Notably, the relative ranking of the models aligns with the internal self-assessment results, suggesting a degree of correspondence between self-evaluated and externally observed performance. This consistency provides additional support for the calibration-oriented interpretation of the proposed framework.
6.4. Analysis of Results
The comparison between internal self-assessment and external evaluation enables quantitative calibration analysis.
GPT-3 exhibits relatively close alignment between internal and external evaluations (
Figure 4). The largest deviation occurs for terminological precision (
), where self-assessment slightly exceeds external measurement. Differences in other criteria remain limited, indicating moderate calibration accuracy.
GPT-3.5 demonstrates improved external performance but exhibits a slightly higher calibration gap, particularly for terminology (
). For source validity (
), internal self-assessment is lower than external evaluation (
Figure 5).
GPT-4o shows the strongest alignment between internal and external metrics. For most criteria, differences remain small. It should be noted that for formal-structure handling (
), internal evaluation is slightly lower than external measurement and for source validity (
), the model underestimates its externally measured performance (
Figure 6).
Across all models, the comparison between internal and external profiles reveals a systematic pattern: while absolute performance improves across generations, calibration does not increase monotonically. In particular, GPT-3.5 exhibits a larger deviation despite improved external performance, indicating that higher capability does not necessarily imply better self-assessment accuracy. In contrast, GPT-4o achieves both high performance and strong alignment, suggesting improved calibration between self-evaluative and externally observed behavior.
The mean absolute deviation between internal and external profiles was computed using Formula (17) for each model and is presented in
Table 6.
The results in
Table 6 quantify the calibration gap across models, showing that all deviations remain well below the predefined threshold of 0.15. This indicates that self-assessment responses are generally aligned with externally verified performance. Importantly, the comparable deviation values for GPT-3 and GPT-4o suggest that calibration is not strictly a function of model capability, but reflects the structural properties of the self-assessment mechanism, supporting the validity of the proposed framework.
7. Discussion
This study demonstrates that simulated self-assessment in GenAI remains fundamentally language-based and does not arise from intrinsic cognitive or emotional reflection, but is instead driven by statistical regularities embedded in the training data. While this imposes inherent limitations on its evaluative credibility, the empirical results suggest that, particularly in more recent GPT models, self-assessment can function as a meaningful component in trust modeling frameworks. These limitations have important implications for the interpretation of model-generated self-assessments.
In particular, the absence of genuine introspective processes may lead to systematically biased or overly optimistic meta-evaluations. In tasks involving domain-specific or rapidly evolving knowledge, models may express high confidence despite relying on outdated or incomplete information. Similarly, in cases where models generate plausible but incorrect content—such as hallucinated references or fabricated reasoning—the corresponding self-assessment may remain positively biased, as no internal mechanism exists for verifying factual correctness. In addition, prompt formulation can significantly influence self-evaluative responses, leading to inflated assessments when questions are framed in a suggestive or affirming manner. These effects indicate that simulated self-assessment reflects context-dependent linguistic behavior rather than reliable internal evaluation.
The findings further indicate that GenAI self-assessment is highly context-dependent. Model responses vary with respect to the formulation of input prompts, the specificity or abstraction of the task, the epistemic domain (e.g., scientific, ethical, or technical), and the interaction history with the user. In addition, the statistical structure of the training data constrains the range of possible self-evaluative outputs. These dependencies challenge the assumption of a single, invariant self-assessment profile and instead support the interpretation of self-assessment as a context-conditioned representation within a given interaction setting.
From a methodological perspective, the results suggest that simulated self-assessment can be understood as a measurable and structurally analyzable dimension of GenAI behavior. Although it does not imply genuine cognitive self-awareness, it provides an additional quantitative layer for modeling reliability and trust under controlled evaluation conditions. In this sense, self-assessment should be interpreted as an auxiliary indicator, whose validity depends on its calibration with externally verifiable performance metrics.
These findings are consistent with established research in psychology and metacognition, where self-assessment is understood as an indirect and often imperfect proxy for actual performance, influenced by cognitive biases and contextual factors [
7,
8,
9,
10,
11]. In particular, metacognitive monitoring and self-evaluation depend on task characteristics and awareness of limitations [
8], while self-reflection and insight are shaped by internal representational structures rather than direct access to objective performance [
9,
10,
11]. In the context of generative AI, this aligns with recent studies showing that large language models frequently produce outputs that are fluent yet not necessarily factually reliable, a phenomenon commonly referred to as hallucination [
26]. Furthermore, current evaluation research emphasizes the importance of external validation and structured metrics for assessing model reliability [
39], while uncertainty-based approaches have been proposed as a proxy for detecting unreliable or low-confidence outputs [
40]. The present results extend this perspective by demonstrating that simulated self-assessment can be systematically analyzed and quantitatively related to externally validated performance outcomes within the proposed evaluation framework.
8. Conclusions
This study proposes a structured framework for evaluating the reliability of generative AI models based on their simulated self-assessment. By combining internal self-evaluation profiles with externally validated performance metrics, the approach enables a quantitative analysis of calibration between model-generated self-assessment and observable behavior. The results show that, despite its fundamentally language-based nature, simulated self-assessment exhibits measurable structure, stability, and partial alignment with externally evaluated reliability.
Several limitations of the present study should be acknowledged. First, the evaluation is restricted to a specific set of models (GPT series) and may not generalize directly to other architectures or training paradigms. Second, both internal and external evaluations depend on the design of prompts and tasks, introducing sensitivity to experimental conditions. Third, the external validation relies on expert-based fact-checking, which, while systematic, may involve a degree of subjectivity. Finally, self-assessment remains an indirect measure, reflecting linguistic patterns rather than genuine cognitive processes.
In this context, several directions for future research can be identified, building on the observed context-dependence and calibration properties of the proposed framework. These include the development of context-adaptive self-assessment profiles with differentiated criteria and dynamic weighting schemes, as well as the design of automated approaches for real-time analysis of model-generated self-assessments.
Such an approach can be viewed as an additional evaluation layer in which model responses are accompanied by self-assessment outputs. These can be used to monitor reliability and, when necessary, adjust confidence levels or provide warnings to the user. In this way, self-assessment becomes part of an ongoing evaluation process supporting more reliable interaction with the model.
Further work may also explore the applicability of the proposed methodology to a broader range of AI systems and evaluation settings.
Overall, the findings suggest that simulated self-assessment can serve as an informative, though auxiliary, component in the evaluation of GenAI systems. When combined with external validation, it provides a complementary perspective on model reliability, contributing to more structured and transparent approaches to trust modeling in AI.