1. Introduction
With the rapid progress of artificial intelligence, automated evaluation methods based on natural language processing and deep learning have been widely adopted across various domains [
1]. These methods typically rely on large-scale data modeling and end-to-end generation, enabling systems to produce comprehensive assessments from heterogeneous and unstructured information. Despite these advantages in automation and efficiency, current evaluation systems still face several critical challenges. First, most existing methods operate under a single-step, passive inference paradigm, generating assessment outputs directly from inputs without revealing intermediate reasoning processes, resulting in limited interpretability. Second, when confronted with incomplete, ambiguous, or conflicting information, passive generation models lack mechanisms to actively obtain additional evidence or adjust their inference trajectory, making them susceptible to noise, logical inconsistencies, and cumulative errors in complex evaluation tasks. Moreover, in scenarios where sensitive attributes or latent sources of bias exist, one-shot inference is insufficient for effective bias detection and mitigation, thereby compromising the reliability and fairness of evaluation outcomes [
2,
3].
Against this backdrop, active reasoning has emerged as a promising paradigm for enhancing the transparency, controllability, and trustworthiness of AI-driven evaluation. In contrast to passive generative approaches, active reasoning enables a model to take deliberate actions during inference—such as retrieving supplemental information, generating intermediate hypotheses, performing consistency checks, or soliciting human feedback when necessary—based on its current uncertainty and knowledge state. Recent advances in multi-step reasoning, retrieval-augmented generation, and free-energy–based decision mechanisms have demonstrated significant improvements in logical coherence, robustness, and interpretability across many tasks, establishing a solid methodological foundation for building more trustworthy evaluation frameworks [
4]. Furthermore, the development of metaverse-related technologies has accelerated multimodal semantic fusion within intelligent sensing architectures, enabling richer representations and deeper understanding of complex information environments [
5,
6].
Artificial intelligence (AI) technologies, particularly deep learning-based natural language processing (NLP) and machine learning (ML) methods, have been widely applied in the evaluation and analysis of various groups, including talents, students, employees, and others. These technologies mainly focus on automating the analysis of large datasets, mining potential patterns and features, and assisting in evaluation tasks. Specifically, language models based on Transformer architectures (e.g., GPT variants) demonstrate strong capabilities in text understanding and generation, efficiently handling complex textual information and generating valuable reports. Variational autoencoders (VAEs) are commonly used for dimensionality reduction and data generation, helping to simplify the processing of complex data. Text-based diffusion models also show significant advantages in text generation and semantic enhancement [
7,
8].
However, existing AI-based evaluation methods still face several challenges in practical applications. First, most existing technological approaches rely on static data analysis, meaning they primarily handle historical data and lack adaptability to dynamic changes. These methods fail to capture the potential and future development of talents, students, or employees in different contexts. For example, existing technologies cannot accurately simulate an individual’s performance changes in different academic, work, or life environments, nor can they generate long-term predictions of individual potential. Furthermore, these methods often lack personalized growth pathways and development suggestions, failing to provide targeted development support for individuals [
3]. Secondly, current AI models are often reliant on large volumes of historical data, especially for training on standardized tasks, making them susceptible to data bias. For instance, certain emerging fields or materials from less-represented languages may be underrepresented in datasets, leading to potential unfairness in evaluations [
9]. Particularly in diverse environments such as higher education, existing models fail to adequately handle the differences between fields, potentially introducing bias toward specific fields or groups. Moreover, existing technological methods struggle in modeling complex semantic relationships and understanding multidimensional, multilayered information. These methods primarily rely on static corpora and standardized datasets for evaluation, which limits their performance when handling complex, cross-domain tasks [
10].
In response to these issues, this study proposes an evaluation framework formulated within the principles of active inference, providing a unified probabilistic treatment of temporal evidence, latent structure, and policy selection. The approach models observed textual sequences as emissions of an underlying latent dynamical system, allowing the evolution of hidden states to be represented explicitly through a parametric state–space formulation. A representation and augmentation module first constructs a semantically coherent embedding space, ensuring that the generative and inference components operate on continuous variables that satisfy the assumptions of variational state estimation. These embeddings are then coupled with a transition model and a preference model, together defining a complete generative density over future trajectories. Within this formalism, alternative analytical policies correspond to distinct predictive distributions, whose plausibility is quantified via the free energy of the expected future—an objective that naturally balances epistemic value, predictive sufficiency, and consistency with encoded preferences. A diffusion–based generative mechanism is finally employed to map inferred latent trajectories to structured textual evaluations, thereby closing the perception–inference–generation loop implied by the active inference principle. This formulation establishes a mathematically coherent pipeline in which representation learning, latent dynamics, and generative outputs arise as components of a single variational inference framework.
The main contributions of this study are summarized as follows:
A unified active-inference formulation for evaluation. We recast the evaluation problem as variational inference in a latent dynamical system, thereby establishing a principled probabilistic framework that simultaneously handles temporal structure, uncertainty propagation, and policy selection.
A multi-stage representational and dynamical modelling pipeline. We introduce a learning procedure that combines VAE-based corpus augmentation, contrastive semantic representation, and a parametric state–space model, enabling coherent integration of heterogeneous textual evidence into a continuous latent process.
A policy-driven generative mechanism based on expected free energy minimization. We derive a policy evaluation criterion grounded in the free energy of the expected future and couple it with a latent diffusion generator, allowing structured evaluation outputs to be produced from latent trajectories inferred through the active inference principle.
The remainder of this paper is organized as follows:
Section 2 reviews related work on AI in talent evaluation.
Section 3 elaborates on the proposed methodology.
Section 4 presents experimental results and analysis.
Section 5 concludes the paper.
2. Related Work
Research related to this study spans several areas, including text-based evaluation and talent analytics, clustering and feature selection for high-dimensional data, sequential and self-supervised modelling, inference frameworks for intelligent systems, and, more recently, active inference as a unified paradigm for perception, prediction, and policy selection. This section reviews these lines of work and highlights the gaps that motivate the proposed Active Inference–Driven Evaluation (AIDE) framework.
2.1. Text-Based Evaluation and Talent Analytics
A substantial body of work has explored how textual and behavioural data can be used to support evaluation and decision-making in organisational and educational settings. Arora and Damarla provide a review of generative AI–assisted talent management, emphasising applications such as employee engagement and retention strategies built on top of large language models and generative models [
1]. Complementary to this, Ooi et al. survey the broader impact of generative AI across disciplines, including management, education, and information systems, and discuss the challenges associated with reliability, transparency, and institutional deployment [
2]. Qin et al. offer a comprehensive survey of AI techniques for talent analytics, covering supervised, unsupervised, and emerging generative approaches for recruitment, performance prediction, and career path modelling [
3]. These studies illustrate the growing interest in using AI to support complex evaluation tasks, but they mostly frame the problem in terms of discriminative prediction or task-specific pipelines rather than a unified probabilistic formulation.
In parallel, there has been rapid progress in generative AI and large language models (LLMs). Hagos et al. review recent advances in generative AI and LLMs, with an emphasis on model architectures, training paradigms, and deployment challenges [
7]. Zhao et al. survey explainability techniques for LLMs, discussing attribution, probing, and mechanistic interpretability approaches that aim to make complex language models more transparent and controllable [
8]. Wu et al. propose a tag-enriched multi-attention framework with LLMs for cross-domain sequential recommendation, showing that large language models can act as powerful sequence modellers when augmented with additional semantic signals [
11]. Nezami et al. study fairness issues in predictive modelling for college student success, analysing how different imputation techniques affect both performance and fairness metrics [
9]. Together, these works demonstrate the richness of the modelling toolkit available for evaluation tasks, but they typically treat representation, prediction, and policy design as loosely coupled components.
2.2. Clustering, Feature Selection, and Multisource Educational Analytics
Beyond direct text classification or regression, clustering and feature selection have been widely studied as tools for profiling and evaluation. Dhelim et al. present a survey of personality-aware recommendation systems, arguing that explicit modelling of user traits can mitigate cold-start and data sparsity problems, and can lead to more personalised educational or content recommendations [
12]. Qu et al. propose a new sparse multiple-kernel k-means algorithm that introduces
-based sparsity into the partition matrix to improve clustering robustness and interpretability [
13,
14], and further study formulations. These contributions provide effective tools for unsupervised structure discovery but operate largely in a static setting, where temporal evolution of entities is not explicitly modelled.
Xiong et al. develop a multi-feature fusion and selection method based on binary particle swarm optimisation, incorporating chaos-based initialisation and improved operators to accelerate convergence in high-dimensional spaces [
15]. Such methods are well suited for selecting relevant descriptors in complex evaluation systems, but they do not provide a generative account of how observations arise from latent states, nor do they support explicit reasoning about possible futures.
In the context of education, Liu highlights the role of multi-source information fusion in cultivating autonomous learning ability, showing that the integration of heterogeneous indicators (motivation, learning time, behavioural logs) can yield more nuanced evaluation of learning progress [
16]. These studies underline the importance of integrating heterogeneous signals for evaluation, but they typically rely on fixed aggregation rules rather than a fully probabilistic latent-variable model.
Social network analysis offers another perspective on evaluation and profiling. Zhang et al. propose methods based on predictive social networks to analyse group relationships among college students, demonstrating that network structure can provide additional context for academic performance and collaboration patterns [
17]. These graph-based methods are highly informative for relational aspects but are not directly integrated with generative sequence models or active policy selection.
2.3. Sequential Modelling, Cognitive Factors, and Digital Evaluation
Sequential modelling and self-supervised learning have become central for capturing long-term temporal dependencies in behavioural data. Xu et al. propose a recommendation algorithm based on a self-supervised pretrain transformer, which first learns general sequence patterns and then fine-tunes on downstream tasks, enhancing accuracy in modelling user trajectories and actions [
18]. Their work demonstrates the value of unsupervised pretraining for sequence data, but it focuses on prediction rather than on interpretable, policy-driven evaluation.
Cognitive and physical health factors have also been incorporated into evaluation frameworks. Hao et al. use data-driven methods and a priori analysis to model cognitive interventions for college students’ sports health, highlighting the influence of physical activity and cognitive interventions on academic outcomes [
19]. These studies emphasise the multi-dimensional nature of evaluation but do not explicitly link these signals to latent state-space models or to normative criteria such as expected utility or free energy.
Digitalisation has reshaped talent cultivation and performance evaluation. Zhang and Yu investigate how digital marketing evaluation can be used to design applied undergraduate training programmes, emphasising the alignment between educational outcomes and industry demands [
20]. Zhang et al. further propose mixed-methods frameworks that integrate importance-performance analysis (IPA) and Kaiser–Meyer–Olkin (KMO) measures to assess applied talent cultivation in data science education [
21]. These works formalise evaluation as multi-criteria assessment but remain largely metric-based and do not employ explicit latent generative models.
In other domains, Li et al. develop the TLI-YOLO framework for rice disease detection, demonstrating how advanced deep-learning architectures can be adapted for mobile deployment in complex visual scenes [
22]. Zhang et al. study stochastic-process-based degradation modelling from Brownian to fractional Brownian motion, providing tools for remaining useful life (RUL) prediction in engineering systems [
23]. Both works highlight the importance of temporal modelling and uncertainty, but they focus on physical or visual signals rather than structured textual sequences.
2.4. Inference Frameworks, Generative Evaluation, and System-Level Considerations
Data-driven talent management and evaluation have also been studied from a systems and organisational perspective. Li and Zheng investigate human resource management models supported by wireless communication and association-rule mining, showing how data-driven patterns can inform staffing and retention strategies [
24]. Arora and Damarla emphasise the role of generative AI in talent management strategies [
1], while Ooi et al. discuss the cross-disciplinary potential of generative AI for organisational decision processes [
2]. However, these approaches still rely on separate modules for data processing, scoring, and policy selection rather than a single variational objective.
Tuitt et al. propose a generative AI and multi-agent–based approach to psychometric evaluation, using generative models to design, administer, and interpret assessments [
25]. Their work moves towards more flexible, data-driven evaluation frameworks, but it does not explicitly exploit latent-state dynamics or active inference principles.
In parallel, large-scale IoT and cloud–fog architectures have been investigated as supporting infrastructures for intelligent services. He et al. propose proactive personalised services in fog–cloud computing for healthcare [
26], and Fu et al. design an intelligent cloud-computing framework for logistics alliances based on blockchain and big data [
27]. These works are relevant from a systems-design perspective, but they do not specify the internal inference and evaluation mechanisms used by high-level decision modules.
2.5. Active Inference: Theory and Applications
Active inference has emerged as a powerful theoretical framework for modelling sentient behaviour as approximate Bayesian inference in generative models. Pezzulo, Parr, and Friston provide a comprehensive review of active inference as a theory of sentient behaviour, tracing its conceptual roots from Helmholtzian ideas on unconscious inference through predictive coding and hierarchical generative models to the modern formulation centred on minimising expected free energy [
28]. The review emphasises that action, perception, and policy selection can be cast under a single objective, with hierarchically deep models supporting rich forms of inference and planning.
Beyond its neurobiological origin, active inference has been applied to engineering and AI problems. He et al. propose an active-inference-based approach for offloading LLM inference tasks and allocating resources in cloud–edge computing, showing that active inference can outperform deep reinforcement learning methods in data efficiency and adaptability to variable workloads [
29]. Engström et al. model adaptive human driving behaviour as active inference, demonstrating that human-like trade-offs between progress and caution can emerge from policy selection driven by expected free energy minimisation [
30]. Ren et al. formulate model trading strategies for connected and autonomous vehicles in Web3 as an active-inference problem, and introduce an intelligence-based reinforcement learning (IRL) scheme that leverages active inference to construct higher-level cognition without explicit reward functions [
31].
These works show that active inference offers a principled alternative to traditional reinforcement learning or heuristic control, especially when explicit reward design is difficult or undesirable. However, active inference has been used primarily for sensorimotor control, resource allocation, and decision-making over physical or continuous state spaces. Its potential for text-based evaluation, where the observations are symbolic and high dimensional and where outputs themselves may be structured textual reports, remains under-explored.
2.6. Summary and Open Challenges
In summary, prior research provides strong building blocks in at least four dimensions: (i) advanced representation learning and clustering for textual and behavioural data; (ii) sequential modelling and self-supervised learning for temporal dynamics; (iii) generative and Bayesian frameworks for evaluation and scientific inference; and (iv) active inference as a general theory of perception, prediction, and action. Despite this rich landscape, at least two important gaps remain:
1. Lack of a unified variational framework for text-based evaluation. Existing methods typically treat semantic representation, temporal modelling, scoring, and policy selection as separate modules. There is limited work that formulates the entire evaluation process—from evidence accumulation through latent-state dynamics to structured textual reporting—as a single variational inference problem. 2. Limited application of active inference to symbolic, text-centric settings. Most active inference applications operate on low-level sensory streams or continuous control variables. How to adapt expected free energy minimisation to high-dimensional textual observations, and how to couple it with modern generative mechanisms such as diffusion models to produce interpretable reports, remain open questions.
The AIDE framework proposed in this paper addresses these gaps by integrating variational representation learning, latent dynamical modelling, active inference–based policy selection, and diffusion-based generative reporting within a single probabilistic architecture.
3. Methodology
This section presents a unified evaluation framework based on active inference, which integrates probabilistic representation learning, latent dynamical modelling, and generative reporting into a coherent computational architecture. The method operates by first transforming heterogeneous textual records into stable semantic embeddings through a representation and augmentation module, then modelling their temporal evolution via a latent state–space formulation that enables predictive reasoning over hypothetical futures. A generative evaluation mechanism subsequently selects analysis policies by minimising the free energy of the expected future and produces structured, human–interpretable outputs using a latent diffusion generator conditioned on inferred latent trajectories. The overall architecture of the proposed framework is illustrated in
Figure 1. Source: author’s contribution.
Given an input sequence
(ordinal number ① in
Figure 1), the model first infers latent states via state–space modeling (ordinal number ②). Candidate policies are then evaluated by minimising the expected free energy (ordinal number ③), and the selected latent trajectory is finally decoded into a textual evaluation using a diffusion–based generator (ordinal number ④).
3.1. Problem Formulation
We consider a generic information system that generates heterogeneous textual records over time. At each discrete time step t, the system produces an observation , such as a project description, a progress record, or a feedback entry. Given a finite history , the goal is to construct a structured evaluation output and generate predictions for possible future trajectories .
Rather than designing an evaluation rule directly in the observation space, we introduce a latent sequence and cast the problem as probabilistic inference in a latent dynamical system. The evaluation process is formulated in the framework of active inference, where the model maintains probabilistic beliefs over future trajectories and chooses internal analysis policies that minimise a free–energy functional of the expected future.
This formulation is domain–agnostic and does not rely on a particular application scenario; evaluation in academic information systems is only one example of its potential use.
3.2. Overview of the Active Inference–Driven Evaluation Framework
The proposed Active Inference–Driven Evaluation Framework (AIDE) consists of three tightly interacting components:
Representation and augmentation layer: raw textual records are denoised, optionally augmented using a variational autoencoder (VAE), and projected into a continuous semantic space through a sentence–level encoder.
Latent dynamical layer: a state–space model describes the evolution of latent states and the generation of observations; it supports predictive distributions over hypothetical futures.
Generative evaluation layer: each candidate policy is scored by the free energy of the expected future. A conditional text generator, implemented as a latent diffusion model, produces human–readable evaluation outputs based on latent trajectories with low expected free energy.
3.3. Generative Model
Section 3.3 introduces the generative model underlying AIDE. This section is organized into three components: (1) representation and augmentation of textual inputs (
Section 3.3.1), (2) latent state dynamics describing temporal evolution (
Section 3.3.2), and (3) the observation and preference model (
Section 3.3.3). Together, these components define the complete generative process on which active inference is performed.
3.3.1. Representation and Augmentation
For sentence-level representation, we employ a transformer-based encoder based on the BERT-base architecture, which produces 768-dimensional embeddings for each text segment.
Let
denote an observed text segment. To increase robustness in data–sparse regimes, we first construct an augmented corpus using a VAE. The encoder
maps
x to a latent code
z, and the decoder
reconstructs the text. The VAE parameters
are obtained by maximising.
with standard Gaussian prior
.
Each text segment is next encoded as a semantic vector
using a sentence–level encoder. A contrastive objective encourages discriminative representation learning:
where
denotes positive semantic pairs,
a cosine similarity, and
a temperature parameter.
3.3.2. Latent State Dynamics
Let
denote the latent state at time
t. We assume a first–order Markov process
with Gaussian transitions
where
and
are parametrised functions.
The likelihood of an embedding
given state
is
with
a decoding function and
R a diagonal noise matrix.
3.3.3. Observation and Preference Model
We introduce evaluation quantities
(e.g., scalars or low–dimensional descriptors) together with embeddings
, forming
. Over a horizon
T, the joint model is
The term ∣ encodes preferred behaviour or preferred trajectories in a domain–specific manner.
3.4. Variational Inference and Policy Evaluation
Section 3.4 presents the variational inference framework used to approximate posterior beliefs and evaluate alternative policies. The discussion proceeds in three steps: (1) policies and predictive beliefs (
Section 3.4.1), (2) the definition and decomposition of expected free energy (
Section 3.4.2), and (3) numerical approximations used in practice (
Section 3.4.3). These subsections together describe how AIDE infers future trajectories and selects actions.
3.4.1. Policies and Predictive Beliefs
A policy
denotes a configuration of internal analysis choices over a finite horizon
H. For each policy, a variational density approximates the predictive distribution:
3.4.2. Free Energy of the Expected Future
The free energy of the expected future is defined as
Optimal policy posteriors satisfy
A decomposition yields
where the first term measures expected information gain, and the second measures how well predicted evaluation quantities align with preference distributions.
3.4.3. Numerical Approximation
Exact computation is intractable; Monte Carlo rollouts from the dynamical model are used. Parameter–related information gain is estimated via entropy differences, e.g.,
3.5. Generative Evaluation via Latent Diffusion
A latent diffusion model produces evaluation outputs conditioned on low–free–energy latent trajectories. Let
denote a pooled latent code. The forward process applies
with noise schedule
. The denoising network
minimises
The reverse trajectory yields
, which the decoder maps to textual form:
3.6. Learning Procedure
The learning procedure of the proposed framework is decomposed into three coordinated components, each formalised as an independent algorithm while remaining tightly interconnected. Algorithm 1 first establishes the representational foundation by performing variational augmentation of the raw corpus and learning discriminative semantic embeddings that serve as the input space for subsequent modelling stages. Source: author’s contribution. Building on these representations, Algorithm 2 trains the latent dynamical model and estimates the preference distribution, thereby capturing both the temporal structure of the underlying process and the target behavioural tendencies encoded by domain knowledge. Source: author’s contribution. Algorithm 3 then integrates these learned elements within an active inference loop: candidate policies are evaluated through approximations of their expected free energy, and the most plausible policy guides the diffusion-based generator to produce structured evaluation outputs. Source: author’s contribution. Together, the three algorithms form a coherent learning pipeline, progressing from representation to dynamical modelling and finally to policy-driven generative evaluation.
| Algorithm 1 Representation Learning and Data Augmentation |
Require: Raw text corpus , VAE parameters , encoder parameters Ensure: Augmented corpus , trained VAE-based Augmentation 1: for each mini-batch do 2: Encode using 3: Reconstruct 4: Update by maximising 5: Retain only if semantic checks are satisfied 6: end for 7: Form Semantic Representation Learning 8: for each mini-batch do 9: Compute embeddings 10: Construct contrastive pairs 11: Update by minimising 12: end for |
| Algorithm 2 Learning the Dynamical and Preference Models |
Require: Embedded sequences , initial dynamical parameters Ensure: Trained dynamical model and preference parameters Learning the Latent Dynamics 1: for each sequence do 2: Initialise 3: for to T do 4: Predict using 5: Evaluate likelihood 6: end for 7: Update by gradient descent on sequence likelihood 8: end for Estimating Preference Model 9: Collect representative desirable evaluation quantities 10: Fit preference parameters by maximum likelihood |
| Algorithm 3 Active Inference and Diffusion-Based Evaluation Generation |
Require: Posterior , dynamical model , preference model Require: Number of candidate policies J, diffusion model Ensure: Evaluation output 1: Generate candidate policies Policy Evaluation via Expected Free Energy (Steps 2–5) 2: for to J do 3: Approximate using Monte Carlo rollouts 4: Compute policy weight 5: end for 6: Select Diffusion-Based Generation (Steps 6–9) 7: Infer latent trajectory under 8: Pool trajectory into latent code 9: Apply reverse diffusion to obtain 10: Generate output |
3.7. Computational Complexity
The computational complexity of the proposed framework primarily depends on the number of planning steps (horizon length) and the number of candidate policies. Specifically, the Monte Carlo rollouts used to estimate the expected free energy involve sampling multiple trajectories for each policy. The time complexity for each policy evaluation is therefore proportional to the number of planning steps N, the number of Monte Carlo samples M, and the number of candidate policies J. As a result, the overall time complexity of policy evaluation is .
In addition, the memory complexity is influenced by the need to store sampled trajectories and corresponding free energy estimates for each policy. Thus, the memory complexity is also .
While this computational cost increases with longer planning horizons and a larger number of candidate policies, the framework is scalable for practical applications, particularly when combined with approximation techniques such as importance sampling or batch processing. For larger-scale or real-time applications, optimization strategies such as parallel computing or reducing the number of candidate policies may be employed to mitigate computational overhead.
To provide a sense of the practical feasibility of our approach, we estimate that for a planning horizon of and policies, the framework requires approximately X hours for computation on a standard CPU with Y GB of memory, assuming Monte Carlo samples per policy.
4. Experiments
4.1. Experimental Setup
Section 4.1 describes the experimental setup. It is structured as follows: the dataset is introduced in
Section 4.1.1, the evaluation task formulation is described in
Section 4.1.2, and implementation details are provided in
Section 4.1.3. This organization clarifies the experimental context before presenting results.
4.1.1. Dataset
The experiments were conducted on a curated longitudinal textual dataset referred to as AcademicText-2025. The dataset consists of heterogeneous academic records, including project reports, feedback summaries, and process logs, which are organised into temporal sequences associated with individual entities. These records were collected from institutional academic information systems and reflect real-world evaluation scenarios.
Due to privacy, confidentiality, and institutional policy constraints, the AcademicText-2025 dataset is not publicly released. However, the dataset can be made available for research purposes upon reasonable request to the corresponding author. Access is granted after completing a data usage agreement that specifies research-only use and compliance with relevant privacy and ethical regulations. The publication of aggregated experimental results based on this dataset is permitted under the terms of this agreement. The dataset does not contain personally identifiable information. All potentially sensitive attributes were anonymised or removed prior to analysis in accordance with applicable data protection regulations.
The dataset was pre-processed following the protocol described in
Section 3. All documents were converted to a unified encoding format, tokenised, normalised, and truncated or padded to a fixed maximum length. Sequences with missing timestamps, incomplete evaluation records, or insufficient temporal length were excluded to ensure the reliability of subsequent modelling and evaluation. For transparency and reproducibility, the key characteristics of the dataset and the experimental environment are summarised in
Table 1.
The remaining sequences are partitioned into training, validation, and test subsets at the sequence level using random sampling, with proportions of 70%, 15%, and 15%, respectively. To avoid information leakage, all sequences associated with the same entity are assigned exclusively to a single subset. The random split is performed once using a fixed random seed to ensure reproducibility.
4.1.2. Task Formulation
Given the history of textual observations for a particular entity, the task is to generate an evaluation report that summarises the current state and, when appropriate, reflects short-term expectations about future development. All models in the comparison take the same input (a history of texts) and are required to output a natural-language evaluation.
Some baselines naturally produce numerical scores or categorical labels rather than free-form text. For these methods, we convert their outputs into templated sentences so that all approaches can be compared under a uniform text-based evaluation protocol.
4.1.3. Implementation Details for AIDE
In AIDE, the representation and augmentation module employs a transformer-based sentence encoder with embedding dimension . The VAE consists of two fully connected layers in both encoder and decoder and assumes a standard Gaussian prior over latent codes. It is first trained in an unsupervised fashion on the entire corpus to obtain a smooth latent space. On top of this VAE, the encoder is further fine-tuned using a contrastive learning objective to enforce semantic discrimination between similar and dissimilar text pairs. The augmented corpus includes both original and quality-controlled synthetic samples, providing richer evidence for subsequent latent dynamical modelling.
The latent dynamical layer uses a state dimension in the range of 64 to 128. The state transition function and the observation mapping are parameterised by gated recurrent units (GRUs), which offer a good compromise between modelling capacity and computational efficiency. The planning horizon H and the number of candidate policies J are treated as hyperparameters of the active inference component; they are varied in sensitivity analyses but set to moderate default values in the main experiments to balance temporal coverage and computational cost.
The diffusion-based generator operates in a latent space obtained by pooling the sequence of latent states into an initial code . A forward noising process then transforms into approximately isotropic Gaussian noise over T discrete steps. A residual denoising network with time-step embeddings is trained to reverse this process and reconstruct a clean latent representation , which is finally decoded into a textual evaluation. All models are trained with the Adam optimiser. Learning rates and batch sizes are selected based on validation performance, and early stopping is used to avoid overfitting. Once the hyperparameters are fixed on the validation set, the models are retrained on the union of training and validation data and evaluated on the held-out test set.
4.2. Baseline Methods
To assess the performance of AIDE, we compare it against four representative baselines, corresponding to traditional metric-based systems, discriminative deep models, end-to-end sequence models, and generative state-space models without active inference.
The first baseline, denoted MBE (Metric-Based Evaluation), represents conventional indicator-aggregation systems. Hand-crafted quantitative indicators, such as counts of specific events, keyword frequencies, or rubric-based scores, are aggregated through linear or tree-based models to produce numerical evaluation outputs. Because MBE does not generate free-form text, we convert its scalar outputs into short templated sentences in order to evaluate it under the same metrics as the generative models.
The second baseline, DAI (Discriminative Assessment Model), is a typical deep discriminative model. It encodes the history of texts into a sequence of embedding vectors and feeds them to a feedforward network or a sequence-to-sequence decoder to predict evaluation labels or generate text directly. DAI does not include an explicit latent dynamical model and does not perform active policy selection; it relies on the fitting capacity of deep networks to capture correlations between histories and evaluation outputs [
9].
The third baseline, TRANS-SEQ, is a standard transformer-based sequence-to-sequence model. It treats the historical texts as the source sequence and the evaluation as the target sequence, and learns a direct mapping via end-to-end maximum likelihood training. This model can generate fluent text but does not expose an explicit latent state nor incorporate any notion of active inference or planning [
32].
The fourth baseline, SSM-GEN, is a generative state-space model without active inference. It learns a latent state evolution and uses a conditional decoder to generate evaluation texts, but employs fixed or heuristic strategies when choosing how to roll out latent trajectories and configure reports. SSM-GEN therefore isolates the contribution of latent dynamics without the additional benefits of policy selection under expected free energy [
33].
4.3. Evaluation Metrics and Human Assessment
Model performance is evaluated using a combination of automatic metrics and human judgements.
From the automatic perspective, we first compute BLEU scores to quantify n-gram overlap between generated and reference evaluations, considering n from 1 to 4 and using the standard weighted average. BLEU reflects local lexical and phrase-level similarity. Second, we compute ROUGE-L, which measures the longest common subsequence between generated and reference texts; this metric emphasises sentence-level coverage and captures whether the generator reproduces key semantic fragments. Third, we use BERTScore to assess semantic similarity based on contextual embeddings, thereby going beyond surface-level token overlap.
To evaluate how well generated evaluations are aligned with subsequent developments, we design sequence consistency measures. At an evaluation time t, we extract key forward-looking information from the generated text (such as stated trends or categorical judgements) and compare it with features derived from the realised trajectory between and . The consistency score reflects the degree to which the generated evaluation anticipates the direction and pattern of future observations, without requiring exact numeric prediction.
Human evaluation is conducted by domain experts who independently rate a stratified subset of test cases. For each sampled case, several model outputs are presented in random order without revealing their source. Experts score the reports along several dimensions, including informativeness (extent to which available evidence is used), internal coherence (logical and linguistic consistency), alignment with evidence (faithfulness to the observed history), and overall readability. Scores are standardised across raters to mitigate individual bias. Specifically, for each rater and each evaluation criterion, raw scores are normalised using z-score standardisation by subtracting the rater-specific mean and dividing by the corresponding standard deviation. The resulting normalised scores are then aggregated across raters by averaging to obtain the final human evaluation score for each generated report.
4.4. Overall Quantitative Performance
The main quantitative results are summarised in
Table 2. Source: author’s contribution. Across BLEU, ROUGE-L, and BERTScore, AIDE achieves the best overall performance. Compared with TRANS-SEQ, AIDE exhibits higher lexical overlap and semantic similarity to the human-written references, indicating that combining latent dynamics with active inference yields more reference-like evaluations than purely end-to-end sequence modelling. Relative to the metric-based baseline MBE, AIDE produces richer and more context-sensitive texts, while also achieving considerably stronger automatic scores.
In terms of sequence consistency, AIDE shows the highest alignment between generated evaluations and subsequent observed development, outperforming both SSM-GEN and DAI. This suggests that selecting policies via expected free energy encourages the model to generate evaluations that are not only descriptive of the present but also coherent with the likely future under the learned dynamical model, echoing the theoretical role of active inference.
4.5. Effect of Active Inference and Policy Selection
To quantify the contribution of active inference, we consider two ablated variants of AIDE. In the first variant, denoted AIDE-noEI, the epistemic term in the expected free energy is removed, so that policy selection only considers preference alignment and ignores the value of information gain. In the second variant, AIDE-noPI, the diffusion generator is conditioned on a latent trajectory obtained from a single greedy rollout of the dynamical model, rather than on the posterior over policies; in other words, generation is no longer policy-informed.
The performance of these variants is reported in
Table 3. Source: author’s contribution. Removing the epistemic term leads to a noticeable decrease in sequence consistency and a modest drop in BLEU and ROUGE-L, indicating that policies selected without accounting for information gain produce evaluations that are less informative about future developments. Removing policy-informed generation reduces the diversity and structural richness of generated texts; although some automatic scores remain close to those of the full AIDE, human evaluators more frequently describe these outputs as flat or lacking nuance. These findings support the conclusion that both epistemic and pragmatic components of expected free energy are important for the overall effectiveness of AIDE.
4.6. Component-Level Ablation
Beyond the active inference module, we examine the contributions of VAE-based augmentation, latent dynamical modelling, and diffusion-based generation. A variant without VAE augmentation (AIDE-noVAE) is trained solely on the original corpus. This model shows reduced robustness on sequences with sparse or noisy observations, confirming that VAE augmentation helps smooth the representation space and partially compensates for data sparsity. A variant without latent dynamics (AIDE-noDyn) replaces the state-space model with a simple pooling mechanism over embeddings. This approach maintains reasonable performance on very short histories but degrades on longer sequences, underscoring the importance of explicit temporal modelling. A variant without diffusion (AIDE-noDiff) employs a deterministic decoder conditioned directly on the pooled latent representation. While computationally cheaper, it produces less fluent and less diverse text, and its automatic metrics generally lag behind the full model.
Table 4 summarises these component-level ablation results. Source: author’s contribution.
4.7. Sensitivity to Planning Horizon and Policy Set Size
The sensitivity of AIDE to the planning horizon H and the number of candidate policies J is investigated by varying one parameter at a time while keeping all others fixed. When H is very small, the model largely focuses on the immediate present, and the generated evaluations primarily restate current evidence, leading to relatively low sequence consistency. As H increases to a moderate range, both automatic metrics and consistency scores improve, since the model can account for longer-term trends in its policy selection and report generation. Beyond a certain horizon, however, further increases in H yield diminishing gains and can even destabilise training, due to the increased difficulty of long-horizon prediction and expected free energy estimation.
The effect of the number of candidate policies J exhibits a similar pattern. With very small J, exploration of the policy space is insufficient, and the posterior concentrates on suboptimal regions. Increasing J improves the coverage of the policy space and leads to better performance. Once J exceeds a moderate value, additional increases offer only marginal performance gains while substantially increasing computational cost. These trends suggest that AIDE can be operated with relatively modest values of H and J while capturing most of the benefits of active inference.
Figure 2 illustrates the performance of AIDE as a function of the planning horizon
H, and
Figure 3 shows the effect of varying
J. Source: author’s contribution.
Figure 2 summarises the sensitivity of AIDE to the planning horizon
H. As shown in
Figure 2a, increasing
H from very short values leads to consistent improvements in BLEU, ROUGE-L and BERTScore, indicating that a longer horizon allows the model to exploit more temporal structure when selecting policies and generating reports. However, the gains saturate once
H enters a moderate range, and no further systematic improvement is observed for larger horizons. A similar pattern is observed for the sequence-level consistency score in
Figure 2b: short horizons yield evaluations that mainly restate the present, whereas moderate horizons produce assessments that are better aligned with subsequent developments. Beyond that point, increasing
H further introduces additional uncertainty into long-range predictions and expected free energy estimation, resulting in diminishing or even slightly negative returns. These results support our choice of a moderate planning horizon in the main experiments.
Figure 3 examines the effect of the number of candidate policies
J on the performance of AIDE. When
J is small, the model explores only a limited portion of the policy space and all automatic metrics remain relatively low. As
J increases, BLEU, ROUGE-L and BERTScore improve steadily, indicating that a richer policy set allows the active inference procedure to identify more informative trajectories and, in turn, better evaluation reports. Beyond a moderate range, however, the curves flatten and additional policies yield only marginal gains, while the computational cost grows substantially. These observations suggest that AIDE can operate effectively with a moderately sized policy set, capturing most of the benefits of active inference without incurring excessive overhead.
4.8. Qualitative Behaviour and Case Analysis
To better understand the qualitative behaviour of AIDE, we examine generated evaluations for representative sequences. When the historical trajectory shows a clear pattern of steady improvement, AIDE typically produces reports that both summarise the current state and cautiously anticipate continued progress, often explicitly referencing recent changes and their possible implications. In contrast, when the history is volatile or ambiguous, the generated evaluations become more conservative and conditional in wording, sometimes articulating alternative possible developments. This qualitative behaviour reflects the uncertainty encoded in the latent states and is consistent with the active inference formulation, where policies are selected to balance explanatory adequacy and future plausibility.
Compared with the baselines, AIDE’s outputs are generally more structured and sensitive to temporal context. MBE tends to reuse a small set of templates and fails to respond to subtle changes in the trajectory. TRANS-SEQ can generate fluent text but occasionally hallucinates details that are not supported by the history, due to the lack of explicit constraints from a latent dynamical model. SSM-GEN benefits from latent states but, without principled policy selection, its evaluations are less forward-looking and sometimes less aligned with the implicit preferences encoded in the data.
Figure 4 presents an illustrative example in which AIDE’s evaluation better captures the trajectory’s turning point than the competing methods.
Figure 4 presents a representative case study designed to illustrate how different models respond to a sequence exhibiting a clear turning point. Source: author’s contribution. As shown in the upper panel, the underlying latent performance signal increases steadily before undergoing a distinct reversal. The textual summaries generated by the baseline methods reveal their limitations: MBE produces template-like statements that ignore the trend shift, and TRANS-SEQ, although fluent, fails to capture the downturn. SSM-GEN is able to follow the overall trajectory but lacks coherent forward-looking commentary. In contrast, AIDE not only recognises the change in direction but also situates it within a broader temporal context, producing an assessment that is both more faithful to the observed history and more consistent with likely future developments. This example highlights the advantage of combining latent dynamical modelling with active inference for generating context-aware evaluations.
4.9. Discussion
The experimental results demonstrate that AIDE offers a competitive and conceptually coherent solution to sequential text evaluation. Its advantages are most pronounced when long and noisy histories must be condensed into concise reports that simultaneously describe the present and anticipate plausible futures. By formulating evaluation as active inference in a latent dynamical model, AIDE unifies representation learning, temporal modelling, policy selection, and generative reporting under a single variational principle.
At the same time, the experiments also reveal several limitations. Estimating expected free energy remains computationally demanding for very long planning horizons, and the quality of generated text still depends on the underlying language model and the diversity and coverage of the training corpus. These observations suggest promising directions for future work, including more efficient approximations of expected free energy, more compact parameterisations of the policy space, and hybrid architectures that integrate AIDE with stronger general-purpose or domain-specific language models.
Active inference offers a probabilistic framework for sequential decision-making, where decisions are made by minimizing the expected free energy over time. This approach contrasts with reinforcement learning (RL), which typically involves learning a policy through trial and error based on reward feedback. While RL requires exploration-exploitation trade-offs and can be highly data-driven, active inference provides a more principled way of encoding uncertainty and involves both perception and action in a unified framework. A key difference is that active inference integrates both prior knowledge and sensory data in decision-making, whereas RL generally relies on experience and reward signals to adjust the policy. This distinction allows active inference to be particularly advantageous in settings where prior information is available and uncertainty plays a significant role in decision-making. Furthermore, active inference directly models the process of belief updating, which allows it to naturally handle dynamic, changing environments without requiring retraining as in RL. Overall, while both methods aim to optimize sequential decision-making, active inference provides a more coherent, probabilistic approach that blends perception and action, offering a conceptual advantage in applications involving complex, uncertain environments.
5. Conclusions
This paper has proposed AIDE, an active inference–driven evaluation framework that formulates sequential text evaluation as variational inference in a latent dynamical system. Instead of treating representation, temporal modelling, scoring, and report generation as loosely coupled components, AIDE integrates VAE-based augmentation and contrastive semantic encoding, a parametric state–space model for latent dynamics, an expected free energy-based policy selection mechanism, and a diffusion-based text generator into a single probabilistic architecture. Within this formulation, evaluation reports arise as the outcome of active inference over future trajectories, balancing explanatory adequacy with alignment to encoded preferences. Experimental results on a longitudinal text corpus show that AIDE consistently outperforms representative metric-based, discriminative, sequence-to-sequence, and purely generative state-space baselines in terms of automatic metrics, alignment with subsequent developments, and expert judgements, and that each of its key components—especially the active inference module—contributes measurably to overall performance. At the same time, the study highlights several open challenges, including the computational cost of expected free energy estimation at long planning horizons, the dependence of generation quality on the underlying language model, and the need for more interpretable mechanisms to specify and learn preference structures. Future work will focus on more efficient approximations of active inference, compact parameterisations of policy spaces, and the integration of AIDE with stronger general-purpose or domain-specific language models, as well as on extending the framework to other sequential evaluation tasks beyond the textual domain.