Next Article in Journal
Accelerating Post-Quantum Cryptography: A High-Efficiency NTT for ML-KEM on RISC-V
Previous Article in Journal
Fast Algorithms for Small-Size Type VII Discrete Cosine Transform
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AIDE: An Active Inference-Driven Framework for Dynamic Evaluation via Latent State Modeling and Generative Reasoning

1
Personnel Department, Tianjin Normal University, Tianjin 300387, China
2
Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
3
School of Information Science and Engineering, Linyi University, Linyi 276012, China
4
School of Computer Science, University of Liverpool, Liverpool L69 7ZX, UK
5
School of Computer Science, Shanghai University of International Business and Economics, Shanghai 201613, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(1), 99; https://doi.org/10.3390/electronics15010099
Submission received: 28 November 2025 / Revised: 20 December 2025 / Accepted: 23 December 2025 / Published: 24 December 2025
(This article belongs to the Section Artificial Intelligence)

Abstract

This paper introduces AIDE, an active inference-driven evaluation framework designed to provide a unified and theoretically grounded approach for analyzing sequential textual data. AIDE formulates the evaluation problem as variational inference in a latent dynamical system, enabling joint treatment of representation, temporal structure, and predictive reasoning. The framework integrates (i) a representation and augmentation module based on variational learning and contrastive semantic encoding, (ii) a parametric state–space model that captures the evolution of latent states and supports probabilistic forecasting, and (iii) a policy-selection mechanism that minimizes the expected free energy, guiding a latent diffusion generator to produce coherent and interpretable evaluation outputs. This formulation yields a principled pipeline linking evidence accumulation, latent-state inference, and policy-driven generative reporting. Experimental studies demonstrate that AIDE provides stable inference, coherent predictions, and consistent evaluation behavior across heterogeneous textual sequences. The proposed framework offers a general probabilistic foundation for dynamic evaluation tasks and contributes a structured methodology for integrating representation learning, dynamical modeling, and generative mechanisms within a single variational paradigm.

1. Introduction

With the rapid progress of artificial intelligence, automated evaluation methods based on natural language processing and deep learning have been widely adopted across various domains [1]. These methods typically rely on large-scale data modeling and end-to-end generation, enabling systems to produce comprehensive assessments from heterogeneous and unstructured information. Despite these advantages in automation and efficiency, current evaluation systems still face several critical challenges. First, most existing methods operate under a single-step, passive inference paradigm, generating assessment outputs directly from inputs without revealing intermediate reasoning processes, resulting in limited interpretability. Second, when confronted with incomplete, ambiguous, or conflicting information, passive generation models lack mechanisms to actively obtain additional evidence or adjust their inference trajectory, making them susceptible to noise, logical inconsistencies, and cumulative errors in complex evaluation tasks. Moreover, in scenarios where sensitive attributes or latent sources of bias exist, one-shot inference is insufficient for effective bias detection and mitigation, thereby compromising the reliability and fairness of evaluation outcomes [2,3].
Against this backdrop, active reasoning has emerged as a promising paradigm for enhancing the transparency, controllability, and trustworthiness of AI-driven evaluation. In contrast to passive generative approaches, active reasoning enables a model to take deliberate actions during inference—such as retrieving supplemental information, generating intermediate hypotheses, performing consistency checks, or soliciting human feedback when necessary—based on its current uncertainty and knowledge state. Recent advances in multi-step reasoning, retrieval-augmented generation, and free-energy–based decision mechanisms have demonstrated significant improvements in logical coherence, robustness, and interpretability across many tasks, establishing a solid methodological foundation for building more trustworthy evaluation frameworks [4]. Furthermore, the development of metaverse-related technologies has accelerated multimodal semantic fusion within intelligent sensing architectures, enabling richer representations and deeper understanding of complex information environments [5,6].
Artificial intelligence (AI) technologies, particularly deep learning-based natural language processing (NLP) and machine learning (ML) methods, have been widely applied in the evaluation and analysis of various groups, including talents, students, employees, and others. These technologies mainly focus on automating the analysis of large datasets, mining potential patterns and features, and assisting in evaluation tasks. Specifically, language models based on Transformer architectures (e.g., GPT variants) demonstrate strong capabilities in text understanding and generation, efficiently handling complex textual information and generating valuable reports. Variational autoencoders (VAEs) are commonly used for dimensionality reduction and data generation, helping to simplify the processing of complex data. Text-based diffusion models also show significant advantages in text generation and semantic enhancement [7,8].
However, existing AI-based evaluation methods still face several challenges in practical applications. First, most existing technological approaches rely on static data analysis, meaning they primarily handle historical data and lack adaptability to dynamic changes. These methods fail to capture the potential and future development of talents, students, or employees in different contexts. For example, existing technologies cannot accurately simulate an individual’s performance changes in different academic, work, or life environments, nor can they generate long-term predictions of individual potential. Furthermore, these methods often lack personalized growth pathways and development suggestions, failing to provide targeted development support for individuals [3]. Secondly, current AI models are often reliant on large volumes of historical data, especially for training on standardized tasks, making them susceptible to data bias. For instance, certain emerging fields or materials from less-represented languages may be underrepresented in datasets, leading to potential unfairness in evaluations [9]. Particularly in diverse environments such as higher education, existing models fail to adequately handle the differences between fields, potentially introducing bias toward specific fields or groups. Moreover, existing technological methods struggle in modeling complex semantic relationships and understanding multidimensional, multilayered information. These methods primarily rely on static corpora and standardized datasets for evaluation, which limits their performance when handling complex, cross-domain tasks [10].
In response to these issues, this study proposes an evaluation framework formulated within the principles of active inference, providing a unified probabilistic treatment of temporal evidence, latent structure, and policy selection. The approach models observed textual sequences as emissions of an underlying latent dynamical system, allowing the evolution of hidden states to be represented explicitly through a parametric state–space formulation. A representation and augmentation module first constructs a semantically coherent embedding space, ensuring that the generative and inference components operate on continuous variables that satisfy the assumptions of variational state estimation. These embeddings are then coupled with a transition model and a preference model, together defining a complete generative density over future trajectories. Within this formalism, alternative analytical policies correspond to distinct predictive distributions, whose plausibility is quantified via the free energy of the expected future—an objective that naturally balances epistemic value, predictive sufficiency, and consistency with encoded preferences. A diffusion–based generative mechanism is finally employed to map inferred latent trajectories to structured textual evaluations, thereby closing the perception–inference–generation loop implied by the active inference principle. This formulation establishes a mathematically coherent pipeline in which representation learning, latent dynamics, and generative outputs arise as components of a single variational inference framework.
The main contributions of this study are summarized as follows:
  • A unified active-inference formulation for evaluation. We recast the evaluation problem as variational inference in a latent dynamical system, thereby establishing a principled probabilistic framework that simultaneously handles temporal structure, uncertainty propagation, and policy selection.
  • A multi-stage representational and dynamical modelling pipeline. We introduce a learning procedure that combines VAE-based corpus augmentation, contrastive semantic representation, and a parametric state–space model, enabling coherent integration of heterogeneous textual evidence into a continuous latent process.
  • A policy-driven generative mechanism based on expected free energy minimization. We derive a policy evaluation criterion grounded in the free energy of the expected future and couple it with a latent diffusion generator, allowing structured evaluation outputs to be produced from latent trajectories inferred through the active inference principle.
The remainder of this paper is organized as follows: Section 2 reviews related work on AI in talent evaluation. Section 3 elaborates on the proposed methodology. Section 4 presents experimental results and analysis. Section 5 concludes the paper.

2. Related Work

Research related to this study spans several areas, including text-based evaluation and talent analytics, clustering and feature selection for high-dimensional data, sequential and self-supervised modelling, inference frameworks for intelligent systems, and, more recently, active inference as a unified paradigm for perception, prediction, and policy selection. This section reviews these lines of work and highlights the gaps that motivate the proposed Active Inference–Driven Evaluation (AIDE) framework.

2.1. Text-Based Evaluation and Talent Analytics

A substantial body of work has explored how textual and behavioural data can be used to support evaluation and decision-making in organisational and educational settings. Arora and Damarla provide a review of generative AI–assisted talent management, emphasising applications such as employee engagement and retention strategies built on top of large language models and generative models [1]. Complementary to this, Ooi et al. survey the broader impact of generative AI across disciplines, including management, education, and information systems, and discuss the challenges associated with reliability, transparency, and institutional deployment [2]. Qin et al. offer a comprehensive survey of AI techniques for talent analytics, covering supervised, unsupervised, and emerging generative approaches for recruitment, performance prediction, and career path modelling [3]. These studies illustrate the growing interest in using AI to support complex evaluation tasks, but they mostly frame the problem in terms of discriminative prediction or task-specific pipelines rather than a unified probabilistic formulation.
In parallel, there has been rapid progress in generative AI and large language models (LLMs). Hagos et al. review recent advances in generative AI and LLMs, with an emphasis on model architectures, training paradigms, and deployment challenges [7]. Zhao et al. survey explainability techniques for LLMs, discussing attribution, probing, and mechanistic interpretability approaches that aim to make complex language models more transparent and controllable [8]. Wu et al. propose a tag-enriched multi-attention framework with LLMs for cross-domain sequential recommendation, showing that large language models can act as powerful sequence modellers when augmented with additional semantic signals [11]. Nezami et al. study fairness issues in predictive modelling for college student success, analysing how different imputation techniques affect both performance and fairness metrics [9]. Together, these works demonstrate the richness of the modelling toolkit available for evaluation tasks, but they typically treat representation, prediction, and policy design as loosely coupled components.

2.2. Clustering, Feature Selection, and Multisource Educational Analytics

Beyond direct text classification or regression, clustering and feature selection have been widely studied as tools for profiling and evaluation. Dhelim et al. present a survey of personality-aware recommendation systems, arguing that explicit modelling of user traits can mitigate cold-start and data sparsity problems, and can lead to more personalised educational or content recommendations [12]. Qu et al. propose a new sparse multiple-kernel k-means algorithm that introduces 1 -based sparsity into the partition matrix to improve clustering robustness and interpretability [13,14], and further study formulations. These contributions provide effective tools for unsupervised structure discovery but operate largely in a static setting, where temporal evolution of entities is not explicitly modelled.
Xiong et al. develop a multi-feature fusion and selection method based on binary particle swarm optimisation, incorporating chaos-based initialisation and improved operators to accelerate convergence in high-dimensional spaces [15]. Such methods are well suited for selecting relevant descriptors in complex evaluation systems, but they do not provide a generative account of how observations arise from latent states, nor do they support explicit reasoning about possible futures.
In the context of education, Liu highlights the role of multi-source information fusion in cultivating autonomous learning ability, showing that the integration of heterogeneous indicators (motivation, learning time, behavioural logs) can yield more nuanced evaluation of learning progress [16]. These studies underline the importance of integrating heterogeneous signals for evaluation, but they typically rely on fixed aggregation rules rather than a fully probabilistic latent-variable model.
Social network analysis offers another perspective on evaluation and profiling. Zhang et al. propose methods based on predictive social networks to analyse group relationships among college students, demonstrating that network structure can provide additional context for academic performance and collaboration patterns [17]. These graph-based methods are highly informative for relational aspects but are not directly integrated with generative sequence models or active policy selection.

2.3. Sequential Modelling, Cognitive Factors, and Digital Evaluation

Sequential modelling and self-supervised learning have become central for capturing long-term temporal dependencies in behavioural data. Xu et al. propose a recommendation algorithm based on a self-supervised pretrain transformer, which first learns general sequence patterns and then fine-tunes on downstream tasks, enhancing accuracy in modelling user trajectories and actions [18]. Their work demonstrates the value of unsupervised pretraining for sequence data, but it focuses on prediction rather than on interpretable, policy-driven evaluation.
Cognitive and physical health factors have also been incorporated into evaluation frameworks. Hao et al. use data-driven methods and a priori analysis to model cognitive interventions for college students’ sports health, highlighting the influence of physical activity and cognitive interventions on academic outcomes [19]. These studies emphasise the multi-dimensional nature of evaluation but do not explicitly link these signals to latent state-space models or to normative criteria such as expected utility or free energy.
Digitalisation has reshaped talent cultivation and performance evaluation. Zhang and Yu investigate how digital marketing evaluation can be used to design applied undergraduate training programmes, emphasising the alignment between educational outcomes and industry demands [20]. Zhang et al. further propose mixed-methods frameworks that integrate importance-performance analysis (IPA) and Kaiser–Meyer–Olkin (KMO) measures to assess applied talent cultivation in data science education [21]. These works formalise evaluation as multi-criteria assessment but remain largely metric-based and do not employ explicit latent generative models.
In other domains, Li et al. develop the TLI-YOLO framework for rice disease detection, demonstrating how advanced deep-learning architectures can be adapted for mobile deployment in complex visual scenes [22]. Zhang et al. study stochastic-process-based degradation modelling from Brownian to fractional Brownian motion, providing tools for remaining useful life (RUL) prediction in engineering systems [23]. Both works highlight the importance of temporal modelling and uncertainty, but they focus on physical or visual signals rather than structured textual sequences.

2.4. Inference Frameworks, Generative Evaluation, and System-Level Considerations

Data-driven talent management and evaluation have also been studied from a systems and organisational perspective. Li and Zheng investigate human resource management models supported by wireless communication and association-rule mining, showing how data-driven patterns can inform staffing and retention strategies [24]. Arora and Damarla emphasise the role of generative AI in talent management strategies [1], while Ooi et al. discuss the cross-disciplinary potential of generative AI for organisational decision processes [2]. However, these approaches still rely on separate modules for data processing, scoring, and policy selection rather than a single variational objective.
Tuitt et al. propose a generative AI and multi-agent–based approach to psychometric evaluation, using generative models to design, administer, and interpret assessments [25]. Their work moves towards more flexible, data-driven evaluation frameworks, but it does not explicitly exploit latent-state dynamics or active inference principles.
In parallel, large-scale IoT and cloud–fog architectures have been investigated as supporting infrastructures for intelligent services. He et al. propose proactive personalised services in fog–cloud computing for healthcare [26], and Fu et al. design an intelligent cloud-computing framework for logistics alliances based on blockchain and big data [27]. These works are relevant from a systems-design perspective, but they do not specify the internal inference and evaluation mechanisms used by high-level decision modules.

2.5. Active Inference: Theory and Applications

Active inference has emerged as a powerful theoretical framework for modelling sentient behaviour as approximate Bayesian inference in generative models. Pezzulo, Parr, and Friston provide a comprehensive review of active inference as a theory of sentient behaviour, tracing its conceptual roots from Helmholtzian ideas on unconscious inference through predictive coding and hierarchical generative models to the modern formulation centred on minimising expected free energy [28]. The review emphasises that action, perception, and policy selection can be cast under a single objective, with hierarchically deep models supporting rich forms of inference and planning.
Beyond its neurobiological origin, active inference has been applied to engineering and AI problems. He et al. propose an active-inference-based approach for offloading LLM inference tasks and allocating resources in cloud–edge computing, showing that active inference can outperform deep reinforcement learning methods in data efficiency and adaptability to variable workloads [29]. Engström et al. model adaptive human driving behaviour as active inference, demonstrating that human-like trade-offs between progress and caution can emerge from policy selection driven by expected free energy minimisation [30]. Ren et al. formulate model trading strategies for connected and autonomous vehicles in Web3 as an active-inference problem, and introduce an intelligence-based reinforcement learning (IRL) scheme that leverages active inference to construct higher-level cognition without explicit reward functions [31].
These works show that active inference offers a principled alternative to traditional reinforcement learning or heuristic control, especially when explicit reward design is difficult or undesirable. However, active inference has been used primarily for sensorimotor control, resource allocation, and decision-making over physical or continuous state spaces. Its potential for text-based evaluation, where the observations are symbolic and high dimensional and where outputs themselves may be structured textual reports, remains under-explored.

2.6. Summary and Open Challenges

In summary, prior research provides strong building blocks in at least four dimensions: (i) advanced representation learning and clustering for textual and behavioural data; (ii) sequential modelling and self-supervised learning for temporal dynamics; (iii) generative and Bayesian frameworks for evaluation and scientific inference; and (iv) active inference as a general theory of perception, prediction, and action. Despite this rich landscape, at least two important gaps remain:
1. Lack of a unified variational framework for text-based evaluation. Existing methods typically treat semantic representation, temporal modelling, scoring, and policy selection as separate modules. There is limited work that formulates the entire evaluation process—from evidence accumulation through latent-state dynamics to structured textual reporting—as a single variational inference problem. 2. Limited application of active inference to symbolic, text-centric settings. Most active inference applications operate on low-level sensory streams or continuous control variables. How to adapt expected free energy minimisation to high-dimensional textual observations, and how to couple it with modern generative mechanisms such as diffusion models to produce interpretable reports, remain open questions.
The AIDE framework proposed in this paper addresses these gaps by integrating variational representation learning, latent dynamical modelling, active inference–based policy selection, and diffusion-based generative reporting within a single probabilistic architecture.

3. Methodology

This section presents a unified evaluation framework based on active inference, which integrates probabilistic representation learning, latent dynamical modelling, and generative reporting into a coherent computational architecture. The method operates by first transforming heterogeneous textual records into stable semantic embeddings through a representation and augmentation module, then modelling their temporal evolution via a latent state–space formulation that enables predictive reasoning over hypothetical futures. A generative evaluation mechanism subsequently selects analysis policies by minimising the free energy of the expected future and produces structured, human–interpretable outputs using a latent diffusion generator conditioned on inferred latent trajectories. The overall architecture of the proposed framework is illustrated in Figure 1. Source: author’s contribution.
Given an input sequence x 1 : t (ordinal number ① in Figure 1), the model first infers latent states via state–space modeling (ordinal number ②). Candidate policies are then evaluated by minimising the expected free energy (ordinal number ③), and the selected latent trajectory is finally decoded into a textual evaluation using a diffusion–based generator (ordinal number ④).

3.1. Problem Formulation

We consider a generic information system that generates heterogeneous textual records over time. At each discrete time step t, the system produces an observation x t X , such as a project description, a progress record, or a feedback entry. Given a finite history x 1 : t = { x 1 , , x t } , the goal is to construct a structured evaluation output y t Y and generate predictions for possible future trajectories x t + 1 : T .
Rather than designing an evaluation rule directly in the observation space, we introduce a latent sequence s 1 : T = { s 1 , , s T } and cast the problem as probabilistic inference in a latent dynamical system. The evaluation process is formulated in the framework of active inference, where the model maintains probabilistic beliefs over future trajectories and chooses internal analysis policies that minimise a free–energy functional of the expected future.
This formulation is domain–agnostic and does not rely on a particular application scenario; evaluation in academic information systems is only one example of its potential use.

3.2. Overview of the Active Inference–Driven Evaluation Framework

The proposed Active Inference–Driven Evaluation Framework (AIDE) consists of three tightly interacting components:
Representation and augmentation layer: raw textual records are denoised, optionally augmented using a variational autoencoder (VAE), and projected into a continuous semantic space through a sentence–level encoder.
Latent dynamical layer: a state–space model describes the evolution of latent states and the generation of observations; it supports predictive distributions over hypothetical futures.
Generative evaluation layer: each candidate policy is scored by the free energy of the expected future. A conditional text generator, implemented as a latent diffusion model, produces human–readable evaluation outputs based on latent trajectories with low expected free energy.

3.3. Generative Model

Section 3.3 introduces the generative model underlying AIDE. This section is organized into three components: (1) representation and augmentation of textual inputs (Section 3.3.1), (2) latent state dynamics describing temporal evolution (Section 3.3.2), and (3) the observation and preference model (Section 3.3.3). Together, these components define the complete generative process on which active inference is performed.

3.3.1. Representation and Augmentation

For sentence-level representation, we employ a transformer-based encoder based on the BERT-base architecture, which produces 768-dimensional embeddings for each text segment.
Let x X denote an observed text segment. To increase robustness in data–sparse regimes, we first construct an augmented corpus using a VAE. The encoder q ϕ ( z x ) maps x to a latent code z, and the decoder p θ ( x z ) reconstructs the text. The VAE parameters ϕ , θ are obtained by maximising.
L VAE = E q ϕ ( z x ) [ log p θ ( x z ) ] KL q ϕ ( z x ) p ( z ) ,
with standard Gaussian prior p ( z ) = N ( 0 , I ) .
Each text segment is next encoded as a semantic vector
e = f enc ( x ; ω ) R d ,
using a sentence–level encoder. A contrastive objective encourages discriminative representation learning:
L rep = ( i , j ) P log exp sim ( e i , e j ) / τ contrast k exp sim ( e i , e k ) / τ contrast ,
where P denotes positive semantic pairs, sim a cosine similarity, and τ contrast a temperature parameter.

3.3.2. Latent State Dynamics

Let s t R d s denote the latent state at time t. We assume a first–order Markov process
p Θ ( s 1 : T ) = p ( s 1 ) t = 2 T p Θ ( s t s t 1 ) ,
with Gaussian transitions
p Θ ( s t s t 1 ) = N s t ; m Θ ( s t 1 ) , Σ Θ ( s t 1 ) ,
where m Θ ( · ) and Σ Θ ( · ) are parametrised functions.
The likelihood of an embedding e t given state s t is
p Θ ( e t s t ) = N e t ; g Θ ( s t ) , R ,
with g Θ ( · ) a decoding function and R a diagonal noise matrix.

3.3.3. Observation and Preference Model

We introduce evaluation quantities r t (e.g., scalars or low–dimensional descriptors) together with embeddings e t , forming o t = ( e t , r t ) . Over a horizon T, the joint model is
p Φ ( o 1 : T , s 1 : T , Θ ) = p ( Θ ) p ( s 1 ) t = 1 T p ( e t s t , Θ ) p ( r t s t , Φ ) p ( s t + 1 s t , Θ ) .
The term p ( r t s t , Φ ) encodes preferred behaviour or preferred trajectories in a domain–specific manner.

3.4. Variational Inference and Policy Evaluation

Section 3.4 presents the variational inference framework used to approximate posterior beliefs and evaluate alternative policies. The discussion proceeds in three steps: (1) policies and predictive beliefs (Section 3.4.1), (2) the definition and decomposition of expected free energy (Section 3.4.2), and (3) numerical approximations used in practice (Section 3.4.3). These subsections together describe how AIDE infers future trajectories and selects actions.

3.4.1. Policies and Predictive Beliefs

A policy π denotes a configuration of internal analysis choices over a finite horizon H. For each policy, a variational density approximates the predictive distribution:
q ( o t : t + H , s t : t + H , Θ , π ) = q ( π ) q ( Θ ) τ = t t + H q ( s τ s τ 1 , Θ , π ) q ( o τ s τ , Θ , π ) .

3.4.2. Free Energy of the Expected Future

The free energy of the expected future is defined as
F ˜ = KL q ( o t : t + H , s t : t + H , Θ , π ) p Φ ( o t : t + H , s t : t + H , Θ ) .
For fixed π ,
F ˜ π = KL q ( o t : t + H , s t : t + H , Θ π ) p Φ ( o t : t + H , s t : t + H , Θ ) .
Optimal policy posteriors satisfy
q ( π ) exp ( F ˜ π ) .
A decomposition yields
F ˜ π E q ( o t : t + H π ) KL q ( s , Θ o , π ) q ( s , Θ π ) E q ( s , Θ π ) KL q ( r s , Θ , π ) p Φ ( r ) ,
where the first term measures expected information gain, and the second measures how well predicted evaluation quantities align with preference distributions.

3.4.3. Numerical Approximation

Exact computation is intractable; Monte Carlo rollouts from the dynamical model are used. Parameter–related information gain is estimated via entropy differences, e.g.,
E q ( s Θ ) KL q ( Θ s ) q ( Θ ) = H E q ( Θ ) q ( s Θ ) E q ( Θ ) H q ( s Θ ) .

3.5. Generative Evaluation via Latent Diffusion

A latent diffusion model produces evaluation outputs conditioned on low–free–energy latent trajectories. Let h 0 denote a pooled latent code. The forward process applies
q ( h τ h τ 1 ) = N h τ ; 1 β τ h τ 1 , β τ I , τ = 1 , , T ,
with noise schedule { β τ } . The denoising network ϵ ψ ( h τ , τ , s ) minimises
L diff = E ϵ ϵ ψ ( h τ , τ , s ) 2 .
The reverse trajectory yields h ^ 0 , which the decoder maps to textual form:
y t = f dec ( h ^ 0 ; ψ dec ) .

3.6. Learning Procedure

The learning procedure of the proposed framework is decomposed into three coordinated components, each formalised as an independent algorithm while remaining tightly interconnected. Algorithm 1 first establishes the representational foundation by performing variational augmentation of the raw corpus and learning discriminative semantic embeddings that serve as the input space for subsequent modelling stages. Source: author’s contribution. Building on these representations, Algorithm 2 trains the latent dynamical model and estimates the preference distribution, thereby capturing both the temporal structure of the underlying process and the target behavioural tendencies encoded by domain knowledge. Source: author’s contribution. Algorithm 3 then integrates these learned elements within an active inference loop: candidate policies are evaluated through approximations of their expected free energy, and the most plausible policy guides the diffusion-based generator to produce structured evaluation outputs. Source: author’s contribution. Together, the three algorithms form a coherent learning pipeline, progressing from representation to dynamical modelling and finally to policy-driven generative evaluation.
Algorithm 1 Representation Learning and Data Augmentation
Require: Raw text corpus X , VAE parameters ( ϕ , θ ) , encoder parameters ω
Ensure: Augmented corpus X , trained ( ϕ , θ , ω )
VAE-based Augmentation
 1: for each mini-batch B X  do
 2:     Encode x B using q ϕ ( z x )
 3:     Reconstruct x ^ p θ ( x z )
 4:     Update ( ϕ , θ ) by maximising L VAE
 5:     Retain x ^ only if semantic checks are satisfied
 6: end for
 7: Form X = X X
Semantic Representation Learning
 8: for each mini-batch B X  do
 9:     Compute embeddings e = f enc ( x ; ω )
10:    Construct contrastive pairs
11:    Update ω by minimising L rep
12: end for
Algorithm 2 Learning the Dynamical and Preference Models
Require: Embedded sequences { e t } , initial dynamical parameters Θ
Ensure: Trained dynamical model Θ and preference parameters Φ
Learning the Latent Dynamics
 1: for each sequence { e t } t = 1 T  do
 2:     Initialise s 1
 3:     for  t = 2 to T do
 4:         Predict s t using p Θ ( s t s t 1 )
 5:         Evaluate likelihood p Θ ( e t s t )
 6:    end for
 7:    Update Θ by gradient descent on sequence likelihood
 8: end for
Estimating Preference Model
 9: Collect representative desirable evaluation quantities { r t }
10: Fit preference parameters Φ by maximum likelihood
Algorithm 3 Active Inference and Diffusion-Based Evaluation Generation
Require: Posterior q ( s t ) , dynamical model Θ , preference model Φ
Require: Number of candidate policies J, diffusion model ( ψ , ψ dec )
Ensure: Evaluation output y t
 1: Generate candidate policies { π ( j ) } j = 1 J
Policy Evaluation via Expected Free Energy (Steps 2–5)
 2: for  j = 1 to J do
 3:    Approximate F ˜ π ( j ) using Monte Carlo rollouts
 4:    Compute policy weight q ( π ( j ) ) exp ( F ˜ π ( j ) )
 5: end for
 6: Select π = arg max π ( j ) q ( π ( j ) )
Diffusion-Based Generation (Steps 6–9)
 7: Infer latent trajectory { s τ time } τ time = t t + H under π
 8: Pool trajectory into latent code h 0
 9: Apply reverse diffusion to obtain h ^ 0
10: Generate output y t = f dec ( h ^ 0 ; ψ dec )

3.7. Computational Complexity

The computational complexity of the proposed framework primarily depends on the number of planning steps (horizon length) and the number of candidate policies. Specifically, the Monte Carlo rollouts used to estimate the expected free energy involve sampling multiple trajectories for each policy. The time complexity for each policy evaluation is therefore proportional to the number of planning steps N, the number of Monte Carlo samples M, and the number of candidate policies J. As a result, the overall time complexity of policy evaluation is O ( N × M × J ) .
In addition, the memory complexity is influenced by the need to store sampled trajectories and corresponding free energy estimates for each policy. Thus, the memory complexity is also O ( N × M × J ) .
While this computational cost increases with longer planning horizons and a larger number of candidate policies, the framework is scalable for practical applications, particularly when combined with approximation techniques such as importance sampling or batch processing. For larger-scale or real-time applications, optimization strategies such as parallel computing or reducing the number of candidate policies may be employed to mitigate computational overhead.
To provide a sense of the practical feasibility of our approach, we estimate that for a planning horizon of N = 100 and J = 10 policies, the framework requires approximately X hours for computation on a standard CPU with Y GB of memory, assuming M = 1000 Monte Carlo samples per policy.

4. Experiments

4.1. Experimental Setup

Section 4.1 describes the experimental setup. It is structured as follows: the dataset is introduced in Section 4.1.1, the evaluation task formulation is described in Section 4.1.2, and implementation details are provided in Section 4.1.3. This organization clarifies the experimental context before presenting results.

4.1.1. Dataset

The experiments were conducted on a curated longitudinal textual dataset referred to as AcademicText-2025. The dataset consists of heterogeneous academic records, including project reports, feedback summaries, and process logs, which are organised into temporal sequences associated with individual entities. These records were collected from institutional academic information systems and reflect real-world evaluation scenarios.
Due to privacy, confidentiality, and institutional policy constraints, the AcademicText-2025 dataset is not publicly released. However, the dataset can be made available for research purposes upon reasonable request to the corresponding author. Access is granted after completing a data usage agreement that specifies research-only use and compliance with relevant privacy and ethical regulations. The publication of aggregated experimental results based on this dataset is permitted under the terms of this agreement. The dataset does not contain personally identifiable information. All potentially sensitive attributes were anonymised or removed prior to analysis in accordance with applicable data protection regulations.
The dataset was pre-processed following the protocol described in Section 3. All documents were converted to a unified encoding format, tokenised, normalised, and truncated or padded to a fixed maximum length. Sequences with missing timestamps, incomplete evaluation records, or insufficient temporal length were excluded to ensure the reliability of subsequent modelling and evaluation. For transparency and reproducibility, the key characteristics of the dataset and the experimental environment are summarised in Table 1.
The remaining sequences are partitioned into training, validation, and test subsets at the sequence level using random sampling, with proportions of 70%, 15%, and 15%, respectively. To avoid information leakage, all sequences associated with the same entity are assigned exclusively to a single subset. The random split is performed once using a fixed random seed to ensure reproducibility.

4.1.2. Task Formulation

Given the history of textual observations x 1 : t = { x 1 , , x t } for a particular entity, the task is to generate an evaluation report y t that summarises the current state and, when appropriate, reflects short-term expectations about future development. All models in the comparison take the same input (a history of texts) and are required to output a natural-language evaluation.
Some baselines naturally produce numerical scores or categorical labels rather than free-form text. For these methods, we convert their outputs into templated sentences so that all approaches can be compared under a uniform text-based evaluation protocol.

4.1.3. Implementation Details for AIDE

In AIDE, the representation and augmentation module employs a transformer-based sentence encoder with embedding dimension d = 768 . The VAE consists of two fully connected layers in both encoder and decoder and assumes a standard Gaussian prior over latent codes. It is first trained in an unsupervised fashion on the entire corpus to obtain a smooth latent space. On top of this VAE, the encoder is further fine-tuned using a contrastive learning objective to enforce semantic discrimination between similar and dissimilar text pairs. The augmented corpus X includes both original and quality-controlled synthetic samples, providing richer evidence for subsequent latent dynamical modelling.
The latent dynamical layer uses a state dimension d s in the range of 64 to 128. The state transition function m Θ ( · ) and the observation mapping g Θ ( · ) are parameterised by gated recurrent units (GRUs), which offer a good compromise between modelling capacity and computational efficiency. The planning horizon H and the number of candidate policies J are treated as hyperparameters of the active inference component; they are varied in sensitivity analyses but set to moderate default values in the main experiments to balance temporal coverage and computational cost.
The diffusion-based generator operates in a latent space obtained by pooling the sequence of latent states into an initial code h 0 . A forward noising process then transforms h 0 into approximately isotropic Gaussian noise over T discrete steps. A residual denoising network with time-step embeddings is trained to reverse this process and reconstruct a clean latent representation h ^ 0 , which is finally decoded into a textual evaluation. All models are trained with the Adam optimiser. Learning rates and batch sizes are selected based on validation performance, and early stopping is used to avoid overfitting. Once the hyperparameters are fixed on the validation set, the models are retrained on the union of training and validation data and evaluated on the held-out test set.

4.2. Baseline Methods

To assess the performance of AIDE, we compare it against four representative baselines, corresponding to traditional metric-based systems, discriminative deep models, end-to-end sequence models, and generative state-space models without active inference.
The first baseline, denoted MBE (Metric-Based Evaluation), represents conventional indicator-aggregation systems. Hand-crafted quantitative indicators, such as counts of specific events, keyword frequencies, or rubric-based scores, are aggregated through linear or tree-based models to produce numerical evaluation outputs. Because MBE does not generate free-form text, we convert its scalar outputs into short templated sentences in order to evaluate it under the same metrics as the generative models.
The second baseline, DAI (Discriminative Assessment Model), is a typical deep discriminative model. It encodes the history of texts into a sequence of embedding vectors and feeds them to a feedforward network or a sequence-to-sequence decoder to predict evaluation labels or generate text directly. DAI does not include an explicit latent dynamical model and does not perform active policy selection; it relies on the fitting capacity of deep networks to capture correlations between histories and evaluation outputs [9].
The third baseline, TRANS-SEQ, is a standard transformer-based sequence-to-sequence model. It treats the historical texts as the source sequence and the evaluation as the target sequence, and learns a direct mapping via end-to-end maximum likelihood training. This model can generate fluent text but does not expose an explicit latent state nor incorporate any notion of active inference or planning [32].
The fourth baseline, SSM-GEN, is a generative state-space model without active inference. It learns a latent state evolution and uses a conditional decoder to generate evaluation texts, but employs fixed or heuristic strategies when choosing how to roll out latent trajectories and configure reports. SSM-GEN therefore isolates the contribution of latent dynamics without the additional benefits of policy selection under expected free energy [33].

4.3. Evaluation Metrics and Human Assessment

Model performance is evaluated using a combination of automatic metrics and human judgements.
From the automatic perspective, we first compute BLEU scores to quantify n-gram overlap between generated and reference evaluations, considering n from 1 to 4 and using the standard weighted average. BLEU reflects local lexical and phrase-level similarity. Second, we compute ROUGE-L, which measures the longest common subsequence between generated and reference texts; this metric emphasises sentence-level coverage and captures whether the generator reproduces key semantic fragments. Third, we use BERTScore to assess semantic similarity based on contextual embeddings, thereby going beyond surface-level token overlap.
To evaluate how well generated evaluations are aligned with subsequent developments, we design sequence consistency measures. At an evaluation time t, we extract key forward-looking information from the generated text (such as stated trends or categorical judgements) and compare it with features derived from the realised trajectory between t + 1 and t + H . The consistency score reflects the degree to which the generated evaluation anticipates the direction and pattern of future observations, without requiring exact numeric prediction.
Human evaluation is conducted by domain experts who independently rate a stratified subset of test cases. For each sampled case, several model outputs are presented in random order without revealing their source. Experts score the reports along several dimensions, including informativeness (extent to which available evidence is used), internal coherence (logical and linguistic consistency), alignment with evidence (faithfulness to the observed history), and overall readability. Scores are standardised across raters to mitigate individual bias. Specifically, for each rater and each evaluation criterion, raw scores are normalised using z-score standardisation by subtracting the rater-specific mean and dividing by the corresponding standard deviation. The resulting normalised scores are then aggregated across raters by averaging to obtain the final human evaluation score for each generated report.

4.4. Overall Quantitative Performance

The main quantitative results are summarised in Table 2. Source: author’s contribution. Across BLEU, ROUGE-L, and BERTScore, AIDE achieves the best overall performance. Compared with TRANS-SEQ, AIDE exhibits higher lexical overlap and semantic similarity to the human-written references, indicating that combining latent dynamics with active inference yields more reference-like evaluations than purely end-to-end sequence modelling. Relative to the metric-based baseline MBE, AIDE produces richer and more context-sensitive texts, while also achieving considerably stronger automatic scores.
In terms of sequence consistency, AIDE shows the highest alignment between generated evaluations and subsequent observed development, outperforming both SSM-GEN and DAI. This suggests that selecting policies via expected free energy encourages the model to generate evaluations that are not only descriptive of the present but also coherent with the likely future under the learned dynamical model, echoing the theoretical role of active inference.

4.5. Effect of Active Inference and Policy Selection

To quantify the contribution of active inference, we consider two ablated variants of AIDE. In the first variant, denoted AIDE-noEI, the epistemic term in the expected free energy is removed, so that policy selection only considers preference alignment and ignores the value of information gain. In the second variant, AIDE-noPI, the diffusion generator is conditioned on a latent trajectory obtained from a single greedy rollout of the dynamical model, rather than on the posterior over policies; in other words, generation is no longer policy-informed.
The performance of these variants is reported in Table 3. Source: author’s contribution. Removing the epistemic term leads to a noticeable decrease in sequence consistency and a modest drop in BLEU and ROUGE-L, indicating that policies selected without accounting for information gain produce evaluations that are less informative about future developments. Removing policy-informed generation reduces the diversity and structural richness of generated texts; although some automatic scores remain close to those of the full AIDE, human evaluators more frequently describe these outputs as flat or lacking nuance. These findings support the conclusion that both epistemic and pragmatic components of expected free energy are important for the overall effectiveness of AIDE.

4.6. Component-Level Ablation

Beyond the active inference module, we examine the contributions of VAE-based augmentation, latent dynamical modelling, and diffusion-based generation. A variant without VAE augmentation (AIDE-noVAE) is trained solely on the original corpus. This model shows reduced robustness on sequences with sparse or noisy observations, confirming that VAE augmentation helps smooth the representation space and partially compensates for data sparsity. A variant without latent dynamics (AIDE-noDyn) replaces the state-space model with a simple pooling mechanism over embeddings. This approach maintains reasonable performance on very short histories but degrades on longer sequences, underscoring the importance of explicit temporal modelling. A variant without diffusion (AIDE-noDiff) employs a deterministic decoder conditioned directly on the pooled latent representation. While computationally cheaper, it produces less fluent and less diverse text, and its automatic metrics generally lag behind the full model. Table 4 summarises these component-level ablation results. Source: author’s contribution.

4.7. Sensitivity to Planning Horizon and Policy Set Size

The sensitivity of AIDE to the planning horizon H and the number of candidate policies J is investigated by varying one parameter at a time while keeping all others fixed. When H is very small, the model largely focuses on the immediate present, and the generated evaluations primarily restate current evidence, leading to relatively low sequence consistency. As H increases to a moderate range, both automatic metrics and consistency scores improve, since the model can account for longer-term trends in its policy selection and report generation. Beyond a certain horizon, however, further increases in H yield diminishing gains and can even destabilise training, due to the increased difficulty of long-horizon prediction and expected free energy estimation.
The effect of the number of candidate policies J exhibits a similar pattern. With very small J, exploration of the policy space is insufficient, and the posterior concentrates on suboptimal regions. Increasing J improves the coverage of the policy space and leads to better performance. Once J exceeds a moderate value, additional increases offer only marginal performance gains while substantially increasing computational cost. These trends suggest that AIDE can be operated with relatively modest values of H and J while capturing most of the benefits of active inference.
Figure 2 illustrates the performance of AIDE as a function of the planning horizon H, and Figure 3 shows the effect of varying J. Source: author’s contribution.
Figure 2 summarises the sensitivity of AIDE to the planning horizon H. As shown in Figure 2a, increasing H from very short values leads to consistent improvements in BLEU, ROUGE-L and BERTScore, indicating that a longer horizon allows the model to exploit more temporal structure when selecting policies and generating reports. However, the gains saturate once H enters a moderate range, and no further systematic improvement is observed for larger horizons. A similar pattern is observed for the sequence-level consistency score in Figure 2b: short horizons yield evaluations that mainly restate the present, whereas moderate horizons produce assessments that are better aligned with subsequent developments. Beyond that point, increasing H further introduces additional uncertainty into long-range predictions and expected free energy estimation, resulting in diminishing or even slightly negative returns. These results support our choice of a moderate planning horizon in the main experiments.
Figure 3 examines the effect of the number of candidate policies J on the performance of AIDE. When J is small, the model explores only a limited portion of the policy space and all automatic metrics remain relatively low. As J increases, BLEU, ROUGE-L and BERTScore improve steadily, indicating that a richer policy set allows the active inference procedure to identify more informative trajectories and, in turn, better evaluation reports. Beyond a moderate range, however, the curves flatten and additional policies yield only marginal gains, while the computational cost grows substantially. These observations suggest that AIDE can operate effectively with a moderately sized policy set, capturing most of the benefits of active inference without incurring excessive overhead.

4.8. Qualitative Behaviour and Case Analysis

To better understand the qualitative behaviour of AIDE, we examine generated evaluations for representative sequences. When the historical trajectory shows a clear pattern of steady improvement, AIDE typically produces reports that both summarise the current state and cautiously anticipate continued progress, often explicitly referencing recent changes and their possible implications. In contrast, when the history is volatile or ambiguous, the generated evaluations become more conservative and conditional in wording, sometimes articulating alternative possible developments. This qualitative behaviour reflects the uncertainty encoded in the latent states and is consistent with the active inference formulation, where policies are selected to balance explanatory adequacy and future plausibility.
Compared with the baselines, AIDE’s outputs are generally more structured and sensitive to temporal context. MBE tends to reuse a small set of templates and fails to respond to subtle changes in the trajectory. TRANS-SEQ can generate fluent text but occasionally hallucinates details that are not supported by the history, due to the lack of explicit constraints from a latent dynamical model. SSM-GEN benefits from latent states but, without principled policy selection, its evaluations are less forward-looking and sometimes less aligned with the implicit preferences encoded in the data. Figure 4 presents an illustrative example in which AIDE’s evaluation better captures the trajectory’s turning point than the competing methods.
Figure 4 presents a representative case study designed to illustrate how different models respond to a sequence exhibiting a clear turning point. Source: author’s contribution. As shown in the upper panel, the underlying latent performance signal increases steadily before undergoing a distinct reversal. The textual summaries generated by the baseline methods reveal their limitations: MBE produces template-like statements that ignore the trend shift, and TRANS-SEQ, although fluent, fails to capture the downturn. SSM-GEN is able to follow the overall trajectory but lacks coherent forward-looking commentary. In contrast, AIDE not only recognises the change in direction but also situates it within a broader temporal context, producing an assessment that is both more faithful to the observed history and more consistent with likely future developments. This example highlights the advantage of combining latent dynamical modelling with active inference for generating context-aware evaluations.

4.9. Discussion

The experimental results demonstrate that AIDE offers a competitive and conceptually coherent solution to sequential text evaluation. Its advantages are most pronounced when long and noisy histories must be condensed into concise reports that simultaneously describe the present and anticipate plausible futures. By formulating evaluation as active inference in a latent dynamical model, AIDE unifies representation learning, temporal modelling, policy selection, and generative reporting under a single variational principle.
At the same time, the experiments also reveal several limitations. Estimating expected free energy remains computationally demanding for very long planning horizons, and the quality of generated text still depends on the underlying language model and the diversity and coverage of the training corpus. These observations suggest promising directions for future work, including more efficient approximations of expected free energy, more compact parameterisations of the policy space, and hybrid architectures that integrate AIDE with stronger general-purpose or domain-specific language models.
Active inference offers a probabilistic framework for sequential decision-making, where decisions are made by minimizing the expected free energy over time. This approach contrasts with reinforcement learning (RL), which typically involves learning a policy through trial and error based on reward feedback. While RL requires exploration-exploitation trade-offs and can be highly data-driven, active inference provides a more principled way of encoding uncertainty and involves both perception and action in a unified framework. A key difference is that active inference integrates both prior knowledge and sensory data in decision-making, whereas RL generally relies on experience and reward signals to adjust the policy. This distinction allows active inference to be particularly advantageous in settings where prior information is available and uncertainty plays a significant role in decision-making. Furthermore, active inference directly models the process of belief updating, which allows it to naturally handle dynamic, changing environments without requiring retraining as in RL. Overall, while both methods aim to optimize sequential decision-making, active inference provides a more coherent, probabilistic approach that blends perception and action, offering a conceptual advantage in applications involving complex, uncertain environments.

5. Conclusions

This paper has proposed AIDE, an active inference–driven evaluation framework that formulates sequential text evaluation as variational inference in a latent dynamical system. Instead of treating representation, temporal modelling, scoring, and report generation as loosely coupled components, AIDE integrates VAE-based augmentation and contrastive semantic encoding, a parametric state–space model for latent dynamics, an expected free energy-based policy selection mechanism, and a diffusion-based text generator into a single probabilistic architecture. Within this formulation, evaluation reports arise as the outcome of active inference over future trajectories, balancing explanatory adequacy with alignment to encoded preferences. Experimental results on a longitudinal text corpus show that AIDE consistently outperforms representative metric-based, discriminative, sequence-to-sequence, and purely generative state-space baselines in terms of automatic metrics, alignment with subsequent developments, and expert judgements, and that each of its key components—especially the active inference module—contributes measurably to overall performance. At the same time, the study highlights several open challenges, including the computational cost of expected free energy estimation at long planning horizons, the dependence of generation quality on the underlying language model, and the need for more interpretable mechanisms to specify and learn preference structures. Future work will focus on more efficient approximations of active inference, compact parameterisations of policy spaces, and the integration of AIDE with stronger general-purpose or domain-specific language models, as well as on extending the framework to other sequential evaluation tasks beyond the textual domain.

Author Contributions

Conceptualization, X.C. and C.L. methodology, X.C.; software, Y.W., J.C. and S.H.; validation, C.L., W.Y. and J.G.; formal analysis, W.W. and S.H.; investigation, X.C.; resources, X.C.; data curation, C.L.; writing—original draft preparation, X.C., W.W. and W.Y.; writing—review and editing, C.L., C.Z., Y.W., J.C. and J.G.; visualization, C.L.; supervision, C.L. and J.G.; project administration, C.L.; funding acquisition, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The Study on the Classification and Evaluation Model of Continuing Education Personnel and Courses Based on Artificial Intelligence in the Big Data Environment under Grant No. J2023008.

Institutional Review Board Statement

Not applicable, as no human participants or identifiable personal data were involved.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and code used in this paper are solely for research and educational purposes. They can be obtained by contacting the corresponding author and signing a usage agreement.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript.

References

  1. Arora, R.; Damarla, R.B. A Review on Generative AI Powered Talent Management, Employee Engagement and Retention Strategies: Applications, Benefits, and Challenges. Procedia Comput. Sci. 2025, 260, 683–691. [Google Scholar] [CrossRef]
  2. Ooi, K.-B.; Tan, G.W.-H.; Al-Emran, M.; Al-Sharafi, M.A.; Capatina, A.; Chakraborty, A.; Dwivedi, Y.K.; Huang, T.-L.; Kar, A.K.; Lee, V.H.; et al. The potential of generative artificial intelligence across disciplines: Perspectives and future directions. J. Comput. Inf. Syst. 2025, 65, 76–107. [Google Scholar] [CrossRef]
  3. Qin, C.; Zhang, L.; Cheng, Y.; Zha, R.; Shen, D.; Zhang, Q.; Chen, X.; Sun, Y.; Zhu, C.; Zhu, H. A comprehensive survey of artificial intelligence techniques for talent analytics. Proc. IEEE 2025, 113, 125–171. [Google Scholar] [CrossRef]
  4. Tschantz, A.; Millidge, B.; Seth, A.K.; Buckley, C.L. Reinforcement learning through active inference. arXiv 2020, arXiv:2002.12636. [Google Scholar] [CrossRef]
  5. Fadhel, M.A.; Duhaim, A.M.; Albahri, A.S.; Al-Qaysi, Z.T.; Aktham, M.A.; Chyad, M.A.; Abd-Alaziz, W.; Albahri, O.S.; Alamoodi, A.H.; Alzubaidi, L.; et al. Navigating the metaverse: Unraveling the impact of artificial intelligence—A comprehensive review and gap analysis. Artif. Intell. Rev. 2024, 57, 264. [Google Scholar] [CrossRef]
  6. Wang, H.; Ning, H.; Lin, Y.; Zhang, X.; Dhelim, S.; Farha, F. A survey on the metaverse: The state-of-the-art, technologies, applications, and challenges. IEEE Internet Things J. 2023, 10, 14671–14688. [Google Scholar] [CrossRef]
  7. Hagos, D.H.; Battle, R.; Rawat, D.B. Recent advances in generative AI and large language models: Current status, challenges, and perspectives. IEEE Trans. Artif. Intell. 2024, 5, 5873–5893. [Google Scholar] [CrossRef]
  8. Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar] [CrossRef]
  9. Nezami, N.; Haghighat, P.; Gándara, D.; Anahideh, H. Assessing disparities in predictive modeling outcomes for college student success: The impact of imputation techniques on model performance and fairness. Educ. Sci. 2024, 14, 136. [Google Scholar] [CrossRef]
  10. Sengar, S.S.; Hasan, A.B.; Kumar, S.; Carroll, F. Generative artificial intelligence: A systematic review and applications. Multimed. Tools Appl. 2025, 84, 23661–23700. [Google Scholar] [CrossRef]
  11. Wu, W.; Chen, X.; Chen, Z.; Jiang, J.-E.; Tsang, K.-F.; Huang, X.; Ma, F.; Xiao, J. Tag-Enriched Multi-Attention with Large Language Models for Cross-Domain Sequential Recommendation. IEEE Trans. Consum. Electron. 2025; early access. [Google Scholar] [CrossRef]
  12. Dhelim, S.; Aung, N.; Bouras, M.A.; Ning, H.; Cambria, E. A survey on personality-aware recommendation systems. Artif. Intell. Rev. 2022, 55, 2409–2454. [Google Scholar] [CrossRef]
  13. Li, M.; Zhang, Y.; Liu, S.; Liu, Z.; Zhu, X. Simple multiple kernel k-means with kernel weight regularization. Inf. Fusion 2023, 100, 1566–2535. [Google Scholar] [CrossRef]
  14. Qu, W.; Xiu, X.; Sun, J.; Kong, L. A new formulation of sparse multiple kernel k-means clustering and its applications. Stat. Anal. Data Min. ASA Data Sci. J. 2023, 16, 436–455. [Google Scholar] [CrossRef]
  15. Xiong, L.; Chen, R.-S.; Zhou, X.; Jing, C. Multi-feature fusion and selection method for an improved particle swarm optimization. J. Ambient. Intell. Humaniz. Comput. 2019, 1–10. [Google Scholar] [CrossRef]
  16. Liu, W. Cultivation of College English Network Autonomous Learning Ability Based on the Multisource Information Fusion algorithm. Mob. Inf. Syst. 2022, 2022, 8192410. [Google Scholar] [CrossRef]
  17. Zhang, H.; Zhang, J.; Shi, Y. A method to study group relationship of college students based on predictive social network. Sci. Program. 2022, 2022, 5970188. [Google Scholar] [CrossRef]
  18. Xu, Y.-H.; Wang, Z.-H.; Wang, Z.-R.; Fan, R.; Wang, X. A recommendation algorithm based on a self-supervised learning pretrain transformer. Neural Process. Lett. 2023, 55, 4481–4497. [Google Scholar] [CrossRef]
  19. Hao, Q.; Choi, W.J.; Meng, J. A data mining-based analysis of cognitive intervention for college students’ sports health using Apriori algorithm. Soft Comput. 2023, 27, 16353–16371. [Google Scholar] [CrossRef]
  20. Zhang, L.; Yu, H. Digital marketing evaluation of applied undergraduate talent training with e-commerce using big data mining and communication technology support. Comput.-Aided Des. Appl. 2024, 21, 103–118. [Google Scholar] [CrossRef]
  21. Zhang, Z.; Sangsawang, T.; Vipahasna, K.; Pigultong, M. A mixed-methods data approach integrating importance-performance analysis (IPA) and Kaiser-Meyer-Olkin (KMO) in applied talent cultivation. J. Appl. Data Sci. 2024, 5, 256–267. [Google Scholar] [CrossRef]
  22. Li, Z.; Wu, W.; Wei, B.; Li, H.; Zhan, J.; Deng, S.; Wang, J. Rice disease detection: TLI-YOLO innovative approach for enhanced detection and mobile compatibility. Sensors 2025, 25, 2494. [Google Scholar] [CrossRef]
  23. Zhang, H.; Chen, M.; Shang, J.; Yang, C.; Sun, Y. Stochastic process-based degradation modeling and RUL prediction: From Brownian motion to fractional Brownian motion. Sci. China Inf. Sci. 2021, 64, 171201. [Google Scholar] [CrossRef]
  24. Li, M.; Zheng, M. Research Huawei’s Human Resource Management Model by Wireless Network Communication and Association Rule algorithms. Wirel. Commun. Mob. Comput. 2021, 2021, 2031232. [Google Scholar] [CrossRef]
  25. Tuitt, T.-A.; Addison, L.M.; Hosein, P. Generative AI and Multi-Agent Systems Approach to Psychometric Evaluation for Human Resource Management and Talent Acquisition. In Proceedings of the 2025 7th International Symposium on Computational and Business Intelligence (ISCBI), Macau, China, 14–16 February 2025; pp. 108–112. [Google Scholar] [CrossRef]
  26. He, S.; Cheng, B.; Wang, H.; Huang, Y.; Chen, J. Proactive personalized services through fog-cloud computing in large-scale IoT-based healthcare application. China Commun. 2017, 14, 1–16. [Google Scholar] [CrossRef]
  27. Fu, D.; Hu, S.; Zhang, L.; He, S.; Qiu, J. An intelligent cloud computing of trunk logistics alliance based on blockchain and big data. J. Supercomput. 2021, 77, 13863–13878. [Google Scholar] [CrossRef]
  28. Pezzulo, G.; Parr, T.; Friston, K. Active inference as a theory of sentient behavior. Biol. Psychol. 2024, 186, 108741. [Google Scholar] [CrossRef]
  29. He, Y.; Fang, J.; Yu, F.R.; Leung, V.C. Large language models (LLMs) inference offloading and resource allocation in cloud-edge computing: An active inference approach. IEEE Trans. Mob. Comput. 2024, 23, 11253–11264. [Google Scholar] [CrossRef]
  30. Engström, J.; Wei, R.; McDonald, A.D.; Garcia, A.; O’Kelly, M.; Johnson, L. Resolving uncertainty on the fly: Modeling adaptive driving behavior as active inference. Front. Neurorobotics 2024, 18, 1341750. [Google Scholar] [CrossRef] [PubMed]
  31. Ren, Y.; Xie, R.; Yu, F.R.; Zhang, R.; Wang, Y.; He, Y.; Huang, T. Connected and autonomous vehicles in web3: An intelligence-based reinforcement learning approach. IEEE Trans. Intell. Transp. Syst. 2024, 25, 9863–9877. [Google Scholar] [CrossRef]
  32. Dang, Q.; Zhang, G.; Wang, L.; Yu, Y.; Yang, S.; He, X. Transformer-based intelligent prediction model for multimodal multi-objective optimization. IEEE Comput. Intell. Mag. 2025, 20, 34–49. [Google Scholar] [CrossRef]
  33. Kim, J.; Shim, J. A novel approach to data generation in generative model. arXiv 2025, arXiv:2502.10092. [Google Scholar] [CrossRef]
Figure 1. Overview of the proposed Active Inference–Driven Evaluation (AIDE) framework.
Figure 1. Overview of the proposed Active Inference–Driven Evaluation (AIDE) framework.
Electronics 15 00099 g001
Figure 2. Sensitivity of AIDE to the planning horizon H. Panel (a) reports representative automatic metrics (BLEU, ROUGE-L, and BERTScore) as H varies, while panel (b) shows the sequence-level consistency score. Moderate horizons yield the best trade-off between performance and stability, with diminishing gains for excessively long planning horizons.
Figure 2. Sensitivity of AIDE to the planning horizon H. Panel (a) reports representative automatic metrics (BLEU, ROUGE-L, and BERTScore) as H varies, while panel (b) shows the sequence-level consistency score. Moderate horizons yield the best trade-off between performance and stability, with diminishing gains for excessively long planning horizons.
Electronics 15 00099 g002
Figure 3. Sensitivity of AIDE to the number of candidate policies J. Performance improves with J up to a moderate range, after which the gains saturate, illustrating diminishing returns for very large policy sets.
Figure 3. Sensitivity of AIDE to the number of candidate policies J. Performance improves with J up to a moderate range, after which the gains saturate, illustrating diminishing returns for very large policy sets.
Electronics 15 00099 g003
Figure 4. Representative case study illustrating model behaviour on a sequence with a pronounced turning point. The upper plot shows the underlying latent performance signal with a clear shift in trend. The text summaries below compare the evaluations generated by different methods. MBE yields template-like outputs and fails to notice the turning point; TRANS-SEQ produces fluent text but misses the downturn; SSM-GEN captures the general trend but offers limited forward-looking insight. By contrast, AIDE recognises the shift and generates a more context-aware and future-oriented assessment.
Figure 4. Representative case study illustrating model behaviour on a sequence with a pronounced turning point. The upper plot shows the underlying latent performance signal with a clear shift in trend. The text summaries below compare the evaluations generated by different methods. MBE yields template-like outputs and fails to notice the turning point; TRANS-SEQ produces fluent text but misses the downturn; SSM-GEN captures the general trend but offers limited forward-looking insight. By contrast, AIDE recognises the shift and generates a more context-aware and future-oriented assessment.
Electronics 15 00099 g004
Table 1. Summary of the AcademicText-2025 dataset and experimental setup.
Table 1. Summary of the AcademicText-2025 dataset and experimental setup.
ItemDescription
Dataset nameAcademicText-2025
Data typeLongitudinal textual records
Data sourceInstitutional academic information systems
AvailabilityPrivate; available upon request
Access procedureRequest via corresponding author; data usage agreement required
Usage licenseResearch-only use under signed agreement
Privacy handlingAnonymised; no personally identifiable information
Experimental hardwareNVIDIA RTX 3090 GPU (24 GB), Intel Xeon CPU, 128 GB RAM
Operating systemUbuntu 20.04 LTS
Deep learning frameworkPyTorch 2.0
Supporting librariesHuggingFace Transformers, Diffusers
CUDA versionCUDA 11.8
Table 2. Overall performance comparison on the test set. Higher is better for all metrics.
Table 2. Overall performance comparison on the test set. Higher is better for all metrics.
MethodBLEUROUGE-LBERTScoreConsistencyHuman Score
MBE0.340.480.720.620.55
DAI0.420.540.750.670.60
TRANS-SEQ0.520.580.790.710.70
SSM-GEN0.550.600.810.740.73
AIDE0.600.630.830.760.77
Table 3. Ablation study on active inference components. AIDE-noEI removes the epistemic term in the expected free energy; AIDE-noPI removes policy-informed generation.
Table 3. Ablation study on active inference components. AIDE-noEI removes the epistemic term in the expected free energy; AIDE-noPI removes policy-informed generation.
MethodBLEUROUGE-LBERTScoreConsistencyHuman Score
AIDE-noEI0.560.600.810.710.73
AIDE-noPI0.580.610.820.730.75
AIDE0.600.630.830.760.77
Table 4. Component-level ablation study for AIDE. “noVAE” removes VAE augmentation, “noDyn” removes latent dynamics, and “noDiff” replaces diffusion with a deterministic decoder.
Table 4. Component-level ablation study for AIDE. “noVAE” removes VAE augmentation, “noDyn” removes latent dynamics, and “noDiff” replaces diffusion with a deterministic decoder.
MethodBLEUROUGE-LBERTScoreConsistencyHuman Score
AIDE-noVAE0.550.590.800.720.72
AIDE-noDyn0.520.560.780.690.68
AIDE-noDiff0.540.580.790.710.70
AIDE0.600.630.830.760.77
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, X.; Liu, C.; Zhang, C.; Wang, Y.; Chang, J.; He, S.; Wu, W.; Yu, W.; Guo, J. AIDE: An Active Inference-Driven Framework for Dynamic Evaluation via Latent State Modeling and Generative Reasoning. Electronics 2026, 15, 99. https://doi.org/10.3390/electronics15010099

AMA Style

Chen X, Liu C, Zhang C, Wang Y, Chang J, He S, Wu W, Yu W, Guo J. AIDE: An Active Inference-Driven Framework for Dynamic Evaluation via Latent State Modeling and Generative Reasoning. Electronics. 2026; 15(1):99. https://doi.org/10.3390/electronics15010099

Chicago/Turabian Style

Chen, Xi, Changwang Liu, Chenyang Zhang, Yuxuan Wang, Jiayi Chang, Shuqing He, Wangyu Wu, Wenjun Yu, and Jia Guo. 2026. "AIDE: An Active Inference-Driven Framework for Dynamic Evaluation via Latent State Modeling and Generative Reasoning" Electronics 15, no. 1: 99. https://doi.org/10.3390/electronics15010099

APA Style

Chen, X., Liu, C., Zhang, C., Wang, Y., Chang, J., He, S., Wu, W., Yu, W., & Guo, J. (2026). AIDE: An Active Inference-Driven Framework for Dynamic Evaluation via Latent State Modeling and Generative Reasoning. Electronics, 15(1), 99. https://doi.org/10.3390/electronics15010099

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop