1. Introduction
With the rapid advancement of intelligent education, AI-driven efficient, personalized, and interpretable teaching feedback has become a key research focus in educational technology [
1]. Traditional feedback relies heavily on manual teacher evaluation and experience, supplemented by inputs from students, peers, and supervisors, which is time-consuming and often inadequate for large-scale personalized teaching needs [
2]. Advances in deep learning, especially large-scale models, enable the integration and analysis of multimodal data such as images and text, facilitating better understanding of teaching behaviors and intelligent feedback generation [
3,
4]. Li et al. demonstrated that intelligent tutoring systems can analyze student behaviors, emotions, progress, and answer accuracy in real time, delivering precise personalized feedback to support teaching strategy adjustments [
5]. Combining image analysis with natural language processing allows examination of non-verbal cues (e.g., facial expressions, posture) alongside verbal outputs, enhancing feedback comprehensiveness and accuracy [
5]. The adoption of large-model AI technologies improves classroom efficiency and effectiveness, driving transformative changes in educational technology [
6,
7].
Most current teaching feedback systems primarily handle structured text inputs. Abdi et al. developed DTLP, an automated deep learning system analyzing student feedback with features including word embeddings, affective knowledge, emotion rules, statistics, and linguistic characteristics [
8]. Shaik et al. integrated NLP methods to annotate and respond to student feedback, enhancing AI’s educational impact [
9]. However, these systems show limitations in jointly modeling student work images (handwritten assignments, lab reports, sketches) and textual descriptions (reasoning, reflections), restricting comprehensive understanding of learning outputs. The rise of multimedia data has increased interest in semantic alignment across modalities [
10]. Despite progress, deeper linguistic understanding remains a challenge for generating effective feedback. Joint image–text modeling offers a holistic view of student work, enabling precise personalized feedback and addressing shortcomings of traditional approaches [
11]. Deploying large language models (LLMs) in education also faces critical challenges. At the semantic level, LLMs have notable limitations in parsing complex linguistic phenomena: Weissweiler et al. [
12] showed that mainstream models like GPT and Gemini fail to capture additional semantics in constructions such as causative motion (e.g., “sneeze causing object movement”), revealing gaps in deep semantic reasoning. These semantic weaknesses affect text generation and introduce linguistic patterns with privacy and security implications. Muñoz-Ortiz et al. [
13] found LLM-generated news texts exhibit simpler syntax, emotional imbalance (overemphasis on “joy”), and overuse of pronouns, numerals, and auxiliaries—linguistic traits linked to privacy risks such as entity reference obfuscation and attribute inference [
14].
Recent work suggests that injecting explicit linguistic knowledge can mitigate these issues. Zhang et al. [
15] proposed LINGOLLM, which integrates grammar books and dictionaries into GPT-4, enabling efficient processing of endangered languages and achieving a BLEU score of 10.5, thus opening paths for low-resource language applications. This underscores the value of structured linguistic knowledge in enhancing multimodal and multilingual educational AI. Deep learning advancements in multimodal learning and joint image–text modeling have spurred interest in integrated analysis and feedback systems. Kumar et al. developed a system generating human-like feedback from paired text and image inputs [
16]. Xie et al. highlighted multimodal interaction’s advantages in education, combining speech, text, and visuals for richer outputs [
17]. Liu et al. tackled personalized multimodal feedback generation, enabling customized responses across disciplines for assignments involving images, audio, and text [
18]. To address these challenges, we propose a multimodal Transformer-based framework for teaching feedback generation, combining student-submitted images and textual descriptions with cross-modal attention. The primary contributions of this study are as follows:
A multimodal Transformer framework for automated teaching feedback generation is proposed, jointly modeling student-submitted images and textual content.
Three innovative modules are introduced: the image–text semantic alignment module, the personalized feedback generation module, and the cognitive weakness highlighter module.
Experimental validation demonstrates that the proposed method achieves high accuracy and strong practicality in various educational tasks, such as essay scoring, design feedback, and answer correction.
3. Materials and Method
3.1. Data Collection
To address the practical requirements of research on open-ended expression and structured critical feedback in law and economics education, the dataset was constructed under multimodal conditions, with broader coverage and a larger sample size to enhance model generalization and robustness across diverse instructional tasks, as shown in
Table 1. Data collection was conducted using undergraduate courses from three comprehensive universities, encompassing both core and elective subjects such as Principles of Civil Law, Public Finance, Legal Writing Practice, Principles of Economics, International Economic Law, and Public Finance Management. The image data were sourced from scanned copies of paper-based assignments, scanned classroom experiment and research reports, student-drawn argumentation diagrams, economic statistical charts, case analysis flowcharts, and photographs of classroom blackboard writing. Paper-based submissions were collected by course instructors and digitized using high-resolution scanners at 600 dpi. During acquisition, edge cropping, color correction, and contrast enhancement were applied to ensure the preservation of essential details and structural features. Certain assignments were completed directly on tablets or graphic tablets and submitted in PDF or JPEG format, maintaining an original resolution of no less than
pixels to facilitate subsequent image segmentation and feature extraction.
Text data were obtained from written responses and analytical reports submitted by students for assignments and projects, covering content types such as judgment reasoning, case commentary, economic policy review, statistical result interpretation, solution proposals, and reflective summaries. The original file formats included Word documents, PDF files, and text exported from online assignment platforms. All text data were collected with the original formatting, paragraphing, and punctuation preserved, along with metadata including task type, submission time, and assignment background, to support contextual modeling and feature analysis. In law-related tasks, the text typically contained statutory citations, case names, and logical argumentation structures, whereas in economics-related tasks, it more frequently included formula derivations, data interpretations, and chart references. Upon completion of collection, each image and text entry was paired via a unique identifier, forming one-to-one multimodal records to ensure consistency in cross-modal analysis.
Annotation guidelines were formulated by three senior law and economics educators, each with more than ten years of teaching and grading experience. The annotation dimensions included logical structure completeness, legal or economic theory application, accuracy of expression and use of technical terminology, and personalized improvement suggestions with reflective prompts. Each sample was independently annotated by at least two experts, with cross-checking performed to ensure consistency, and in cases of disagreement, a third annotator acted as arbiter to guarantee the authority and uniformity of the final annotations. The resulting dataset comprises 5000 multimodal assignment samples, with case analysis tasks accounting for the largest proportion, followed by statute application and economic chart interpretation, while policy commentary tasks were fewer but demonstrated diverse viewpoints and prominent critical thinking features. These data provide a solid foundation for achieving precise feedback generation in multidisciplinary and multi-expression-form educational scenarios.
3.2. Dataset Construction
Given the structural complexity, domain-specific characteristics, and expressive diversity of multimodal assignment data in the fields of law and economics, a systematic preprocessing and augmentation pipeline was designed prior to model training. This pipeline comprised three main stages: image processing, text processing, and data augmentation.
3.2.1. Image Processing
In the image processing stage, variations in resolution and clarity among assignment images from different sources were mitigated while key informational regions were preserved. All scanned images and handwritten sketches were first standardized in resolution by scaling them to a predefined size
:
where
I denotes the original image,
represents the bicubic interpolation resampling function, and
H and
W are the target image height and width, respectively. After standardization, handwriting, structural lines, and edge features were enhanced by applying a sharpening operator
followed by an edge enhancement filter
, yielding:
To facilitate subsequent cross-modal alignment, a patchifying method was employed to divide the image into non-overlapping blocks of size
. In combination with a diagram detection model
, blocks containing key diagrams or structural graphics were automatically identified and retained, thereby reducing redundant background noise.
3.2.2. Text Processing
In the text processing stage, a hierarchical procedure was implemented to accommodate the high density of domain-specific terminology, statute references, and policy entities in law and economics assignments, ensuring structural regularity and semantic accuracy. Initially, a domain-specific lexicon was applied for terminology recognition and normalization:
where
T is the original text and
denotes the terminology normalization function based on the domain lexicon. Subsequently, a normalization parser
was used to encode statute references in a consistent format, while a policy entity recognition function
extracted named entities relevant to economic and legal policies, forming a structured element set:
Based on this, a citation cleaning function
was applied to remove extraneous symbols, standardize inconsistent citation formats, and eliminate irrelevant noise. Finally, a syntactic restoration operator
and a spell-checking function
were applied to reconstruct the original grammatical structure and reduce the impact of syntactic flaws on model comprehension, producing a structurally complete and semantically clear text sequence
:
3.2.3. Data Augmentation
In the data augmentation stage, a feedback contrastive learning strategy was introduced to construct positive and negative feedback sample pairs
under the same task context, where the content was similar but the quality of reasoning or viewpoint differed significantly. Semantic discrimination was achieved by minimizing the embedding distance between positive and anchor samples while maximizing the distance to negative samples:
Here,
is the feature encoding function,
denotes cosine similarity, and
is the temperature parameter. Furthermore, a cross-task synonymous structure perturbation strategy
was developed, introducing structurally equivalent but lexically varied replacements across different tasks to improve model robustness and generalization in ambiguous assignment contexts. This preprocessing and augmentation pipeline not only improved the quality and consistency of the input data but also established a stable feature foundation for subsequent cross-modal alignment and personalized feedback generation.
3.3. Proposed Method
3.3.1. Overall
As shown in
Figure 1, the overall architecture is built upon the ViLT backbone, with the input consisting of pre-aligned image patch sequences and text token sequences. The image input undergoes visual embedding and positional encoding to form the visual representation
, while the text input is processed through word embedding and positional encoding to form the language representation
. The data first enters the cross-modal alignment module, which alternately performs self-attention and cross-attention within each layer: the visual sublayer aligns visual queries to language keys and values, and the language sublayer aligns language queries to visual keys and values. This iterative process produces the fused representations
and
and explicitly constructs an image patch–text fragment consistency matrix, which constrains the attention distribution in the subsequent generation stage.
The outputs and are then concatenated along the sequence dimension, with two further control signals incorporated—the student background embedding s, which encodes historical performance, common errors, and terminology preferences, and the task context embedding c, which encodes course type, question type, and scoring dimensions. Together, these constitute the conditional representation, which is provided to the personalized feedback generation module. This module is implemented as an autoregressive Transformer decoder that, conditioned on , first generates a cluster of “positive points,” followed by a cluster of “issue localization,” and finally a cluster of “improvement suggestions.” To avoid template-like outputs, diversity suppression and coverage constraints are incorporated into the attention heads, ensuring a balanced allocation of attention between key image patches and evidential text fragments. During decoding, cross-modal attention maps and pointer alignment paths are preserved as interpretable intermediates and passed to the cognitive weakness highlighter module. This module receives , , and multi-layer attention tensors from the decoder, applying a residual attention stack to perform multi-scale deviation scoring: on the one hand, it compares the student representation with a high-quality expert prototype library to locate cognitive weaknesses such as “reasoning jumps,” “statute mismatches,” “image–text contradictions,” and “statistical misinterpretations”; on the other hand, it leverages both the cross-modal alignment matrix and the decoder coverage maps to trace weaknesses back to specific text spans and image patch regions, producing visual masks and confidence scores. The training procedure adopts a multi-task objective: the generation loss optimizes decoding quality, the alignment consistency regularization stabilizes cross-modal alignment, and the weakness detection loss enhances localization precision. During inference, the alignment module first produces , which is then used by the decoder to generate structured feedback in one pass, while the highlighter module outputs interpretable highlighted weaknesses and evidence links, forming an end-to-end “alignment–generation–diagnosis” feedback loop.
3.3.2. Cross-Modal Alignment Module
The cross-modal alignment module is based on ViLT’s vision–language fusion mechanism, processing both image patch features and text token features during the encoding stage to achieve high-precision semantic alignment between the two modalities.
As shown in
Figure 2, the image input is first divided into non-overlapping patches of size
through patch embedding, where each patch is linearly mapped to a
d-dimensional visual embedding vector
, while preserving two-dimensional positional encodings to convey spatial information. The text input is processed via word embedding and positional encoding to obtain
d-dimensional language vectors
. The module consists of
L stacked Transformer encoder layers, each containing multi-head self-attention (with
M heads) and cross-modal cross-attention sublayers. In the cross-attention sublayer, the visual stream uses text features as keys and values for visual queries to compute attention weights
, obtaining visual representations enriched with textual semantics. Conversely, the text stream uses visual features as keys and values for text queries to compute attention weights
, obtaining textual representations enriched with visual information. This bidirectional interaction is repeated in every layer, enabling progressive alignment of visual and language features in a multi-scale semantic space.
In terms of parameters, the module adopts Transformer encoder layers, each with attention heads, a hidden dimension , a feed-forward network dimension of , and a dropout rate of 0.1 to balance representational capacity and overfitting prevention. Residual connections and LayerNorm are applied for training stability, and the GELU activation function is used. The aligned outputs are and , where and denote the number of visual patches and text tokens, respectively.
Mathematically, the module learns a shared alignment mapping
, by minimizing the cross-modal matching loss:
where
denotes the index of the text token corresponding to the
i-th visual patch, and
is a temperature parameter. This design explicitly establishes a semantic mapping between images and text at the feature level, enabling the capture of correspondences such as “chart trend–text conclusion” or “flowchart node–legal reasoning paragraph” in complex law and economics assignments.
This module eliminates early-stage semantic misalignment between modalities, ensuring the personalized feedback module works on a consistent cross-modal foundation. For example, it can detect inconsistencies—like a chart showing decline while text mentions “growth”—via low similarity scores, aiding the cognitive weakness highlighter. Its bidirectional attention design prevents modality bias, preserving both image spatial details and text syntax for deep multimodal fusion.
3.3.3. Personalized Feedback Generator
The personalized feedback generation module is positioned after the cross-modal semantic alignment module in the overall architecture. Its inputs include the aligned visual feature matrix and the textual feature matrix , which are fused with the student background embedding vector and the task context control vector . A Transformer decoder structure with depth is adopted, where each layer contains multi-head cross-attention sublayers, feed-forward network sublayers, and normalization units. The inputs are first projected to dimension d = 768 through linear transformation to match and , then concatenated along the sequence dimension to form the conditional representation matrix .
Each decoder layer employs attention heads, a feed-forward layer width of , and dropout rate of 0.1. In implementation, self-attention mechanisms are first applied to model dependencies among generated feedback tokens, followed by cross-attention based on conditional representation to integrate cross-modal features and personalized control information. Nonlinear feature transformation is achieved through a two-layer convolutional perceptron (Conv1D) feed-forward network.
The feedback generation process can be formalized as follows: let
denote the feedback token embedding sequence generated at step t − 1, then the output
of the l-th decoder layer satisfies:
where
represent trainable parameters at layer l, and
denotes the two-layer feed-forward network mapping. The final output
is transformed into vocabulary probability distribution through linear projection and softmax, enabling token-wise autoregressive generation.
Mathematically, the incorporation of s and c during decoding is equivalent to introducing personalized priors
in conditional probability modeling. While standard Transformer decoders model
, the enhanced version learns:
Through conditional probability chain rule, s and c influence predictions at each timestep as contextual conditions, mathematically guaranteeing adaptation to individual and task characteristics.
The joint usage with cross-modal semantic alignment module provides semantically consistent and multimodal-fused feature foundations, enabling direct reference to corresponding visual and textual evidence during feedback generation while incorporating student-specific characteristics. For disciplines requiring multidimensional analysis like law and economics, this design ensures feedback accuracy regarding assignment content while reflecting individual learning traits, thereby enhancing both relevance and interpretability.
3.3.4. Cognitive Weakness Highlighter
The cognitive weakness highlighter module takes aligned visual representations and textual representations as inputs, performing explicit localization of deficiencies through multi-scale residual attention and prototype metrics. The visual branch first reconstructs into (default H = W = 28, d = 768) based on image patch grids, then processes through three pyramid convolution stacks to obtain , , and . Each stack contains two residual blocks (with convolutions, strides of 1, 2, 2, channel numbers as specified, GELU and LayerNorm, Dropout = 0.1). The textual branch projects into and generates token-dependent guidance maps via convolutions. After channel-wise concatenation, the features pass through residual attention stacks, producing multi-scale evidence maps that are upsampled and fused into .
To quantify deviations from expert paradigms, the module maintains a prototype library
containing high-quality assignment exemplars. For any point (u,v) with feature
, its cognitive weakness energy is defined as the quadratic form score of the Mahalanobis distance to the nearest prototype:
which undergoes
normalization and temperature scaling to produce pixel-level weakness probabilities
. To map regional evidence to text segments, cross-modal alignment matrix
from the alignment module is utilized for patch-based aggregation, generating token-level distribution
. Jensen–Shannon divergence regularization is introduced to ensure consistency between visual and textual evidence distributions:
where
represents the textual dual distribution obtained via patch pooling of M. The primary weakness detection loss employs a hinge-type segmentation objective with threshold
. Given annotation mask
:
Isotropic total variation regularization is incorporated to suppress noise and obtain compact highlighted regions:
The overall training objective becomes
.
From a discriminative perspective, when expert prototypes approximately follow Gaussian distributions with invertible covariance, Equation (
6) becomes equivalent to maximum likelihood ratio testing with quadratic decision boundaries, demonstrating superior Bayesian consistency for locally non-separable assignment patterns. The JS divergence in Equation (
7) constrains cross-modal evidence to shared mixture priors, minimizing modality conflicts. Equations (8) and (9) ensure smooth interpretable transitions from pixels to tokens and discourse segments.
During joint operation with the personalized feedback generator, q and M serve as evidential gates for decoding conditions through sigmoidal modulation:
It can be mathematically shown that when g monotonically increases for high-confidence weaknesses, the gradient conditions
and
hold, guaranteeing feedback responsiveness to located deficiencies at the gradient level. Applied to open-ended assignments in law and economics, this design precisely maps advanced errors like "contradictions between chart trends and textual conclusions" or "mismatches between legal citations and factual elements" to visual regions and specific statements, directly injecting evidence into generation decoding to form traceable feedback loops integrating localization, explanation, and improvement suggestions.
4. Results and Discussion
4.1. Evaluation Metrics
To comprehensively evaluate the performance of the proposed multimodal educational feedback generation model, five categories of metrics were employed: ROUGE-L, BLEU, CIDEr, human evaluation, and ablation performance analysis. The mathematical definitions of the primary automated metrics are expressed as follows:
where
and
represent recall and precision based on the longest common subsequence, respectively;
denotes the weighting coefficient balancing recall and precision; BP indicates the brevity penalty factor in BLEU;
represents the n-gram precision matching rate;
is the n-gram weighting coefficient; and N denotes the maximum n-gram order. In CIDEr,
and
correspond to TF-IDF weighted vectors of generated feedback and reference feedback for the j-th n-gram, respectively, with M being the number of reference feedback samples. Human evaluation was conducted by three expert educators with over 10 years of teaching experience, who scored the feedback quality from three dimensions (accuracy, personalization, and interpretability) using a 5-point Likert scale, with the average score taken as the final result. Ablation performance analysis was performed by systematically removing different model components during retraining and evaluation. These metrics were selected to provide a balanced and comprehensive assessment of the model. Specifically, ROUGE-L and BLEU capture lexical overlap and fluency, CIDEr emphasizes semantic relevance through TF-IDF weighting, human evaluation ensures pedagogical qualities such as personalization and interpretability are assessed, and ablation analysis quantifies the contributions of individual modules. Together, they allow both quantitative and qualitative evaluation, ensuring that the model’s effectiveness is validated from multiple perspectives.
4.2. Baseline Models
Six representative baseline models were selected for comparative experiments to validate the effectiveness of the proposed multimodal educational feedback generation framework: GPT-3.5 [
27], BLIP-2 [
28], ChatGPT Prompt Tuning [
29], EduFormer [
30], GPT-4V [
31], and LLaVA [
32]. GPT-3.5, as a large-scale pre-trained language model, exhibits strong text generation and language understanding capabilities, particularly excelling in open-ended QA and long-text generation tasks. BLIP-2 represents a vision–language pre-training framework that achieves high-precision cross-modal understanding through deep fusion of visual and textual features. ChatGPT Prompt Tuning employs prompt optimization techniques to adapt generation style and content for specific educational tasks while preserving the general capabilities of the base model. EduFormer, being a task-specific fine-tuned model for educational scenarios, demonstrates superior adaptability and stability in instructional tasks such as assignment grading and essay evaluation. GPT-4V extends the capabilities of large language models into the multimodal domain by enabling direct visual–textual interaction, delivering state-of-the-art performance in vision–language reasoning, multimodal comprehension, and grounded response generation. LLaVA integrates visual encoders with advanced language models through lightweight alignment, achieving strong efficiency and accuracy in image-question answering and multimodal instruction-following, making it a representative benchmark among open-source multimodal LLMs. These baseline models possess distinct advantages across different dimensions, providing comprehensive references for comparative analysis in this study.
4.3. Performance Comparison on Image–Text Essay Feedback Generation Task
This experiment was designed to evaluate the comprehensive performance of different models in image–text essay feedback generation tasks, verifying the relationship between multimodal understanding capability and feedback generation quality. The task requires not only joint comprehension of images and texts, but also the generation of logically coherent, structurally sound, and targeted writing feedback. Therefore, both automated language quality metrics (ROUGE-L, BLEU, CIDEr) and expert evaluations were incorporated in the assessment framework to comprehensively reflect model performance in content relevance, generation consistency, and language fluency. Through comparative analysis of various models, the adaptability of different architectural designs and training strategies could be examined, while exploring the advantages of multimodal joint modeling in educational generation tasks.
As shown in
Table 2 and
Figure 3, the unimodal GPT-3.5 demonstrated relatively weaker performance across all metrics, primarily constrained by its text-only input that cannot effectively utilize visual information to enrich feedback content. BLIP-2 showed improvements in ROUGE-L, BLEU, and CIDEr through enhanced image–text matching via multimodal encoding. ChatGPT Prompt Tuning optimized task adaptability through prompt engineering, though its improvements were mainly concentrated in generation consistency and human readability. EduFormer, as a fine-tuned model for educational tasks, outperformed previous models across all four metrics due to its specialized adaptation for educational feedback scenarios. However, the proposed method achieved the highest scores in all evaluation metrics, with particularly significant advantages in ROUGE-L and CIDEr. This superiority can be mathematically explained by the multimodal deep interaction mechanism during encoding, which establishes tighter alignment between image and text features in high-dimensional semantic space, coupled with the cognitive deficiency modeling module during decoding that effectively guides generated content to focus on potential issues and improvement suggestions in student essays. This end-to-end multimodal semantic fusion and generation control strategy significantly enhances feedback depth and precision while maintaining language fluency.
4.4. Performance Comparison on Image–Text QA Feedback Task
This experiment evaluated model performance in image–text QA feedback tasks to verify the effectiveness and stability of multimodal fusion in educational settings. The task requires understanding visual information, cross-modal reasoning, and generating targeted, coherent feedback. Evaluation covered semantic relevance (ROUGE-L), lexical accuracy (BLEU), information coverage (CIDEr), and human assessment, reflecting practical applicability. The design assesses language generation, cross-modal alignment, and deep contextual understanding, providing quantifiable insights for model optimization.
As shown in
Table 3 and
Figure 4, GPT-3.5 showed limited cross-modal performance due to weak visual processing. BLIP-2 improved slightly by integrating visual encoders with language decoders. ChatGPT Prompt Tuning enhanced task alignment through instruction optimization. EduFormer advanced cross-modal alignment using education-specific multimodal feature interactions. Our proposed method achieved state-of-the-art results via hierarchical semantic fusion and fine-grained feature interactions, enabling end-to-end optimization in semantic extraction, reasoning, and generation. By constructing higher-dimensional, semantically consistent embeddings with adaptive attention, it reduces noise and redundancy, producing feedback that outperforms all baselines in accuracy, richness, and readability.
4.5. Performance Comparison on Artistic Cognition Guidance and Improvement Suggestion Task
This experiment evaluated multimodal generation models on the “artistic cognition guidance and improvement suggestion” task, assessing their ability to understand visual artwork, extract key cognitive info, and generate actionable feedback. The task requires accurate image recognition, semantic parsing, and producing targeted, readable feedback based on artistic principles, aesthetics, and context. Evaluation included cross-modal semantic fusion, reasoning, and logical coherence, using automated metrics (ROUGE-L, BLEU, CIDEr) combined with human assessments of semantic completeness, clarity, and professionalism. This approach highlights models’ strengths and weaknesses in multimodal art understanding and validates the method’s practical value in complex cognitive tasks.
As shown in
Table 4, GPT-3.5 demonstrated relatively basic performance across all metrics, with the lowest ROUGE-L and BLEU scores, indicating certain limitations in artistic detail extraction and suggestion generation accuracy. BLIP-2 showed slight improvements over GPT-3.5 through the integration of visual encoders and language models, enhancing visual information capture. ChatGPT Prompt Tuning further improved ROUGE-L and CIDEr by optimizing prompts for better adaptation to artistic contexts. EduFormer achieved higher BLEU and CIDEr scores through enhanced cross-modal information fusion via deep alignment between visual and semantic features. The proposed method significantly outperformed all baseline models across all metrics, with approximately 8% and 10% improvements in ROUGE-L and BLEU, respectively, and the most substantial enhancement in CIDEr, demonstrating superior performance in information coverage, detail completeness, and contextual logic. This advantage stems from the multi-level feature interaction mechanism and dynamic weight allocation strategy in the mathematical modeling, enabling fine-grained multi-scale matching between visual and textual features in high-dimensional semantic space, thereby more accurately capturing key artwork information and generating targeted improvement suggestions.
4.6. Ablation Study of the CWH Across Three Tasks
This experiment aimed to validate the effectiveness of the CWH module across different educational tasks, covering three scenarios: image–text essay feedback generation, cross-modal QA feedback, and artistic cognition guidance with improvement suggestions. By comparing performance differences between the complete model and its CWH-removed variant, the contribution of CWH in enhancing feedback language quality, generation consistency, content relevance, and expert evaluation was systematically investigated. The core motivation of this ablation study was to confirm whether CWH could effectively identify and present learners’ cognitive deficiencies in multimodal educational feedback, thereby providing more targeted and in-depth guidance. Since the three tasks differ significantly in input modality structure, output generation objectives, and feedback content granularity, examining CWH’s role across them enables comprehensive evaluation of its cross-task adaptability and stability.
As shown in
Table 5 and
Figure 5, the complete model consistently outperformed its CWH-removed counterpart across all three tasks. In Task 1, the full model showed improvements of approximately 3.5, 3.2, and 0.13 in ROUGE-L, BLEU, and CIDEr respectively, with a 0.42-point enhancement in expert rating, demonstrating CWH’s capability to enhance text generation structure and detail coverage. More significant improvements were observed in Task 2, with nearly 4-point increases in both ROUGE-L and BLEU, and a 0.13 CIDEr improvement, indicating CWH’s crucial role in integrating cross-modal information and highlighting key elements in scenarios requiring precise image–text combination. Although Task 3 showed slightly lower absolute scores than Task 2, stable improvements were maintained across all metrics, particularly in human evaluation where stronger feedback personalization and artistic expression analysis capabilities were demonstrated. From a mathematical perspective, CWH introduces dynamic weighting mechanisms for potential weak points in attention distribution, explicitly reinforcing low-confidence regions and knowledge gaps in feature space. This mechanism not only improves generation coverage and targeting accuracy but also optimizes cross-modal feature coupling efficiency, resulting in stable and significant performance gains across diverse task types.
4.7. Case Studies and Qualitative Analysis
The objective of this experiment is to go beyond statistical metrics and demonstrate the practical effectiveness and interpretability of the proposed framework through qualitative case studies. Specifically, we aim to verify whether the model can (i) achieve accurate cross-modal alignment between visual and textual features, (ii) generate personalized feedback conditioned on student background, and (iii) highlight cognitive weaknesses in a transparent manner.
Figure 6 illustrates the cross-modal attention distribution for the inflation question example. The heatmap indicates that the model assigns higher attention weights to the chart region depicting the declining purchasing power curve and links it with the relevant textual phrase in the student’s answer. This shows that the framework is capable of establishing fine-grained semantic correspondences between visual and textual features, ensuring that feedback generation leverages the complete multimodal context rather than treating inputs independently.
Table 6 further demonstrates the contribution of contextual signals through background embeddings. Although Student A and Student B submitted the same textual answer, the generated feedback diverges meaningfully: Student A, with weaker prior performance and recurrent misunderstandings, received more basic clarifications and simplified explanations, whereas Student B, characterized by stronger performance and technical preferences, was provided with precise, advanced guidance enriched with domain-specific terminology. These results highlight the model’s dual strengths in multimodal alignment and personalized adaptation, confirming that it delivers feedback that is both accurate and tailored to learner needs.
4.8. Discussion
The proposed method demonstrates significant application potential in real-world educational and cognitive training scenarios. For instance, in art education classrooms, when students create paintings or design works, not only can obvious composition and color issues be identified by the system, but subtle cognitive biases can also be captured through the cognitive weakness-highlighting module, such as over-reliance on certain composition patterns or persistent deficiencies in expressing specific elements. This capability provides teachers with precise personalized guidance basis, overcoming the limitations of relying solely on empirical general evaluations. In language learning feedback generation, the system can focus on recurring structural errors in writing, speaking, or dialogue tasks, enabling learners to quickly identify cognitive blind spots and thereby improve learning efficiency. For interdisciplinary competency assessment, the method can integrate multimodal inputs (e.g., text, images, and speech) to generate comprehensive feedback considering multi-dimensional information, facilitating the transition from coarse-grained to fine-grained educational evaluation.
In cognitive rehabilitation or special education domains, this approach also exhibits practical significance. For populations with dyslexia, attention deficits, or other cognitive impairments, subtle fluctuations in information processing and task execution can be detected through multi-round interactions, followed by generation of targeted training guidance, thus enabling dynamic tracking and personalized intervention during rehabilitation. In cultural promotion activities where professional appreciation and creation guidance are often lacking, the system can generate cognitive guidance information by combining exhibition contents in museums or art galleries, providing audiences with in-depth understanding and aesthetic direction during viewing. Meanwhile, in vocational skill training (e.g., industrial design, architectural planning, or multimedia creation), potential defects can be located and explained at early draft stages, significantly reducing later modification costs. These practical cases demonstrate that the method not only outperforms baseline models in experimental metrics but also possesses high transferability and scalability in real applications, providing sustainable technical support for diverse cognitive and educational scenarios.
4.9. Limitation and Future Work
Although the proposed method demonstrates promising performance across various practical scenarios, several limitations remain. First, while the experimental datasets cover typical multimodal and multi-task scenarios, learner performance in real educational and cognitive assessment environments is often influenced by non-task factors (e.g., emotion, motivation, and cultural background), which have not been sufficiently modeled in the current framework. Furthermore, despite incorporating multimodal feature fusion and cognitive weakness-highlighting mechanisms, the model’s robustness against extreme few-shot cases, noisy data, or cross-domain transfer requires further validation.
Future research could be extended in multiple directions. Additional modalities such as emotion recognition and physiological signal monitoring could be incorporated to more comprehensively characterize learners’ cognitive states for precise personalized feedback. Exploration of large-scale pre-trained models combined with adaptive fine-tuning strategies may enhance the method’s adaptability in cross-disciplinary and cross-cultural environments, particularly for resource-scarce languages and non-standardized tasks. Meanwhile, online learning and incremental training mechanisms could be introduced to enable continuous system updating and optimization during long-term interactions, accommodating dynamic changes in learners’ abilities and requirements.
5. Conclusions
A novel framework integrating cross-modal alignment mechanisms and personalized attention modulation was developed to address the limitations of conventional approaches in simultaneously achieving fine-grained recognition and robust modeling within complex educational assessment scenarios. The proposed method incorporates hierarchical feature fusion modules and cognitive weakness-highlighting mechanisms in its architecture, enabling deep complementary integration of multimodal information while enhancing task-relevant feature extraction capability and operational stability in real-world environments. Experimental results demonstrated significant improvements across multiple core evaluation metrics, with average accuracy exceeding 92%, recall maintained at approximately 91%, and precision consistently above 90%. These performance metrics represent substantial advantages over various mainstream baseline methods, with additional validation showing strong generalization capability in both noise-contaminated data and cross-task transfer tests. Ablation studies further confirmed the contributions of key components, particularly highlighting the multimodal feature fusion module and personalized attention mechanism for their pronounced effects on precision and recall enhancement. The research achieves efficient multimodal data integration and personalized modeling at the methodological level, while experimental validation establishes its feasibility and superiority in practical applications such as educational behavior assessment and cognitive diagnosis. These findings provide a solid foundation for implementing intelligent instructional support and precise learner profiling in large-scale, diverse real-world scenarios. The framework’s mathematical formulation ensures stable gradient propagation during optimization, while its modular architecture permits flexible adaptation to various educational contexts through component-wise modifications.