Image Sensor-Supported Multimodal Attention Modeling for Educational Intelligence

Chen, Yanlin; Yang, Yingqiu; Lan, Zeyu; Chen, Xinyuan; Zhan, Haoyuan; Yu, Lingxi; Zhan, Yan

doi:10.3390/s25185640

Open AccessArticle

Image Sensor-Supported Multimodal Attention Modeling for Educational Intelligence

by

Yanlin Chen

^1,2,†,

Yingqiu Yang

^1,2,†,

Zeyu Lan

^1,†,

Xinyuan Chen

¹,

Haoyuan Zhan

¹,

Lingxi Yu

¹ and

Yan Zhan

^1,3,*

¹

National School of Development, Peking University, Beijing 100871, China

²

College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China

³

Artificial Intelligence Research Institute, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sensors 2025, 25(18), 5640; https://doi.org/10.3390/s25185640

Submission received: 15 August 2025 / Revised: 5 September 2025 / Accepted: 7 September 2025 / Published: 10 September 2025

(This article belongs to the Special Issue Multimodal Perception Modeling Based on Advanced Computational Technologies)

Download

Browse Figures

Versions Notes

Abstract

To address the limitations of low fusion efficiency and insufficient personalization in multimodal perception for educational intelligence, a novel deep learning framework is proposed that integrates image sensor data with textual and contextual information through a cross-modal attention mechanism. The architecture employs a cross-modal alignment module to achieve fine-grained semantic correspondence between visual features captured by image sensors and associated textual elements, followed by a personalized feedback generator that incorporates learner background and task context embeddings to produce adaptive educational guidance. A cognitive weakness highlighter is introduced to enhance the discriminability of task-relevant features, enabling explicit localization and interpretation of conceptual gaps. Experiments show the proposed method outperforms conventional fusion and unimodal baselines with 92.37% accuracy, 91.28% recall, and 90.84% precision. Cross-task and noise-robustness tests confirm its stability, while ablation studies highlight the fusion module’s +4.2% accuracy gain and the attention mechanism’s +3.8% recall and +3.5% precision improvements. These results establish the proposed method as a transferable, high-performance solution for next-generation adaptive learning systems, offering precise, explainable, and context-aware feedback grounded in advanced multimodal perception modeling.

Keywords:

image sensors; visual-text integration; multimodal perception modeling; cross-modal alignment; educational intelligence

1. Introduction

With the rapid advancement of intelligent education, AI-driven efficient, personalized, and interpretable teaching feedback has become a key research focus in educational technology [1]. Traditional feedback relies heavily on manual teacher evaluation and experience, supplemented by inputs from students, peers, and supervisors, which is time-consuming and often inadequate for large-scale personalized teaching needs [2]. Advances in deep learning, especially large-scale models, enable the integration and analysis of multimodal data such as images and text, facilitating better understanding of teaching behaviors and intelligent feedback generation [3,4]. Li et al. demonstrated that intelligent tutoring systems can analyze student behaviors, emotions, progress, and answer accuracy in real time, delivering precise personalized feedback to support teaching strategy adjustments [5]. Combining image analysis with natural language processing allows examination of non-verbal cues (e.g., facial expressions, posture) alongside verbal outputs, enhancing feedback comprehensiveness and accuracy [5]. The adoption of large-model AI technologies improves classroom efficiency and effectiveness, driving transformative changes in educational technology [6,7].

Most current teaching feedback systems primarily handle structured text inputs. Abdi et al. developed DTLP, an automated deep learning system analyzing student feedback with features including word embeddings, affective knowledge, emotion rules, statistics, and linguistic characteristics [8]. Shaik et al. integrated NLP methods to annotate and respond to student feedback, enhancing AI’s educational impact [9]. However, these systems show limitations in jointly modeling student work images (handwritten assignments, lab reports, sketches) and textual descriptions (reasoning, reflections), restricting comprehensive understanding of learning outputs. The rise of multimedia data has increased interest in semantic alignment across modalities [10]. Despite progress, deeper linguistic understanding remains a challenge for generating effective feedback. Joint image–text modeling offers a holistic view of student work, enabling precise personalized feedback and addressing shortcomings of traditional approaches [11]. Deploying large language models (LLMs) in education also faces critical challenges. At the semantic level, LLMs have notable limitations in parsing complex linguistic phenomena: Weissweiler et al. [12] showed that mainstream models like GPT and Gemini fail to capture additional semantics in constructions such as causative motion (e.g., “sneeze causing object movement”), revealing gaps in deep semantic reasoning. These semantic weaknesses affect text generation and introduce linguistic patterns with privacy and security implications. Muñoz-Ortiz et al. [13] found LLM-generated news texts exhibit simpler syntax, emotional imbalance (overemphasis on “joy”), and overuse of pronouns, numerals, and auxiliaries—linguistic traits linked to privacy risks such as entity reference obfuscation and attribute inference [14].

Recent work suggests that injecting explicit linguistic knowledge can mitigate these issues. Zhang et al. [15] proposed LINGOLLM, which integrates grammar books and dictionaries into GPT-4, enabling efficient processing of endangered languages and achieving a BLEU score of 10.5, thus opening paths for low-resource language applications. This underscores the value of structured linguistic knowledge in enhancing multimodal and multilingual educational AI. Deep learning advancements in multimodal learning and joint image–text modeling have spurred interest in integrated analysis and feedback systems. Kumar et al. developed a system generating human-like feedback from paired text and image inputs [16]. Xie et al. highlighted multimodal interaction’s advantages in education, combining speech, text, and visuals for richer outputs [17]. Liu et al. tackled personalized multimodal feedback generation, enabling customized responses across disciplines for assignments involving images, audio, and text [18]. To address these challenges, we propose a multimodal Transformer-based framework for teaching feedback generation, combining student-submitted images and textual descriptions with cross-modal attention. The primary contributions of this study are as follows:

A multimodal Transformer framework for automated teaching feedback generation is proposed, jointly modeling student-submitted images and textual content.
Three innovative modules are introduced: the image–text semantic alignment module, the personalized feedback generation module, and the cognitive weakness highlighter module.
Experimental validation demonstrates that the proposed method achieves high accuracy and strong practicality in various educational tasks, such as essay scoring, design feedback, and answer correction.

2. Related Work

2.1. Research on Educational Feedback Systems and AI-Assisted Teaching

Traditional educational feedback has been teacher-centered, relying on grading, classroom interaction, and performance evaluation, often dependent on teachers’ subjective judgment, causing delays, inconsistency, and limited personalization [19,20]. AI has transformed feedback systems by automating tasks like essay scoring, grading, and behavior analysis. Steiss et al. highlighted AI tools such as ChatGPT that use NLP and machine learning for real-time feedback generation [20]. Yet challenges remain in producing fine-grained, context-aware feedback and in integrating AI across diverse teaching settings [21]. While AI improves efficiency, objectivity, automation, and personalization, issues like limited accuracy, poor generalization, and low interpretability persist. Liu et al. noted AI often neglects younger students’ emotional needs and grade-specific alignment, causing skepticism of AI-only scoring [22]. In intelligent tutoring and personalized learning, AI can deliver high-quality, individualized feedback that supports cognitive and metacognitive skills and positive emotions, but concerns about privacy, bias, unreliable feedback, negative perceptions, limited capability, academic integrity, and lack of guidance remain [23].

2.2. The Application of Multimodal Deep Learning in Education

With the continuous advancement of deep learning technologies, the integration of vision and language has emerged as a key research direction in AI, particularly in educational image–text analysis. Multimodal deep learning models, such as ViLT, BLIP (Bootstrapping Language-Image Pre-training), and CLIP (Contrastive Language-Image Pre-training), combine visual information (images, videos) and linguistic information (text, speech) to enable more comprehensive and intelligent understanding and analysis. ViLT (Vision-and-Language Transformer) is a visual–language fusion model that converts images into sequences of visual tokens and processes them together with text tokens through a Transformer for cross-modal learning [24]. By leveraging ViLT, educational systems can extract and analyze relevant information from both images and text in instructional materials, thus enabling automated educational content generation and learning resource recommendation.

BLIP is an innovative multimodal pre-training framework designed to enhance joint language–vision modeling through self-supervised learning [25]. CLIP, proposed by OpenAI, is a vision–language fusion model trained using contrastive learning to simultaneously optimize image and text representations [26]. It enables automated image–text matching in educational platforms and can generate relevant answers or feedback based on either modality, thus improving interactivity. Despite these advancements, vision–language fusion models still face limitations when processing complex structured multimodal educational content.

3. Materials and Method

3.1. Data Collection

To address the practical requirements of research on open-ended expression and structured critical feedback in law and economics education, the dataset was constructed under multimodal conditions, with broader coverage and a larger sample size to enhance model generalization and robustness across diverse instructional tasks, as shown in Table 1. Data collection was conducted using undergraduate courses from three comprehensive universities, encompassing both core and elective subjects such as Principles of Civil Law, Public Finance, Legal Writing Practice, Principles of Economics, International Economic Law, and Public Finance Management. The image data were sourced from scanned copies of paper-based assignments, scanned classroom experiment and research reports, student-drawn argumentation diagrams, economic statistical charts, case analysis flowcharts, and photographs of classroom blackboard writing. Paper-based submissions were collected by course instructors and digitized using high-resolution scanners at 600 dpi. During acquisition, edge cropping, color correction, and contrast enhancement were applied to ensure the preservation of essential details and structural features. Certain assignments were completed directly on tablets or graphic tablets and submitted in PDF or JPEG format, maintaining an original resolution of no less than

2480 \times 3508

pixels to facilitate subsequent image segmentation and feature extraction.

Text data were obtained from written responses and analytical reports submitted by students for assignments and projects, covering content types such as judgment reasoning, case commentary, economic policy review, statistical result interpretation, solution proposals, and reflective summaries. The original file formats included Word documents, PDF files, and text exported from online assignment platforms. All text data were collected with the original formatting, paragraphing, and punctuation preserved, along with metadata including task type, submission time, and assignment background, to support contextual modeling and feature analysis. In law-related tasks, the text typically contained statutory citations, case names, and logical argumentation structures, whereas in economics-related tasks, it more frequently included formula derivations, data interpretations, and chart references. Upon completion of collection, each image and text entry was paired via a unique identifier, forming one-to-one multimodal records to ensure consistency in cross-modal analysis.

Annotation guidelines were formulated by three senior law and economics educators, each with more than ten years of teaching and grading experience. The annotation dimensions included logical structure completeness, legal or economic theory application, accuracy of expression and use of technical terminology, and personalized improvement suggestions with reflective prompts. Each sample was independently annotated by at least two experts, with cross-checking performed to ensure consistency, and in cases of disagreement, a third annotator acted as arbiter to guarantee the authority and uniformity of the final annotations. The resulting dataset comprises 5000 multimodal assignment samples, with case analysis tasks accounting for the largest proportion, followed by statute application and economic chart interpretation, while policy commentary tasks were fewer but demonstrated diverse viewpoints and prominent critical thinking features. These data provide a solid foundation for achieving precise feedback generation in multidisciplinary and multi-expression-form educational scenarios.

3.2. Dataset Construction

Given the structural complexity, domain-specific characteristics, and expressive diversity of multimodal assignment data in the fields of law and economics, a systematic preprocessing and augmentation pipeline was designed prior to model training. This pipeline comprised three main stages: image processing, text processing, and data augmentation.

3.2.1. Image Processing

In the image processing stage, variations in resolution and clarity among assignment images from different sources were mitigated while key informational regions were preserved. All scanned images and handwritten sketches were first standardized in resolution by scaling them to a predefined size

(H, W)

:

I^{'} = R (I, H, W),

where I denotes the original image,

R (\cdot)

represents the bicubic interpolation resampling function, and H and W are the target image height and width, respectively. After standardization, handwriting, structural lines, and edge features were enhanced by applying a sharpening operator

S (\cdot)

followed by an edge enhancement filter

E (\cdot)

, yielding:

I^{″} = E (S (I^{'})) .

To facilitate subsequent cross-modal alignment, a patchifying method was employed to divide the image into non-overlapping blocks of size

P \times P

. In combination with a diagram detection model

G (\cdot)

, blocks containing key diagrams or structural graphics were automatically identified and retained, thereby reducing redundant background noise.

3.2.2. Text Processing

In the text processing stage, a hierarchical procedure was implemented to accommodate the high density of domain-specific terminology, statute references, and policy entities in law and economics assignments, ensuring structural regularity and semantic accuracy. Initially, a domain-specific lexicon was applied for terminology recognition and normalization:

T^{'} = N_{t} (T),

where T is the original text and

N_t (\cdot)

denotes the terminology normalization function based on the domain lexicon. Subsequently, a normalization parser

L (\cdot)

was used to encode statute references in a consistent format, while a policy entity recognition function

P (\cdot)

extracted named entities relevant to economic and legal policies, forming a structured element set:

E = L (T^{'}) \cup P (T^{'}) .

Based on this, a citation cleaning function

C (\cdot)

was applied to remove extraneous symbols, standardize inconsistent citation formats, and eliminate irrelevant noise. Finally, a syntactic restoration operator

S_p (\cdot)

and a spell-checking function

F (\cdot)

were applied to reconstruct the original grammatical structure and reduce the impact of syntactic flaws on model comprehension, producing a structurally complete and semantically clear text sequence

T^{″}

:

T^{″} = F (S_{p} (C (T^{'}))) .

3.2.3. Data Augmentation

In the data augmentation stage, a feedback contrastive learning strategy was introduced to construct positive and negative feedback sample pairs

(x^{+}, x^{-})

under the same task context, where the content was similar but the quality of reasoning or viewpoint differed significantly. Semantic discrimination was achieved by minimizing the embedding distance between positive and anchor samples while maximizing the distance to negative samples:

L_{c o n} = - log \frac{exp (sim (f (x), f (x^{+})) / τ)}{exp (sim (f (x), f (x^{+})) / τ) + exp (sim (f (x), f (x^{-})) / τ)} .

Here,

f (\cdot)

is the feature encoding function,

sim (\cdot, \cdot)

denotes cosine similarity, and

τ

is the temperature parameter. Furthermore, a cross-task synonymous structure perturbation strategy

T (\cdot)

was developed, introducing structurally equivalent but lexically varied replacements across different tasks to improve model robustness and generalization in ambiguous assignment contexts. This preprocessing and augmentation pipeline not only improved the quality and consistency of the input data but also established a stable feature foundation for subsequent cross-modal alignment and personalized feedback generation.

3.3. Proposed Method

3.3.1. Overall

As shown in Figure 1, the overall architecture is built upon the ViLT backbone, with the input consisting of pre-aligned image patch sequences and text token sequences. The image input undergoes visual embedding and positional encoding to form the visual representation

Z^{v}

, while the text input is processed through word embedding and positional encoding to form the language representation

Z^{t}

. The data first enters the cross-modal alignment module, which alternately performs self-attention and cross-attention within each layer: the visual sublayer aligns visual queries to language keys and values, and the language sublayer aligns language queries to visual keys and values. This iterative process produces the fused representations

H^{v}

and

H^{t}

and explicitly constructs an image patch–text fragment consistency matrix, which constrains the attention distribution in the subsequent generation stage.

The outputs

H^{v}

and

H^{t}

are then concatenated along the sequence dimension, with two further control signals incorporated—the student background embedding s, which encodes historical performance, common errors, and terminology preferences, and the task context embedding c, which encodes course type, question type, and scoring dimensions. Together, these constitute the conditional representation, which is provided to the personalized feedback generation module. This module is implemented as an autoregressive Transformer decoder that, conditioned on

H^{*}

, first generates a cluster of “positive points,” followed by a cluster of “issue localization,” and finally a cluster of “improvement suggestions.” To avoid template-like outputs, diversity suppression and coverage constraints are incorporated into the attention heads, ensuring a balanced allocation of attention between key image patches and evidential text fragments. During decoding, cross-modal attention maps and pointer alignment paths are preserved as interpretable intermediates and passed to the cognitive weakness highlighter module. This module receives

H^{v}

,

H^{t}

, and multi-layer attention tensors from the decoder, applying a residual attention stack to perform multi-scale deviation scoring: on the one hand, it compares the student representation with a high-quality expert prototype library to locate cognitive weaknesses such as “reasoning jumps,” “statute mismatches,” “image–text contradictions,” and “statistical misinterpretations”; on the other hand, it leverages both the cross-modal alignment matrix and the decoder coverage maps to trace weaknesses back to specific text spans and image patch regions, producing visual masks and confidence scores. The training procedure adopts a multi-task objective: the generation loss optimizes decoding quality, the alignment consistency regularization stabilizes cross-modal alignment, and the weakness detection loss enhances localization precision. During inference, the alignment module first produces

H^{*}

, which is then used by the decoder to generate structured feedback in one pass, while the highlighter module outputs interpretable highlighted weaknesses and evidence links, forming an end-to-end “alignment–generation–diagnosis” feedback loop.

3.3.2. Cross-Modal Alignment Module

The cross-modal alignment module is based on ViLT’s vision–language fusion mechanism, processing both image patch features and text token features during the encoding stage to achieve high-precision semantic alignment between the two modalities.

As shown in Figure 2, the image input is first divided into non-overlapping patches of size

P \times P

through patch embedding, where each patch is linearly mapped to a d-dimensional visual embedding vector

z_{i}^{v}

, while preserving two-dimensional positional encodings to convey spatial information. The text input is processed via word embedding and positional encoding to obtain d-dimensional language vectors

z_{j}^{t}

. The module consists of L stacked Transformer encoder layers, each containing multi-head self-attention (with M heads) and cross-modal cross-attention sublayers. In the cross-attention sublayer, the visual stream uses text features as keys and values for visual queries to compute attention weights

α_{i j}^{v t} = softmax (\frac{q_{i}^{v} \cdot k_{j}^{t}}{\sqrt{d_{k}}})

, obtaining visual representations enriched with textual semantics. Conversely, the text stream uses visual features as keys and values for text queries to compute attention weights

α_{j i}^{t v}

, obtaining textual representations enriched with visual information. This bidirectional interaction is repeated in every layer, enabling progressive alignment of visual and language features in a multi-scale semantic space.

In terms of parameters, the module adopts

L = 12

Transformer encoder layers, each with

M = 8

attention heads, a hidden dimension

d = 768

, a feed-forward network dimension of

4 d

, and a dropout rate of 0.1 to balance representational capacity and overfitting prevention. Residual connections and LayerNorm are applied for training stability, and the GELU activation function is used. The aligned outputs are

H^{v} \in R^{n_{v} \times d}

and

H^{t} \in R^{n_{t} \times d}

, where

n_{v}

and

n_{t}

denote the number of visual patches and text tokens, respectively.

Mathematically, the module learns a shared alignment mapping

f_{θ} : (R^{n_{v} \times d}, R^{n_{t} \times d}) \to (R^{n_{v} \times d}, R^{n_{t} \times d})

, by minimizing the cross-modal matching loss:

L_{a l i g n} = - \frac{1}{n_{v}} \sum_{i} log \frac{exp (sim (h_{i}^{v}, h_{π (i)}^{t}) / τ)}{\sum_{j} exp (sim (h_{i}^{v}, h_{j}^{t}) / τ)},

(1)

where

π (i)

denotes the index of the text token corresponding to the i-th visual patch, and

τ

is a temperature parameter. This design explicitly establishes a semantic mapping between images and text at the feature level, enabling the capture of correspondences such as “chart trend–text conclusion” or “flowchart node–legal reasoning paragraph” in complex law and economics assignments.

This module eliminates early-stage semantic misalignment between modalities, ensuring the personalized feedback module works on a consistent cross-modal foundation. For example, it can detect inconsistencies—like a chart showing decline while text mentions “growth”—via low similarity scores, aiding the cognitive weakness highlighter. Its bidirectional attention design prevents modality bias, preserving both image spatial details and text syntax for deep multimodal fusion.

3.3.3. Personalized Feedback Generator

The personalized feedback generation module is positioned after the cross-modal semantic alignment module in the overall architecture. Its inputs include the aligned visual feature matrix

H^{v} \in R^{n_{v} \times d}

and the textual feature matrix

H^{t} \in R^{n_{t} \times d}

, which are fused with the student background embedding vector

s \in R^{d_{s}}

and the task context control vector

c \in R^{d_{c}}

. A Transformer decoder structure with depth

L_{g} = 6

is adopted, where each layer contains multi-head cross-attention sublayers, feed-forward network sublayers, and normalization units. The inputs are first projected to dimension d = 768 through linear transformation to match

H^{v}

and

H^{t}

, then concatenated along the sequence dimension to form the conditional representation matrix

X_{0} \in R^{(n_{v} + n_{t} + 2) \times d}

.

Each decoder layer employs

M_{g} = 8

attention heads, a feed-forward layer width of

4 d = 3072

, and dropout rate of 0.1. In implementation, self-attention mechanisms are first applied to model dependencies among generated feedback tokens, followed by cross-attention based on conditional representation

X_{0}

to integrate cross-modal features and personalized control information. Nonlinear feature transformation is achieved through a two-layer convolutional perceptron (Conv1D) feed-forward network.

The feedback generation process can be formalized as follows: let

Y_{t - 1}

denote the feedback token embedding sequence generated at step t − 1, then the output

Z_{t}^{(l)}

of the l-th decoder layer satisfies:

Q_{t}^{(l)} = Y_{t - 1} W_{Q}^{(l)}, K^{(l)} = X_{0} W_{K}^{(l)}, V^{(l)} = X_{0} W_{V}^{(l)},

(2)

{Attn}^{(l)} = softmax (\frac{Q_{t}^{(l)} {K^{(l)}}^{⊤}}{\sqrt{d_{k}}}) V^{(l)},

(3)

Z_{t}^{(l)} = FFN (LayerNorm (Y_{t - 1} + {Attn}^{(l)})),

(4)

where

W_{Q}^{(l)}, W_{K}^{(l)}, W_{V}^{(l)} \in R^{d \times d_{k}}

represent trainable parameters at layer l, and

FFN (\cdot)

denotes the two-layer feed-forward network mapping. The final output

Z_{t}^{(L_{g})}

is transformed into vocabulary probability distribution through linear projection and softmax, enabling token-wise autoregressive generation.

Mathematically, the incorporation of s and c during decoding is equivalent to introducing personalized priors

p (Y H^{v}, H^{t}, s, c)

in conditional probability modeling. While standard Transformer decoders model

p (Y X_{0})

, the enhanced version learns:

p (Y H^{v}, H^{t}, s, c) = \prod_{t = 1}^{T} p (y_{t} y_{< t}, H^{v}, H^{t}, s, c) .

(5)

Through conditional probability chain rule, s and c influence predictions at each timestep as contextual conditions, mathematically guaranteeing adaptation to individual and task characteristics.

The joint usage with cross-modal semantic alignment module provides semantically consistent and multimodal-fused feature foundations, enabling direct reference to corresponding visual and textual evidence during feedback generation while incorporating student-specific characteristics. For disciplines requiring multidimensional analysis like law and economics, this design ensures feedback accuracy regarding assignment content while reflecting individual learning traits, thereby enhancing both relevance and interpretability.

3.3.4. Cognitive Weakness Highlighter

The cognitive weakness highlighter module takes aligned visual representations

H^{v} \in R^{n_{v} \times d}

and textual representations

H^{t} \in R^{n_{t} \times d}

as inputs, performing explicit localization of deficiencies through multi-scale residual attention and prototype metrics. The visual branch first reconstructs

H^{v}

into

F_{0} \in R^{H \times W \times d}

(default H = W = 28, d = 768) based on image patch grids, then processes through three pyramid convolution stacks to obtain

F_{1} \in R^{28 \times 28 \times 256}

,

F_{2} \in R^{14 \times 14 \times 384}

, and

F_{3} \in R^{7 \times 7 \times 512}

. Each stack contains two residual blocks (with

3 \times 3

convolutions, strides of 1, 2, 2, channel numbers as specified, GELU and LayerNorm, Dropout = 0.1). The textual branch projects

H^{t}

into

E \in R^{n_{t} \times 128}

and generates token-dependent guidance maps

G \in R^{H \times W \times 128}

via

1 \times 1

convolutions. After channel-wise concatenation, the features pass through

L_{c w h} = 4

residual attention stacks, producing multi-scale evidence maps

M_{s}

that are upsampled and fused into

M \in R^{H \times W}

.

To quantify deviations from expert paradigms, the module maintains a prototype library

{μ_{k}, Σ_{k}}_{k = 1}^{K} (d e f a u l t K = 32)

containing high-quality assignment exemplars. For any point (u,v) with feature

f = F_{3} (u, v) \in R^{512}

, its cognitive weakness energy is defined as the quadratic form score of the Mahalanobis distance to the nearest prototype:

S (u, v) = min_{k} {(f - μ_{k})}^{⊤} Σ_{k}^{- 1} (f - μ_{k}),

(6)

which undergoes

ℓ_{2}

normalization and temperature scaling to produce pixel-level weakness probabilities

\hat{p} (u, v)

. To map regional evidence to text segments, cross-modal alignment matrix

A \in R^{n_{v} \times n_{t}}

from the alignment module is utilized for patch-based aggregation, generating token-level distribution

q \in R^{n_{t}}

. Jensen–Shannon divergence regularization is introduced to ensure consistency between visual and textual evidence distributions:

L_{j s} = \frac{1}{2} KL (q | \frac{q + \tilde{q}}{2}) + \frac{1}{2} KL (\tilde{q} | \frac{q + \tilde{q}}{2}),

(7)

where

\tilde{q}

represents the textual dual distribution obtained via patch pooling of M. The primary weakness detection loss employs a hinge-type segmentation objective with threshold

γ

. Given annotation mask

y \in {0, 1}^{H \times W}

:

L_{w e a k} = \frac{1}{H W} \sum_{u, v} max (0, γ - (2 y (u, v) - 1) (2 \hat{p} (u, v) - 1)) .

(8)

Isotropic total variation regularization is incorporated to suppress noise and obtain compact highlighted regions:

L_{t v} = \sum_{u, v} \sqrt{{(\nabla_{x} \hat{p} (u, v))}^{2} + {(\nabla_{y} \hat{p} (u, v))}^{2}} .

(9)

The overall training objective becomes

L_{d e t} = L_{w e a k} + λ_{1} L_{j s} + λ_{2} L_{t v}

.

From a discriminative perspective, when expert prototypes approximately follow Gaussian distributions with invertible covariance, Equation (6) becomes equivalent to maximum likelihood ratio testing with quadratic decision boundaries, demonstrating superior Bayesian consistency for locally non-separable assignment patterns. The JS divergence in Equation (7) constrains cross-modal evidence to shared mixture priors, minimizing modality conflicts. Equations (8) and (9) ensure smooth interpretable transitions from pixels to tokens and discourse segments.

During joint operation with the personalized feedback generator, q and M serve as evidential gates for decoding conditions through sigmoidal modulation:

h^{*} = g ⊙ h, g = σ (W_{g} [Pool (M), AttnPool (q)]) .

(10)

It can be mathematically shown that when g monotonically increases for high-confidence weaknesses, the gradient conditions

\partial log p (y_{t} h) / \partial M \geq 0

and

\partial log p (y_{t} h) / \partial q \geq 0

hold, guaranteeing feedback responsiveness to located deficiencies at the gradient level. Applied to open-ended assignments in law and economics, this design precisely maps advanced errors like "contradictions between chart trends and textual conclusions" or "mismatches between legal citations and factual elements" to visual regions and specific statements, directly injecting evidence into generation decoding to form traceable feedback loops integrating localization, explanation, and improvement suggestions.

4. Results and Discussion

4.1. Evaluation Metrics

To comprehensively evaluate the performance of the proposed multimodal educational feedback generation model, five categories of metrics were employed: ROUGE-L, BLEU, CIDEr, human evaluation, and ablation performance analysis. The mathematical definitions of the primary automated metrics are expressed as follows:

R O U G E - L = \frac{(1 + β^{2}) \cdot R_{L C S} \cdot P_{L C S}}{R_{L C S} + β^{2} \cdot P_{L C S}},

(11)

B L E U = B P \cdot exp (\sum_{n = 1}^{N} w_{n} log p_{n}),

(12)

C I D E r = \frac{1}{M} \sum_{j = 1}^{M} \frac{g_{j} \cdot g_{j}^{'}}{g_{j} g_{j}^{'}},

(13)

where

R_{L C S}

and

P_{L C S}

represent recall and precision based on the longest common subsequence, respectively;

β

denotes the weighting coefficient balancing recall and precision; BP indicates the brevity penalty factor in BLEU;

p_{n}

represents the n-gram precision matching rate;

w_{n}

is the n-gram weighting coefficient; and N denotes the maximum n-gram order. In CIDEr,

g_{j}

and

g_{j}^{'}

correspond to TF-IDF weighted vectors of generated feedback and reference feedback for the j-th n-gram, respectively, with M being the number of reference feedback samples. Human evaluation was conducted by three expert educators with over 10 years of teaching experience, who scored the feedback quality from three dimensions (accuracy, personalization, and interpretability) using a 5-point Likert scale, with the average score taken as the final result. Ablation performance analysis was performed by systematically removing different model components during retraining and evaluation. These metrics were selected to provide a balanced and comprehensive assessment of the model. Specifically, ROUGE-L and BLEU capture lexical overlap and fluency, CIDEr emphasizes semantic relevance through TF-IDF weighting, human evaluation ensures pedagogical qualities such as personalization and interpretability are assessed, and ablation analysis quantifies the contributions of individual modules. Together, they allow both quantitative and qualitative evaluation, ensuring that the model’s effectiveness is validated from multiple perspectives.

4.2. Baseline Models

Six representative baseline models were selected for comparative experiments to validate the effectiveness of the proposed multimodal educational feedback generation framework: GPT-3.5 [27], BLIP-2 [28], ChatGPT Prompt Tuning [29], EduFormer [30], GPT-4V [31], and LLaVA [32]. GPT-3.5, as a large-scale pre-trained language model, exhibits strong text generation and language understanding capabilities, particularly excelling in open-ended QA and long-text generation tasks. BLIP-2 represents a vision–language pre-training framework that achieves high-precision cross-modal understanding through deep fusion of visual and textual features. ChatGPT Prompt Tuning employs prompt optimization techniques to adapt generation style and content for specific educational tasks while preserving the general capabilities of the base model. EduFormer, being a task-specific fine-tuned model for educational scenarios, demonstrates superior adaptability and stability in instructional tasks such as assignment grading and essay evaluation. GPT-4V extends the capabilities of large language models into the multimodal domain by enabling direct visual–textual interaction, delivering state-of-the-art performance in vision–language reasoning, multimodal comprehension, and grounded response generation. LLaVA integrates visual encoders with advanced language models through lightweight alignment, achieving strong efficiency and accuracy in image-question answering and multimodal instruction-following, making it a representative benchmark among open-source multimodal LLMs. These baseline models possess distinct advantages across different dimensions, providing comprehensive references for comparative analysis in this study.

4.3. Performance Comparison on Image–Text Essay Feedback Generation Task

This experiment was designed to evaluate the comprehensive performance of different models in image–text essay feedback generation tasks, verifying the relationship between multimodal understanding capability and feedback generation quality. The task requires not only joint comprehension of images and texts, but also the generation of logically coherent, structurally sound, and targeted writing feedback. Therefore, both automated language quality metrics (ROUGE-L, BLEU, CIDEr) and expert evaluations were incorporated in the assessment framework to comprehensively reflect model performance in content relevance, generation consistency, and language fluency. Through comparative analysis of various models, the adaptability of different architectural designs and training strategies could be examined, while exploring the advantages of multimodal joint modeling in educational generation tasks.

As shown in Table 2 and Figure 3, the unimodal GPT-3.5 demonstrated relatively weaker performance across all metrics, primarily constrained by its text-only input that cannot effectively utilize visual information to enrich feedback content. BLIP-2 showed improvements in ROUGE-L, BLEU, and CIDEr through enhanced image–text matching via multimodal encoding. ChatGPT Prompt Tuning optimized task adaptability through prompt engineering, though its improvements were mainly concentrated in generation consistency and human readability. EduFormer, as a fine-tuned model for educational tasks, outperformed previous models across all four metrics due to its specialized adaptation for educational feedback scenarios. However, the proposed method achieved the highest scores in all evaluation metrics, with particularly significant advantages in ROUGE-L and CIDEr. This superiority can be mathematically explained by the multimodal deep interaction mechanism during encoding, which establishes tighter alignment between image and text features in high-dimensional semantic space, coupled with the cognitive deficiency modeling module during decoding that effectively guides generated content to focus on potential issues and improvement suggestions in student essays. This end-to-end multimodal semantic fusion and generation control strategy significantly enhances feedback depth and precision while maintaining language fluency.

4.4. Performance Comparison on Image–Text QA Feedback Task

This experiment evaluated model performance in image–text QA feedback tasks to verify the effectiveness and stability of multimodal fusion in educational settings. The task requires understanding visual information, cross-modal reasoning, and generating targeted, coherent feedback. Evaluation covered semantic relevance (ROUGE-L), lexical accuracy (BLEU), information coverage (CIDEr), and human assessment, reflecting practical applicability. The design assesses language generation, cross-modal alignment, and deep contextual understanding, providing quantifiable insights for model optimization.

As shown in Table 3 and Figure 4, GPT-3.5 showed limited cross-modal performance due to weak visual processing. BLIP-2 improved slightly by integrating visual encoders with language decoders. ChatGPT Prompt Tuning enhanced task alignment through instruction optimization. EduFormer advanced cross-modal alignment using education-specific multimodal feature interactions. Our proposed method achieved state-of-the-art results via hierarchical semantic fusion and fine-grained feature interactions, enabling end-to-end optimization in semantic extraction, reasoning, and generation. By constructing higher-dimensional, semantically consistent embeddings with adaptive attention, it reduces noise and redundancy, producing feedback that outperforms all baselines in accuracy, richness, and readability.

4.5. Performance Comparison on Artistic Cognition Guidance and Improvement Suggestion Task

This experiment evaluated multimodal generation models on the “artistic cognition guidance and improvement suggestion” task, assessing their ability to understand visual artwork, extract key cognitive info, and generate actionable feedback. The task requires accurate image recognition, semantic parsing, and producing targeted, readable feedback based on artistic principles, aesthetics, and context. Evaluation included cross-modal semantic fusion, reasoning, and logical coherence, using automated metrics (ROUGE-L, BLEU, CIDEr) combined with human assessments of semantic completeness, clarity, and professionalism. This approach highlights models’ strengths and weaknesses in multimodal art understanding and validates the method’s practical value in complex cognitive tasks.

As shown in Table 4, GPT-3.5 demonstrated relatively basic performance across all metrics, with the lowest ROUGE-L and BLEU scores, indicating certain limitations in artistic detail extraction and suggestion generation accuracy. BLIP-2 showed slight improvements over GPT-3.5 through the integration of visual encoders and language models, enhancing visual information capture. ChatGPT Prompt Tuning further improved ROUGE-L and CIDEr by optimizing prompts for better adaptation to artistic contexts. EduFormer achieved higher BLEU and CIDEr scores through enhanced cross-modal information fusion via deep alignment between visual and semantic features. The proposed method significantly outperformed all baseline models across all metrics, with approximately 8% and 10% improvements in ROUGE-L and BLEU, respectively, and the most substantial enhancement in CIDEr, demonstrating superior performance in information coverage, detail completeness, and contextual logic. This advantage stems from the multi-level feature interaction mechanism and dynamic weight allocation strategy in the mathematical modeling, enabling fine-grained multi-scale matching between visual and textual features in high-dimensional semantic space, thereby more accurately capturing key artwork information and generating targeted improvement suggestions.

4.6. Ablation Study of the CWH Across Three Tasks

This experiment aimed to validate the effectiveness of the CWH module across different educational tasks, covering three scenarios: image–text essay feedback generation, cross-modal QA feedback, and artistic cognition guidance with improvement suggestions. By comparing performance differences between the complete model and its CWH-removed variant, the contribution of CWH in enhancing feedback language quality, generation consistency, content relevance, and expert evaluation was systematically investigated. The core motivation of this ablation study was to confirm whether CWH could effectively identify and present learners’ cognitive deficiencies in multimodal educational feedback, thereby providing more targeted and in-depth guidance. Since the three tasks differ significantly in input modality structure, output generation objectives, and feedback content granularity, examining CWH’s role across them enables comprehensive evaluation of its cross-task adaptability and stability.

As shown in Table 5 and Figure 5, the complete model consistently outperformed its CWH-removed counterpart across all three tasks. In Task 1, the full model showed improvements of approximately 3.5, 3.2, and 0.13 in ROUGE-L, BLEU, and CIDEr respectively, with a 0.42-point enhancement in expert rating, demonstrating CWH’s capability to enhance text generation structure and detail coverage. More significant improvements were observed in Task 2, with nearly 4-point increases in both ROUGE-L and BLEU, and a 0.13 CIDEr improvement, indicating CWH’s crucial role in integrating cross-modal information and highlighting key elements in scenarios requiring precise image–text combination. Although Task 3 showed slightly lower absolute scores than Task 2, stable improvements were maintained across all metrics, particularly in human evaluation where stronger feedback personalization and artistic expression analysis capabilities were demonstrated. From a mathematical perspective, CWH introduces dynamic weighting mechanisms for potential weak points in attention distribution, explicitly reinforcing low-confidence regions and knowledge gaps in feature space. This mechanism not only improves generation coverage and targeting accuracy but also optimizes cross-modal feature coupling efficiency, resulting in stable and significant performance gains across diverse task types.

4.7. Case Studies and Qualitative Analysis

The objective of this experiment is to go beyond statistical metrics and demonstrate the practical effectiveness and interpretability of the proposed framework through qualitative case studies. Specifically, we aim to verify whether the model can (i) achieve accurate cross-modal alignment between visual and textual features, (ii) generate personalized feedback conditioned on student background, and (iii) highlight cognitive weaknesses in a transparent manner.

Figure 6 illustrates the cross-modal attention distribution for the inflation question example. The heatmap indicates that the model assigns higher attention weights to the chart region depicting the declining purchasing power curve and links it with the relevant textual phrase in the student’s answer. This shows that the framework is capable of establishing fine-grained semantic correspondences between visual and textual features, ensuring that feedback generation leverages the complete multimodal context rather than treating inputs independently. Table 6 further demonstrates the contribution of contextual signals through background embeddings. Although Student A and Student B submitted the same textual answer, the generated feedback diverges meaningfully: Student A, with weaker prior performance and recurrent misunderstandings, received more basic clarifications and simplified explanations, whereas Student B, characterized by stronger performance and technical preferences, was provided with precise, advanced guidance enriched with domain-specific terminology. These results highlight the model’s dual strengths in multimodal alignment and personalized adaptation, confirming that it delivers feedback that is both accurate and tailored to learner needs.

4.8. Discussion

The proposed method demonstrates significant application potential in real-world educational and cognitive training scenarios. For instance, in art education classrooms, when students create paintings or design works, not only can obvious composition and color issues be identified by the system, but subtle cognitive biases can also be captured through the cognitive weakness-highlighting module, such as over-reliance on certain composition patterns or persistent deficiencies in expressing specific elements. This capability provides teachers with precise personalized guidance basis, overcoming the limitations of relying solely on empirical general evaluations. In language learning feedback generation, the system can focus on recurring structural errors in writing, speaking, or dialogue tasks, enabling learners to quickly identify cognitive blind spots and thereby improve learning efficiency. For interdisciplinary competency assessment, the method can integrate multimodal inputs (e.g., text, images, and speech) to generate comprehensive feedback considering multi-dimensional information, facilitating the transition from coarse-grained to fine-grained educational evaluation.

In cognitive rehabilitation or special education domains, this approach also exhibits practical significance. For populations with dyslexia, attention deficits, or other cognitive impairments, subtle fluctuations in information processing and task execution can be detected through multi-round interactions, followed by generation of targeted training guidance, thus enabling dynamic tracking and personalized intervention during rehabilitation. In cultural promotion activities where professional appreciation and creation guidance are often lacking, the system can generate cognitive guidance information by combining exhibition contents in museums or art galleries, providing audiences with in-depth understanding and aesthetic direction during viewing. Meanwhile, in vocational skill training (e.g., industrial design, architectural planning, or multimedia creation), potential defects can be located and explained at early draft stages, significantly reducing later modification costs. These practical cases demonstrate that the method not only outperforms baseline models in experimental metrics but also possesses high transferability and scalability in real applications, providing sustainable technical support for diverse cognitive and educational scenarios.

4.9. Limitation and Future Work

Although the proposed method demonstrates promising performance across various practical scenarios, several limitations remain. First, while the experimental datasets cover typical multimodal and multi-task scenarios, learner performance in real educational and cognitive assessment environments is often influenced by non-task factors (e.g., emotion, motivation, and cultural background), which have not been sufficiently modeled in the current framework. Furthermore, despite incorporating multimodal feature fusion and cognitive weakness-highlighting mechanisms, the model’s robustness against extreme few-shot cases, noisy data, or cross-domain transfer requires further validation.

Future research could be extended in multiple directions. Additional modalities such as emotion recognition and physiological signal monitoring could be incorporated to more comprehensively characterize learners’ cognitive states for precise personalized feedback. Exploration of large-scale pre-trained models combined with adaptive fine-tuning strategies may enhance the method’s adaptability in cross-disciplinary and cross-cultural environments, particularly for resource-scarce languages and non-standardized tasks. Meanwhile, online learning and incremental training mechanisms could be introduced to enable continuous system updating and optimization during long-term interactions, accommodating dynamic changes in learners’ abilities and requirements.

5. Conclusions

A novel framework integrating cross-modal alignment mechanisms and personalized attention modulation was developed to address the limitations of conventional approaches in simultaneously achieving fine-grained recognition and robust modeling within complex educational assessment scenarios. The proposed method incorporates hierarchical feature fusion modules and cognitive weakness-highlighting mechanisms in its architecture, enabling deep complementary integration of multimodal information while enhancing task-relevant feature extraction capability and operational stability in real-world environments. Experimental results demonstrated significant improvements across multiple core evaluation metrics, with average accuracy exceeding 92%, recall maintained at approximately 91%, and precision consistently above 90%. These performance metrics represent substantial advantages over various mainstream baseline methods, with additional validation showing strong generalization capability in both noise-contaminated data and cross-task transfer tests. Ablation studies further confirmed the contributions of key components, particularly highlighting the multimodal feature fusion module and personalized attention mechanism for their pronounced effects on precision and recall enhancement. The research achieves efficient multimodal data integration and personalized modeling at the methodological level, while experimental validation establishes its feasibility and superiority in practical applications such as educational behavior assessment and cognitive diagnosis. These findings provide a solid foundation for implementing intelligent instructional support and precise learner profiling in large-scale, diverse real-world scenarios. The framework’s mathematical formulation ensures stable gradient propagation during optimization, while its modular architecture permits flexible adaptation to various educational contexts through component-wise modifications.

Author Contributions

Conceptualization, Y.C., Y.Y., Z.L. and Y.Z.; Data curation, X.C.; Formal analysis, H.Z. and L.Y.; Funding acquisition, Y.Z.; Investigation, H.Z. and L.Y.; Methodology, Y.C., Y.Y. and Z.L.; Project administration, Y.Z.; Resources, X.C.; Software, Y.C., Y.Y. and Z.L.; Supervision, Y.Z.; Validation, H.Z.; Visualization, X.C. and L.Y.; Writing––original draft, Y.C., Y.Y., Z.L., X.C., H.Z., L.Y. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number: 61202479).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, H. Research on higher vocational computer education based on big data era. Int. Econ. Manag. Educ. Technol. 2020, 10, 187–191. [Google Scholar]
Zhang, Y.; Gao, J. Research on classroom teaching quality evaluation and feedback system based on big data analysis. Sci. Program. 2022, 2022, 7870113. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, Y.; Ma, X. A new strategy for tuning ReLUs: Self-adaptive linear units (SALUs). In Proceedings of the ICMLCA 2021; 2nd International Conference on Machine Learning and Computer Application, Shenyang, China, 17–19 December 2021; VDE: Frankfurt am Main, Germany, 2021; pp. 1–8. [Google Scholar]
Dai, W.; Lin, J.; Jin, H.; Li, T.; Tsai, Y.S.; Gašević, D.; Chen, G. Can large language models provide feedback to students? A case study on ChatGPT. In Proceedings of the 2023 IEEE international conference on advanced learning technologies (ICALT), Orem, UT, USA, 10–13 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 323–325. [Google Scholar]
Li, X.; Jiang, Y. Artificial intelligence in education: Opportunities, current status, and prospects. Geogr. Res. Bull. 2024, 3, 146–174. [Google Scholar]
Luo, H.; Liao, X.; Ru, Q.; Wang, Z. Generative AI-supported teacher comments: An empirical study based on junior high school mathematics classrooms. E-Educ. Res. 2024, 45, 58–66. [Google Scholar]
Chen, A.; Wei, Y.; Le, H.; Zhang, Y. Learning by teaching with ChatGPT: The effect of teachable ChatGPT agent on programming education. Br. J. Educ. Technol. 2024; early view. [Google Scholar]
Abdi, A.; Sedrakyan, G.; Veldkamp, B.; van Hillegersberg, J.; van den Berg, S.M. Students feedback analysis model using deep learning-based method and linguistic knowledge for intelligent educational systems. Soft Comput. 2023, 27, 14073–14094. [Google Scholar] [CrossRef]
Shaik, T.; Tao, X.; Li, Y.; Dann, C.; McDonald, J.; Redmond, P.; Galligan, L. A review of the trends and challenges in adopting natural language processing methods for education feedback analysis. IEEE Access 2022, 10, 56720–56739. [Google Scholar] [CrossRef]
Wang, J.; Zhang, H.; Zhong, Y.; Liang, Y.; Ji, R.; Cang, Y. Advanced multimodal deep learning architecture for image-text matching. In Proceedings of the 2024 IEEE 4th International Conference on Electronic Technology, Communication and Information (ICETCI), Changchun, China, 24–26 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1185–1191. [Google Scholar]
Wu, S.; Fei, H.; Qu, L.; Ji, W.; Chua, T.S. Next-gpt: Any-to-any multimodal llm. In Proceedings of the Forty-first International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Weissweiler, L.; Köksal, A.; Schütze, H. Hybrid Human-LLM Corpus Construction and LLM Evaluation for Rare Linguistic Phenomena. arXiv 2024, arXiv:2403.06965. [Google Scholar] [CrossRef]
Muñoz-Ortiz, A.; Gómez-Rodríguez, C.; Vilares, D. Contrasting Linguistic Patterns in Human and LLM-Generated News Text. Artif. Intell. Rev. 2024, 57, 265. [Google Scholar] [CrossRef] [PubMed]
Yao, Y.; Duan, J.; Xu, K.; Cui, F.; Fang, M.; Ma, S.; Jin, H.; Backes, M.; Zhang, Y.; Shen, Y.; et al. A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
Zhang, K.; Choi, Y.; Song, Z.; Yao, Y.; Garrette, D.; Xia, P.; Jin, D.; Artetxe, M.; Yvon, F.; Kirov, C.; et al. Hire a Linguist!: Learning Endangered Languages in LLMs with In-Context Linguistic Descriptions. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 15654–15669. [Google Scholar] [CrossRef]
Kumar, P.; Bhatt, G.; Ingle, O.; Goyal, D.; Raman, B. Affective Feedback Synthesis Towards Multimodal Text and Image Data. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 190. [Google Scholar] [CrossRef]
Xie, Y.; Yang, L.; Zhang, M.; Chen, S.; Li, J. A Review of Multimodal Interaction in Remote Education: Technologies, Applications, and Challenges. Appl. Sci. 2025, 15, 3937. [Google Scholar] [CrossRef]
Liu, H.; Liu, Z.; Wu, Z.; Tang, J. Personalized multimodal feedback generation in education. arXiv 2020, arXiv:2011.00192. [Google Scholar] [CrossRef]
Sharif, H.; Atif, A. The evolving classroom: How learning analytics is shaping the future of education and feedback mechanisms. Educ. Sci. 2024, 14, 176. [Google Scholar] [CrossRef]
Steiss, J.; Tate, T.; Graham, S.; Cruz, J.; Hebert, M.; Wang, J.; Moon, Y.; Tseng, W.; Warschauer, M.; Olson, C.B. Comparing the quality of human and ChatGPT feedback of students’ writing. Learn. Instr. 2024, 91, 101894. [Google Scholar] [CrossRef]
Youn, C.H.; Salam, A.R.; Rahman, A.A. AI-Driven Tools in Providing Feedback on Studentsâ€™ Writing. Int. J. Res. Innov. Soc. Sci. 2025, 9, 58–67. [Google Scholar] [CrossRef]
Liu, A.; Esbenshade, L.; Sarkar, S.; Zhang, Z.; He, K.; Sun, M. Implementation Considerations for Automated AI Grading of Student Work. arXiv 2025, arXiv:2506.07955. [Google Scholar] [CrossRef]
Zhao, C. AI-assisted assessment in higher education: A systematic review. J. Educ. Technol. Innov. 2024, 6, 137–152. [Google Scholar] [CrossRef]
Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2024; PMLR: New York, NY, USA, 2021; pp. 5583–5594. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; PMLR: New York, NY, USA, 2022; pp. 12888–12900. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 8748–8763. [Google Scholar]
Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; et al. A comprehensive capability analysis of GPT-3 and GPT-3.5 series models. arXiv 2023, arXiv:2303.10420. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR: New York, NY, USA, 2023; pp. 19730–19742. [Google Scholar]
Zhu, Y.; Yin, Z.; Tyson, G.; Haq, E.U.; Lee, L.H.; Hui, P. Apt-pipe: A prompt-tuning tool for social data annotation using chatgpt. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 245–255. [Google Scholar]
Zhang, J.; Tee, M.; Lin, C.; Huili, S. ZJ-EduFormer: Predicting Supply Chain Student Stress Using Transformer. In Proceedings of the 2024 5th International Conference on Information Science and Education (ICISE-IE), Zhanjiang, China, 20–22 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 307–311. [Google Scholar]
Yang, Z.; Li, L.; Lin, K.; Wang, J.; Lin, C.C.; Liu, Z.; Wang, L. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv 2023, arXiv:2309.17421. [Google Scholar]
Lin, B.; Ye, Y.; Zhu, B.; Cui, J.; Ning, M.; Jin, P.; Yuan, L. Video-llava: Learning united visual representation by alignment before projection. arXiv 2023, arXiv:2311.10122. [Google Scholar]

Figure 1. The overall framework diagram illustrates the multimodal input processing and encoding-decoding workflow.

Figure 2. Cross-modal alignment module.

Figure 3. Visualization of the ablation study results for the CWH across three tasks.

Figure 4. Visualization of the ablation study results of the CWH across three tasks.

Figure 5. Ablation results of the CWH across three tasks. Each subplot compares the performance of the full model and w/o CWH on ROUGE-L, BLEU, CIDEr, and human scores, highlighting the consistent effectiveness of CWH in enhancing model performance across tasks.

Figure 6. In this heatmap, the red color family (such as red, orange) represents relatively high heat (such as attention, numerical intensity, etc.) in the corresponding area, while the green color family (such as green, blue) represents relatively low heat.

Table 1. Sample distribution of dataset used in this paper.

Task Type	Sample Count	Image Count	Text Count
Case analysis	1927	1298	3103
Statute application	1311	325	2037
Economic chart interpretation	1193	1905	1592
Policy commentary	781	379	1017

Table 2. Task 1: Performance comparison on image–text essay feedback generation.

Model	ROUGE-L	BLEU	CIDEr	Human Eval. (/5)
GPT-3.5	0.542	0.398	1.210	3.82
BLIP-2	0.563	0.415	1.284	3.95
ChatGPT Prompt Tuning	0.587	0.437	1.325	4.10
EduFormer	0.604	0.452	1.372	4.18
GPT-4V	0.636	0.478	1.445	4.38
LLaVA	0.624	0.469	1.412	4.31
Proposed Method	0.653	0.496	1.498	4.46

Table 3. Task 2: Performance comparison on image–text QA feedback.

Model	ROUGE-L	BLEU	CIDEr	Human Eval. (/5)
GPT-3.5	0.528	0.385	1.142	3.76
BLIP-2	0.547	0.401	1.208	3.89
ChatGPT Prompt Tuning	0.569	0.422	1.264	4.02
EduFormer	0.593	0.439	1.316	4.11
GPT-4V	0.629	0.472	1.402	4.33
LLaVA	0.615	0.461	1.367	4.26
Proposed Method	0.642	0.481	1.437	4.39

Table 4. Task 3: Performance comparison on artistic cognition guidance and improvement suggestion.

Model	ROUGE-L	BLEU	CIDEr	Human Eval. (/5)
GPT-3.5	0.556	0.406	1.196	3.84
BLIP-2	0.574	0.423	1.255	3.96
ChatGPT Prompt Tuning	0.598	0.441	1.309	4.09
EduFormer	0.615	0.455	1.346	4.17
GPT-4V	0.650	0.490	1.432	4.36
LLaVA	0.636	0.478	1.389	4.29
Proposed Method	0.664	0.502	1.476	4.43

Table 5. Ablation study of the CWH across three tasks.

Task 1: Feedback Generation	ROUGE-L	BLEU	CIDEr	Human
Full Model	45.2	38.6	1.25	4.72
w/o CWH	41.7	35.4	1.12	4.30
Task 2: QA Feedback	ROUGE-L	BLEU	CIDEr	Human
Full Model	47.5	40.1	1.31	4.80
w/o CWH	43.6	36.8	1.18	4.35
Task 3: Art Cognition Guidance	ROUGE-L	BLEU	CIDEr	Human
Full Model	44.0	37.5	1.22	4.65
w/o CWH	40.5	34.2	1.08	4.25

Table 6. Case study illustrating personalized feedback generation for different students with the same answer.

Input	Output (Feedback)
Question: Explain the impact of inflation on purchasing power. Student Answer (text): “Inflation makes goods more expensive but does not change income.” Student Answer (image): Handwritten short essay. Student Background A: Low prior performance; common error = confusion between nominal and real value.	Feedback A: “Your answer identifies the price increase, but remember to also mention income and real purchasing power. Inflation reduces the real value of money, meaning you can buy less with the same income.”
Question: Explain the impact of inflation on purchasing power. Student Answer (text): “Inflation makes goods more expensive but does not change income.” Student Answer (image): Handwritten short essay. Student Background B: High prior performance; prefers technical terminology.	Feedback B: “Good point on price increase. For precision, note that inflation decreases real purchasing power by eroding the constant-value basis of nominal income, which is crucial in macroeconomic analysis.”

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Yang, Y.; Lan, Z.; Chen, X.; Zhan, H.; Yu, L.; Zhan, Y. Image Sensor-Supported Multimodal Attention Modeling for Educational Intelligence. Sensors 2025, 25, 5640. https://doi.org/10.3390/s25185640

AMA Style

Chen Y, Yang Y, Lan Z, Chen X, Zhan H, Yu L, Zhan Y. Image Sensor-Supported Multimodal Attention Modeling for Educational Intelligence. Sensors. 2025; 25(18):5640. https://doi.org/10.3390/s25185640

Chicago/Turabian Style

Chen, Yanlin, Yingqiu Yang, Zeyu Lan, Xinyuan Chen, Haoyuan Zhan, Lingxi Yu, and Yan Zhan. 2025. "Image Sensor-Supported Multimodal Attention Modeling for Educational Intelligence" Sensors 25, no. 18: 5640. https://doi.org/10.3390/s25185640

APA Style

Chen, Y., Yang, Y., Lan, Z., Chen, X., Zhan, H., Yu, L., & Zhan, Y. (2025). Image Sensor-Supported Multimodal Attention Modeling for Educational Intelligence. Sensors, 25(18), 5640. https://doi.org/10.3390/s25185640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image Sensor-Supported Multimodal Attention Modeling for Educational Intelligence

Abstract

1. Introduction

2. Related Work

2.1. Research on Educational Feedback Systems and AI-Assisted Teaching

2.2. The Application of Multimodal Deep Learning in Education

3. Materials and Method

3.1. Data Collection

3.2. Dataset Construction

3.2.1. Image Processing

3.2.2. Text Processing

3.2.3. Data Augmentation

3.3. Proposed Method

3.3.1. Overall

3.3.2. Cross-Modal Alignment Module

3.3.3. Personalized Feedback Generator

3.3.4. Cognitive Weakness Highlighter

4. Results and Discussion

4.1. Evaluation Metrics

4.2. Baseline Models

4.3. Performance Comparison on Image–Text Essay Feedback Generation Task

4.4. Performance Comparison on Image–Text QA Feedback Task

4.5. Performance Comparison on Artistic Cognition Guidance and Improvement Suggestion Task

4.6. Ablation Study of the CWH Across Three Tasks

4.7. Case Studies and Qualitative Analysis

4.8. Discussion

4.9. Limitation and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI