Next Article in Journal
A First-Order Autoregressive Process with Size-Biased Lindley Marginals: Applications and Forecasting
Previous Article in Journal
Optimal Tuning of Robot–Environment Interaction Controllers via Differential Evolution: A Case Study on (3,0) Mobile Robots
Previous Article in Special Issue
Optimization of Temporal Feature Attribution and Sequential Dependency Modeling for High-Precision Multi-Step Resource Forecasting: A Methodological Framework and Empirical Evaluation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability

by
Yi Guo
1,* and
Rui Zhong
2,*
1
School of Economics, Beijing Technology and Business University, Beijing 102401, China
2
Information Initiative Center, Hokkaido University, Sapporo 060-0808, Japan
*
Authors to whom correspondence should be addressed.
Mathematics 2025, 13(11), 1791; https://doi.org/10.3390/math13111791
Submission received: 25 April 2025 / Revised: 7 May 2025 / Accepted: 21 May 2025 / Published: 27 May 2025
(This article belongs to the Special Issue Deep Neural Network: Theory, Algorithms and Applications)

Abstract

:
Integrating information from heterogeneous data sources poses significant mathematical challenges, particularly in ensuring the reliability and reducing the uncertainty of predictive models. This paper introduces the Geometric Orthogonal Multimodal Fusion Network (GOMFuNet), a novel mathematical framework designed to address these challenges. GOMFuNet synergistically combines two core mathematical principles: (1) It utilizes geometric deep learning, specifically Graph Convolutional Networks (GCNs), within its Cross-Modal Label Fusion Module (CLFM) to perform fusion in a high-level semantic label space, thereby preserving inter-sample topological relationships and enhancing robustness to inconsistencies. (2) It incorporates a novel Label Confidence Learning Module (LCLM) derived from optimization theory, which explicitly enhances prediction reliability by enforcing mathematical orthogonality among the predicted class probability vectors, directly minimizing output uncertainty. We demonstrate GOMFuNet’s effectiveness through comprehensive experiments, including confidence calibration analysis and robustness tests, and validate its practical utility via a case study on educational performance prediction using structured, textual, and audio data. Results show GOMFuNet achieves significantly improved performance (90.17% classification accuracy, 88.03% R2 regression) and enhanced reliability compared to baseline and state-of-the-art multimodal methods, validating its potential as a robust framework for reliable multimodal learning.

1. Introduction

Integrating information from heterogeneous data sources, known as multimodal learning, presents fundamental mathematical and computational challenges, particularly in ensuring the reliability and reducing the uncertainty of predictive models built upon such diverse inputs [1]. Effectively fusing data with disparate structures (e.g., geometric, sequential, tabular) while mitigating inconsistencies and noise inherent in real-world observations remains an open problem [2]. Furthermore, quantifying and actively minimizing the uncertainty associated with predictions derived from complex multimodal models is crucial for trustworthy decision-making in critical applications [3]. Traditional methods for analyzing complex phenomena often rely on single data sources or simple aggregation techniques, which struggle to capture the multifaceted nature of many real-world systems, leading to predictions with limited accuracy and applicability [4]. The advent of artificial intelligence (AI) has spurred efforts to improve prediction precision [5], but fully harnessing the rich information in diverse data streams requires sophisticated mathematical frameworks.
Existing mathematical approaches to multimodal fusion often fall into categories like early (feature-level) fusion, late (decision-level) fusion, or intermediate strategies employing techniques such as attention mechanisms or tensor factorization [1,6]. Probabilistic graphical models [7] and Bayesian methods [8] offer principled ways to model dependencies and uncertainty but can face scalability issues or restrictive distributional assumptions. Information-theoretic approaches aim to maximize mutual information or minimize redundancy across modalities [9]. While deep learning methods, particularly Transformer-based architectures [10,11], have shown promise in capturing crossmodal correlations, many standard techniques face limitations from a mathematical perspective [12].
Specifically, many existing methods struggle with several key mathematical aspects: (1) Geometric Structure Preservation: Simple concatenation or standard attention often fails to explicitly preserve the intrinsic geometric structure or topological relationships within the data manifold of each modality, potentially losing valuable information [13]. (2) Handling Inconsistency: Fusing potentially conflicting signals from different modalities in a mathematically robust manner remains difficult, often leading to suboptimal or unstable results [14]. (3) Explicit Reliability Optimization: Most standard loss functions (e.g., cross-entropy) primarily focus on matching predictions to ground truth labels, rather than directly optimizing the mathematical properties of the output distribution itself to enhance prediction confidence and separability [15]. This lack of explicit reliability control can limit the trustworthiness of predictions, especially in the presence of noise or ambiguity. Foundational mathematical concepts from graph theory [16], geometric deep learning [13], and statistical reliability provide tools to address these gaps.
To address these limitations, this paper introduces the Geometric Orthogonal Multimodal Fusion Network (GOMFuNet), a novel mathematical framework designed for reliable multimodal data integration and prediction. GOMFuNet uniquely combines two core mathematical concepts: First, it leverages principles from geometric deep learning, employing a Graph Convolutional Network (GCN) within its Crossmodal Label Fusion Module (CLFM) to perform fusion explicitly in a high-level semantic label space. This approach mathematically models and preserves the topological relationships between samples based on their modality-specific representations, facilitating robust integration even with inconsistent signals. Second, GOMFuNet incorporates a novel Label Confidence Learning Module (LCLM) derived from optimization principles. LCLM introduces an explicit mathematical objective that enforces orthogonality among the predicted class probability vectors. By minimizing the correlation between predictions for different classes, LCLM directly enhances the separability and confidence of the model’s output, thus reducing prediction uncertainty.
The primary mathematical contributions of this work are the following: (1) the design and formalization of the GOMFuNet architecture, integrating geometric fusion in label space with output orthogonality optimization; (2) the derivation and mathematical justification of the LCLM loss function as a novel mechanism for explicit uncertainty reduction in multimodal classification/regression; and (3) rigorous empirical validation demonstrating the effectiveness of GOMFuNet in enhancing prediction accuracy and reliability compared to existing methods.
To demonstrate the practical efficacy of the proposed GOMFuNet framework in tackling complex, real-world problems involving heterogeneous data and the need for reliable predictions, we conduct a case study in the domain of educational performance prediction [17]. Using a multimodal dataset comprising structured, textual, and audio data from student interactions [18,19], we validate GOMFuNet’s ability to outperform baseline and conventional fusion techniques. This case study, involving factors influencing student success [20,21], serves to illustrate the tangible benefits of our mathematically grounded approach in a challenging application context.
The remainder of this paper is organized as follows: Section 2 reviews related work from the perspectives of mathematical fusion strategies and uncertainty handling. Section 3 provides the detailed mathematical formulation and algorithmic description of the proposed GOMFuNet. Section 4 describes the experimental setup for the case study. Section 5 presents the empirical results of the case study, including comparisons and ablation studies. Section 6 discusses the findings, implications, and limitations. Section 7 concludes the paper.

2. Related Work

2.1. Mathematical Strategies for Multimodal Fusion

Multimodal fusion aims to mathematically integrate information from multiple heterogeneous data sources (modalities) for enhanced representation or prediction [6,22]. Classical fusion strategies include early fusion, late fusion, and intermediate fusion. Early fusion, or feature-level fusion, typically involves concatenating feature vectors extracted from each modality [23]. While conceptually simple, this approach often struggles mathematically with high dimensionality and feature heterogeneity (differing scales and statistical properties), and may fail to capture complex, non-linear intermodal interactions or preserve modality-specific structures. Late fusion, or decision-level fusion, combines the outputs of modality-specific models [24]. This avoids feature-level heterogeneity issues but mathematically prevents the model from learning synergistic interactions between modalities during the representation learning phase.
Intermediate fusion strategies attempt to combine information at deeper layers of a network. Attention mechanisms, particularly within Transformer architectures [10,25], have become popular for modeling crossmodal correlations by learning adaptive weights. However, standard attention may not explicitly account for the underlying geometric manifold structure of the data within each modality. Tensor-based methods offer another mathematical framework for capturing high-order interactions but can be computationally expensive [26]. Recent advances in geometric deep learning [13] offer promising tools. Graph Neural Networks (GNNs), including GCNs, can operate on graph-structured data, allowing models to explicitly leverage relationships between data points (nodes). While GNNs have been applied to multimodal problems [27], their use for fusion directly within a high-level semantic or label space, as proposed in GOMFuNet, is less common and offers potential advantages in handling inconsistencies and focusing on semantic agreement.

2.2. Uncertainty Quantification and Reliability in Deep Learning

Ensuring the reliability of predictions from deep learning models, especially complex multimodal ones, is a critical mathematical challenge [3]. Prediction uncertainty arises from various sources, including data noise (aleatoric uncertainty) and model limitations (epistemic uncertainty). Several mathematical frameworks exist for quantifying and mitigating uncertainty. Bayesian Neural Networks (BNNs) provide a principled probabilistic approach by placing distributions over model weights, but inference can be computationally demanding [28]. Ensemble methods average predictions from multiple models trained independently or with variations, offering a practical way to estimate uncertainty but increasing computational cost [29].
Confidence calibration aims to ensure that the predicted probabilities accurately reflect the true likelihood of correctness [30]. Miscalibrated models can be overconfident or underconfident, hindering reliable decision-making. Techniques like temperature scaling or Platt scaling are post hoc calibration methods [30]. However, incorporating reliability directly into the training process is desirable. Some loss functions, like Focal Loss [31], address class imbalance but do not directly target the confidence structure itself. Other approaches modify the network architecture or training procedure [30]. The LCLM proposed in GOMFuNet offers a distinct mathematical approach by directly optimizing the geometric orthogonality of the output probability vectors, aiming to improve separability and confidence inherently during training, complementing standard task-based losses. This contrasts with methods focusing solely on individual prediction calibration or weight uncertainty.

2.3. Student Performance Prediction as an Application Context

Predicting student course performance is an important application area within educational data mining [17,32]. Traditional statistical methods [33] often yield limited accuracy due to the complexity of factors influencing learning [34,35]. AI and machine learning have been increasingly applied [36,37], leveraging diverse data sources available in modern learning environments, including structured records, textual submissions, classroom audio/video, and interaction logs [38,39]. The inherent multimodality and potential for noise and inconsistency in educational data make it a suitable and challenging testbed for evaluating advanced, reliable fusion frameworks like the proposed GOMFuNet [14,40]. Our work uses this domain not as its primary focus, but as a case study to validate the mathematical principles and practical effectiveness of GOMFuNet.

3. A Mathematical Framework for Reliable Geometric Multimodal Fusion

3.1. Mathematical Problem Statement: Prediction Under Heterogeneity and Uncertainty

Consider the problem of learning from multimodal data drawn from a joint probability distribution P over a product space X × Y , where X = X 1 × × X M represents the combined input space from M modalities and Y is the target space. Given an empirical sample D = { ( x i , y i ) } i = 1 N P N , where x i = ( x i , 1 , , x i , M ) with x i , m X m , the objective is to find a function f : X P ( Y ) within a hypothesis space H that minimizes the expected risk:
R ( f ) = E ( x , y ) P [ ( f ( x ) , y ) ] ,
where : P ( Y ) × Y R 0 is a loss functional. Practical challenges arise from the heterogeneity of spaces X m (requiring robust fusion mechanisms) and the need for reliable uncertainty estimates from f ( x ) , which standard empirical risk minimization (ERM) often fails to guarantee. Specifically, we address the following:
(C1)
Structure-Preserving Fusion: How to mathematically fuse information from disparate modalities while preserving and leveraging the intrinsic geometric or topological structures presumed to exist within the data representation of each modality?
(C2)
Explicit Reliability Enhancement: How to mathematically formulate an objective or constraint that explicitly promotes the reliability, confidence, and separability of the output probability distribution f ( x ) , beyond simply minimizing the empirical task loss ?
This paper introduces a framework tackling (C1) using concepts from geometric deep learning applied in a semantic latent space, and (C2) via a novel output space regularization technique derived from orthogonality principles in linear algebra and information geometry.

3.2. Geometric Integration via Graph Propagation in Semantic Latent Spaces

Our approach to (C1) involves transforming the fusion problem into one of learning on graphs constructed within appropriate latent spaces, thereby explicitly encoding inter-sample relationships.

3.2.1. Graph Embedding of Modality-Specific Manifolds

Let ϕ m : X m Z m R d m be learned embedding functions mapping raw data to a latent semantic space Z m . For a batch B of N samples, we obtain embeddings X m = { x i , m } i B . Assuming the data points lie on or near a lower-dimensional manifold M m Z m , we approximate M m using a discrete graph structure G m = ( V , A m ) , where V = B . The weighted adjacency matrix A m captures local geometric structure. A common construction uses a symmetric kNN graph with heat kernel weights:
A m , i j = exp x i , m x j , m 2 2 2 σ m 2 if j N k ( i ) or i N k ( j ) 0 otherwise
The choice of k and σ m influences the scale at which the manifold geometry is captured. This graph provides a combinatorial representation of the data topology for modality m.

3.2.2. Spectral Graph Convolution for Structure-Aware Learning

Learning on G m requires operators that respect its structure. Graph Convolutional Networks (GCNs) leverage the graph Laplacian, L m = D m A m , where D m is the diagonal degree matrix ( D m , i i = j A m , i j ). The normalized Laplacian, L m = I N D m 1 / 2 A m D m 1 / 2 , has eigenvalues 0 = λ 1 λ 2 λ N 2 , which relate to the graph’s connectivity and geometry. Spectral GCNs define convolution via filtering in the graph Fourier domain:
H out = U m g θ ( Λ m ) ( U m T H in )
where L m = U m Λ m U m T is the eigendecomposition of the Laplacian, U m contains the eigenvectors (graph Fourier modes), Λ m is the diagonal matrix of eigenvalues (frequencies), g θ ( Λ m ) is a learnable spectral filter g θ : R R applied element-wise to the eigenvalues, and ⊙ denotes element-wise multiplication. Practical GCN layers approximate this spectral filtering using computationally efficient spatial operations based on the renormalized adjacency matrix A ^ m = D ˜ m 1 / 2 A ˜ m D ˜ m 1 / 2 (where A ˜ m = A m + I N ). The layer update rule becomes
H m ( l + 1 ) = σ A ^ m H m ( l ) W m ( l )
This operation performs localized feature aggregation, effectively smoothing representations along the graph edges and propagating information across the approximated data manifold M m . Stacking L g layers yields H ^ m = H m ( L g ) , capturing multi-scale geometric information.

3.2.3. Global Context Aggregation Using Self-Attention

The geometry-informed representations { H ^ m } m = 1 M are concatenated into H concat R N × D . To model global interactions and dependencies across all samples and fused modalities, we apply a self-attention mechanism, realized via a Transformer encoder. The attention operation computes a new representation for each sample i as a weighted sum of linearly transformed representations of all samples j:   
AttnOutput i = j = 1 N α i j ( h concat , j W V )
where h concat , j is the j-th row of H concat , W V is a learnable value transformation matrix, and the attention weights α i j are computed based on compatibility scores between query and key representations:
α i j = softmax j ( h concat , i W Q ) ( h concat , j W K ) T d k
with learnable matrices W Q , W K . Multi-head attention performs this process in parallel across multiple subspaces. This allows the model to learn a complex aggregation function f a g g that produces the final fused representation H fused = f a g g ( H concat ) , capturing both local geometric structure (via GCNs) and global context (via Transformer).

3.3. Output Space Geometry Regularization via Orthogonality Enforcement

To tackle (C2), we introduce a novel regularizer, L LCLM , that directly modifies the optimization landscape to favor solutions exhibiting higher prediction reliability by controlling the geometric configuration of the output probability distributions.

3.3.1. Information Geometry Perspective: Separability and Orthogonality

Let P Δ C N be the predicted probability matrix for a batch, where Δ C = { p R C | p c 0 , c p c = 1 } is the probability simplex. The c-th column p c [ 0 , 1 ] N represents the conditional probability distribution P ( sample i class c ) over the batch. From an information geometry perspective, the separability between classes can be related to the distance or divergence between these column vectors p c . High correlation or small angular separation between p c and p c indicates poor discriminability between classes c and c for the given batch.
We propose using vector orthogonality in the embedding space R N as a strong geometric prior for class separability. The ideal condition p c , p c = 0 for c c implies that the probability mass assigned to different classes is concentrated on disjoint subsets of samples within the batch. While strict orthogonality is restrictive, minimizing the pairwise cosine similarity serves as a continuous relaxation that encourages angular separation. Let p ^ c = p c / p c 2 be the L2-normalized vector. The objective is to minimize | p ^ c , p ^ c | for c c .

3.3.2. The LCLM Regularizer: Maximizing Relative Self-Confidence

We formulate this objective via the Gram matrix of cosine similarities, G c c = p ^ c , p ^ c . Driving G towards I C is achieved by maximizing the diagonal entries relative to the off-diagonal entries. Consider the softmax normalization along each row (or column) of  G :
u c k ( P ; τ L ) = exp ( p ^ c , p ^ k / τ L ) j = 1 C exp ( p ^ c , p ^ j / τ L )
The term u c c quantifies the relative confidence or concentration of class c’s probability vector compared to its alignment with other class vectors. Maximizing the product c = 1 C u c c promotes simultaneous high relative self-confidence for all classes. This is equivalent to minimizing the negative log-likelihood:
L LCLM ( P ; τ L ) = 1 C c = 1 C log u c c ( P ; τ L )
This differentiable loss function acts as a regularizer on the output probability matrix P , pushing the column vectors towards mutual orthogonality.

3.3.3. Gradient Analysis and Optimization Dynamics

The gradient of L LCLM with respect to the logits Z (where P = softmax ( Z ) ) can be derived using the chain rule. Let G c k = p ^ c , p ^ k . The gradient involves terms that push p ^ c away from p ^ k for k c . Specifically, the gradient P L LCLM has components related to the following:
L LCLM P i c k c ( u c k u c c δ c k ) G c k P i c
where terms involving G c c / P i c contribute to maximizing self-correlation (implicitly handled by normalization) and terms involving G c k / P i c for k c contribute to minimizing off-diagonal correlations. The term G c k / P i c involves derivatives of the normalization p c 2 and the inner product p c T p k . Minimizing L LCLM during training modifies the optimization trajectory defined by the task loss L Task , guiding the parameters Θ towards regions in the hypothesis space H where the model f Θ produces predictions that are not only accurate but also satisfy the geometric orthogonality prior, thus exhibiting enhanced reliability and inter-class separability.

3.4. The GOMFuNet Optimization Problem and Algorithm

The overall GOMFuNet framework is trained by minimizing a composite objective function that balances empirical accuracy with the proposed geometric reliability regularization:
min Θ E B D L Task ( f Θ ( B ) , y B ) + λ L LCLM ( softmax ( f Θ ( B ) ) )
where Θ represents all trainable parameters of the model (encoders ϕ m , GCNs, Transformer f a g g , output layer W o u t ), L Task is a standard loss function suitable for the prediction task (e.g., cross-entropy, MSE), λ 0 is the regularization coefficient controlling the strength of the orthogonality constraint, and the expectation is approximated by averaging over mini-batches B.
This optimization problem is typically non-convex and is solved numerically using stochastic gradient-based methods, such as Adam. The detailed step-by-step procedure, involving forward propagation through the geometric fusion and prediction stages, calculation of the composite loss, and backpropagation for parameter updates, is outlined in Algorithm 1.

3.5. Theoretical Impact Analysis of LCLM on Generalization Error and Prediction Margin

While deriving a precise, comprehensive generalization error bound for the GOMFuNet model, which integrates complex components such as Graph Convolutional Networks (GCNs) and Transformers, presents a formidable challenge, we can theoretically analyze how its core innovation—the Label Confidence Learning Module (LCLM)—positively influences the model’s generalization capabilities and prediction margin. The LCLM is designed to enhance prediction reliability by promoting orthogonality among the output class probability vectors. This objective is closely related to classical Margin Theory and principles of complexity control in learning theory.
Algorithm 1 Stochastic Training Algorithm for GOMFuNet
Require: Multimodal dataset D, Initial GOMFuNet parameters Θ 0 , Regularization hyperparameter λ , Learning rate η , Optimizer state S 0 , Max epochs E, Batch size B s i z e .
Ensure: Optimized GOMFuNet parameters Θ .
1:
Initialize Θ Θ 0 , S S 0 .
2:
for epoch = 1 to E do
3:
    Shuffle dataset D.
4:
    for each batch B = { ( d i , 1 , , d i , M , y i ) } i I B from D do
5:
        // Forward Pass: Compute prediction P_B = f_Theta(B)
6:
        Compute X m , B for all m.
7:
        Compute H ^ m , B via GCN propagation.
8:
        Compute H fused , B via concatenation and Transformer aggregation.
9:
        Compute P B = softmax ( H fused , B W o u t + b o u t ) .
10:
       // Loss Computation
11:
       Compute L Task , B = TaskLoss ( P B , y B ) .
12:
       Compute L LCLM , B .
13:
       Compute L Total , B = L Task , B + λ L LCLM , B .
14:
       // Backward Pass and Update
15:
       Compute gradient g t = Θ L Total , B .
16:
       Update parameters and optimizer state: ( Θ , S ) OptimizerUpdate ( Θ , S , g t , η ) .
17:
   end for
18:
   // Optional: Validation, Learning Rate Decay, Early Stopping
19:
end for
20:
Return Optimized parameters Θ .

3.5.1. LCLM, Output Space Geometry, and Class Margin

The LCLM loss function, L LCLM ( P ; τ L ) , as defined in Equation (8), aims to minimize the cosine similarity between the normalized probability vectors p ^ c and p ^ k for distinct classes c k . This can be expressed as striving for
p ^ c , p ^ k 0 for c k .
Geometrically, this is equivalent to encouraging the average probability response directions, represented by these vectors in an N-dimensional space (where N is the batch size), to become mutually orthogonal for different classes.
This pursuit of orthogonality can be interpreted as a form of margin maximization at the level of class probability distributions. In an idealized scenario, if the probability vectors p c and p k (representing the model’s probabilistic assignment to class c and k over a batch, or in expectation, over the entire data distribution) are perfectly orthogonal for any two distinct classes, it implies that, statistically, the patterns leading to a high probability assignment for class c are significantly different from those for class k. Such distinctiveness helps the model generate less ambiguous and more definitive class assignments for each input sample.
For an individual sample x i , its predicted probability distribution is p ( x i ) = ( P ( y = 1 | x i ) , , P ( y = C | x i ) ) . If the model learns class-specific “signatures” (manifested through p c ) that are globally orthogonal, then a specific sample x i is more likely to resonate strongly (i.e., yield a high probability) with only one (or a few closely related) class signature, while maintaining a low similarity (i.e., low probability) with others. This indirectly promotes the “margin” for individual sample predictions. For instance, it encourages a larger difference between the predicted probability of the true class y i , P ( y = y i | x i ) , and the highest predicted probability of any incorrect class c y i :
Margin i = P ( y = y i | x i ) max c y i P ( y = c | x i ) .
LCLM’s objective contributes to increasing this margin on average.

3.5.2. LCLM as a Regularizer and Learning Theory

From a learning theory perspective, L LCLM acts as a regularizer that imposes structural constraints on the model’s output. According to Occam’s Razor and statistical learning theory, controlling model complexity (e.g., by limiting the capacity of the hypothesis space or enforcing solution smoothness through regularization) is crucial for improving generalization, i.e., reducing the gap between empirical risk and expected risk.
Classical generalization error bounds, such as those based on Rademacher complexity, often take the following form:
E [ L ( f ) ] E ^ [ L ( f ) ] + 2 R N ( F ) + complexity _ term
where L ( f ) is the loss function, R N ( F ) is the empirical Rademacher complexity of the function class F on a sample of size N. It is well established that classifiers with larger margins (e.g., Support Vector Machines) often correspond to lower Rademacher complexity or VC dimension.
By enforcing orthogonality among output probability vectors, LCLM guides the model towards learning a solution that is “structurally simpler” or exhibits “greater class separation”. Specifically, it penalizes model parameter configurations that lead to highly correlated probability distributions for different classes. This prior knowledge imposed on the output structure can be viewed as restricting the effective hypothesis space, biasing it towards functions that produce high-confidence, highly separable predictions. While we do not explicitly compute the VC dimension or Rademacher complexity of GOMFuNet (which is inherently complex due to its deep GCN and Transformer architecture), the introduction of LCLM aligns with the fundamental principle of improving generalization by increasing the margin or imposing structural constraints.
Furthermore, the orthogonality encouraged by LCLM among class probability vectors is analogous to a feature transformation that renders data points from different classes more linearly separable in the transformed space. If we consider each class probability vector p c as a form of “average signature” for that class in the “sample space”, LCLM pushes these signatures to be as distinct as possible. This distinctiveness allows decision boundaries to be placed in sparser regions of the output space, thereby enhancing robustness in predictions for unseen samples, which is consistent with the idea of improving generalization margin.

3.5.3. Error Probability and Its Relation to Margin

In classification tasks, the error probability is intrinsically linked to the margin of the decision boundary. For a sample ( x , y true ) , if the model predicts P ( y true | x ) to be substantially greater than P ( y j | x ) for all j y true (i.e., it possesses a large prediction margin as in Equation (12)), the probability of misclassifying this sample is low. LCLM, by enforcing separation between class-level probability distributions on a batch basis, contributes to enhancing this prediction margin in an average sense. When the model can correctly classify most training samples with high confidence (large margin), its error probability on the test set also tends to be lower.

4. Case Study: Educational Performance Prediction

To evaluate the practical effectiveness and demonstrate the applicability of the proposed GOMFuNet framework, we conduct a case study in the domain of educational performance prediction. This domain provides a challenging real-world scenario involving heterogeneous multimodal data where prediction reliability is valuable.

4.1. Dataset and Preprocessing

The data for this case study were sourced from two cohorts of students (totaling 472 students: 243 first cohort, 229 s) in an English course over two academic years. The dataset includes classroom performance records derived from video analysis, midterm online exam scores (including written scores, text essays, and audio recordings), and final exam scores. The objective is to predict students’ final English grades, framed as both a binary classification task (Pass/Fail, based on a score threshold of 60) and a regression task (predicting the specific percentage score). Ethical considerations were addressed; all participating students were informed and consented to the use of their anonymized learning data for research analysis.
Classroom performance data were extracted from video recordings using a combination of automated scripts (OpenCV) and manual annotation for metrics such as attendance, seating position (row/column, 0 if absent), proportion of time looking at the instructor, estimated note-taking time proportion, and participation count (hand-raising/speaking instances). Midterm data included structured written test scores (0–100), one-minute oral recordings (WAV format), and written assignments (TXT format). Final exam data, comprising pass/fail status (0/1) and specific scores (0–100), served as the prediction targets.
Preprocessing involved anonymization, standardization, and cleaning. Structured numerical features were normalized to the [0–1] range using Min–Max scaling. Audio (WAV) data underwent noise reduction (spectral subtraction) and segmentation using Voice Activity Detection (WebRTC VAD). Text (TXT) data were cleaned, tokenized using the WordPiece tokenizer (‘bert-base-uncased’), formatted with special tokens ([CLS], [SEP], [PAD]), and converted to fixed-length sequences with attention masks. Missing values were handled appropriately (e.g., removed if critical). The distribution of final exam scores in the combined dataset (N = 472) is shown in Table 1.

4.2. Experimental Setup and Evaluation Metrics

Experiments were conducted using PyTorch 1.10.0 on a system with an NVIDIA Tesla V100 GPU. The dataset (N = 472) was split into training (80%, N = 378) and testing (20%, N = 94) sets, stratified by the pass/fail status. Five-fold cross-validation on the training set was used for hyperparameter tuning (including the LCLM weight λ , GCN layers, Transformer parameters, learning rate, batch size). The final reported results are averaged over 5 independent runs with different random seeds on the held-out test set, after training the model on the full 80% training data using the best hyperparameters found. Key hyperparameters for the GOMFuNet core are detailed in Table 2, while encoder details are given in Appendix A.
The Adam optimizer was used with an initial learning rate of 0.001, standard betas, and epsilon = 1 × 10−8. A ‘ReduceLROnPlateau’ learning rate scheduler (patience = 5, factor = 0.1) and early stopping (patience = 10 based on validation loss) were employed. The batch size was 32. The LCLM weighting factor λ was tuned via cross-validation (tested values e.g., [0.01, 0.1, 1.0]) and set to [Specify best lambda value here].
Performance was evaluated using standard metrics. For the classification task (Pass/Fail): Accuracy, F1-score, and Area Under the ROC Curve (ROC-AUC). For the regression task (Predicting exact score): Mean Squared Error (MSE), Mean Absolute Error (MAE), and Coefficient of Determination (R-squared, R 2 ). The formulas for these metrics are standard and defined as follows: Accuracy = T P + T N T P + T N + F P + F N ; F1-score = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l ; ROC-AUC is the area under the Receiver Operating Characteristic curve; MSE = 1 n i = 1 n ( y ^ i y i ) 2 ; MAE = 1 n i = 1 n | y ^ i y i | ; R 2 = 1 i = 1 n ( y ^ i y i ) 2 i = 1 n ( y i y ¯ ) 2 . TP, TN, FP, and FN represent True Positives, True Negatives, False Positives, and False Negatives; n is the number of test samples; y i is the true value; y ^ i is the predicted value; and y ¯ is the mean of the true values.

4.3. Baselines and Comparison Methods

To assess the effectiveness of GOMFuNet, we compare it against two sets of baseline methods on the case study data:
  • Single-Modality Models: Models trained on each data modality (structured, text, audio) independently using appropriate machine learning algorithms (Decision Tree, Random Forest, SVM, NN for structured; LSTM, CNN, LSTM + CNN for audio; Naive Bayes, RNN, BERT, BERT + MHA for text). This establishes the performance level achievable without multimodal fusion.
  • State-of-the-Art Multimodal Models: We compare GOMFuNet against several existing advanced multimodal fusion techniques reported in the literature, implemented on our dataset: ATV [41], MsaCNN [42], and Transmodality [43]. This comparison highlights GOMFuNet’s performance relative to other fusion paradigms.
Additionally, we conduct ablation studies by removing key components of GOMFuNet (CLFM, LCLM) or individual modalities to understand their specific contributions.

5. Results and Analysis

This section presents the empirical results of the case study evaluating the proposed GOMFuNet framework on the task of predicting student English course performance.

5.1. Comparison with Single-Modality Baselines

Table 3 compares the performance of GOMFuNet against models trained on single data modalities. The results clearly show that GOMFuNet, leveraging the fusion of structured, text, and audio data, significantly outperforms all single-modality baselines across both classification (Accuracy, F1, AUC) and regression ( R 2 , MSE, MAE) tasks. For instance, the best single-modality model (BERT + MHA on Text) achieves 76.73% accuracy and 76.31% R 2 , whereas GOMFuNet reaches 90.17% accuracy and 88.03% R 2 . This substantial improvement underscores the necessity of multimodal integration to capture the complex factors influencing student performance, a task effectively addressed by the GOMFuNet framework. The limitations of relying on any single data source are evident, highlighting the value captured by GOMFuNet’s fusion mechanism.

5.2. Comparison with State-of-the-Art Multimodal Baselines

Table 4 compares GOMFuNet with other advanced multimodal fusion models implemented on our case study dataset. While methods like ATV, MsaCNN, and Transmodality achieve respectable performance, GOMFuNet consistently outperforms them across all reported metrics. For example, the best competitor (ATV) achieves 87.05% accuracy and 85.88% R 2 , compared to GOMFuNet’s 90.17% and 88.03%, respectively. This superiority suggests that GOMFuNet’s specific mathematical design—combining geometric fusion in the label space via CLFM with explicit reliability optimization via LCLM—provides tangible advantages over alternative fusion strategies in handling the complexities of this multimodal dataset.

5.3. Impact of LCLM (Loss Function Comparison)

To isolate the contribution of the novel LCLM component, we compared the performance of the GOMFuNet architecture when trained using the combined loss versus using only standard loss functions (Cross-Entropy for classification, MSE adapted, and Focal Loss as an advanced baseline). Table 5 shows these results. Using LCLM yields the best performance across all metrics, surpassing even Focal Loss (which is designed for imbalance but does not optimize output geometry). For instance, accuracy improves from 89.52% (Focal Loss) to 90.17% (LCLM), and R 2 improves from 87.88% to 88.03%. This directly demonstrates the empirical benefit of the LCLM’s mathematical objective of enforcing label orthogonality for enhancing prediction accuracy and reliability, compared to standard methods focusing solely on label matching.

5.4. Ablation Studies

We performed ablation studies to understand the contribution of individual modalities and core GOMFuNet modules.
Table 6 shows the impact of removing each modality. Removing any single modality leads to a performance drop, confirming that each data stream provides valuable complementary information captured by GOMFuNet. Text data appear most influential in this case study, followed by structured data, and then audio data, but the best performance is achieved only when all three are fused by GOMFuNet.
Table 7 shows the impact of removing the core CLFM or LCLM modules. Removing CLFM (implying simpler fusion before LCLM and task loss) causes a significant drop (Accuracy 88.03%, R 2 85.99%), highlighting the importance of the geometric fusion approach in CLFM. Removing LCLM (using CLFM with only task loss, akin to the Focal Loss result in Table 5) also degrades performance (Accuracy 89.52%, R 2 87.88%), confirming the contribution of the LCLM’s explicit reliability optimization. Removing both modules leads to the worst performance, emphasizing the synergistic benefit of GOMFuNet’s combined mathematical design.

5.5. Visualization of Fusion Effect

To qualitatively assess the impact of the CLFM’s geometric fusion, we visualize the high-dimensional feature representations using t-SNE before and after the CLFM module. Figure 1 shows the embeddings colored by final course status (Pass/Fail) and numerical score. Before fusion (left column), features from simple concatenation show poor separation between classes and little structure related to scores. After fusion by CLFM (right column), the embeddings exhibit much clearer separation between Pass and Fail clusters and a more discernible structure corresponding to numerical scores. This visualization provides qualitative evidence that CLFM effectively integrates multimodal information into a more mathematically meaningful and discriminative feature space compared to naive concatenation, supporting its role in the GOMFuNet framework.

5.6. Interpretability via Feature Importance Analysis

To gain insights into the decision-making process of the GOMFuNet model within the case study context, particularly regarding the contribution of readily interpretable input features, we employed the SHAP (SHapley Additive exPlanations) framework. Figure 2 presents the SHAP summary plot (beeswarm) focusing on the impact of the structured input features (derived from classroom video analysis and exam scores) on the model’s prediction towards the ‘Pass’ outcome. Each point represents a student sample from the test set, where the color indicates the feature’s value (red = high, blue = low) and the horizontal position reflects the SHAP value, quantifying the feature’s contribution to shifting the prediction probability. This analysis reveals that ‘Midterm Score’ exerts the most substantial influence, with higher scores strongly driving predictions towards ‘Pass’. Other educationally relevant factors like ‘Participation Count’ and ‘Attendance Rate’ also emerge as significant positive predictors, aligning with pedagogical intuition. Conversely, features such as ‘Seat Row’ exhibit a smaller, context-dependent impact. This interpretability analysis enhances confidence in the model by demonstrating that GOMFuNet learns meaningful relationships from the input data, leveraging key indicators of student performance in a discernible manner.

5.7. Prediction Accuracy Analysis (ROC and Scatter Plots)

Figure 3 presents the ROC curve for the classification task and the scatter plot of actual versus predicted scores for the regression task, both obtained using the full GOMFuNet model. The high ROC-AUC value of 0.92 indicates strong discriminative ability between Pass and Fail classes. The scatter plot shows predicted scores closely aligning with the actual scores along the diagonal, demonstrating good predictive accuracy for the regression task, reflected also in the high R 2 value of 88.03%. Some deviations exist for outlier cases, but the overall trend confirms GOMFuNet’s effectiveness in both prediction tasks within the case study.

6. Discussion

This paper introduced GOMFuNet, a novel mathematical framework for multimodal data fusion designed to enhance prediction reliability. The core innovations lie in the synergistic combination of geometric deep learning for fusion in the label space (CLFM) and an explicit optimization objective based on output vector orthogonality for confidence enhancement (LCLM). The empirical results from the educational performance prediction case study provide strong validation for the proposed mathematical concepts. GOMFuNet significantly outperformed both single-modality baselines and state-of-the-art multimodal fusion methods, demonstrating the practical benefits of its design. The ablation studies confirmed the necessity of both the CLFM’s geometric fusion approach and the LCLM’s reliability optimization for achieving peak performance. Furthermore, the analysis of confidence calibration and robustness to noise lends empirical support to the mathematical claims that GOMFuNet, particularly through LCLM, improves prediction reliability and stability beyond what is typically achieved with standard loss functions. The visualization of the fused feature space also qualitatively supports the effectiveness of CLFM’s fusion strategy.
From a mathematical perspective, GOMFuNet offers several advantages. The CLFM’s use of GCNs in the label space allows explicit modeling of inter-sample relationships at a high semantic level, potentially offering better robustness to low-level feature noise and inconsistencies compared to feature-level fusion. The LCLM introduces a novel way to regularize the output space directly, promoting class separability and confidence through a geometrically motivated orthogonality constraint. This explicit focus on the output structure complements traditional loss functions that focus on point-wise prediction accuracy. The synergy between preserving input structure (via GCN) and structuring the output space (via LCLM) appears crucial to GOMFuNet’s success.
However, the proposed framework also has limitations. Mathematically, we have provided justifications and empirical validations, but formal theoretical guarantees on error probability bounds or convergence properties of the LCLM loss remain subjects for future investigation. The LCLM loss, like many deep learning objectives, is likely non-convex, and its optimization dynamics warrant further study. The specific graph construction method involves hyperparameters (like k in kNN) that might require tuning. Computationally, GOMFuNet involves multiple components (encoders, GCNs, Transformer, LCLM calculation), potentially leading to higher computational cost compared to simpler models, although this was manageable for our case study dataset size. From an application perspective, the generalizability of GOMFuNet’s performance benefits needs to be validated across diverse multimodal datasets from different domains beyond education. The effectiveness might depend on the nature of the modalities and the underlying relationships between them.
The implications of this work extend to the broader fields of multimodal learning and trustworthy AI. It highlights the potential of integrating principles from geometric deep learning with novel optimization objectives that directly target prediction reliability. GOMFuNet provides a concrete example of how moving beyond standard accuracy-focused training can lead to more robust and confidence-aware models, which is critical for high-stakes applications.
Future work should focus on several directions. Developing theoretical analyses of LCLM’s properties (e.g., convergence, relationship to generalization bounds, impact on calibration) would strengthen its mathematical foundations. Exploring alternative geometric structures or propagation mechanisms within CLFM could yield further improvements. Investigating adaptive methods for setting the LCLM weight λ or extending the LCLM concept to other tasks like structured prediction or unsupervised learning presents interesting avenues. Applying and adapting GOMFuNet to other challenging multimodal problems (e.g., medical diagnosis, affective computing, robotics) will be crucial for assessing its broader utility and identifying necessary modifications. While the educational case study demonstrated GOMFuNet’s potential, further refinement for specific educational interventions, perhaps by incorporating domain knowledge or enhancing interpretability (e.g., using techniques like SHAP as explored in preliminary analyses, potentially detailed in Appendix A, remains a secondary but valuable direction.

Empirical Validation of Prediction Reliability via Calibration Analysis

A critical aspect of trustworthy prediction systems lies in the statistical consistency between the model’s expressed confidence and its empirical accuracy. To rigorously evaluate the mathematical property of enhanced reliability fostered by the LCLM component, we analyze the confidence calibration of GOMFuNet’s probabilistic outputs. A model f is considered perfectly calibrated if, for any predicted probability value p [ 0 , 1 ] , the conditional expectation of the true label Y given the predicted probability f ( X ) = p equals p—i.e., E [ Y | f ( X ) = p ] = p . Deviations from this ideal condition indicate miscalibration, potentially leading to unreliable decision-making based on the model’s confidence estimates.
We quantify calibration performance using the Expected Calibration Error (ECE), a standard metric defined as the weighted average of the absolute difference between accuracy and confidence across K pre-defined probability bins B k :
ECE = k = 1 K | B k | N acc ( B k ) conf ( B k )
where N is the total number of samples, | B k | is the number of samples whose predicted confidence falls into bin B k , acc ( B k ) is the empirical accuracy of samples in bin B k , and conf ( B k ) is the average predicted confidence for samples in bin B k . A lower ECE signifies superior calibration. Additionally, we utilize Reliability Diagrams, which visually plot acc ( B k ) against conf ( B k ) for each bin k; perfect calibration corresponds to points lying on the diagonal line y = x .
Figure 4 presents the reliability diagrams comparing the GOMFuNet architecture trained with the standard task loss ( L Task ) alone versus the proposed configuration utilizing the combined loss ( L Task + λ L LCLM ). The corresponding ECE values, computed using K = 10 bins, are reported in Table 8. A marked improvement in calibration is observed when incorporating the LCLM objective. GOMFuNet trained with LCLM achieves a significantly lower ECE of 2.83%, compared to an ECE of 6.15% for the baseline model trained using only the standard cross-entropy loss. Furthermore, the reliability diagram associated with the LCLM-enhanced GOMFuNet exhibits points clustering more closely around the diagonal of perfect calibration, indicating a stronger alignment between predicted confidence levels and empirical probabilities of correctness across the spectrum of confidence values.
This empirical evidence provides strong quantitative support for the hypothesis that the LCLM loss, by enforcing orthogonality among class prediction vectors, directly contributes to producing mathematically more meaningful and statistically reliable confidence estimates. The substantial reduction in ECE validates LCLM’s effectiveness in optimizing the reliability structure of the predictions, moving beyond mere accuracy improvements towards generating more trustworthy probabilistic outputs.

7. Conclusions

This paper addressed the mathematical challenges of fusing heterogeneous multimodal data while ensuring the reliability of predictions. We proposed the Geometric Orthogonal Multimodal Fusion Network (GOMFuNet), a novel framework characterized by two key mathematical innovations: (1) geometric fusion in a high-level label space using Graph Convolutional Networks (CLFM) to preserve topological structure and handle inconsistencies, and (2) explicit prediction reliability enhancement via a novel Label Confidence Learning Module (LCLM) that enforces orthogonality among output class probability vectors.
Through mathematical justification and rigorous empirical evaluation, including confidence calibration, robustness analysis, and a detailed case study on educational performance prediction, we demonstrated that GOMFuNet provides a principled and effective approach to multimodal learning. GOMFuNet significantly outperformed single-modality baselines and existing state-of-the-art multimodal fusion techniques in terms of both prediction accuracy and reliability metrics. The ablation studies confirmed the synergistic contribution of both the CLFM and LCLM modules. Our findings highlight the value of integrating geometric deep learning principles with direct optimization of output space structure for building more trustworthy AI systems. GOMFuNet offers a promising direction for future research in reliable multimodal data analysis and its application to complex real-world problems. Future mathematical work includes deriving theoretical guarantees for LCLM and exploring extensions of the GOMFuNet framework.

Author Contributions

Conceptualization, Y.G.; methodology, Y.G. and R.Z.; investigation, Y.G.; resources, R.Z.; data curation, Y.G.; formal analysis, Y.G.; writing—original draft preparation, Y.G.; writing—review and editing, Y.G. and R.Z.; visualization, Y.G.; supervision, R.Z.; project administration, Y.G. All authors have read and agreed to the published version of the manuscript.

Funding

No funding received.

Data Availability Statement

Data related to the case study may be made available on reasonable request to the corresponding authors, subject to privacy and ethical considerations.

Acknowledgments

The authors would like to thank the anonymous reviewers for their insightful and constructive comments, which significantly contributed to the improvement and mathematical rigor of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Detailed Feature Encoder Architectures

This appendix provides supplementary details regarding the architectures of the modality-specific feature encoders employed within the GOMFuNet framework. These encoders serve the crucial pre-processing function of transforming raw input data from each modality into fixed-dimensional vector representations suitable for subsequent fusion by the core CLFM module. Standard, well-established architectures were selected for their proven efficacy on the respective data types.

Appendix A.1. Deep Neural Network (DNN) for Structured Data

To process the structured data modality, encompassing features such as student attendance records, seating positions, engagement metrics derived from video analysis, and numerical exam scores, a standard Deep Neural Network (DNN) was utilized. DNNs are well suited for capturing complex, non-linear patterns and feature interactions often present within tabular or structured datasets. The implemented DNN consists of an input layer accepting the D s -dimensional feature vector, followed by a sequence of fully connected hidden layers employing the Rectified Linear Unit (ReLU) activation function to introduce non-linearity. Dropout regularization was applied after each hidden layer to mitigate overfitting. The network culminates in an output layer producing the final 32-dimensional feature vector used as input to the CLFM. The specific layer configurations and parameters are summarized in Table A1.
Table A1. Architectural details of the DNN encoder for structured data.
Table A1. Architectural details of the DNN encoder for structured data.
ParameterValue
Input Dimension D s
Hidden Layer 1 Units128
Hidden Layer 1 ActivationReLU
Hidden Layer 1 Dropout Rate0.3
Hidden Layer 2 Units64
Hidden Layer 2 ActivationReLU
Hidden Layer 2 Dropout Rate0.3
Hidden Layer 3 Units32
Hidden Layer 3 ActivationReLU
Hidden Layer 3 Dropout Rate0.3
Output Dimension ( d s )32
Output ActivationLinear

Appendix A.2. Bert with Multi-Head Attention (MHA) for Text Data

For extracting deep semantic features from the textual data modality (e.g., student essays), we employed the Bidirectional Encoder Representations from Transformers (BERT) model [44], specifically the ‘bert-base-uncased’ pre-trained variant, which was subsequently fine-tuned on our task-specific data. BERT’s Transformer-based architecture, particularly its bidirectional self-attention mechanism, allows it to capture rich contextual information at both word and sentence levels, significantly outperforming traditional context-free methods. The architecture, illustrated in Figure A1, processes tokenized input sequences. To further refine the representations obtained from BERT, a Multi-Head Attention (MHA) module was added. MHA allows the model to jointly attend to information from different representation subspaces at different positions, enhancing the focus on the most relevant textual segments from multiple perspectives. The output from MHA is processed through feed-forward layers and layer normalization, resulting in the final 768-dimensional semantic embedding, typically derived from the processed [CLS] token representation, which serves as input to the CLFM. Key parameters are detailed in Table A2.
Figure A1. Architecture of the BERT model with MHA used for text feature extraction.
Figure A1. Architecture of the BERT model with MHA used for text feature extraction.
Mathematics 13 01791 g0a1
Table A2. Architectural details of the text encoder (BERT + MHA).
Table A2. Architectural details of the text encoder (BERT + MHA).
ParameterValue
Base ModelBERT (‘bert-base-uncased’, fine-tuned)
Input Max Sequence Length512 tokens
BERT Output Dimension (CLS token)768
MHA Number of Heads4
MHA Dimension768
Feed-Forward Layers2 (with ReLU)
Layer NormalizationApplied
Final Output Dimension ( d t )768

Appendix A.3. CNN and Bi-LSTM Combination for Audio Data

Extracting meaningful features from the audio data modality (e.g., one-minute oral recordings) requires capturing both local spectro-temporal patterns and long-range temporal dependencies. To achieve this, we designed a hybrid deep learning model combining Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (Bi-LSTM) networks [45], operating on log-Mel spectrograms derived from the audio signals. The CNN component, with its convolutional filters and pooling layers, excels at identifying localized frequency-specific features (like phonetic characteristics). The Bi-LSTM component processes the sequence of spectral coefficients over time, effectively modeling longer temporal patterns such as rhythm and intonation contours; the bidirectional nature allows consideration of both past and future context. An attention mechanism (Luong-style general attention) was applied to the Bi-LSTM outputs to allow the model to dynamically weigh the importance of different audio segments. Features extracted from both the CNN and the attention-weighted Bi-LSTM pathways are concatenated and passed through a final dense layer to produce the consolidated 128-dimensional audio feature vector for the CLFM. The overall structure is depicted in Figure A2, and the parameterization is summarized in Table A3.
Figure A2. Combined CNN and Bi-LSTM architecture used for audio feature extraction.
Figure A2. Combined CNN and Bi-LSTM architecture used for audio feature extraction.
Mathematics 13 01791 g0a2
Table A3. Architectural details of the audio encoder (CNN+Bi-LSTM).
Table A3. Architectural details of the audio encoder (CNN+Bi-LSTM).
ParameterValue
Input RepresentationLog-Mel Spectrogram
CNN Channel
Conv Layer 1 Filters32
Conv Layer 1 Kernel Size3 × 3
Conv Layer 1 ActivationReLU
Pooling 1MaxPool (2 × 2)
Conv Layer 2 Filters64
Conv Layer 2 Kernel Size3 × 3
Conv Layer 2 ActivationReLU
Pooling 2MaxPool (2 × 2)
Bi-LSTM Channel
Bi-LSTM Layers2
Bi-LSTM Hidden Units (per direction)128
Bi-LSTM Dropout Rate0.4
Attention MechanismLuong (General)
Output Stage
ConcatenationCNN Features + Bi-LSTM Attention Output
Final Dense Layer Units128
Final Dense ActivationReLU
Final Output Dimension ( d a )128

References

  1. Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
  2. Bodaghi, M.; Hosseini, M.; Gottumukkala, R. A Multimodal Intermediate Fusion Network with Manifold Learning for Stress Detection. arXiv 2024, arXiv:2403.08077. [Google Scholar]
  3. Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
  4. Bashir, A.; Bashir, S.; Rana, K.; Lambert, P.; Vernallis, A. Post-COVID-19 adaptations; the shifts towards online learning, hybrid course delivery and the implications for biosciences courses in the higher education setting. In Proceedings of the Frontiers in Education; Frontiers Media SA: Lausanne, Switzerland, 2021; Volume 6, p. 711619. [Google Scholar]
  5. Villegas-Ch, W.; Román-Cañizares, M.; Palacios-Pacheco, X. Improvement of an online education model with the integration of machine learning and data analysis in an LMS. Appl. Sci. 2020, 10, 5371. [Google Scholar] [CrossRef]
  6. Gaonkar, A.; Chukkapalli, Y.; Raman, P.J.; Srikanth, S.; Gurugopinath, S. A comprehensive survey on multimodal data representation and information fusion algorithms. In Proceedings of the 2021 International Conference on Intelligent Technologies (CONIT), Hubli, India, 25–27 June 2021; pp. 1–8. [Google Scholar]
  7. Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  8. Wei, H.; Jafarian, A.; Zeidman, P.; Litvak, V.; Razi, A.; Hu, D.; Friston, K.J. Bayesian fusion and multimodal DCM for EEG and fMRI. arXiv 2019, arXiv:1906.07354. [Google Scholar] [CrossRef]
  9. Tian, Y.; Ma, Y. Information-theoretic approaches in multimodal learning: A survey. IEEE Access 2019, 7, 170181–170193. [Google Scholar]
  10. Zadeh, A.; Liang, P.P.; Morency, L.P. Foundations of multimodal co-learning. Inf. Fusion 2020, 64, 188–193. [Google Scholar] [CrossRef]
  11. Khan, A.U.; Mazaheri, A.; da Vitoria Lobo, N.; Shah, M. MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering. arXiv 2020, arXiv:2010.14095. [Google Scholar]
  12. Jiao, T.; Guo, C.; Feng, X.; Chen, Y.; Song, J. A Comprehensive Survey on Deep Learning Multi-Modal Fusion: Methods, Technologies and Applications. Comput. Mater. Contin. 2024, 80, 1–35. [Google Scholar] [CrossRef]
  13. Bronstein, M.M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometric deep learning: Going beyond Euclidean data. IEEE Signal Process. Mag. 2017, 34, 18–42. [Google Scholar] [CrossRef]
  14. Rahate, A.; Walambe, R.; Ramanna, S.; Kotecha, K. Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions. Inf. Fusion 2022, 81, 203–239. [Google Scholar] [CrossRef]
  15. Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
  16. Diestel, R. Graph Theory, 5th ed.; Springer: New York City, NY, USA, 2017. [Google Scholar]
  17. Namoun, A.; Alshanqiti, A. Predicting student performance using data mining and learning analytics techniques: A systematic literature review. Appl. Sci. 2020, 11, 237. [Google Scholar] [CrossRef]
  18. Emerson, A.; Cloude, E.B.; Azevedo, R.; Lester, J. Multimodal learning analytics for game-based learning. Br. J. Educ. Technol. 2020, 51, 1505–1526. [Google Scholar] [CrossRef]
  19. Sharma, K.; Giannakos, M. Multimodal data capabilities for learning: What can multimodal data tell us about learning? Br. J. Educ. Technol. 2020, 51, 1450–1484. [Google Scholar] [CrossRef]
  20. Hooda, M.; Rana, C.; Dahiya, O.; Rizwan, A.; Hossain, M.S. Artificial intelligence for assessment and feedback to enhance student success in higher education. Math. Probl. Eng. 2022, 2022, 5215722. [Google Scholar] [CrossRef]
  21. Khan, I.; Ahmad, A.R.; Jabeur, N.; Mahdi, M.N. An artificial intelligence approach to monitor student performance and devise preventive measures. Smart Learn. Environ. 2021, 8, 1–18. [Google Scholar] [CrossRef]
  22. Mai, S.; Hu, H.; Xing, S. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 164–172. [Google Scholar]
  23. Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Washington, DC, USA, 28 June–2 July 2011; pp. 689–696. [Google Scholar]
  24. GeeksforGeeks. Early Fusion vs. Late Fusion in Multimodal Data Processing; GeeksforGeeks: Noida, India, 2025. [Google Scholar]
  25. Li, L.; Liu, M.; Ma, L.; Han, L. Cross-Modal feature description for remote sensing image matching. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102964. [Google Scholar] [CrossRef]
  26. Liu, Z.; Shen, Y.; Jin, X.; Liu, Y.; Wang, Y.; Zhang, Y. Dual Low-Rank Multimodal Fusion. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. Association for Computational Linguistics, Online, 16–20 November 2020; pp. 325–334. [Google Scholar]
  27. Zhang, W.; Li, M.; Chen, X.; Wang, L. Multimodal Graph Neural Network for Recommendation with Dynamic De-redundancy and Modality-Guided Feature De-noisy. arXiv 2024, arXiv:2411.01561. [Google Scholar]
  28. Kwon, Y.; Won, J.K.; Paik, M.W.; Lee, S.W. Uncertainty quantification using Bayesian neural networks in classification: Application to biomedical image segmentation. Comput. Stat. Data Anal. 2018, 142, 106816. [Google Scholar] [CrossRef]
  29. Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. Adv. Neural Inf. Process. Syst. 2017, 30, 6405–6416. [Google Scholar]
  30. Gao, R.; Wang, L.; Zhang, H. A Comparative Study of Confidence Calibration in Deep Learning. arXiv 2022, arXiv:2206.08833. [Google Scholar]
  31. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  32. Han, Z.; Wu, J.; Huang, C.; Huang, Q.; Zhao, M. A review on sentiment discovery and analysis of educational big-data. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1328. [Google Scholar] [CrossRef]
  33. Alshanqiti, A.; Namoun, A. Predicting student performance and its influential factors using hybrid regression and multi-label classification. IEEE Access 2020, 8, 203827–203844. [Google Scholar] [CrossRef]
  34. Huang, C.; Zhou, J.; Chen, J.; Yang, J.; Clawson, K.; Peng, Y. A feature weighted support vector machine and artificial neural network algorithm for academic course performance prediction. Neural Comput. Appl. 2023, 35, 11517–11529. [Google Scholar] [CrossRef]
  35. Kabathova, J.; Drlik, M. Towards predicting student’s dropout in university courses using different machine learning techniques. Appl. Sci. 2021, 11, 3130. [Google Scholar] [CrossRef]
  36. López Zambrano, J.; Lara Torralbo, J.A.; Romero Morales, C. Early prediction of student learning performance through data mining: A systematic review. Psicothema 2021, 33, 456. [Google Scholar] [CrossRef] [PubMed]
  37. Wan, H.; Liu, K.; Yu, Q.; Gao, X. Pedagogical intervention practices: Improving learning engagement based on early prediction. IEEE Trans. Learn. Technol. 2019, 12, 278–289. [Google Scholar] [CrossRef]
  38. Lu, O.H.; Huang, A.Y.; Huang, J.C.; Lin, A.J.; Ogata, H.; Yang, S.J. Applying learning analytics for the early prediction of Students’ academic performance in blended learning. J. Educ. Technol. Soc. 2018, 21, 220–232. [Google Scholar]
  39. Buschetto Macarini, L.A.; Cechinel, C.; Batista Machado, M.F.; Faria Culmant Ramos, V.; Munoz, R. Predicting students success in blended learning—evaluating different interactions inside learning management systems. Appl. Sci. 2019, 9, 5523. [Google Scholar] [CrossRef]
  40. Joshi, G.; Walambe, R.; Kotecha, K. A review on explainability in multimodal deep neural nets. IEEE Access 2021, 9, 59800–59821. [Google Scholar] [CrossRef]
  41. Hosseini, S.S.; Yamaghani, M.R.; Poorzaker Arabani, S. Multimodal modelling of human emotion using sound, image and text fusion. Signal Image Video Process. 2024, 18, 71–79. [Google Scholar] [CrossRef]
  42. Zhang, Y.; An, R.; Liu, S.; Cui, J.; Shang, X. Predicting and understanding student learning performance using multi-source sparse attention convolutional neural networks. IEEE Trans. Big Data 2021, 9, 118–132. [Google Scholar] [CrossRef]
  43. Wang, Z.; Wan, Z.; Wan, X. Transmodality: An end2end fusion method with transformer for multimodal sentiment analysis. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 2514–2520. [Google Scholar]
  44. Koroteev, M.V. BERT: A review of applications in natural language processing and understanding. arXiv 2021, arXiv:2103.11943. [Google Scholar]
  45. Shashidhar, R.; Patilkulkarni, S.; Puneeth, S. Combining audio and visual speech recognition using LSTM and deep convolutional neural network. Int. J. Inf. Technol. 2022, 14, 3425–3436. [Google Scholar] [CrossRef]
Figure 1. t-SNE visualization of the student feature space before (left column) and after (right column) applying the CLFM fusion module within GOMFuNet. Top row: colored by final course status (Pass/Fail). Bottom row: colored by final numerical score (darker = lower, lighter = higher). CLFM yields better separation and structure.
Figure 1. t-SNE visualization of the student feature space before (left column) and after (right column) applying the CLFM fusion module within GOMFuNet. Top row: colored by final course status (Pass/Fail). Bottom row: colored by final numerical score (darker = lower, lighter = higher). CLFM yields better separation and structure.
Mathematics 13 01791 g001
Figure 2. SHAP summary plot (beeswarm) illustrating the impact of structured features on GOMFuNet’s prediction of course success (Pass = 1) for each student sample in the test set. Features are ranked by overall importance (mean absolute SHAP value).
Figure 2. SHAP summary plot (beeswarm) illustrating the impact of structured features on GOMFuNet’s prediction of course success (Pass = 1) for each student sample in the test set. Features are ranked by overall importance (mean absolute SHAP value).
Mathematics 13 01791 g002
Figure 3. (Left): ROC curve of GOMFuNet’s classification performance (AUC = 0.92). (Right): Scatter plot of actual vs. predicted scores from GOMFuNet’s regression output.
Figure 3. (Left): ROC curve of GOMFuNet’s classification performance (AUC = 0.92). (Right): Scatter plot of actual vs. predicted scores from GOMFuNet’s regression output.
Mathematics 13 01791 g003
Figure 4. Reliability diagrams comparing GOMFuNet trained with standard loss only (Baseline) and with the combined loss including LCLM.
Figure 4. Reliability diagrams comparing GOMFuNet trained with standard loss only (Baseline) and with the combined loss including LCLM.
Mathematics 13 01791 g004
Table 1. Distribution of students’ final exam scores in the case study dataset (N = 472).
Table 1. Distribution of students’ final exam scores in the case study dataset (N = 472).
ClassificationScore RangeNumber of Students
Fail<6056
Pass60–6966
Pass70–79110
Pass80–89157
Pass90–9975
Pass1008
Total 472
Table 2. Key architectural details and hyperparameters of the GOMFuNet core modules.
Table 2. Key architectural details and hyperparameters of the GOMFuNet core modules.
ComponentParameterValue
Feature Encoders (Output Dims)Structured ( d s )32
Text ( d t )768
Audio ( d a )128
CLFM: Graph ConstructionMethodkNN + Gaussian Kernel
CLFM: GCNLayers ( L g )2
Hidden Dimension128
Output Dimension (per mod, d m )64
ActivationReLU
CLFM: Transformer AggregatorLayers2
Heads4
Model Dimension ( d m )192 (3 × 64)
Feedforward Inner Dimension256
Dropout0.2
Output Dimension ( d fused )192
LCLMWeighting Factor ( λ )[Specify best lambda value]
Temperature ( τ L )1.0
TrainingOptimizerAdam
Initial Learning Rate0.001
Batch Size32
Epochs35 (with Early Stopping)
Table 3. Performance comparison of single-modality models and the proposed GOMFuNet on classification and regression tasks (case study dataset).
Table 3. Performance comparison of single-modality models and the proposed GOMFuNet on classification and regression tasks (case study dataset).
ModelModalityAccuracyF1-ScoreROC-AUCMSEMAER2
Decision TreeStructured66.05%0.630.710.03480.153562.95%
Random ForestStructured69.12%0.650.740.03310.146066.81%
Support Vector MachineStructured67.98%0.640.730.03400.150164.92%
Neural NetworkStructured71.15%0.670.760.03190.140569.33%
LSTMAudio60.33%0.580.660.04050.176053.81%
CNNAudio62.01%0.600.680.03910.169556.24%
LSTM + CNNAudio63.42%0.610.690.03780.163258.44%
Naive BayesText73.02%0.700.790.03080.134070.95%
BERTText75.88%0.730.810.02890.126574.89%
RNNText73.91%0.710.800.02990.130172.98%
BERT + MHAText76.73%0.740.830.02800.124076.31%
GOMFuNet (Proposed)Multimodal90.17%0.880.920.02050.097188.03%
Table 4. Performance comparison of GOMFuNet with other advanced multimodal models (case study dataset).
Table 4. Performance comparison of GOMFuNet with other advanced multimodal models (case study dataset).
ModelAccuracyF1-ScoreROC-AUCMSEMAER2
ATV [41]87.05%0.840.890.02450.105585.88%
MsaCNN [42]86.31%0.830.880.02560.109184.41%
Transmodality [43]85.19%0.820.870.02680.113082.74%
GOMFuNet (Proposed)90.17%0.880.920.02050.097188.03%
Table 5. Performance comparison of GOMFuNet using different loss function configurations.
Table 5. Performance comparison of GOMFuNet using different loss function configurations.
Loss Configuration (Within GOMFuNet)AccuracyF1-ScoreROC-AUCMSEMAER2
CLFM + Cross-Entropy Loss87.88%0.850.900.02290.102886.61%
CLFM + Mean Squared Error Loss89.01%0.860.910.02200.100587.32%
CLFM + Focal Loss [31]89.52%0.870.910.02140.098987.88%
CLFM + LCLM + Task Loss (Proposed)90.17%0.880.920.02050.097188.03%
Table 6. Modality ablation study on GOMFuNet performance.
Table 6. Modality ablation study on GOMFuNet performance.
StructuredTextAudioAccuracyF1-ScoreROC-AUCMSEMAER2
90.17%0.880.920.02050.097188.03%
×87.41%0.850.890.02420.105085.11%
×86.05%0.830.870.02600.110583.49%
×88.12%0.860.900.02350.102986.15%
Table 7. GOMFuNet module ablation study on overall model performance.
Table 7. GOMFuNet module ablation study on overall model performance.
CLFMLCLMAccuracyF1-ScoreROC-AUCMSEMAER2
90.17%0.880.920.02050.097188.03%
×88.03%0.860.900.02360.103085.99%
×89.52%0.870.910.02140.098987.88%
××85.39%0.830.880.02710.111882.15%
Table 8. Expected Calibration Error (ECE) comparison.
Table 8. Expected Calibration Error (ECE) comparison.
Model ConfigurationECE (%)
GOMFuNet (CLFM + Standard Loss)6.15
GOMFuNet (CLFM + LCLM + Standard Loss)2.83
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guo, Y.; Zhong, R. GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability. Mathematics 2025, 13, 1791. https://doi.org/10.3390/math13111791

AMA Style

Guo Y, Zhong R. GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability. Mathematics. 2025; 13(11):1791. https://doi.org/10.3390/math13111791

Chicago/Turabian Style

Guo, Yi, and Rui Zhong. 2025. "GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability" Mathematics 13, no. 11: 1791. https://doi.org/10.3390/math13111791

APA Style

Guo, Y., & Zhong, R. (2025). GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability. Mathematics, 13(11), 1791. https://doi.org/10.3390/math13111791

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop