GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability

Guo, Yi; Zhong, Rui

doi:10.3390/math13111791

Open AccessArticle

GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability

by

Yi Guo

^1,* and

Rui Zhong

^2,*

¹

School of Economics, Beijing Technology and Business University, Beijing 102401, China

²

Information Initiative Center, Hokkaido University, Sapporo 060-0808, Japan

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(11), 1791; https://doi.org/10.3390/math13111791

Submission received: 25 April 2025 / Revised: 7 May 2025 / Accepted: 21 May 2025 / Published: 27 May 2025

(This article belongs to the Special Issue Deep Neural Network: Theory, Algorithms and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Integrating information from heterogeneous data sources poses significant mathematical challenges, particularly in ensuring the reliability and reducing the uncertainty of predictive models. This paper introduces the Geometric Orthogonal Multimodal Fusion Network (GOMFuNet), a novel mathematical framework designed to address these challenges. GOMFuNet synergistically combines two core mathematical principles: (1) It utilizes geometric deep learning, specifically Graph Convolutional Networks (GCNs), within its Cross-Modal Label Fusion Module (CLFM) to perform fusion in a high-level semantic label space, thereby preserving inter-sample topological relationships and enhancing robustness to inconsistencies. (2) It incorporates a novel Label Confidence Learning Module (LCLM) derived from optimization theory, which explicitly enhances prediction reliability by enforcing mathematical orthogonality among the predicted class probability vectors, directly minimizing output uncertainty. We demonstrate GOMFuNet’s effectiveness through comprehensive experiments, including confidence calibration analysis and robustness tests, and validate its practical utility via a case study on educational performance prediction using structured, textual, and audio data. Results show GOMFuNet achieves significantly improved performance (90.17% classification accuracy, 88.03% R² regression) and enhanced reliability compared to baseline and state-of-the-art multimodal methods, validating its potential as a robust framework for reliable multimodal learning.

Keywords:

multimodal integration; uncertainty quantification; learning resource optimization; crossmodal mapping; confidence calibration; educational sustainability

MSC:

68T07

1. Introduction

Integrating information from heterogeneous data sources, known as multimodal learning, presents fundamental mathematical and computational challenges, particularly in ensuring the reliability and reducing the uncertainty of predictive models built upon such diverse inputs [1]. Effectively fusing data with disparate structures (e.g., geometric, sequential, tabular) while mitigating inconsistencies and noise inherent in real-world observations remains an open problem [2]. Furthermore, quantifying and actively minimizing the uncertainty associated with predictions derived from complex multimodal models is crucial for trustworthy decision-making in critical applications [3]. Traditional methods for analyzing complex phenomena often rely on single data sources or simple aggregation techniques, which struggle to capture the multifaceted nature of many real-world systems, leading to predictions with limited accuracy and applicability [4]. The advent of artificial intelligence (AI) has spurred efforts to improve prediction precision [5], but fully harnessing the rich information in diverse data streams requires sophisticated mathematical frameworks.

Existing mathematical approaches to multimodal fusion often fall into categories like early (feature-level) fusion, late (decision-level) fusion, or intermediate strategies employing techniques such as attention mechanisms or tensor factorization [1,6]. Probabilistic graphical models [7] and Bayesian methods [8] offer principled ways to model dependencies and uncertainty but can face scalability issues or restrictive distributional assumptions. Information-theoretic approaches aim to maximize mutual information or minimize redundancy across modalities [9]. While deep learning methods, particularly Transformer-based architectures [10,11], have shown promise in capturing crossmodal correlations, many standard techniques face limitations from a mathematical perspective [12].

Specifically, many existing methods struggle with several key mathematical aspects: (1) Geometric Structure Preservation: Simple concatenation or standard attention often fails to explicitly preserve the intrinsic geometric structure or topological relationships within the data manifold of each modality, potentially losing valuable information [13]. (2) Handling Inconsistency: Fusing potentially conflicting signals from different modalities in a mathematically robust manner remains difficult, often leading to suboptimal or unstable results [14]. (3) Explicit Reliability Optimization: Most standard loss functions (e.g., cross-entropy) primarily focus on matching predictions to ground truth labels, rather than directly optimizing the mathematical properties of the output distribution itself to enhance prediction confidence and separability [15]. This lack of explicit reliability control can limit the trustworthiness of predictions, especially in the presence of noise or ambiguity. Foundational mathematical concepts from graph theory [16], geometric deep learning [13], and statistical reliability provide tools to address these gaps.

To address these limitations, this paper introduces the Geometric Orthogonal Multimodal Fusion Network (GOMFuNet), a novel mathematical framework designed for reliable multimodal data integration and prediction. GOMFuNet uniquely combines two core mathematical concepts: First, it leverages principles from geometric deep learning, employing a Graph Convolutional Network (GCN) within its Crossmodal Label Fusion Module (CLFM) to perform fusion explicitly in a high-level semantic label space. This approach mathematically models and preserves the topological relationships between samples based on their modality-specific representations, facilitating robust integration even with inconsistent signals. Second, GOMFuNet incorporates a novel Label Confidence Learning Module (LCLM) derived from optimization principles. LCLM introduces an explicit mathematical objective that enforces orthogonality among the predicted class probability vectors. By minimizing the correlation between predictions for different classes, LCLM directly enhances the separability and confidence of the model’s output, thus reducing prediction uncertainty.

The primary mathematical contributions of this work are the following: (1) the design and formalization of the GOMFuNet architecture, integrating geometric fusion in label space with output orthogonality optimization; (2) the derivation and mathematical justification of the LCLM loss function as a novel mechanism for explicit uncertainty reduction in multimodal classification/regression; and (3) rigorous empirical validation demonstrating the effectiveness of GOMFuNet in enhancing prediction accuracy and reliability compared to existing methods.

To demonstrate the practical efficacy of the proposed GOMFuNet framework in tackling complex, real-world problems involving heterogeneous data and the need for reliable predictions, we conduct a case study in the domain of educational performance prediction [17]. Using a multimodal dataset comprising structured, textual, and audio data from student interactions [18,19], we validate GOMFuNet’s ability to outperform baseline and conventional fusion techniques. This case study, involving factors influencing student success [20,21], serves to illustrate the tangible benefits of our mathematically grounded approach in a challenging application context.

The remainder of this paper is organized as follows: Section 2 reviews related work from the perspectives of mathematical fusion strategies and uncertainty handling. Section 3 provides the detailed mathematical formulation and algorithmic description of the proposed GOMFuNet. Section 4 describes the experimental setup for the case study. Section 5 presents the empirical results of the case study, including comparisons and ablation studies. Section 6 discusses the findings, implications, and limitations. Section 7 concludes the paper.

2. Related Work

2.1. Mathematical Strategies for Multimodal Fusion

Multimodal fusion aims to mathematically integrate information from multiple heterogeneous data sources (modalities) for enhanced representation or prediction [6,22]. Classical fusion strategies include early fusion, late fusion, and intermediate fusion. Early fusion, or feature-level fusion, typically involves concatenating feature vectors extracted from each modality [23]. While conceptually simple, this approach often struggles mathematically with high dimensionality and feature heterogeneity (differing scales and statistical properties), and may fail to capture complex, non-linear intermodal interactions or preserve modality-specific structures. Late fusion, or decision-level fusion, combines the outputs of modality-specific models [24]. This avoids feature-level heterogeneity issues but mathematically prevents the model from learning synergistic interactions between modalities during the representation learning phase.

Intermediate fusion strategies attempt to combine information at deeper layers of a network. Attention mechanisms, particularly within Transformer architectures [10,25], have become popular for modeling crossmodal correlations by learning adaptive weights. However, standard attention may not explicitly account for the underlying geometric manifold structure of the data within each modality. Tensor-based methods offer another mathematical framework for capturing high-order interactions but can be computationally expensive [26]. Recent advances in geometric deep learning [13] offer promising tools. Graph Neural Networks (GNNs), including GCNs, can operate on graph-structured data, allowing models to explicitly leverage relationships between data points (nodes). While GNNs have been applied to multimodal problems [27], their use for fusion directly within a high-level semantic or label space, as proposed in GOMFuNet, is less common and offers potential advantages in handling inconsistencies and focusing on semantic agreement.

2.2. Uncertainty Quantification and Reliability in Deep Learning

Ensuring the reliability of predictions from deep learning models, especially complex multimodal ones, is a critical mathematical challenge [3]. Prediction uncertainty arises from various sources, including data noise (aleatoric uncertainty) and model limitations (epistemic uncertainty). Several mathematical frameworks exist for quantifying and mitigating uncertainty. Bayesian Neural Networks (BNNs) provide a principled probabilistic approach by placing distributions over model weights, but inference can be computationally demanding [28]. Ensemble methods average predictions from multiple models trained independently or with variations, offering a practical way to estimate uncertainty but increasing computational cost [29].

Confidence calibration aims to ensure that the predicted probabilities accurately reflect the true likelihood of correctness [30]. Miscalibrated models can be overconfident or underconfident, hindering reliable decision-making. Techniques like temperature scaling or Platt scaling are post hoc calibration methods [30]. However, incorporating reliability directly into the training process is desirable. Some loss functions, like Focal Loss [31], address class imbalance but do not directly target the confidence structure itself. Other approaches modify the network architecture or training procedure [30]. The LCLM proposed in GOMFuNet offers a distinct mathematical approach by directly optimizing the geometric orthogonality of the output probability vectors, aiming to improve separability and confidence inherently during training, complementing standard task-based losses. This contrasts with methods focusing solely on individual prediction calibration or weight uncertainty.

2.3. Student Performance Prediction as an Application Context

Predicting student course performance is an important application area within educational data mining [17,32]. Traditional statistical methods [33] often yield limited accuracy due to the complexity of factors influencing learning [34,35]. AI and machine learning have been increasingly applied [36,37], leveraging diverse data sources available in modern learning environments, including structured records, textual submissions, classroom audio/video, and interaction logs [38,39]. The inherent multimodality and potential for noise and inconsistency in educational data make it a suitable and challenging testbed for evaluating advanced, reliable fusion frameworks like the proposed GOMFuNet [14,40]. Our work uses this domain not as its primary focus, but as a case study to validate the mathematical principles and practical effectiveness of GOMFuNet.

3. A Mathematical Framework for Reliable Geometric Multimodal Fusion

3.1. Mathematical Problem Statement: Prediction Under Heterogeneity and Uncertainty

Consider the problem of learning from multimodal data drawn from a joint probability distribution

P

over a product space

X \times Y

, where

X = X_{1} \times \dots \times X_{M}

represents the combined input space from M modalities and

Y

is the target space. Given an empirical sample

D = {(x_{i}, y_{i})}_{i = 1}^{N} \sim P^{N}

, where

x_{i} = (x_{i, 1}, \dots, x_{i, M})

with

x_{i, m} \in X_{m}

, the objective is to find a function

f : X \to P (Y)

within a hypothesis space

H

that minimizes the expected risk:

R (f) = E_{(x, y) \sim P} [ℓ (f (x), y)],

(1)

where

ℓ : P (Y) \times Y \to R_{\geq 0}

is a loss functional. Practical challenges arise from the heterogeneity of spaces

X_{m}

(requiring robust fusion mechanisms) and the need for reliable uncertainty estimates from

f (x)

, which standard empirical risk minimization (ERM) often fails to guarantee. Specifically, we address the following:

(C1): Structure-Preserving Fusion: How to mathematically fuse information from disparate modalities while preserving and leveraging the intrinsic geometric or topological structures presumed to exist within the data representation of each modality?
(C2): Explicit Reliability Enhancement: How to mathematically formulate an objective or constraint that explicitly promotes the reliability, confidence, and separability of the output probability distribution $f (x)$ , beyond simply minimizing the empirical task loss ℓ?

This paper introduces a framework tackling (C1) using concepts from geometric deep learning applied in a semantic latent space, and (C2) via a novel output space regularization technique derived from orthogonality principles in linear algebra and information geometry.

3.2. Geometric Integration via Graph Propagation in Semantic Latent Spaces

Our approach to (C1) involves transforming the fusion problem into one of learning on graphs constructed within appropriate latent spaces, thereby explicitly encoding inter-sample relationships.

3.2.1. Graph Embedding of Modality-Specific Manifolds

Let

ϕ_{m} : X_{m} \to Z_{m} \subseteq R^{d_{m}^{'}}

be learned embedding functions mapping raw data to a latent semantic space

Z_{m}

. For a batch

B

of N samples, we obtain embeddings

X_{m}^{'} = {x_{i, m}^{'}}_{i \in B}

. Assuming the data points lie on or near a lower-dimensional manifold

M_{m} \subset Z_{m}

, we approximate

M_{m}

using a discrete graph structure

G_{m} = (V, A_{m})

, where

V = B

. The weighted adjacency matrix

A_{m}

captures local geometric structure. A common construction uses a symmetric kNN graph with heat kernel weights:

A_{m, i j} = \{\begin{matrix} exp (- \frac{∥ x_{i, m}^{'} - x_{j, m}^{'} ∥_{2}^{2}}{2 σ_{m}^{2}}) & if j \in N_{k} (i) or i \in N_{k} (j) \\ 0 & otherwise \end{matrix}

(2)

The choice of k and

σ_{m}

influences the scale at which the manifold geometry is captured. This graph provides a combinatorial representation of the data topology for modality m.

3.2.2. Spectral Graph Convolution for Structure-Aware Learning

Learning on

G_{m}

requires operators that respect its structure. Graph Convolutional Networks (GCNs) leverage the graph Laplacian,

L_{m} = D_{m} - A_{m}

, where

D_{m}

is the diagonal degree matrix (

D_{m, i i} = \sum_{j} A_{m, i j}

). The normalized Laplacian,

L_{m} = I_{N} - D_{m}^{- 1 / 2} A_{m} D_{m}^{- 1 / 2}

, has eigenvalues

0 = λ_{1} \leq λ_{2} \leq \dots \leq λ_{N} \leq 2

, which relate to the graph’s connectivity and geometry. Spectral GCNs define convolution via filtering in the graph Fourier domain:

H_{out} = U_{m} (g_{θ} (Λ_{m}) ⊙ (U_{m}^{T} H_{in}))

(3)

where

L_{m} = U_{m} Λ_{m} U_{m}^{T}

is the eigendecomposition of the Laplacian,

U_{m}

contains the eigenvectors (graph Fourier modes),

Λ_{m}

is the diagonal matrix of eigenvalues (frequencies),

g_{θ} (Λ_{m})

is a learnable spectral filter

g_{θ} : R \to R

applied element-wise to the eigenvalues, and ⊙ denotes element-wise multiplication. Practical GCN layers approximate this spectral filtering using computationally efficient spatial operations based on the renormalized adjacency matrix

{\hat{A}}_{m} = {\tilde{D}}_{m}^{- 1 / 2} {\tilde{A}}_{m} {\tilde{D}}_{m}^{- 1 / 2}

(where

{\tilde{A}}_{m} = A_{m} + I_{N}

). The layer update rule becomes

H_{m}^{(l + 1)} = σ ({\hat{A}}_{m} H_{m}^{(l)} W_{m}^{(l)})

(4)

This operation performs localized feature aggregation, effectively smoothing representations along the graph edges and propagating information across the approximated data manifold

M_{m}

. Stacking

L_{g}

layers yields

{\hat{H}}_{m} = H_{m}^{(L_{g})}

, capturing multi-scale geometric information.

3.2.3. Global Context Aggregation Using Self-Attention

The geometry-informed representations

{{\hat{H}}_{m}}_{m = 1}^{M}

are concatenated into

H_{concat} \in R^{N \times D^{″}}

. To model global interactions and dependencies across all samples and fused modalities, we apply a self-attention mechanism, realized via a Transformer encoder. The attention operation computes a new representation for each sample i as a weighted sum of linearly transformed representations of all samples j:

{AttnOutput}_{i} = \sum_{j = 1}^{N} α_{i j} (h_{concat, j} W_{V})

(5)

where

h_{concat, j}

is the j-th row of

H_{concat}

,

W_{V}

is a learnable value transformation matrix, and the attention weights

α_{i j}

are computed based on compatibility scores between query and key representations:

α_{i j} = {softmax}_{j} (\frac{(h_{concat, i} W_{Q}) {(h_{concat, j} W_{K})}^{T}}{\sqrt{d_{k}}})

(6)

with learnable matrices

W_{Q}, W_{K}

. Multi-head attention performs this process in parallel across multiple subspaces. This allows the model to learn a complex aggregation function

f_{a g g}

that produces the final fused representation

H_{fused} = f_{a g g} (H_{concat})

, capturing both local geometric structure (via GCNs) and global context (via Transformer).

3.3. Output Space Geometry Regularization via Orthogonality Enforcement

To tackle (C2), we introduce a novel regularizer,

L_{LCLM}

, that directly modifies the optimization landscape to favor solutions exhibiting higher prediction reliability by controlling the geometric configuration of the output probability distributions.

3.3.1. Information Geometry Perspective: Separability and Orthogonality

Let

P \in Δ_{C}^{N}

be the predicted probability matrix for a batch, where

Δ_{C} = {p \in R^{C} | p_{c} \geq 0, \sum_{c} p_{c} = 1}

is the probability simplex. The c-th column

p_{c} \in {[0, 1]}^{N}

represents the conditional probability distribution

P (sample i \in class c)

over the batch. From an information geometry perspective, the separability between classes can be related to the distance or divergence between these column vectors

p_{c}

. High correlation or small angular separation between

p_{c}

and

p_{c^{'}}

indicates poor discriminability between classes c and

c^{'}

for the given batch.

We propose using vector orthogonality in the embedding space

R^{N}

as a strong geometric prior for class separability. The ideal condition

〈 p_{c}, p_{c^{'}} 〉 = 0

for

c \neq c^{'}

implies that the probability mass assigned to different classes is concentrated on disjoint subsets of samples within the batch. While strict orthogonality is restrictive, minimizing the pairwise cosine similarity serves as a continuous relaxation that encourages angular separation. Let

{\hat{p}}_{c} = p_{c} / {∥ p_{c} ∥}_{2}

be the L2-normalized vector. The objective is to minimize

| 〈 {\hat{p}}_{c}, {\hat{p}}_{c^{'}} 〉 |

for

c \neq c^{'}

.

3.3.2. The LCLM Regularizer: Maximizing Relative Self-Confidence

We formulate this objective via the Gram matrix of cosine similarities,

G_{c c^{'}} = 〈 {\hat{p}}_{c}, {\hat{p}}_{c^{'}} 〉

. Driving

G

towards

I_{C}

is achieved by maximizing the diagonal entries relative to the off-diagonal entries. Consider the softmax normalization along each row (or column) of

G

:

u_{c k} (P; τ_{L}) = \frac{exp (〈 {\hat{p}}_{c}, {\hat{p}}_{k} 〉 / τ_{L})}{\sum_{j = 1}^{C} exp (〈 {\hat{p}}_{c}, {\hat{p}}_{j} 〉 / τ_{L})}

(7)

The term

u_{c c}

quantifies the relative confidence or concentration of class c’s probability vector compared to its alignment with other class vectors. Maximizing the product

\prod_{c = 1}^{C} u_{c c}

promotes simultaneous high relative self-confidence for all classes. This is equivalent to minimizing the negative log-likelihood:

L_{LCLM} (P; τ_{L}) = - \frac{1}{C} \sum_{c = 1}^{C} log u_{c c} (P; τ_{L})

(8)

This differentiable loss function acts as a regularizer on the output probability matrix

P

, pushing the column vectors towards mutual orthogonality.

3.3.3. Gradient Analysis and Optimization Dynamics

The gradient of

L_{LCLM}

with respect to the logits

Z

(where

P = softmax (Z)

) can be derived using the chain rule. Let

G_{c k} = 〈 {\hat{p}}_{c}, {\hat{p}}_{k} 〉

. The gradient involves terms that push

{\hat{p}}_{c}

away from

{\hat{p}}_{k}

for

k \neq c

. Specifically, the gradient

\nabla_{P} L_{LCLM}

has components related to the following:

\frac{\partial L_{LCLM}}{\partial P_{i c}} \propto \sum_{k \neq c} (u_{c k} - u_{c c} δ_{c k}) \frac{\partial G_{c k}}{\partial P_{i c}}

(9)

where terms involving

\partial G_{c c} / \partial P_{i c}

contribute to maximizing self-correlation (implicitly handled by normalization) and terms involving

\partial G_{c k} / \partial P_{i c}

for

k \neq c

contribute to minimizing off-diagonal correlations. The term

\partial G_{c k} / \partial P_{i c}

involves derivatives of the normalization

∥ p_{c} ∥_{2}

and the inner product

p_{c}^{T} p_{k}

. Minimizing

L_{LCLM}

during training modifies the optimization trajectory defined by the task loss

L_{Task}

, guiding the parameters

Θ

towards regions in the hypothesis space

H

where the model

f_{Θ}

produces predictions that are not only accurate but also satisfy the geometric orthogonality prior, thus exhibiting enhanced reliability and inter-class separability.

3.4. The GOMFuNet Optimization Problem and Algorithm

The overall GOMFuNet framework is trained by minimizing a composite objective function that balances empirical accuracy with the proposed geometric reliability regularization:

min_{Θ} E_{B \sim D} [L_{Task} (f_{Θ} (B), y_{B}) + λ L_{LCLM} (softmax (f_{Θ} (B)))]

(10)

where

Θ

represents all trainable parameters of the model (encoders

ϕ_{m}

, GCNs, Transformer

f_{a g g}

, output layer

W_{o u t}

),

L_{Task}

is a standard loss function suitable for the prediction task (e.g., cross-entropy, MSE),

λ \geq 0

is the regularization coefficient controlling the strength of the orthogonality constraint, and the expectation is approximated by averaging over mini-batches B.

This optimization problem is typically non-convex and is solved numerically using stochastic gradient-based methods, such as Adam. The detailed step-by-step procedure, involving forward propagation through the geometric fusion and prediction stages, calculation of the composite loss, and backpropagation for parameter updates, is outlined in Algorithm 1.

3.5. Theoretical Impact Analysis of LCLM on Generalization Error and Prediction Margin

While deriving a precise, comprehensive generalization error bound for the GOMFuNet model, which integrates complex components such as Graph Convolutional Networks (GCNs) and Transformers, presents a formidable challenge, we can theoretically analyze how its core innovation—the Label Confidence Learning Module (LCLM)—positively influences the model’s generalization capabilities and prediction margin. The LCLM is designed to enhance prediction reliability by promoting orthogonality among the output class probability vectors. This objective is closely related to classical Margin Theory and principles of complexity control in learning theory.

Algorithm 1 Stochastic Training Algorithm for GOMFuNet

Require: Multimodal dataset D, Initial GOMFuNet parameters

Θ_{0}

, Regularization hyperparameter

λ

, Learning rate

η

, Optimizer state

S_{0}

, Max epochs E, Batch size

B_{s i z e}

.
Ensure: Optimized GOMFuNet parameters

Θ

.

1:: Initialize $Θ \leftarrow Θ_{0}$ , $S \leftarrow S_{0}$ .
2:: for epoch = 1 to E do
3:: Shuffle dataset D.
4:: for each batch $B = {(d_{i, 1}, \dots, d_{i, M}, y_{i})}_{i \in I_{B}}$ from D do
5:: // Forward Pass: Compute prediction P_B = f_Theta(B)
6:: Compute $X_{m, B}^{'}$ for all m.
7:: Compute ${\hat{H}}_{m, B}$ via GCN propagation.
8:: Compute $H_{fused, B}$ via concatenation and Transformer aggregation.
9:: Compute $P_{B} = softmax (H_{fused, B} W_{o u t} + b_{o u t})$ .
10:: // Loss Computation
11:: Compute $L_{Task, B} = TaskLoss (P_{B}, y_{B})$ .
12:: Compute $L_{LCLM, B}$ .
13:: Compute $L_{Total, B} = L_{Task, B} + λ L_{LCLM, B}$ .
14:: // Backward Pass and Update
15:: Compute gradient $g_{t} = \nabla_{Θ} L_{Total, B}$ .
16:: Update parameters and optimizer state: $(Θ, S) \leftarrow OptimizerUpdate (Θ, S, g_{t}, η)$ .
17:: end for
18:: // Optional: Validation, Learning Rate Decay, Early Stopping
19:: end for
20:: Return Optimized parameters $Θ$ .

3.5.1. LCLM, Output Space Geometry, and Class Margin

The LCLM loss function,

L_{LCLM} (P; τ_{L})

, as defined in Equation (8), aims to minimize the cosine similarity between the normalized probability vectors

{\hat{p}}_{c}

and

{\hat{p}}_{k}

for distinct classes

c \neq k

. This can be expressed as striving for

〈 {\hat{p}}_{c}, {\hat{p}}_{k} 〉 \to 0 for c \neq k .

(11)

Geometrically, this is equivalent to encouraging the average probability response directions, represented by these vectors in an N-dimensional space (where N is the batch size), to become mutually orthogonal for different classes.

This pursuit of orthogonality can be interpreted as a form of margin maximization at the level of class probability distributions. In an idealized scenario, if the probability vectors

p_{c}

and

p_{k}

(representing the model’s probabilistic assignment to class c and k over a batch, or in expectation, over the entire data distribution) are perfectly orthogonal for any two distinct classes, it implies that, statistically, the patterns leading to a high probability assignment for class c are significantly different from those for class k. Such distinctiveness helps the model generate less ambiguous and more definitive class assignments for each input sample.

For an individual sample

x_{i}

, its predicted probability distribution is

p (x_{i}) = (P (y = 1 | x_{i}), \dots, P (y = C | x_{i}))

. If the model learns class-specific “signatures” (manifested through

p_{c}

) that are globally orthogonal, then a specific sample

x_{i}

is more likely to resonate strongly (i.e., yield a high probability) with only one (or a few closely related) class signature, while maintaining a low similarity (i.e., low probability) with others. This indirectly promotes the “margin” for individual sample predictions. For instance, it encourages a larger difference between the predicted probability of the true class

y_{i}

,

P (y = y_{i} | x_{i})

, and the highest predicted probability of any incorrect class

c \neq y_{i}

:

{Margin}_{i} = P (y = y_{i} | x_{i}) - max_{c \neq y_{i}} P (y = c | x_{i}) .

(12)

LCLM’s objective contributes to increasing this margin on average.

3.5.2. LCLM as a Regularizer and Learning Theory

From a learning theory perspective,

L_{LCLM}

acts as a regularizer that imposes structural constraints on the model’s output. According to Occam’s Razor and statistical learning theory, controlling model complexity (e.g., by limiting the capacity of the hypothesis space or enforcing solution smoothness through regularization) is crucial for improving generalization, i.e., reducing the gap between empirical risk and expected risk.

Classical generalization error bounds, such as those based on Rademacher complexity, often take the following form:

E [L (f)] \leq \hat{E} [L (f)] + 2 R_{N} (F) + complexity_term

(13)

where

L (f)

is the loss function,

R_{N} (F)

is the empirical Rademacher complexity of the function class

F

on a sample of size N. It is well established that classifiers with larger margins (e.g., Support Vector Machines) often correspond to lower Rademacher complexity or VC dimension.

By enforcing orthogonality among output probability vectors, LCLM guides the model towards learning a solution that is “structurally simpler” or exhibits “greater class separation”. Specifically, it penalizes model parameter configurations that lead to highly correlated probability distributions for different classes. This prior knowledge imposed on the output structure can be viewed as restricting the effective hypothesis space, biasing it towards functions that produce high-confidence, highly separable predictions. While we do not explicitly compute the VC dimension or Rademacher complexity of GOMFuNet (which is inherently complex due to its deep GCN and Transformer architecture), the introduction of LCLM aligns with the fundamental principle of improving generalization by increasing the margin or imposing structural constraints.

Furthermore, the orthogonality encouraged by LCLM among class probability vectors is analogous to a feature transformation that renders data points from different classes more linearly separable in the transformed space. If we consider each class probability vector

p_{c}

as a form of “average signature” for that class in the “sample space”, LCLM pushes these signatures to be as distinct as possible. This distinctiveness allows decision boundaries to be placed in sparser regions of the output space, thereby enhancing robustness in predictions for unseen samples, which is consistent with the idea of improving generalization margin.

3.5.3. Error Probability and Its Relation to Margin

In classification tasks, the error probability is intrinsically linked to the margin of the decision boundary. For a sample

(x, y_{true})

, if the model predicts

P (y_{true} | x)

to be substantially greater than

P (y_{j} | x)

for all

j \neq y_{true}

(i.e., it possesses a large prediction margin as in Equation (12)), the probability of misclassifying this sample is low. LCLM, by enforcing separation between class-level probability distributions on a batch basis, contributes to enhancing this prediction margin in an average sense. When the model can correctly classify most training samples with high confidence (large margin), its error probability on the test set also tends to be lower.

4. Case Study: Educational Performance Prediction

To evaluate the practical effectiveness and demonstrate the applicability of the proposed GOMFuNet framework, we conduct a case study in the domain of educational performance prediction. This domain provides a challenging real-world scenario involving heterogeneous multimodal data where prediction reliability is valuable.

4.1. Dataset and Preprocessing

The data for this case study were sourced from two cohorts of students (totaling 472 students: 243 first cohort, 229 s) in an English course over two academic years. The dataset includes classroom performance records derived from video analysis, midterm online exam scores (including written scores, text essays, and audio recordings), and final exam scores. The objective is to predict students’ final English grades, framed as both a binary classification task (Pass/Fail, based on a score threshold of 60) and a regression task (predicting the specific percentage score). Ethical considerations were addressed; all participating students were informed and consented to the use of their anonymized learning data for research analysis.

Classroom performance data were extracted from video recordings using a combination of automated scripts (OpenCV) and manual annotation for metrics such as attendance, seating position (row/column, 0 if absent), proportion of time looking at the instructor, estimated note-taking time proportion, and participation count (hand-raising/speaking instances). Midterm data included structured written test scores (0–100), one-minute oral recordings (WAV format), and written assignments (TXT format). Final exam data, comprising pass/fail status (0/1) and specific scores (0–100), served as the prediction targets.

Preprocessing involved anonymization, standardization, and cleaning. Structured numerical features were normalized to the [0–1] range using Min–Max scaling. Audio (WAV) data underwent noise reduction (spectral subtraction) and segmentation using Voice Activity Detection (WebRTC VAD). Text (TXT) data were cleaned, tokenized using the WordPiece tokenizer (‘bert-base-uncased’), formatted with special tokens ([CLS], [SEP], [PAD]), and converted to fixed-length sequences with attention masks. Missing values were handled appropriately (e.g., removed if critical). The distribution of final exam scores in the combined dataset (N = 472) is shown in Table 1.

4.2. Experimental Setup and Evaluation Metrics

Experiments were conducted using PyTorch 1.10.0 on a system with an NVIDIA Tesla V100 GPU. The dataset (N = 472) was split into training (80%, N = 378) and testing (20%, N = 94) sets, stratified by the pass/fail status. Five-fold cross-validation on the training set was used for hyperparameter tuning (including the LCLM weight

λ

, GCN layers, Transformer parameters, learning rate, batch size). The final reported results are averaged over 5 independent runs with different random seeds on the held-out test set, after training the model on the full 80% training data using the best hyperparameters found. Key hyperparameters for the GOMFuNet core are detailed in Table 2, while encoder details are given in Appendix A.

The Adam optimizer was used with an initial learning rate of 0.001, standard betas, and epsilon = 1 × 10⁻⁸. A ‘ReduceLROnPlateau’ learning rate scheduler (patience = 5, factor = 0.1) and early stopping (patience = 10 based on validation loss) were employed. The batch size was 32. The LCLM weighting factor

λ

was tuned via cross-validation (tested values e.g., [0.01, 0.1, 1.0]) and set to [Specify best lambda value here].

Performance was evaluated using standard metrics. For the classification task (Pass/Fail): Accuracy, F1-score, and Area Under the ROC Curve (ROC-AUC). For the regression task (Predicting exact score): Mean Squared Error (MSE), Mean Absolute Error (MAE), and Coefficient of Determination (R-squared,

R^{2}

). The formulas for these metrics are standard and defined as follows: Accuracy

= \frac{T P + T N}{T P + T N + F P + F N}

; F1-score

= 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

; ROC-AUC is the area under the Receiver Operating Characteristic curve; MSE

= \frac{1}{n} \sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}

; MAE

= \frac{1}{n} \sum_{i = 1}^{n} | {\hat{y}}_{i} - y_{i} |

;

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

. TP, TN, FP, and FN represent True Positives, True Negatives, False Positives, and False Negatives; n is the number of test samples;

y_{i}

is the true value;

{\hat{y}}_{i}

is the predicted value; and

\bar{y}

is the mean of the true values.

4.3. Baselines and Comparison Methods

To assess the effectiveness of GOMFuNet, we compare it against two sets of baseline methods on the case study data:

Single-Modality Models: Models trained on each data modality (structured, text, audio) independently using appropriate machine learning algorithms (Decision Tree, Random Forest, SVM, NN for structured; LSTM, CNN, LSTM + CNN for audio; Naive Bayes, RNN, BERT, BERT + MHA for text). This establishes the performance level achievable without multimodal fusion.
State-of-the-Art Multimodal Models: We compare GOMFuNet against several existing advanced multimodal fusion techniques reported in the literature, implemented on our dataset: ATV [41], MsaCNN [42], and Transmodality [43]. This comparison highlights GOMFuNet’s performance relative to other fusion paradigms.

Additionally, we conduct ablation studies by removing key components of GOMFuNet (CLFM, LCLM) or individual modalities to understand their specific contributions.

5. Results and Analysis

This section presents the empirical results of the case study evaluating the proposed GOMFuNet framework on the task of predicting student English course performance.

5.1. Comparison with Single-Modality Baselines

Table 3 compares the performance of GOMFuNet against models trained on single data modalities. The results clearly show that GOMFuNet, leveraging the fusion of structured, text, and audio data, significantly outperforms all single-modality baselines across both classification (Accuracy, F1, AUC) and regression (

R^{2}

, MSE, MAE) tasks. For instance, the best single-modality model (BERT + MHA on Text) achieves 76.73% accuracy and 76.31%

R^{2}

, whereas GOMFuNet reaches 90.17% accuracy and 88.03%

R^{2}

. This substantial improvement underscores the necessity of multimodal integration to capture the complex factors influencing student performance, a task effectively addressed by the GOMFuNet framework. The limitations of relying on any single data source are evident, highlighting the value captured by GOMFuNet’s fusion mechanism.

5.2. Comparison with State-of-the-Art Multimodal Baselines

Table 4 compares GOMFuNet with other advanced multimodal fusion models implemented on our case study dataset. While methods like ATV, MsaCNN, and Transmodality achieve respectable performance, GOMFuNet consistently outperforms them across all reported metrics. For example, the best competitor (ATV) achieves 87.05% accuracy and 85.88%

R^{2}

, compared to GOMFuNet’s 90.17% and 88.03%, respectively. This superiority suggests that GOMFuNet’s specific mathematical design—combining geometric fusion in the label space via CLFM with explicit reliability optimization via LCLM—provides tangible advantages over alternative fusion strategies in handling the complexities of this multimodal dataset.

5.3. Impact of LCLM (Loss Function Comparison)

To isolate the contribution of the novel LCLM component, we compared the performance of the GOMFuNet architecture when trained using the combined loss versus using only standard loss functions (Cross-Entropy for classification, MSE adapted, and Focal Loss as an advanced baseline). Table 5 shows these results. Using LCLM yields the best performance across all metrics, surpassing even Focal Loss (which is designed for imbalance but does not optimize output geometry). For instance, accuracy improves from 89.52% (Focal Loss) to 90.17% (LCLM), and

R^{2}

improves from 87.88% to 88.03%. This directly demonstrates the empirical benefit of the LCLM’s mathematical objective of enforcing label orthogonality for enhancing prediction accuracy and reliability, compared to standard methods focusing solely on label matching.

5.4. Ablation Studies

We performed ablation studies to understand the contribution of individual modalities and core GOMFuNet modules.

Table 6 shows the impact of removing each modality. Removing any single modality leads to a performance drop, confirming that each data stream provides valuable complementary information captured by GOMFuNet. Text data appear most influential in this case study, followed by structured data, and then audio data, but the best performance is achieved only when all three are fused by GOMFuNet.

Table 7 shows the impact of removing the core CLFM or LCLM modules. Removing CLFM (implying simpler fusion before LCLM and task loss) causes a significant drop (Accuracy 88.03%,

R^{2}

85.99%), highlighting the importance of the geometric fusion approach in CLFM. Removing LCLM (using CLFM with only task loss, akin to the Focal Loss result in Table 5) also degrades performance (Accuracy 89.52%,

R^{2}

87.88%), confirming the contribution of the LCLM’s explicit reliability optimization. Removing both modules leads to the worst performance, emphasizing the synergistic benefit of GOMFuNet’s combined mathematical design.

5.5. Visualization of Fusion Effect

To qualitatively assess the impact of the CLFM’s geometric fusion, we visualize the high-dimensional feature representations using t-SNE before and after the CLFM module. Figure 1 shows the embeddings colored by final course status (Pass/Fail) and numerical score. Before fusion (left column), features from simple concatenation show poor separation between classes and little structure related to scores. After fusion by CLFM (right column), the embeddings exhibit much clearer separation between Pass and Fail clusters and a more discernible structure corresponding to numerical scores. This visualization provides qualitative evidence that CLFM effectively integrates multimodal information into a more mathematically meaningful and discriminative feature space compared to naive concatenation, supporting its role in the GOMFuNet framework.

5.6. Interpretability via Feature Importance Analysis

To gain insights into the decision-making process of the GOMFuNet model within the case study context, particularly regarding the contribution of readily interpretable input features, we employed the SHAP (SHapley Additive exPlanations) framework. Figure 2 presents the SHAP summary plot (beeswarm) focusing on the impact of the structured input features (derived from classroom video analysis and exam scores) on the model’s prediction towards the ‘Pass’ outcome. Each point represents a student sample from the test set, where the color indicates the feature’s value (red = high, blue = low) and the horizontal position reflects the SHAP value, quantifying the feature’s contribution to shifting the prediction probability. This analysis reveals that ‘Midterm Score’ exerts the most substantial influence, with higher scores strongly driving predictions towards ‘Pass’. Other educationally relevant factors like ‘Participation Count’ and ‘Attendance Rate’ also emerge as significant positive predictors, aligning with pedagogical intuition. Conversely, features such as ‘Seat Row’ exhibit a smaller, context-dependent impact. This interpretability analysis enhances confidence in the model by demonstrating that GOMFuNet learns meaningful relationships from the input data, leveraging key indicators of student performance in a discernible manner.

5.7. Prediction Accuracy Analysis (ROC and Scatter Plots)

Figure 3 presents the ROC curve for the classification task and the scatter plot of actual versus predicted scores for the regression task, both obtained using the full GOMFuNet model. The high ROC-AUC value of 0.92 indicates strong discriminative ability between Pass and Fail classes. The scatter plot shows predicted scores closely aligning with the actual scores along the diagonal, demonstrating good predictive accuracy for the regression task, reflected also in the high

R^{2}

value of 88.03%. Some deviations exist for outlier cases, but the overall trend confirms GOMFuNet’s effectiveness in both prediction tasks within the case study.

6. Discussion

This paper introduced GOMFuNet, a novel mathematical framework for multimodal data fusion designed to enhance prediction reliability. The core innovations lie in the synergistic combination of geometric deep learning for fusion in the label space (CLFM) and an explicit optimization objective based on output vector orthogonality for confidence enhancement (LCLM). The empirical results from the educational performance prediction case study provide strong validation for the proposed mathematical concepts. GOMFuNet significantly outperformed both single-modality baselines and state-of-the-art multimodal fusion methods, demonstrating the practical benefits of its design. The ablation studies confirmed the necessity of both the CLFM’s geometric fusion approach and the LCLM’s reliability optimization for achieving peak performance. Furthermore, the analysis of confidence calibration and robustness to noise lends empirical support to the mathematical claims that GOMFuNet, particularly through LCLM, improves prediction reliability and stability beyond what is typically achieved with standard loss functions. The visualization of the fused feature space also qualitatively supports the effectiveness of CLFM’s fusion strategy.

From a mathematical perspective, GOMFuNet offers several advantages. The CLFM’s use of GCNs in the label space allows explicit modeling of inter-sample relationships at a high semantic level, potentially offering better robustness to low-level feature noise and inconsistencies compared to feature-level fusion. The LCLM introduces a novel way to regularize the output space directly, promoting class separability and confidence through a geometrically motivated orthogonality constraint. This explicit focus on the output structure complements traditional loss functions that focus on point-wise prediction accuracy. The synergy between preserving input structure (via GCN) and structuring the output space (via LCLM) appears crucial to GOMFuNet’s success.

However, the proposed framework also has limitations. Mathematically, we have provided justifications and empirical validations, but formal theoretical guarantees on error probability bounds or convergence properties of the LCLM loss remain subjects for future investigation. The LCLM loss, like many deep learning objectives, is likely non-convex, and its optimization dynamics warrant further study. The specific graph construction method involves hyperparameters (like k in kNN) that might require tuning. Computationally, GOMFuNet involves multiple components (encoders, GCNs, Transformer, LCLM calculation), potentially leading to higher computational cost compared to simpler models, although this was manageable for our case study dataset size. From an application perspective, the generalizability of GOMFuNet’s performance benefits needs to be validated across diverse multimodal datasets from different domains beyond education. The effectiveness might depend on the nature of the modalities and the underlying relationships between them.

The implications of this work extend to the broader fields of multimodal learning and trustworthy AI. It highlights the potential of integrating principles from geometric deep learning with novel optimization objectives that directly target prediction reliability. GOMFuNet provides a concrete example of how moving beyond standard accuracy-focused training can lead to more robust and confidence-aware models, which is critical for high-stakes applications.

Future work should focus on several directions. Developing theoretical analyses of LCLM’s properties (e.g., convergence, relationship to generalization bounds, impact on calibration) would strengthen its mathematical foundations. Exploring alternative geometric structures or propagation mechanisms within CLFM could yield further improvements. Investigating adaptive methods for setting the LCLM weight

λ

or extending the LCLM concept to other tasks like structured prediction or unsupervised learning presents interesting avenues. Applying and adapting GOMFuNet to other challenging multimodal problems (e.g., medical diagnosis, affective computing, robotics) will be crucial for assessing its broader utility and identifying necessary modifications. While the educational case study demonstrated GOMFuNet’s potential, further refinement for specific educational interventions, perhaps by incorporating domain knowledge or enhancing interpretability (e.g., using techniques like SHAP as explored in preliminary analyses, potentially detailed in Appendix A, remains a secondary but valuable direction.

Empirical Validation of Prediction Reliability via Calibration Analysis

A critical aspect of trustworthy prediction systems lies in the statistical consistency between the model’s expressed confidence and its empirical accuracy. To rigorously evaluate the mathematical property of enhanced reliability fostered by the LCLM component, we analyze the confidence calibration of GOMFuNet’s probabilistic outputs. A model f is considered perfectly calibrated if, for any predicted probability value

p \in [0, 1]

, the conditional expectation of the true label Y given the predicted probability

f (X) = p

equals p—i.e.,

E [Y | f (X) = p] = p

. Deviations from this ideal condition indicate miscalibration, potentially leading to unreliable decision-making based on the model’s confidence estimates.

We quantify calibration performance using the Expected Calibration Error (ECE), a standard metric defined as the weighted average of the absolute difference between accuracy and confidence across K pre-defined probability bins

B_{k}

:

ECE = \sum_{k = 1}^{K} \frac{| B_{k} |}{N} |acc (B_{k}) - conf (B_{k})|

(14)

where N is the total number of samples,

| B_{k} |

is the number of samples whose predicted confidence falls into bin

B_{k}

,

acc (B_{k})

is the empirical accuracy of samples in bin

B_{k}

, and

conf (B_{k})

is the average predicted confidence for samples in bin

B_{k}

. A lower ECE signifies superior calibration. Additionally, we utilize Reliability Diagrams, which visually plot

acc (B_{k})

against

conf (B_{k})

for each bin k; perfect calibration corresponds to points lying on the diagonal line

y = x

.

Figure 4 presents the reliability diagrams comparing the GOMFuNet architecture trained with the standard task loss (

L_{Task}

) alone versus the proposed configuration utilizing the combined loss (

L_{Task} + λ L_{LCLM}

). The corresponding ECE values, computed using

K = 10

bins, are reported in Table 8. A marked improvement in calibration is observed when incorporating the LCLM objective. GOMFuNet trained with LCLM achieves a significantly lower ECE of 2.83%, compared to an ECE of 6.15% for the baseline model trained using only the standard cross-entropy loss. Furthermore, the reliability diagram associated with the LCLM-enhanced GOMFuNet exhibits points clustering more closely around the diagonal of perfect calibration, indicating a stronger alignment between predicted confidence levels and empirical probabilities of correctness across the spectrum of confidence values.

This empirical evidence provides strong quantitative support for the hypothesis that the LCLM loss, by enforcing orthogonality among class prediction vectors, directly contributes to producing mathematically more meaningful and statistically reliable confidence estimates. The substantial reduction in ECE validates LCLM’s effectiveness in optimizing the reliability structure of the predictions, moving beyond mere accuracy improvements towards generating more trustworthy probabilistic outputs.

7. Conclusions

This paper addressed the mathematical challenges of fusing heterogeneous multimodal data while ensuring the reliability of predictions. We proposed the Geometric Orthogonal Multimodal Fusion Network (GOMFuNet), a novel framework characterized by two key mathematical innovations: (1) geometric fusion in a high-level label space using Graph Convolutional Networks (CLFM) to preserve topological structure and handle inconsistencies, and (2) explicit prediction reliability enhancement via a novel Label Confidence Learning Module (LCLM) that enforces orthogonality among output class probability vectors.

Through mathematical justification and rigorous empirical evaluation, including confidence calibration, robustness analysis, and a detailed case study on educational performance prediction, we demonstrated that GOMFuNet provides a principled and effective approach to multimodal learning. GOMFuNet significantly outperformed single-modality baselines and existing state-of-the-art multimodal fusion techniques in terms of both prediction accuracy and reliability metrics. The ablation studies confirmed the synergistic contribution of both the CLFM and LCLM modules. Our findings highlight the value of integrating geometric deep learning principles with direct optimization of output space structure for building more trustworthy AI systems. GOMFuNet offers a promising direction for future research in reliable multimodal data analysis and its application to complex real-world problems. Future mathematical work includes deriving theoretical guarantees for LCLM and exploring extensions of the GOMFuNet framework.

Author Contributions

Conceptualization, Y.G.; methodology, Y.G. and R.Z.; investigation, Y.G.; resources, R.Z.; data curation, Y.G.; formal analysis, Y.G.; writing—original draft preparation, Y.G.; writing—review and editing, Y.G. and R.Z.; visualization, Y.G.; supervision, R.Z.; project administration, Y.G. All authors have read and agreed to the published version of the manuscript.

Funding

No funding received.

Data Availability Statement

Data related to the case study may be made available on reasonable request to the corresponding authors, subject to privacy and ethical considerations.

Acknowledgments

The authors would like to thank the anonymous reviewers for their insightful and constructive comments, which significantly contributed to the improvement and mathematical rigor of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Detailed Feature Encoder Architectures

This appendix provides supplementary details regarding the architectures of the modality-specific feature encoders employed within the GOMFuNet framework. These encoders serve the crucial pre-processing function of transforming raw input data from each modality into fixed-dimensional vector representations suitable for subsequent fusion by the core CLFM module. Standard, well-established architectures were selected for their proven efficacy on the respective data types.

Appendix A.1. Deep Neural Network (DNN) for Structured Data

To process the structured data modality, encompassing features such as student attendance records, seating positions, engagement metrics derived from video analysis, and numerical exam scores, a standard Deep Neural Network (DNN) was utilized. DNNs are well suited for capturing complex, non-linear patterns and feature interactions often present within tabular or structured datasets. The implemented DNN consists of an input layer accepting the

D_{s}

-dimensional feature vector, followed by a sequence of fully connected hidden layers employing the Rectified Linear Unit (ReLU) activation function to introduce non-linearity. Dropout regularization was applied after each hidden layer to mitigate overfitting. The network culminates in an output layer producing the final 32-dimensional feature vector used as input to the CLFM. The specific layer configurations and parameters are summarized in Table A1.

Table A1. Architectural details of the DNN encoder for structured data.

Parameter	Value
Input Dimension	$D_{s}$
Hidden Layer 1 Units	128
Hidden Layer 1 Activation	ReLU
Hidden Layer 1 Dropout Rate	0.3
Hidden Layer 2 Units	64
Hidden Layer 2 Activation	ReLU
Hidden Layer 2 Dropout Rate	0.3
Hidden Layer 3 Units	32
Hidden Layer 3 Activation	ReLU
Hidden Layer 3 Dropout Rate	0.3
Output Dimension ( $d_{s}^{'}$ )	32
Output Activation	Linear

Appendix A.2. Bert with Multi-Head Attention (MHA) for Text Data

For extracting deep semantic features from the textual data modality (e.g., student essays), we employed the Bidirectional Encoder Representations from Transformers (BERT) model [44], specifically the ‘bert-base-uncased’ pre-trained variant, which was subsequently fine-tuned on our task-specific data. BERT’s Transformer-based architecture, particularly its bidirectional self-attention mechanism, allows it to capture rich contextual information at both word and sentence levels, significantly outperforming traditional context-free methods. The architecture, illustrated in Figure A1, processes tokenized input sequences. To further refine the representations obtained from BERT, a Multi-Head Attention (MHA) module was added. MHA allows the model to jointly attend to information from different representation subspaces at different positions, enhancing the focus on the most relevant textual segments from multiple perspectives. The output from MHA is processed through feed-forward layers and layer normalization, resulting in the final 768-dimensional semantic embedding, typically derived from the processed [CLS] token representation, which serves as input to the CLFM. Key parameters are detailed in Table A2.

Figure A1. Architecture of the BERT model with MHA used for text feature extraction.

Table A2. Architectural details of the text encoder (BERT + MHA).

Parameter	Value
Base Model	BERT (‘bert-base-uncased’, fine-tuned)
Input Max Sequence Length	512 tokens
BERT Output Dimension (CLS token)	768
MHA Number of Heads	4
MHA Dimension	768
Feed-Forward Layers	2 (with ReLU)
Layer Normalization	Applied
Final Output Dimension ( $d_{t}^{'}$ )	768

Appendix A.3. CNN and Bi-LSTM Combination for Audio Data

Extracting meaningful features from the audio data modality (e.g., one-minute oral recordings) requires capturing both local spectro-temporal patterns and long-range temporal dependencies. To achieve this, we designed a hybrid deep learning model combining Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (Bi-LSTM) networks [45], operating on log-Mel spectrograms derived from the audio signals. The CNN component, with its convolutional filters and pooling layers, excels at identifying localized frequency-specific features (like phonetic characteristics). The Bi-LSTM component processes the sequence of spectral coefficients over time, effectively modeling longer temporal patterns such as rhythm and intonation contours; the bidirectional nature allows consideration of both past and future context. An attention mechanism (Luong-style general attention) was applied to the Bi-LSTM outputs to allow the model to dynamically weigh the importance of different audio segments. Features extracted from both the CNN and the attention-weighted Bi-LSTM pathways are concatenated and passed through a final dense layer to produce the consolidated 128-dimensional audio feature vector for the CLFM. The overall structure is depicted in Figure A2, and the parameterization is summarized in Table A3.

Figure A2. Combined CNN and Bi-LSTM architecture used for audio feature extraction.

Table A3. Architectural details of the audio encoder (CNN+Bi-LSTM).

Parameter	Value
Input Representation	Log-Mel Spectrogram
CNN Channel
Conv Layer 1 Filters	32
Conv Layer 1 Kernel Size	3 × 3
Conv Layer 1 Activation	ReLU
Pooling 1	MaxPool (2 × 2)
Conv Layer 2 Filters	64
Conv Layer 2 Kernel Size	3 × 3
Conv Layer 2 Activation	ReLU
Pooling 2	MaxPool (2 × 2)
Bi-LSTM Channel
Bi-LSTM Layers	2
Bi-LSTM Hidden Units (per direction)	128
Bi-LSTM Dropout Rate	0.4
Attention Mechanism	Luong (General)
Output Stage
Concatenation	CNN Features + Bi-LSTM Attention Output
Final Dense Layer Units	128
Final Dense Activation	ReLU
Final Output Dimension ( $d_{a}^{'}$ )	128

References

Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
Bodaghi, M.; Hosseini, M.; Gottumukkala, R. A Multimodal Intermediate Fusion Network with Manifold Learning for Stress Detection. arXiv 2024, arXiv:2403.08077. [Google Scholar]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
Bashir, A.; Bashir, S.; Rana, K.; Lambert, P.; Vernallis, A. Post-COVID-19 adaptations; the shifts towards online learning, hybrid course delivery and the implications for biosciences courses in the higher education setting. In Proceedings of the Frontiers in Education; Frontiers Media SA: Lausanne, Switzerland, 2021; Volume 6, p. 711619. [Google Scholar]
Villegas-Ch, W.; Román-Cañizares, M.; Palacios-Pacheco, X. Improvement of an online education model with the integration of machine learning and data analysis in an LMS. Appl. Sci. 2020, 10, 5371. [Google Scholar] [CrossRef]
Gaonkar, A.; Chukkapalli, Y.; Raman, P.J.; Srikanth, S.; Gurugopinath, S. A comprehensive survey on multimodal data representation and information fusion algorithms. In Proceedings of the 2021 International Conference on Intelligent Technologies (CONIT), Hubli, India, 25–27 June 2021; pp. 1–8. [Google Scholar]
Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
Wei, H.; Jafarian, A.; Zeidman, P.; Litvak, V.; Razi, A.; Hu, D.; Friston, K.J. Bayesian fusion and multimodal DCM for EEG and fMRI. arXiv 2019, arXiv:1906.07354. [Google Scholar] [CrossRef]
Tian, Y.; Ma, Y. Information-theoretic approaches in multimodal learning: A survey. IEEE Access 2019, 7, 170181–170193. [Google Scholar]
Zadeh, A.; Liang, P.P.; Morency, L.P. Foundations of multimodal co-learning. Inf. Fusion 2020, 64, 188–193. [Google Scholar] [CrossRef]
Khan, A.U.; Mazaheri, A.; da Vitoria Lobo, N.; Shah, M. MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering. arXiv 2020, arXiv:2010.14095. [Google Scholar]
Jiao, T.; Guo, C.; Feng, X.; Chen, Y.; Song, J. A Comprehensive Survey on Deep Learning Multi-Modal Fusion: Methods, Technologies and Applications. Comput. Mater. Contin. 2024, 80, 1–35. [Google Scholar] [CrossRef]
Bronstein, M.M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometric deep learning: Going beyond Euclidean data. IEEE Signal Process. Mag. 2017, 34, 18–42. [Google Scholar] [CrossRef]
Rahate, A.; Walambe, R.; Ramanna, S.; Kotecha, K. Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions. Inf. Fusion 2022, 81, 203–239. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
Diestel, R. Graph Theory, 5th ed.; Springer: New York City, NY, USA, 2017. [Google Scholar]
Namoun, A.; Alshanqiti, A. Predicting student performance using data mining and learning analytics techniques: A systematic literature review. Appl. Sci. 2020, 11, 237. [Google Scholar] [CrossRef]
Emerson, A.; Cloude, E.B.; Azevedo, R.; Lester, J. Multimodal learning analytics for game-based learning. Br. J. Educ. Technol. 2020, 51, 1505–1526. [Google Scholar] [CrossRef]
Sharma, K.; Giannakos, M. Multimodal data capabilities for learning: What can multimodal data tell us about learning? Br. J. Educ. Technol. 2020, 51, 1450–1484. [Google Scholar] [CrossRef]
Hooda, M.; Rana, C.; Dahiya, O.; Rizwan, A.; Hossain, M.S. Artificial intelligence for assessment and feedback to enhance student success in higher education. Math. Probl. Eng. 2022, 2022, 5215722. [Google Scholar] [CrossRef]
Khan, I.; Ahmad, A.R.; Jabeur, N.; Mahdi, M.N. An artificial intelligence approach to monitor student performance and devise preventive measures. Smart Learn. Environ. 2021, 8, 1–18. [Google Scholar] [CrossRef]
Mai, S.; Hu, H.; Xing, S. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 164–172. [Google Scholar]
Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Washington, DC, USA, 28 June–2 July 2011; pp. 689–696. [Google Scholar]
GeeksforGeeks. Early Fusion vs. Late Fusion in Multimodal Data Processing; GeeksforGeeks: Noida, India, 2025. [Google Scholar]
Li, L.; Liu, M.; Ma, L.; Han, L. Cross-Modal feature description for remote sensing image matching. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102964. [Google Scholar] [CrossRef]
Liu, Z.; Shen, Y.; Jin, X.; Liu, Y.; Wang, Y.; Zhang, Y. Dual Low-Rank Multimodal Fusion. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. Association for Computational Linguistics, Online, 16–20 November 2020; pp. 325–334. [Google Scholar]
Zhang, W.; Li, M.; Chen, X.; Wang, L. Multimodal Graph Neural Network for Recommendation with Dynamic De-redundancy and Modality-Guided Feature De-noisy. arXiv 2024, arXiv:2411.01561. [Google Scholar]
Kwon, Y.; Won, J.K.; Paik, M.W.; Lee, S.W. Uncertainty quantification using Bayesian neural networks in classification: Application to biomedical image segmentation. Comput. Stat. Data Anal. 2018, 142, 106816. [Google Scholar] [CrossRef]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. Adv. Neural Inf. Process. Syst. 2017, 30, 6405–6416. [Google Scholar]
Gao, R.; Wang, L.; Zhang, H. A Comparative Study of Confidence Calibration in Deep Learning. arXiv 2022, arXiv:2206.08833. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Han, Z.; Wu, J.; Huang, C.; Huang, Q.; Zhao, M. A review on sentiment discovery and analysis of educational big-data. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1328. [Google Scholar] [CrossRef]
Alshanqiti, A.; Namoun, A. Predicting student performance and its influential factors using hybrid regression and multi-label classification. IEEE Access 2020, 8, 203827–203844. [Google Scholar] [CrossRef]
Huang, C.; Zhou, J.; Chen, J.; Yang, J.; Clawson, K.; Peng, Y. A feature weighted support vector machine and artificial neural network algorithm for academic course performance prediction. Neural Comput. Appl. 2023, 35, 11517–11529. [Google Scholar] [CrossRef]
Kabathova, J.; Drlik, M. Towards predicting student’s dropout in university courses using different machine learning techniques. Appl. Sci. 2021, 11, 3130. [Google Scholar] [CrossRef]
López Zambrano, J.; Lara Torralbo, J.A.; Romero Morales, C. Early prediction of student learning performance through data mining: A systematic review. Psicothema 2021, 33, 456. [Google Scholar] [CrossRef] [PubMed]
Wan, H.; Liu, K.; Yu, Q.; Gao, X. Pedagogical intervention practices: Improving learning engagement based on early prediction. IEEE Trans. Learn. Technol. 2019, 12, 278–289. [Google Scholar] [CrossRef]
Lu, O.H.; Huang, A.Y.; Huang, J.C.; Lin, A.J.; Ogata, H.; Yang, S.J. Applying learning analytics for the early prediction of Students’ academic performance in blended learning. J. Educ. Technol. Soc. 2018, 21, 220–232. [Google Scholar]
Buschetto Macarini, L.A.; Cechinel, C.; Batista Machado, M.F.; Faria Culmant Ramos, V.; Munoz, R. Predicting students success in blended learning—evaluating different interactions inside learning management systems. Appl. Sci. 2019, 9, 5523. [Google Scholar] [CrossRef]
Joshi, G.; Walambe, R.; Kotecha, K. A review on explainability in multimodal deep neural nets. IEEE Access 2021, 9, 59800–59821. [Google Scholar] [CrossRef]
Hosseini, S.S.; Yamaghani, M.R.; Poorzaker Arabani, S. Multimodal modelling of human emotion using sound, image and text fusion. Signal Image Video Process. 2024, 18, 71–79. [Google Scholar] [CrossRef]
Zhang, Y.; An, R.; Liu, S.; Cui, J.; Shang, X. Predicting and understanding student learning performance using multi-source sparse attention convolutional neural networks. IEEE Trans. Big Data 2021, 9, 118–132. [Google Scholar] [CrossRef]
Wang, Z.; Wan, Z.; Wan, X. Transmodality: An end2end fusion method with transformer for multimodal sentiment analysis. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 2514–2520. [Google Scholar]
Koroteev, M.V. BERT: A review of applications in natural language processing and understanding. arXiv 2021, arXiv:2103.11943. [Google Scholar]
Shashidhar, R.; Patilkulkarni, S.; Puneeth, S. Combining audio and visual speech recognition using LSTM and deep convolutional neural network. Int. J. Inf. Technol. 2022, 14, 3425–3436. [Google Scholar] [CrossRef]

Figure 1. t-SNE visualization of the student feature space before (left column) and after (right column) applying the CLFM fusion module within GOMFuNet. Top row: colored by final course status (Pass/Fail). Bottom row: colored by final numerical score (darker = lower, lighter = higher). CLFM yields better separation and structure.

Figure 2. SHAP summary plot (beeswarm) illustrating the impact of structured features on GOMFuNet’s prediction of course success (Pass = 1) for each student sample in the test set. Features are ranked by overall importance (mean absolute SHAP value).

Figure 3. (Left): ROC curve of GOMFuNet’s classification performance (AUC = 0.92). (Right): Scatter plot of actual vs. predicted scores from GOMFuNet’s regression output.

Figure 4. Reliability diagrams comparing GOMFuNet trained with standard loss only (Baseline) and with the combined loss including LCLM.

Table 1. Distribution of students’ final exam scores in the case study dataset (N = 472).

Classification	Score Range	Number of Students
Fail	<60	56
Pass	60–69	66
Pass	70–79	110
Pass	80–89	157
Pass	90–99	75
Pass	100	8
Total		472

Table 2. Key architectural details and hyperparameters of the GOMFuNet core modules.

Component	Parameter	Value
Feature Encoders (Output Dims)	Structured ( $d_{s}^{'}$ )	32
	Text ( $d_{t}^{'}$ )	768
	Audio ( $d_{a}^{'}$ )	128
CLFM: Graph Construction	Method	kNN + Gaussian Kernel
CLFM: GCN	Layers ( $L_{g}$ )	2
	Hidden Dimension	128
	Output Dimension (per mod, $d_{m}^{''}$ )	64
	Activation	ReLU
CLFM: Transformer Aggregator	Layers	2
	Heads	4
	Model Dimension ( $\sum d_{m}^{″}$ )	192 (3 × 64)
	Feedforward Inner Dimension	256
	Dropout	0.2
	Output Dimension ( $d_{fused}$ )	192
LCLM	Weighting Factor ( $λ$ )	[Specify best lambda value]
	Temperature ( $τ_{L}$ )	1.0
Training	Optimizer	Adam
	Initial Learning Rate	0.001
	Batch Size	32
	Epochs	35 (with Early Stopping)

Table 3. Performance comparison of single-modality models and the proposed GOMFuNet on classification and regression tasks (case study dataset).

Model	Modality	Accuracy	F1-Score	ROC-AUC	MSE	MAE	R²
Decision Tree	Structured	66.05%	0.63	0.71	0.0348	0.1535	62.95%
Random Forest	Structured	69.12%	0.65	0.74	0.0331	0.1460	66.81%
Support Vector Machine	Structured	67.98%	0.64	0.73	0.0340	0.1501	64.92%
Neural Network	Structured	71.15%	0.67	0.76	0.0319	0.1405	69.33%
LSTM	Audio	60.33%	0.58	0.66	0.0405	0.1760	53.81%
CNN	Audio	62.01%	0.60	0.68	0.0391	0.1695	56.24%
LSTM + CNN	Audio	63.42%	0.61	0.69	0.0378	0.1632	58.44%
Naive Bayes	Text	73.02%	0.70	0.79	0.0308	0.1340	70.95%
BERT	Text	75.88%	0.73	0.81	0.0289	0.1265	74.89%
RNN	Text	73.91%	0.71	0.80	0.0299	0.1301	72.98%
BERT + MHA	Text	76.73%	0.74	0.83	0.0280	0.1240	76.31%
GOMFuNet (Proposed)	Multimodal	90.17%	0.88	0.92	0.0205	0.0971	88.03%

Table 4. Performance comparison of GOMFuNet with other advanced multimodal models (case study dataset).

Model	Accuracy	F1-Score	ROC-AUC	MSE	MAE	R²
ATV [41]	87.05%	0.84	0.89	0.0245	0.1055	85.88%
MsaCNN [42]	86.31%	0.83	0.88	0.0256	0.1091	84.41%
Transmodality [43]	85.19%	0.82	0.87	0.0268	0.1130	82.74%
GOMFuNet (Proposed)	90.17%	0.88	0.92	0.0205	0.0971	88.03%

Table 5. Performance comparison of GOMFuNet using different loss function configurations.

Loss Configuration (Within GOMFuNet)	Accuracy	F1-Score	ROC-AUC	MSE	MAE	R²
CLFM + Cross-Entropy Loss	87.88%	0.85	0.90	0.0229	0.1028	86.61%
CLFM + Mean Squared Error Loss	89.01%	0.86	0.91	0.0220	0.1005	87.32%
CLFM + Focal Loss [31]	89.52%	0.87	0.91	0.0214	0.0989	87.88%
CLFM + LCLM + Task Loss (Proposed)	90.17%	0.88	0.92	0.0205	0.0971	88.03%

Table 6. Modality ablation study on GOMFuNet performance.

Structured	Text	Audio	Accuracy	F1-Score	ROC-AUC	MSE	MAE	R²
✔	✔	✔	90.17%	0.88	0.92	0.0205	0.0971	88.03%
×	✔	✔	87.41%	0.85	0.89	0.0242	0.1050	85.11%
✔	×	✔	86.05%	0.83	0.87	0.0260	0.1105	83.49%
✔	✔	×	88.12%	0.86	0.90	0.0235	0.1029	86.15%

Table 7. GOMFuNet module ablation study on overall model performance.

CLFM	LCLM	Accuracy	F1-Score	ROC-AUC	MSE	MAE	R²
✔	✔	90.17%	0.88	0.92	0.0205	0.0971	88.03%
×	✔	88.03%	0.86	0.90	0.0236	0.1030	85.99%
✔	×	89.52%	0.87	0.91	0.0214	0.0989	87.88%
×	×	85.39%	0.83	0.88	0.0271	0.1118	82.15%

Table 8. Expected Calibration Error (ECE) comparison.

Model Configuration	ECE (%)
GOMFuNet (CLFM + Standard Loss)	6.15
GOMFuNet (CLFM + LCLM + Standard Loss)	2.83

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, Y.; Zhong, R. GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability. Mathematics 2025, 13, 1791. https://doi.org/10.3390/math13111791

AMA Style

Guo Y, Zhong R. GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability. Mathematics. 2025; 13(11):1791. https://doi.org/10.3390/math13111791

Chicago/Turabian Style

Guo, Yi, and Rui Zhong. 2025. "GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability" Mathematics 13, no. 11: 1791. https://doi.org/10.3390/math13111791

APA Style

Guo, Y., & Zhong, R. (2025). GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability. Mathematics, 13(11), 1791. https://doi.org/10.3390/math13111791

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GOMFuNet: A Geometric Orthogonal Multimodal Fusion Network for Enhanced Prediction Reliability

Abstract

1. Introduction

2. Related Work

2.1. Mathematical Strategies for Multimodal Fusion

2.2. Uncertainty Quantification and Reliability in Deep Learning

2.3. Student Performance Prediction as an Application Context

3. A Mathematical Framework for Reliable Geometric Multimodal Fusion

3.1. Mathematical Problem Statement: Prediction Under Heterogeneity and Uncertainty

3.2. Geometric Integration via Graph Propagation in Semantic Latent Spaces

3.2.1. Graph Embedding of Modality-Specific Manifolds

3.2.2. Spectral Graph Convolution for Structure-Aware Learning

3.2.3. Global Context Aggregation Using Self-Attention

3.3. Output Space Geometry Regularization via Orthogonality Enforcement

3.3.1. Information Geometry Perspective: Separability and Orthogonality

3.3.2. The LCLM Regularizer: Maximizing Relative Self-Confidence

3.3.3. Gradient Analysis and Optimization Dynamics

3.4. The GOMFuNet Optimization Problem and Algorithm

3.5. Theoretical Impact Analysis of LCLM on Generalization Error and Prediction Margin

3.5.1. LCLM, Output Space Geometry, and Class Margin

3.5.2. LCLM as a Regularizer and Learning Theory

3.5.3. Error Probability and Its Relation to Margin

4. Case Study: Educational Performance Prediction

4.1. Dataset and Preprocessing

4.2. Experimental Setup and Evaluation Metrics

4.3. Baselines and Comparison Methods

5. Results and Analysis

5.1. Comparison with Single-Modality Baselines

5.2. Comparison with State-of-the-Art Multimodal Baselines

5.3. Impact of LCLM (Loss Function Comparison)

5.4. Ablation Studies

5.5. Visualization of Fusion Effect

5.6. Interpretability via Feature Importance Analysis

5.7. Prediction Accuracy Analysis (ROC and Scatter Plots)

6. Discussion

Empirical Validation of Prediction Reliability via Calibration Analysis

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Detailed Feature Encoder Architectures

Appendix A.1. Deep Neural Network (DNN) for Structured Data

Appendix A.2. Bert with Multi-Head Attention (MHA) for Text Data

Appendix A.3. CNN and Bi-LSTM Combination for Audio Data

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI