1. Introduction
Integrating information from heterogeneous data sources, known as multimodal learning, presents fundamental mathematical and computational challenges, particularly in ensuring the reliability and reducing the uncertainty of predictive models built upon such diverse inputs [
1]. Effectively fusing data with disparate structures (e.g., geometric, sequential, tabular) while mitigating inconsistencies and noise inherent in real-world observations remains an open problem [
2]. Furthermore, quantifying and actively minimizing the uncertainty associated with predictions derived from complex multimodal models is crucial for trustworthy decision-making in critical applications [
3]. Traditional methods for analyzing complex phenomena often rely on single data sources or simple aggregation techniques, which struggle to capture the multifaceted nature of many real-world systems, leading to predictions with limited accuracy and applicability [
4]. The advent of artificial intelligence (AI) has spurred efforts to improve prediction precision [
5], but fully harnessing the rich information in diverse data streams requires sophisticated mathematical frameworks.
Existing mathematical approaches to multimodal fusion often fall into categories like early (feature-level) fusion, late (decision-level) fusion, or intermediate strategies employing techniques such as attention mechanisms or tensor factorization [
1,
6]. Probabilistic graphical models [
7] and Bayesian methods [
8] offer principled ways to model dependencies and uncertainty but can face scalability issues or restrictive distributional assumptions. Information-theoretic approaches aim to maximize mutual information or minimize redundancy across modalities [
9]. While deep learning methods, particularly Transformer-based architectures [
10,
11], have shown promise in capturing crossmodal correlations, many standard techniques face limitations from a mathematical perspective [
12].
Specifically, many existing methods struggle with several key mathematical aspects: (1) Geometric Structure Preservation: Simple concatenation or standard attention often fails to explicitly preserve the intrinsic geometric structure or topological relationships within the data manifold of each modality, potentially losing valuable information [
13]. (2) Handling Inconsistency: Fusing potentially conflicting signals from different modalities in a mathematically robust manner remains difficult, often leading to suboptimal or unstable results [
14]. (3) Explicit Reliability Optimization: Most standard loss functions (e.g., cross-entropy) primarily focus on matching predictions to ground truth labels, rather than directly optimizing the mathematical properties of the output distribution itself to enhance prediction confidence and separability [
15]. This lack of explicit reliability control can limit the trustworthiness of predictions, especially in the presence of noise or ambiguity. Foundational mathematical concepts from graph theory [
16], geometric deep learning [
13], and statistical reliability provide tools to address these gaps.
To address these limitations, this paper introduces the Geometric Orthogonal Multimodal Fusion Network (GOMFuNet), a novel mathematical framework designed for reliable multimodal data integration and prediction. GOMFuNet uniquely combines two core mathematical concepts: First, it leverages principles from geometric deep learning, employing a Graph Convolutional Network (GCN) within its Crossmodal Label Fusion Module (CLFM) to perform fusion explicitly in a high-level semantic label space. This approach mathematically models and preserves the topological relationships between samples based on their modality-specific representations, facilitating robust integration even with inconsistent signals. Second, GOMFuNet incorporates a novel Label Confidence Learning Module (LCLM) derived from optimization principles. LCLM introduces an explicit mathematical objective that enforces orthogonality among the predicted class probability vectors. By minimizing the correlation between predictions for different classes, LCLM directly enhances the separability and confidence of the model’s output, thus reducing prediction uncertainty.
The primary mathematical contributions of this work are the following: (1) the design and formalization of the GOMFuNet architecture, integrating geometric fusion in label space with output orthogonality optimization; (2) the derivation and mathematical justification of the LCLM loss function as a novel mechanism for explicit uncertainty reduction in multimodal classification/regression; and (3) rigorous empirical validation demonstrating the effectiveness of GOMFuNet in enhancing prediction accuracy and reliability compared to existing methods.
To demonstrate the practical efficacy of the proposed GOMFuNet framework in tackling complex, real-world problems involving heterogeneous data and the need for reliable predictions, we conduct a case study in the domain of educational performance prediction [
17]. Using a multimodal dataset comprising structured, textual, and audio data from student interactions [
18,
19], we validate GOMFuNet’s ability to outperform baseline and conventional fusion techniques. This case study, involving factors influencing student success [
20,
21], serves to illustrate the tangible benefits of our mathematically grounded approach in a challenging application context.
The remainder of this paper is organized as follows:
Section 2 reviews related work from the perspectives of mathematical fusion strategies and uncertainty handling.
Section 3 provides the detailed mathematical formulation and algorithmic description of the proposed GOMFuNet.
Section 4 describes the experimental setup for the case study.
Section 5 presents the empirical results of the case study, including comparisons and ablation studies.
Section 6 discusses the findings, implications, and limitations.
Section 7 concludes the paper.
4. Case Study: Educational Performance Prediction
To evaluate the practical effectiveness and demonstrate the applicability of the proposed GOMFuNet framework, we conduct a case study in the domain of educational performance prediction. This domain provides a challenging real-world scenario involving heterogeneous multimodal data where prediction reliability is valuable.
4.1. Dataset and Preprocessing
The data for this case study were sourced from two cohorts of students (totaling 472 students: 243 first cohort, 229 s) in an English course over two academic years. The dataset includes classroom performance records derived from video analysis, midterm online exam scores (including written scores, text essays, and audio recordings), and final exam scores. The objective is to predict students’ final English grades, framed as both a binary classification task (Pass/Fail, based on a score threshold of 60) and a regression task (predicting the specific percentage score). Ethical considerations were addressed; all participating students were informed and consented to the use of their anonymized learning data for research analysis.
Classroom performance data were extracted from video recordings using a combination of automated scripts (OpenCV) and manual annotation for metrics such as attendance, seating position (row/column, 0 if absent), proportion of time looking at the instructor, estimated note-taking time proportion, and participation count (hand-raising/speaking instances). Midterm data included structured written test scores (0–100), one-minute oral recordings (WAV format), and written assignments (TXT format). Final exam data, comprising pass/fail status (0/1) and specific scores (0–100), served as the prediction targets.
Preprocessing involved anonymization, standardization, and cleaning. Structured numerical features were normalized to the [0–1] range using Min–Max scaling. Audio (WAV) data underwent noise reduction (spectral subtraction) and segmentation using Voice Activity Detection (WebRTC VAD). Text (TXT) data were cleaned, tokenized using the WordPiece tokenizer (‘bert-base-uncased’), formatted with special tokens ([CLS], [SEP], [PAD]), and converted to fixed-length sequences with attention masks. Missing values were handled appropriately (e.g., removed if critical). The distribution of final exam scores in the combined dataset (N = 472) is shown in
Table 1.
4.2. Experimental Setup and Evaluation Metrics
Experiments were conducted using PyTorch 1.10.0 on a system with an NVIDIA Tesla V100 GPU. The dataset (N = 472) was split into training (80%, N = 378) and testing (20%, N = 94) sets, stratified by the pass/fail status. Five-fold cross-validation on the training set was used for hyperparameter tuning (including the LCLM weight
, GCN layers, Transformer parameters, learning rate, batch size). The final reported results are averaged over 5 independent runs with different random seeds on the held-out test set, after training the model on the full 80% training data using the best hyperparameters found. Key hyperparameters for the GOMFuNet core are detailed in
Table 2, while encoder details are given in
Appendix A.
The Adam optimizer was used with an initial learning rate of 0.001, standard betas, and epsilon = 1 × 10−8. A ‘ReduceLROnPlateau’ learning rate scheduler (patience = 5, factor = 0.1) and early stopping (patience = 10 based on validation loss) were employed. The batch size was 32. The LCLM weighting factor was tuned via cross-validation (tested values e.g., [0.01, 0.1, 1.0]) and set to [Specify best lambda value here].
Performance was evaluated using standard metrics. For the classification task (Pass/Fail): Accuracy, F1-score, and Area Under the ROC Curve (ROC-AUC). For the regression task (Predicting exact score): Mean Squared Error (MSE), Mean Absolute Error (MAE), and Coefficient of Determination (R-squared, ). The formulas for these metrics are standard and defined as follows: Accuracy ; F1-score ; ROC-AUC is the area under the Receiver Operating Characteristic curve; MSE ; MAE ; . TP, TN, FP, and FN represent True Positives, True Negatives, False Positives, and False Negatives; n is the number of test samples; is the true value; is the predicted value; and is the mean of the true values.
4.3. Baselines and Comparison Methods
To assess the effectiveness of GOMFuNet, we compare it against two sets of baseline methods on the case study data:
Single-Modality Models: Models trained on each data modality (structured, text, audio) independently using appropriate machine learning algorithms (Decision Tree, Random Forest, SVM, NN for structured; LSTM, CNN, LSTM + CNN for audio; Naive Bayes, RNN, BERT, BERT + MHA for text). This establishes the performance level achievable without multimodal fusion.
State-of-the-Art Multimodal Models: We compare GOMFuNet against several existing advanced multimodal fusion techniques reported in the literature, implemented on our dataset: ATV [
41], MsaCNN [
42], and Transmodality [
43]. This comparison highlights GOMFuNet’s performance relative to other fusion paradigms.
Additionally, we conduct ablation studies by removing key components of GOMFuNet (CLFM, LCLM) or individual modalities to understand their specific contributions.
6. Discussion
This paper introduced GOMFuNet, a novel mathematical framework for multimodal data fusion designed to enhance prediction reliability. The core innovations lie in the synergistic combination of geometric deep learning for fusion in the label space (CLFM) and an explicit optimization objective based on output vector orthogonality for confidence enhancement (LCLM). The empirical results from the educational performance prediction case study provide strong validation for the proposed mathematical concepts. GOMFuNet significantly outperformed both single-modality baselines and state-of-the-art multimodal fusion methods, demonstrating the practical benefits of its design. The ablation studies confirmed the necessity of both the CLFM’s geometric fusion approach and the LCLM’s reliability optimization for achieving peak performance. Furthermore, the analysis of confidence calibration and robustness to noise lends empirical support to the mathematical claims that GOMFuNet, particularly through LCLM, improves prediction reliability and stability beyond what is typically achieved with standard loss functions. The visualization of the fused feature space also qualitatively supports the effectiveness of CLFM’s fusion strategy.
From a mathematical perspective, GOMFuNet offers several advantages. The CLFM’s use of GCNs in the label space allows explicit modeling of inter-sample relationships at a high semantic level, potentially offering better robustness to low-level feature noise and inconsistencies compared to feature-level fusion. The LCLM introduces a novel way to regularize the output space directly, promoting class separability and confidence through a geometrically motivated orthogonality constraint. This explicit focus on the output structure complements traditional loss functions that focus on point-wise prediction accuracy. The synergy between preserving input structure (via GCN) and structuring the output space (via LCLM) appears crucial to GOMFuNet’s success.
However, the proposed framework also has limitations. Mathematically, we have provided justifications and empirical validations, but formal theoretical guarantees on error probability bounds or convergence properties of the LCLM loss remain subjects for future investigation. The LCLM loss, like many deep learning objectives, is likely non-convex, and its optimization dynamics warrant further study. The specific graph construction method involves hyperparameters (like k in kNN) that might require tuning. Computationally, GOMFuNet involves multiple components (encoders, GCNs, Transformer, LCLM calculation), potentially leading to higher computational cost compared to simpler models, although this was manageable for our case study dataset size. From an application perspective, the generalizability of GOMFuNet’s performance benefits needs to be validated across diverse multimodal datasets from different domains beyond education. The effectiveness might depend on the nature of the modalities and the underlying relationships between them.
The implications of this work extend to the broader fields of multimodal learning and trustworthy AI. It highlights the potential of integrating principles from geometric deep learning with novel optimization objectives that directly target prediction reliability. GOMFuNet provides a concrete example of how moving beyond standard accuracy-focused training can lead to more robust and confidence-aware models, which is critical for high-stakes applications.
Future work should focus on several directions. Developing theoretical analyses of LCLM’s properties (e.g., convergence, relationship to generalization bounds, impact on calibration) would strengthen its mathematical foundations. Exploring alternative geometric structures or propagation mechanisms within CLFM could yield further improvements. Investigating adaptive methods for setting the LCLM weight
or extending the LCLM concept to other tasks like structured prediction or unsupervised learning presents interesting avenues. Applying and adapting GOMFuNet to other challenging multimodal problems (e.g., medical diagnosis, affective computing, robotics) will be crucial for assessing its broader utility and identifying necessary modifications. While the educational case study demonstrated GOMFuNet’s potential, further refinement for specific educational interventions, perhaps by incorporating domain knowledge or enhancing interpretability (e.g., using techniques like SHAP as explored in preliminary analyses, potentially detailed in
Appendix A, remains a secondary but valuable direction.
Empirical Validation of Prediction Reliability via Calibration Analysis
A critical aspect of trustworthy prediction systems lies in the statistical consistency between the model’s expressed confidence and its empirical accuracy. To rigorously evaluate the mathematical property of enhanced reliability fostered by the LCLM component, we analyze the confidence calibration of GOMFuNet’s probabilistic outputs. A model f is considered perfectly calibrated if, for any predicted probability value , the conditional expectation of the true label Y given the predicted probability equals p—i.e., . Deviations from this ideal condition indicate miscalibration, potentially leading to unreliable decision-making based on the model’s confidence estimates.
We quantify calibration performance using the Expected Calibration Error (ECE), a standard metric defined as the weighted average of the absolute difference between accuracy and confidence across
K pre-defined probability bins
:
where
N is the total number of samples,
is the number of samples whose predicted confidence falls into bin
,
is the empirical accuracy of samples in bin
, and
is the average predicted confidence for samples in bin
. A lower ECE signifies superior calibration. Additionally, we utilize Reliability Diagrams, which visually plot
against
for each bin
k; perfect calibration corresponds to points lying on the diagonal line
.
Figure 4 presents the reliability diagrams comparing the GOMFuNet architecture trained with the standard task loss (
) alone versus the proposed configuration utilizing the combined loss (
). The corresponding ECE values, computed using
bins, are reported in
Table 8. A marked improvement in calibration is observed when incorporating the LCLM objective. GOMFuNet trained with LCLM achieves a significantly lower ECE of
2.83%, compared to an ECE of
6.15% for the baseline model trained using only the standard cross-entropy loss. Furthermore, the reliability diagram associated with the LCLM-enhanced GOMFuNet exhibits points clustering more closely around the diagonal of perfect calibration, indicating a stronger alignment between predicted confidence levels and empirical probabilities of correctness across the spectrum of confidence values.
This empirical evidence provides strong quantitative support for the hypothesis that the LCLM loss, by enforcing orthogonality among class prediction vectors, directly contributes to producing mathematically more meaningful and statistically reliable confidence estimates. The substantial reduction in ECE validates LCLM’s effectiveness in optimizing the reliability structure of the predictions, moving beyond mere accuracy improvements towards generating more trustworthy probabilistic outputs.