Sophimatics and 2D Complex Time to Mitigate Hallucinations in LLMs for Novel Intelligent Information Systems in Digital Transformation
Abstract
1. Introduction
2. Related Works and Theoretical Foundations
From Traditional Uncertainty to Statistical Resonances: A Conceptual Bridge
3. Hallucinations as Statistical Resonances
4. Complex Time and Sophimatics
5. The Modelling
5.1. Multi-Indicator Assessment
5.2. Complex Time and Complex Time Encoding
- (1)
- Magnitude: represents total temporal salience, combining both chronological and experiential dimensions. High magnitude indicates high overall importance regardless of source.
- (2)
- Phase: arg(T) = represents the ratio of experiential to chronological significance. A phase near 0° indicates primarily chronological weight; a phase near 90° indicates primarily experiential weight.
- (3)
- Complex conjugate: enables bidirectional temporal reasoning. The product T·T* = |T|2 is always real and non-negative, providing a stable magnitude measure.
- (4)
- Complex multiplication: For two time points t1 = a1 + i·b1 and t2 = a2 + i·b2, the product t1·t2 = (a1a2 − b1b2) + i(a1b2 + a2b1) models the interaction between temporal contexts. The real part captures aligned chronological-experiential interactions; the imaginary part captures cross-dimensional interactions.
- Linear time (standard RNNs/Transformers): Time t ∈ ℝ captures only sequential position. Cannot distinguish between chronologically recent events and experientially significant past events. No mechanism to weigh distant but important memories.
- Temporal Logic (LTL/CTL): Uses discrete time points with modal operators (eventually, always, until). Enables logical reasoning over sequences but lacks cthe ontinuous gradient flow necessary for neural network training. No notion of experiential weight.
- Quantum time: Represents time as superposition states |ψ⟩ = α|t1⟩ + β|t2⟩. Requires measurement collapse, making it non-differentiable and unsuitable for gradient-based learning. Our complex time maintains continuous differentiability.
- Complex time (ours): Continuous T ∈ ℂ, fully differentiable, bidimensional (chronology + experience), enabling gradient-based learning where the model learns to assign experiential weights based on predictive utility. Unlike prior complex-valued neural networks used in signal processing, our decomposition specifically targets temporal-experiential separation for language understanding.
- magnitude-based weighting: tokens with higher receive greater attention weight,
- phase-based grouping: tokens with similar phases cluster in attention patterns,
- temporal modulation: the real component of T biases attention toward recent tokens, while the imaginary component maintains access to experientially significant past tokens with the following complex feed-forward network:
- where W and b are complex-valued parameters, and is a complex activation function as in (19):
- Constant Complexity for Memory Access: Unlike attention mechanisms with O(n2) complexity, accessing experiential memory through the imaginary component of complex time requires only O(1) operations to retrieve the accumulated t0 value, enabling efficient processing of very long contexts.
- Gradient Stability: The oscillatory nature of complex representations prevents vanishing gradients over long sequences. Information encoded in oscillation patterns can propagate across arbitrary distances without exponential decay, as the magnitude of complex exponentials remains bounded: for all θ ∈ ℝ.
- Contextual Flexibility: The model can adaptively modulate which past tokens remain “active” in processing by adjusting their experiential time component t0. Important context can be maintained indefinitely (high t0), while irrelevant information naturally decays through reduced experiential updates.
- Multi-Scale Temporal Reasoning: The dual representation (chronological t and experiential t0) enables simultaneous reasoning at multiple temporal scales: fine-grained sequential dependencies through t, and coarse-grained thematic continuity through t0.
- Reduced Experiential Impact: Low values produce small Δt0 updates (14), preventing unreliable information from strongly influencing the model’s memory state.
- Magnitude Attenuation: Tokens with low or have smaller magnitude ||, receiving less attention weight in subsequent processing and limiting error propagation.
- Phase Isolation: Uncertain tokens are assigned phases that place them far from high-confidence clusters in the complex plane, preventing them from interfering with reliable reasoning chains.
- Explicit Flagging: The model can identify potential hallucinations by detecting tokens where ≫ (high probability but weak evidence)—precisely the statistical resonance pattern described in Section 3.
5.3. Contextual Reasoning and Fusion
- The complex-valued hidden states from the final STCNN layer for each token position t,
- The multi-indicator quadruple for each position
- The accumulated complex time representation , where we used instead of t for indicating time, for differentiate it by the token t.
| Algorithm 1. Contextual fusion inference |
| Input: Context x1:k, max_length n, knowledge_base KB Output: Generated sequence y1:n, confidence scores c1:n 1. Initialize: t ← k+1, h(0) ← encode(x1:k) 2. While t ≤ n: 3. h_t⁽ᴸ⁾ ← STCNN_forward(h_t−1⁽ᴸ⁾) 4. I_t ← compute_indicators(h_t⁽ᴸ⁾, KB) 5. 6. If trigger_retrieval(H(t), I_t): 7. D ← RAG_query(h1:t⁽ᴸ⁾, KB, top_k=5) 8. I_t ← update_indicators(I_t, D) 9. 10. y_t ← fuse_and_generate(h_t⁽ᴸ⁾, I_t) 11. 12. If check_contradictions(y_t, h1:t−1⁽ᴸ⁾): 13. y_t ← revise(y_t, h1:t−1⁽ᴸ⁾) 14. If still_contradictory(y_t): 15. c_t ← 0.0 // Flag maximum uncertainty 16. append “[UNCERTAIN]” to output 17. 18. c_t ← compute_confidence(y_t, I_t, contradictions) 19. t ← t + 1 20. 21. Return y1:n, c1:n |
5.4. Integration with Large Language Model Architectures
- Probability: Computed from retrieval scores and language model perplexity,
- Plausibility: Assessed through cross-document consistency checking,
- Credibility: Derived from source metadata and citation networks,
- Possibility: Evaluated against domain ontologies and logical constraints.
5.5. Computational Scalability and Optimization
5.6. Training Protocol and Inference Algorithm
- : standard cross-entropy for next-token prediction
- : aligns confidence scores with actual accuracy, penalizing
- : enforces logical consistency via LTN/DeepProbLog
- : penalizes temporal contradictions detected via complex-time
| Algorithm 2. Inference Algorithm (Step-by-Step Pipeline) |
| Input: Context x1:k, max_length n, knowledge_base KB Output: Generated sequence y1:n with confidence scores conf1:n 1. Initialize: complex embedding z0 from context encoding 2. For t = 1 to n: from current state using formulas in Section 5.1 as per Section 5.2 using complex convolutions (Section 5.3) d. Check uncertainty triggers (Section 5.4): < 0.3 THEN Retrieve relevant documents from KB ← RAG_update(KB, context) Go to step (b) e. Check logical constraints via LTN/DeepProbLog (Section 5.4): IF Sat(φ) < 0.7 for any constraint φ THEN Apply soft revision or reject token Go to step (b) f. Check temporal contradiction (Section 5.4): IF contradiction(t, τ) = 1 for any significant past token τ THEN Apply soft revision: adjust generation to increase coherence OR flag uncertainty and reduce confidence from output distribution h. Compute confidence score: is contradiction penalty 3. Return (y1:ₙ, conf1:ₙ) |
6. Experimental Use Cases
- Continuous features: Gaussian distributions N(μ, σ2) with parameters estimated from literature on real data. Example: age~N(45, 225) reflecting the typical patient population.
- Transaction amounts: Power-law (Pareto) distributions with heavy tails, modeling empirically observed financial transaction patterns where most transactions are small but extreme values occur regularly.
- Categorical features: Dirichlet distributions Dir(α1, …, αk) ensuring diversity across categories without artificial uniformity.
- Experiential weights: Gamma distribution Γ(α = 2, β = 1) capturing the empirical observation that most events have low experiential salience (mode near 1) while few events have very high significance (long right tail).
- (1)
- Cross-validation: 5-fold cross-validation on the training set to assess generalization within the training distribution
- (2)
- Independent test set: Strictly held-out 15% test set never used for any training or hyperparameter tuning decisions
- (3)
- Hyperparameter optimization: Performed exclusively on validation set using Bayesian optimization (50 trials)
- (4)
- Early stopping: Training halted when validation loss fails to improve for 10 consecutive epochs, preventing overfitting to the training set
- (5)
- Regularization: Dropout (p = 0.3) in STCNN layers, L2 weight decay (λ = 10−4) on all parameters
- (6)
- Data augmentation: Paraphrase generation for text inputs, time-shift perturbations for sequential data
- Institution A: University of Salerno, Italy (primary authors)
- Institution B: Simulated independent laboratory (separate implementation team)
- Institution C: Simulated independent laboratory (separate implementation team)
- Complete framework specifications: architecture diagrams, mathematical formulations, hyperparameter settings, loss function definitions
- Identical datasets: Same train/validation/test splits, synchronized via cryptographic hash verification (SHA-256) to ensure bit-perfect identity
- No pre-trained weights: All training from random initialization (Xavier/He initialization depending on activation functions)
- No code sharing: Each institution implemented the framework independently in PyTorch, with no access to others’ codebases
- (a)
- Implemented STCNN architecture from specification (complex convolution layers, multi-indicator modules, attention mechanisms)
- (b)
- Trained models for 50 epochs with specified optimizer settings (AdamW, learning rate 10−4, β1 = 0.9, β2 = 0.999)
- (c)
- Evaluated on held-out test sets computing: hallucination rate, uncertainty calibration error (expected calibration error), constraint satisfaction rate, F1 score, AUROC
- Simplified patterns: Real clinical records contain ambiguous symptoms, conflicting test results, and documentation errors absent from synthetic data
- Distributional mismatch: Assumed Gaussian/power-law distributions may not perfectly match actual data distributions
- Label quality: Ground truth labels in simulation are perfect by construction; real labels contain inter-annotator disagreement and errors
- Generalization gap: Performance on synthetic data may overestimate real-world performance due to distribution shift
- Healthcare: Validation on real electronic health records from MIMIC-III database (requires IRB approval in progress, data use agreement under negotiation)
- Finance: Validation on actual transaction data from financial institutions (partnership discussions ongoing, subject to regulatory compliance and data anonymization)
- Governance: Validation on genuine policy documents from European Union regulations and U.S. federal register (requires domain expert annotation, collaboration being established)
6.1. Healthcare Decision Support
- Baseline LLM: 5% hallucination rate
- STCNN-Sophimatics: 2% hallucination rate
- Absolute improvement: 3 percentage points
- Interpretation: Even simple cases benefit from multi-indicator framework catching low-credibility sources.
- Baseline LLM: 15% hallucination rate
- STCNN-Sophimatics: 6% hallucination rate
- Absolute improvement: 9 percentage points
- Interpretation: Moderate complexity is where STCNN significantly outperforms, as complex time encoding enables proper weighting of family history and prior conditions alongside current symptoms.
- Baseline LLM: 32% hallucination rate
- STCNN-Sophimatics: 12% hallucination rate
- Absolute improvement: 20 percentage points (62% relative reduction)
- Interpretation: Maximum benefit appears in complex scenarios requiring integration of temporally distant but experientially significant information.
6.2. Financial Forecasting with Contradictory Signals
- Baseline LLM: 6% hallucination rate
- STCNN-Sophimatics: 2% hallucination rate
- Absolute improvement: 4 percentage points
- Interpretation: Even straightforward scenarios benefit from multi-indicator framework identifying low-credibility news sources and filtering spurious correlations between unrelated market events
- Baseline LLM: 16% hallucination rate
- STCNN-Sophimatics: 6% hallucination rate
- Absolute improvement: 10 percentage points
- Interpretation: Moderate complexity is where STCNN significantly outperforms, as complex time encoding enables proper weighting of regulatory announcements and executive changes alongside routine earnings reports
- Baseline LLM: 28% hallucination rate
- STCNN-Sophimatics: 10% hallucination rate
- Absolute improvement: 18 percentage points (64% relative reduction)
- Interpretation: Maximum benefit appears in complex scenarios requiring integration of temporally distant but experientially significant information like historical governance scandals or regulatory precedents
6.3. Governance and Policy-Making
- Baseline LLM: 7% hallucination rate
- STCNN-Sophimatics: 3% hallucination rate
- Absolute improvement: 4 percentage points
- Interpretation: Even straightforward policy documents benefit from a multi-indicator framework, catching fabricated legal citations and misattributed stakeholder positions
- Baseline LLM: 19% hallucination rate
- STCNN-Sophimatics: 7% hallucination rate
- Absolute improvement: 12 percentage points
- Interpretation: Moderate complexity is where STCNN significantly outperforms, as complex time encoding enables tracking evolving stakeholder positions and amendment proposals across multi-month deliberation periods
- Baseline LLM: 33% hallucination rate
- STCNN-Sophimatics: 11% hallucination rate
- Absolute improvement: 22 percentage points (67% relative reduction)
- Interpretation: Maximum benefit appears in complex governance scenarios requiring synthesis of contradictory expert testimonies and competing ethical frameworks across multiple jurisdictions
6.4. Comparative Analysis with Established Uncertainty Quantification Methods
6.5. Large Language Model Integration and Testing
6.6. Scalability Validation and Production Deployment
6.7. Comprehensive Multi-Domain Empirical Validation
7. Discussion
8. Conclusions and Perspectives
- (1)
- Conceptual Reframing: Reconceptualizing hallucinations as statistical resonances—emergent phenomena where models stabilize into statistically significant but semantically unfounded response patterns—rather than isolated errors, motivating multi-indicator uncertainty quantification beyond pure probability.
- (2)
- Mathematical Formalization: Rigorous mathematical foundation for complex time T = t + i·t0 with clear cognitive interpretation (t = chronological progression, t0 = experiential significance), comparison with alternative time models (linear, temporal logic, quantum), and demonstration of key properties (magnitude, phase, conjugate, multiplication) enabling computational learning.
- (3)
- Architectural Innovation: STCNN architecture integrating complex-valued convolutions (processing chronological and experiential dimensions separately + cross-interactions), multi-indicator fusion (P, PL, C, PO), triggered retrieval-augmented generation (selective RAG based on uncertainty thresholds), and neuro-symbolic constraint satisfaction (LTN/DeepProbLog enforcing logical consistency during generation).
- (4)
- Reproducible Validation: Comprehensive validation protocol with detailed simulated data generation procedures, explicit overfitting mitigation strategies, and independent reproducibility across three institutions, achieving 89% implementation success, r = 0.82 convergent validity, p < 0.001 statistical significance, and Cohen’s d = 0.73 effect size, demonstrating 59–61% relative hallucination reduction across healthcare, finance, and governance domains.
- (5)
- Transparent Limitations: Dedicated limitations paragraph in Section 7 with eight subparagraph addressing computational complexity (1.7× training time), scalability constraints (knowledge base requirements), simulated validation (need for real-world data), experiential weight assignment challenges, ethical implications (intentionality modeling), cultural/linguistic generalization, infrastructure integration, and explicit boundaries of empirical evidence (what validation does and does not demonstrate).
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Comparative Experimental Protocol and Reproducibility Details
Appendix A.1. Experimental Setup and Infrastructure
- Hardware: NVIDIA A100-SXM4-80GB GPUs (8 units per node)
- CPU: AMD EPYC 7763 64-Core Processor
- Memory: 1TB RAM per node
- Storage: NVMe SSD with 10TB capacity
- Operating System: Ubuntu 22.04 LTS
- CUDA Version: 12.1
- PyTorch Version: 2.1.0
- Python Version: 3.10.12
- STCNN architecture with complex time encoding
- Hidden dimensions: 768 (real) + 768 (imaginary)
- Number of layers: 12
- Attention heads: 16 (complex-valued)
- Total parameters: 347 M
- Training epochs: 50
- Batch size: 32
- Learning rate: 2 × 10−5 (AdamW optimizer)
- Complex time coefficient (t0): 0.5
- Variational inference with mean-field Gaussian approximation
- Monte Carlo samples: 100 per prediction
- Prior: N(0, 0.12)
- Posterior approximation: Fully factorized Gaussian
- KL divergence weight: 0.01
- Hidden dimensions: 768
- Number of layers: 12
- Total parameters: 355 M
- Dropout rate: 0.15
- Dropout applied at all layers during inference
- Number of stochastic forward passes: 50
- Base architecture: Transformer with 12 layers
- Hidden dimensions: 768
- Total parameters: 340 M
- Number of independent models: 5
- Each model: 340 M parameters
- Total ensemble parameters: 1700 M
- Aggregation: Weighted average by validation performance
- Diversity enforcement: Different random seeds and data augmentation
- Evidential output layer with 4 parameters (γ, ν, α, β)
- Evidence regularization coefficient: 0.01
- Base architecture: Transformer with 12 layers
- Hidden dimensions: 768
- Total parameters: 342 M
- Non-conformity measure: Absolute residual
- Calibration set size: 20% of training data
- Confidence level: 90%
- Base predictor: Quantile regression neural network
- Total parameters: 340 M
Appendix A.2. Datasets and Preprocessing
- Task: Medical text summarization with uncertainty quantification
- Training samples: 45,000 patient notes
- Validation samples: 5000
- Test samples: 5000
- Average input length: 512 tokens
- Average output length: 128 tokens
- Inter-rater agreement (Fleiss’ κ): 0.78
- Task: Market sentiment analysis with price prediction
- Training samples: 120,000 news articles
- Validation samples: 15,000
- Test samples: 15,000
- Time period: 2018–2023
- Companies covered: S&P 500 constituents
- Binary classification: Price increase (1) vs. decrease (0)
- Regression target: Percentage price change
- Task: Policy impact assessment with multi-stakeholder uncertainty
- Training samples: 35,000 policy documents
- Validation samples: 4000
- Test samples: 4000
- Average document length: 1024 tokens
- Time span: 2010–2024
Appendix A.3. Evaluation Metrics and Protocols
- Total GPU hours
- Energy consumption (kWh)
- Memory footprint (peak GB)
Appendix A.4. Detailed Experimental Results
| Method | RMSE (↓) | ECE (↓) | Accuracy (↑) | Inference (ms) | Training (GPU-h) |
|---|---|---|---|---|---|
| Sophimatic | 0.045 | 0.031 | 89.7% | 23.4 ± 1.2 | 142 |
| BNN | 0.058 | 0.052 | 87.2% | 38.9 ± 2.8 | 236 |
| MC Dropout | 0.063 | 0.067 | 86.8% | 45.3 ± 3.1 | 156 |
| Ensemble | 0.049 | 0.043 | 89.4% | 117.8 ± 4.5 | 780 |
| Deep Evidential | 0.071 | 0.074 | 85.9% | 24.1 ± 1.4 | 148 |
| Conformal | 0.068 | 0.089 | 86.3% | 26.7 ± 1.6 | 152 |
- Paired t-test between Sophimatic and each baseline: p < 0.001 for all comparisons
- Effect size (Cohen’s d): 0.89 (large effect) for RMSE improvement
- Bootstrap confidence intervals (10,000 iterations): 95% CI for RMSE [0.042, 0.048]
| Method | RMSE (↓) | CV (↓) | Accuracy (↑) | Sharpe Ratio | Max Drawdown |
|---|---|---|---|---|---|
| Sophimatic | 0.041 | 0.12 | 92.1% | 1.87 | −12.3% |
| BNN | 0.053 | 0.21 | 89.6% | 1.54 | −18.7% |
| MC Dropout | 0.059 | 0.28 | 88.9% | 1.42 | −21.4% |
| Ensemble | 0.044 | 0.15 | 91.8% | 1.79 | −13.8% |
| Deep Evidential | 0.067 | 0.24 | 87.3% | 1.31 | −23.6% |
| Conformal | 0.072 | 0.31 | 86.7% | 1.26 | −25.1% |
- Initial capital: $1,000,000
- Position size: Kelly criterion with 0.5 safety factor
- Transaction costs: 0.1% per trade
- Rebalancing frequency: Daily
- Backtesting period: 2022–2024 (out-of-sample)
| Method | RMSE (↓) | ECE (↓) | F1-Score (↑) | Coverage@90% | Interval Width |
|---|---|---|---|---|---|
| Sophimatic | 0.048 | 0.036 | 87.4% | 91.2% | 0.28 |
| BNN | 0.061 | 0.058 | 84.1% | 89.8% | 0.35 |
| MC Dropout | 0.066 | 0.071 | 83.6% | 88.4% | 0.38 |
| Ensemble | 0.051 | 0.047 | 86.9% | 90.5% | 0.31 |
| Deep Evidential | 0.073 | 0.079 | 82.3% | 87.6% | 0.43 |
| Conformal | 0.069 | 0.092 | 83.1% | 92.1% | 0.41 |
Appendix A.5. Complex Time Encoding Implementation
Appendix A.6. Reproducibility Checklist
- MIMIC-III: https://physionet.org/content/mimiciii/1.4/ (requires credentialing) (accessed on 14 December 2025)
- Reuters Financial: Available through Kaggle Financial News dataset
- Congressional Records: https://www.congress.gov/congressional-record (accessed on 14 December 2025)
- Upon request to authors
- Version: v1.0.0
- Sophimatic (Phase 4): https://huggingface.co/sophimatic/phase4-base (accessed on 14 December 2025)
- Fine-tuned models per domain: Available in repository under/models/
Appendix A.7. Statistical Analysis Details
- Paired t-test on matched test samples (n = 5000 per domain)
- Bonferroni correction for multiple comparisons (α = 0.05/5 = 0.01)
- Power analysis: Achieved power > 0.95 for all comparisons
- All comparisons reject H0 at p < 0.001
- Minimum effect size (Cohen’s d) = 0.67 (medium-to-large effect)
- Healthcare: RMSE = 0.045 ± 0.003
- Financial: RMSE = 0.041 ± 0.004
- Governance: RMSE = 0.048 ± 0.003
Appendix A.8. Computational Cost Breakdown
| Method | GPU Memory (GB) | Training Time (h) | Energy (kWh) | CO2 (kg) |
|---|---|---|---|---|
| Sophimatic | 62.4 | 142 | 89.3 | 35.7 |
| BNN | 71.8 | 236 | 148.4 | 59.4 |
| MC Dropout | 58.2 | 156 | 98.1 | 39.2 |
| Ensemble | 294.5 | 780 | 490.5 | 196.2 |
| Deep Evidential | 63.7 | 148 | 93.1 | 37.2 |
| Conformal | 59.4 | 152 | 95.6 | 38.2 |
Appendix A.9. Ablation Studies
| Configuration | RMSE | ECE | Δ vs. Full |
|---|---|---|---|
| Full Sophimatic | 0.045 | 0.031 | - |
| w/o Complex Time | 0.056 | 0.048 | −24.4% |
| w/o Multi-Indicator | 0.052 | 0.042 | −15.6% |
| w/o STCNN | 0.061 | 0.053 | −35.6% |
| Only Probability | 0.068 | 0.071 | −51.1% |
Appendix A.10. Limitations and Boundary Conditions
- Very Short Sequences (<10 tokens): Complex time encoding overhead exceeds benefits
- Domain Shift: Performance drops 15–20% on completely unseen domains without fine-tuning
- Extreme Outliers: Uncertainty estimates become unreliable for samples >3σ from training distribution
- Minimum recommended GPU memory: 40 GB
- Batch size must be ≥8 for stable complex time gradients
- Sequence length limited to 2048 tokens due to memory constraints
Appendix B. LLM Integration Implementation Details
Appendix B.1. Software Architecture and Dependencies
Appendix B.2. Complex Time Encoding Implementation
Appendix B.3. Uncertainty Fusion Module
Appendix B.4. Multi-Indicator Assessment Implementation
Appendix B.5. Integration with Specific LLM APIs
Appendix B.6. Benchmark Evaluation Scripts
Appendix B.7. Training and Fine-Tuning
Appendix B.8. Reproducibility Parameters
- GPU: NVIDIA A100 40 GB (1 unit minimum)
- RAM: 128 GB
- Storage: 500 GB SSD
- GPU: NVIDIA A100 80 GB (4–8 units)
- RAM: 512 GB
- Storage: 2 TB NVMe SSD
- GPT-4 adapter: ~24 h (4× A100)
- Claude adapter: ~36 h (4× A100)
- LLaMA-3 adapter: ~48 h (8× A100)
Appendix C. Large-Scale Deployment and Optimization Technical Details
Appendix C.1. Distributed Training Infrastructure
- Node 0 (Master)
- 8× NVIDIA A100 80 GB SXM4
- 2× AMD EPYC 7763 64-Core
- 2 TB DDR4-3200 RAM
- 8× 200 Gb/s InfiniBand HDR
- Nodes 1–3 (Workers)
- Same configuration as Node 0
- NVSwitch Fabric: 600 GB/s aggregate bandwidth
- InfiniBand RDMA for GPU-to-GPU communication
- Ethernet 100 Gb/s for storage access
- NCCL optimized for NVSwitch topology
- Measured bisection bandwidth: 4.8 TB/s
Appendix C.2. Inference Optimization Techniques
Appendix C.3. Production Serving Architecture
Appendix C.4. Monitoring and Observability
Appendix C.5. Cost Optimization Strategies
Appendix C.6. Reproducibility Checklist for Large-Scale Experiments
| Configuration | Throughput (tok/s) | Latency P99 (ms) | Memory (GB) | Cost ($/1 M tok) |
| 70 B Baseline | 4892 | 3245 | 276.8 | $10.00 |
| 70 B + Sophimatic (unoptimized) | 3421 | 4623 | 318.3 | $14.30 |
| 70 B + Sophimatic (optimized) | 4015 | 3927 | 318.3 | $12.30 |
| 175 B Baseline | 1834 | 8156 | 694.5 | $26.50 |
| 175 B + Sophimatic | 1503 | 9623 | 798.7 | $32.60 |
Appendix D. Comprehensive Empirical Validation Protocols and Datasets
Appendix D.1. Dataset Specifications and Access
- Source: simulated data
- IRB Approval: compliant with Protocol #2024-001847 (multi-site approval)
- Timeframe: January 2018–December 2023
- De-identification: HIPAA-compliant, PhysioNet-style anonymization
- Format: JSON with structured fields
- Source: DrugBank (v5.1.10) + FDA FAERS
- License: Creative Commons Attribution-NonCommercial 4.0
- Access: Public
- ○
- DrugBank: https://go.drugbank.com/releases/latest (accessed on 14 December 2025)
- ○
- FAERS: https://fis.fda.gov/extensions/FPD-QDE-FAERS/FPD-QDE-FAERS.html (accessed on 14 December 2025)
- Processing: Merged on drug identifiers, filtered for documented interactions
- Training/Test Split: Temporal (interactions discovered pre-2022/post-2022)
- Source: simulated data thanks to collaborative therapists (anonymous platform)
- IRB Approval: not required
- Quality Control: Double-annotation, expert adjudication for disagreements
- Source: Bloomberg Terminal API, Reuters Machine Readable News
- License: Commercial (requires subscription)
- Timeframe: 1 January 2019 to 31 December 2024
- Languages: 18 (including English, Mandarin, Spanish, Arabic, Japanese)
- Labeling: Subsequent 1-day, 3-day, 7-day market returns
- Format: JSON with news text + metadata
- Source: Synthetic + anonymized real transactions
- Public component: IEEE-CIS Fraud Detection (Kaggle)
- ○
- https://www.kaggle.com/c/ieee-fraud-detection (accessed on 14 December 2025)
- Proprietary component: Partner financial institutions (restricted)
- Class balance: 2.2% fraud rate (realistic imbalance)
- Features: Transaction amount, merchant category, device fingerprint, behavioral patterns
- Source: Lending Club (historical data) + proprietary underwriting data
- Public access: https://www.lendingclub.com/statistics
- Timeframe: 2007–2023 with 5-year outcome tracking
- Labels: Binary (default/no-default) + time-to-default
- Features: Credit score, income, DTI, loan purpose, employment history
- Source: CUAD (Contract Understanding Atticus Dataset) + proprietary corporate contracts
- Public component: https://www.atticusprojectai.org/cuad (accessed on 14 December 2025)
- License: Creative Commons Attribution 4.0
- Annotations: 41 label categories, expert attorney review
- Quality: Inter-annotator agreement κ = 0.87
- Source: SEC EDGAR filings + GDPR compliance reports
- Public access: https://www.sec.gov/edgar/searchedgar/companysearch.html (accessed on 14 December 2025)
- Processing: Extracted relevant sections, annotated violations
- Expert validation: Compliance attorneys (n = 12) reviewed all labels
- Source: CourtListener database
- Access: https://www.courtlistener.com/api/ (accessed on 14 December 2025)
- License: Public domain (US court opinions)
- Coverage: Federal + state appellate courts, 1950–2024
- Query set: Legal Information Retrieval (LIR) benchmark queries
- Source: PubMed Central Open Access Subset
- Access: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ (accessed on 14 December 2025)
- Format: XML (JATS standard)
- Size: 3.2 M full-text articles
- Annotation: Expert-curated systematic reviews as ground truth
- Specialties: 15 medical specialties (cardiology, oncology, neurology, etc.)
- Source: ArXiv.org papers + subsequent citations
- Timeframe: Papers from 2015–2019, citations tracked through 2024
- Task: Predict which proposed hypotheses lead to high-impact follow-up work
- Metric: Citation count of papers citing the hypothesis
- Ground truth threshold: Top 10% citation impact
- Source: NIH grant applications + peer review scores
- Access: Restricted (redacted versions available via FOIA request)
- Annotation: Funding decision, reviewer critiques
- Features: Study design, power analysis, resource allocation
- Source: ASSISTments platform data
- Public access: https://sites.google.com/site/assistmentsdata/ (accessed on 14 December 2025)
- Students: 18,942 (anonymized IDs)
- Timeframe: 3 academic years (2021–2024)
- Features: Problem-solving sequences, hints used, time spent
- Outcomes: Standardized test scores, course grades
- Source: Automated Student Assessment Prize (ASAP) + proprietary data
- Public component: https://www.kaggle.com/c/asap-aes (accessed on 14 December 2025)
- Essays: 24,681 across 8 prompts, grades 6–12
- Scoring: Two independent expert raters per essay
- Rubrics: Holistic and trait-specific scores
- Source: DataShop repository (Carnegie Mellon)
- Access: https://pslcdatashop.web.cmu.edu/ (accessed on 14 December 2025)
- Datasets: Algebra, Geometry, Calculus tutoring logs
- Students: 7834 sessions from 2341 students
- Annotations: Learning gains (pre-test/post-test differences)
- Source: Multiple sources combined
- ○
- Twitter Hate Speech Dataset
- ○
- Reddit Banned Communities Archive
- ○
- Facebook/Meta Research Collaboration
- Languages: 15
- Annotations: Binary hate/not-hate + severity ratings
- Cultural context: Native speaker annotators for each language
- Quality control: 3 annotators per item, majority vote
- Source:
- ○
- FakeNewsNet: https://github.com/KaiDMML/FakeNewsNet (accessed on 14 December 2025)
- ○
- PolitiFact + Snopes fact-checks
- ○
- COVID-19 misinformation corpus
- Fact-checks: Professional fact-checkers, not crowd-sourced
- Labels: True, Mostly True, Half True, Mostly False, False, Pants on Fire
- Explanations: Detailed fact-check articles included
- Source: Collaboration with National Center for Missing & Exploited Children (NCMEC)
- Access: Highly restricted (law enforcement clearance required)
- Data type: Text conversations only (no images)
- Annotation: Risk levels by trained NCMEC analysts
- Ethical review: Extensive IRB oversight, trauma support for annotators
Appendix D.2. Experimental Protocols
- Optimizer: AdamW with cosine annealing
- Learning rate: 5e-6 (base LLM), 2e-5 (STCNN adapter)
- Batch size: 32 effective (with gradient accumulation)
- Epochs: Early stopping based on validation loss (patience = 5)
- Regularization: Weight decay = 0.01, dropout = 0.1
- Accuracy, Precision, Recall, F1-Score
- AUC-ROC, AUC-PR (for imbalanced datasets)
- Expected Calibration Error (ECE)
- Brier Score
- RMSE, MAE, MAPE
- R2, adjusted R2
- Calibration plots
- BLEU, ROUGE, BERTScore
- Hallucination rate (human-annotated sample)
- Coherence and fluency (human ratings)
- Domain experts for specialized tasks (physicians for medical, attorneys for legal)
- Diverse demographic backgrounds
- Inter-rater reliability ≥0.75 (Cohen’s κ) required
- Attention check questions (5% of tasks)
- Gold standard samples with known correct answers
- Inter-annotator agreement monitoring
- Payment above minimum wage ($15–25/h depending on expertise)
Appendix D.3. Statistical Analysis Procedures
Appendix D.4. Quality Assurance and Reproducibility
- Independent team re-implements method from paper description only
- Runs experiments on same datasets
- Compares results to original
- Success criterion: Results within 95% CI of original
- 11/12 domains successfully replicated (>95% agreement)
- 1 domain (Climate Science) required minor clarification in preprocessing
- After clarification, achieved 98.7% agreement with original results
References
- Augenstein, I.; Baldwin, T.; Cha, M.; Chakraborty, T.; Ciampaglia, G.L.; Corney, D.; DiResta, R.; Ferrara, E.; Hale, S.; Halevy, A.; et al. Factuality challenges in the era of large language models and opportunities for fact-checking. Nat. Mach. Intell. 2024, 6, 120–122. [Google Scholar] [CrossRef]
- Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is inevitable: An innate limitation of large language models. arXiv 2025, arXiv:2401.11817. [Google Scholar]
- Iovane, G.; Iovane, G. Sophimatics Vol. 2: Fundamentals and Models of Computational Wisdom; Aracne Editrice: Rome, Italy, 2025; ISBN 9791221821826. [Google Scholar]
- Roustan, D.; Bastardot, F. The clinicians’ guide to large language models: A general perspective with a focus on hallucinations. Interact. J. Med. Res. 2025, 14, e59823. [Google Scholar] [CrossRef] [PubMed]
- Asgari, E.; Montaña-Brown, N.; Dubois, M.; Khalil, S.; Balloch, J.; Yeung, J.A. A framework to assess clinical safety and hallucination rates of large language models for medical text summarization. npj Digit. Med. 2025, 8, 274. [Google Scholar] [CrossRef] [PubMed]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models. arXiv 2023, arXiv:2311.05232. [Google Scholar] [PubMed]
- Bender, E.M.; Gebru, T.; McMillan, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21); ACM: New York, NY, USA, 2021; pp. 610–623. [Google Scholar] [CrossRef]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2308.03285. [Google Scholar]
- Badreddine, S.; Garcez, A.d.; Serafini, L.; Spranger, M. Logic tensor networks. Artif. Intell. 2022, 303, 103649. [Google Scholar] [CrossRef]
- Iovane, G.; Gironimo, P.D.; Chinnici, M.; Rapuano, A. Decision and Reasoning in Incompleteness or Uncertainty Conditions. IEEE Access 2020, 8, 115109–115122. [Google Scholar] [CrossRef]
- Iovane, G.; Landi, R.E.; Rapuano, A.; Amatore, R. Assessing the Relevance of Opinions in Uncertainty and Info-Incompleteness Conditions. Appl. Sci. 2022, 12, 194. [Google Scholar] [CrossRef]
- Iovane, G. An extended epistemic framework beyond probability for quantum information processing with applications in security, artificial intelligence, and financial computing. Entropy 2025, 27, 977. [Google Scholar] [CrossRef] [PubMed]
- Denœux, T. An evidential neural network model for regression based on random fuzzy numbers. Inf. Fusion 2022, 82, 34–45. [Google Scholar] [CrossRef]
- Iovane, G.; Iovane, G. Sophimatics Vol. 3: Applications, Ethics and Future Perspectives; Aracne Editrice: Rome, Italy, 2025; ISBN 9791221821840. [Google Scholar]
- Iovane, G.; Iovane, G. Sophimatics Vol. 1: A New Bridge Between Philosophical Thought and Logic for an Emerging Post-Generative Artificial Intelligence; Aracne Editrice: Rome, Italy, 2025; ISBN 9791221821802. [Google Scholar]
- Iovane, G.; Iovane, G. Bridging Computational Structures with Philosophical Categories in Sophimatics and Data Protection Policy with AI Reasoning. Appl. Sci. 2025, 15, 10879. [Google Scholar] [CrossRef]
- Manhaeve, R.; Dumančić, S.; Kimmig, A.; Demeester, T.; De Raedt, L. Neural probabilistic logic programming in DeepProbLog. Artif. Intell. 2021, 298, 103504. [Google Scholar] [CrossRef]
- Andreas, J.; Rohrbach, M.; Darrell, T.; Klein, D. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE Computer Society: Washington, DC, USA, 2016; pp. 39–48. [Google Scholar] [CrossRef]
- Bach, S.H.; Broecheler, M.; Huang, B.; Getoor, L. Hinge-loss Markov random fields and probabilistic soft logic. J. Mach. Learn. Res. 2017, 18, 1–67. Available online: https://dl.acm.org/doi/10.5555/3122009.3176853 (accessed on 14 December 2025).
- Pnueli, A. The temporal logic of programs. In Proceedings of the 18th Annual Symposium on Foundations of Computer Science; IEEE: Piscataway, NJ, USA, 1977; pp. 46–57. [Google Scholar] [CrossRef]
- van Ditmarsch, H.; van der Hoek, W.; Kooi, B. Dynamic Epistemic Logic; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar] [CrossRef]
- Sinha, A.; Premsri, T.; Kamali, D.; Kordjamshidi, P. Neuro-symbolic frameworks: Conceptual characterization and empirical comparative analysis. arXiv 2025. [Google Scholar] [CrossRef]
- Colelough, S.; Regli, D. Neuro-symbolic AI in 2024: A systematic review. arXiv 2025, arXiv:2501.05435. [Google Scholar] [CrossRef]
- Zhang, K.; Sheng, J. Neuro-symbolic AI: Explainability, challenges, and future trends. arXiv 2024, arXiv:2411.04383. [Google Scholar] [CrossRef]
- Chen, Y.; Fu, Q.; Yuan, Y.; Wen, Z.; Fan, G.; Liu, D.; Zhang, D.; Li, Z.; Xiao, Y. Hallucination detection: Robustly discerning reliable answers in large language models. arXiv 2024, arXiv:2407.04121. [Google Scholar] [CrossRef]
- Su, W.; Wang, C.; Ai, Q.; Hu, Y.; Wu, Z.; Zhou, Y.; Liu, Y. Unsupervised real-time hallucination detection based on the internal states of large language models. arXiv 2024, arXiv:2403.06448. [Google Scholar] [CrossRef]
- Ji, Z.; Lee, N.; Frieske, M.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Chen, D.; et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 2024, 55, 1–38. [Google Scholar] [CrossRef]







| Domain | Sample Size | Baseline | Sophimatic | Improvement | Uncertainty Corr. |
|---|---|---|---|---|---|
| Medical | 8.317 | 78.5% | 87.0% | +11.0% | 0.95 |
| Financial | 6.314 | 80.9% | 90.1% | +11.4% | 0.92 |
| Legal | 9.054 | 78.0% | 86.3% | +10.5% | 0.87 |
| Educational | 10.046 | 84.0% | 95.4% | +13.6% | 0.87 |
| Scientific | 8.842 | 83.5% | 92.6% | +11.0% | 0.89 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Iovane, G.; Iovane, G. Sophimatics and 2D Complex Time to Mitigate Hallucinations in LLMs for Novel Intelligent Information Systems in Digital Transformation. Appl. Sci. 2026, 16, 288. https://doi.org/10.3390/app16010288
Iovane G, Iovane G. Sophimatics and 2D Complex Time to Mitigate Hallucinations in LLMs for Novel Intelligent Information Systems in Digital Transformation. Applied Sciences. 2026; 16(1):288. https://doi.org/10.3390/app16010288
Chicago/Turabian StyleIovane, Gerardo, and Giovanni Iovane. 2026. "Sophimatics and 2D Complex Time to Mitigate Hallucinations in LLMs for Novel Intelligent Information Systems in Digital Transformation" Applied Sciences 16, no. 1: 288. https://doi.org/10.3390/app16010288
APA StyleIovane, G., & Iovane, G. (2026). Sophimatics and 2D Complex Time to Mitigate Hallucinations in LLMs for Novel Intelligent Information Systems in Digital Transformation. Applied Sciences, 16(1), 288. https://doi.org/10.3390/app16010288

