Mixture of TSMixer Experts for Time Series Forecasting
Abstract
1. Introduction
2. Prior Work
3. Proposed Method
3.1. Weight Cloning and Modification
| Algorithm 1: Training of TSMixer Experts. |
| Input: Parameter: Output: (moment parameters) 1: 2: Precompute the Hermite coefficients using Equation (2) 3: while not converged do 4: do 5: 6: Higher-order moment morphing using Equation (1) 7: Lower-order moment morphing using Equation (3) 8: 9: end for 10: 11: do 12: - 13: end for 14: using a gating function 15: using Equation (5) and final model output 16: 17: 18: end while 19: |
3.2. Model Architecture
3.3. Computational Complexity
4. Experiments
4.1. Experimental Designs
- RQ1: How does the proposed model’s performance compare to that of the standard MLP-Mixer and its MoE variant?
- RQ2: How does the number of moment parameters, number of experts, and other hyperparameters impact the performance?
4.2. Experimental Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| MoE | Mixture-of-Experts |
| TSMixer | Time Series Mixer |
| MoE-TSMixer | Mixture-of-Experts version of TSMixer |
| MLP | Multi-Layer Perceptron |
| LTSF | Long-Term Time Series Forecasting |
References
- Wang, Y.; Wu, H.; Dong, J.; Liu, Y.; Wang, C.; Long, M.; Wang, J. Deep Time Series Models: A Comprehensive Survey and Benchmark. arXiv 2024. [Google Scholar] [CrossRef] [PubMed]
- Chen, S.-A.; Li, C.-L.; Yoder, N.; Arik, S.O.; Pfister, T. TSMixer: An All-MLP Architecture for Time Series Forecasting. arXiv 2023. [Google Scholar] [CrossRef]
- Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive Mixtures of Local Experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef]
- Mountcastle, V. The columnar organization of the neocortex. Brain 1997, 120, 701–722. [Google Scholar] [CrossRef]
- Deng, Z.; Ma, W.; Han, Q.-L.; Zhou, W.; Zhu, X.; Wen, S.; Xiang, Y. Exploring DeepSeek: A Survey on Advances, Applications, Challenges and Future Directions. IEEECAA J. Autom. Sin. 2025, 12, 872–893. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17); Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
- Dai, D.; Deng, C.; Zhao, C.; Xu, R.X.; Gao, H.; Chen, D.; Li, J.; Zeng, W.; Yu, X.; Wu, Y.; et al. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv 2024. [Google Scholar] [CrossRef]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
- Liu, X.; Liu, J.; Woo, G.; Aksu, T.; Liang, Y.; Zimmermann, R.; Liu, C.; Savarese, S.; Xiong, C.; Sahoo, D. Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts. arXiv 2024. [Google Scholar] [CrossRef]
- Riquelme, C.; Puigcerver, J.; Mustafa, B.; Neumann, M.; Jenatton, R.; Pinto, A.S.; Keysers, D.; Houlsby, N. Scaling Vision with Sparse Mixture of Experts. arXiv 2021. [Google Scholar] [CrossRef]
- Chen, Z.; Deng, Y.; Wu, Y.; Gu, Q.; Li, Y. Towards understanding the mixture-of-experts layer in deep learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS ’22); Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
- Guo, H.; Lu, H.; Nan, G.; Chu, B.; Zhuang, J.; Yang, Y.; Che, W.; Cao, X.; Leng, S.; Cui, Q.; et al. Advancing Expert Specialization for Better MoE. arXiv 2025. [Google Scholar] [CrossRef]
- Nguyen, H.; Ho, N.; Rinaldo, A. Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts. arXiv 2024. [Google Scholar] [CrossRef]
- Oldfield, J.; Georgopoulos, M.; Chrysos, G.G.; Tzelepis, C.; Panagakis, Y.; Nicolaou, M.A.; Deng, J.; Patras, I. Multilinear mixture of experts: Scalable expert specialization through factorization. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NIPS ’24); Curran Associates Inc.: Red Hook, NY, USA, 2024. [Google Scholar]
- Song, Q.; Jing, S.; Zhang, S.; Zhang, S.; Huang, C. Mixture-of-Experts for Distributed Edge Computing with Channel-Aware Gating Function. arXiv 2025. [Google Scholar] [CrossRef]
- Wang, L.; Gao, H.; Zhao, C.; Sun, X.; Dai, D. Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts. arXiv 2024. [Google Scholar] [CrossRef]
- Hong, J.; Lee, K.M. Moment Learning: Forecasting non-stationary time series with fewer parameters via data-dependent weight sampling. Knowl.-Based Syst. 2026, 343, 115978. [Google Scholar] [CrossRef]
- Stein, A.; Hall, P. The Bootstrap and Edgeworth Expansion. Statistician 1996, 45, 532. [Google Scholar] [CrossRef]
- Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. arXiv 2022. [Google Scholar] [CrossRef]
- Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In Proceedings of the 35th International Conference on Neural Information Processing Systems (NIPS ’21); Curran Associates Inc.: Red Hook, NY, USA, 2021. [Google Scholar]
- Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. arXiv 2022. [Google Scholar] [CrossRef]
- Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are Transformers Effective for Time Series Forecasting? Proc. AAAI Conf. Artif. Intell. 2023, 37, 11121–11128. [Google Scholar] [CrossRef]
- Liu, Y.; Wu, H.; Wang, J.; Long, M. Non-stationary transformers: Exploring the stationarity in time series forecasting. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS ’22); Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
- Murad, M.M.N.; Aktukmak, M.; Yilmaz, Y. WPMixer: Efficient Multi-Resolution Mixing for Long-Term Time Series Forecasting. arXiv 2024. [Google Scholar] [CrossRef]
- Wang, S.; Li, J.; Shi, X.; Ye, Z.; Mo, B.; Lin, W.; Ju, S.; Chu, Z.; Jin, M. TimeMixer++: A General Time Series Pattern Machine for Universal Predictive Analysis. arXiv 2024. [Google Scholar] [CrossRef]
- Wang, S.; Wu, H.; Shi, X.; Hu, T.; Luo, H.; Ma, L.; Zhang, J.Y.; Zhou, J. TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting. arXiv 2024. [Google Scholar] [CrossRef]
- Bear, M.F.; Connors, B.W.; Paradiso, M.A. Neuroscience: Exploring the Brain, 5th ed.; Jones & Bartlett Learning: Burlington, MA, USA, 2025; ISBN 978-1-284-31986-6. [Google Scholar]


| Data | Horizon | Proposed Model | MoE-TSMixer | TSMixer | Transformer | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MSE | MAE | Param. | MSE | MAE | Param. | MSE | MAE | Param. | MSE | MAE | Param. | ||
| ETTh1 | 96 | 0.4842 | 0.4879 | 84,558 | 0.5191 | 0.5111 | 281,966 | 0.5110 | 0.5045 | 36,142 | 0.8866 | 0.7541 | 276,199 |
| 192 | 0.6048 | 0.5598 | 131,493 | 0.6186 | 0.5686 | 291,278 | 0.5945 | 0.5531 | 45,454 | 0.9128 | 0.7682 | 285,511 | |
| 336 | 0.6998 | 0.6186 | 107,838 | 0.7541 | 0.649 | 305,246 | 0.6730 | 0.5979 | 59,422 | 1.0036 | 0.8179 | 299,479 | |
| 720 | 0.6776 | 0.6107 | 118,462 | 0.8064 | 0.6903 | 342,494 | 0.7478 | 0.6585 | 96,670 | 1.3498 | 0.9739 | 336,727 | |
| ETTh2 | 96 | 0.5293 | 0.5578 | 46,935 | 1.2638 | 0.9305 | 281,966 | 0.6385 | 0.6015 | 36,142 | 1.567 | 1.0553 | 276,199 |
| 192 | 1.4592 | 0.9239 | 93,870 | 1.8921 | 1.184 | 291,278 | 0.8027 | 0.6743 | 45,454 | 3.7647 | 1.7137 | 285,511 | |
| 336 | 1.3647 | 0.9067 | 107,838 | 1.8384 | 1.1324 | 305,246 | 0.8727 | 0.6975 | 59,422 | 3.592 | 1.6718 | 299,479 | |
| 720 | 2.5425 | 1.3626 | 145,086 | 1.7264 | 1.0791 | 342,494 | 0.9769 | 0.7262 | 96,670 | 3.6145 | 1.6709 | 336,727 | |
| ETTm1 | 96 | 0.4112 | 0.4295 | 84,558 | 0.4474 | 0.4594 | 281,966 | 0.4200 | 0.4377 | 36,142 | 0.5586 | 0.5335 | 276,199 |
| 192 | 0.4606 | 0.4658 | 147,118 | 0.5128 | 0.508 | 291,278 | 0.4461 | 0.4586 | 45,454 | 0.6036 | 0.5816 | 285,511 | |
| 336 | 0.5340 | 0.5228 | 107,838 | 0.5819 | 0.5585 | 305,246 | 0.4943 | 0.4935 | 59,422 | 0.6886 | 0.6288 | 299,479 | |
| 720 | 0.6260 | 0.5683 | 145,086 | 0.7625 | 0.6679 | 342,494 | 0.5940 | 0.5525 | 96,670 | 0.8517 | 0.7391 | 336,727 | |
| ETTm2 | 96 | 0.2278 | 0.3458 | 84,558 | 0.288 | 0.3844 | 281,966 | 0.2707 | 0.3929 | 36,142 | 0.2935 | 0.3907 | 276,199 |
| 192 | 0.4057 | 0.4863 | 94,126 | 0.4014 | 0.4772 | 291,278 | 0.3991 | 0.4764 | 45,454 | 0.6357 | 0.6456 | 285,511 | |
| 336 | 0.7628 | 0.6769 | 107,838 | 0.8391 | 0.7255 | 305,246 | 0.5450 | 0.5699 | 59,422 | 0.9315 | 0.7672 | 299,479 | |
| 720 | 1.2044 | 0.8137 | 178,094 | 2.6288 | 1.2806 | 342,494 | 1.2250 | 0.8884 | 96,670 | 3.3569 | 1.4834 | 336,727 | |
| Data | Horizon | Vs. MoE-TSMixer | Vs. TSMixer | Vs. Transformer | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| MSE Delta | Cohen’s d | Better | MSE Delta | Cohen’s d | Better | MSE Delta | Cohen’s d | Better | ||
| ETTh1 | 96 | 0.0349 | 0.6882 | O | 0.0268 | 0.496 | O | 0.4024 | 1.8356 | O |
| 192 | 0.0138 | 0.3306 | O | −0.0103 | −0.2532 | X | 0.308 | 2.3843 | O | |
| 336 | 0.0544 | 1.0556 | O | −0.0268 | −0.7879 | X | 0.3039 | 2.9745 | O | |
| 720 | 0.1288 | 1.5684 | O | 0.0702 | 2.7057 | O | 0.6722 | 3.5918 | O | |
| ETTh2 | 96 | 0.7345 | 2.8891 | O | 0.1091 | 0.8593 | O | 1.0376 | 3.505 | O |
| 192 | 0.4329 | 3.4922 | O | −0.6565 | −4.7372 | X | 2.3054 | 5.8116 | O | |
| 336 | 0.4738 | 4.7073 | O | −0.492 | −6.4997 | X | 2.2274 | 9.9016 | O | |
| 720 | −0.8161 | −2.1274 | X | −1.5655 | −3.1438 | X | 1.0721 | 2.7572 | O | |
| ETTm1 | 96 | 0.0362 | 0.4743 | O | 0.0088 | 0.2212 | O | 0.1475 | 0.8398 | O |
| 192 | 0.0522 | 0.5684 | O | −0.0144 | −0.4786 | X | 0.1431 | 0.9106 | O | |
| 336 | 0.0479 | 0.5864 | O | −0.0397 | −1.1686 | X | 0.1546 | 0.9306 | O | |
| 720 | 0.1365 | 1.2974 | O | −0.0321 | −1.1887 | X | 0.2257 | 1.5761 | O | |
| ETTm2 | 96 | 0.0602 | 0.3931 | O | 0.043 | 0.2797 | O | 0.0658 | 0.4142 | O |
| 192 | −0.0043 | −0.0289 | X | −0.0066 | −0.0459 | X | 0.23 | 1.5396 | O | |
| 336 | 0.0763 | 0.4187 | O | −0.2178 | −0.889 | X | 0.1687 | 1.3671 | O | |
| 720 | 1.4244 | 2.2226 | O | 0.0206 | 0.0235 | O | 2.1525 | 2.7558 | O | |
| Data | Horizon | Proposed Model | MoE-TSMixer | TSMixer | Transformer | ||||
|---|---|---|---|---|---|---|---|---|---|
| Time (s) | Memory (MB) | Time (s) | Memory (MB) | Time (s) | Memory (MB) | Time (s) | Memory (MB) | ||
| ETTh1 | 96 | 3.0968 | 21.68 | 0.3416 | 26.45 | 0.1503 | 20.75 | 0.1902 | 33.07 |
| 192 | 4.8797 | 22.16 | 17.7099 | 26.93 | 17.5923 | 21.23 | 17.5694 | 33.55 | |
| 336 | 7.374 | 22.88 | 19.9737 | 27.65 | 21.0833 | 21.95 | 21.602 | 34.27 | |
| 720 | 10.6884 | 24.81 | 19.9831 | 29.58 | 21.2293 | 23.88 | 20.7512 | 36.2 | |
| ETTh2 | 96 | 3.009 | 21.68 | 0.5486 | 26.45 | 0.2183 | 20.75 | 0.2333 | 33.07 |
| 192 | 4.5492 | 22.16 | 16.9067 | 26.93 | 20.1527 | 21.23 | 17.3526 | 33.55 | |
| 336 | 6.5426 | 22.88 | 24.7284 | 27.65 | 17.0863 | 21.95 | 22.1932 | 34.27 | |
| 720 | 10.5266 | 24.81 | 22.498 | 29.58 | 14.9439 | 23.88 | 17.3449 | 36.2 | |
| ETTm1 | 96 | 9.7696 | 21.7 | 1.7943 | 26.48 | 0.9897 | 20.77 | 1.1012 | 33.1 |
| 192 | 21.9178 | 22.2 | 99.8257 | 26.97 | 101.4233 | 21.27 | 97.8145 | 33.59 | |
| 336 | 29.7039 | 22.94 | 106.3797 | 27.71 | 93.7403 | 22.01 | 88.9411 | 34.33 | |
| 720 | 58.0665 | 24.91 | 100.1851 | 29.69 | 77.5634 | 23.98 | 90.8661 | 36.31 | |
| ETTm2 | 96 | 9.0746 | 21.7 | 1.9777 | 26.48 | 0.7884 | 20.77 | 1.3115 | 33.1 |
| 192 | 22.8284 | 22.2 | 97.758 | 26.97 | 90.2805 | 21.27 | 99.6014 | 33.59 | |
| 336 | 33.0968 | 22.94 | 87.7527 | 27.71 | 86.1177 | 22.01 | 87.8673 | 34.33 | |
| 720 | 55.4874 | 24.91 | 88.313 | 29.69 | 87.7887 | 23.98 | 98.4895 | 36.31 | |
| Data | Horizon | Number of Moment Parameters () | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 6 | 8 | 10 | 12 | 14 | 16 | 18 | 20 | ||
| ETTh1 | 96 | 0.5086 | 0.5091 | 0.5097 | 0.5092 | 0.5092 | 0.5092 | 0.5092 | 0.5092 | 0.5086 |
| 192 | 0.6090 | 0.6084 | 0.6092 | 0.6091 | 0.609 | 0.6092 | 0.6092 | 0.6092 | 0.6090 | |
| 336 | 0.6998 | 0.699 | 0.6970 | 0.6978 | 0.6972 | 0.6974 | 0.6973 | 0.6973 | 0.6998 | |
| 720 | 0.7658 | 0.7668 | 0.7669 | 0.7669 | 0.7669 | 0.7669 | 0.7669 | 0.7669 | 0.7658 | |
| ETTh2 | 96 | 0.6764 | 0.6661 | 0.6702 | 0.6698 | 0.67 | 0.6695 | 0.6697 | 0.6697 | 0.6764 |
| 192 | 2.3290 | 2.3509 | 2.3464 | 2.3483 | 2.3486 | 2.3485 | 2.3485 | 2.3485 | 2.3290 | |
| 336 | 2.1882 | 2.1849 | 2.1538 | 2.1567 | 2.1577 | 2.1557 | 2.1551 | 2.1551 | 2.1882 | |
| 720 | 2.7564 | 2.7532 | 2.7526 | 2.7526 | 2.753 | 2.7525 | 2.7528 | 2.7528 | 2.7564 | |
| ETTm1 | 96 | 0.4346 | 0.4345 | 0.4186 | 0.419 | 0.4187 | 0.435 | 0.4189 | 0.4189 | 0.4346 |
| 192 | 0.4665 | 0.4618 | 0.4610 | 0.4637 | 0.4621 | 0.4613 | 0.4662 | 0.4662 | 0.4665 | |
| 336 | 0.5461 | 0.5305 | 0.5440 | 0.5442 | 0.5441 | 0.5443 | 0.544 | 0.544 | 0.5461 | |
| 720 | 0.6344 | 0.6327 | 0.6323 | 0.632 | 0.6321 | 0.6319 | 0.6319 | 0.6319 | 0.6344 | |
| ETTm2 | 96 | 0.264 | 0.2687 | 0.2654 | 0.2652 | 0.2466 | 0.2518 | 0.2649 | 0.2649 | 0.264 |
| 192 | 0.4129 | 0.425 | 0.4057 | 0.4060 | 0.4089 | 0.4064 | 0.4146 | 0.4146 | 0.4129 | |
| 336 | 0.6368 | 0.8676 | 0.647 | 0.6446 | 0.645 | 0.6479 | 0.652 | 0.652 | 0.6368 | |
| 720 | 1.5548 | 2.6094 | 2.8167 | 2.827 | 2.7854 | 2.8197 | 2.8269 | 2.8269 | 1.5548 | |
| Data | Horizon | Number of Local Experts () | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 6 | 8 | 10 | 12 | 14 | 16 | 18 | 20 | ||
| ETTh1 | 96 | 0.5091 | 0.5136 | 0.5086 | 0.5006 | 0.5034 | 0.5071 | 0.4999 | 0.4981 | 0.5078 |
| 192 | 0.5977 | 0.6134 | 0.609 | 0.6162 | 0.6053 | 0.5838 | 0.5967 | 0.6186 | 0.6088 | |
| 336 | 0.69 | 0.7025 | 0.6998 | 0.6938 | 0.707 | 0.6854 | 0.6853 | 0.7012 | 0.6963 | |
| 720 | 0.7467 | 0.7576 | 0.7658 | 0.7777 | 0.776 | 0.7497 | 0.7364 | 0.7486 | 0.7436 | |
| ETTh2 | 96 | 0.6201 | 0.5970 | 0.6764 | 0.8565 | 0.6576 | 0.6044 | 0.9344 | 0.6121 | 0.715 |
| 192 | 2.9602 | 2.525 | 2.3290 | 3.1365 | 3.0004 | 3.2259 | 2.0625 | 2.5026 | 3.0161 | |
| 336 | 1.9639 | 2.3804 | 2.1882 | 2.2667 | 2.5192 | 2.3533 | 2.1515 | 2.6271 | 2.6674 | |
| 720 | 2.3276 | 2.4612 | 2.7564 | 2.0419 | 2.3203 | 2.0532 | 2.2847 | 2.1858 | 1.8053 | |
| ETTm1 | 96 | 0.4338 | 0.4247 | 0.4346 | 0.4181 | 0.4479 | 0.4381 | 0.4489 | 0.4374 | 0.4553 |
| 192 | 0.4585 | 0.4550 | 0.4665 | 0.4711 | 0.4661 | 0.4926 | 0.5024 | 0.4999 | 0.5091 | |
| 336 | 0.5128 | 0.5485 | 0.5461 | 0.5337 | 0.5335 | 0.5265 | 0.5628 | 0.5388 | 0.5654 | |
| 720 | 0.6002 | 0.6239 | 0.6344 | 0.6283 | 0.6064 | 0.6214 | 0.6158 | 0.6207 | 0.6458 | |
| ETTm2 | 96 | 0.3362 | 0.2815 | 0.264 | 0.3832 | 0.3231 | 0.2742 | 0.2572 | 0.2344 | 0.4162 |
| 192 | 0.4549 | 0.4445 | 0.4129 | 0.4416 | 0.5183 | 0.4017 | 0.3974 | 0.4639 | 0.4108 | |
| 336 | 0.7856 | 0.7432 | 0.6368 | 0.9372 | 0.8718 | 0.6828 | 0.662 | 0.6354 | 0.6718 | |
| 720 | 2.0394 | 2.1679 | 1.5548 | 1.9819 | 2.2997 | 1.3312 | 1.6449 | 2.6045 | 1.5136 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hong, J.; Lee, K.M. Mixture of TSMixer Experts for Time Series Forecasting. Biomimetics 2026, 11, 426. https://doi.org/10.3390/biomimetics11060426
Hong J, Lee KM. Mixture of TSMixer Experts for Time Series Forecasting. Biomimetics. 2026; 11(6):426. https://doi.org/10.3390/biomimetics11060426
Chicago/Turabian StyleHong, Jaemoo, and Keon Myung Lee. 2026. "Mixture of TSMixer Experts for Time Series Forecasting" Biomimetics 11, no. 6: 426. https://doi.org/10.3390/biomimetics11060426
APA StyleHong, J., & Lee, K. M. (2026). Mixture of TSMixer Experts for Time Series Forecasting. Biomimetics, 11(6), 426. https://doi.org/10.3390/biomimetics11060426

