# Transformers for Multi-Horizon Forecasting in an Industry 4.0 Use Case

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Contributions

- Innovative Transformer-Based Architectures for Multi-Horizon Forecasting: Two innovative variations of the encoder-only transformer architecture were devised: one that incorporates attention mechanisms applied to all input data (TRA-FLAT) and the other that applies attention solely to the time dimension (TRA-TIME). To the best of our knowledge, this represents the first attempt to adapt and apply an encoder-only transformer model to a multi-horizon forecasting problem. Our experimental results indicate that both transformer-based models outperform traditional deep-learning models (LSTMs) in terms of accuracy in both multi-horizon and fixed-horizon scenarios.
- Superiority of Temporal-Based Deep-Learning Architectures in Multi-Horizon Forecasting: This study presents empirical evidence in favor of deep-learning architectures that take advantage of temporal relationships in the data—specifically TRA-TIME and LSTM—for multi-horizon forecasting problems. Despite the conventional assumption that fixed-horizon models, which are optimized to predict a specific point in the forecasting horizon, would outperform their multi-horizon counterparts, our experiments indicate that architectures capable of exploiting the temporal structure of the data, given an adequate amount of past temporal data as input, achieve superior performance. Moreover, our findings suggest that TRA-TIME and LSTM outperform other multi-horizon models, such as TRA-FLA, that do not consider the temporal topology of the input data in multi-horizon forecasting scenarios.
- Advantages of Multi-Horizon Forecasting Models for Real-Time Forecasting: Our study suggests that, in the context of real-time forecasting scenarios, multi-horizon models represent a more efficient alternative to fixed-horizon models. This is attributed to the fact that a single multi-horizon model is capable of generating a sequence of forecasts across a given temporal range, as opposed to a single prediction for a particular time instant. Furthermore, the training and validation of multiple fixed-horizon models that span the same forecasting horizon requires more computational resources than required for a single multi-horizon model with equivalent precision. This finding is especially relevant in the context of Industry 4.0 real-time deployments, where periodic model retraining is necessary to address data drift issues throughout the lifespan of the deployed models.

#### 1.2. Paper Structure

## 2. Related Work

## 3. Proposed Deep-Learning Architectures

#### 3.1. Time-Series Forecasting Description

#### 3.2. Deep-Learning Models

#### 3.2.1. New Transformer Architectures

- Time-Attention Encoder-Only Transformer (TRA-TIME): applies attention only to the time dimension. Unlike in [10], the input data batch is transposed by only swapping the features and the time dimension. This helps the model to learn time-related patterns, such as trends and seasonality. Furthermore, due to the efficiency of this approach in most time-series-related problems, the dimension reduction block used in [10] can be omitted. This omission increases the quality of the information received by the attention layers and the quality of the entire model.
- Flattened-Attention Encoder-Only Transformer (TRA-FLAT): applies attention to all input data. Unlike in [10], before the trainable positional encoder block, the input batch is flattened to a single dimension. This allows the model to generalize and learn data relationships independently of the feature or time. In addition, the model becomes more complex and less efficient. Therefore, the dimension reduction block is still required in most cases.

#### 3.2.2. LSTM Models

## 4. System Model

- The AGV sends status data to the PLC. The most important part for the control application is the deviation of the AGV with respect to the magnetic line of the circuit. This provides an estimate of the position of the AGV.
- The PLC uses the received AGV position to compute speed references for the wheels to correct its deviation with respect to the magnetic line. This information is sent to the physical AGV.
- The AGV applies these speed references and returns to step 1 to send its updated status to the PLC.

## 5. Dataset

- Set up and deploy the components of the use case together with the network degradation generator module.
- Collect the network packets generated in the communication between an AGV and a PLC under various network degradation scenarios.
- Clean the collected data and extract the relevant features of the network packets transmitted between the AGV and the PLC.
- Split the data into different partitions to be used for training the deep-learning models and verifying the proper functioning and generalization of the trained model.

#### 5.1. Data Collection

#### 5.2. Data Processing

## 6. Deep-Learning Model Training

## 7. Experiments

#### 7.1. Fixed-Horizon Forecasting

#### 7.2. Multi-Horizon Forecasting

#### 7.3. Model Deployment

## 8. Discussion of Results

- Comparison between Fixed-Horizon Models and Multi-Horizon Models: The results of the experiments conducted showed that the best performance was observed for fixed-horizon models, which was in line with prior expectations. However, the difference between the optimal fixed-horizon solution and the optimal multi-horizon solution was found to be less than 1%, and its accuracy slowly decayed as the prediction horizon expands into the future. This indicates that a flexible model with multi-horizons can be used with minimal loss of precision. This outcome holds substantial significance as multi-horizon models offer the system operator or technician the capability to choose dynamically the most suitable forecasting step in accordance with the AGV workload, network stability, and desired accuracy without the need for training and validating a large number of new models for each different forecasting horizon point.
- Performance Comparison between the Different Evaluated Architectures: The experiments indicated that one of our variations of the encoder-only transformer was the best performing model when trained as a fixed-horizon model. In particular, the encoder-only transformer with attention to all features (TRA-FLAT) and using a 60 s input time window without network input features was identified as the best model for the fixed-horizon problem. Conversely, the best model for the multi-horizon problem was found to be the encoder-only transformer with attention to time features (TRA-TIME) utilizing a 60 s input time window that comprised only the AGV guide error variable. Furthermore, the results obtained suggest that network input features are only useful for models operating with small time windows and in the context of fixed-horizon models as multi-horizon models do not seem to exploit this exogenous variable.
- Importance of the Temporal Dimension in Multi-Horizon Forecasting Scenarios: We observed that architectures that exploit the temporal relationships in their inputs, such as LSTMs and encoder-only transformer models with attention mechanisms for the temporal dimension (TRA-TIME), were more effective in multi-horizon forecasting scenarios than (i) their counterparts for fixed-horizon scenarios and (ii) other multi-horizon models that do not consider a temporal topology in their inputs. This result highlights the importance of considering architectures that are well-suited to learn patterns from the temporal dimension when forecasting in multi-horizon scenarios.
- Influence of Input Time Window Size: The results indicated that increasing the size of the input time window was positively correlated with an increase in the accuracy of the models. This relationship was found to be the most pronounced for multi-horizon models, which are known to benefit from the consideration of time relationships.
- Deployment in Real-World Scenarios: The outcomes of the model deployment evaluation were found to be positive and highly supportive of the feasibility of using the models in real-time industrial scenarios. The models were evaluated using the symmetric mean absolute percentage error (SMAPE) assessment and all models exhibited a SMAPE below 50% (sufficiently precise for industrial applications) with the most optimal models having an SMAPE of 26%. Additionally, the models were tested for their ability to manage real-time data sampling, where they had to produce at least 10 predictions per second. The results showed that all models were at least ten-times faster than the minimum requirement with the lowest performing model still capable of producing 124 predictions per second. These results indicate the suitability of the models for real-time industrial use.

## 9. Conclusions and Future Work

#### 9.1. Conclusions

#### 9.2. Future Work

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Lim, B.; Zohren, S. Time-series forecasting with deep learning: A survey. Philos. Trans. R. Soc. A
**2021**, 379, 20200209. [Google Scholar] [CrossRef] [PubMed] - Mozo, A.; Ordozgoiti, B.; Gómez-Canaval, S. Forecasting short-term data center network traffic load with convolutional neural networks. PLoS ONE
**2018**, 13, e0191939. [Google Scholar] [CrossRef] [PubMed] - Siami-Namini, S.; Namin, A.S. Forecasting economics and financial time series: ARIMA vs. LSTM. arXiv
**2018**, arXiv:1803.06386. [Google Scholar] - Chen, K.; Zhou, Y.; Dai, F. A LSTM-based method for stock returns prediction: A case study of China stock market. In Proceedings of the 2015 IEEE International Conference on Big Data, Santa Clara, CA, USA, 29 October–1 November 2015; pp. 2823–2824. [Google Scholar]
- Sierra-García, J.; Santos, M. Redes neuronales y aprendizaje por refuerzo en el control de turbinas eólicas. Rev. Iberoam. Autom. Inf. Ind.
**2021**, 18, 327–335. [Google Scholar] [CrossRef] - Sierra-Garcia, J.E.; Santos, M. Deep learning and fuzzy logic to implement a hybrid wind turbine pitch control. Neural Comput. Appl.
**2022**, 34, 10503–10517. [Google Scholar] [CrossRef] - Pierson, H.A.; Gashler, M.S. Deep learning in robotics: A review of recent research. Adv. Robot.
**2017**, 31, 821–835. [Google Scholar] [CrossRef] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in time series: A survey. arXiv
**2022**, arXiv:2202.07125. [Google Scholar] - Zerveas, G.; Jayaraman, S.; Patel, D.; Bhamidipaty, A.; Eickhoff, C. A transformer-based framework for multivariate time series representation learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 2114–2124. [Google Scholar]
- Sierra-García, J.E.; Santos, M. Mechatronic modelling of industrial AGVs: A complex system architecture. Complexity
**2020**, 2020, 6687816. [Google Scholar] [CrossRef] - Espinosa, F.; Santos, C.; Sierra-García, J. Transporte multi-AGV de una carga: Estado del arte y propuesta centralizada. Rev. Iberoam. Autom. Inf. Ind.
**2020**, 18, 82–91. [Google Scholar] [CrossRef] - Spinelli, F.; Mancuso, V. Toward enabled industrial verticals in 5G: A survey on MEC-based approaches to provisioning and flexibility. IEEE Commun. Surv. Tutor.
**2020**, 23, 596–630. [Google Scholar] [CrossRef] - Ahmad, I.; Kumar, T.; Liyanage, M.; Okwuibe, J.; Ylianttila, M.; Gurtov, A. Overview of 5G security challenges and solutions. IEEE Commun. Stand. Mag.
**2018**, 2, 36–43. [Google Scholar] [CrossRef] - Vakaruk, S.; Sierra-García, J.E.; Mozo, A.; Pastor, A. Forecasting automated guided vehicle malfunctioning with deep learning in a 5G-based industry 4.0 scenario. IEEE Commun. Mag.
**2021**, 59, 102–108. [Google Scholar] [CrossRef] - Yaovaja, K.; Bamrungthai, P.; Ketsarapong, P. Design of an Autonomous Tracked Mower Robot Using Vision-Based Remote Control. In Proceedings of the 2019 IEEE Eurasia Conference on IOT, Communication and Engineering (ECICE), Yunlin, Taiwan, 3–6 October 2019; pp. 324–327. [Google Scholar] [CrossRef]
- Ben Taieb, S.; Sorjamaa, A.; Bontempi, G. Multiple-Output Modeling for Multi-Step-Ahead Time Series Forecasting. Neurocomputing
**2010**, 73, 1950–1957. [Google Scholar] [CrossRef] - Bengio, Y. Learning Deep Architectures for AI; Now, Publishers Inc.: Norwell, MA, USA, 2009. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed] - Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw.
**1994**, 5, 157–166. [Google Scholar] [CrossRef] [PubMed] - Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies; IEEE Press: New York, NY, USA, 2001. [Google Scholar]
- Sánchez, R.; Sierra-García, J.E.; Santos, M. Modelado de un AGV híbrido triciclo-diferencial. Rev. Iberoam. Autom. Inf. Ind.
**2022**, 19, 84–95. [Google Scholar] [CrossRef] - Mozo, A.; Karamchandani, A.; Gómez-Canaval, S.; Sanz, M.; Moreno, J.I.; Pastor, A. B5GEMINI: AI-driven network digital twin. Sensors
**2022**, 22, 4106. [Google Scholar] [CrossRef] [PubMed] - Pastor, A.; Mozo, A.; Lopez, D.R.; Folgueira, J.; Kapodistria, A. The Mouseworld, a security traffic analysis lab based on NFV/SDN. In Proceedings of the 13th International Conference on Availability, Reliability and Security, Hamburg, Germany, 27–30 August 2018; pp. 1–6. [Google Scholar]
- Binder, M.; Moosbauer, J.; Thomas, J.; Bischl, B. Multi-Objective Hyperparameter Tuning and Feature Selection Using Filter Ensembles. arXiv
**2020**, arXiv:1912.12912. [Google Scholar] [CrossRef] - Blasco, B.C.; Moreno, J.J.M.; Pol, A.P.; Abad, A.S. Using the R-MAPE index as a resistant measure of forecast accuracy. Psicothema
**2013**, 25, 500–506. [Google Scholar]

**Figure 2.**Overview of the steps of the followed method, including: Data Generation (see Section 5.1), Data Processing (see Section 5.2), Model Training (see Section 6), and Model Evaluation (see Section 7).

**Figure 4.**A comparison of the performance of fixed vs. multi horizon models is presented in terms of mean absolute error (MAE) for the prediction of the t + 15 s step. The results presented are based solely on the guide error as the input feature and a time-window size of 60 s, which was determined to be the best configuration. The highest level of precision was attained by the transformer that considered all features (TRA-FLAT) in its fixed horizon configuration, resulting in a mean absolute error (MAE) of 0.923. On the other hand, the best multi horizon model was the transformer that focused only on the time dimension (TRA-TIME), with a slightly lower MAE of 0.936, which only deviates by 1% from the best model—yet offers greater robustness and flexibility.

**Figure 5.**Comparison of the most efficient multi-horizon architectures and input time-window sizes in terms of the average mean absolute error (MAE) across all predicted steps. The performance of all architectures improved as the input time-window size increased while using only the guide error as the input feature. The best performing architecture, with an MAE of 0.918, was the transformer with attention to time features only (TRA-TIME) using a 60 s input time window.

**Figure 6.**Comparison of MAE values across steps in the forecasting horizon. The horizontal axis represents the second of each step and the vertical axis represents the MAE. The best performing architecture for multi-horizon forecasting was TRA-TIME with an average MAE of 0.918, followed by TRA-FLAT with an average MAE of 0.939, and finally LSTM with an average MAE of 0.951. The figure clearly demonstrates that the LSTM model outperformed the TRA-FLAT after the tenth prediction second, which is significant as a fully loaded AGV requires at least 10 s to stop.

Contribution | Yaovaja et al. [16] | Vakaruk et al. [15] | Our Work |
---|---|---|---|

Type of Models | Kinematic | Fixed-Horizon Prediction | Multi-Horizon and Fixed-Horizon Prediction |

Use of AI Models | No | Traditional ML Architectures | State-of-the-Art DL Architectures |

Collision Avoidance (Humans/Obstacles) | No | Yes | Yes |

Correction of Trajectory Deviation | Yes | Yes | Yes |

Trajectory Deviation Anticipation | No | Yes | Yes |

Use of Industrial-Grade Components in Experiments | No | Yes | Yes |

Use of State-of-the-Art Techniques | No | Yes | Yes |

Comparison of Fixed-Horizon and Multi-Horizon Prediction Models | No | No | Yes |

Analysis of the Response Time of the Control Algorithm | No | No | Yes |

Verification of the Deployment Feasibility | No | Limited | Yes |

Evaluation under Realistic Conditions in a Real Industrial Environment | No | Yes | Yes |

**Table 2.**Hyperparameter settings for the transformer, LSTM, and FCNN models. The table displays the architecture type in the first column, the hyperparameter name in the second column, the type of hyperparameter value in the third column, and the range of hyperparameter values in the fourth column. The FCNN model is represented as the regression component at the end of the transformer and LSTM models. (*) The reduction size hyperparameter is exclusive to the transformer architecture that focuses on all features (TRA-FLAT).

Architecture | Hyperparameter | Value Type | Value Range |
---|---|---|---|

Transformer | Reduction Size * | Integer | [64, 4096] |

Number of Blocks | Integer | [1, 8] | |

Number of Heads | Integer | [1, 10] | |

Number of Block Output Neurons | Integer | [128, 8192] | |

Dropout | Float | [$1\times {10}^{-5}$, 0.5] | |

LSTM | Number of Layers | Integer | [1, 5] |

Number of Neurons per Layer | Integer | [16, 128] | |

Dropout | Float | [$1\times {10}^{-5}$, 0.5] | |

Batch Normalization | Boolean | False/True | |

L2 Penalty Term | Float | [$1\times {10}^{-5}$, 1.0] | |

FCNN | Number of Neurons per Layer | Integer | [256, 4096] |

Dropout | Float | [$1\times {10}^{-5}$, 0.5] |

**Table 3.**The best deep-learning architectures. Deep-learning architectures, the size of the input time window in seconds (TW), the type of model (fixed or multi-horizon), the input features (guide error only or with network), the mean absolute error (MAE), the symmetric mean absolute percentage error (SMAPE) for a 15 s step, the average value of the mean absolute errors for all steps (only for multi-horizon models), and the number of predictions per second are presented. The deep-learning architecture can be either a long short-term memory (LSTM) network defined by the number of layers (L-2) and number of LSTM neurons in each layer (×119), followed by the number of output neurons (F-499), or a transformer with flat attention (TRA-FLAT) defined by the number of output neurons in the reduction layer (R-398), the number of attention blocks (B-4) and number of neurons in each block (×2947), the number of heads in each block (H-4), and the number of output neurons (F-2796). The transformer with time attention (TRA-TIME) is defined in a similar manner as the TRA-FLAT but without the reduction layer. All the models displayed are suitable for real-time deployment, as they are faster than 10 predictions per second (the data collection speed) and have a SMAPE less than 50% (sufficiently precise for industrial applications). The smallest MAE value for the 15 s step of each architecture is highlighted in bold.

MAE | SMAPE | MAE | Predictions | ||||
---|---|---|---|---|---|---|---|

Architecture | TW | Horizon | Features | 15 s | 15 s | Average | /Second |

LSTM (L-2×119, F-499) | 15 s | Fixed | GE + Net | 1.052 | 29% | N/A | 411.311 |

LSTM (L-2×112, F-446) | 15 s | Multi | GE | 1.011 | 27% | 0.977 | 480.215 |

LSTM (L-2×121, F-609) | 30 s | Fixed | GE + Net | 1.058 | 28% | N/A | 381.997 |

LSTM (L-2×124, F-821) | 30 s | Multi | GE | 1.000 | 27% | 0.976 | 396.051 |

LSTM (L-2×126, F-503) | 60 s | Fixed | GE | 0.997 | 26% | N/A | 385.698 |

LSTM (L-2×117, F-247) | 60 s | Multi | GE | 0.955 | 26% | 0.951 | 368.333 |

TRA-FLAT (R-398, B-4×2947, H-4, F-2796) | 15 s | Fixed | GE | 0.992 | 28% | N/A | 176.139 |

TRA-FLAT (R-344, B-4×5948, H-5, F-3083) | 15 s | Multi | GE | 1.010 | 28% | 0.962 | 156.427 |

TRA-FLAT (R-936, B-2×6292, H-3, F-1759) | 30 s | Fixed | GE + Net | 0.962 | 27% | N/A | 220.363 |

TRA-FLAT (R-490, B-5×3353, H-3, F-1441) | 30 s | Multi | GE | 1.009 | 28% | 0.951 | 175.153 |

TRA-FLAT (R-378, B-4×4917, H-3, F-2671) | 60 s | Fixed | GE | 0.923 | 26% | N/A | 187.143 |

TRA-FLAT (R-658, B-6×3240, H-3, F-1282) | 60 s | Multi | GE | 0.976 | 27% | 0.939 | 124.321 |

TRA-TIME (B-2×5581, H-4, F-832) | 15 s | Fixed | GE | 0.987 | 27% | N/A | 279.559 |

TRA-TIME (B-3×7637, H-4, F-1959) | 15 s | Multi | GE | 1.015 | 28% | 0.975 | 215.980 |

TRA-TIME (B-2×5846, H-4, F-989) | 30 s | Fixed | GE | 0.968 | 26% | N/A | 277.005 |

TRA-TIME (B-4×7685, H-4, F-1748) | 30 s | Multi | GE | 0.994 | 28% | 0.957 | 182.018 |

TRA-TIME (B-1×6937, H-4, F-1275) | 60 s | Fixed | GE | 0.940 | 26% | N/A | 448.991 |

TRA-TIME (B-2×8014, H-5, F-753) | 60 s | Multi | GE | 0.936 | 26% | 0.918 | 258.026 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Vakaruk, S.; Karamchandani, A.; Sierra-García, J.E.; Mozo, A.; Gómez-Canaval, S.; Pastor, A.
Transformers for Multi-Horizon Forecasting in an Industry 4.0 Use Case. *Sensors* **2023**, *23*, 3516.
https://doi.org/10.3390/s23073516

**AMA Style**

Vakaruk S, Karamchandani A, Sierra-García JE, Mozo A, Gómez-Canaval S, Pastor A.
Transformers for Multi-Horizon Forecasting in an Industry 4.0 Use Case. *Sensors*. 2023; 23(7):3516.
https://doi.org/10.3390/s23073516

**Chicago/Turabian Style**

Vakaruk, Stanislav, Amit Karamchandani, Jesús Enrique Sierra-García, Alberto Mozo, Sandra Gómez-Canaval, and Antonio Pastor.
2023. "Transformers for Multi-Horizon Forecasting in an Industry 4.0 Use Case" *Sensors* 23, no. 7: 3516.
https://doi.org/10.3390/s23073516