3.5. Transformer
A Transformer presents a new kind of artificial neural network, which is mainly based on a self-attention mechanism. It was originally designed for natural language processing applications and was recently adopted for computer vision applications with the Vision Transformer (ViT) version. It mimics a recurrent neural network that processes sequential data but has different working processes. A Transformer processes all the data at once compared to a RNN that processes data sequentially. Transformers suffer from many problems that limit their use in solar energy forecasting. The main limitations can be summarized into the following three points:
Quadratic time computation: the main operation for the self-attention block proposed by Transformers, named canonical dot product, is computationally extensive and requires large memory storage.
Very large memory for large input: large input requires stacking more encoder/decoder layers, which results in the doubling of the required memory by a factor equal to the number of stacked encoder/decoder layers. This limits the use of Transformers for processing large inputs, such as long time series.
Low processing speed: the processing speed of the encoder/decoder structures works sequentially, which increases the processing time.
The canonical self-attention mechanism is based on three main inputs, which are the query (
Q), the key (
K), and the value (
V). Considering an input with a dimension
d, the output of the canonical self-attention mechanism can be computed using Equation (7):
The output for a specific raw in
Q,
K, and
V can be computed using Equation (8):
The self-attention mechanism processes the input values and generates the output by calculating the probability
. This process requires quadratic time computation and memory storage in the order of
. Enhancing the performance of the Transformer requires additional computation, which limits its use in real applications. Many works [
33,
34] have been proposed to overcome these limitations by discovering the sparsity of the probability distribution computed by the self-attention mechanism. Motivated by this discovery, a new kind of self-attention mechanism was proposed. To attend to this, we started by evaluating the learned attention patterns. It was discovered that only a few numbers of dot-product pairs affect the overall performance, while the others do not contribute to the performance. Thus, the main idea is to eliminate dot-product pairs that do not affect the performance.
Considering the query
, the attention output on all keys is the composition between the probability
and the
V values. The corresponding query’s attention probability distribution is encouraged to deviate from the uniform distribution by the dominant dot-product pairs. If the probability
is close to a uniform distribution equal to
, then the output of the self-attention mechanism is the sum of the
V values. To identify the relevant queries, the similarity between distributions
p and
q can be used. We proposed measuring the similarity using the Kullback–Leibler divergence method [
35]. The similarity between
q and
p can be computed using Equation (9):
After eliminating the constant, the sparsity for the
ith query can be computed using Equation (10):
The query that obtains a high measure
has a high probability of containing the dominant dot-product pair that contributes to the overall performance. By considering this measurement, a probe sparse operation for self-attention was proposed to replace the canonical operation. The probe sparse operation allows the processing of a fixed number n of queries for each key. The attention based on the proposed operation can be computed using Equation (11):
contains
n queries that satisfy the measurement
M with the same size as
q. A sampling factor c was proposed to control the number of queries
n. Hence, the number of queries can be controlled based on Equation (12):
This relation reduces the calculation of dot-product operations for the query–key lookup to and maintains the memory occupation for each layer in the order of . The multi-head approach explains that this attention creates various sparse query–key pairings for each head, preventing significant information loss in return.
However, in order to process all of the queries for the measurement
M, it is necessary to calculate each pair of dot products, and the first part of the measurement operation may have a problem with numerical stability. We suggest an empirical approximation for the effective acquisition of the query sparsity measurement to overcome this issue. As such, the measurement can be computed using Equation (13):
The proposed max operator in measurement is less sensitive to zero values, in addition to presenting good numerical stability. Practically, the self-attention mechanism accepts equal input length for both queries and keys. Considering L as the input length, the computational complexity for the probe sparse based on self-attention is .
Memory limitation is a hard challenge that prevents the adoption of a Transformer in time-series forecasting. To overcome this limitation, we designed an encoder that processes longer sequential input, while requiring less memory. For this purpose, a new component was proposed based on the use of a 1D convolution layer and embedding the layer to generate the input of the self-attention block. The encoder’s purpose is to extract from the lengthy sequential inputs the reliable long-range dependency. The input is reshaped to matrix representation, the corresponding matrix for the tth input sequence is .
Due to the use of the proposed probe sparse operation in the self-attention mechanism, the feature map of the encoder contains redundant combinations of value
V. To handle this problem, a distilling operation was proposed to create a focused self-attention feature map in the next layer and to prioritize the better ones with dominant features. The proposed distilling operation was inspired by the dilated convolution [
36]. Passing from one layer to the next one, the proposed distilling operation can be computed using Equation (14).
where maxpool represents the maximum pooling layer with a stride of 2;
ELU is the exponential linear unit activation function; conv1D is a one-dimensional convolution layer with a kernel size of 3; and
is the output of the proposed self-attention block. The proposed decoder structure is presented in
Figure 6.
The proposed distilling operation reduces the memory occupation to . In order to increase the robustness of the distilling process, we constructed copies of the main stack with the inputs reduced by half and gradually reduced the number of self-attention distilling layers by removing one layer at a time in a way that their output dimension is aligned. We then combined the outputs of each stack to obtain the encoder’s final hidden representation.
The next goal is to design a decoder that generates longer sequential outputs in a single forward way. We adopted the standard decoder of the Transformer model, which consists of two identical multi-head attention layers stacked on top of one another. The input vector of the decoder is computed using Equation (15):
where
is the start token and
is the placeholder for the target sequence. In the Probsparse self-attention computing mechanism, the masked dot-products are set to
, which implements masked multi-head attention. By preventing each position from anticipating the subsequent ones, auto-regressive behavior is avoided. To generate the output, a fully connected layer is used, with its size depending on the forecasting variate.
To achieve our goal, a generative inference was adopted by replacing specific flags as a token with input sequence and adding an earlier slice before the output sequence. In this way, the proposed decoder generates predictions in a single forward way instead of the dynamic decoding in the original transformer.
As a loss function, we proposed the use of the Mean Squared Error (MSE) function. The loss propagation starts from the decoder until reaching the input of the encoder.