Next Article in Journal
Impact of Spatial Rainfall Scenarios on River Basin Runoff Simulation a Nan River Basin Study Using the Rainfall-Runoff-Inundation Model
Previous Article in Journal
Applied Research on Electronic Documentation and 3D Product Model Deployment in Production and Assembly Processes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mathematical Formulation of Learning and Its Computational Complexity for Transformers’ Layers

by
Danilo Pietro Pau
* and
Fabrizio Maria Aymone
Department of Systems Research and Applications, STMicroelectronics, 20864 Agrate Brianza, Italy
*
Author to whom correspondence should be addressed.
Eng 2024, 5(1), 34-50; https://doi.org/10.3390/eng5010003
Submission received: 20 October 2023 / Revised: 26 November 2023 / Accepted: 7 December 2023 / Published: 21 December 2023
(This article belongs to the Section Electrical and Electronic Engineering)

Abstract

:
Transformers are the cornerstone of natural language processing and other much more complicated sequential modelling tasks. The training of these models, however, requires an enormous number of computations, with substantial economic and environmental impacts. An accurate estimation of the computational complexity of training would allow us to be aware in advance about the associated latency and energy consumption. Furthermore, with the advent of forward learning workloads, an estimation of the computational complexity of such neural network topologies is required in order to reliably compare backpropagation with these advanced learning procedures. This work describes a mathematical approach, independent from the deployment on a specific target, for estimating the complexity of training a transformer model. Hence, the equations used during backpropagation and forward learning algorithms are derived for each layer and their complexity is expressed in the form of MACCs and FLOPs. By adding all of these together accordingly to their embodiment into a complete topology and the learning rule taken into account, the total complexity of the desired transformer workload can be estimated.

1. Introduction

Transformers [1] have revolutionized the field of artificial intelligence (AI) by achieving unprecedented accuracy results over a broad variety of complex tasks, including natural language processing (NLP). Their performance, however, has been proven analytically to scale as a power law of the number of parameters of the model and the dataset size [2]. The inevitable consequences of such dependence are bigger model footprints, larger datasets and an increasing number of gradient descent iterations. The associated enormous number of computations and memory usage has critical economic, time and environmental impacts. As an example, the recent 175-billion-parameter GPT-3 [3] was trained for 300 billion tokens [4] with a total compute of ∼3640 petaflop/s-days. According to [5], this process emitted 502 tonnes of carbon dioxide and cost 1.8 million USD. Given the pivotal role computational complexity plays when training a transformer, it is fundamental to provide for it an accurate estimation. Moreover, a variety of alternative learning rules to backpropagation (BP) were recently proposed [6,7]. A precise complexity analysis would allow one to reliably compare these algorithms and quantifying the advantages that forward learning would bring with respect to BP. Ideally, the training complexity should be obtained by considering the single operations performed at the hardware level. This is not convenient for several reasons. First of all, it strictly depends upon the device used, whose knowledge is generally limited, restricting the generalizability of the prediction. Secondly, it is not trivial to obtain the low-level implementation of the training algorithm, which in turn depends on the AI runtime library being used. This paper is organized as follows: Section 2 describes the objectives of this work and its contributions; Section 3 cites the main related works in the known literature; Section 4 describes the notation and conventions adopted during the quantitative analysis of the learning algorithms; Section 5 reports the equations for learning with BP, PEPITA and MEMPEPITA and estimates their complexity for all transformer layers; Section 6 presents an example application of the results obtained, and Section 7 concludes the paper.

2. Key Contributions of This Work

In order to eliminate the dependence from the deployment device, this study introduces a mathematical approach that assesses the complexity of layers in a transformer topology solely based on mathematical expressions, rather than considering hardware operations. In this respect, the contributions brought by this paper can be summarized as follows:
A description of the equations implemented in BP, PEPITA and MEMPEPITA for all transformer layers;
A mathematical derivation of the weights’ and activations’ gradient with respect to the loss function;
Quantitative complexity analysis in terms of multiply and accumulate (MACCs) and floating point operations (FLOPs) of each layer for the forward pass, backward pass and weight updates.

3. Related Works

3.1. Automatic Differentiation

Currently, most AI algorithms are implemented in two major libraries: Tensorflow [8] and Pytorch [9]. Both of the aforementioned frameworks use reverse-mode automatic differentiation (i.e., autodiff) [10], namely BP. Autodiff creates a computational graph from the mathematical expression considered, where each node describes an operation and each edge a variable. During the forward pass, intermediate variables are populated, and each node is complemented with the derivatives of the outputs with respect to the inputs. During the backward pass, the gradient is obtained by leveraging the chain rule of differential calculus to compute partial derivatives of the objective with respect to the weights. The operations involved during training can be referred to as forward pass, backward pass (gradient of the loss with respect to the activations) and weight update (gradient of the loss with respect to the weights). In several works [3,11], the complexity of the backward pass and weight update is assumed to be 2 × that of the forward pass, which is true only in certain specific cases such as fully connected and convolutional layers. Regarding transformers, Ref. [2] empirically identified the relations between performance, training time, dataset size, number of parameters and amount of computation. Moreover, also in this case, the computational complexity of training was approximated to be 3 × that of a forward pass. To the best of the authors’ knowledge there is no work that has analytically described the computational complexity of the transformer topology.

3.2. Alternatives to Backpropagation

It is well known from theory that BP is not describing the learning process happening in the human brain [12,13]. There are four main aspects considered to be in contrast with neuro-biological observations. Firstly, during the backward pass, weights previously used during the forward pass are utilized to backpropagate the error. Considering that synapses in the brain are unidirectional, this characteristic of BP gives rise to the “weight symmetry” problem [14]. Secondly, when calculating the error gradient, the activities of the neurons computed during the forward pass are left unaffected. The freezing of the activities during the backward pass is incompatible with the behaving of feedback connections in neural circuits, through which the signal travels via modulating activities [15]. Thirdly, the modification of synaptic weights is influenced by downstream neurons and synapses, whereas synaptic learning in the brain is predominantly governed by localized signals that are contingent upon the activity of the interconnected neurons [16]. Lastly, in order to update the weights of the l-th layer, the forward pass has to end, and the backward pass has to arrive at such layer. This means that learning cannot happen in an online fashion, contrarily to biological evidence. Such problem is referred to as “update-locking” [17,18]. With the hope of creating a correspondence between deep learning and brain nature, a vast research field focused on finding biologically plausible alternatives to BP has emerged. Addressing the “weight symmetry” issue, learning has been found to happen even when the error is backpropagated with matrices which only share the sign with the forward weights [19] or that are random and fixed like in feedback alignment (FA) [20]. The latter can be modified by propagating directly the error from the output layer to each layer through random connectivity matrices. Such technique is denoted as direct feedback alignment [21]. A broad variety of other algorithms have been proposed in literature [22,23,24,25,26]; however, their knowledge goes beyond the scope of this work.

3.3. Forward Learning

Recently, two promising bioplausible learning algorithms were proposed: forward-forward [6] and PEPITA [7]. The use of forward-only passes solves several implausible aspects of BP. Their effect on memory usage and computational complexity has been studied by [27] on the MLCommons/Tiny industrial benchmarks [28], suggesting that FF is unsuitable for multiclass classification [27]. Moreover, Ref. [27] proposed MEMPEPITA, a memory-efficient version of PEPITA, which introduces an additional forward pass saving on average a third of RAM at the expense of a third more complexity.

3.3.1. PEPITA

PEPITA [7] performs two forward passes. The first pass, named standard pass, calculates the error of the model’s output with respect to the ground truth. As the output and input dimensions are generally different, the error is projected onto the input through a fixed random matrix F, with zero mean and a small standard deviation (e.g., 0.05 2 F A N I N  [29]). The second pass, named modulated, transforms the input by adding to it the projected error calculated by the standard pass and computes the corresponding activations. The difference between the activations of the two passes is then used to update the weights. The weights of the last layer can be updated by the error at the output layer as in BP without compromising accuracy [7]. Algorithm 1 illustrates the procedure implemented in PEPITA, where a 0 , a l and a L are the activations of the first, l-th and last layer, respectively, during the standard pass, a 0 e r r , a l e r r and a L e r r are the activations of the first, l-th and last layer, respectively, during the modulated pass, σ l is the nonlinearity of the l-th layer, and W l are the weights of the l-th layer. A theoretical analysis of the learning dynamics of PEPITA was performed in [29]. By observing that the perturbation is small compared to the input F e     x , it was possible to perform a Taylor expansion of the presynaptic term a a e r r thus obtaining the update rule for the first layer, as described in Equation (1). It was considered that W ( t + 1 ) = W ( t ) η Δ W , with η symbolizing the learning rate, and x was used instead of ( x F e ) since the small perturbation was determined to have a negligible impact on performance.
Algorithm 1: PEPITA
Given: Features(x) and label( t a r g e t )
  Standard Pass
   a 0 = x
  for  = 1 , , L  do
    a = σ ( W a 1 )
  end for
   e = a L t a r g e t
  Modulated pass
   a 0 e r r = x + F e
  for  = 1 , , L  do
    a e r r = σ ( W a 1 e r r )
   Weight update
    W : = W η ( a a e r r ) · ( a 1 e r r ) T
  end for
Δ W 1 [ ( W 1 F e ) a 1 ] x T
PEPITA essentially adopts an update similar to DFA and equivalent to FA in two-layer networks. However, it uniquely employs an adaptive feedback matrix (AF) in which the network weights modulate the random component. In such a way, the learning effect of PEPITA found experimentally was justified theoretically.

3.3.2. MEMPEPITA

In its original form, the PEPITA algorithm [7] necessitates retaining activations calculated during the standard computational pass for the subsequent evaluation of ( a a e r r ) in the modulated pass. This requirement unfortunately aligns with the memory demands characteristic of backpropagation (BP). To circumvent this memory constraint, one could introduce a concurrent secondary standard pass alongside the modulated pass. This approach enables the recalculation of necessary activations for the weight update process. However, this solution does introduce an additional computational overhead. This variant of the original algorithm, termed MEMPEPITA, is presented in [27] and significantly enhances memory efficiency by avoiding the intermediate activations’ storage, which is detrimental in deep neural networks (DNNs). This variant, detailed in Algorithm 2, while maintaining the core principles of PEPITA, offers a more resource-conscious alternative, particularly in scenarios where memory resources are a critical constraint.
Algorithm 2: MEMPEPITA
Given: Features(x) and label( t a r g e t )
  Standard Pass
   a 0 = x
  for  = 1 , , L  do
    a = σ ( W a 1 )
  end for
   e = a L t a r g e t
  Modulated + 2nd Standard pass
   a 0 e r r = x + F e
  for  = 1 , , L  do
   Standard Pass
    a = σ ( W a 1 )
   Modulated pass
    a e r r = σ ( W a 1 e r r )
   Weight update
    W : = W η ( a a e r r ) · ( a 1 e r r ) T
  end for

4. Notation and Conventions

The objective of the quantitative analysis in this paper is to accurately model the mathematical equations behind BP, PEPITA and MEMPEPITA for estimating the computational complexity [30,31,32] of training the transformer architecture. Therefore, it is necessary to clearly define beforehand the notations and conventions used in the proposed analysis. Each “mathematical” operation (e.g., exponentiation, sum, product, division) is considered a FLOP of 32 bits even if the underlying hardware may require performing more operations. Hence, each MACC is equivalent to two FLOPs, one ADD and one MULTIPLY [33]. Even if a MULTIPLY operation is more complex than an ADD operation when implemented on hardware, this work considers them to be both equivalent to one FLOP as they both consist in one mathematical operation.
For the sake of a clear and lean notation, the symbol y x (i.e., partial derivative) is used to indicate the gradient, whose adequate symbol would be x y . Such a choice was determined by the fact that y x highlights the target with respect to which the gradient is computed. Moreover, for each layer, the total number of MACCs and FLOPs estimated for the macro-operations (forward pass, backward pass, weight update, etc.) are framed in a box to highlight them. Lastly, a new operator indicated with × s l i c e is introduced. This operator receives a 2d matrix of size N × M as a left operand and a 3d matrix of size N × M × K as a right operand and outputs a 2d matrix of size N × K . The operator multiplies the first row of the left operand by the first 2d matrix in the 3d matrix’s right operand, obtaining a row vector which corresponds to the first row of the output 2d matrix. Then, it obtains the second row of the output matrix by multiplying the second row of the left operand by the second 2d matrix of the 3d matrix’s right operand. This process is iterated N times for each row in the 2d matrix’s left operand.
The transformer is composed of an encoder and a decoder, and its architecture is reported in Figure 1 [1]. Given such structure, there is a collection of hyperparameters needed to uniquely identify a specific architecture embodiment. The latter are reported in Table 1 and they are used as parameters throughout the analysis.

5. Complexity Analysis

The method adopted for estimating the complexity of a specific learning procedure involves subdividing the latter in a series of macro-operations (e.g., forward pass, backward pass, weight update, error projection), as reported in Table 2. The total complexity of a macro-operation is obtained by calculating the complexity of performing such a macro-operation at each single layer of the transformer and adding it for all layers. In the following paragraphs, the structure and functionality of each layer is described and their complexity for the forward pass, backward pass and weight update is computed.

5.1. Embedding Layer

The embedding Layer consists of a matrix of W e m b size v o c s i z e × d m o d e l , where each row corresponds to the embedding of a token.

5.1.1. Forward Pass

Given a sequence of N tokens, these are represented as a matrix T of size M × v o c s i z e , where each row is a one-hot-encoded representation of the token. By multiplying the token matrix with the embedding matrix, a matrix E = T W e m b of size M × d m o d e l is obtained, where each row corresponds to the embedding representation of the original token in the sequence. Hence, the complexity of an embedding layer for a forward pass is, as in [11],
MACCs = M × voc size × d model
FLOPs = 2 M × voc size × d model

5.1.2. Weight Update (Only PEPITA and MEMPEPITA)

In BP, the gradient of the loss function with respect to the input tokens is directly calculated during the backward pass of the next layer, without involving the embedding matrix. Such gradient is directly used for updating the rows of the embedding layer corresponding to the tokens considered. Therefore, in BP no computation is needed by the embedding layer for the backward pass and weight update. On the other hand, PEPITA updates the embedding layer as for other layers, by performing a matrix multiplication. The resulting complexity of the weight update is the same as that of the forward pass.
MACCs = M × voc size × d model
FLOPs = 2 M × voc size × d model

5.2. Position Embeddings

A positional embedding matrix P of size m a x l e n × d m o d e l is used to store the positional embeddings for each position up to the maximum number of tokens in the context. To encode positional information, the first M rows of the positional matrix are added to E. The obtained matrix of size M × d m o d e l is indicated with X. A positional embedding matrix can be learned or it can be already assigned following the sinusoidal positional encoding proposed in [1]. Being a simple addition operation, its complexity is negligible and can be considered to be already incorporated in the MACCs/FLOPs of the embedding layer.

5.3. Multihead Attention

This is the most important block in the transformer, and it occupies the first stage of the encoder layer and the first two stages of the decoder layer. Its structure is reported in Figure 2 [1].

5.3.1. Forward Pass

The first step consists in identifying the input query, key and value matrices X Q of size M × d m o d e l and X K , X V of size N × d m o d e l . Then, for each different head i, Q i , K i and V i are computed where the number of heads h is determined by dividing d m o d e l by d k . This is achieved by multiplying the inputs with appropriate weight matrices of size d m o d e l × d k .
( Q i , K i , V i ) = ( X Q W i Q , X K W i K , X V W i V )
MACCs = M × d m o d e l × d m o d e l + 2 × N × d m o d e l × d m o d e l
FLOPs = 2 M × d m o d e l × d m o d e l + 4 × N × d m o d e l × d m o d e l
Then, the Q i , K i and V i matrices are fed into the attention mechanism of each head. It is assumed, in accordance with the conventions adopted in [11], that computing the softmax of an array of size N requires 5N FLOPs.
Attention ( Q i , K i , V i ) = softmax ( Q i K i T d k ) V i
MACCs = 2 M × N × d m o d e l
FLOPs = 4 M × N × d m o d e l + M × 5 N × h + M × N × h
The output of the attention of each head is concatenated and multiplied by a weight matrix W O of size d m o d e l × d m o d e l . Summing up, the output of a multihead attention block is
Multihead ( X Q , X K , X V ) = Concat ( head 1 , , head h ) W O where head i = Attention ( Q i , K i , V i )
MACCs = M × d m o d e l × d m o d e l
FLOPs = 2 M × d m o d e l × d m o d e l
The multihead attention block in the decoder builds the query matrix Q i starting from the output of the previous masked multihead attention block in the decoder X d e c , while the key and value matrices are obtained from the output of the encoder stack X e n c . Namely,
( Q i , K i , V i ) = ( X d e c W i Q , X e n c W i K , X e n c W i V )
The masked multihead attention puts a mask on the softmax output, during training, in order for tokens not to look for a correlation with the next tokens in the sequence. The complexity of this operation is not considered, as it consists in putting to −inf certain cells of the matrix.
The total number of MACCs are
2 M × d m o d e l × d m o d e l + 2 M × N × d m o d e l + 2 N × d m o d e l × d m o d e l
The total number of FLOPs are
4 M × d m o d e l × d m o d e l + 4 M × N × d m o d e l + 4 N × d m o d e l × d m o d e l + 6 M × N × h

5.3.2. Backward Pass and Weight Update

The learnable parameters in the multiheaded attention block are the weight matrices W i Q , W i K , W i V and W O . During bakcpropagation, the derivatives of the output of the block with respect to these matrices and with respect to the inputs X Q , X K and X V should therefore be calculated. If it is a multihead encoder or masked multihead decoder: X Q = X K = X V = X ; if it is a multihead decoder: X Q = X d e c and X K = X V = X e n c . Let us denote f ( X Q , X K , X V ) = multihead attention ( X Q , X K , X V ) and indicate with L the loss function.
f = Concat ( head 1 , , head h ) W O
L W O = Concat ( head 1 , , head h ) T L f
L Concat ( head 1 , , head h ) = L f ( W O ) T
Then, the derivatives of the attention module, considering the different heads, result in the following:
Attention ( Q i , K i , V i ) = softmax ( Q i K i T d k ) V i
L softmax = L Attention V i T
Every row of the softmax is independent from the other rows. Given a row x 1 x 2 x 3 x n and the softmax of that row s 1 s 2 s 3 s n the jacobian of the softmax with respect to the row is
s 1 · ( 1 s 1 ) s 2 · s 1 s 3 · s 1 s n · s 1 s 1 · s 2 s 2 · ( 1 s 2 ) s 3 · s 2 s n · s 2 s 1 · s n s 2 · s n s 3 · s n s n · ( 1 s n )
It is possible to define a 3-dimensional matrix composed of the Jacobian of each row of the softmax layer, denoted as softmax X , where X is a 2-dimensional matrix in softmax ( X ) . In order to maintain a concise notation, we also introduce a new type of matrix product × s l i c e which multiplies each row of the first 2-dimensional matrix by each slice of the 3d matrix. The complexity of this product is the same as the regular matrix product.
L Q i K i T = L softmax × s l i c e ( softmax Q i K i T d k ) T 1 d k
Now, the derivative of the loss function with respect to Q i , K i , V i is obtained.
L Q i = L Q i K i T K i
L K i = ( L Q i K i T ) T Q i
L V i = softmax T L Attention
The derivatives with respect to the inputs X Q , X K and X V for each head i are the following:
( L X Q ) i = L Q i ( W i Q ) T
( L X K ) i = L K i ( W i K ) T
( L X V ) i = L V i ( W i V ) T
Then, the derivatives with respect to the weight matrix are computed:
L W i Q = ( X Q ) T L Q i
L W i K = ( X K ) T L K i
L W i V = ( X V ) T L V i
To obtain the derivative with respect to X Q , X K and X V , it is required to add all heads. As X Q , X K and X V are the same for the encoder, they are all added together. In the decoder, only the derivatives with respect to X K and X V are added together as they come from the encoder. Let N be the dimension of X K and X V and M be the dimension of X Q .
The complexity for each operation during the backward pass can be calculated as follows:
L Concat ( head 1 , , head h ) = L f ( W O ) T
MACCs = M × d m o d e l × d m o d e l
FLOPs = 2 M × d m o d e l × d m o d e l
L W O = Concat ( head 1 , , head h ) T L f
MACCs = d m o d e l × M × d m o d e l
FLOPs = 2 d m o d e l × M × d m o d e l
In order to compute the derivative of the loss function with respect to X V and W i V , the following complexities are obtained:
L V i = softmax T L Attention i
MACCs = N × M × d k × h = N × M × d m o d e l
FLOPs = 2 N × M × d m o d e l
L X V i = L V i ( W i V ) T
L X V = i = 1 h L X V i
MACCs = N × d k × d m o d e l × h = N × d m o d e l × d m o d e l
FLOPs = 2 N × d m o d e l × d m o d e l
L W i V = ( X V ) T L V i
MACCs = d m o d e l × N × d k × h = d m o d e l × N × d m o d e l
FLOPs = 2 d m o d e l × N × d m o d e l
In order to compute the derivative of the loss function with respect to X Q , X K , W i Q and W i K , the following complexities are obtained:
softmax Q i K i T d k
MACCs = 0
FLOPs = N × N × M × h
L Q i K i T = L Attention V i T × s l i c e softmax Q i K i T d k T 1 d k
MACCs = M × d m o d e l × N + M × N × M × h
FLOPs = 2 M × d m o d e l × N + 2 M × N × M × h + M × N × h
L Q i = L Q i K i T K i
MACCs = M × N × d m o d e l
FLOPs = 2 M × N × d m o d e l
L X Q i = L Q i ( W i Q ) T
L X Q = i = 1 h L X Q i
MACCs = M × d m o d e l × d m o d e l
FLOPs = 2 M × d m o d e l × d m o d e l
L W i Q = ( X Q ) T L Q i
MACCs = d m o d e l × M × d m o d e l
FLOPs = 2 d m o d e l × M × d m o d e l
L K i = ( L Q i K i T ) T Q i
MACCs = N × M × d m o d e l
FLOPs = 2 N × M × d m o d e l
L X K i = L K i ( W i K ) T
L X K = i = 1 h L X K i
MACCs = N × d m o d e l × d m o d e l
FLOPs = 2 N × d m o d e l × d m o d e l
L W i K = ( X K ) T L K i
MACCs = d m o d e l × N × d m o d e l
FLOPs = 2 d m o d e l × N × d m o d e l
In the case of the multihead attention encoder and of the masked multihead attention decoder, it should be considered that X Q = X K = X V ( M = N ), and the derivatives with respect to X Q , X K and X V are added. Conversely, in the multi-head attention decoder, it holds that X Q X K = X V (M ≠ N), and only the derivatives with respect to X K and X V are added. The complexity of these operations is marginal and can be considered to be already integrated in previous MACCs/FLOPs.
In conclusion, the backward pass has a complexity of
MACCs = 2 M × d m o d e l × d m o d e l + 2 N × d m o d e l × d m o d e l + 4 M × N × d m o d e l + M × N × M × h
FLOPs = 4 M × d m o d e l × d m o d e l + 4 N × d m o d e l × d m o d e l + 8 M × N × d m o d e l + 2 M × N   × M × h + M × N × h
and the weight update has a complexity of
MACCs = 2 M × d m o d e l × d m o d e l + 2 N × d m o d e l × d m o d e l
FLOPs = 4 M × d m o d e l × d m o d e l + 4 N × d m o d e l × d m o d e l

5.4. Feed-Forward Network

The feed-forward network (FFN) is a 2-layer neural network, where the first layer is of size M × d f f and the second M × d m o d e l . Only the first layer uses an activation function.
FFN ( x ) = GELU ( x W 1 + b 1 ) W 2 + b 2

5.4.1. Forward Pass

MACCs = 2 M × d m o d e l × d ff
The FLOPs for the GeLU activation are assumed to be 8 FLOPs in the forward pass and 13 FLOPs for computing the derivative.
The FLOPs accounting for bias and NL are
FLOPs = 4 M × d m o d e l × d ff + 9 M × d ff + M × d m o d e l

5.4.2. Backward Pass

The backward pass is characterized by the following complexity.
MACCs = 2 M × d ff × d m o d e l
FLOPs = 4 M × d m o d e l × d ff + 13 M × d ff

5.4.3. Weight Update

The weight update requires the following complexity.
MACCs = 2 M × d ff × d m o d e l
FLOPs = 4 M × d ff × d m o d e l

5.5. Add and Norm

After each multihead attention and feed-forward block, the input to the block, indicated as sublayer, is added, and a layer normalization is applied.
LayerNorm ( x + Sublayer ( x ) )
Layer The normalization normalizes the features across each token, multiplies the results by γ and adds β , where γ and β are learnable parameters.
LayerNorm ( x ) = x E [ x ] Var [ x ] γ + β

5.5.1. Forward Pass

Such operation does not properly constitute a MACC.
MACCs = 0
Operations that are performed for each neuron are considered. The square root is only performed once. The other operations are addition for the mean, subtract, square and addition for the variance, subtract, divide, bias (add) and scale (multiply). It results in 8 FLOPs per neuron. Furthermore, the FLOPs relative to the addition between x and the sublayer should also be considered:
MACCs = 0
FLOPs = 9 M × d m o d e l

5.5.2. Backward Pass and Weight Update

To train the parameters γ and β , we first need to compute the derivative of layernorm. The layer-normalized activation matrix is denoted as z.
L γ j = i M L Layernorm i j z i j
MACCs = M × d m o d e l
FLOPs = 2 M × d m o d e l
L β j = i M L Layernorm i j
MACCs = 0
FLOPs = M × d m o d e l
L z i j = L Layernorm i j γ i j
MACCs = 0
FLOPs = M × d m o d e l
The Jacobian of activation vector z i with respect to x i is defined as j a c i = { z i j x i k = 1 σ ( δ j k 1 d m o d e l ( x j μ ) ( x k μ ) d m o d e l σ 2 ) } j k . By combining the various Jacobians for each row z i , a 3d matrix j a c is obtained, where each slice corresponds to the Jacobian of a row vector. The operations involved in such a computation do not properly constitute MACCs. The FLOPs to calculate the Jacobian is 3 muls, 2 divisions, 4 adds, namely 9 FLOPs for each element.
MACCs = 0
FLOPs = 9 × M × d m o d e l × d m o d e l
To obtain the derivative
L X = L z × s l i c e j a c
MACCs = M × d m o d e l × d m o d e l
FLOPs = 2 M × d m o d e l × d m o d e l
Then, FLOPs used to add the derivative for the skip layer should also be taken into account.
MACCs = 0
MACCs = M × d m o d e l
The total number of MACCs for backward are
MACCs = M × d m o d e l × d m o d e l
FLOPs = 11 M × d m o d e l × d m o d e l + 2 M × d m o d e l
The total number of MACCs for weight update are
MACCs = M × d m o d e l
FLOPs = 3 M × d m o d e l

5.6. Softmax Layer

At the end of the transformer, there is a softmax layer where the weight matrix W S is of size d m o d e l × v o c s i z e .
Softmax ( X W S )

5.6.1. Forward Pass

The forward pass requires
MACCs = M × d m o d e l × voc size
The softmax function requires 5N FLOPs for an array of N elements. Hence, the number of FLOPs are
FLOPs = 2 M × d m o d e l × voc size + M × 5 voc size

5.6.2. Backward Pass and Weight Update

The derivative of the loss with respect to Z, with Z being the product of X with Ws is L z = t a r g e t s . As the target is usually a one hot encoded vector, it is assumed that such operation has no FLOPs or MACCs.
L X = L Z ( W S ) T
MACCs = M × v o c s i z e × d m o d e l
FLOPs = 2 M × v o c s i z e × d m o d e l
L W S = X T L Z
MACCs = d m o d e l × M × v o c s i z e
FLOPs = 2 d m o d e l × M × v o c s i z e
The total number of MACCs for backward are
MACCs = M × voc size × d m o d e l
FLOPs = 2 M × voc size × d m o d e l
The total number of MACCs for weight update are
MACCs = d m o d e l × M × voc size
FLOPs = 2 d m o d e l × M × voc size

5.7. Error Projection (Only PEPITA and MEMPEPITA)

The output error has dimensionality M × v o c s i z e , which is the same dimensionality as the decoder input. Therefore, a projection matrix to project it to the decoder input is not needed. On the other hand, an attention mechanism is used to project the error of dimensionality M × d m o d e l to the dimensionality of the input to the encoder N × d m o d e l .
T e r r e n c = A t t e n t i o n ( T e n c , T e r r d e c , T e r r d e c )
MACCs = 2 N × M × voc size
FLOPs = 4 N × M × voc size + 6 N × M

6. Exemplary Application

To explain better the applicability of the proposed mathematical formulation, the complexity estimation in terms of MACCs for a one-block encoder-only simplified architecture trained with BP, PEPITA or MEMPEPITA is reported in this section. The layers involved in the architecture are the following (sections): embedding layer (Section 5.1), multihead attention (Section 5.3), add and norm (Section 5.5), feed-forward network (Section 5.4) and softmax (Section 5.6). To compute the number of MACCs required for a forward pass, its complexity at each layer is added together.
MACCs f o r w a r d = M × v o c s i z e × d m o d e l + 2 M × d m o d e l 2 + 2 M × N × d m o d e l + 2 N   × d m o d e l 2 + 0 + 2 M × d m o d e l × d f f + 0 + M × d m o d e l × v o c s i z e
Analogously, the total number of MACCs for the backward pass and the weight update are the following:
MACCs b a c k w a r d = 0 + 2 M × d m o d e l × d m o d e l + 2 N × d m o d e l × d m o d e l + 4 M × N × d m o d e l + M × N × M × h   + M × d m o d e l × d m o d e l + 2 M × d m o d e l × d f f + M × d m o d e l × d m o d e l + d m o d e l × M × v o c s i z e
MACCs w e i g h t u p d a t e = M × v o c s i z e × d m o d e l + 2 M × d m o d e l 2 + 2 N × d m o d e l 2 + M   × d m o d e l + 2 M × d m o d e l × d f f + M × d m o d e l + M × v o c s i z e × d m o d e l
The first term of the sum in the weight-update MACC estimation shall be discarded when considering BP. As the output dimension is the same as the input dimension for an encoder-only architecture, the error projections for PEPITA and MEMPEPITA are not required. Referring to Table 2, the total numbers of MACCs for training a one-block encoder-only transformer and adopting the different learning procedures are the following:
MACCs BP = MACCs forward + MACCs backward + MACCs weight update
MACCs PEPITA = 2 MACCs forward + MACCs weight update
MACCs MEMPEPITA = 3 MACCs forward + MACCs weight update

7. Conclusions

In this work, the equations behind BP (reverse-mode autodiff), PEPITA and MEMPEPITA for the layers of a generic transformer architecture were derived and described. The computational complexity of the forward pass, backward pass and weight update were expressed in terms of MACCs and FLOPs for each layer, using the mathematical formulas previously obtained. An examplary application for the computation of the complexity in the case of a one-block encoder-only transformer was also reported for illustration purposes. The method proposed in this work combines the advantages of being device-agnostic with mathematical rigour, providing a robust estimation of complexity independent of the specific target. By taking advantage of the results of this paper, the reader can easily provide a reliable estimation of the computational complexity involved in training a transformer architecture of their choice using BP and forward learning procedures.

Author Contributions

Conceptualization, D.P.P. and F.M.A.; methodology, D.P.P. and F.M.A.; investigation, D.P.P. and F.M.A.; resources, D.P.P. and F.M.A.; writing—original draft preparation, writing—review and editing, D.P.P. and F.M.A.; supervision, D.P.P.; project administration, D.P.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

Authors Danilo Pietro Pau and Fabrizio Maria Aymone were employed by the company STMicroelectronics. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
  2. Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
  3. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
  4. Mielke, S.J.; Alyafeai, Z.; Salesky, E.; Raffel, C.; Dey, M.; Gallé, M.; Raja, A.; Si, C.; Lee, W.Y.; Sagot, B.; et al. Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP. arXiv 2021, arXiv:2112.10508. [Google Scholar]
  5. Maslej, N.; Fattorini, L.; Brynjolfsson, E.; Etchemendy, J.; Ligett, K.; Lyons, T.; Manyika, J.; Ngo, H.; Niebles, J.C.; Parli, V.; et al. The AI Index 2023 Annual Report; Technical report; AI Index Steering Committee, Institute for Human-Centered AI, Stanford University: Stanford, CA, USA, 2023. [Google Scholar]
  6. Hinton, G. The Forward-Forward Algorithm: Some Preliminary Investigations. arXiv 2022, arXiv:2212.13345. [Google Scholar]
  7. Dellaferrera, G.; Kreiman, G. Error-driven Input Modulation: Solving the Credit Assignment Problem without a Backward Pass. arXiv 2022, arXiv:2201.11665. [Google Scholar]
  8. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2016, arXiv:1603.04467. [Google Scholar]
  9. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
  10. Baydin, A.G.; Pearlmutter, B.A.; Radul, A.A.; Siskind, J.M. Automatic Differentiation in Machine Learning: A Survey. J. Mach. Learn. Res. 2017, 18, 5595–5637. [Google Scholar]
  11. Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Pre-Training Transformers as Energy-Based Cloze Models. In Proceedings of the EMNLP, Online, 16–20 November 2020. [Google Scholar]
  12. Crick, F. The recent excitement about neural networks. Nature 1989, 337, 129–132. [Google Scholar] [CrossRef]
  13. Lillicrap, T.; Santoro, A.; Marris, L.; Akerman, C.; Hinton, G. Backpropagation and the brain. Nat. Rev. Neurosci. 2020, 21, 335–346. [Google Scholar] [CrossRef]
  14. Burbank, K.S.; Kreiman, G. Depression-Biased Reverse Plasticity Rule Is Required for Stable Learning at Top-Down Connections. PLoS Comput. Biol. 2012, 8, e1002393. [Google Scholar] [CrossRef] [PubMed]
  15. Liao, Q.; Leibo, J.Z.; Poggio, T. How Important is Weight Symmetry in Backpropagation? arXiv 2016, arXiv:1510.05067. [Google Scholar] [CrossRef]
  16. Baldi, P.; Sadowski, P. A theory of local learning, the learning channel, and the optimality of backpropagation. Neural Netw. 2016, 83, 51–74. [Google Scholar] [CrossRef] [PubMed]
  17. Jaderberg, M.; Czarnecki, W.M.; Osindero, S.; Vinyals, O.; Graves, A.; Silver, D.; Kavukcuoglu, K. Decoupled Neural Interfaces using Synthetic Gradients. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1627–1635. [Google Scholar]
  18. Czarnecki, W.M.; Świrszcz, G.; Jaderberg, M.; Osindero, S.; Vinyals, O.; Kavukcuoglu, K. Understanding Synthetic Gradients and Decoupled Neural Interfaces. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 904–912. [Google Scholar]
  19. Xiao, W.; Chen, H.; Liao, Q.; Poggio, T. Biologically-plausible learning algorithms can scale to large datasets. arXiv 2018, arXiv:1811.03567. [Google Scholar]
  20. Lillicrap, T.; Cownden, D.; Tweed, D.; Akerman, C. Random synaptic feedback weights support error backpropagation for deep learning. Nat. Commun. 2016, 7, 13276. [Google Scholar] [CrossRef] [PubMed]
  21. Nøkland, A. Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Barcelona, Spain, 5–10 December 2016; pp. 1045–1053. [Google Scholar]
  22. Akrout, M.; Wilson, C.; Humphreys, P.; Lillicrap, T.; Tweed, D.B. Deep Learning without Weight Transport. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32. [Google Scholar]
  23. Frenkel, C.; Lefebvre, M.; Bol, D. Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Front. Neurosci. 2021, 15, 629892. [Google Scholar] [CrossRef] [PubMed]
  24. Xie, X.; Seung, H. Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network. Neural Comput. 2003, 15, 441–454. [Google Scholar] [CrossRef] [PubMed]
  25. Scellier, B.; Bengio, Y. Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation. Front. Comput. Neurosci. 2017, 11, 24. [Google Scholar] [CrossRef]
  26. Clark, D.; Abbott, L.; Chung, S. Credit Assignment Through Broadcasting a Global Error Vector. In Proceedings of the Advances in Neural Information Processing Systems 34—35th Conference on Neural Information Processing Systems, NeurIPS 2021, Virtual, 6–14 December 2021; pp. 10053–10066. [Google Scholar]
  27. Pau, D.P.; Aymone, F.M. Suitability of Forward-Forward and PEPITA Learning to MLCommons-Tiny benchmarks. In Proceedings of the 2023 IEEE International Conference on Omni-layer Intelligent Systems (COINS), Berlin, Germany, 23–25 July 2023; pp. 1–6. [Google Scholar] [CrossRef]
  28. Banbury, C.; Reddi, V.J.; Torelli, P.; Holleman, J.; Jeffries, N.; Kiraly, C.; Montino, P.; Kanter, D.; Ahmed, S.; Pau, D.; et al. MLCommons Tiny Benchmark. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Virtual, 6–14 December 2021. [Google Scholar]
  29. Srinivasan, R.F.; Mignacco, F.; Sorbaro, M.; Refinetti, M.; Cooper, A.; Kreiman, G.; Dellaferrera, G. Forward Learning with Top-Down Feedback: Empirical and Analytical Characterization. arXiv 2023, arXiv:2302.05440. [Google Scholar]
  30. Justus, D.; Brennan, J.; Bonner, S.; McGough, A.S. Predicting the Computational Cost of Deep Learning Models. arXiv 2018, arXiv:1811.11880. [Google Scholar]
  31. Zargar, B.; Ponci, F.; Monti, A. Evaluation of Computational Complexity for Distribution Systems State Estimation. IEEE Trans. Instrum. Meas. 2023, 72, 9001512. [Google Scholar] [CrossRef]
  32. Muhammad, N.; Bibi, N.; Jahangir, A.; Mahmood, Z. Image denoising with norm weighted fusion estimators. Form. Pattern Anal. Appl. 2018, 21, 1013–1022. [Google Scholar] [CrossRef]
  33. Getzner, J.; Charpentier, B.; Günnemann, S. Accuracy is not the only Metric that matters: Estimating the Energy Consumption of Deep Learning Models. arXiv 2023, arXiv:2304.00897. [Google Scholar]
Figure 1. Transformer architecture.
Figure 1. Transformer architecture.
Eng 05 00003 g001
Figure 2. Structure of the attention block.
Figure 2. Structure of the attention block.
Eng 05 00003 g002
Table 1. Architecture hyperparameters.
Table 1. Architecture hyperparameters.
NameDescription
v o c s i z e Number of word/tokens in the corpus
d m o d e l Dimension of embeddings
d k Dimension of the single attention head
d f f Dimension of the first layer in the feed forward network
n e n c Number of encoder layers
n d e c Number of decoder layers
m a x l e n Maximum number of tokens in the context
Table 2. Summary of the learning procedures.
Table 2. Summary of the learning procedures.
Learning MethodsBPPEPMPE
Forward pass123
Backward pass100
Weight update111
Error projection011
PEP stands for PEPITA and MPE for MEMPEPITA.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pau, D.P.; Aymone, F.M. Mathematical Formulation of Learning and Its Computational Complexity for Transformers’ Layers. Eng 2024, 5, 34-50. https://doi.org/10.3390/eng5010003

AMA Style

Pau DP, Aymone FM. Mathematical Formulation of Learning and Its Computational Complexity for Transformers’ Layers. Eng. 2024; 5(1):34-50. https://doi.org/10.3390/eng5010003

Chicago/Turabian Style

Pau, Danilo Pietro, and Fabrizio Maria Aymone. 2024. "Mathematical Formulation of Learning and Its Computational Complexity for Transformers’ Layers" Eng 5, no. 1: 34-50. https://doi.org/10.3390/eng5010003

Article Metrics

Back to TopTop