1. Introduction
Generative artificial intelligence (AI) models are important for modeling intelligent machines as physically described in [
1,
2]. Generative AI is based on deep neural networks (DNNs for short), and a common characteristic of DNNs is their compositional nature (cf. [
3]): data is processed sequentially, layer by layer, resulting in a discrete-time dynamical system. The introduction of the transformer architecture for generative AI in 2017 marked the most striking advancement in terms of DNNs (cf. [
4]). Indeed, the transformer is a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. At each step, the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next. The transformer has achieved great success in natural language processing (cf. [
5]).
The transformer has a modularization framework and is constructed by two main building blocks: self-attention and feed-forward neural networks. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. However, despite its meteoric rise within deep learning, we believe there is a gap in our theoretical understanding of what the transformer is and why it works physically (cf. [
6]).
We think that there are two origins for the modularization framework of generative AI models. One is a mathematical origin in which a joint probability distribution can be computed by sequentially conditional probabilities. For instance, the probability of generating a text 
 given an input 
X in a transformer architecture is equal to the joint probability distribution 
 such that
      where the conditional probability 
 is given by the 
ℓ-th attention block in the transformer. Another is a physical origin, in which a physical process is considered to be a sequence of events as a history. As such, generating a text 
 given an input 
X in a physical machine is a process in which, given an input 
X at time 
 an event 
 occurs at time 
 an event 
 occurs at time 
 …, and last, an event 
 occurs at time 
. A theory of the “histories” approach to physical systems was established by Isham [
2], and the mathematical theory of it associated with joint probability distributions was then developed by Gudder  [
1]. Based on their theory, in this paper, we present a mathematical formalism for generative AI and describe the associated physical models.
To the best of our knowledge, physical models for generative AI are usually described by using systems of mean-field interacting particles (cf.  [
7,
8] and references therein); i.e., generative AI models are regarded as classical statistical systems. However, since modern chips process data by controlling the flow of electric current, i.e., the dynamics of many electrons, they should be regarded as quantum statistical ensembles and open quantum systems from a physical perspective (cf. [
9,
10]). Consequently, based on our mathematical formalism for generative AI, we construct physical models realizing generative AI systems as open quantum systems. As an illustration, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens.
The paper is organized as follows. In 
Section 2, we include some notation and definitions on the attention mechanism, the transformer, and the effect algebras. In 
Section 3, we give the definition of a generative AI system as a family of sequential joint probabilities associated with input texts and temporal sequences of tokens. This is based on the mathematical theory developed by Gudder (cf. [
1]) for a historical approach to physical evolution processes. Those joint probabilities characterize the attention mechanisms as well as the mathematical structure of the transformer architecture. In 
Section 4, we present the construction of physical models realizing generative AI systems as open quantum systems. Our physical models are given by an event-history approach to physical systems; we refer to  [
2] for the background of physics for this formulation. In 
Section 5, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens. Finally, in 
Section 6, we give a summary of our innovative points listed item by item and conclude the contributions of the paper.
  2. Preliminaries
In this section, we present a mathematical description of the attention mechanism and transformer architecture for generative AI and include some notations and basic properties of 
-effect algebras (cf. [
11]). For the sake of convenience, we collect some notations and definitions. Denote by 
 the natural number set 
 and for 
 we use the notation 
 to represent the set 
 For 
 we denote by 
 the 
d-dimensional Euclid space with the usual inner product 
 For two sets 
 we denote by 
 the set of all maps from 
X into 
 For a set 
 we denote 
 where 
 is the set of all sequences 
 of 
n elements in 
S; i.e., 
 is the set of all finite sequences of elements in 
  2.1. Deep Neural Networks
A DNN is constructed by connecting multiple neurons. Recall that a (feed-forward) neural network of depth 
L consists of some number of neurons arranged in 
 layers. Layer 
 is the input layer, where data is presented to the network, while layer 
 is where the output is read out. All layers in between are referred to as the hidden layers, and each hidden layer has an activation that is a map in the same layer. Specifically, let 
 be a sequence of sets where 
 indexes the neurons in layer 
 and let 
 be a sequence of vector spaces. A mapping 
 is called a feed-forward neural network of depth 
L if there exists a sequence 
 of maps 
 and a sequence 
 of maps 
, which is called the activation function at the layer 
 such that
        for 
 where 
 is called the input and 
 is the output. We call 
 the architecture of the neural network 
 Of course, 
 is determined by its architecture, and there exist different choices of architectures yielding the same 
In their most basic form, 
 is a finite set of 
 elements and 
 a feed-forward neural network 
 is a function of the following form: the input is 
  for 
 and
        where 
 is the output. This can be illustrated as follows:
		
		Here, the map 
 is usually of the form
        where 
 is an 
 matrix called a weight matrix and 
 is called a bias vector for each 
 and the function 
 represents the activation function at the 
ℓ-th layer. The set of all entries of the weight matrices and bias vectors of a neural network 
 are called the parameters of 
 These parameters are adjustable and learned during the training process, determining the specific function realized by the network. Also, the depth 
 the number of neurons in each layer, and the activation functions of a neural network 
 are called the hyperparameters of 
 They define the network’s architecture (and training process) and are typically set before training begins. For a fixed architecture, every choice of network parameters as in (
3) defines a specific function 
 and this function is often referred to as a model.
In a feed-forward neural network, the inputs to neurons in the 
ℓ-th layer are usually exclusively neurons from the 
-th layer. However, residual neural networks (ResNets for short) allow skip connections; that is, information is allowed to skip layers in the sense that the neurons in layer 
ℓ may have 
 as their input (and not just 
). In their most basic form, 
 and
        where 
 is a vector function, 
’s are 
 matrices, and 
’s are vectors in 
 In contrast to feed-forward neural networks, recurrent neural networks (RNNs for short) allow information to flow backward in the sense that 
 may serve as input for the neurons in layer 
ℓ and not just 
 We refer to  [
12] for more details, such as training for a neural network.
  2.2. Attention
The fundamental definition of attention was given by Bahdanau et al. in 2014. To describe the mathematical definition of attention, we denote by  the query space,  the key space, and  the value space. We call an element  a query,  a key, , and so on.
Definition 1  (cf. [
13])
. Let  be a function. Let  be a set of keys and  a set of values. Given a  the attention  is defined bywhere  is a probability distribution over  defined by This means that a value  in (6) occurs with probability  for For  we defineIn particular, when   is said to be self-attention at  and the mapping  defined byis called the self-attention map.  We remark that
        
- (1)
- For a finite sequence  -  of real numbers, define 
- Then, - 
            as usual in the literature. 
- (2)
- We have  but  in general. 
- (3)
- The function  -  is called a similarity function, usually given by - 
            where  -  is a  -  real matrix called a query matrix and  -  is a  -  real matrix called a key matrix. For  -  the real number  -  is interpreted as the similarity between the query  q-  and the key  
- (4)
- In the representation learning framework of attention, we usually assume the finite set  of tokens has been embedded in  where d is called the embedding dimension, so we identify each  with one of finitely-many vectors x in  We assume that the structure (positional information, adjacency information, etc) is encoded in these vectors. In the case of self-attention, we assume  
Since the self-attention mechanism can be composed to arbitrary depth, making it a crucial building block of the transformer architecture, we mainly focus on it in what follows. In practice, we need multi-headed attention (cf. [
4]), that process independent copies of the data 
X and combine them with concatenation and matrix multiplication. Let 
 be the input set of tokens embedded in 
 Let us consider 
-headed attention with the dimension 
 for every head. For every 
 let 
 be 
 (query, key, value) matrices associated with the 
i-th self-attention, and the similarity function
 Let 
 denote the output projection matrix, where 
 is a 
 matrix for every 
 For 
 the multi-headed self-attention (MHSelfAtt for short) is then defined by
        that is, an output
        occurs with the probability 
 As such,
        yields a basic building block of the transformer
        as in the case of one-headed attention.
  2.3. Transformer
In line with successful models, such as large language models, we focus on the decoder-only setting of the transformer, where the model iteratively predicts the next tokens based on a given sequence of tokens. This procedure is called autoregressive since the prediction of new tokens is only based on previous tokens. Such conditional sequence generation using autoregressive transformers is referred to as the transformer architecture.
Specifically, in the transformer architecture defined by a composition of blocks, each block consists of a self-attention layer 
 a multi-layer perception 
 and a prediction head layer 
 First, the self-attention layer SelfAtt is the only layer that combines different tokens. Let us denote the input text to the layer by 
 embedded in 
 and focus on the 
n-th output. For each 
 letting
        where 
 and 
 are two 
 matrices (i.e., the query and key matrices), we can interpret 
 as similarities between the 
n-th token 
 (i.e., the query) and the other tokens (i.e., keys); for satisfying the autoregressive structure, we only consider 
 The softmax layer is given by
        which can be interpreted as the probability for the 
n-th query to “attend” to the 
j-th key. Then, the self-attention layer 
 can be defined as
        where 
 is the 
 real matrix such that 
 for any 
 the output 
 occurring with the probability 
 is often referred to as the values of the token 
 Thus, 
 is a random map such that 
 for each 
If the attention is a multi-headed attention with 
 heads of the dimension 
 where for 
  are the 
 (query, key, value) matrices and 
 is the 
 (output) matrix of the 
i-th self-attention, then the multi-headed self-attention layer 
 is defined by
        where
        i.e., an output 
 occurs with the probability 
 for each 
 In what follows, we only consider the case of one-headed attention, since the multi-headed case is similar.
Second, the multi-layer perception is a feed-forward neural network 
 such that 
 with the probability 
 (
) for each 
 Finally, the prediction head layer can be represented as a mapping 
 which maps the sequence of 
 to a probability distribution 
 where 
 is the probability of predicting 
 as the next token. Since 
 contains information about the whole input text, we may define
        such that the next token 
 with the probability 
 for 
Hence, a basic building block for the transformer, consisting of a self-attention module (SelfAtt) and a feed-forward network (FFN) followed by a prediction head layer (PH), can be illustrated as follows:
		 
        where the input text 
 is embedded as a sequence 
 in 
  occurs with the probability 
 for each 
  is generated with the probability 
 for each 
 and so the output is 
 One can then apply the same operations to the extended sequence 
 in the next block, obtaining 
 to iteratively compute further tokens (there is usually a stopping criterion based on a special token or the mapping 
). Below, without loss of generality, we omit the prediction head layer 
Typically, a transformer of depth 
L is defined by a composition of 
L blocks, denoted by 
 consisting of 
L self-attention maps 
 and 
L feed-forward neural networks 
 that is,
        where the indices of the layers SelfAtt and FFN in (
24) indicate the use of different trainable parameters in each of the blocks. This can be illustrated as follows:
		
        that is,
 Also, we can consider the transformer of the form
        where 
 denotes the identity mapping in 
 commonly known as a skip or residual connection.
  2.4. Effect Algebras
For the sake of convenience, we collect some notations and basic properties of 
-effect algebras (cf. [
1,
11,
14] and references therein). Recall that an effect algebra is an algebraic system 
 where 
 is a non-empty set, 
, which are called zeroes and unit elements of this algebra, respectively, and ⊕ is a partial binary operation on 
 that satisfies the following conditions for any 
:
- (E1)
- (Commutative Law): If  is defined, then  is defined and  which is called the orthogonal sum of a and  
- (E2)
- (Associative Law): If  -  and  -  are defined, then  -  and  -  are defined and - 
            which is denoted by  
- (E3)
- (Orthosupplementation Law): there exists a unique  such that  is defined and  such  is unique and called the orthosupplement of  
- (E4)
- (Zero–One Law): if  is defined, then  
 We simply call 
 an effect algebra in the sequel. From the associative law (E2), we can write 
 if this orthogonal sum is defined. For any 
 we define 
 if there exists a 
 such that 
 this 
c is unique and denoted by 
 so 
 We also define 
 if 
 is defined; i.e., 
a is orthogonal to 
 It can be shown (cf. [
14]) that 
 is a bounded partially ordered set (poset for short) and 
 if and only if 
 For a sequence 
 in 
 if 
 is defined for all 
 such that 
 exists, then the sum 
 of 
 exists and define 
 We say that 
 is a 
-effect algebra if 
 exists for any sequence 
 in 
 satisfying that 
 is defined for all 
 It was shown in (Lemma 3.1, [
1]) that 
 is a 
-effect algebra if and only if the least upper bound 
 exists for any monotone sequence 
 i.e., 
Let 
 and 
 be 
-effect algebras. A map 
 is said to be additive if for 
  implies that 
 and 
 An additive map 
 is 
-additive if for any sequence 
 such that 
 exists, 
 exists and 
 A 
-additive map 
 is said to be a 
-morphism if 
 and moreover, 
 is called a 
-isomorphism if 
 is a bijective 
-morphism and 
 is a 
-morphism. It can be shown (cf. [
1]) that
- (1)
- A map  is additive if and only if  is monotone in the sense that  implies  
- (2)
- An additive map  is -additive if and only if  implies  
- (3)
- A -morphism  satisfies  
The unit interval  is a -effect algebra defined as follows: For any   is defined if  and in this case  Then, we have that  and  are the zero and unit elements, respectively. In what follows, we always regard  as a -effect algebra in this way. Let  be a -effect algebra, a -morphism  is called a state on  and we denote by  the set of all states on  A subset S of  is said to be order determining if  for all  implies 
Another example of a -effect algebra is a measurable space  defined as follows: For any   is defined if  and in this case,  We then have  and  We always regard a measurable space  as a -effect algebra in this way. Let  be a -effect algebra, a -morphism  is called an observable on  with values in  (a -valued observable for short). The elements of a -effect algebra are called effects, and so an observable X maps effects in  into effects in ; i.e.,  is an effect in  for  We denote by  the set of all -valued observables. Note that  is equal to the set of all probability measures on  For  and  we have  which is called the probability distribution of X in the state 
  3. Mathematical Formalism
In this section, we introduce a mathematical formalism for generative AI. We utilize the theory of -effect algebras to give a mathematical definition for a generative AI system. Let  be a -effect algebra and  a measurable space. An orthogonal decomposition in  is a sequence  in  such that  exists, and moreover, it is complete if  We denote by  the set of all completely orthogonal decomposition in  A completely orthogonal decomposition in  is called a countable partition of  i.e., a sequence  of elements in  such that  for  and  We denote by  the set of all countable partitions of  For  an ordered n-tuple  of effects in  is called a n-time chain-of-effect, and we interpret  as an inference process of an intelligence machine in which the effect  occurs at time  for  where  Alternatively, no specific times may be involved and we regard  as a sequential effect in which  occurs first,  occurs second, …, and  occurs last.
Definition 2.  With the above notations, a generative artificial intelligence system  is defined to be a triple  where  is a σ-effect algebra,  is a measurable space, such that
- (G1) 
- The input set  of  is equal to the set ; i.e., an input is interpreted by a state  
- (G2) 
- The output set  of  is equal to the set ; i.e., the set of all finite sequences of elements in  
- (G3) 
- An inference process in  is interpreted by a chain-of-effect  for  
 Remark 1.  We refer to  [15] for a mathematical definition of general artificial intelligence systems in terms of topos theory, including quantum artificial intelligence systems.  In practice, we are not concerned with a generative AI system  itself but deal with models for  such as large language models. To this end, we need to introduce the definition of a model for  in terms of joint probability distributions for observables associated with 
For  and  we may view the effect  as the event for which X has a value in  For a partition  we may view  as a set of possible alternative events that can occur. One interpretation is that  represents a building block of an artificial intelligence architecture for processing X and the alternatives result from the dial readings of the block. Given    an ordered n-tuple  of events is called an n-time chain-of-event, and we interpret  as an inference process of an intelligence machine in which  has a value  in  first,  has  in  s, …, and  has  in  last, so that the output result is  We denote the set of all n-time chain-of-events by  and the set of all chain-of-events by 
A 
n-step inference set has the form 
 where 
   We interpret 
 as ordered successive processes of observables 
 with partitions 
 for 
 We denote the collection of all 
n-step inference sets by 
 and the collection of all inference sets by 
 If 
 and 
 such that 
 for every 
 we say the chain-of-event 
 is an element of the inference set 
 and write 
 This can be illustrated as follows:
      which means that the machine firstly obtains 
 as part of an output with the probability 
 then obtains 
 with the conditional probability 
 …, and lastly obtains 
 with the conditional probability 
 and finally combines them to obtain the output result 
 with the probability
      where 
 will be explained later.
If 
 and 
 are two inference sets, then we define their sequential product by
      and obtain a 
-step inference set. Mathematically, we can include the empty inference set 
∅ that satisfies 
 such that 
 becomes a semigroup under this product.
For a partition 
 we denote by 
 the 
-subalgebra of 
 generated by 
 and for 
n partitions 
 we denote by 
 the 
-algebra on 
 generated by 
 i.e.,
 We denote by 
 the set of all probability measures on 
 Also, we write 
 for 
 Given an input 
 for an inference set 
, we denote by 
 the probability measure such that for 
  is the probability within the inference set 
 that the event 
 occurs first, 
 occurs second, …, 
 occurs last. We call 
 the joint probability distribution of an inference set 
 under the input 
For interpreting a model for a generative AI system, ’s need to satisfy physically motivated axioms as follows.
Definition 3.  With the above notations, a model  for  is defined to be a family of joint probability distributions of inference setsthat satisfies the following axioms: - (P1) 
- For  and  if  for all  then  
- (P2) 
- For   if  and  thenfor every  
- (P3) 
- For   with  if  then 
- (P4) 
- If   and  for  thenfor every  
 For the physical meanings of the model structure axioms, we remark that
- (1)
- The axiom  means that the input set can distinguish different events; 
- (2)
- The axiom  means that the partition of the last processing is irrelevant; 
- (3)
- The axiom  means that the last processing does not affect the previous ones; 
- (4)
- The axiom  means that the probability of a chain of events does not depend on the partitions and hence is unambiguous. However, for  in   in general if ’s are quantum observables due to quantum interference. 
If 
 and 
 are two inference sets, 
  and if 
 is an input such that 
 then we define the conditional probability of 
B given 
A within 
 under the input 
 as follows:
 Since 
 is a probability measure on 
 where 
 for 
 and 
 for 
 so 
 is a probability measure on 
 which is called a conditional sequential joint probability distribution.
Proposition 1.  Given   and  if  then the conditional sequential joint probability distribution  satisfies the axioms (P2)–(P4) in Definition 3.
 Proof.  By the axiom (P2), we have
        hence, 
 satisfies the axiom (P2), and so does the axiom (P3). Similarly, the axiom (P4) implies that 
 satisfies the axiom (P4); we omit the details.    □
 We remark that when observables are quantum ones, Bayes’ formula need not hold, i.e.,
      in general. This is because the left-hand side is 
 and the right-hand side is 
 so the order of the occurrences is changed. For instance, consider a qubit with the standard basis 
 and 
 Let 
 If 
  and 
; then
      and so, 
 We refer to 
Section 4 for more details.
  4. Physical Models for Generative AI
Physical models for generative AI are usually described by using systems of mean-field interacting particles, such as large language models based on attention mechanisms (cf. [
7,
8] and references therein); i.e., generative AI systems are regarded as classical statistical ensembles. However, since modern chips process data through controlling the flow of electric current, i.e., the dynamics of largely many electrons, they should be regarded as quantum statistical ensembles from a physical perspective (cf. [
10]). Consequently, we need to model modern intelligence machines involving open quantum systems. To this end, combining the history theory of quantum systems (cf. [
2]) and the theory of effect algebras (cf. [
1,
14]), we construct physical models realizing generative AI systems as open quantum systems.
Let 
 be a separable complex Hilbert space with the inner product 
 being conjugate-linear at the first variable and linear at the second variable. We denote by 
 the set of all bounded linear operators on 
 by 
 the set of all bounded self-adjoint operators, and by 
 the set of all orthogonal projection operators. We denote by 
I the identity operator on 
 Unless stated otherwise, an operator means a bounded linear operator in the sequel. An operator 
T is positive if 
 for all 
 and in this case we write 
 We define 
 for a positive operator 
 where 
 is an orthogonal basis of 
 It is known that 
 is independent of the choice of the basis, and it is called the trace of 
T if 
 A positive operator 
 is a density operator if 
 and the set of all density operators on 
 is denoted by 
 Each positive operator is self-adjoint, and if two self-adjoint operators 
 such that 
 we write 
 or 
 We refer to  [
16,
17,
18] for more details on the theory of operators on Hilbert spaces.
A self-adjoint operator 
E that satisfies 
 is called an effect, and the set of all effects on 
 is denoted by 
 For 
 we define 
 if 
 and in this case we write 
 It can be shown (cf. (Lemma 5.1, [
1])) that 
 is a 
-effect algebra, and each state 
 on 
 has the form 
 for every 
 where 
 is a unique density operator on 
 and vice versa. Thus, we identify 
 Let 
 be a measurable space. An observable 
 is a positive operator valued (POV for short) measure on 
; i.e.,
- (1)
-  is an effect on  for any  
- (2)
-  and  
- (3)
- For an orthogonal decomposition  -  in  - 
          where the series on the right-hand side is convergent in the strong operator topology on  -  i.e., - 
          for every  
To understand the inference process, let us show the conventional interpretation of joint probability distributions in an open quantum system that is subject to measurements by an external observer. To this end, let 
 denote the time-evolution operator from time 
s to 
 where 
’s are usually called Kraus operators, such that
 That is, 
 are quantum operations (cf. [
19]) such that for every state 
      in the Schrödinger picture, while for each observable 
      in the Heisenberg picture. We refer to  [
9] for the details on the theory of open quantum systems.
Then the density operator state 
 at time 
 evolves in time 
 to the state 
 where
 Suppose that a measurement 
 is made at time 
 where 
 and 
 Then the probability that an event 
 with 
 occurs is
 If the result of this measurement is kept, then, according to the von Neumann–Lüders reduction postulate, the appropriate density operator to use for any further calculation is
Next, suppose a measurement 
 is performed at time 
 Then, according to the above, the conditional probability of an event 
 for 
 occurs at time 
 given that the event 
 occurs at time 
 (and that the original state was 
) is
      where 
 and the appropriate density operator to use for any further calculation is
 The joint probability of 
 occurring at 
 and 
 occurring at 
 is then
Generalizing to a sequence of measurements 
 at times 
 where 
 and 
 for 
 the sequential joint probability of associated events 
 with 
 occurring at 
 for 
 is
      where 
 for 
  and
      for 
Therefore, given an inference set 
 for an input 
 the sequential joint probability within the inference set 
 that the event 
 occurs at 
  occurs at 
 …, 
 occurs at 
 where 
 for 
 and 
 is given by
      where 
 and
      for 
 in the Schrödinger picture operator defined with respect to the fiducial time 
Proposition 2.  Let  be a separable complex Hilbert space, and let  be a measurable space. A physical model associated with  defined bywhere s’ are given by (49), satisfies the axioms in Definition 3.  Proof.  For 
 and 
 by (
49) we have
 If 
 for all 
 then 
 i.e., the axiom (P1) holds.
For 
  if 
 for 
 and 
 by (
49), we have
        where 
   
        for 
 and
        for 
 Also, for 
 and 
 by (
49) we have
 Hence, we have
        for 
 Since 
 is generated by 
’s, this concludes the axiom (P2). Similarly, we can check the axioms (P3) and (P4) and omit the details.    □
 Remark 2.  Note that the probability family ’s are determined by the time-evolution operators ’s. Therefore, a family of the discrete-time evolution operators  defines a physical model realizing a generative AI system, based on the mathematical formalism in Definition 3 for models of generative AI systems.
   5. Large Language Models
In this section, we describe physical models for large language models based on a transformer architecture in the Fock space over the Hilbert space of tokens. Consider a large language model  with the set  of N tokens. A finite sequence  of tokens is called a text for  simply denoted by  or  where n is called the length of the text 
Let 
 be the Hilbert space with 
 being an orthogonal basis, and we identify 
 for 
 Let 
 be the Fock space over 
 that is,
      where 
 is the 
n-fold tensor product of 
 We refer to  [
16] for the details of Fock spaces. In what follows, for the sake of convenience, we involve the finite Fock space
      for a large integer 
 Note that an operator 
 for 
 satisfies that for all 
      and in particular, if 
 for 
 then 
 Given 
 and a sequence 
 for 
 the operator 
 is defined by
      for every 
 In particular, if 
 then
      where 
 denotes the zero operator in 
 for 
Since large language models are based on a transformer architecture, we suffice to construct a physical model in the Fock space 
 (
) for a transformer 
 (
24) with a composition of 
L blocks, consisting of 
L self-attention maps 
 and 
L feed-forward neural networks 
 Precisely, let us denote the input text to the layer by 
 As noted above,
      where 
 and
 Then, a physical model for 
 consists of an input 
 and a sequence of quantum operations 
 in the Fock space 
 defined above, where 
 We show how to construct this model step by step as follows.
To this end, we denote by 
 and write 
 At first, the input state 
 is given as
 Then there is a physical operation 
 in 
 (see Proposition 3 below), depending only on the attention mechanism 
 and 
 such that
      where 
 and 
 Define 
 by
      and for every 
 Making a measurement 
 at time 
 we obtain an output 
 with probability 
, and the appropriate density operator to use for any further calculation is
      for every 
 where 
 and
Next, there is a physical operation 
 in 
 (see Proposition 3 again), depending only on the attention mechanism 
 and 
 at time 
 such that
      for 
 where 
 (with 
) and 
 Define 
 by
      and for every 
 Making a measurement 
 at time 
 we obtain an output 
 with probability 
, and the appropriate density operator to use for any further calculation is
      for each 
 where 
 and
Step by step, we can obtain a physical model 
 with the input state 
 such that a text 
 is generated with the probability
      within the inference 
Thus, we can obtain a physical model for  if we prove that ’s exist.
Proposition 3.  With the above notations, there exists a physical model  in  () for a transformer  (24) such that given an input text  a text  is generated with the probabilitywithin the inference   Proof.  We regard 
  and 
 as elements in 
 in a natural way, i.e.,
        for 
 We need to construct 
 to satisfy (
60). We first define
        where 
 is a certain token. Secondly, define
        and in general, for 
 define
        for any 
 and 
 Let
 Then 
 extends uniquely to a positive map 
 from 
 into 
 that is,
        where 
 are any complex numbers for 
 Since 
 is a commutative 
-algebra, by Stinespring’s theorem (cf. (Theorem 3.11, [
20])), it follows that 
 is completely positive. Hence, by Arveson’s extension theorem (cf. (Theorem 7.5, [
20])), 
 extends to a completely positive operator 
 in 
 (note that 
 is not necessarily unique), i.e., a quantum operation in 
 By the construction, 
 satisfies (
60). Also, by Kraus’s theorem (cf. [
19]), we conclude that 
 has the Kraus decomposition (
39).
In the same way, we can prove that 
 exists and satisfies (
65). Step by step, we can thus obtain a physical model 
 as required.    □
 Remark 3.  A physical model for the transformer with a multi-headed attention (21) can be constructed in a similar way. Also, we can construct physical models for the transformer of the form (26), even for the transformer of a more complex structure (cf. [21] and reference therein). We omit the details.  Physical models satisfying the above joint probability distributions associated with a transformer 
 are not necessarily unique. However, a physical model 
 uniquely determines the joint probability distributions; that is, it defines a unique physical process for operating the large language model based on 
 Therefore, in a physical model 
 for 
 training for 
 corresponds to training for the Kraus operators 
 which are adjustable and learned during the training process, determining the physical model, as corresponding to the parameters 
 and 
 in 
 From a physical perspective, to train for a large language model is just to determine the Kraus operators 
 associated with the corresponding physical system (cf. [
22]).
Example 1.  Let  be the set of two tokens embedded in  such that  and  Then,  with the standard basis  and  LetSuppose that  in , and let  i.e.,  and  Below, we construct a quantum operation  associated with  and  in  To this end, define andWe regard   () as elements in  in a natural way. LetThen  is a subspace of  and Φ extends uniquely to a positive map  from  into  i.e.,for any  As shown in Proposition 3,  can extend to a completely positive operator in  which is a quantum operation in  associated with  and  in  Note that  is not necessarily unique.  Example 2.  As in Example 1,  is the set of two tokens embedded in  such that  and  Then  with the standard basis  and  Let  Assume an input text  The input state  is then given by If  and  in  an associated physical operation  at time  satisfiesBy measurement, we obtain  with probability  and obtain  with probability  while If  and  in  an associated quantum operation  at time  satisfiesandBy measurement at time  when  occurs at  we obtain  with probability  and obtain  with probability  when  occurs at  we obtain  with probability  and obtain  with probability  Therefore, we obtain the joint probability distributions:This can be illustrated as follows: