Previous Article in Journal
Cross-Analysis of Magnetic and Current Density Field Topologies in a Quiescent High Confinement Mode Tokamak Discharge
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mathematical Formalism and Physical Models for Generative Artificial Intelligence

Wuhan Institute of Physics and Mathematics, IAPM, Chinese Academy of Sciences, 30 West District, Xiao-Hong-Shan, Wuhan 430071, China
Foundations 2025, 5(3), 23; https://doi.org/10.3390/foundations5030023
Submission received: 13 May 2025 / Revised: 9 June 2025 / Accepted: 16 June 2025 / Published: 24 June 2025
(This article belongs to the Section Physical Sciences)

Abstract

This paper presents a mathematical formalism for generative artificial intelligence (GAI). Our starting point is an observation that a “histories” approach to physical systems agrees with the compositional nature of deep neural networks. Mathematically, we define a GAI system as a family of sequential joint probabilities associated with input texts and temporal sequences of tokens (as physical event histories). From a physical perspective on modern chips, we then construct physical models realizing GAI systems as open quantum systems. Finally, as an illustration, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens. Our physical models underlie the transformer architecture for large language models.

1. Introduction

Generative artificial intelligence (AI) models are important for modeling intelligent machines as physically described in [1,2]. Generative AI is based on deep neural networks (DNNs for short), and a common characteristic of DNNs is their compositional nature (cf. [3]): data is processed sequentially, layer by layer, resulting in a discrete-time dynamical system. The introduction of the transformer architecture for generative AI in 2017 marked the most striking advancement in terms of DNNs (cf. [4]). Indeed, the transformer is a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. At each step, the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next. The transformer has achieved great success in natural language processing (cf. [5]).
The transformer has a modularization framework and is constructed by two main building blocks: self-attention and feed-forward neural networks. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. However, despite its meteoric rise within deep learning, we believe there is a gap in our theoretical understanding of what the transformer is and why it works physically (cf. [6]).
We think that there are two origins for the modularization framework of generative AI models. One is a mathematical origin in which a joint probability distribution can be computed by sequentially conditional probabilities. For instance, the probability of generating a text t 1 t 2 t N given an input X in a transformer architecture is equal to the joint probability distribution P X ( t 1 , , t N ) such that
P X ( t 1 , , t N ) = P X ( t 1 ) P X ( t 2 | t 1 ) P X ( t N | t 1 , , t N 1 ) ,
where the conditional probability P X ( t | t 1 , , t 1 ) is given by the -th attention block in the transformer. Another is a physical origin, in which a physical process is considered to be a sequence of events as a history. As such, generating a text t 1 t 2 t N given an input X in a physical machine is a process in which, given an input X at time τ 0 , an event | t 1 t 1 | occurs at time τ 1 , an event | t 2 t 2 | occurs at time τ 2 , …, and last, an event | t N t N | occurs at time τ N . A theory of the “histories” approach to physical systems was established by Isham [2], and the mathematical theory of it associated with joint probability distributions was then developed by Gudder [1]. Based on their theory, in this paper, we present a mathematical formalism for generative AI and describe the associated physical models.
To the best of our knowledge, physical models for generative AI are usually described by using systems of mean-field interacting particles (cf. [7,8] and references therein); i.e., generative AI models are regarded as classical statistical systems. However, since modern chips process data by controlling the flow of electric current, i.e., the dynamics of many electrons, they should be regarded as quantum statistical ensembles and open quantum systems from a physical perspective (cf. [9,10]). Consequently, based on our mathematical formalism for generative AI, we construct physical models realizing generative AI systems as open quantum systems. As an illustration, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens.
The paper is organized as follows. In Section 2, we include some notation and definitions on the attention mechanism, the transformer, and the effect algebras. In Section 3, we give the definition of a generative AI system as a family of sequential joint probabilities associated with input texts and temporal sequences of tokens. This is based on the mathematical theory developed by Gudder (cf. [1]) for a historical approach to physical evolution processes. Those joint probabilities characterize the attention mechanisms as well as the mathematical structure of the transformer architecture. In Section 4, we present the construction of physical models realizing generative AI systems as open quantum systems. Our physical models are given by an event-history approach to physical systems; we refer to [2] for the background of physics for this formulation. In Section 5, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens. Finally, in Section 6, we give a summary of our innovative points listed item by item and conclude the contributions of the paper.

2. Preliminaries

In this section, we present a mathematical description of the attention mechanism and transformer architecture for generative AI and include some notations and basic properties of σ -effect algebras (cf. [11]). For the sake of convenience, we collect some notations and definitions. Denote by N the natural number set { 1 , 2 , } , and for n N , we use the notation [ n ] to represent the set { 1 , , n } . For d N , we denote by R d the d-dimensional Euclid space with the usual inner product , . For two sets X , Y , we denote by Hom ( X , Y ) the set of all maps from X into Y . For a set S , we denote S * = n N S ( n ) , where S ( n ) is the set of all sequences ( s 1 , , s n ) of n elements in S; i.e., S * is the set of all finite sequences of elements in S .

2.1. Deep Neural Networks

A DNN is constructed by connecting multiple neurons. Recall that a (feed-forward) neural network of depth L consists of some number of neurons arranged in L + 1 layers. Layer = 0 is the input layer, where data is presented to the network, while layer = L is where the output is read out. All layers in between are referred to as the hidden layers, and each hidden layer has an activation that is a map in the same layer. Specifically, let { X } = 0 L be a sequence of sets where X indexes the neurons in layer , and let { V } = 0 L be a sequence of vector spaces. A mapping Φ : Hom ( X 0 , V 0 ) Hom ( X L , V L ) is called a feed-forward neural network of depth L if there exists a sequence { W } = 1 L of maps W : Hom ( X 1 , V 1 ) Hom ( X , V ) and a sequence { σ } = 1 L 1 of maps σ : V V , which is called the activation function at the layer , such that
Φ ( f 0 ) = W L ( σ L 1 ( W L 1 σ 1 ( W 1 ( f 0 ) ) ) ) ,
for f 0 Hom ( X 0 , V 0 ) , where f 0 is called the input and f L = Φ ( f 0 ) Hom ( X L , V L ) is the output. We call ( { W } = 1 L , { σ } = 1 L 1 ) the architecture of the neural network Φ . Of course, Φ is determined by its architecture, and there exist different choices of architectures yielding the same Φ .
In their most basic form, X is a finite set of n elements and V = R , a feed-forward neural network Φ : R n 0 R n L is a function of the following form: the input is x ( 0 ) = x R n 0 , x ( ) = σ ( W ( x ( 1 ) ) ) for = 1 , , L 1 , and
Φ ( x ) = x ( L ) = W L ( σ L 1 ( W L 1 σ 1 ( W 1 ( x ( 0 ) ) ) ) ) ,
where x ( L ) R n L is the output. This can be illustrated as follows:
Foundations 05 00023 i001
Here, the map W : R n 1 R n is usually of the form
W x ( 1 ) = A x ( 1 ) + b , = 1 , , L ,
where A is an n × n 1 matrix called a weight matrix and b R n is called a bias vector for each , and the function σ : R n R n represents the activation function at the -th layer. The set of all entries of the weight matrices and bias vectors of a neural network Φ are called the parameters of Φ . These parameters are adjustable and learned during the training process, determining the specific function realized by the network. Also, the depth L , the number of neurons in each layer, and the activation functions of a neural network Φ are called the hyperparameters of Φ . They define the network’s architecture (and training process) and are typically set before training begins. For a fixed architecture, every choice of network parameters as in (3) defines a specific function Φ , and this function is often referred to as a model.
In a feed-forward neural network, the inputs to neurons in the -th layer are usually exclusively neurons from the ( 1 ) -th layer. However, residual neural networks (ResNets for short) allow skip connections; that is, information is allowed to skip layers in the sense that the neurons in layer may have x ( 0 ) , , x ( 1 ) as their input (and not just x ( 1 ) ). In their most basic form, x ( 0 ) = x R d , and
x ( ) = x ( 1 ) + Q σ ( A x ( 1 ) + b ) , = 1 , , L 1 ,
where σ : R d R d is a vector function, Q , A ’s are d × d matrices, and b ’s are vectors in R d . In contrast to feed-forward neural networks, recurrent neural networks (RNNs for short) allow information to flow backward in the sense that x ( 1 ) , x ( + 1 ) , , x ( L ) may serve as input for the neurons in layer and not just x ( 1 ) . We refer to [12] for more details, such as training for a neural network.

2.2. Attention

The fundamental definition of attention was given by Bahdanau et al. in 2014. To describe the mathematical definition of attention, we denote by Q R d q the query space, K R d k the key space, and V R d v the value space. We call an element q Q a query, k K a key, v V , and so on.
Definition 1 
(cf. [13]). Let a : Q × K R be a function. Let K = { k 1 , , k N } K be a set of keys and V V a set of values. Given a q Q , the attention Att : ( q , K , V ) R is defined by
Att ( q , K , V ) = n = 1 N softmatch a ( q , K ) n · v n ,
where softmatch a ( q , K ) is a probability distribution over K = { k 1 , , k N } defined by
softmatch a ( q , K ) n = e a ( q , k n ) j = 1 N e a ( q , k j ) , n = 1 , , N .
This means that a value v n in (6) occurs with probability softmatch a ( q , K ) n for n [ N ] .
For Q = { q 1 , , q M } Q , we define
Att ( Q , K , V ) = { Att ( q m , K , V ) } m = 1 M .
In particular, when Q = K = V , Att ( Q , Q , Q ) is said to be self-attention at Q , and the mapping SelfAtt defined by
SelfAtt ( Q ) = Att ( Q , Q , Q ) ,
is called the self-attention map.
We remark that
(1)
For a finite sequence { x n } n = 1 N of real numbers, define
softmax ( { x j ) } j = 1 N ) n = e x n j = 1 N e x j , n [ N ] .
Then,
softmatch a ( q , K ) n = softmax ( { a ( q , k j ) } j = 1 N ) n ,
as usual in the literature.
(2)
We have | K | = | V | = N , but | Q | = M N in general.
(3)
The function a : Q × K R is called a similarity function, usually given by
a ( q , k ) = 1 d W Q q , W K k ,
where W Q is a d × d q real matrix called a query matrix and W K is a d × d k real matrix called a key matrix. For q Q , k K , the real number a ( q , k ) is interpreted as the similarity between the query q and the key k .
(4)
In the representation learning framework of attention, we usually assume the finite set T of tokens has been embedded in R d , where d is called the embedding dimension, so we identify each t T with one of finitely-many vectors x in R d . We assume that the structure (positional information, adjacency information, etc) is encoded in these vectors. In the case of self-attention, we assume d q = d k = d v = d .
Since the self-attention mechanism can be composed to arbitrary depth, making it a crucial building block of the transformer architecture, we mainly focus on it in what follows. In practice, we need multi-headed attention (cf. [4]), that process independent copies of the data X and combine them with concatenation and matrix multiplication. Let X = { x n } n = 1 N be the input set of tokens embedded in R d . Let us consider n h -headed attention with the dimension d h for every head. For every i [ n h ] , let W i Q , W i K , W i V be d h × d (query, key, value) matrices associated with the i-th self-attention, and the similarity function
a i ( x , y ) = 1 d h W i Q x , W i K y .
Let W O = [ W 1 O , , W n h O ] denote the output projection matrix, where W i O is a d × d h matrix for every i [ n h ] . For n [ N ] , the multi-headed self-attention (MHSelfAtt for short) is then defined by
MHSelfAtt ( x n , X , X ) = i = 1 n h j i = 1 n softmax ( { 1 d h W i Q x n , W i K x } = 1 n ) j i [ W i O ( W i V x j i ) ] ,
that is, an output
u n = i = 1 n h W i O ( W i V x j i ) , j i [ n ] ,
occurs with the probability i = 1 n h softmax ( { 1 d h W i Q x n , W i K x } = 1 n ) j i . As such,
MHSelfAtt ( X ) = { MHSelfAtt ( x n , X , X ) } n = 1 N ,
yields a basic building block of the transformer
Transf ( X ) = FFN MHSelfAtt ( X ) ,
as in the case of one-headed attention.

2.3. Transformer

In line with successful models, such as large language models, we focus on the decoder-only setting of the transformer, where the model iteratively predicts the next tokens based on a given sequence of tokens. This procedure is called autoregressive since the prediction of new tokens is only based on previous tokens. Such conditional sequence generation using autoregressive transformers is referred to as the transformer architecture.
Specifically, in the transformer architecture defined by a composition of blocks, each block consists of a self-attention layer SelfAtt , a multi-layer perception FFN , and a prediction head layer PH . First, the self-attention layer SelfAtt is the only layer that combines different tokens. Let us denote the input text to the layer by X = { x n } n = 1 N embedded in R d and focus on the n-th output. For each n [ N ] , letting
s j ( n ) = 1 d W Q x n , W K x j , j [ n ] ,
where W Q and W K are two d × d matrices (i.e., the query and key matrices), we can interpret S ( n ) = { s j ( n ) } j = 1 n as similarities between the n-th token x n (i.e., the query) and the other tokens (i.e., keys); for satisfying the autoregressive structure, we only consider j = 1 , , n . The softmax layer is given by
softmax ( S ( n ) ) j = e s j ( n ) i = 1 n e s i ( n ) , j [ n ] ,
which can be interpreted as the probability for the n-th query to “attend” to the j-th key. Then, the self-attention layer SelfAtt can be defined as
SelfAtt ( X ) n = j = 1 n softmax ( S ( n ) ) j W V x j , n [ N ] ,
where W V is the d × d real matrix such that W V x T for any x T , the output W V x j occurring with the probability softmax ( S ( n ) ) j is often referred to as the values of the token x j . Thus, SelfAtt : ( R d ) * ( R d ) * is a random map such that SelfAtt [ ( R d ) ( N ) ] ( R d ) ( N ) for each N N .
If the attention is a multi-headed attention with n h heads of the dimension d h , where for i [ n h ] , W i Q , W i K , W i V are the d h × d (query, key, value) matrices and W i O is the d × d h (output) matrix of the i-th self-attention, then the multi-headed self-attention layer MHSelfAtt is defined by
MHSelfAtt ( X ) n = i = 1 n h j i = 1 n softmax ( S i ( n ) ) j i [ W i O ( W i V x j i ) ] , n [ N ] ,
where
softmax ( S i ( n ) ) j i = 1 d h W i Q x n , W i K x j i = 1 n 1 d h W i Q x n , W i K x , j i [ n ] ,
i.e., an output u n = i = 1 n h W i O ( W i V x j i ) occurs with the probability i = 1 n h softmax ( S i ( n ) ) j i for each n [ N ] . In what follows, we only consider the case of one-headed attention, since the multi-headed case is similar.
Second, the multi-layer perception is a feed-forward neural network FFN such that y n = FFN ( W V x j ) with the probability softmax ( S ( n ) ) j ( j [ n ] ) for each n [ N ] . Finally, the prediction head layer can be represented as a mapping PH : ( R d ) * [ 0 , 1 ] * , which maps the sequence of { y n } n = 1 N to a probability distribution { p n } n = 1 N , where p n is the probability of predicting y n as the next token. Since y N contains information about the whole input text, we may define
PH [ { y n } n = 1 N ] = j = 1 N softmax ( S ( N ) ) j FFN ( W V x j ) ,
such that the next token x N + 1 = y j = FFN ( W V x j ) with the probability softmax ( S ( N ) ) j for j [ N ] .
Hence, a basic building block for the transformer, consisting of a self-attention module (SelfAtt) and a feed-forward network (FFN) followed by a prediction head layer (PH), can be illustrated as follows:
Foundations 05 00023 i002
where the input text t 1 t 2 t n is embedded as a sequence { x i } i = 1 n in R d , y j = FFN ( W V x j ) occurs with the probability softmax ( S ( n ) ) j for each j [ n ] , x n + 1 = y j is generated with the probability softmax ( S ( n ) ) j for each j [ n ] , and so the output is x n + 1 = PH FFN SelfAtt ( { x i } i = 1 n ) . One can then apply the same operations to the extended sequence x 1 x 2 x n x n + 1 in the next block, obtaining x n + 2 = PH FFN SelfAtt ( { x i } i = 1 n + 1 ) , to iteratively compute further tokens (there is usually a stopping criterion based on a special token or the mapping PH ). Below, without loss of generality, we omit the prediction head layer PH .
Typically, a transformer of depth L is defined by a composition of L blocks, denoted by Transf L , consisting of L self-attention maps { SelfAtt } = 1 L and L feed-forward neural networks { FFN } = 1 L , that is,
Transf L = ( FFN L SelfAtt L ) ( FFN 1 SelfAtt 1 ) ,
where the indices of the layers SelfAtt and FFN in (24) indicate the use of different trainable parameters in each of the blocks. This can be illustrated as follows:
Foundations 05 00023 i003
that is,
Transf L ( t 1 t n ) = t 1 t 2 t L .
Also, we can consider the transformer of the form
Transf L = ( ( id + FFN L ) ( id + SelfAtt L ) ) ( ( id + FFN 1 ) ( id + SelfAtt 1 ) ) ,
where id denotes the identity mapping in R d , commonly known as a skip or residual connection.

2.4. Effect Algebras

For the sake of convenience, we collect some notations and basic properties of σ -effect algebras (cf. [1,11,14] and references therein). Recall that an effect algebra is an algebraic system ( E , 0 , 1 , ) , where E is a non-empty set, 0 , 1 E , which are called zeroes and unit elements of this algebra, respectively, and ⊕ is a partial binary operation on E that satisfies the following conditions for any a , b , c E :
(E1)
(Commutative Law): If a b is defined, then b a is defined and b a = a b , which is called the orthogonal sum of a and b ;
(E2)
(Associative Law): If a b and ( a b ) c are defined, then b c and a ( b c ) are defined and
( a b ) c = a ( b c ) ,
which is denoted by a b c ;
(E3)
(Orthosupplementation Law): there exists a unique a E such that a a is defined and a a = 1 , such a is unique and called the orthosupplement of a ;
(E4)
(Zero–One Law): if a 1 is defined, then a = 0 .
We simply call E an effect algebra in the sequel. From the associative law (E2), we can write a 1 a 2 a n if this orthogonal sum is defined. For any a , b E , we define a b if there exists a c E such that a c = b ; this c is unique and denoted by c = b a , so a = 1 a . We also define a b if a b is defined; i.e., a is orthogonal to b . It can be shown (cf. [14]) that ( E , ) is a bounded partially ordered set (poset for short) and a b if and only if a b . For a sequence { a i } i = 1 in E , if a 1 a n is defined for all n N such that n = 1 ( a 1 a n ) exists, then the sum i a i of { a i } i = 1 exists and define i a i = n = 1 ( a 1 a n ) . We say that E is a σ -effect algebra if i a i exists for any sequence { a i } i = 1 in E satisfying that a 1 a n is defined for all n N . It was shown in (Lemma 3.1, [1]) that E is a σ -effect algebra if and only if the least upper bound i a i exists for any monotone sequence { a i } i = 1 , i.e., a 1 a 2 .
Let E and F be σ -effect algebras. A map ϕ : E F is said to be additive if for a , b E , a b implies that ϕ ( a ) ϕ ( b ) and ϕ ( a b ) = ϕ ( a ) ϕ ( b ) . An additive map ϕ : E F is σ -additive if for any sequence { a i } i = 1 such that i a i exists, i ϕ ( a i ) exists and ϕ ( i a i ) = i ϕ ( a i ) . A σ -additive map ϕ : E F is said to be a σ -morphism if ϕ ( 1 ) = 1 ; and moreover, ϕ is called a σ -isomorphism if ϕ is a bijective σ -morphism and ϕ 1 : F E is a σ -morphism. It can be shown (cf. [1]) that
(1)
A map ϕ : E F is additive if and only if ϕ is monotone in the sense that a b implies ϕ ( a ) ϕ ( b ) ;
(2)
An additive map ϕ is σ -additive if and only if a 1 a 2 implies ϕ ( i a i ) = i ϕ ( a i ) ;
(3)
A σ -morphism ϕ satisfies ϕ ( a ) = ϕ ( a ) .
The unit interval [ 0 , 1 ] is a σ -effect algebra defined as follows: For any a , b [ 0 , 1 ] , a b is defined if a + b 1 and in this case a b = a + b . Then, we have that a = 1 a , and 0 , 1 are the zero and unit elements, respectively. In what follows, we always regard [ 0 , 1 ] as a σ -effect algebra in this way. Let E be a σ -effect algebra, a σ -morphism ϕ : E [ 0 , 1 ] is called a state on E , and we denote by S ( E ) the set of all states on E . A subset S of S ( E ) is said to be order determining if α ( a ) α ( b ) for all α S implies a b .
Another example of a σ -effect algebra is a measurable space ( Ω , F ) defined as follows: For any A , B F , A B is defined if A B = , and in this case, A B = A B . We then have 0 = , 1 = Ω , and A = Ω A . We always regard a measurable space ( Ω , F ) as a σ -effect algebra in this way. Let E be a σ -effect algebra, a σ -morphism X : ( Ω , F ) E is called an observable on E with values in ( Ω , F ) (a Ω -valued observable for short). The elements of a σ -effect algebra are called effects, and so an observable X maps effects in F into effects in E ; i.e., X ( A ) is an effect in E for A F . We denote by O ( E , Ω , F ) the set of all Ω -valued observables. Note that S ( Ω , F ) is equal to the set of all probability measures on ( Ω , F ) . For α S ( E ) and X O ( E , Ω , F ) , we have α X S ( Ω , F ) , which is called the probability distribution of X in the state α .

3. Mathematical Formalism

In this section, we introduce a mathematical formalism for generative AI. We utilize the theory of σ -effect algebras to give a mathematical definition for a generative AI system. Let E be a σ -effect algebra and ( Ω , F ) a measurable space. An orthogonal decomposition in E is a sequence { a i } in E such that i a i exists, and moreover, it is complete if i a i = 1 . We denote by D ( E ) the set of all completely orthogonal decomposition in E . A completely orthogonal decomposition in F is called a countable partition of Ω , i.e., a sequence { A i } of elements in F such that A i A j for i j and i A i = Ω . We denote by D ( Ω , F ) the set of all countable partitions of Ω . For n N , an ordered n-tuple R = ( e 1 , , e n ) of effects in E is called a n-time chain-of-effect, and we interpret R as an inference process of an intelligence machine in which the effect e i occurs at time τ i for i [ n ] , where τ 1 < τ 2 < < τ n . Alternatively, no specific times may be involved and we regard R as a sequential effect in which e 1 occurs first, e 2 occurs second, …, and e n occurs last.
Definition 2. 
With the above notations, a generative artificial intelligence system S is defined to be a triple ( E , Ω , F ) , where E is a σ-effect algebra, ( Ω , F ) is a measurable space, such that
(G1) 
The input set In ( S ) of S is equal to the set S ( E ) ; i.e., an input is interpreted by a state α S ( E ) ;
(G2) 
The output set Out ( S ) of S is equal to the set Ω * = n = 1 Ω ( n ) ; i.e., the set of all finite sequences of elements in Ω ;
(G3) 
An inference process in S is interpreted by a chain-of-effect ( e 1 , , e n ) for n N .
Remark 1. 
We refer to [15] for a mathematical definition of general artificial intelligence systems in terms of topos theory, including quantum artificial intelligence systems.
In practice, we are not concerned with a generative AI system S = ( E , Ω , F ) itself but deal with models for S , such as large language models. To this end, we need to introduce the definition of a model for S in terms of joint probability distributions for observables associated with S .
For X O ( E , Ω , F ) and A F , we may view the effect X ( A ) as the event for which X has a value in A . For a partition D = { A i } D ( Ω , F ) , we may view ( X , D ) as a set of possible alternative events that can occur. One interpretation is that ( X , D ) represents a building block of an artificial intelligence architecture for processing X and the alternatives result from the dial readings of the block. Given X i O ( E , Ω , F ) , A i F , i = 1 , , n , an ordered n-tuple R = ( X 1 ( A 1 ) , , X n ( A n ) ) of events is called an n-time chain-of-event, and we interpret R as an inference process of an intelligence machine in which X 1 has a value a 1 in A 1 first, X 2 has a 2 in A 2 s, …, and X n has a n in A n last, so that the output result is ( a 1 , a 2 , , a n ) . We denote the set of all n-time chain-of-events by R ( n ) and the set of all chain-of-events by R * = n R ( n ) .
A n-step inference set has the form I = ( ( X 1 , D 1 ) , , ( X n , D n ) ) , where X i O ( E , Ω , F ) , D i D ( Ω , F ) , i [ n ] . We interpret I as ordered successive processes of observables X i with partitions D i for i [ n ] . We denote the collection of all n-step inference sets by I ( n ) and the collection of all inference sets by I * = n I ( n ) . If R = ( X 1 ( A 1 ) , , X n ( A n ) ) and I = ( ( X 1 , D 1 ) , , ( X n , D n ) ) such that A i D i for every i [ n ] , we say the chain-of-event R is an element of the inference set I and write R I . This can be illustrated as follows:
Foundations 05 00023 i004
which means that the machine firstly obtains a 1 as part of an output with the probability P ( A 1 ) , then obtains a 2 with the conditional probability P ( A 2 | A 1 ) , …, and lastly obtains a n with the conditional probability P ( A n | A n 1 , , A 1 ) and finally combines them to obtain the output result ( a 1 , a 2 , , a n ) with the probability
P α , I ( A 1 × × A n ) = P ( A 1 ) P ( A 2 | A 1 ) P ( A n | A n 1 , , A 1 ) ,
where P α , I will be explained later.
If I 1 = ( ( X 1 , D 1 ) , , ( X n , D n ) ) and I 2 = ( ( Y 1 , J 1 ) , , ( Y m , J m ) ) are two inference sets, then we define their sequential product by
I = I 1 I 2 = ( ( X 1 , D 1 ) , , ( X n , D n ) , ( Y 1 , J 1 ) , , ( Y m , J m ) ) ,
and obtain a ( n + m ) -step inference set. Mathematically, we can include the empty inference set that satisfies I = I = I , such that I * becomes a semigroup under this product.
For a partition D D ( Ω , F ) , we denote by σ ( D ) the σ -subalgebra of F generated by D , and for n partitions { D i } i = 1 n , we denote by σ ( { D i } i = 1 n ) the σ -algebra on Ω ( n ) generated by { D i } i = 1 n , i.e.,
σ ( { D i } i = 1 n ) = σ ( { A 1 × × A n Ω ( n ) : A i D i , i [ n ] } ) .
We denote by P ( Ω ( n ) , σ ( { D i } i = 1 n ) ) the set of all probability measures on ( Ω ( n ) , σ ( { D i } i = 1 n ) ) . Also, we write σ ( I ) = σ ( { D i } i = 1 n ) for I = ( ( X 1 , D 1 ) , , ( X n , D n ) ) . Given an input α S ( E ) , for an inference set I = ( ( X 1 , D 1 ) , , ( X n , D n ) ) , we denote by P α , I P ( Ω ( n ) , σ ( I ) ) the probability measure such that for A 1 × × A n σ ( { D i } i = 1 n ) , P α , I ( A 1 × × A n ) is the probability within the inference set I that the event X 1 ( A 1 ) occurs first, X 2 ( A 2 ) occurs second, …, X n ( A n ) occurs last. We call P α , I P ( Ω ( n ) , σ ( I ) ) the joint probability distribution of an inference set I under the input α S ( E ) .
For interpreting a model for a generative AI system, P α , I ’s need to satisfy physically motivated axioms as follows.
Definition 3. 
With the above notations, a model M for S = ( E , Ω , F ) is defined to be a family of joint probability distributions of inference sets
M = n N P α , I P ( Ω ( n ) , σ ( I ) ) : α S ( E ) , I I ( n ) ,
that satisfies the following axioms:
(P1) 
For I 1 = ( X , D ) , I 2 = ( Y , J ) I ( 1 ) and A σ ( D ) , B σ ( J ) , if P α , I 1 ( A ) = P α , I 2 ( B ) for all α S ( E ) , then X ( A ) = Y ( B ) .
(P2) 
For I I * , I i = ( X , D i ) , i = 1 , 2 , if A σ ( I ) and B σ ( D 1 ) σ ( D 2 ) , then
P α , I I 1 ( A × B ) = P α , I I 2 ( A × B ) ,
for every α S ( E ) .
(P3) 
For I I * , J = ( X , D ) with D = { B i } , if A σ ( I ) then
P α , I J ( A × Ω ) = i P α , I J ( A × B i ) = P α , I ( A ) , α S ( E ) .
(P4) 
If I 1 = ( ( X 1 , D 1 ) , , ( X n , D n ) ) , I 2 = ( ( X 1 , J 1 ) , , ( X n , J n ) ) , and A i D i J i for i [ n ] , then
P α , I 1 ( A 1 × × A n ) = P α , I 2 ( A 1 × × A n ) ,
for every α S ( E ) .
For the physical meanings of the model structure axioms, we remark that
(1)
The axiom ( P 1 ) means that the input set can distinguish different events;
(2)
The axiom ( P 2 ) means that the partition of the last processing is irrelevant;
(3)
The axiom ( P 3 ) means that the last processing does not affect the previous ones;
(4)
The axiom ( P 4 ) means that the probability of a chain of events does not depend on the partitions and hence is unambiguous. However, for B σ ( I 1 ) σ ( I 2 ) in ( P 4 ) , P α , I 1 ( B ) P α , I 2 ( B ) in general if X i ’s are quantum observables due to quantum interference.
If I 1 = ( ( X 1 , D 1 ) , , ( X n , D n ) ) and I 2 = ( ( Y 1 , J 1 ) , , ( Y m , J m ) ) are two inference sets, A σ ( { D i } i = 1 n ) , B σ ( { J j } j = 1 m ) , and if α is an input such that P α , I 1 ( A ) 0 , then we define the conditional probability of B given A within I 1 I 2 under the input α as follows:
P α , I 2 | I 1 ( B | A ) = P α , I 1 I 2 ( A × B ) P α , I 1 ( A ) .
Since P α , I 1 I 2 is a probability measure on ( Ω ( n + m ) , σ ( { D } i = 1 n + m ) ) , where D i = D i for i [ n ] and D n + j = J j for j [ m ] , so P α , I 2 | I 1 ( | A ) is a probability measure on ( Ω m , σ ( { J } j = 1 m ) ) , which is called a conditional sequential joint probability distribution.
Proposition 1. 
Given α S ( E ) , I I * and A σ ( I ) , if P α , I ( A ) 0 , then the conditional sequential joint probability distribution P α , | I ( | A ) satisfies the axioms (P2)(P4) in Definition 3.
Proof. 
By the axiom (P2), we have
P α , I 1 I 2 | I ( B × C | A ) = P α , I I 1 I 2 ( A × B × C ) P α , I ( A ) = P α , I I 1 I 2 ( A × B × C ) P α , I ( A ) = P α , I 1 I 2 | I ( B × C | A ) ;
hence, P α , | I ( | A ) satisfies the axiom (P2), and so does the axiom (P3). Similarly, the axiom (P4) implies that P α , | I ( | A ) satisfies the axiom (P4); we omit the details. □
We remark that when observables are quantum ones, Bayes’ formula need not hold, i.e.,
P α , I 2 | I 1 ( B | A ) P α , I 1 ( A ) P α , I 1 | I 2 ( A | B ) P α , I 2 ( B ) ,
in general. This is because the left-hand side is P α , I 1 I 2 ( A × B ) and the right-hand side is P α , I 2 I 1 ( B × A ) , so the order of the occurrences is changed. For instance, consider a qubit with the standard basis | 0 and | 1 . Let | x = 1 2 ( | 0 + | 1 ) . If X 1 ( A ) = | 0 0 | , X 2 ( B ) = | x x | , and α = | 0 0 | ; then
P α , I 1 I 2 ( A × B ) = Tr [ α ( X 2 ( B ) 1 2 X 1 ( A ) 1 2 ) X 2 ( B ) 1 2 X 1 ( A ) 1 2 ] = Tr [ α X 1 ( A ) 1 2 X 2 ( B ) X 1 ( A ) 1 2 ] = 1 2 , P α , I 2 I 1 ( B × A ) = Tr [ α ( X 1 ( A ) 1 2 X 2 ( B ) 1 2 ) X 1 ( A ) 1 2 X 2 ( B ) 1 2 ] = Tr [ α X 2 ( B ) 1 2 X 1 ( A ) X 2 ( B ) 1 2 ] = 1 4 ,
and so, P α , I 2 | I 1 ( B | A ) P α , I 1 ( A ) P α , I 1 | I 2 ( A | B ) P α , I 2 ( B ) . We refer to Section 4 for more details.

4. Physical Models for Generative AI

Physical models for generative AI are usually described by using systems of mean-field interacting particles, such as large language models based on attention mechanisms (cf. [7,8] and references therein); i.e., generative AI systems are regarded as classical statistical ensembles. However, since modern chips process data through controlling the flow of electric current, i.e., the dynamics of largely many electrons, they should be regarded as quantum statistical ensembles from a physical perspective (cf. [10]). Consequently, we need to model modern intelligence machines involving open quantum systems. To this end, combining the history theory of quantum systems (cf. [2]) and the theory of effect algebras (cf. [1,14]), we construct physical models realizing generative AI systems as open quantum systems.
Let H be a separable complex Hilbert space with the inner product | being conjugate-linear at the first variable and linear at the second variable. We denote by L ( H ) the set of all bounded linear operators on H , by O ( H ) the set of all bounded self-adjoint operators, and by P ( H ) the set of all orthogonal projection operators. We denote by I the identity operator on H . Unless stated otherwise, an operator means a bounded linear operator in the sequel. An operator T is positive if x | T x 0 for all x H , and in this case we write T 0 . We define Tr [ T ] = i x i | T x i for a positive operator T , where { x i } is an orthogonal basis of H . It is known that Tr [ T ] is independent of the choice of the basis, and it is called the trace of T if Tr [ T ] < . A positive operator ρ is a density operator if Tr [ ρ ] = 1 , and the set of all density operators on H is denoted by S ( H ) . Each positive operator is self-adjoint, and if two self-adjoint operators S , T such that T S 0 , we write T S or S T . We refer to [16,17,18] for more details on the theory of operators on Hilbert spaces.
A self-adjoint operator E that satisfies 0 E I is called an effect, and the set of all effects on H is denoted by E ( H ) . For E , F E ( H ) , we define E F = E + F if E + F I , and in this case we write E F . It can be shown (cf. (Lemma 5.1, [1])) that ( E ( H ) , 0 , I , ) is a σ -effect algebra, and each state α on E ( H ) has the form α ( E ) = Tr [ ρ E ] for every E E ( H ) , where ρ is a unique density operator on H , and vice versa. Thus, we identify S ( E ( H ) ) = S ( H ) . Let ( Ω , F ) be a measurable space. An observable X O ( E ( H ) , Ω , F ) is a positive operator valued (POV for short) measure on ( Ω , F ) ; i.e.,
(1)
X ( F ) is an effect on E ( H ) for any F F ;
(2)
X ( ) = 0 and X ( Ω ) = I ;
(3)
For an orthogonal decomposition { F j } in F ,
X ( j F j ) = j X ( F j ) ,
where the series on the right-hand side is convergent in the strong operator topology on L ( H ) , i.e.,
X ( j F j ) h = j X ( F j ) h ,
for every h H .
To understand the inference process, let us show the conventional interpretation of joint probability distributions in an open quantum system that is subject to measurements by an external observer. To this end, let E ( t , s ) = { K m ( t , s ) } denote the time-evolution operator from time s to t , where K m ( t , s ) ’s are usually called Kraus operators, such that
m K m ( t , s ) K m ( t , s ) = I .
That is, E ( t , s ) are quantum operations (cf. [19]) such that for every state ρ S ( H ) ,
E ( t , s ) ρ = m K m ( t , s ) ρ K m ( t , s ) ,
in the Schrödinger picture, while for each observable X O ( H ) ,
E ( t , s ) X = m K m ( t , s ) X K m ( t , s ) ,
in the Heisenberg picture. We refer to [9] for the details on the theory of open quantum systems.
Then the density operator state ρ ( t 0 ) at time t 0 evolves in time t 1 t 0 to the state ρ ( t 1 ) , where
ρ ( t 1 ) = E ( t 1 , t 0 ) ρ ( t 0 ) = m K m ( t 1 , t 0 ) ρ ( t 0 ) K m ( t 1 , t 0 ) .
Suppose that a measurement ( X 1 , D 1 ) is made at time t 1 , where X 1 O ( E ( H ) , Ω , F ) and D 1 D ( Ω , F ) . Then the probability that an event X 1 ( A 1 ) with A 1 D 1 occurs is
P ( X 1 ( A 1 ) , ρ ( t 1 ) ) = Tr [ X 1 ( A 1 ) ρ ( t 1 ) ] .
If the result of this measurement is kept, then, according to the von Neumann–Lüders reduction postulate, the appropriate density operator to use for any further calculation is
ρ red ( t 1 ) = X 1 ( A 1 ) 1 2 ρ ( t 1 ) X 1 ( A 1 ) 1 2 Tr [ X 1 ( A 1 ) ρ ( t 1 ) ] .
Next, suppose a measurement ( X 2 , D 2 ) is performed at time t 2 > t 1 . Then, according to the above, the conditional probability of an event X 2 ( A 2 ) for A 2 D 2 occurs at time t 2 given that the event X 1 ( A 1 ) occurs at time t 1 (and that the original state was ρ ( t 0 ) ) is
P ( X 2 ( A 2 ) | X 1 ( A 1 ) , ρ ( t 0 ) ) = Tr [ X 2 ( A 2 ) ρ ( t 2 ) ] ,
where ρ ( t 2 ) = E ( t 2 , t 0 ) ρ red ( t 1 ) , and the appropriate density operator to use for any further calculation is
ρ red ( t 2 ) = X 2 ( A 2 ) 1 2 ρ ( t 2 ) X 2 ( A 2 ) 1 2 Tr [ X 2 ( A 2 ) ρ ( t 2 ) ] .
The joint probability of X 1 ( A 1 ) occurring at t 1 and X 2 ( A 2 ) occurring at t 2 is then
P ( ( X 1 ( A 1 ) , X 2 ( A 2 ) ) , ρ ( t 0 ) ) = P ( X 1 ( A 1 ) , ρ ( t 1 ) ) P ( X 2 ( A 2 ) | X 1 ( A 1 ) , ρ ( t 0 ) ) = Tr [ X 1 ( A 1 ) ρ ( t 1 ) ] Tr [ X 2 ( A 2 ) ρ ( t 2 ) ] .
Generalizing to a sequence of measurements ( X 1 , D 1 ) , ( X 2 , D 2 ) , , ( X n , D 2 ) at times t 1 < t 2 < < t n , where X i O ( E ( H ) , Ω , F ) and D i D ( Ω , F ) for i [ n ] , the sequential joint probability of associated events X i ( A i ) with A i D i occurring at t i for i [ n ] is
P ( ( X 1 ( A 1 ) , X 2 ( A 2 ) , , X n ( A n ) ) , ρ ( t 0 ) ) = Tr [ X 1 ( A 1 ) ρ ( t 1 ) ] Tr [ X 2 ( A 2 ) ρ ( t 2 ) ] Tr [ X n ( A n ) ρ ( t n ) ] ,
where ρ ( t i ) = E ( t i , t 0 ) ρ red ( t i 1 ) for i [ n ] , ρ red ( t 0 ) = ρ ( t 0 ) , and
ρ red ( t i ) = X i ( A i ) 1 2 ρ ( t i ) X i ( A i ) 1 2 Tr [ X i ( A i ) ρ ( t i ) ] .
for i [ n 1 ] .
Therefore, given an inference set I = ( ( X 1 , D 1 ) , , ( X n , D n ) ) , for an input ρ ( t 0 ) S ( E ) , the sequential joint probability within the inference set I that the event X 1 ( A 1 ) occurs at t 1 , X 2 ( A 2 ) occurs at t 2 , …, X n ( A n ) occurs at t n , where A i D i for i [ n ] and t 0 < t 1 < t 2 < < t n , is given by
P ρ ( t 0 ) , I ( A 1 × × A n ) = Tr [ X 1 ( A 1 ) ρ ( t 1 ) ] Tr [ X 2 ( A 2 ) ρ ( t 2 ) ] Tr [ X n ( A n ) ρ ( t n ) ] ,
where ρ ( t i ) = E ( t i , t 0 ) ρ red ( t i 1 ) and
E ( t i , t 0 ) ρ red ( t i 1 ) = m K m ( t i , t 0 ) ρ red ( t i 1 ) K m ( t i , t 0 ) ,
for i [ n ] , in the Schrödinger picture operator defined with respect to the fiducial time t 0 .
Proposition 2. 
Let H be a separable complex Hilbert space, and let ( Ω , F ) be a measurable space. A physical model associated with ( E ( H ) , Ω , F ) defined by
M = n N P ρ , I P ( Ω ( n ) , σ ( I ) ) : ρ S ( H ) , I I ( n ) ,
where P ρ , I s’ are given by (49), satisfies the axioms in Definition 3.
Proof. 
For I 1 = ( X , D ) , I 2 = ( Y , J ) I ( 1 ) and A σ ( D ) , B σ ( J ) , by (49) we have
P ρ , I 1 ( A ) = Tr [ ρ X ( A ) ] , P ρ , I 2 ( B ) = Tr [ ρ Y ( B ) ] .
If P ρ , I 1 ( A ) = P ρ , I 2 ( B ) for all ρ S ( H ) , then X ( A ) = Y ( B ) , i.e., the axiom (P1) holds.
For I = ( ( X 1 , D 1 ) , , ( X n , D n ) ) I * , I 1 = ( Y , J 1 ) , if A i D i for i [ n ] and B J 1 , by (49), we have
P ρ , I I 1 ( A 1 × × A n × B ) = Tr [ X 1 ( A 1 ) ρ ( t 1 ) ] Tr [ X n ( A n ) ρ ( t n ) ] Tr [ Y ( B ) ρ ( t n + 1 ) ] ,
where t 0 < t 1 < < t n < t n + 1 , ρ ( t 0 ) = ρ , ρ ( t i ) = E ( t i , t 0 ) ρ red ( t i 1 ) , ρ red ( t 0 ) = ρ ( t 0 ) ,
ρ red ( t i ) = X i ( A i ) 1 2 ρ ( t i ) X i ( A i ) 1 2 Tr [ X i ( A i ) ρ ( t i ) ]
for i [ n ] , and
ρ i ( t i ) = m K m ( t i , t 0 ) ρ red ( t i 1 ) K m ( t i , t 0 ) ,
for i [ n + 1 ] . Also, for I 2 = ( Y , J 2 ) and B J 2 , by (49) we have
P α , I I 2 ( A 1 × × A n × B ) = Tr [ X 1 ( A 1 ) ρ ( t 1 ) ] Tr [ X n ( A n ) ρ ( t n ) ] Tr [ Y ( B ) ρ ( t n + 1 ) ] .
Hence, we have
P ρ , I I 1 ( A 1 × × A n × B ) = P α , I I 2 ( A 1 × × A n × B )
for B σ ( J 1 ) σ ( J 2 ) . Since σ ( I ) is generated by D i ’s, this concludes the axiom (P2). Similarly, we can check the axioms (P3) and (P4) and omit the details. □
Remark 2. 
Note that the probability family P ρ ( t 0 ) , I ’s are determined by the time-evolution operators E ( t , s ) ’s. Therefore, a family of the discrete-time evolution operators { E ( t i , s ) } i = 1 n defines a physical model realizing a generative AI system, based on the mathematical formalism in Definition 3 for models of generative AI systems.

5. Large Language Models

In this section, we describe physical models for large language models based on a transformer architecture in the Fock space over the Hilbert space of tokens. Consider a large language model S with the set T of N tokens. A finite sequence { x i } i = 1 n of tokens is called a text for S , simply denoted by T = x 1 x 2 x n or ( x 1 , x 2 , , x n ) , where n is called the length of the text T .
Let h be the Hilbert space with { | x : x T } being an orthogonal basis, and we identify x = | x for x T . Let H = F ( h ) be the Fock space over h , that is,
F ( h ) = C n = 1 h n ,
where h n is the n-fold tensor product of h . We refer to [16] for the details of Fock spaces. In what follows, for the sake of convenience, we involve the finite Fock space
H = F ( M ) ( h ) = C n = 1 M h n ,
for a large integer M N . Note that an operator A ( n ) = A 1 A n L ( h n ) for A j L ( h ) satisfies that for all h ( n ) = h 1 h n h n ,
A h ( n ) = ( A 1 h 1 ) ( A n h n ) h n ,
and in particular, if ρ i S ( h ) for i [ n ] , then ρ ( n ) = ρ 1 ρ n S ( h n ) . Given α C and a sequence A ( n ) L ( h n ) for n [ M ] , the operator diag ( α , A ( 1 ) , , A ( M ) ) L ( H ) is defined by
diag ( α , A ( 1 ) , , A ( M ) ) h ( M ) = ( α c , A ( 1 ) h ( 1 ) , , A ( M ) h ( M ) ) ,
for every h ( M ) = ( c , h ( 1 ) , , h ( M ) ) H . In particular, if ρ ( n ) S ( h n ) , then
ρ ( M ) = diag ( 0 , 0 ( 1 ) , , 0 ( n 1 ) , ρ ( n ) , 0 ( n + 1 ) , , 0 ( M ) ) S ( H ) ,
where 0 ( i ) denotes the zero operator in h i for i 1 .
Since large language models are based on a transformer architecture, we suffice to construct a physical model in the Fock space H = F ( M ) ( h ) ( M L ) for a transformer Transf L (24) with a composition of L blocks, consisting of L self-attention maps { SelfAtt } = 1 L and L feed-forward neural networks { FFN } = 1 L . Precisely, let us denote the input text to the layer by T = { x i } i = 1 n . As noted above,
FFN SelfAtt ( T ) = i = 1 n + 1 softmax ( S ( n + 1 ) ) i FFN ( W V x i ) ,
where S ( n + 1 ) = { s i ( ) } i = 1 n + 1 and
s i ( ) = 1 d W Q x n + 1 , W K x i , i [ n + 1 ] .
Then, a physical model for Transf L consists of an input ρ ( t 0 ) and a sequence of quantum operations { E ( t , t 0 ) } = 1 L in the Fock space H defined above, where t 0 < t 1 < < t L . We show how to construct this model step by step as follows.
To this end, we denote by Ω = { } T and write D = ( { ω } : ω Ω ) . At first, the input state ρ T is given as
ρ T = ρ ( t 0 ) = diag ( 0 , 0 ( 1 ) , , 0 ( n 1 ) , | x 1 x 1 | | x n x n | , 0 ( n + 1 ) , ) S ( H ) .
Then there is a physical operation E ( t 1 , t 0 ) in H (see Proposition 3 below), depending only on the attention mechanism ( W 1 Q , W 1 K , W 1 V ) and FFN 1 , such that
E ( t 1 , t 0 ) ρ ( t 0 ) = i 1 = 1 n softmax ( S 1 ( n ) ) i 1 diag ( 0 , 0 ( 1 ) , 0 ( n ) , | x 1 x 1 | | x n x n | | y i 1 ( 1 ) y i 1 ( 1 ) | , 0 ( n + 2 ) , ) ,
where y i 1 ( 1 ) = FFN 1 ( W V 1 x i 1 ) and { y i 1 ( 1 ) } i 1 = 1 n { | x : x T } . Define X 1 : 2 Ω E ( H ) by
X 1 ( { } ) = diag ( 1 , I h , , I h n , 0 ( n + 1 ) , I h ( n + 2 ) , ) ,
and for every x T ,
X 1 ( { x } ) = diag ( 0 , 0 ( 1 ) , , 0 ( n ) , I h I h n | x x | , 0 ( n + 2 ) , ) .
Making a measurement ( X 1 , D ) at time t 1 , we obtain an output y i 1 ( 1 ) with probability softmax ( S 1 ( n ) ) i 1 , and the appropriate density operator to use for any further calculation is
ρ red ( t 1 ) i 1 = E i 1 ( 1 ) ρ ( t 1 ) E i 1 ( 1 ) Tr [ E i 1 ( 1 ) ρ ( t 1 ) ] = diag ( 0 , 0 ( 1 ) , 0 ( n ) , | x 1 x 1 | | x n x n | | y i 1 ( 1 ) y i 1 ( 1 ) | , 0 ( n + 2 ) , ) ,
for every i 1 [ n ] , where ρ ( t 1 ) = E ( t 1 , t 0 ) ρ ( t 0 ) , and
E i 1 ( 1 ) = diag ( 0 , 0 ( 1 ) , , 0 ( n ) , I h I h n | y i 1 ( 1 ) y i 1 ( 1 ) | , 0 ( n + 2 ) , ) .
Next, there is a physical operation E ( t 2 , t 0 ) in H (see Proposition 3 again), depending only on the attention mechanism ( W 2 Q , W 2 K , W 2 V ) and FFN 2 at time t 2 , such that
E ( t 2 , t 0 ) ρ red ( t 1 ) i 1 = i 2 = 1 n + 1 softmax ( S 2 ( n + 1 ) ) i 2 × diag ( 0 , 0 ( 1 ) , , 0 ( n + 1 ) , | x 1 x 1 | | x n x n | | y i 1 ( 1 ) y i 1 ( 1 ) | | y i 2 ( 2 ) y i 2 ( 2 ) | , 0 ( n + 3 ) , )
for i 1 [ n ] , where y i 2 ( 2 ) = FFN 2 ( W V 2 x i 2 ) (with x n + 1 = y i 1 ( 1 ) ) and { y i ( 2 ) } i = 1 n + 1 { | x : x T } . Define X 2 : 2 Ω E ( H ) by
X 2 ( { } ) = diag ( 1 , I h , , I h ( n + 1 ) , 0 ( n + 2 ) , I h ( n + 3 ) , ) ,
and for every x T ,
X 2 ( { x } ) = diag ( 0 , 0 ( 1 ) , , 0 ( n + 1 ) , I h I h n + 1 | x x | , 0 ( n + 3 ) , ) .
Making a measurement ( X 2 , D ) at time t 2 , we obtain an output y i 2 ( 2 ) with probability softmax ( S 2 ( n + 1 ) ) i 2 , and the appropriate density operator to use for any further calculation is
ρ red ( t 2 ) i 1 , i 2 = E i 2 ( 2 ) ρ ( t 2 ) i 1 E i 2 ( 2 ) Tr [ E i 2 ( 2 ) ρ ( t 2 ) i 1 ] = diag ( 0 , 0 ( 1 ) , , 0 ( n + 1 ) , | x 1 x 1 | | x n x n | | y i 1 ( 1 ) y i 1 ( 1 ) | | y i 2 ( 2 ) y i 2 ( 2 ) | , 0 ( n + 3 ) , ) ,
for each i 2 [ n + 1 ] , where ρ ( t 2 ) i 1 = E ( t 2 , t 0 ) ρ red ( t 1 ) i 1 and
E i 2 ( 2 ) = diag ( 0 , 0 ( 1 ) , , 0 ( n + 1 ) , I h I h n + 1 | y i 2 ( 2 ) y i 2 ( 2 ) | , 0 ( n + 3 ) , ) .
Step by step, we can obtain a physical model { E ( t , t 0 ) } = 1 L with the input state ρ ( t 0 ) such that a text ( y i 1 ( 1 ) , y i 2 ( 2 ) , , y i L ( L ) ) is generated with the probability
P T ( y i 1 ( 1 ) , y i 2 ( 2 ) , , y i L ( L ) ) = softmax ( S 1 ( n ) ) i 1 softmax ( S L ( n + L 1 ) ) i L ,
within the inference ( ( X 1 , D ) , , ( X L , D ) ) .
Thus, we can obtain a physical model for Transf L if we prove that E ( t , t 0 ) ’s exist.
Proposition 3. 
With the above notations, there exists a physical model { E ( t , t 0 ) } = 1 L in H = F ( M ) ( h ) ( M L ) for a transformer Transf L (24) such that given an input text T = { x i } i = 1 n , a text ( y i 1 ( 1 ) , y i 2 ( 2 ) , , y i L ( L ) ) is generated with the probability
P T ( y i 1 ( 1 ) , y i 2 ( 2 ) , , y i L ( L ) ) = softmax ( S 1 ( n ) ) i 1 softmax ( S L ( n + L 1 ) ) i L ,
within the inference ( ( X 1 , D ) , , ( X L , D ) ) .
Proof. 
We regard 1 , | x x | , and | x 1 x 1 | | x n x n | as elements in L ( H ) in a natural way, i.e.,
1 diag ( 1 , 0 ( 1 ) , 0 ( 2 ) , ) , | x x | diag ( 0 , | x x | , 0 ( 2 ) , , ) , | x 1 x 1 | | x n x n | diag ( 0 , 0 ( 1 ) , , 0 ( n 1 ) , | x 1 x 1 | | x n x n | , 0 ( n + 1 ) , ) ,
for n 1 . We need to construct E ( t 1 , t 0 ) to satisfy (60). We first define
Φ ( 1 ) = | x 0 x 0 | ,
where x 0 T is a certain token. Secondly, define
Φ ( | x x | ) = diag ( 0 , 0 ( 1 ) , | x x | | FFN ( W V x ) FFN ( W V x ) | , 0 ( 3 ) , ) , x T ,
and in general, for n [ L ] define
Φ ( | x 1 x 1 | | x n x n | ) = i = 1 n softmax ( S 1 ( n ) ) i diag ( 0 , 0 ( 1 ) , 0 ( n ) , | x 1 x 1 | | x n x n | | y i ( 1 ) y i ( 1 ) | , 0 ( n + 2 ) , ) ,
for any x i T and i [ n ] . Let
S = span { 1 , | x 1 x 1 | | x x | : x i T , i [ ] ; = 1 , , L } .
Then Φ extends uniquely to a positive map E Φ from S into L ( H ) , that is,
E Φ ( a 0 + n 1 x 1 , , x n T a x 1 , , x n | x 1 x 1 | | x n x n | ) = a 0 | x 0 x 0 | + n 1 x 1 , , x n T a x 1 , , x n Φ ( | x 1 x 1 | | x n x n | ) ,
where a 0 , a x 1 , , x n are any complex numbers for n 1 . Since S is a commutative C * -algebra, by Stinespring’s theorem (cf. (Theorem 3.11, [20])), it follows that E Φ : S L ( H ) is completely positive. Hence, by Arveson’s extension theorem (cf. (Theorem 7.5, [20])), E Φ extends to a completely positive operator E ( t 1 , t 0 ) in L ( H ) (note that E ( t 1 , t 0 ) is not necessarily unique), i.e., a quantum operation in H . By the construction, E ( t 1 , t 0 ) satisfies (60). Also, by Kraus’s theorem (cf. [19]), we conclude that E ( t 1 , t 0 ) has the Kraus decomposition (39).
In the same way, we can prove that E ( t 2 , t 0 ) exists and satisfies (65). Step by step, we can thus obtain a physical model { E ( t , t 0 ) } = 1 L as required. □
Remark 3. 
A physical model for the transformer with a multi-headed attention (21) can be constructed in a similar way. Also, we can construct physical models for the transformer of the form (26), even for the transformer of a more complex structure (cf. [21] and reference therein). We omit the details.
Physical models satisfying the above joint probability distributions associated with a transformer Transf L are not necessarily unique. However, a physical model { E ( t , t 0 ) } = 1 L uniquely determines the joint probability distributions; that is, it defines a unique physical process for operating the large language model based on Transf L . Therefore, in a physical model { E ( t , t 0 ) } = 1 L for Transf L , training for Transf L corresponds to training for the Kraus operators E ( t , t 0 ) = { K j ( ) ( t , t 0 ) } , which are adjustable and learned during the training process, determining the physical model, as corresponding to the parameters W Q , W K and W V in Transf L . From a physical perspective, to train for a large language model is just to determine the Kraus operators E ( t , t 0 ) = { K j ( ) ( t , t 0 ) } associated with the corresponding physical system (cf. [22]).
Example 1. 
Let T = { e 0 , e 1 } be the set of two tokens embedded in R 2 such that e 0 = ( 1 , 0 ) and e 1 = ( 0 , 1 ) . Then, h = C 2 with the standard basis | 0 = | e 0 and | 1 = | e 1 . Let
H = F ( 3 ) ( C 2 ) = C C 2 [ C 2 C 2 ] [ C 2 C 2 C 2 ] .
Suppose that W Q = W K = FFN = I in R 2 , and let W V = σ x , i.e., W V e 0 = e 1 and W V e 1 = e 0 . Below, we construct a quantum operation E associated with SelfAtt = ( I , I , σ ) and FFN = I in R 2 . To this end, define Φ ( 1 ) = | 0 0 | ,
Φ ( | 0 0 | ) = | 0 0 |   ×   | W V e 0 W V e 0 | = | 0 0 | × | 1 1 | , Φ ( | 1 1 | ) = | 1 1 |   ×   | W V e 1 W V e 1 | = | 1 1 | × | 0 0 | ;
and
Φ ( | 0 0 | | 0 0 | ) = | 0 0 | × | 0 0 | × | 1 1 | , Φ ( | 0 0 | × | 1 1 | ) = e 1 + e | 0 0 | × | 1 1 | × | 0 0 | + 1 1 + e | 0 0 | × | 1 1 | × | 1 1 | , Φ ( | 1 1 | | 0 0 | ) = 1 1 + e | 1 1 | × | 0 0 | × | 0 0 | + e 1 + e | 1 1 | × | 0 0 | × | 1 1 | , Φ ( | 1 1 | | 1 1 | ) = | 1 1 | × | 1 1 | × | 0 0 | .
We regard 1 , | e i e i | , | e j e j | | e k e k | ( i , j , k = 0 , 1 ) as elements in L ( F ( 3 ) ( C 2 ) ) in a natural way. Let
S = span { 1 , | e i e i | , | e j e j | | e k e k | : i , j , k = 0 , 1 } .
Then S is a subspace of L ( F ( 3 ) ( C 2 ) ) and Φ extends uniquely to a positive map E from S into L ( F ( 3 ) ( C 2 ) ) , i.e.,
E ( a + i = 0 , 1 b i | e i e i | + j , k = 0 , 1 c j , k | e j e j | | e k e k | ) = a | 0 0 | + i = 0 , 1 b i Φ ( | e i e i | ) + j , k = 0 , 1 c j , k Φ ( | e j e j | | e k e k | ) ,
for any a , b i , c j , k C . As shown in Proposition 3, E can extend to a completely positive operator in L ( F ( 3 ) ( C 2 ) ) , which is a quantum operation in H = F ( 3 ) ( C 2 ) associated with SelfAtt = ( I , I , σ ) and FFN = I in R 2 . Note that E is not necessarily unique.
Example 2. 
As in Example 1, T = { x 0 , x 1 } is the set of two tokens embedded in R 2 such that x 0 = ( 1 , 0 ) and x 1 = ( 0 , 1 ) . Then h = C 2 with the standard basis | 0 = | x 0 and | 1 = | x 1 . Let H = F ( 6 ) ( C 2 ) . Assume an input text T = ( x 0 , x 1 , x 0 ) . The input state ρ T is then given by
ρ T = ρ ( t 0 ) = diag ( 0 , 0 ( 1 ) , 0 ( 2 ) , | x 0 x 0 | | x 1 x 1 | | x 0 x 0 | , 0 ( 4 ) , 0 ( 5 ) , 0 ( 6 ) ) .
If W 1 Q = W 1 K = FFN 1 = I and W 1 V = σ x in R 2 , an associated physical operation E ( t 1 , t 0 ) at time t 1 satisfies
E ( t 1 , t 0 ) ρ ( t 0 ) = 1 2 e + 1 diag ( 0 , 0 ( 1 ) , 0 ( 2 ) , 0 ( 3 ) , | x 0 x 0 | | x 1 x 1 | | x 0 x 0 | | x 0 x 0 | , 0 ( 5 ) , 0 ( 6 ) ) + 2 e 2 e + 1 diag ( 0 , 0 ( 1 ) , 0 ( 2 ) , 0 ( 3 ) , | x 0 x 0 | | x 1 x 1 | | x 0 x 0 | | x 1 x 1 | , 0 ( 5 ) , 0 ( 6 ) ) .
By measurement, we obtain x 0 with probability 1 2 e + 1 and obtain x 1 with probability 2 e 2 e + 1 , while
ρ red ( t 1 ) 0 = diag ( 0 , 0 ( 1 ) , 0 ( 2 ) , 0 ( 3 ) , | x 0 x 0 | | x 1 x 1 | | x 0 x 0 | | x 0 x 0 | , 0 ( 5 ) , 0 ( 6 ) ) , ρ red ( t 1 ) 1 = diag ( 0 , 0 ( 1 ) , 0 ( 2 ) , 0 ( 3 ) , | x 0 x 0 | | x 1 x 1 | | x 0 x 0 | | x 1 x 1 | , 0 ( 5 ) , 0 ( 6 ) ) .
If W 2 Q = W 2 V = FFN 2 = I and W 2 K = σ x in R 2 , an associated quantum operation E ( t 2 , t 0 ) at time t 2 satisfies
E ( t 2 , t 0 ) ρ red ( t 1 ) 0 = 3 e + 3 ( 0 , 0 ( 1 ) , 0 ( 2 ) , 0 ( 3 ) , 0 ( 4 ) , | x 0 x 0 | | x 1 x 1 | | x 0 x 0 | | x 0 x 0 | | x 0 x 0 | , 0 ( 6 ) ) + e e + 3 ( 0 , 0 ( 1 ) , 0 ( 2 ) , 0 ( 3 ) , 0 ( 4 ) , | x 0 x 0 | | x 1 x 1 | | x 0 x 0 | | x 0 x 0 | | x 1 x 1 | , 0 ( 6 ) ) ,
and
E ( t 2 , t 0 ) ρ red ( t 1 ) 1 = e e + 1 ( 0 , 0 ( 1 ) , 0 ( 2 ) , 0 ( 3 ) , 0 ( 4 ) , | x 0 x 0 | | x 1 x 1 | | x 0 x 0 | | x 1 x 1 | | x 0 x 0 | , 0 ( 6 ) ) + 1 e + 1 ( 0 , 0 ( 1 ) , 0 ( 2 ) , 0 ( 3 ) , 0 ( 4 ) , | x 0 x 0 | | x 1 x 1 | | x 0 x 0 | | x 1 x 1 | | x 1 x 1 | , 0 ( 6 ) ) .
By measurement at time t 2 , when x 0 occurs at t 1 , we obtain x 0 with probability 3 e + 3 and obtain x 1 with probability e e + 3 ; when x 1 occurs at t 1 , we obtain x 0 with probability e e + 1 and obtain x 1 with probability 1 e + 1 .
Therefore, we obtain the joint probability distributions:
P T ( x 0 , x 0 ) = 1 2 e + 1 3 e + 3 = 3 ( 2 e + 1 ) ( e + 3 ) , P T ( x 0 , x 1 ) = 1 2 e + 1 e e + 3 = e ( 2 e + 1 ) ( e + 3 ) , P T ( x 1 , x 0 ) = 2 e 2 e + 1 e e + 1 = 2 e 2 ( 2 e + 1 ) ( e + 1 ) , P T ( x 1 , x 1 ) = 2 e 2 e + 1 1 e + 1 = 2 e ( 2 e + 1 ) ( e + 1 ) .
This can be illustrated as follows:
Foundations 05 00023 i005

6. Conclusions

Our primary innovative points are summarized as follows:
  • Mathematical formalism for generative AI models. We present a mathematical formalism for generative AI models by using the theory of the history approach to physical systems developed by Isham and Gudder [1,2].
  • Physical models realizing generative AI systems. We give a construction of physical models realizing generative AI systems as open quantum systems by using the theories of σ -effect algebras and open quantum systems.
  • Large language models realized as open quantum system. We construct physical models realizing large language models based on the transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens. The Fock space structure plays a crucial role in this construction.
In conclusion, we present a mathematical formalism for generative AI and describe physical models realizing generative AI systems as open quantum systems. Our formalism shows that a transformer architecture used for generative AI systems is characterized by a family of sequential joint probability distributions. The physical models realizing generative AI systems are described by sequential event histories in open quantum systems. The Kraus operators in the physical models correspond to the query, key, and value matrices in the attention mechanism of a transformer, which are adjustable and learned during the training process. As an illustration, we construct physical models in the Fock space over the Hilbert space of tokens, realizing large language models based on a transformer architecture as open quantum systems. This means that our physical models underlie the transformer architecture for large language models. We refer to [23] for an argument on the physical principle of generic AI and to [15] for a mathematical foundation of general AI, including quantum AI.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in this article.

Acknowledgments

The author thanks the anonymous referees for making helpful comments and suggestions, which have been incorporated into this version of the paper.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Gudder, S. A histories approach to quantum mechanics. J. Math. Phys. 1998, 39, 5772–5788. [Google Scholar] [CrossRef]
  2. Isham, C.J. Quantum logic and the histories approach to quantum theory. J. Math. Phys. 1994, 35, 2157–2185. [Google Scholar] [CrossRef]
  3. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  4. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  5. Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
  6. Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large language models: A Survey. arXiv 2024, arXiv:2402.06196. [Google Scholar]
  7. Geshkovski, B.; Letrouit, C.; Polyyanskiy, Y.; Rigollet, P. A mathematical perspective on transformers. Bull. Am. Math. Soc. 2025, in press.
  8. Vuckovic, J.; Baratin, A.; Combes, R.T. A mathematical theory of attention. arXiv 2020, arXiv:2007.02876. [Google Scholar]
  9. Breuer, H.P.; Petruccione, F. The Theory of Open Quantum Systems; Oxford University Press: Oxford, UK, 2002. [Google Scholar]
  10. Villas-Boas, C.J.; Máximo, C.E.; Paulino, P.J.; Bachelard, R.P.; Rempe, G. Bright and dark states of light: The quantum origin of classical interference. Phys. Rev. Lett. 2025, 134, 133603. [Google Scholar] [CrossRef] [PubMed]
  11. Dvurečenskij, A.; Pulmannová, S. New Trends in Quantum Structures; Springer: Berlin/Heidelberg, Germany, 2000. [Google Scholar]
  12. Petersen, P.; Zech, J. Mathematical Theory of Deep Learning. arXiv 2025, arXiv:2407.18384v3. [Google Scholar]
  13. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
  14. Foulis, D.J.; Bennett, M.K. Effect algebras and unsharp quantum logics. Found. Phys. 1994, 24, 1331–1352. [Google Scholar] [CrossRef]
  15. Chen, Z.; Ding, L.; Liu, H.; Yu, J. A topos-theoretic formalism of quantum artificial intellegence. Sci. Sin. Math. 2025, 55, 1–32. (In Chinese) [Google Scholar] [CrossRef]
  16. Reed, M.; Simon, B. Method of Mordern Mathematical Physics, Vol. I; Academic Press: San Diego, CA, USA, 1980. [Google Scholar]
  17. Reed, M.; Simon, B. Method of Mordern Mathematical Physics, Vol. II; Academic Press: Cambridge, UK, 1980. [Google Scholar]
  18. Rudin, W. Functional Analysis, 2nd ed.; The McGraw-Hill Companies, Inc.: New York, NY, USA, 1991. [Google Scholar]
  19. Nielsen, M.A.; Chuang, I.L. Quantum Computation and Quantum Information; Cambridge University Press: Cambridge, UK, 2001. [Google Scholar]
  20. Paulsen, V. Completely Bounded Maps and Operator Algebras; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
  21. Zhang, Y.; Liu, Y.; Yuan, H.; Qin, Z.; Yuan, Y.; Gu, Q.; Yao, A.C. Tensor product attention is all you need. arXiv 2025, arXiv:2501.06425. [Google Scholar]
  22. Sharma, K.; Cerezo, M.; Cincio, L.; Coles, P.J. Trainability of dissipative perceptron-based quantum neural networks. Phys. Rev. Lett. 2022, 128, 180505. [Google Scholar] [CrossRef] [PubMed]
  23. Chen, Z. Turing’s thinking machine and ’t Hooft’s principle of superposition of states. ChinaXiv 1207. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Z. Mathematical Formalism and Physical Models for Generative Artificial Intelligence. Foundations 2025, 5, 23. https://doi.org/10.3390/foundations5030023

AMA Style

Chen Z. Mathematical Formalism and Physical Models for Generative Artificial Intelligence. Foundations. 2025; 5(3):23. https://doi.org/10.3390/foundations5030023

Chicago/Turabian Style

Chen, Zeqian. 2025. "Mathematical Formalism and Physical Models for Generative Artificial Intelligence" Foundations 5, no. 3: 23. https://doi.org/10.3390/foundations5030023

APA Style

Chen, Z. (2025). Mathematical Formalism and Physical Models for Generative Artificial Intelligence. Foundations, 5(3), 23. https://doi.org/10.3390/foundations5030023

Article Metrics

Back to TopTop