Mathematical Formalism and Physical Models for Generative Artificial Intelligence

Zeqian Chen

doi:10.3390/foundations5030023

Abstract

This paper presents a mathematical formalism for generative artificial intelligence (GAI). Our starting point is an observation that a “histories” approach to physical systems agrees with the compositional nature of deep neural networks. Mathematically, we define a GAI system as a family of sequential joint probabilities associated with input texts and temporal sequences of tokens (as physical event histories). From a physical perspective on modern chips, we then construct physical models realizing GAI systems as open quantum systems. Finally, as an illustration, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens. Our physical models underlie the transformer architecture for large language models.

Keywords:

generative artificial intelligence; attention mechanism; large language model; sequential joint probability; event histories; open quantum system; Kraus operator; Fock space

1. Introduction

Generative artificial intelligence (AI) models are important for modeling intelligent machines as physically described in [1,2]. Generative AI is based on deep neural networks (DNNs for short), and a common characteristic of DNNs is their compositional nature (cf. [3]): data is processed sequentially, layer by layer, resulting in a discrete-time dynamical system. The introduction of the transformer architecture for generative AI in 2017 marked the most striking advancement in terms of DNNs (cf. [4]). Indeed, the transformer is a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. At each step, the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next. The transformer has achieved great success in natural language processing (cf. [5]).

The transformer has a modularization framework and is constructed by two main building blocks: self-attention and feed-forward neural networks. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. However, despite its meteoric rise within deep learning, we believe there is a gap in our theoretical understanding of what the transformer is and why it works physically (cf. [6]).

We think that there are two origins for the modularization framework of generative AI models. One is a mathematical origin in which a joint probability distribution can be computed by sequentially conditional probabilities. For instance, the probability of generating a text

t_{1} t_{2} \dots t_{N}

given an input X in a transformer architecture is equal to the joint probability distribution

P_{X} (t_{1}, \dots, t_{N})

such that

P_{X} (t_{1}, \dots, t_{N}) = P_{X} (t_{1}) P_{X} (t_{2} | t_{1}) \dots P_{X} (t_{N} | t_{1}, \dots, t_{N - 1}),

(1)

where the conditional probability

P_{X} (t_{ℓ} | t_{1}, \dots, t_{ℓ - 1})

is given by the ℓ-th attention block in the transformer. Another is a physical origin, in which a physical process is considered to be a sequence of events as a history. As such, generating a text

t_{1} t_{2} \dots t_{N}

given an input X in a physical machine is a process in which, given an input X at time

τ_{0},

an event

| t_{1} ⟩ ⟨ t_{1} |

occurs at time

τ_{1},

an event

| t_{2} ⟩ ⟨ t_{2} |

occurs at time

τ_{2},

…, and last, an event

| t_{N} ⟩ ⟨ t_{N} |

occurs at time

τ_{N}

. A theory of the “histories” approach to physical systems was established by Isham [2], and the mathematical theory of it associated with joint probability distributions was then developed by Gudder [1]. Based on their theory, in this paper, we present a mathematical formalism for generative AI and describe the associated physical models.

To the best of our knowledge, physical models for generative AI are usually described by using systems of mean-field interacting particles (cf. [7,8] and references therein); i.e., generative AI models are regarded as classical statistical systems. However, since modern chips process data by controlling the flow of electric current, i.e., the dynamics of many electrons, they should be regarded as quantum statistical ensembles and open quantum systems from a physical perspective (cf. [9,10]). Consequently, based on our mathematical formalism for generative AI, we construct physical models realizing generative AI systems as open quantum systems. As an illustration, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens.

The paper is organized as follows. In Section 2, we include some notation and definitions on the attention mechanism, the transformer, and the effect algebras. In Section 3, we give the definition of a generative AI system as a family of sequential joint probabilities associated with input texts and temporal sequences of tokens. This is based on the mathematical theory developed by Gudder (cf. [1]) for a historical approach to physical evolution processes. Those joint probabilities characterize the attention mechanisms as well as the mathematical structure of the transformer architecture. In Section 4, we present the construction of physical models realizing generative AI systems as open quantum systems. Our physical models are given by an event-history approach to physical systems; we refer to [2] for the background of physics for this formulation. In Section 5, we construct physical models realizing large language models based on a transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens. Finally, in Section 6, we give a summary of our innovative points listed item by item and conclude the contributions of the paper.

2. Preliminaries

In this section, we present a mathematical description of the attention mechanism and transformer architecture for generative AI and include some notations and basic properties of

σ

-effect algebras (cf. [11]). For the sake of convenience, we collect some notations and definitions. Denote by

N

the natural number set

{1, 2, \dots},

and for

n \in N,

we use the notation

[n]

to represent the set

{1, \dots, n} .

For

d \in N,

we denote by

R^{d}

the d-dimensional Euclid space with the usual inner product

⟨ -, - ⟩ .

For two sets

X, Y,

we denote by

Hom (X, Y)

the set of all maps from X into

Y .

For a set

S,

we denote

S^{*} = ⋃_{n \in N} S^{(n)},

where

S^{(n)}

is the set of all sequences

(s_{1}, \dots, s_{n})

of n elements in S; i.e.,

S^{*}

is the set of all finite sequences of elements in

S .

2.1. Deep Neural Networks

A DNN is constructed by connecting multiple neurons. Recall that a (feed-forward) neural network of depth L consists of some number of neurons arranged in

L + 1

layers. Layer

ℓ = 0

is the input layer, where data is presented to the network, while layer

ℓ = L

is where the output is read out. All layers in between are referred to as the hidden layers, and each hidden layer has an activation that is a map in the same layer. Specifically, let

{X_{ℓ}}_{ℓ = 0}^{L}

be a sequence of sets where

X_{ℓ}

indexes the neurons in layer

ℓ,

and let

{V_{ℓ}}_{ℓ = 0}^{L}

be a sequence of vector spaces. A mapping

Φ : Hom (X_{0}, V_{0}) \mapsto Hom (X_{L}, V_{L})

is called a feed-forward neural network of depth L if there exists a sequence

{W_{ℓ}}_{ℓ = 1}^{L}

of maps

W_{ℓ} : Hom (X_{ℓ - 1}, V_{ℓ - 1}) \mapsto Hom (X_{ℓ}, V_{ℓ})

and a sequence

{σ_{ℓ}}_{ℓ = 1}^{L - 1}

of maps

σ_{ℓ} : V_{ℓ} \mapsto V_{ℓ}

, which is called the activation function at the layer

ℓ,

such that

Φ (f_{0}) = W_{L} (σ_{L - 1} (W_{L - 1} \dots σ_{1} (W_{1} (f_{0})) \dots)),

(2)

for

f_{0} \in Hom (X_{0}, V_{0}),

where

f_{0}

is called the input and

f_{L} = Φ (f_{0}) \in Hom (X_{L}, V_{L})

is the output. We call

({W_{ℓ}}_{ℓ = 1}^{L}, {σ_{ℓ}}_{ℓ = 1}^{L - 1})

the architecture of the neural network

Φ .

Of course,

Φ

is determined by its architecture, and there exist different choices of architectures yielding the same

Φ .

In their most basic form,

X_{ℓ}

is a finite set of

n_{ℓ}

elements and

V_{ℓ} = R,

a feed-forward neural network

Φ : R^{n_{0}} \mapsto R^{n_{L}}

is a function of the following form: the input is

x^{(0)} = x \in R^{n_{0}},

x^{(ℓ)} = σ_{ℓ} (W_{ℓ} (x^{(ℓ - 1)}))

for

ℓ = 1, \dots, L - 1,

and

Φ (x) = x^{(L)} = W_{L} (σ_{L - 1} (W_{L - 1} \dots σ_{1} (W_{1} (x^{(0)})) \dots)),

(3)

where

x^{(L)} \in R^{n_{L}}

is the output. This can be illustrated as follows:

Here, the map

W_{ℓ} : R^{n_{ℓ - 1}} \mapsto R^{n_{ℓ}}

is usually of the form

W_{ℓ} x^{(ℓ - 1)} = A_{ℓ} x^{(ℓ - 1)} + b_{ℓ}, ℓ = 1, \dots, L,

(4)

where

A_{ℓ}

is an

n_{ℓ} \times n_{ℓ - 1}

matrix called a weight matrix and

b_{ℓ} \in R^{n_{ℓ}}

is called a bias vector for each

ℓ,

and the function

σ_{ℓ} : R^{n_{ℓ}} \mapsto R^{n_{ℓ}}

represents the activation function at the ℓ-th layer. The set of all entries of the weight matrices and bias vectors of a neural network

Φ

are called the parameters of

Φ .

These parameters are adjustable and learned during the training process, determining the specific function realized by the network. Also, the depth

L,

the number of neurons in each layer, and the activation functions of a neural network

Φ

are called the hyperparameters of

Φ .

They define the network’s architecture (and training process) and are typically set before training begins. For a fixed architecture, every choice of network parameters as in (3) defines a specific function

Φ,

and this function is often referred to as a model.

In a feed-forward neural network, the inputs to neurons in the ℓ-th layer are usually exclusively neurons from the

(ℓ - 1)

-th layer. However, residual neural networks (ResNets for short) allow skip connections; that is, information is allowed to skip layers in the sense that the neurons in layer ℓ may have

x^{(0)}, \dots, x^{(ℓ - 1)}

as their input (and not just

x^{(ℓ - 1)}

). In their most basic form,

x^{(0)} = x \in R^{d},

and

x^{(ℓ)} = x^{(ℓ - 1)} + Q_{ℓ} σ (A_{ℓ} x^{(ℓ - 1)} + b_{ℓ}), ℓ = 1, \dots, L - 1,

(5)

where

σ : R^{d} \mapsto R^{d}

is a vector function,

Q_{ℓ}, A_{ℓ}

’s are

d \times d

matrices, and

b_{ℓ}

’s are vectors in

R^{d} .

In contrast to feed-forward neural networks, recurrent neural networks (RNNs for short) allow information to flow backward in the sense that

x^{(ℓ - 1)}, x^{(ℓ + 1)}, \dots, x^{(L)}

may serve as input for the neurons in layer ℓ and not just

x^{(ℓ - 1)} .

We refer to [12] for more details, such as training for a neural network.

2.2. Attention

The fundamental definition of attention was given by Bahdanau et al. in 2014. To describe the mathematical definition of attention, we denote by

Q \subset R^{d_{q}}

the query space,

K \subset R^{d_{k}}

the key space, and

V \subset R^{d_{v}}

the value space. We call an element

q \in Q

a query,

k \in K

a key,

v \in V

, and so on.

Definition 1

(cf. [13]). Let

a : Q \times K \mapsto R

be a function. Let

K = {k_{1}, \dots, k_{N}} \subset K

be a set of keys and

V \subset V

a set of values. Given a

q \in Q,

the attention

Att : (q, K, V) \mapsto R

is defined by

Att (q, K, V) = \sum_{n = 1}^{N} {softmatch}_{a} {(q, K)}_{n} \cdot v_{n},

(6)

where

{softmatch}_{a} (q, K)

is a probability distribution over

K = {k_{1}, \dots, k_{N}}

defined by

{softmatch}_{a} {(q, K)}_{n} = \frac{e^{a (q, k_{n})}}{\sum_{j = 1}^{N} e^{a (q, k_{j})}}, n = 1, \dots, N .

(7)

This means that a value

v_{n}

in (6) occurs with probability

{softmatch}_{a} {(q, K)}_{n}

for

n \in [N] .

For

Q = {q_{1}, \dots, q_{M}} \subset Q,

we define

Att (Q, K, V) = {Att (q_{m}, K, V)}_{m = 1}^{M} .

(8)

In particular, when

Q = K = V,

Att (Q, Q, Q)

is said to be self-attention at

Q,

and the mapping

SelfAtt

defined by

SelfAtt (Q) = Att (Q, Q, Q),

(9)

is called the self-attention map.

We remark that

(1): For a finite sequence ${x_{n}}_{n = 1}^{N}$ of real numbers, define

$softmax {({x_{j})}}_{j = 1}^{N})_{n} = \frac{e^{x_{n}}}{\sum_{j = 1}^{N} e^{x_{j}}}, n \in [N] .$

(10)

Then,

${softmatch}_{a} {(q, K)}_{n} = softmax {({a (q, k_{j})}_{j = 1}^{N})}_{n},$

(11)

as usual in the literature.
(2): We have $| K | = | V | = N,$ but $| Q | = M \neq N$ in general.
(3): The function $a : Q \times K \mapsto R$ is called a similarity function, usually given by

$a (q, k) = \frac{1}{\sqrt{d^{'}}} ⟨ W^{Q} q, W^{K} k ⟩,$

(12)

where $W^{Q}$ is a $d^{'} \times d_{q}$ real matrix called a query matrix and $W^{K}$ is a $d^{'} \times d_{k}$ real matrix called a key matrix. For $q \in Q, k \in K,$ the real number $a (q, k)$ is interpreted as the similarity between the query q and the key $k .$
(4): In the representation learning framework of attention, we usually assume the finite set $T$ of tokens has been embedded in $R^{d},$ where d is called the embedding dimension, so we identify each $t \in T$ with one of finitely-many vectors x in $R^{d} .$ We assume that the structure (positional information, adjacency information, etc) is encoded in these vectors. In the case of self-attention, we assume $d_{q} = d_{k} = d_{v} = d .$

Since the self-attention mechanism can be composed to arbitrary depth, making it a crucial building block of the transformer architecture, we mainly focus on it in what follows. In practice, we need multi-headed attention (cf. [4]), that process independent copies of the data X and combine them with concatenation and matrix multiplication. Let

X = {x_{n}}_{n = 1}^{N}

be the input set of tokens embedded in

R^{d} .

Let us consider

n_{h}

-headed attention with the dimension

d_{h}

for every head. For every

i \in [n_{h}],

let

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V}

be

d_{h} \times d

(query, key, value) matrices associated with the i-th self-attention, and the similarity function

a_{i} (x, y) = \frac{1}{\sqrt{d_{h}}} ⟨ W_{i}^{Q} x, W_{i}^{K} y ⟩ .

(13)

Let

W^{O} = [W_{1}^{O}, \dots, W_{n_{h}}^{O}]

denote the output projection matrix, where

W_{i}^{O}

is a

d \times d_{h}

matrix for every

i \in [n_{h}] .

For

n \in [N],

the multi-headed self-attention (MHSelfAtt for short) is then defined by

MHSelfAtt (x_{n}, X, X) = \sum_{i = 1}^{n_{h}} \sum_{j_{i} = 1}^{n} softmax {({\frac{1}{\sqrt{d_{h}}} ⟨ W_{i}^{Q} x_{n}, W_{i}^{K} x_{ℓ} ⟩}_{ℓ = 1}^{n})}_{j_{i}} [W_{i}^{O} (W_{i}^{V} x_{j_{i}})],

(14)

that is, an output

u_{n} = \sum_{i = 1}^{n_{h}} W_{i}^{O} (W_{i}^{V} x_{j_{i}}), j_{i} \in [n],

(15)

occurs with the probability

\prod_{i = 1}^{n_{h}} softmax {({\frac{1}{\sqrt{d_{h}}} ⟨ W_{i}^{Q} x_{n}, W_{i}^{K} x_{ℓ} ⟩}_{ℓ = 1}^{n})}_{j_{i}} .

As such,

MHSelfAtt (X) = {MHSelfAtt (x_{n}, X, X)}_{n = 1}^{N},

(16)

yields a basic building block of the transformer

Transf (X) = FFN \circ MHSelfAtt (X),

(17)

as in the case of one-headed attention.

2.3. Transformer

In line with successful models, such as large language models, we focus on the decoder-only setting of the transformer, where the model iteratively predicts the next tokens based on a given sequence of tokens. This procedure is called autoregressive since the prediction of new tokens is only based on previous tokens. Such conditional sequence generation using autoregressive transformers is referred to as the transformer architecture.

Specifically, in the transformer architecture defined by a composition of blocks, each block consists of a self-attention layer

SelfAtt,

a multi-layer perception

FFN,

and a prediction head layer

PH .

First, the self-attention layer SelfAtt is the only layer that combines different tokens. Let us denote the input text to the layer by

X = {x_{n}}_{n = 1}^{N}

embedded in

R^{d}

and focus on the n-th output. For each

n \in [N],

letting

s_{j}^{(n)} = \frac{1}{\sqrt{d}} ⟨ W^{Q} x_{n}, W^{K} x_{j} ⟩, \forall j \in [n],

(18)

where

W^{Q}

and

W^{K}

are two

d^{'} \times d

matrices (i.e., the query and key matrices), we can interpret

S^{(n)} = {s_{j}^{(n)}}_{j = 1}^{n}

as similarities between the n-th token

x_{n}

(i.e., the query) and the other tokens (i.e., keys); for satisfying the autoregressive structure, we only consider

j = 1, \dots, n .

The softmax layer is given by

softmax {(S^{(n)})}_{j} = \frac{e^{s_{j}^{(n)}}}{\sum_{i = 1}^{n} e^{s_{i}^{(n)}}}, \forall j \in [n],

(19)

which can be interpreted as the probability for the n-th query to “attend” to the j-th key. Then, the self-attention layer

SelfAtt

can be defined as

SelfAtt {(X)}_{n} = \sum_{j = 1}^{n} softmax {(S^{(n)})}_{j} W^{V} x_{j}, n \in [N],

(20)

where

W^{V}

is the

d \times d

real matrix such that

W^{V} x \in T

for any

x \in T,

the output

W^{V} x_{j}

occurring with the probability

softmax {(S^{(n)})}_{j}

is often referred to as the values of the token

x_{j} .

Thus,

SelfAtt : {(R^{d})}^{*} \mapsto {(R^{d})}^{*}

is a random map such that

SelfAtt [{(R^{d})}^{(N)}] \subset {(R^{d})}^{(N)}

for each

N \in N .

If the attention is a multi-headed attention with

n_{h}

heads of the dimension

d_{h},

where for

i \in [n_{h}],

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V}

are the

d_{h} \times d

(query, key, value) matrices and

W_{i}^{O}

is the

d \times d_{h}

(output) matrix of the i-th self-attention, then the multi-headed self-attention layer

MHSelfAtt

is defined by

MHSelfAtt {(X)}_{n} = \sum_{i = 1}^{n_{h}} \sum_{j_{i} = 1}^{n} softmax {(S_{i}^{(n)})}_{j_{i}} [W_{i}^{O} (W_{i}^{V} x_{j_{i}})], n \in [N],

(21)

where

softmax {(S_{i}^{(n)})}_{j_{i}} = \frac{\frac{1}{\sqrt{d_{h}}} ⟨ W_{i}^{Q} x_{n}, W_{i}^{K} x_{j_{i}} ⟩}{\sum_{ℓ = 1}^{n} \frac{1}{\sqrt{d_{h}}} ⟨ W_{i}^{Q} x_{n}, W_{i}^{K} x_{ℓ} ⟩}, j_{i} \in [n],

(22)

i.e., an output

u_{n} = \sum_{i = 1}^{n_{h}} W_{i}^{O} (W_{i}^{V} x_{j_{i}})

occurs with the probability

\prod_{i = 1}^{n_{h}} softmax {(S_{i}^{(n)})}_{j_{i}}

for each

n \in [N] .

In what follows, we only consider the case of one-headed attention, since the multi-headed case is similar.

Second, the multi-layer perception is a feed-forward neural network

FFN

such that

y_{n} = FFN (W^{V} x_{j})

with the probability

softmax {(S^{(n)})}_{j}

(

j \in [n]

) for each

n \in [N] .

Finally, the prediction head layer can be represented as a mapping

PH : {(R^{d})}^{*} \mapsto {[0, 1]}^{*},

which maps the sequence of

{y_{n}}_{n = 1}^{N}

to a probability distribution

{p_{n}}_{n = 1}^{N},

where

p_{n}

is the probability of predicting

y_{n}

as the next token. Since

y_{N}

contains information about the whole input text, we may define

PH [{y_{n}}_{n = 1}^{N}] = \sum_{j = 1}^{N} softmax {(S^{(N)})}_{j} FFN (W^{V} x_{j}),

(23)

such that the next token

x_{N + 1} = y_{j} = FFN (W^{V} x_{j})

with the probability

softmax {(S^{(N)})}_{j}

for

j \in [N] .

Hence, a basic building block for the transformer, consisting of a self-attention module (SelfAtt) and a feed-forward network (FFN) followed by a prediction head layer (PH), can be illustrated as follows:

where the input text

t_{1} t_{2} \dots t_{n}

is embedded as a sequence

{x_{i}}_{i = 1}^{n}

in

R^{d},

y_{j} = FFN (W^{V} x_{j})

occurs with the probability

softmax {(S^{(n)})}_{j}

for each

j \in [n],

x_{n + 1} = y_{j}

is generated with the probability

softmax {(S^{(n)})}_{j}

for each

j \in [n],

and so the output is

x_{n + 1} = PH \circ FFN \circ SelfAtt ({x_{i}}_{i = 1}^{n}) .

One can then apply the same operations to the extended sequence

x_{1} x_{2} \dots x_{n} x_{n + 1}

in the next block, obtaining

x_{n + 2} = PH \circ FFN \circ SelfAtt ({x_{i}}_{i = 1}^{n + 1}),

to iteratively compute further tokens (there is usually a stopping criterion based on a special token or the mapping

PH

). Below, without loss of generality, we omit the prediction head layer

PH .

Typically, a transformer of depth L is defined by a composition of L blocks, denoted by

{Transf}_{L},

consisting of L self-attention maps

{{SelfAtt}_{ℓ}}_{ℓ = 1}^{L}

and L feed-forward neural networks

{{FFN}_{ℓ}}_{ℓ = 1}^{L},

that is,

{Transf}_{L} = ({FFN}_{L} \circ {SelfAtt}_{L}) \circ \dots \circ ({FFN}_{1} \circ {SelfAtt}_{1}),

(24)

where the indices of the layers SelfAtt and FFN in (24) indicate the use of different trainable parameters in each of the blocks. This can be illustrated as follows:

that is,

{Transf}_{L} (t_{1} \dots t_{n}) = t_{1}^{'} t_{2}^{'} \dots t_{L}^{'} .

(25)

Also, we can consider the transformer of the form

{Transf}_{L} = ((id + {FFN}_{L}) \circ (id + {SelfAtt}_{L})) \circ \dots \circ ((id + {FFN}_{1}) \circ (id + {SelfAtt}_{1})),

(26)

where

id

denotes the identity mapping in

R^{d},

commonly known as a skip or residual connection.

2.4. Effect Algebras

For the sake of convenience, we collect some notations and basic properties of

σ

-effect algebras (cf. [1,11,14] and references therein). Recall that an effect algebra is an algebraic system

(E, 0, 1, \oplus),

where

E

is a non-empty set,

0, 1 \in E

, which are called zeroes and unit elements of this algebra, respectively, and ⊕ is a partial binary operation on

E

that satisfies the following conditions for any

a, b, c \in E

:

(E1): (Commutative Law): If $a \oplus b$ is defined, then $b \oplus a$ is defined and $b \oplus a = a \oplus b,$ which is called the orthogonal sum of a and $b;$
(E2): (Associative Law): If $a \oplus b$ and $(a \oplus b) \oplus c$ are defined, then $b \oplus c$ and $a \oplus (b \oplus c)$ are defined and

$\begin{matrix} (a \oplus b) \oplus c = a \oplus (b \oplus c), \end{matrix}$

which is denoted by $a \oplus b \oplus c;$
(E3): (Orthosupplementation Law): there exists a unique $a^{'} \in E$ such that $a \oplus a^{'}$ is defined and $a \oplus a^{'} = 1,$ such $a^{'}$ is unique and called the orthosupplement of $a;$
(E4): (Zero–One Law): if $a \oplus 1$ is defined, then $a = 0 .$

We simply call

E

an effect algebra in the sequel. From the associative law (E2), we can write

a_{1} \oplus a_{2} \oplus \dots \oplus a_{n}

if this orthogonal sum is defined. For any

a, b \in E,

we define

a \leq b

if there exists a

c \in E

such that

a \oplus c = b;

this c is unique and denoted by

c = b ⊖ a,

so

a^{'} = 1 ⊖ a .

We also define

a ⊥ b

if

a \oplus b

is defined; i.e., a is orthogonal to

b .

It can be shown (cf. [14]) that

(E, \leq)

is a bounded partially ordered set (poset for short) and

a ⊥ b

if and only if

a \leq b^{'} .

For a sequence

{a_{i}}_{i = 1}^{\infty}

in

E,

if

a_{1} \oplus \dots \oplus a_{n}

is defined for all

n \in N

such that

⋁_{n = 1}^{\infty} (a_{1} \oplus \dots \oplus a_{n})

exists, then the sum

⨁_{i} a_{i}

of

{a_{i}}_{i = 1}^{\infty}

exists and define

⨁_{i} a_{i} = ⋁_{n = 1}^{\infty} (a_{1} \oplus \dots \oplus a_{n}) .

We say that

E

is a

σ

-effect algebra if

⨁_{i} a_{i}

exists for any sequence

{a_{i}}_{i = 1}^{\infty}

in

E

satisfying that

a_{1} \oplus \dots \oplus a_{n}

is defined for all

n \in N .

It was shown in (Lemma 3.1, [1]) that

E

is a

σ

-effect algebra if and only if the least upper bound

⋁_{i} a_{i}

exists for any monotone sequence

{a_{i}}_{i = 1}^{\infty},

i.e.,

a_{1} \leq a_{2} \leq \dots .

Let

E

and

F

be

σ

-effect algebras. A map

ϕ : E \mapsto F

is said to be additive if for

a, b \in E,

a ⊥ b

implies that

ϕ (a) ⊥ ϕ (b)

and

ϕ (a \oplus b) = ϕ (a) \oplus ϕ (b) .

An additive map

ϕ : E \mapsto F

is

σ

-additive if for any sequence

{a_{i}}_{i = 1}^{\infty}

such that

⨁_{i} a_{i}

exists,

⨁_{i} ϕ (a_{i})

exists and

ϕ (⨁_{i} a_{i}) = ⨁_{i} ϕ (a_{i}) .

A

σ

-additive map

ϕ : E \mapsto F

is said to be a

σ

-morphism if

ϕ (1) = 1;

and moreover,

ϕ

is called a

σ

-isomorphism if

ϕ

is a bijective

σ

-morphism and

ϕ^{- 1} : F \mapsto E

is a

σ

-morphism. It can be shown (cf. [1]) that

(1): A map $ϕ : E \mapsto F$ is additive if and only if $ϕ$ is monotone in the sense that $a \leq b$ implies $ϕ (a) \leq ϕ (b);$
(2): An additive map $ϕ$ is $σ$ -additive if and only if $a_{1} \leq a_{2} \leq \dots$ implies $ϕ (⋁_{i} a_{i}) = ⋁_{i} ϕ (a_{i});$
(3): A $σ$ -morphism $ϕ$ satisfies $ϕ (a^{'}) = ϕ {(a)}^{'} .$

The unit interval

[0, 1]

is a

σ

-effect algebra defined as follows: For any

a, b \in [0, 1],

a \oplus b

is defined if

a + b \leq 1

and in this case

a \oplus b = a + b .

Then, we have that

a^{'} = 1 - a,

and

0, 1

are the zero and unit elements, respectively. In what follows, we always regard

[0, 1]

as a

σ

-effect algebra in this way. Let

E

be a

σ

-effect algebra, a

σ

-morphism

ϕ : E \mapsto [0, 1]

is called a state on

E,

and we denote by

S (E)

the set of all states on

E .

A subset S of

S (E)

is said to be order determining if

α (a) \leq α (b)

for all

α \in S

implies

a \leq b .

Another example of a

σ

-effect algebra is a measurable space

(Ω, F)

defined as follows: For any

A, B \in F,

A \oplus B

is defined if

A \cap B = \emptyset,

and in this case,

A \oplus B = A \cup B .

We then have

0 = \emptyset, 1 = Ω,

and

A^{'} = Ω ∖ A .

We always regard a measurable space

(Ω, F)

as a

σ

-effect algebra in this way. Let

E

be a

σ

-effect algebra, a

σ

-morphism

X : (Ω, F) \mapsto E

is called an observable on

E

with values in

(Ω, F)

(a

Ω

-valued observable for short). The elements of a

σ

-effect algebra are called effects, and so an observable X maps effects in

F

into effects in

E

; i.e.,

X (A)

is an effect in

E

for

A \in F .

We denote by

O (E, Ω, F)

the set of all

Ω

-valued observables. Note that

S (Ω, F)

is equal to the set of all probability measures on

(Ω, F) .

For

α \in S (E)

and

X \in O (E, Ω, F),

we have

α \circ X \in S (Ω, F),

which is called the probability distribution of X in the state

α .

3. Mathematical Formalism

In this section, we introduce a mathematical formalism for generative AI. We utilize the theory of

σ

-effect algebras to give a mathematical definition for a generative AI system. Let

E

be a

σ

-effect algebra and

(Ω, F)

a measurable space. An orthogonal decomposition in

E

is a sequence

{a_{i}}

in

E

such that

⨁_{i} a_{i}

exists, and moreover, it is complete if

⨁_{i} a_{i} = 1 .

We denote by

D (E)

the set of all completely orthogonal decomposition in

E .

A completely orthogonal decomposition in

F

is called a countable partition of

Ω,

i.e., a sequence

{A_{i}}

of elements in

F

such that

A_{i} \cap A_{j}

for

i \neq j

and

\cup_{i} A_{i} = Ω .

We denote by

D (Ω, F)

the set of all countable partitions of

Ω .

For

n \in N,

an ordered n-tuple

\vec{R} = (e_{1}, \dots, e_{n})

of effects in

E

is called a n-time chain-of-effect, and we interpret

\vec{R}

as an inference process of an intelligence machine in which the effect

e_{i}

occurs at time

τ_{i}

for

i \in [n],

where

τ_{1} < τ_{2} < \dots < τ_{n} .

Alternatively, no specific times may be involved and we regard

\vec{R}

as a sequential effect in which

e_{1}

occurs first,

e_{2}

occurs second, …, and

e_{n}

occurs last.

Definition 2.

With the above notations, a generative artificial intelligence system

S

is defined to be a triple

(E, Ω, F),

where

E

is a σ-effect algebra,

(Ω, F)

is a measurable space, such that

(G1): The input set $In (S)$ of $S$ is equal to the set $S (E)$ ; i.e., an input is interpreted by a state $α \in S (E);$
(G2): The output set $Out (S)$ of $S$ is equal to the set $Ω^{*} = ⋃_{n = 1}^{\infty} Ω^{(n)}$ ; i.e., the set of all finite sequences of elements in $Ω;$
(G3): An inference process in $S$ is interpreted by a chain-of-effect $(e_{1}, \dots, e_{n})$ for $n \in N .$

Remark 1.

We refer to [15] for a mathematical definition of general artificial intelligence systems in terms of topos theory, including quantum artificial intelligence systems.

In practice, we are not concerned with a generative AI system

S = (E, Ω, F)

itself but deal with models for

S,

such as large language models. To this end, we need to introduce the definition of a model for

S

in terms of joint probability distributions for observables associated with

S .

For

X \in O (E, Ω, F)

and

A \in F,

we may view the effect

X (A)

as the event for which X has a value in

A .

For a partition

D = {A_{i}} \in D (Ω, F),

we may view

(X, D)

as a set of possible alternative events that can occur. One interpretation is that

(X, D)

represents a building block of an artificial intelligence architecture for processing X and the alternatives result from the dial readings of the block. Given

X_{i} \in O (E, Ω, F),

A_{i} \in F,

i = 1, \dots, n,

an ordered n-tuple

\vec{R} = (X_{1} (A_{1}), \dots, X_{n} (A_{n}))

of events is called an n-time chain-of-event, and we interpret

\vec{R}

as an inference process of an intelligence machine in which

X_{1}

has a value

a_{1}

in

A_{1}

first,

X_{2}

has

a_{2}

in

A_{2}

s, …, and

X_{n}

has

a_{n}

in

A_{n}

last, so that the output result is

(a_{1}, a_{2}, \dots, a_{n}) .

We denote the set of all n-time chain-of-events by

R^{(n)}

and the set of all chain-of-events by

R^{*} = ⋃_{n} R^{(n)} .

A n-step inference set has the form

\vec{I} = ((X_{1}, D_{1}), \dots, (X_{n}, D_{n})),

where

X_{i} \in O (E, Ω, F),

D_{i} \in D (Ω, F),

i \in [n] .

We interpret

\vec{I}

as ordered successive processes of observables

X_{i}

with partitions

D_{i}

for

i \in [n] .

We denote the collection of all n-step inference sets by

I^{(n)}

and the collection of all inference sets by

I^{*} = ⋃_{n} I^{(n)} .

If

\vec{R} = (X_{1} (A_{1}), \dots, X_{n} (A_{n}))

and

\vec{I} = ((X_{1}, D_{1}), \dots, (X_{n}, D_{n}))

such that

A_{i} \in D_{i}

for every

i \in [n],

we say the chain-of-event

\vec{R}

is an element of the inference set

\vec{I}

and write

\vec{R} \in \vec{I} .

This can be illustrated as follows:

which means that the machine firstly obtains

a_{1}

as part of an output with the probability

P (A_{1}),

then obtains

a_{2}

with the conditional probability

P (A_{2} | A_{1}),

…, and lastly obtains

a_{n}

with the conditional probability

P (A_{n} | A_{n - 1}, \dots, A_{1})

and finally combines them to obtain the output result

(a_{1}, a_{2}, \dots, a_{n})

with the probability

P_{α, \vec{I}} (A_{1} \times \dots \times A_{n}) = P (A_{1}) P (A_{2} | A_{1}) \dots P (A_{n} | A_{n - 1}, \dots, A_{1}),

(27)

where

P_{α, \vec{I}}

will be explained later.

If

{\vec{I}}_{1} = ((X_{1}, D_{1}), \dots, (X_{n}, D_{n}))

and

{\vec{I}}_{2} = ((Y_{1}, J_{1}), \dots, (Y_{m}, J_{m}))

are two inference sets, then we define their sequential product by

\vec{I} = {\vec{I}}_{1} {\vec{I}}_{2} = ((X_{1}, D_{1}), \dots, (X_{n}, D_{n}), (Y_{1}, J_{1}), \dots, (Y_{m}, J_{m})),

(28)

and obtain a

(n + m)

-step inference set. Mathematically, we can include the empty inference set ∅ that satisfies

\emptyset \vec{I} = \vec{I} \emptyset = \vec{I},

such that

I^{*}

becomes a semigroup under this product.

For a partition

D \in D (Ω, F),

we denote by

σ (D)

the

σ

-subalgebra of

F

generated by

D,

and for n partitions

{D_{i}}_{i = 1}^{n},

we denote by

σ ({D_{i}}_{i = 1}^{n})

the

σ

-algebra on

Ω^{(n)}

generated by

{D_{i}}_{i = 1}^{n},

i.e.,

σ ({D_{i}}_{i = 1}^{n}) = σ ({A_{1} \times \dots \times A_{n} \subseteq Ω^{(n)} : A_{i} \in D_{i}, i \in [n]}) .

(29)

We denote by

P (Ω^{(n)}, σ ({D_{i}}_{i = 1}^{n}))

the set of all probability measures on

(Ω^{(n)}, σ ({D_{i}}_{i = 1}^{n})) .

Also, we write

σ (\vec{I}) = σ ({D_{i}}_{i = 1}^{n})

for

\vec{I} = ((X_{1}, D_{1}), \dots, (X_{n}, D_{n})) .

Given an input

α \in S (E),

for an inference set

\vec{I} = ((X_{1}, D_{1}), \dots, (X_{n}, D_{n}))

, we denote by

P_{α, \vec{I}} \in P (Ω^{(n)}, σ (\vec{I}))

the probability measure such that for

A_{1} \times \dots \times A_{n} \in σ ({D_{i}}_{i = 1}^{n}),

P_{α, \vec{I}} (A_{1} \times \dots \times A_{n})

is the probability within the inference set

\vec{I}

that the event

X_{1} (A_{1})

occurs first,

X_{2} (A_{2})

occurs second, …,

X_{n} (A_{n})

occurs last. We call

P_{α, \vec{I}} \in P (Ω^{(n)}, σ (\vec{I}))

the joint probability distribution of an inference set

\vec{I}

under the input

α \in S (E) .

For interpreting a model for a generative AI system,

P_{α, \vec{I}}

’s need to satisfy physically motivated axioms as follows.

Definition 3.

With the above notations, a model

M

for

S = (E, Ω, F)

is defined to be a family of joint probability distributions of inference sets

M = ⋃_{n \in N} \{P_{α, \vec{I}} \in P (Ω^{(n)}, σ (\vec{I})) : α \in S (E), \vec{I} \in I^{(n)}\},

(30)

that satisfies the following axioms:

(P1): For ${\vec{I}}_{1} = (X, D), {\vec{I}}_{2} = (Y, J) \in I^{(1)}$ and $A \in σ (D), B \in σ (J),$ if $P_{α, {\vec{I}}_{1}} (A) = P_{α, {\vec{I}}_{2}} (B)$ for all $α \in S (E),$ then $X (A) = Y (B) .$
(P2): For $\vec{I} \in I^{*},$ ${\vec{I}}_{i} = (X, D_{i}), i = 1, 2,$ if $A \in σ (\vec{I})$ and $B \in σ (D_{1}) \cap σ (D_{2}),$ then

$P_{α, \vec{I} {\vec{I}}_{1}} (A \times B) = P_{α, \vec{I} {\vec{I}}_{2}} (A \times B),$

(31)

for every $α \in S (E) .$
(P3): For $\vec{I} \in I^{*},$ $\vec{J} = (X, D)$ with $D = {B_{i}},$ if $A \in σ (\vec{I})$ then

$P_{α, \vec{I} \vec{J}} (A \times Ω) = \sum_{i} P_{α, \vec{I} \vec{J}} (A \times B_{i}) = P_{α, \vec{I}} (A), \forall α \in S (E) .$

(32)
(P4): If ${\vec{I}}_{1} = ((X_{1}, D_{1}), \dots, (X_{n}, D_{n})),$ ${\vec{I}}_{2} = ((X_{1}, J_{1}), \dots, (X_{n}, J_{n})),$ and $A_{i} \in D_{i} \cap J_{i}$ for $i \in [n],$ then

$P_{α, {\vec{I}}_{1}} (A_{1} \times \dots \times A_{n}) = P_{α, {\vec{I}}_{2}} (A_{1} \times \dots \times A_{n}),$

(33)

for every $α \in S (E) .$

For the physical meanings of the model structure axioms, we remark that

(1): The axiom $(P 1)$ means that the input set can distinguish different events;
(2): The axiom $(P 2)$ means that the partition of the last processing is irrelevant;
(3): The axiom $(P 3)$ means that the last processing does not affect the previous ones;
(4): The axiom $(P 4)$ means that the probability of a chain of events does not depend on the partitions and hence is unambiguous. However, for $B \in σ ({\vec{I}}_{1}) \cap σ ({\vec{I}}_{2})$ in $(P 4),$ $P_{α, {\vec{I}}_{1}} (B) \neq P_{α, {\vec{I}}_{2}} (B)$ in general if $X_{i}$ ’s are quantum observables due to quantum interference.

If

{\vec{I}}_{1} = ((X_{1}, D_{1}), \dots, (X_{n}, D_{n}))

and

{\vec{I}}_{2} = ((Y_{1}, J_{1}), \dots, (Y_{m}, J_{m}))

are two inference sets,

A \in σ ({D_{i}}_{i = 1}^{n}),

B \in σ ({J_{j}}_{j = 1}^{m}),

and if

α

is an input such that

P_{α, {\vec{I}}_{1}} (A) \neq 0,

then we define the conditional probability of B given A within

{\vec{I}}_{1} {\vec{I}}_{2}

under the input

α

as follows:

P_{α, {\vec{I}}_{2} | {\vec{I}}_{1}} (B | A) = \frac{P_{α, {\vec{I}}_{1} {\vec{I}}_{2}} (A \times B)}{P_{α, {\vec{I}}_{1}} (A)} .

(34)

Since

P_{α, {\vec{I}}_{1} {\vec{I}}_{2}}

is a probability measure on

(Ω^{(n + m)}, σ ({D^{'}}_{i = 1}^{n + m})),

where

D_{i}^{'} = D_{i}

for

i \in [n]

and

D_{n + j}^{'} = J_{j}

for

j \in [m],

so

P_{α, {\vec{I}}_{2} | {\vec{I}}_{1}} (- | A)

is a probability measure on

(Ω^{m}, σ ({J}_{j = 1}^{m})),

which is called a conditional sequential joint probability distribution.

Proposition 1.

Given

α \in S (E),

\vec{I} \in I^{*}

and

A \in σ (\vec{I}),

if

P_{α, \vec{I}} (A) \neq 0,

then the conditional sequential joint probability distribution

P_{α, - | \vec{I}} (- | A)

satisfies the axioms (P2)–(P4) in Definition 3.

Proof.

By the axiom (P2), we have

\begin{matrix} P_{α, {\vec{I}}_{1} {\vec{I}}_{2} | \vec{I}} (B \times C | A) = \frac{P_{α, \vec{I} {\vec{I}}_{1} {\vec{I}}_{2}} (A \times B \times C)}{P_{α, \vec{I}} (A)} = \frac{P_{α, \vec{I} {\vec{I}}_{1} {\vec{I}}_{2}^{'}} (A \times B \times C)}{P_{α, \vec{I}} (A)} = P_{α, {\vec{I}}_{1} {\vec{I}}_{2}^{'} | \vec{I}} (B \times C | A); \end{matrix}

hence,

P_{α, - | \vec{I}} (- | A)

satisfies the axiom (P2), and so does the axiom (P3). Similarly, the axiom (P4) implies that

P_{α, - | \vec{I}} (- | A)

satisfies the axiom (P4); we omit the details. □

We remark that when observables are quantum ones, Bayes’ formula need not hold, i.e.,

P_{α, {\vec{I}}_{2} | {\vec{I}}_{1}} (B | A) P_{α, {\vec{I}}_{1}} (A) \neq P_{α, {\vec{I}}_{1} | {\vec{I}}_{2}} (A | B) P_{α, {\vec{I}}_{2}} (B),

(35)

in general. This is because the left-hand side is

P_{α, {\vec{I}}_{1} {\vec{I}}_{2}} (A \times B)

and the right-hand side is

P_{α, {\vec{I}}_{2} {\vec{I}}_{1}} (B \times A),

so the order of the occurrences is changed. For instance, consider a qubit with the standard basis

| 0 ⟩

and

| 1 ⟩ .

Let

| x ⟩ = \frac{1}{\sqrt{2}} (| 0 ⟩ + | 1 ⟩) .

If

X_{1} (A) = | 0 ⟩ ⟨ 0 |,

X_{2} (B) = | x ⟩ ⟨ x |,

and

α = | 0 ⟩ ⟨ 0 |

; then

\begin{matrix} P_{α, {\vec{I}}_{1} {\vec{I}}_{2}} (A \times B) = Tr [α {(X_{2} {(B)}^{\frac{1}{2}} X_{1} {(A)}^{\frac{1}{2}})}^{†} X_{2} {(B)}^{\frac{1}{2}} X_{1} {(A)}^{\frac{1}{2}}] = Tr [α X_{1} {(A)}^{\frac{1}{2}} X_{2} (B) X_{1} {(A)}^{\frac{1}{2}}] = \frac{1}{2}, \\ P_{α, {\vec{I}}_{2} {\vec{I}}_{1}} (B \times A) = Tr [α {(X_{1} {(A)}^{\frac{1}{2}} X_{2} {(B)}^{\frac{1}{2}})}^{†} X_{1} {(A)}^{\frac{1}{2}} X_{2} {(B)}^{\frac{1}{2}}] = Tr [α X_{2} {(B)}^{\frac{1}{2}} X_{1} (A) X_{2} {(B)}^{\frac{1}{2}}] = \frac{1}{4}, \end{matrix}

and so,

P_{α, {\vec{I}}_{2} | {\vec{I}}_{1}} (B | A) P_{α, {\vec{I}}_{1}} (A) \neq P_{α, {\vec{I}}_{1} | {\vec{I}}_{2}} (A | B) P_{α, {\vec{I}}_{2}} (B) .

We refer to Section 4 for more details.

4. Physical Models for Generative AI

Physical models for generative AI are usually described by using systems of mean-field interacting particles, such as large language models based on attention mechanisms (cf. [7,8] and references therein); i.e., generative AI systems are regarded as classical statistical ensembles. However, since modern chips process data through controlling the flow of electric current, i.e., the dynamics of largely many electrons, they should be regarded as quantum statistical ensembles from a physical perspective (cf. [10]). Consequently, we need to model modern intelligence machines involving open quantum systems. To this end, combining the history theory of quantum systems (cf. [2]) and the theory of effect algebras (cf. [1,14]), we construct physical models realizing generative AI systems as open quantum systems.

Let

H

be a separable complex Hilbert space with the inner product

⟨ - | - ⟩

being conjugate-linear at the first variable and linear at the second variable. We denote by

L (H)

the set of all bounded linear operators on

H,

by

O (H)

the set of all bounded self-adjoint operators, and by

P (H)

the set of all orthogonal projection operators. We denote by I the identity operator on

H .

Unless stated otherwise, an operator means a bounded linear operator in the sequel. An operator T is positive if

⟨ x | T x ⟩ \geq 0

for all

x \in H,

and in this case we write

T \geq 0 .

We define

Tr [T] = \sum_{i} ⟨ x_{i} | T x_{i} ⟩

for a positive operator

T,

where

{x_{i}}

is an orthogonal basis of

H .

It is known that

Tr [T]

is independent of the choice of the basis, and it is called the trace of T if

Tr [T] < \infty .

A positive operator

ρ

is a density operator if

Tr [ρ] = 1,

and the set of all density operators on

H

is denoted by

S (H) .

Each positive operator is self-adjoint, and if two self-adjoint operators

S, T

such that

T - S \geq 0,

we write

T \geq S

or

S \leq T .

We refer to [16,17,18] for more details on the theory of operators on Hilbert spaces.

A self-adjoint operator E that satisfies

0 \leq E \leq I

is called an effect, and the set of all effects on

H

is denoted by

E (H) .

For

E, F \in E (H),

we define

E \oplus F = E + F

if

E + F \leq I,

and in this case we write

E ⊥ F .

It can be shown (cf. (Lemma 5.1, [1])) that

(E (H), 0, I, \oplus)

is a

σ

-effect algebra, and each state

α

on

E (H)

has the form

α (E) = Tr [ρ E]

for every

E \in E (H),

where

ρ

is a unique density operator on

H,

and vice versa. Thus, we identify

S (E (H)) = S (H) .

Let

(Ω, F)

be a measurable space. An observable

X \in O (E (H), Ω, F)

is a positive operator valued (POV for short) measure on

(Ω, F)

; i.e.,

(1): $X (F)$ is an effect on $E (H)$ for any $F \in F;$
(2): $X (\emptyset) = 0$ and $X (Ω) = I;$
(3): For an orthogonal decomposition ${F_{j}}$ in $F,$

$X (⋃_{j} F_{j}) = \sum_{j} X (F_{j}),$

(36)

where the series on the right-hand side is convergent in the strong operator topology on $L (H),$ i.e.,

$X (⋃_{j} F_{j}) h = \sum_{j} X (F_{j}) h,$

(37)

for every $h \in H .$

To understand the inference process, let us show the conventional interpretation of joint probability distributions in an open quantum system that is subject to measurements by an external observer. To this end, let

E (t, s) = {K_{m} (t, s)}

denote the time-evolution operator from time s to

t,

where

K_{m} (t, s)

’s are usually called Kraus operators, such that

\sum_{m} K_{m} {(t, s)}^{†} K_{m} (t, s) = I .

(38)

That is,

E (t, s)

are quantum operations (cf. [19]) such that for every state

ρ \in S (H),

E (t, s) ρ = \sum_{m} K_{m} (t, s) ρ K_{m} {(t, s)}^{†},

(39)

in the Schrödinger picture, while for each observable

X \in O (H),

E (t, s) X = \sum_{m} K_{m} {(t, s)}^{†} X K_{m} (t, s),

(40)

in the Heisenberg picture. We refer to [9] for the details on the theory of open quantum systems.

Then the density operator state

ρ (t_{0})

at time

t_{0}

evolves in time

t_{1} - t_{0}

to the state

ρ (t_{1}),

where

ρ (t_{1}) = E (t_{1}, t_{0}) ρ (t_{0}) = \sum_{m} K_{m} (t_{1}, t_{0}) ρ (t_{0}) K_{m} {(t_{1}, t_{0})}^{†} .

(41)

Suppose that a measurement

(X_{1}, D_{1})

is made at time

t_{1},

where

X_{1} \in O (E (H), Ω, F)

and

D_{1} \in D (Ω, F) .

Then the probability that an event

X_{1} (A_{1})

with

A_{1} \in D_{1}

occurs is

P (X_{1} (A_{1}), ρ (t_{1})) = Tr [X_{1} (A_{1}) ρ (t_{1})] .

(42)

If the result of this measurement is kept, then, according to the von Neumann–Lüders reduction postulate, the appropriate density operator to use for any further calculation is

ρ_{red} (t_{1}) = \frac{X_{1} {(A_{1})}^{\frac{1}{2}} ρ (t_{1}) X_{1} {(A_{1})}^{\frac{1}{2}}}{Tr [X_{1} (A_{1}) ρ (t_{1})]} .

(43)

Next, suppose a measurement

(X_{2}, D_{2})

is performed at time

t_{2} > t_{1} .

Then, according to the above, the conditional probability of an event

X_{2} (A_{2})

for

A_{2} \in D_{2}

occurs at time

t_{2}

given that the event

X_{1} (A_{1})

occurs at time

t_{1}

(and that the original state was

ρ (t_{0})

) is

P (X_{2} (A_{2}) | X_{1} (A_{1}), ρ (t_{0})) = Tr [X_{2} (A_{2}) ρ (t_{2})],

(44)

where

ρ (t_{2}) = E (t_{2}, t_{0}) ρ_{red} (t_{1}),

and the appropriate density operator to use for any further calculation is

ρ_{red} (t_{2}) = \frac{X_{2} {(A_{2})}^{\frac{1}{2}} ρ (t_{2}) X_{2} {(A_{2})}^{\frac{1}{2}}}{Tr [X_{2} (A_{2}) ρ (t_{2})]} .

(45)

The joint probability of

X_{1} (A_{1})

occurring at

t_{1}

and

X_{2} (A_{2})

occurring at

t_{2}

is then

\begin{matrix} P ((X_{1} (A_{1}), X_{2} (A_{2})), ρ (t_{0})) = & P (X_{1} (A_{1}), ρ (t_{1})) P (X_{2} (A_{2}) | X_{1} (A_{1}), ρ (t_{0})) \\ = & Tr [X_{1} (A_{1}) ρ (t_{1})] Tr [X_{2} (A_{2}) ρ (t_{2})] . \end{matrix}

(46)

Generalizing to a sequence of measurements

(X_{1}, D_{1}), (X_{2}, D_{2}), \dots, (X_{n}, D_{2})

at times

t_{1} < t_{2} < \dots < t_{n},

where

X_{i} \in O (E (H), Ω, F)

and

D_{i} \in D (Ω, F)

for

i \in [n],

the sequential joint probability of associated events

X_{i} (A_{i})

with

A_{i} \in D_{i}

occurring at

t_{i}

for

i \in [n]

is

\begin{matrix} P ((X_{1} & (A_{1}), X_{2} (A_{2}), \dots, X_{n} (A_{n})), ρ (t_{0})) \\ = & Tr [X_{1} (A_{1}) ρ (t_{1})] Tr [X_{2} (A_{2}) ρ (t_{2})] \dots Tr [X_{n} (A_{n}) ρ (t_{n})], \end{matrix}

(47)

where

ρ (t_{i}) = E (t_{i}, t_{0}) ρ_{red} (t_{i - 1})

for

i \in [n],

ρ_{red} (t_{0}) = ρ (t_{0}),

and

ρ_{red} (t_{i}) = \frac{X_{i} {(A_{i})}^{\frac{1}{2}} ρ (t_{i}) X_{i} {(A_{i})}^{\frac{1}{2}}}{Tr [X_{i} (A_{i}) ρ (t_{i})]} .

(48)

for

i \in [n - 1] .

Therefore, given an inference set

\vec{I} = ((X_{1}, D_{1}), \dots, (X_{n}, D_{n})),

for an input

ρ (t_{0}) \in S (E),

the sequential joint probability within the inference set

\vec{I}

that the event

X_{1} (A_{1})

occurs at

t_{1},

X_{2} (A_{2})

occurs at

t_{2},

…,

X_{n} (A_{n})

occurs at

t_{n},

where

A_{i} \in D_{i}

for

i \in [n]

and

t_{0} < t_{1} < t_{2} < \dots < t_{n},

is given by

\begin{matrix} P_{ρ (t_{0}), \vec{I}} (A_{1} \times \dots \times A_{n}) = Tr [X_{1} (A_{1}) ρ (t_{1})] Tr [X_{2} (A_{2}) ρ (t_{2})] \dots Tr [X_{n} (A_{n}) ρ (t_{n})], \end{matrix}

(49)

where

ρ (t_{i}) = E (t_{i}, t_{0}) ρ_{red} (t_{i - 1})

and

E (t_{i}, t_{0}) ρ_{red} (t_{i - 1}) = \sum_{m} K_{m} (t_{i}, t_{0}) ρ_{red} (t_{i - 1}) K_{m} {(t_{i}, t_{0})}^{†},

(50)

for

i \in [n],

in the Schrödinger picture operator defined with respect to the fiducial time

t_{0} .

Proposition 2.

Let

H

be a separable complex Hilbert space, and let

(Ω, F)

be a measurable space. A physical model associated with

(E (H), Ω, F)

defined by

M = ⋃_{n \in N} \{P_{ρ, \vec{I}} \in P (Ω^{(n)}, σ (\vec{I})) : ρ \in S (H), \vec{I} \in I^{(n)}\},

(51)

where

P_{ρ, \vec{I}}

s’ are given by (49), satisfies the axioms in Definition 3.

Proof.

For

{\vec{I}}_{1} = (X, D), {\vec{I}}_{2} = (Y, J) \in I^{(1)}

and

A \in σ (D), B \in σ (J),

by (49) we have

\begin{matrix} P_{ρ, {\vec{I}}_{1}} (A) = Tr [ρ X (A)], P_{ρ, {\vec{I}}_{2}} (B) = Tr [ρ Y (B)] . \end{matrix}

If

P_{ρ, {\vec{I}}_{1}} (A) = P_{ρ, {\vec{I}}_{2}} (B)

for all

ρ \in S (H),

then

X (A) = Y (B),

i.e., the axiom (P1) holds.

For

\vec{I} = ((X_{1}, D_{1}), \dots, (X_{n}, D_{n})) \in I^{*},

{\vec{I}}_{1} = (Y, J_{1}),

if

A_{i} \in D_{i}

for

i \in [n]

and

B \in J_{1},

by (49), we have

\begin{matrix} P_{ρ, \vec{I} {\vec{I}}_{1}} (A_{1} \times \dots \times A_{n} \times B) = Tr [X_{1} (A_{1}) ρ (t_{1})] \dots Tr [X_{n} (A_{n}) ρ (t_{n})] Tr [Y (B) ρ (t_{n + 1})], \end{matrix}

where

t_{0} < t_{1} < \dots < t_{n} < t_{n + 1},

ρ (t_{0}) = ρ,

ρ (t_{i}) = E (t_{i}, t_{0}) ρ_{red} (t_{i - 1}),

ρ_{red} (t_{0}) = ρ (t_{0}),

\begin{matrix} ρ_{red} (t_{i}) = \frac{X_{i} {(A_{i})}^{\frac{1}{2}} ρ (t_{i}) X_{i} {(A_{i})}^{\frac{1}{2}}}{Tr [X_{i} (A_{i}) ρ (t_{i})]} \end{matrix}

for

i \in [n],

and

\begin{matrix} ρ_{i} (t_{i}) = \sum_{m} K_{m} (t_{i}, t_{0}) ρ_{red} (t_{i - 1}) K_{m} {(t_{i}, t_{0})}^{†}, \end{matrix}

for

i \in [n + 1] .

Also, for

{\vec{I}}_{2} = (Y, J_{2})

and

B \in J_{2},

by (49) we have

\begin{matrix} P_{α, \vec{I} {\vec{I}}_{2}} (A_{1} \times \dots \times A_{n} \times B) = Tr [X_{1} (A_{1}) ρ (t_{1})] \dots Tr [X_{n} (A_{n}) ρ (t_{n})] Tr [Y (B) ρ (t_{n + 1})] . \end{matrix}

Hence, we have

\begin{matrix} P_{ρ, \vec{I} {\vec{I}}_{1}} (A_{1} \times \dots \times A_{n} \times B) = P_{α, \vec{I} {\vec{I}}_{2}} (A_{1} \times \dots \times A_{n} \times B) \end{matrix}

for

B \in σ (J_{1}) \cap σ (J_{2}) .

Since

σ (I)

is generated by

D_{i}

’s, this concludes the axiom (P2). Similarly, we can check the axioms (P3) and (P4) and omit the details. □

Remark 2.

Note that the probability family

P_{ρ (t_{0}), \vec{I}}

’s are determined by the time-evolution operators

E (t, s)

’s. Therefore, a family of the discrete-time evolution operators

{E (t_{i}, s)}_{i = 1}^{n}

defines a physical model realizing a generative AI system, based on the mathematical formalism in Definition 3 for models of generative AI systems.

5. Large Language Models

In this section, we describe physical models for large language models based on a transformer architecture in the Fock space over the Hilbert space of tokens. Consider a large language model

S

with the set

T

of N tokens. A finite sequence

{x_{i}}_{i = 1}^{n}

of tokens is called a text for

S,

simply denoted by

T = x_{1} x_{2} \dots x_{n}

or

(x_{1}, x_{2}, \dots, x_{n}),

where n is called the length of the text

T .

Let

h

be the Hilbert space with

{| x ⟩ : x \in T}

being an orthogonal basis, and we identify

x = | x ⟩

for

x \in T .

Let

H = F (h)

be the Fock space over

h,

that is,

F (h) = C \oplus ⨁_{n = 1}^{\infty} h^{\otimes n},

(52)

where

h^{\otimes n}

is the n-fold tensor product of

h .

We refer to [16] for the details of Fock spaces. In what follows, for the sake of convenience, we involve the finite Fock space

H = F^{(M)} (h) = C \oplus ⨁_{n = 1}^{M} h^{\otimes n},

(53)

for a large integer

M ≫ N .

Note that an operator

A^{(n)} = A_{1} \otimes \dots \otimes A_{n} \in L (h^{\otimes n})

for

A_{j} \in L (h)

satisfies that for all

h^{(n)} = h_{1} \otimes \dots \otimes h_{n} \in h^{\otimes n},

A h^{(n)} = (A_{1} h_{1}) \otimes \dots \otimes (A_{n} h_{n}) \in h^{\otimes n},

(54)

and in particular, if

ρ_{i} \in S (h)

for

i \in [n],

then

ρ^{(n)} = ρ_{1} \otimes \dots \otimes ρ_{n} \in S (h^{\otimes n}) .

Given

α \in C

and a sequence

A^{(n)} \in L (h^{\otimes n})

for

n \in [M],

the operator

diag (α, A^{(1)}, \dots, A^{(M)}) \in L (H)

is defined by

diag (α, A^{(1)}, \dots, A^{(M)}) h^{(M)} = (α c, A^{(1)} h^{(1)}, \dots, A^{(M)} h^{(M)}),

(55)

for every

h^{(M)} = (c, h^{(1)}, \dots, h^{(M)}) \in H .

In particular, if

ρ^{(n)} \in S (h^{\otimes n}),

then

ρ^{(M)} = diag (0, 0^{(1)}, \dots, 0^{(n - 1)}, ρ^{(n)}, 0^{(n + 1)}, \dots, 0^{(M)}) \in S (H),

(56)

where

0^{(i)}

denotes the zero operator in

h^{\otimes i}

for

i \geq 1 .

Since large language models are based on a transformer architecture, we suffice to construct a physical model in the Fock space

H = F^{(M)} (h)

(

M ≫ L

) for a transformer

{Transf}_{L}

(24) with a composition of L blocks, consisting of L self-attention maps

{{SelfAtt}_{ℓ}}_{ℓ = 1}^{L}

and L feed-forward neural networks

{{FFN}_{ℓ}}_{ℓ = 1}^{L} .

Precisely, let us denote the input text to the layer by

T = {x_{i}}_{i = 1}^{n} .

As noted above,

{FFN}_{ℓ} \circ {SelfAtt}_{ℓ} (T) = \sum_{i = 1}^{n + ℓ - 1} softmax {(S_{ℓ}^{(n + ℓ - 1)})}_{i} {FFN}_{ℓ} (W^{V_{ℓ}} x_{i}),

(57)

where

S_{ℓ}^{(n + ℓ - 1)} = {s_{i}^{(ℓ)}}_{i = 1}^{n + ℓ - 1}

and

s_{i}^{(ℓ)} = \frac{1}{\sqrt{d}} ⟨ W^{Q_{ℓ}} x_{n + ℓ - 1}, W^{K_{ℓ}} x_{i} ⟩, \forall i \in [n + ℓ - 1] .

(58)

Then, a physical model for

{Transf}_{L}

consists of an input

ρ (t_{0})

and a sequence of quantum operations

{E (t_{ℓ}, t_{0})}_{ℓ = 1}^{L}

in the Fock space

H

defined above, where

t_{0} < t_{1} < \dots < t_{L} .

We show how to construct this model step by step as follows.

To this end, we denote by

Ω = {⋄} \cup T

and write

D = ({ω} : ω \in Ω) .

At first, the input state

ρ_{T}

is given as

ρ_{T} = ρ (t_{0}) = diag (0, 0^{(1)}, \dots, 0^{(n - 1)}, | x_{1} ⟩ ⟨ x_{1} | \otimes \dots \otimes | x_{n} ⟩ ⟨ x_{n} |, 0^{(n + 1)}, \dots) \in S (H) .

(59)

Then there is a physical operation

E (t_{1}, t_{0})

in

H

(see Proposition 3 below), depending only on the attention mechanism

(W_{1}^{Q}, W_{1}^{K}, W_{1}^{V})

and

{FFN}_{1},

such that

\begin{matrix} E (t_{1}, & t_{0}) ρ (t_{0}) \\ = & \sum_{i_{1} = 1}^{n} softmax {(S_{1}^{(n)})}_{i_{1}} diag (0, 0^{(1)} \dots, 0^{(n)}, | x_{1} ⟩ ⟨ x_{1} | \otimes \dots \otimes | x_{n} ⟩ ⟨ x_{n} | \otimes | y_{i_{1}}^{(1)} ⟩ ⟨ y_{i_{1}}^{(1)} |, 0^{(n + 2)}, \dots), \end{matrix}

(60)

where

y_{i_{1}}^{(1)} = {FFN}_{1} (W^{V_{1}} x_{i_{1}})

and

{y_{i_{1}}^{(1)}}_{i_{1} = 1}^{n} \subset {| x ⟩ : x \in T} .

Define

X_{1} : 2^{Ω} \mapsto E (H)

by

X_{1} ({⋄}) = diag (1, I_{h}, \dots, I_{h}^{\otimes n}, 0^{(n + 1)}, I_{h^{\otimes (n + 2)}}, \dots),

(61)

and for every

x \in T,

X_{1} ({x}) = diag (0, 0^{(1)}, \dots, 0^{(n)}, \underset{n}{\underset{︸}{I_{h} \otimes \dots \otimes I_{h}}} \otimes | x ⟩ ⟨ x |, 0^{(n + 2)}, \dots) .

(62)

Making a measurement

(X_{1}, D)

at time

t_{1},

we obtain an output

y_{i_{1}}^{(1)}

with probability

softmax {(S_{1}^{(n)})}_{i_{1}}

, and the appropriate density operator to use for any further calculation is

\begin{matrix} ρ_{red} {(t_{1})}_{i_{1}} = & \frac{E_{i_{1}}^{(1)} ρ (t_{1}) E_{i_{1}}^{(1)}}{Tr [E_{i_{1}}^{(1)} ρ (t_{1})]} \\ = & diag (0, 0^{(1)} \dots, 0^{(n)}, | x_{1} ⟩ ⟨ x_{1} | \otimes \dots \otimes | x_{n} ⟩ ⟨ x_{n} | \otimes | y_{i_{1}}^{(1)} ⟩ ⟨ y_{i_{1}}^{(1)} |, 0^{(n + 2)}, \dots), \end{matrix}

(63)

for every

i_{1} \in [n],

where

ρ (t_{1}) = E (t_{1}, t_{0}) ρ (t_{0}),

and

E_{i_{1}}^{(1)} = diag (0, 0^{(1)}, \dots, 0^{(n)}, \underset{n}{\underset{︸}{I_{h} \otimes \dots \otimes I_{h}}} \otimes | y_{i_{1}}^{(1)} ⟩ ⟨ y_{i_{1}}^{(1)} |, 0^{(n + 2)}, \dots) .

(64)

Next, there is a physical operation

E (t_{2}, t_{0})

in

H

(see Proposition 3 again), depending only on the attention mechanism

(W_{2}^{Q}, W_{2}^{K}, W_{2}^{V})

and

{FFN}_{2}

at time

t_{2},

such that

\begin{matrix} E (t_{2}, t_{0}) ρ_{red} & {(t_{1})}_{i_{1}} = \sum_{i_{2} = 1}^{n + 1} softmax {(S_{2}^{(n + 1)})}_{i_{2}} \\ \times diag (0, & 0^{(1)}, \dots, 0^{(n + 1)}, | x_{1} ⟩ ⟨ x_{1} | \otimes \dots \otimes | x_{n} ⟩ ⟨ x_{n} | \otimes | y_{i_{1}}^{(1)} ⟩ ⟨ y_{i_{1}}^{(1)} | \otimes | y_{i_{2}}^{(2)} ⟩ ⟨ y_{i_{2}}^{(2)} |, 0^{(n + 3)}, \dots) \end{matrix}

(65)

for

i_{1} \in [n],

where

y_{i_{2}}^{(2)} = {FFN}_{2} (W^{V_{2}} x_{i_{2}})

(with

x_{n + 1} = y_{i_{1}}^{(1)}

) and

{y_{i}^{(2)}}_{i = 1}^{n + 1} \subset {| x ⟩ : x \in T} .

Define

X_{2} : 2^{Ω} \mapsto E (H)

by

X_{2} ({⋄}) = diag (1, I_{h}, \dots, I_{h}^{\otimes (n + 1)}, 0^{(n + 2)}, I_{h^{\otimes (n + 3)}}, \dots),

(66)

and for every

x \in T,

X_{2} ({x}) = diag (0, 0^{(1)}, \dots, 0^{(n + 1)}, \underset{n + 1}{\underset{︸}{I_{h} \otimes \dots \otimes I_{h}}} \otimes | x ⟩ ⟨ x |, 0^{(n + 3)}, \dots) .

(67)

Making a measurement

(X_{2}, D)

at time

t_{2},

we obtain an output

y_{i_{2}}^{(2)}

with probability

softmax {(S_{2}^{(n + 1)})}_{i_{2}}

, and the appropriate density operator to use for any further calculation is

\begin{matrix} ρ_{red} {(t_{2})}_{i_{1}, i_{2}} = & \frac{E_{i_{2}}^{(2)} ρ {(t_{2})}_{i_{1}} E_{i_{2}}^{(2)}}{Tr [E_{i_{2}}^{(2)} ρ {(t_{2})}_{i_{1}}]} \\ = & diag (0, 0^{(1)}, \dots, 0^{(n + 1)}, | x_{1} ⟩ ⟨ x_{1} | \otimes \dots \otimes | x_{n} ⟩ ⟨ x_{n} | \otimes | y_{i_{1}}^{(1)} ⟩ ⟨ y_{i_{1}}^{(1)} | \otimes | y_{i_{2}}^{(2)} ⟩ ⟨ y_{i_{2}}^{(2)} |, 0^{(n + 3)}, \dots), \end{matrix}

(68)

for each

i_{2} \in [n + 1],

where

ρ {(t_{2})}_{i_{1}} = E (t_{2}, t_{0}) ρ_{red} {(t_{1})}_{i_{1}}

and

E_{i_{2}}^{(2)} = diag (0, 0^{(1)}, \dots, 0^{(n + 1)}, \underset{n + 1}{\underset{︸}{I_{h} \otimes \dots \otimes I_{h}}} \otimes | y_{i_{2}}^{(2)} ⟩ ⟨ y_{i_{2}}^{(2)} |, 0^{(n + 3)}, \dots) .

(69)

Step by step, we can obtain a physical model

{E (t_{ℓ}, t_{0})}_{ℓ = 1}^{L}

with the input state

ρ (t_{0})

such that a text

(y_{i_{1}}^{(1)}, y_{i_{2}}^{(2)}, \dots, y_{i_{L}}^{(L)})

is generated with the probability

P_{T} (y_{i_{1}}^{(1)}, y_{i_{2}}^{(2)}, \dots, y_{i_{L}}^{(L)}) = softmax {(S_{1}^{(n)})}_{i_{1}} \dots softmax {(S_{L}^{(n + L - 1)})}_{i_{L}},

(70)

within the inference

((X_{1}, D), \dots, (X_{L}, D)) .

Thus, we can obtain a physical model for

{Transf}_{L}

if we prove that

E (t_{ℓ}, t_{0})

’s exist.

Proposition 3.

With the above notations, there exists a physical model

{E (t_{ℓ}, t_{0})}_{ℓ = 1}^{L}

in

H = F^{(M)} (h)

(

M ≫ L

) for a transformer

{Transf}_{L}

(24) such that given an input text

T = {x_{i}}_{i = 1}^{n},

a text

(y_{i_{1}}^{(1)}, y_{i_{2}}^{(2)}, \dots, y_{i_{L}}^{(L)})

is generated with the probability

P_{T} (y_{i_{1}}^{(1)}, y_{i_{2}}^{(2)}, \dots, y_{i_{L}}^{(L)}) = softmax {(S_{1}^{(n)})}_{i_{1}} \dots softmax {(S_{L}^{(n + L - 1)})}_{i_{L}},

(71)

within the inference

((X_{1}, D), \dots, (X_{L}, D)) .

Proof.

We regard

1,

| x ⟩ ⟨ x |,

and

| x_{1} ⟩ ⟨ x_{1} | \otimes \dots \otimes | x_{n} ⟩ ⟨ x_{n} |

as elements in

L (H)

in a natural way, i.e.,

\begin{matrix} \begin{matrix} 1 & ≃ diag (1, 0^{(1)}, 0^{(2)}, \dots), \\ | x ⟩ ⟨ x | & ≃ diag (0, | x ⟩ ⟨ x |, 0^{(2)},, \dots), \\ | x_{1} ⟩ ⟨ x_{1} | \otimes \dots \otimes | x_{n} ⟩ ⟨ x_{n} | & ≃ diag (0, 0^{(1)}, \dots, 0^{(n - 1)}, | x_{1} ⟩ ⟨ x_{1} | \otimes \dots \otimes | x_{n} ⟩ ⟨ x_{n} |, 0^{(n + 1)}, \dots), \end{matrix} \end{matrix}

for

n \geq 1 .

We need to construct

E (t_{1}, t_{0})

to satisfy (60). We first define

\begin{matrix} Φ (1) = | x_{0} ⟩ ⟨ x_{0} |, \end{matrix}

where

x_{0} \in T

is a certain token. Secondly, define

\begin{matrix} Φ (| x ⟩ ⟨ x |) = diag (0, 0^{(1)}, | x ⟩ ⟨ x | \otimes | FFN (W^{V} x) ⟩ ⟨ FFN (W^{V} x) |, 0^{(3)}, \dots), \forall x \in T, \end{matrix}

and in general, for

n \in [L]

define

\begin{matrix} \begin{matrix} Φ (| x_{1} ⟩ ⟨ x_{1} | & \otimes \dots \otimes | x_{n} ⟩ ⟨ x_{n} |) \\ = \sum_{i = 1}^{n} & softmax {(S_{1}^{(n)})}_{i} diag (0, 0^{(1)} \dots, 0^{(n)}, | x_{1} ⟩ ⟨ x_{1} | \otimes \dots \otimes | x_{n} ⟩ ⟨ x_{n} | \otimes | y_{i}^{(1)} ⟩ ⟨ y_{i}^{(1)} |, 0^{(n + 2)}, \dots), \end{matrix} \end{matrix}

for any

x_{i} \in T

and

i \in [n] .

Let

\begin{matrix} S = span {1, | x_{1} ⟩ ⟨ x_{1} | \otimes \dots \otimes | x_{ℓ} ⟩ ⟨ x_{ℓ} | : x_{i} \in T, i \in [ℓ]; ℓ = 1, \dots, L} . \end{matrix}

Then

Φ

extends uniquely to a positive map

E_{Φ}

from

S

into

L (H),

that is,

\begin{matrix} \begin{matrix} E_{Φ} & (a_{0} + \sum_{n \geq 1} \sum_{x_{1}, \dots, x_{n} \in T} a_{x_{1}, \dots, x_{n}} | x_{1} ⟩ ⟨ x_{1} | \otimes \dots \otimes | x_{n} ⟩ ⟨ x_{n} |) \\ = & a_{0} | x_{0} ⟩ ⟨ x_{0} | + \sum_{n \geq 1} \sum_{x_{1}, \dots, x_{n} \in T} a_{x_{1}, \dots, x_{n}} Φ (| x_{1} ⟩ ⟨ x_{1} | \otimes \dots \otimes | x_{n} ⟩ ⟨ x_{n} |), \end{matrix} \end{matrix}

where

a_{0}, a_{x_{1}, \dots, x_{n}}

are any complex numbers for

n \geq 1 .

Since

S

is a commutative

C^{*}

-algebra, by Stinespring’s theorem (cf. (Theorem 3.11, [20])), it follows that

E_{Φ} : S \mapsto L (H)

is completely positive. Hence, by Arveson’s extension theorem (cf. (Theorem 7.5, [20])),

E_{Φ}

extends to a completely positive operator

E (t_{1}, t_{0})

in

L (H)

(note that

E (t_{1}, t_{0})

is not necessarily unique), i.e., a quantum operation in

H .

By the construction,

E (t_{1}, t_{0})

satisfies (60). Also, by Kraus’s theorem (cf. [19]), we conclude that

E (t_{1}, t_{0})

has the Kraus decomposition (39).

In the same way, we can prove that

E (t_{2}, t_{0})

exists and satisfies (65). Step by step, we can thus obtain a physical model

{E (t_{ℓ}, t_{0})}_{ℓ = 1}^{L}

as required. □

Remark 3.

A physical model for the transformer with a multi-headed attention (21) can be constructed in a similar way. Also, we can construct physical models for the transformer of the form (26), even for the transformer of a more complex structure (cf. [21] and reference therein). We omit the details.

Physical models satisfying the above joint probability distributions associated with a transformer

{Transf}_{L}

are not necessarily unique. However, a physical model

{E (t_{ℓ}, t_{0})}_{ℓ = 1}^{L}

uniquely determines the joint probability distributions; that is, it defines a unique physical process for operating the large language model based on

{Transf}_{L} .

Therefore, in a physical model

{E (t_{ℓ}, t_{0})}_{ℓ = 1}^{L}

for

{Transf}_{L},

training for

{Transf}_{L}

corresponds to training for the Kraus operators

E (t_{ℓ}, t_{0}) = {K_{j}^{(ℓ)} (t_{ℓ}, t_{0})},

which are adjustable and learned during the training process, determining the physical model, as corresponding to the parameters

W_{ℓ}^{Q}, W_{ℓ}^{K}

and

W_{ℓ}^{V}

in

{Transf}_{L} .

From a physical perspective, to train for a large language model is just to determine the Kraus operators

E (t_{ℓ}, t_{0}) = {K_{j}^{(ℓ)} (t_{ℓ}, t_{0})}

associated with the corresponding physical system (cf. [22]).

Example 1.

Let

T = {e_{0}, e_{1}}

be the set of two tokens embedded in

R^{2}

such that

e_{0} = (1, 0)

and

e_{1} = (0, 1) .

Then,

h = C^{2}

with the standard basis

| 0 ⟩ = | e_{0} ⟩

and

| 1 ⟩ = | e_{1} ⟩ .

Let

\begin{matrix} H = F^{(3)} (C^{2}) = C \oplus C^{2} \oplus [C^{2} \otimes C^{2}] \oplus [C^{2} \otimes C^{2} \otimes C^{2}] . \end{matrix}

Suppose that

W^{Q} = W^{K} = FFN = I

in

R^{2}

, and let

W^{V} = σ_{x},

i.e.,

W^{V} e_{0} = e_{1}

and

W^{V} e_{1} = e_{0} .

Below, we construct a quantum operation

E

associated with

SelfAtt = (I, I, σ)

and

FFN = I

in

R^{2} .

To this end, define

Φ (1) = | 0 ⟩ ⟨ 0 |,

\begin{matrix} \begin{matrix} Φ (| 0 ⟩ ⟨ 0 |) = & | 0 ⟩ ⟨ 0 | \times | W^{V} e_{0} ⟩ ⟨ W^{V} e_{0} | = | 0 ⟩ ⟨ 0 | \times | 1 ⟩ ⟨ 1 |, \\ Φ (| 1 ⟩ ⟨ 1 |) = & | 1 ⟩ ⟨ 1 | \times | W^{V} e_{1} ⟩ ⟨ W^{V} e_{1} | = | 1 ⟩ ⟨ 1 | \times | 0 ⟩ ⟨ 0 |; \end{matrix} \end{matrix}

and

\begin{matrix} \begin{matrix} Φ (| 0 ⟩ ⟨ 0 | \otimes | 0 ⟩ ⟨ 0 |) = & | 0 ⟩ ⟨ 0 | \times | 0 ⟩ ⟨ 0 | \times | 1 ⟩ ⟨ 1 |, \\ Φ (| 0 ⟩ ⟨ 0 | \times | 1 ⟩ ⟨ 1 |) = & \frac{e}{1 + e} | 0 ⟩ ⟨ 0 | \times | 1 ⟩ ⟨ 1 | \times | 0 ⟩ ⟨ 0 | \\ + \frac{1}{1 + e} | 0 ⟩ ⟨ 0 | \times | 1 ⟩ ⟨ 1 | \times | 1 ⟩ ⟨ 1 |, \\ Φ (| 1 ⟩ ⟨ 1 | \otimes | 0 ⟩ ⟨ 0 |) = & \frac{1}{1 + e} | 1 ⟩ ⟨ 1 | \times | 0 ⟩ ⟨ 0 | \times | 0 ⟩ ⟨ 0 | \\ + \frac{e}{1 + e} | 1 ⟩ ⟨ 1 | \times | 0 ⟩ ⟨ 0 | \times | 1 ⟩ ⟨ 1 |, \\ Φ (| 1 ⟩ ⟨ 1 | \otimes | 1 ⟩ ⟨ 1 |) = & | 1 ⟩ ⟨ 1 | \times | 1 ⟩ ⟨ 1 | \times | 0 ⟩ ⟨ 0 | . \end{matrix} \end{matrix}

We regard

1, | e_{i} ⟩ ⟨ e_{i} |,

| e_{j} ⟩ ⟨ e_{j} | \otimes | e_{k} ⟩ ⟨ e_{k} |

(

i, j, k = 0, 1

) as elements in

L (F^{(3)} (C^{2}))

in a natural way. Let

\begin{matrix} S = span {1, | e_{i} ⟩ ⟨ e_{i} |, | e_{j} ⟩ ⟨ e_{j} | \otimes | e_{k} ⟩ ⟨ e_{k} | : i, j, k = 0, 1} . \end{matrix}

Then

S

is a subspace of

L (F^{(3)} (C^{2}))

and Φ extends uniquely to a positive map

E

from

S

into

L (F^{(3)} (C^{2})),

i.e.,

\begin{matrix} \begin{matrix} E (a & + \sum_{i = 0, 1} b_{i} | e_{i} ⟩ ⟨ e_{i} | + \sum_{j, k = 0, 1} c_{j, k} | e_{j} ⟩ ⟨ e_{j} | \otimes | e_{k} ⟩ ⟨ e_{k} |) \\ = & a | 0 ⟩ ⟨ 0 | + \sum_{i = 0, 1} b_{i} Φ (| e_{i} ⟩ ⟨ e_{i} |) + \sum_{j, k = 0, 1} c_{j, k} Φ (| e_{j} ⟩ ⟨ e_{j} | \otimes | e_{k} ⟩ ⟨ e_{k} |), \end{matrix} \end{matrix}

for any

a, b_{i}, c_{j, k} \in C .

As shown in Proposition 3,

E

can extend to a completely positive operator in

L (F^{(3)} (C^{2})),

which is a quantum operation in

H = F^{(3)} (C^{2})

associated with

SelfAtt = (I, I, σ)

and

FFN = I

in

R^{2} .

Note that

E

is not necessarily unique.

Example 2.

As in Example 1,

T = {x_{0}, x_{1}}

is the set of two tokens embedded in

R^{2}

such that

x_{0} = (1, 0)

and

x_{1} = (0, 1) .

Then

h = C^{2}

with the standard basis

| 0 ⟩ = | x_{0} ⟩

and

| 1 ⟩ = | x_{1} ⟩ .

Let

H = F^{(6)} (C^{2}) .

Assume an input text

T = (x_{0}, x_{1}, x_{0}) .

The input state

ρ_{T}

is then given by

\begin{matrix} ρ_{T} = ρ (t_{0}) = diag (0, 0^{(1)}, 0^{(2)}, | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} | \otimes | x_{0} ⟩ ⟨ x_{0} |, 0^{(4)}, 0^{(5)}, 0^{(6)}) . \end{matrix}

If

W_{1}^{Q} = W_{1}^{K} = {FFN}_{1} = I

and

W_{1}^{V} = σ_{x}

in

R^{2},

an associated physical operation

E (t_{1}, t_{0})

at time

t_{1}

satisfies

\begin{matrix} \begin{matrix} E (t_{1}, t_{0}) ρ (t_{0}) = & \frac{1}{2 e + 1} diag (0, 0^{(1)}, 0^{(2)}, 0^{(3)}, | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} | \otimes | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{0} ⟩ ⟨ x_{0} |, 0^{(5)}, 0^{(6)}) \\ + & \frac{2 e}{2 e + 1} diag (0, 0^{(1)}, 0^{(2)}, 0^{(3)}, | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} | \otimes | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} |, 0^{(5)}, 0^{(6)}) . \end{matrix} \end{matrix}

By measurement, we obtain

x_{0}

with probability

\frac{1}{2 e + 1}

and obtain

x_{1}

with probability

\frac{2 e}{2 e + 1},

while

\begin{matrix} \begin{matrix} ρ_{red} {(t_{1})}_{0} = & diag (0, 0^{(1)}, 0^{(2)}, 0^{(3)}, | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} | \otimes | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{0} ⟩ ⟨ x_{0} |, 0^{(5)}, 0^{(6)}), \\ ρ_{red} {(t_{1})}_{1} = & diag (0, 0^{(1)}, 0^{(2)}, 0^{(3)}, | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} | \otimes | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} |, 0^{(5)}, 0^{(6)}) . \end{matrix} \end{matrix}

If

W_{2}^{Q} = W_{2}^{V} = {FFN}_{2} = I

and

W_{2}^{K} = σ_{x}

in

R^{2},

an associated quantum operation

E (t_{2}, t_{0})

at time

t_{2}

satisfies

\begin{matrix} \begin{matrix} E (t_{2}, & t_{0}) ρ_{red} {(t_{1})}_{0} \\ = & \frac{3}{e + 3} (0, 0^{(1)}, 0^{(2)}, 0^{(3)}, 0^{(4)}, | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} | \otimes | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{0} ⟩ ⟨ x_{0} |, 0^{(6)}) \\ + \frac{e}{e + 3} (0, 0^{(1)}, 0^{(2)}, 0^{(3)}, 0^{(4)}, | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} | \otimes | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} |, 0^{(6)}), \end{matrix} \end{matrix}

and

\begin{matrix} \begin{matrix} E (t_{2}, & t_{0}) ρ_{red} {(t_{1})}_{1} \\ = & \frac{e}{e + 1} (0, 0^{(1)}, 0^{(2)}, 0^{(3)}, 0^{(4)}, | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} | \otimes | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} | \otimes | x_{0} ⟩ ⟨ x_{0} |, 0^{(6)}) \\ + \frac{1}{e + 1} (0, 0^{(1)}, 0^{(2)}, 0^{(3)}, 0^{(4)}, | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} | \otimes | x_{0} ⟩ ⟨ x_{0} | \otimes | x_{1} ⟩ ⟨ x_{1} | \otimes | x_{1} ⟩ ⟨ x_{1} |, 0^{(6)}) . \end{matrix} \end{matrix}

By measurement at time

t_{2},

when

x_{0}

occurs at

t_{1},

we obtain

x_{0}

with probability

\frac{3}{e + 3}

and obtain

x_{1}

with probability

\frac{e}{e + 3};

when

x_{1}

occurs at

t_{1},

we obtain

x_{0}

with probability

\frac{e}{e + 1}

and obtain

x_{1}

with probability

\frac{1}{e + 1} .

Therefore, we obtain the joint probability distributions:

\begin{matrix} \begin{matrix} P_{T} (x_{0}, x_{0}) = & \frac{1}{2 e + 1} \frac{3}{e + 3} = \frac{3}{(2 e + 1) (e + 3)}, \\ P_{T} (x_{0}, x_{1}) = & \frac{1}{2 e + 1} \frac{e}{e + 3} = \frac{e}{(2 e + 1) (e + 3)}, \\ P_{T} (x_{1}, x_{0}) = & \frac{2 e}{2 e + 1} \frac{e}{e + 1} = \frac{2 e^{2}}{(2 e + 1) (e + 1)}, \\ P_{T} (x_{1}, x_{1}) = & \frac{2 e}{2 e + 1} \frac{1}{e + 1} = \frac{2 e}{(2 e + 1) (e + 1)} . \end{matrix} \end{matrix}

This can be illustrated as follows:

6. Conclusions

Our primary innovative points are summarized as follows:

Mathematical formalism for generative AI models. We present a mathematical formalism for generative AI models by using the theory of the history approach to physical systems developed by Isham and Gudder [1,2].
Physical models realizing generative AI systems. We give a construction of physical models realizing generative AI systems as open quantum systems by using the theories of $σ$ -effect algebras and open quantum systems.
Large language models realized as open quantum system. We construct physical models realizing large language models based on the transformer architecture as open quantum systems in the Fock space over the Hilbert space of tokens. The Fock space structure plays a crucial role in this construction.

In conclusion, we present a mathematical formalism for generative AI and describe physical models realizing generative AI systems as open quantum systems. Our formalism shows that a transformer architecture used for generative AI systems is characterized by a family of sequential joint probability distributions. The physical models realizing generative AI systems are described by sequential event histories in open quantum systems. The Kraus operators in the physical models correspond to the query, key, and value matrices in the attention mechanism of a transformer, which are adjustable and learned during the training process. As an illustration, we construct physical models in the Fock space over the Hilbert space of tokens, realizing large language models based on a transformer architecture as open quantum systems. This means that our physical models underlie the transformer architecture for large language models. We refer to [23] for an argument on the physical principle of generic AI and to [15] for a mathematical foundation of general AI, including quantum AI.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in this article.

Acknowledgments

The author thanks the anonymous referees for making helpful comments and suggestions, which have been incorporated into this version of the paper.

Conflicts of Interest

The author declares no conflicts of interest.

References

Gudder, S. A histories approach to quantum mechanics. J. Math. Phys. 1998, 39, 5772–5788. [Google Scholar] [CrossRef]
Isham, C.J. Quantum logic and the histories approach to quantum theory. J. Math. Phys. 1994, 35, 2157–2185. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large language models: A Survey. arXiv 2024, arXiv:2402.06196. [Google Scholar]
Geshkovski, B.; Letrouit, C.; Polyyanskiy, Y.; Rigollet, P. A mathematical perspective on transformers. Bull. Am. Math. Soc. 2025, in press.
Vuckovic, J.; Baratin, A.; Combes, R.T. A mathematical theory of attention. arXiv 2020, arXiv:2007.02876. [Google Scholar]
Breuer, H.P.; Petruccione, F. The Theory of Open Quantum Systems; Oxford University Press: Oxford, UK, 2002. [Google Scholar]
Villas-Boas, C.J.; Máximo, C.E.; Paulino, P.J.; Bachelard, R.P.; Rempe, G. Bright and dark states of light: The quantum origin of classical interference. Phys. Rev. Lett. 2025, 134, 133603. [Google Scholar] [CrossRef] [PubMed]
Dvurečenskij, A.; Pulmannová, S. New Trends in Quantum Structures; Springer: Berlin/Heidelberg, Germany, 2000. [Google Scholar]
Petersen, P.; Zech, J. Mathematical Theory of Deep Learning. arXiv 2025, arXiv:2407.18384v3. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Foulis, D.J.; Bennett, M.K. Effect algebras and unsharp quantum logics. Found. Phys. 1994, 24, 1331–1352. [Google Scholar] [CrossRef]
Chen, Z.; Ding, L.; Liu, H.; Yu, J. A topos-theoretic formalism of quantum artificial intellegence. Sci. Sin. Math. 2025, 55, 1–32. (In Chinese) [Google Scholar] [CrossRef]
Reed, M.; Simon, B. Method of Mordern Mathematical Physics, Vol. I; Academic Press: San Diego, CA, USA, 1980. [Google Scholar]
Reed, M.; Simon, B. Method of Mordern Mathematical Physics, Vol. II; Academic Press: Cambridge, UK, 1980. [Google Scholar]
Rudin, W. Functional Analysis, 2nd ed.; The McGraw-Hill Companies, Inc.: New York, NY, USA, 1991. [Google Scholar]
Nielsen, M.A.; Chuang, I.L. Quantum Computation and Quantum Information; Cambridge University Press: Cambridge, UK, 2001. [Google Scholar]
Paulsen, V. Completely Bounded Maps and Operator Algebras; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
Zhang, Y.; Liu, Y.; Yuan, H.; Qin, Z.; Yuan, Y.; Gu, Q.; Yao, A.C. Tensor product attention is all you need. arXiv 2025, arXiv:2501.06425. [Google Scholar]
Sharma, K.; Cerezo, M.; Cincio, L.; Coles, P.J. Trainability of dissipative perceptron-based quantum neural networks. Phys. Rev. Lett. 2022, 128, 180505. [Google Scholar] [CrossRef] [PubMed]
Chen, Z. Turing’s thinking machine and ’t Hooft’s principle of superposition of states. ChinaXiv 1207. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.