Hybrid Quantum Vision Transformers for Event Classification in High Energy Physics

Models based on vision transformer architectures are considered state-of-the-art when it comes to image classification tasks. However, they require extensive computational resources both for training and deployment. The problem is exacerbated as the amount and complexity of the data increases. Quantum-based vision transformer models could potentially alleviate this issue by reducing the training and operating time while maintaining the same predictive power. Although current quantum computers are not yet able to perform high-dimensional tasks yet, they do offer one of the most efficient solutions for the future. In this work, we construct several variations of a quantum hybrid vision transformer for a classification problem in high energy physics (distinguishing photons and electrons in the electromagnetic calorimeter). We test them against classical vision transformer architectures. Our findings indicate that the hybrid models can achieve comparable performance to their classical analogues with a similar number of parameters.


Introduction
The first transformer architecture was introduced in 2017 by Vaswani et al. in a famous paper "Attention Is All You Need" [1].The new model was shown to outperform the existing state-of-the-art models by a significant margin for the English-to-German and English-to-French newstest2014 tests.Since then, the transformer architecture has been implemented in numerous fields and has become the go-to model for many different applications such as sentiment analysis [2] and question answering [3].
The vision transformer architecture can be considered as the implementation of transformer architecture for image classification.It utilizes the encoder part of the transformer architecture and attaches a multi-layer perceptron (MLP) layer to classify images.This architecture was first introduced by Dosovitskiy et al. in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" [4].It was shown that in a multitude of datasets, a vision transformer model is capable of outperforming the stateof-the-art model ResNet152x4 while using less computation time to pre-train.Similar to their language counterparts, vision transformers became the state-of-the-art models for a multitude of computer vision problems such as image classification [5] and semantic segmentation [6].
However, these advantages come at a cost.Transformer architectures are known to be computationally expensive to train and operate [7].Specifically, their demands on computation power and memory increase quadratically with the input length.A number of studies have attempted to approximate self-attention in order to decrease the associated quadratic complexity in memory and computation power [8][9][10][11].There are also proposed modifications to the architecture which aim to alleviate the quadratic complexity [12][13][14].A recent review of the different methods for reducing the complexity of transformers can be found in [15].As the amount of data grows, these problems are exacerbated.In the future, it will be necessary to find a substitute architecture that has similar performance but demands fewer resources.
A quantum machine learning model might be one of those substitutes.Although the hardware for quantum computation is still in its infancy, there is a high volume of research that is focused on the algorithms that can be used on this hardware.The main appeal of quantum algorithms is that they are already known to have computational advantages over classical algorithms for a variety of problems.For instance, Shor's algorithm can factorize numbers significantly faster than the best classical methods [16].Furthermore, there are studies suggesting that quantum machine learning can lead to computational speedups [17,18].
In this work, we develop a quantum-classical hybrid vision transformer architecture.We demonstrate our architecture on a problem from experimental high energy physics, which is an ideal testing ground because experimental collider physics data are known to have a significant amount of complexity and computational resources represent a major bottleneck [19][20][21].Specifically, we use our model to classify the parent particle in an electromagnetic shower event inside the CMS detector.In addition, we will test the performance of our hybrid architecture by benchmarking it against a classical vision transformer of equivalent architecture.
This paper is structured as follows.In Section 2, we introduce and describe the dataset.The model architectures for both the classical and hybrid models are discussed in Section 3. The model parameters and the training are specified in Sections 4 and 5, respectively.Finally, in Section 6 we present our results and discuss their implications in Section 7. We consider the future directions for study in Section 8.

Dataset and Preprocessing Description
The Compact Muon Solenoid (CMS) is one of the four main experiments at the Large Hadron Collider (LHC), which has been in operation since 2009 at CERN.The CMS detector [22] has been recording the products from collisions between beams consisting of protons or Pb ions, at several different center-of-mass energy milestones, up to the current 13.6 TeV [23].Among the various available CMS datasets [24], we have chosen to study data from proton-proton collisions at 13.6 TeV.Among the basic types of objects reconstructed from those collisions are photons and electrons, which leave rather similar signatures in the CMS electromagnetic calorimeter (ECAL) (see, e.g., Ref. [25] and references therein).A common task in high-energy physics is to classify the resulting electromagnetic shower in the ECAL as a photon (γ) or electron (e − ).In practice, one also uses information from the CMS tracking system [26] and leverages the fact that an electron leaves a track, while a photon does not.However, for the purposes of our study, we shall limit ourselves to the ECAL only.
The dataset used in our study contains the reconstructed hits of 498,000 simulated electromagnetic shower events in the ECAL sub-detector of the CMS experiment (photon conversions were not simulated) [27].Half of the events originate from photons, while the remaining half are initiated by electrons.In each case, an event is generated with exactly one particle (γ or e − ) which is fired from the interaction point with fixed transverse momentum magnitude |⃗ p T | = 50 GeV, see Figure 1.The direction of the momentum ⃗ p is sampled uniformly in azimuthal angle −π ≤ φ ≤ π and pseudorapidity −1.4 ≤ η ≤ 1.4,where the latter is defined in terms of the polar angle θ as η = − ln tan(θ/2). .The CMS coordinate system against the backdrop of the LHC, with the location of the four main experiments (CMS, ALICE, ATLAS and LHCb).The z axis points to the Jura mountains, while the y-axis points toward the sky.In spherical coordinates, the components of a particle momentum ⃗ p are its magnitude |⃗ p|, the polar angle θ (measured from the z-axis), and the azimuthal angle φ (measured from the x-axis).The transverse momentum ⃗ p T is the projection of ⃗ p on the transverse (xy) plane.For each event, the dataset includes two image grids, representing energy and timing information, respectively.The first grid gives the peak energy detected by the crystals of the detector in a 32 × 32 grid centered around the crystal with the maximum energy deposit.The second image grid gives the arrival time when the peak energy was measured in the associated crystal (in our work, we shall only use the first image grid with the energy information.)Each pixel in an image grid corresponds to exactly one ECAL crystal, though not necessarily the same crystal from one event to another.The images were then scaled so that the maximum entry for each event was set to 1.

CMS
Several representative examples of our image data are shown in Figure 2. The first row shows the image grids for the energy (normalized and displayed in log 10 scale), while the second row displays the timing information (not used in our study).In each case, the top row in the title lists the label predicted by one of the benchmark classical models, while the bottom row shows the correct label for that instance-whether the image was generated by an actual electron or photon.
As can be gleaned from Figure 2 with the naked eye, electron-photon discrimination is a challenging task-for example, the first and third images in Figure 2 are wrongly classified.To first approximation, the e − and γ shower profiles are identical, and mostly concentrated in a 3 × 3 grid of crystals around the main deposit.However, interactions with the magnetic field of the CMS solenoid (B = 3.8 T) cause electrons to emit bremsstrahlung radiation, preferentially in φ.This introduces a higher-order perturbation on the shower shape, causing the electromagnetic shower profiles [29] to be more spread out and slightly asymmetric in φ.

Model Architectures
The following definitions will be used for the rest of the paper and are listed here for convenience.Both the benchmark and hybrid models utilize the same architectures except for the type of encoder layers.These architectures are shown in Figure 3.As can be seen in the figure, there will be two main variants of the architecture: (a) column-pooling variant and (b) class token variant.
As the encoder layer is the main component of both the classical and the hybrid models, they will be discussed in more detail in Sections 3.2 and 3.3, respectively.The rest of the architecture is discussed here.
First, we start by dividing our input picture into n t patches of equal area, which are then flattened to obtain n t vectors with length d i .The resulting vectors are afterward concatenated to obtain a n t × d i matrix for each image sample.This matrix is passed through a linear layer with a bias (called "Linear Embedding" in the figure) to change the number of columns from d i to a desirable number (token dimension, referred to as d t ).
If the model is a class token variant, a trainable vector of length d t is concatenated as the first row of the matrix at hand (module "Concat" in Figure 3b).After that, a nontrainable vector is added to each row (called the positional embedding vector).Then the result is fed to a series of encoder layers where each subsequent encoder layer uses its predecessor's output as its input.
If the model is a class token variant, the first row of the output matrix of the final encoder layer is fed into the classifying layer to obtain the classification probabilities ("Extract Class Token" layer in Figure 3b).Otherwise, a column-pooling method (take the mean of all the rows or take the maximum value for each column) is used to reduce the output matrix into a vector, then this vector is fed into the classifying layer to obtain the classification probabilities ("Column-wise Pooling" layer in Figure 3a).we use an MNIST image [30] to demonstrate the process.The hybrid and the classical model differ by the architecture of their encoder layers (see Figures 4 and 5).

The Classical Encoder Layer
The structure of the classical encoder layer can be seen in Figure 4a.First, we start by standardizing the input data to have zero mean and a standard deviation of one.Afterward, the normalized data are fed to the multi-head attention (discussed in the next paragraph) and the output is summed with the unnormalized data.Then, the modified output is again normalized to have zero mean and a standard deviation of one.These normalized modified data are then fed into a multilayer perceptron of two layers with hidden layer size d f f and the result is summed up with the modified data to obtain the final result.The multi-head attention works by separating our input matrix into n h many n t × d h matrices by splitting them through their columns.Afterward, the split matrices are fed to the attention heads described in Equations ( 1) and (2).Finally, the outputs of the attention heads are concatenated to obtain an n t × d t matrix, which has the same size as our input matrix.Each attention head is defined as Attention Head (x i ; W where is the input matrix.

Hybrid Encoder Layer
The structure of the hybrid encoder layer can be seen in Figure 5a.Firstly, we start by standardizing the input data to have zero mean and standard deviation of one.Afterward, the normalized data are fed to the hybrid multi-head attention layer (discussed in the next paragraph).Then, the output is fed into a multilayer perceptron of two layers with hidden layer size d f f , and the result is summed up with the unnormalized data to obtain the final result.The hybrid multi-head attention works by separating our input matrix into n h many n t × d h matrices by splitting them through their columns.Afterward, the split matrices are fed to the hybrid attention heads (which are described in the bulleted procedure below).Finally, the outputs of the attention heads are concatenated to obtain an n t × d t matrix, which has the same size as our input matrix.
The hybrid attention heads we used are almost identical to the architecture implemented in [31], "Quantum Self-Attention Neural Networks for Text Classification" by Li et al.In order to replace the self-attention mechanism of a classical vision transformer in Equation ( 1), we use the following procedure: • Define x i as the ith row of the input matrix X.
• Define the data loader operator Û(x i ) as where Ĥ is the Hadamard gate and Rx is the parameterized rotation around the x-axis.

•
Apply the key circuit (data loader + key operator K(θ K )) for each x i and obtain the column vector K (see Figure 6).
where Ẑi is a spin measurement of the ith qubit on the z direction.

•
Apply the query circuit (data loader Û(x i ) + query operator Q(θ Q )) for each x i and obtain the column vector Q (see Figure 6).
• Obtain the so-called attention matrix using the key and the query vectors using the following expression • Apply the value circuit (data loader + value operator V(θ V )) to each row of the image and measure each qubit separately to obtain the value matrix.(See Figure 7) • Define the self-attention operation as,

Hybrid Attention Head: SoftMax
Figure 6.Key and Query circuit for the d h = 8 case.The first two rows of circuits load the data to the circuit ( Û(x i ) operator), while the rest are the parts of the trainable ansatz.Therefore, the total number of parameters for each circuit is equal to 3d h + 1.

Figure 7.
The value circuit used for the d h = 8 case.The first two rows of circuits load the data to the circuit ( Û(x i ) operator), while the rest are the parts of the trainable ansatz.Therefore, the total number of trainable parameters for each circuit is equal to 3d h .

Hyper-Parameters
The number of parameters is a function of the hyper-parameters for both the classical and the hybrid models.However, these functions are different.Both models share the same linear embedding and classifying layer.The linear embedding layer contains (d i + 1)d t many parameters and the classifying layer contains 32d t + 65 parameters.
For each classical encoder layer, we have n h attention heads which all contain 3d 2 h parameters from the Q, K, and V layers, respectively.In addition, the MLP layer inside each encoder layer contains 2d f f d t + d f f + d t parameters.Overall, each classical vision transformer has d t (33 ) parameters except for the class token variation which has extra d t parameters.
For each hybrid encoder layer, we have n h attention heads which all contain 9d h + 2 parameters from the Q, K, and V layers, respectively.Similar to the classical model, each encoder layer MLP contains 2d f f d t + d f f + d t parameters.Overall, each hybrid vision transformer has d t (33 ) parameters except for the class token variation which has extra d t parameters.
Therefore, assuming they have the same hyper-parameters, the difference between the number of parameters for the classical and hybrid models is n l (d t (3d h − 9) − 2n h ).
Our purpose was to investigate whether our architecture might perform similarly to a classical vision transformer where the number of parameters are close to each other.In order to use a similar number of parameters, we picked a region of hyperparameters such that this difference is rather minimal.For all models, the following parameters were used: Therefore, for our experiment the number of parameters for the classical models (4785 to 4801) is slightly more than the quantum models (4585 to 4601).

Training Process
All the classical parts of the models were implemented in PyTorch [32].The quantum circuit simulations were conducted using TensorCircuit with the JAX backend [33,34].We explored a few different hyperparameter settings before settling on the following.Each model was trained for 40 epochs, which was typically sufficient to ensure convergence, see Figures 8 and 9.The criteria for the selection of the best model iteration was the accuracy of the validation data.The optimizer used was the ADAM optimizer with learning rate λ = 5 × 10 −3 [35].All models were trained on GPUs and the typical training times were on the order of 10 min (5 h) for the classical (quantum) models.The batch size was 512 for all models as well.The loss function utilized was the binary cross entropy.The code used to create and train the models can be found at the following GitHub repository: https://github.com/EyupBunlu/QViT_HEP_ML4Sci(accessed on 7 March 2024).

Results
The training loss and the accuracy of the validation and training data are plotted in Figures 8 and 9, respectively.In addition, the models were compared on several metrics such as the accuracy, binary cross-entropy loss, and AUC (area under the ROC curve) on the test data.This comparison is shown in Table 1.

Discussion
As seen in Table 1, the positional encoding has no significant effect on the performance metrics.In retrospect, this is not that surprising, since the position information is already used in the linear embedding in Figure 3 (we thank the anonymous referee for this clarification.).We note that the CMX variant (either with or without positional encoding) performs similarly to the corresponding classical model.This suggests that a quantum advantage could be achieved when extrapolating to higher-dimensional problems and datasets since the quantum models scale better with dimensionality.
On the other hand, Table 1 shows that hybrid CMN variants are inferior to their hybrid CMX counterparts for all metrics.This might be due to the fact that taking the mean forces each element of the output matrix of the final encoder layer to be relevant, unlike the CMX variant, where the maximum values are chosen.This could explain the larger number of epochs required to converge in the case of the hybrid CMN (see Figures 8 and 9).It is also possible that the hybrid model lacks the expressiveness required to encode enough meaningful information to the column means.
Somewhat surprisingly, the training plots of the hybrid class token variants (upper left panels in Figures 8 and 9) show that the hybrid class token variants did not converge during our numerical experiments.The reason behind this behavior is currently unknown and is being investigated.

Outlook
Quantum machine learning is a relatively new field.In this work, we explored a few of the many possible ways that it could be used to perform different computational tasks as an alternative to classical machine learning techniques.As the current hardware for quantum computers improves further, it is important to explore more ways in which this hardware could be utilized.
Our study raises several questions that warrant future investigations.First, we observe that the hybrid CMX models perform similarly to the classical vision transformer models that we used for benchmarking.It is fair to ask if this similarity is due to the comparable number of trainable parameters or the result of an identical choice of hyper-parameter values.If it is the latter, we can extrapolate and conclude that as the size of the data grows, hybrid models will still perform as well as the classical models while having a significantly fewer number of parameters.
It is fair to say that both the classical and hybrid models perform similarly at this scale.However, the hybrid model discussed in this work is mostly classical, except for the attention heads.The next step in our research is to investigate the effect of increasing the fraction of quantum elements of the model.For instance, the conversion of feedforward layers into quantum circuits such as the value circuit might lead to an even bigger advantage in the number of trainable parameters between the classical and hybrid models.
Although the observed limitations in the class token and column mean variants might appear disappointing at first glance, they are also important findings of this work.It is worth investigating whether this is due to the nature of the dataset or a sign of a fundamental limitation in the method.

Figure 1
Figure1.The CMS coordinate system against the backdrop of the LHC, with the location of the four main experiments (CMS, ALICE, ATLAS and LHCb).The z axis points to the Jura mountains, while the y-axis points toward the sky.In spherical coordinates, the components of a particle momentum ⃗ p are its magnitude |⃗ p|, the polar angle θ (measured from the z-axis), and the azimuthal angle φ (measured from the x-axis).The transverse momentum ⃗ p T is the projection of ⃗ p on the transverse (xy) plane.Figure generated with TikZ code adapted from Ref. [28].
Figure1.The CMS coordinate system against the backdrop of the LHC, with the location of the four main experiments (CMS, ALICE, ATLAS and LHCb).The z axis points to the Jura mountains, while the y-axis points toward the sky.In spherical coordinates, the components of a particle momentum ⃗ p are its magnitude |⃗ p|, the polar angle θ (measured from the z-axis), and the azimuthal angle φ (measured from the x-axis).The transverse momentum ⃗ p T is the projection of ⃗ p on the transverse (xy) plane.Figure generated with TikZ code adapted from Ref. [28].

Figure 2 .
Figure 2. Four representative image grid examples from the dataset, in the (φ, η) plane.The first row shows the image grids for the energy (normalized and displayed in log 10 scale), while the second row displays the timing information.The titles list the correct labels (real electron or real photon), as well as the corresponding labels predicted by one of the benchmark classical models (see text for more details).

Figure 3 .
Figure 3.The architecture for the (a) column-wise pooling and (b) the class-token models.For clarity, we use an MNIST image[30] to demonstrate the process.The hybrid and the classical model differ by the architecture of their encoder layers (see Figures4 and 5).

Figure 4 .
Figure 4.The classical encoder layer (a) and multi-head attention (b) architecture for the benchmark models.

Figure 5 .
Figure 5.The hybrid encoder layer architecture (a) and multi-head attention (b) architecture for the hybrid models.

Figure 8 .Figure 9 .
Figure 8. BCE loss on the validation and training set during training for the (a) quantum and (b) classical models.From left to right, each column corresponds to a different model variant: class token (left column), column max (middle column) and column mean variant (right column).For each plot, the blue (orange) line corresponds to the validation (training) set loss for the model with positional encoding, whereas the dashed green (red) line corresponds to the validation (training) set loss for the model without positional encoding layer.

Table 1 .
Comparison table for the models.The accuracy, the BCE loss and the AUC score were calculated on the test data.For each entry, the first number corresponds to the classical model, whereas the second one corresponds to the hybrid model.For each variant and metric, the best value is shown in bold.