Spatio-Temporal Anomaly Detection with Graph Networks for Data Quality Monitoring of the Hadron Calorimeter

The Compact Muon Solenoid (CMS) experiment is a general-purpose detector for high-energy collision at the Large Hadron Collider (LHC) at CERN. It employs an online data quality monitoring (DQM) system to promptly spot and diagnose particle data acquisition problems to avoid data quality loss. In this study, we present a semi-supervised spatio-temporal anomaly detection (AD) monitoring system for the physics particle reading channels of the Hadron Calorimeter (HCAL) of the CMS using three-dimensional digi-occupancy map data of the DQM. We propose the GraphSTAD system, which employs convolutional and graph neural networks to learn local spatial characteristics induced by particles traversing the detector and the global behavior owing to shared backend circuit connections and housing boxes of the channels, respectively. Recurrent neural networks capture the temporal evolution of the extracted spatial features. We validate the accuracy of the proposed AD system in capturing diverse channel fault types using the LHC collision data sets. The GraphSTAD system achieves production-level accuracy and is being integrated into the CMS core production system for real-time monitoring of the HCAL. We provide a quantitative performance comparison with alternative benchmark models to demonstrate the promising leverage of the presented system.

The large hadron collider (LHC) is the largest particle collider ever built globally.It is designed to conduct experiments in physics and increase our understanding of the universe-with the expectation that new findings will lead to practical applications.The LHC was started in 2008 and consists of a 27 km ring tunnel located 100 meters underground at the France-Switzerland border near Geneva Evans and Bryant [2008].The LHC is a two-ring superconducting hadron accelerator and collider capable of accelerating and colliding beams of protons and heavy ions with the unprecedented luminosity of 10 34 cm −2 s −1 and 10 27 cm −2 s −1 , respectively, at a velocity close to the speed of light-3 × 10 8 ms −1 Evans and Bryant [2008], Heuer [2012].The LHC consists of several experiments on its sites, and its ring holds several detectors for these experiments.The four major detectors of the LHC are a toroidal LHC apparatus (ATLAS), compact muon solenoid (CMS), large hadron collider beauty (LHCb), and a large ion collider experiment (ALICE).Each detector studies particle collisions from a different perspective with different technologies.The ATLAS at point 1 (P1) and the CMS at point 5 (P5) are the two high-luminosity general-purpose detectors at the LHC, and they are located in diametrically opposite sections.
The CMS experiment employs the data quality monitoring (DQM) system to guarantee high-quality physics data through online monitoring that provides live feedback during data acquisition, and offline monitoring that certifies the data quality after offline processing Azzolini et al. [2019].The online DQM identifies emerging problems using reference distribution and predefined tests to detect known failure modes using summary histograms, such as a digi-occupancy map of the calorimeters Tuura et al. [2010], De Guio and Collaboration [2014].A digi-occupancy map contains the histogram record of particle hits of the data-taking channels of the calorimeters.The CMS calorimeters could have several flaws, such as issues with the frontend particle sensing scintillators, digitization and communication systems, backend hardware, and algorithms, which are usually reflected in the digi-occupancy map.The growing complexity of the detector and the variety of physics experimentation make data-driven anomaly detection (AD) systems essential tools for the CMS to identify and localize detector anomalies automatically.Recent efforts in the CMS have proposed deep learning (DL) for AD applications for the DQM Azzolin et al. [2019], Azzolini et al. [2019], Pol et al. [2019a,b].The CMS detector consists of a tracker to reconstruct particle paths accurately, two calorimeters, the electromagnetic (ECAL) and the hadronic (HCAL) to detect electrons, photons, and hadrons, respectively, and of several muon detectors.The synergy in AD has thus far achieved promising results on spatial 2D histogram maps of the DQM for the ECAL Azzolin et al. [2019] and the muon detectors Pol et al. [2019b].Previous studies only considered extreme anomalies, such as dead-no reading-and hot-high noise-particle sensing channels.Detecting degrading channels-essential for quality deterioration monitoring and early intervention-is often overlooked.For instance, the improperly tuned bias voltage on the HCAL physics particle sensing channels caused non-uniformity in the hit map of the DQM, but the channels were neither dead nor hot Viazlo and Collaboration [2022].Calorimeter channels may degrade with subtle abnormality before reaching extreme channel fault status.Capturing such subtle anomalies-e.g., slow system degradation-makes temporal AD models appealing for early anomaly prediction before ultimate system failure.Time-aware models extract temporal context to enhance AD performance.A few efforts have thus far been focused on temporal models despite the acknowledged potential in the future automation technology challenges at the LHC Azzolin et al. [2019], Wielgosz et al. [2018].Our study focuses on DQM automation through time-aware AD modeling using digi-occupancy histogram maps of the HCAL.The digi-occupancy data of the HCAL is 3D, and it poses multidimensional challenges due to its depth-wise calorimeter segmentation; it is relatively unexplored with ML endeavors.The particle hit map data of the HCAL are highly dependent on the collision luminosity-a measure of how many collisions are happening in a particle accelerator-and the number of particles traversing the calorimeter.The effort on data normalization that enhances the learning generalization of machine learning models is still limited.
We address the above gaps while investigating the performance enhancement of temporal AD DL models for the HCAL DQM.We propose to detect anomalies of the HCAL particle sensing channels through a semi-supervised AD system-GraphSTAD-from spatial digi-occupancy maps of the DQM.Anomalies can be unpredictable and come in different patterns of severity, shape, and size-often limiting the availability of labeled anomaly data covering all possible faults.We employ a semi-supervised approach for the AD system; the concept for the AD is that an autoencoder (AE)-trained to reconstruct healthy digi-occupancy maps-would adequately reconstruct the healthy maps, whereas it yields high reconstruction error for maps with anomalies.
Since abnormal events can have spatial appearance and temporal context, we combine both the spatial and temporal features-spatio-temporal (ST)-for AD Xu et al. [2017], Chang et al. [2022], Luo et al. [2019], Hasan et al. [2016], Wu et al. [2020], Hsu [2017], Ullah et al. [2021], Hu et al. [2019], Banković et al. [2012], Zhang et al. [2020].Moreover, modeling the digi-occupancy map of the HCAL is challenging-the spatial nature may exhibit irregularity; although adjacent channels with Euclidean distance are exposed to collision article hits around their region, the channels may belong to different backend circuits-resulting in non-Euclidean spatial behavior on the digi measurements.
The GraphSTAD system captures the behavior of channels from regional collision particle hits, and electrical and environmental characteristics due to a shared backend circuit of the channels to effectively detect the degradation of faulty channels.The AD system attains these utilities using a deep AE model that learns local spatial behavior, physical connectivity-induced shared behavior, and temporal behavior through convolutional neural networks (CNN), graph neural networks (GNN), and recurrent neural networks (RNN), respectively.
We have evaluated our proposed AD approach in detecting spatial faults and temporal discords on digi-occupancy maps of the HCAL.We simulated different realistic types of anomalies-dead channels without registered hits, hot channels dominated by electronic noise resulting in a much higher hit count than expected, and degrading channels with deteriorated particle detection efficiency resulting in lower hit counts than expected-to analyze the effectiveness of the AD model.The results have demonstrated promising performance in detecting and localizing the anomalies.We have validated the accuracy in detecting real anomalies and discussed a comparison to the existing DQM system.
In this paper, we briefly describe the DQM and HCAL systems in Section 2, and our data sets in Section 3. Section 4 explains the methodology of the proposed GraphSTAD model, and Section 5 presents the performance evaluation and result discussion.Finally, we summarize the contribution of our study in Section 6.

Background
This section describes the CMS DQM and the HCAL systems.

The Data Quality Monitoring of the CMS Experiment
The particle collision data of the LHC is aggregated into runs, where each run contains thousands of lumisections.A lumisection (LS) corresponds to approximately 23 seconds of data taking and comprises hundreds or thousands of collision events containing particle hit records.The DQM of the CMS provides feedback on detector performance and data reconstruction; It generates a list of certified data for physics analyses-the "Golden JSON" Azzolini et al. [2019].The DQM employs online and offline monitoring mechanisms: 1) the online monitoring is real-time DQM during data acquisition, and 2) the offline monitoring-after 48 hours since the collisions were recorded-provides the final fine-grained data quality analysis for data certification.The online DQM populates a set of histogram-based maps on a selection of events and provides summary plots with alarms that DQM experts inspect to spot problems.The digi-occupancy-one of the histogram maps generated by the online DQM-contains particle hit histogram records of the particle readout channel sensor of the calorimeters.A digi-also called hit-is a reconstructed and calibrated collision physics signal of the calorimeter.Several faults in the calorimeter affecting the frontend particle sensing scintillators, digitalization, communications, the backend hardware, and the algorithms-could appear in the digi-occupancy map.Previous efforts by Azzolin et al. [2019], Azzolini et al. [2019], Pol et al. [2019a,b] demonstrate the promising AD efficacy of using digi-occupancy maps for calorimeter channel monitoring using machine learning.However, end-to-end DL with temporal models are relatively unexplored Azzolin et al. [2019], Pol et al. [2019b].
The purpose of leveraging the DQM through machine learning is to address particular challenges: 1) latency of human intervention, and thresholds require sufficient statistics; 2) the volume of data a human can process in a finite time is limited; 3) rule-based approaches do not scale and assume limited potential failure scenarios; 4) dynamic running conditions change reference samples; 5) the effort to train human shifters who monitor DQM dashboards, and maintain instructions is expensive.Developing machine learning models for the DQM comes with some impediments despite the potential promises; data normalization to handle variation in experimental settings, the granularity of the failures to spot, and limited availability of the ground truth labels are among the challenges Pol et al. [2019b].
We extend the efforts in AD with spatio-temporal (ST) modeling of the digi-occupancy maps of the DQM for the HCAL.Several promising AD models have been proposed in the literature for ST data in non-HEP domains Atluri et al.
[2018]-such as crowd monitoring using visual streaming data Xu et al.

The Readout Boxes of the HCAL
The HCAL is a specialized calorimeter to capture hadronic particles.The calorimeter is made of multiple subsystems such as HCAL endcap (HE), HCAL barrel (HB), HCAL forward (HF), and HCAL outer (HO) (see Fig. 1).
Figure 1: Schematic of the CMS detector at P5 of the LHC and its calorimeters Focardi [2012].
The HCAL frontend electronic systems readout boxes (RBXes) to house the data acquisition electronics.The RBXes provide high voltage, low voltage, backplane communications, and cooling to the data acquisition electronics.The use-case of our study-the HE-is made of 36 RBXes arranged on the plus and minus hemispheres of the CMS.Its frontend particle detection system is built on brass and plastic scintillators, and transmits the produced photon particles through the wavelength-shifting fibers to the Silicon photomultipliers (SiPMs) (see Fig. 2).Each RBX houses four readout modules (RMs) for signal digitization Strobbe [2017]; each RM has a SiPM control card, 48 SiPMs, and four readout cards-each includes 12 charge integrating and encoding chips (QIE11) and a field programmable gate array (Microsemi Igloo2 FPGA).The QIE integrates charge from each SiPM at 40 MHz, and the FPGA serializes and encodes the data from the QIE.The encoded data is optically transmitted to the backend system via the CERN versatile twin transmitter (VTTx) at 4.8 Gbps.The current HCAL system has 17 detector scintillator layers that are read out in seven groups-hereafter referred to as depths; the light from the scintillators in any given group is optically added together by sending it to a single SiPM.More channels allow for a more refined depth segmentation-ideal for precisely calibrating the depth-dependent radiation damage on the HCAL Azzolini et al. [2019].

Data Set Description
We used digi-occupancy map data of the online DQM system of the CMS experiment to train and validate the proposed AD system.The digi-occupancy map data has 3D spatial dimensions with η ϕ, and depth axes, and contains digi histogram records of the calorimeter readout channels referenced by iη = [−32, 32], iϕ = [1,72] and depth = [1,7] axes (see Fig. 3).The value of the digi-occupancy map varies with the received luminosity-the recorded by the CMS and hereafter referred to as the luminosity-and the number of events-particles traversing the calorimeter-that may differ across LSs.The maps from a sequence of LSs constitute attribution of ST data with correlated spatial and temporal relations Atluri et al. [2018].
We utilized data collected in 2018 during the LHC Run-2 collision experiment.The data set contains about 20K LSs from 20 healthy runs-collected by the CMS experiment-pre-scrutinized by the CMS certifiers, and registered the "Golden JSON" of the DQM as declared of good quality Rapsevicius et al. [2011].The maps-one per LS-were populated with the per LS received luminosity up to 0.4 pb −1 , and the number of events up to 2250.Our working dataset contains 20K map samples-each with a dimension of

Methodology
This section presents the proposed GraphSTAD system for online DQM of the HCAL using digi-occupancy map data.
An anomaly is an odd observation from the bulk of observations-often indicating peculiar underlying incidents Chalapathy and Chawla [2019].AD methods can broadly be categorized as supervised or unsupervised: 1) supervised approaches require annotated ground-truth anomaly observations, and 2) unsupervised approaches do not require labeled anomaly data and are more generally pragmatic in many real-world application settings, as data annotation is an expensive task.Unsupervised AD models trained with only healthy observations are called semi-supervised AD approaches.
We present an ST reconstruction AE to detect abnormality in the HCAL channels using reconstruction deviation scores on ST digi-occupancy maps from consecutive lumisections (see Fig. 4).The AE combines CNN, GNN, and RNN to capture ST characteristics of digi-occupancy maps.The spatial feature extraction of the CNNs is leveraged with GNNs

Data Preprocessing
This section describes the data preprocessing stages of the proposed approach-i.e., digi-occupancy renormalization for particle collision experiment setting variations and graph adjacency matrix generation for the readout channels of the HCAL.

Digi-occupancy Map Renormalization
The digi-occupancy (γ) map data of the HCAL varies with the received luminosity (β) and the number of events (ξ) (see Fig. 5).We devise a renormalization of the γ through a regression model R to have a consistent quantity interpretation of the γ and build a model that robustly generalizes previously unseen run settings-β and ξ variations.The R estimates the renormalizing γs at the s th LS using β and ξ as: The model R is trained to minimize the MSE cost function, E[(γ s − γs ) 2 ], where γ s is calculated as: where the γ(s, i) is the digi-occupancy of the i th channel in the map at the s th LS.Finally, the per-channel γ(s, i) is renormalized by its corresponding γs as: where the γ is the renormalized γ, and the K is a scaling factor to compensate for the difference in the number of channels on the depth axes.
We have employed fully connected (FC) neural networks to build the regression model to effectively capture the non-linear relationships illustrated in Fig. 5: Fig. 6 depicts data distribution of the γ s before and after renormalization with R. The renormalization has successfully handled the discrepancies on the γ s from several runs-overlaps and centers distributions of γs and minimizes the outliers.

Adjacency Matrix Generation for Graph Network
We deploy an undirected graph network G(V, Θ) to represent the HCAL channels in a graph network based on their connection to a shared RBX system.The graph G contains nodes υ ∈ V, with edges (υ i , υ j ) ∈ Θ in a binary adjacency matrix A ∈ R M ×M , where M is the number of channel nodes.An edge indicates the channels sharing the same RBX as: where Ω(υ) returns the RBX ID of the channel υ.
There are approximately 7K channels in a graph representation of the digi-occupancy map of the HE, where each RBX network contains roughly 190 nodes.We retrieved the channel to RBX mapping from the HCAL's calorimeter segmentation map.

Anomaly Detection Modeling
We denote the AE model of the GraphSTAD system as F. The ST data is where N iη × N iϕ × N d is the spatial dimension corresponding to the iη, iϕ, and depth axes, respectively, and N f = 1 is the number of input variables-only a digi-occupancy quantity in the spatial data.The F θ,ω : X → X-parameterized by θ and ω-attempts to reconstruct the input ST data X and outputs X.The encoder network of the model E θ : X → z provides low-dimension latent space, z = E θ (X), and the decoder D ω : z → X, reconstructs the ST data from z, X = D ω (z) as:  The channel anomalies can be transients-live short and impact only a single digi-occupancy map-or persist over time-affecting a sequence of maps.The spatial reconstruction error e is calculated to detect a transient anomaly as: where x i ∈ X and xi ∈ X are the input and reconstructed digi-occupancy of the i th channel.The e i detects channel abnormality occurrence on isolated maps.We engage an aggregated error in a time window T using mean absolute error (MAE) to capture a time-persistent anomaly as: We standardize e i to regularize the reconstruction accuracy variations among the channels-allowing a single AD decision threshold α to all the channels in the spatial map-as: where σ i is the standard deviation of the e i (or e i,M AE if the time window is considered) on the training dataset.The anomaly flags are generated after applying α to the anomaly scores a a i > α.The α is a tunable constant that controls the detection sensitivity.

Autoencoder Model Architecture
Convolutional neural networks have achieved state-of-the-art performance in several AD DL applications with image data Chang et al. [2022], Luo et al. [2019], Hasan et al. [2016], Hsu [2017], Wu et al. [2020].The shared nature of the kernel filters of the CNNs substantially reduces the number of trainable parameters in the model compared to fully connected neural networks.Directly supplying the learned spatial features into an architecture that can learn temporal data-such as RNN-could become inherently challenging due to the considerable computational demand for high-dimensional data.We employ CNN and GNN with a pooling mechanism to extract relevant features from high dimensional spatial data followed by RNN to capture temporal characteristics of the extracted features (see Fig. 7).We integrate variational layer Kingma and Welling The CNN of the encoder has L c networks-each containing Conv3D(•, kernel_size = [3 × 3 × 3]) 1 for regular spatial learning followed by batch normalization (BN) 2 for network weight regularization and faster convergence, ReLU for nonlinear activation, and MaxPooling3D 3 for spatial dimension reduction.The model can be summarized as: where x l t is the input spatial γ map data at time-step t and the N l c is the feature size of the l th network.The y c t is the extracted feature set of the CNN at t.The Pool(•) denotes MaxPooling3D(•, stride = [2 × 2 × 2]).The ψ c t holds the pooling spatial location indices of the MaxPooling3D layers to be used later for upsampling in the decoder during map reconstruction.The final extracted feature set Y c ∈ R T ×Nc of the CNN is an aggregation of all y c t in the time window T -concatenated on the time dimension-as: We have used The GNN of the encoder has L g networks of a graph convolutional network (GCN) 4 with ReLU activation, and a final global attention pooling 4 .The networks are summarized as: where N l r is the feature size of the l th LSTM layer.The last layer (N 2 r = N z = 32) generates the low-dimensional latent representation of the encoder.The VAE layer of the encoder generates the normally distributed representation latent features z as: z = µ z + σ z ⊙ ϵ (14) where ⊙ signifies an element-wise product with standard normal distribution sampling ϵ ∼ N (0, 1) An and Cho [2015].The µ z and the σ z of the VAE are implemented with FC 6 layers taking the ζ as input.
The decoder network of the AE is made of RNN and CNN to reconstruct the target ST data from the latent features.The decoding embarks with temporal feature reconstruction using LSTM network as:

Model Training
We trained the AE on healthy digi-occupancy maps of LHC collision runs (described in Section 3).The modeling task becomes a multivariate learning problem since the target data contains readings from multiple calorimeter channels in the spatial digi-occupancy map.Appropriate scaling of the spatial data is thus necessary for effective model training; we further normalized the spatial data per channel into a range of [0, 1].We have also observed that the γ distribution of the channels at the first depth of the spatial map is different from the channels at the higher depths (see Fig. 3); distribution imbalance on target channel data affects model training efficacy when well-known statistical algorithms, such as MSE, are employed as loss functions.MSE loss minimizes the cost of the entire space, and it may converge to a non-optimal local minimum in the presence of imbalanced data distribution; this phenomenon is known as the class imbalance challenge in machine learning classification problems.A popular remedy is to employ a weighting mechanism-assigning weights to the different targets.We applied a weighted MSE loss function to scale the loss from the different distributions-the depth ∈ 1 and depth ∈ 2, . . ., 7-as: where x i is the γ of the i th channel in the j th group set C j , M j is the number of channels in C j , and ς j is the weight factor of the MSE loss of the j th group.We holistically set ς 1 = 0.4 and ς 2 = 1 after experimenting with several different ς values.
The VAE regularizes the training MSE loss using the KL divergence loss D KL to achieve the normally distributed latent space as: where N is a normal distribution with zero mean and unit variance, and ∥.∥ is the Frobenius norm of L 2 regularization for the trainable model parameters W .The λ = 0.003 and ρ = 10 −7 are tunable regularization hyperparameters.We finally used Adam optimizer with super-convergence one-cyclic learning rate scheduling Smith and Topin [2019] for training.

Results and Discussion
In this section, we discuss the AD performance of the proposed GraphSTAD on simulated and real anomalies.
The ML studies for the CMS DQM mostly inject simulated anomalies into good data to validate the effectiveness of the developed models Azzolin et al. [2019] since a small fraction of the data is affected by real anomalies.We trained the AE model on 10K digi-occupancy maps-from LS sequence number [1,500]-and evaluated on LSs [500, 1500] injected with synthetic anomalies simulating real dead, hot, and degrading channels.We trained the AE on four GPUs with early-stopping using 20% of the training dataset to estimate the validation loss during each training epoch.High reconstruction accuracy on the healthy data is essential to reduce false-positive flags when a semi-supervised AE is employed for AD application.We further discuss the reconstruction error distribution comparison on the healthy and abnormal channels in the AD performance in Section 5.1.2.
We will discuss below the AD performance of the proposed system and comparisons with benchmark models, and present the detection of real faulty channels to demonstrate the accuracy of the proposed approach.

Anomaly Detection Performance
We generated synthetic anomalies simulating real dead, hot, and degrading channels and injected them into healthy digi-occupancy maps; the anomaly generation algorithm involves three steps: 1) selection of a random set of LSs from the test set, 2) random selection of spatial locations φ for each LS, where φ ∈ {iη × iϕ × depth} on the HE axes (see Fig. 3), and 3) injection of anomalies such as dead (γ anomaly = 0), hot (γ anomaly >> γ expected ), and degrading channels (0 ≤ γ anomaly < γ expected ) into digi-occupancy maps of the LSs.We have kept the same spatial locations of the generated anomalies for consistency when evaluating the AD performance of the different anomaly types.The figure illustrates the total digi-occupancy per LS across the seven depths (γ l ).The proposed GraphSTAD AE operates on ST γ map data, and we present the curves corresponding to the γ l per LS only to demonstrate the capability of the AE in handling the fluctuation across the sequence of lumisections.

Detection of Dead and Hot Channels
We have evaluated the AD accuracy on dead-γ a = 0-and hot-γ a = 200%γ h -channels on the 10K maps-5K maps for each anomaly type.We have investigated the AD performance on transient channel anomalies that are short-lived in isolated maps (see Table 1) and persisting anomalies that encroach consecutive maps in a time window (see Table 2).The model has achieved high accuracy with good localization of the faulty channels-0.99 precision when capturing 99% of the 335K faulty channels.Time-persistent anomalies are easier to detect-the FPR generally improves by 13%-23% and 28%-40% for the dead and hot anomalies, respectively-compared to the short-lived anomalies on isolated LSs.
We have observed that most FPs occur on channels with low expected γ, where the model achieves relatively lower reconstruction accuracy.The performance is not entirely unexpected, as we trained the AE to minimize a global MSE loss function ( 19).The reconstruction errors become relatively higher for channels with low γ ranges that limit AD effectiveness in distinguishing the anomalies when capturing 99% of the time-persistent dead channels using (8).
We have monitored roughly 31.28MHE sensor channels-of which 335K (1.07%) are simulated abnormal channels-from the 5K maps on the isolated map evaluation in Table 1.The monitored channels grow to 156M with 1.68M (1.07%) anomalies for the evaluation of time-persistent anomalies in

Detection of Degrading Channels
Table 3 presents the AD accuracy of time-persistent degrading channels simulated with R D = [80%, 60%, 40%, 20%, 0%]; the R D = 0% corresponds to a dead channel.We injected the generated channel faults into 1K maps for each decay factor.We have monitored around 156M channels-of which 1.74M (1.11%) are abnormal channels-in the total of 25K digi-occupancy maps-5K maps per in the time window.The AD system has demonstrated promising potential in detecting degraded channel anomalies.The FPR to capture 99% of the anomaly is 2.988%, 0.155%, 0.022%, 0.002%, and 0.001% when channels operate at 80%, 60%, 40%, 20%, and 0% of their expected capacity, respectively.
The relatively lower precision at the R D = 80% indicates that there are still a few anomalies challenging to catch even though the FPR is very low considering the accurate classification of numerous TN healthy channels (see Fig. 9); The channels operating at R D = 80% are mostly inliers-overlapping with the healthy operating ranges-and detecting such anomalies is difficult when the expected γ of the channel is low.The significant improvement of the FPR by 88% and 95% when the amount of the captured anomaly is reduced to 95% and 90%, respectively, demonstrates a small percentage of the channels causes the performance drop at R D = 80%.Fig. 10 illustrates the overlap regions on the distribution of the reconstruction errors of the healthy and faulty channels at the various R D values.

Performance Comparison with Benchmark Models
We have quantitatively compared alternative benchmark models to validate the capability of the GraphSTAD (see Fig. 11).The benchmark AE models employ a similar architecture as the GraphSTAD AE but with different layers.The results demonstrate that the integration of the GNN has a significant performance improvement from 1.6 to 3.9 times in the FPR.The temporal models-with RNN-have achieved a 3 to 5-fold boost over the non-temporal spatial AD model when capturing severely degraded channels.For subtle or inlier anomalies-e.g., when channels deteriorate by 20% at R D = 80%-the GraphSTAD has a substantial 25 times amelioration over the non-temporal model.Incorporating temporal modeling and GNN has enhanced degrading channel detection performance.12).Fig. 12 and Fig. 13 illustrate the detected faults fall into the dead channel category except in the last LS=57 where the channels operated in a degraded state-the γ is lower than expected.Detecting degraded channels is challenging since the γ reading is non-extreme like in dead and hot channels, and the γ drop overlaps with other false down-spikes (see LS > 57 in Fig. 12).The down-spikes in the digi-occupancy for LS > 57 are due to non-linearity in the LHC-changes in collision run settings (see Fig. 12b); our normalizing regression model has successfully handled the fluctuation during prepossessing before causing false-positive alerts (see Fig. 12a).Fig. 14 and Fig. 15 portray the spatial anomaly scores during death and degraded status of the faulty channels; the high anomaly scores localized at the faulty channels demonstrate the GraphSTAD AD performance at a channel-level granularity.The existing production DQM system of the CMS-uses rule-based and statistical methods-has also reported these abnormal channels at run-level analysis; the results are only available at the end of the run after analyzing all the LSs for the run Tuura et al. [2010].Our approach is adaptive to variability in the digi-occupancy maps and provides anomaly localization that detects faulty-including non-extreme degrading-channels per lumisection granularity.   .The median inference time of the GraphSTAD on a single GPU is roughly 0.05 seconds with a standard deviation of 0.006 seconds.The integration of the GNN makes the inference relatively slower compared to the benchmark models.The processing cost is, nonetheless, within an acceptable range for the CMS production requirement, as the input digi-occupancy map is generated at each lumisection with a time interval of 23 seconds.

Conclusion
Our study presents a semi-supervised anomaly detection system for the hadronic calorimeter's data quality monitoring system using spatio-temporal digi-occupancy maps.We extend the synergy of temporal deep learning developments for the CMS experiment.Our approach addresses modeling challenges, such as digi-occupancy map renormalization, learning non-Euclidean spatial behavior, and degrading channel detection.To overcome these challenges, we propose the GraphSTAD model, which combines convolutional, graph, and temporal learning networks to capture spatio-temporal behavior and achieve robust localization of anomalies at a channel granularity on high spatial data.The AD performance evaluation has demonstrated the efficacy of the proposed system for channel monitoring.Our proposed AD system will facilitate monitoring and diagnostics of faults in the frontend particle hit sensing hardware and software system of the calorimeter.It will enhance the accuracy and automation of the existing DQM system-providing instant anomaly alerts on a broader range of channel faults in real-time and offline; the improved monitoring of the calorimeter will result in the collection of high-quality physics data.The methods and approaches discussed in this study are domain-agnostic and can be adopted in other spatio-temporal fields-particularly when the data exhibits regular and irregular spatial characteristics.hh Also at Adiyaman University, Adiyaman, Turkey ii Also at Istanbul Gedik University, Istanbul, Turkey jj Also at Necmettin Erbakan University, Konya, Turkey kk Also at Bozok Universitetesi Rektörlügü, Yozgat, Turkey ll Also at Marmara University, Istanbul, Turkey mm Also at Milli Savunma University, Istanbul, Turkey nn Also at Kafkas University, Kars, Turkey oo Also at Hacettepe University, Ankara, Turkey pp Also at Istanbul University -Cerrahpasa, Faculty of Engineering, Istanbul, Turkey qq Also at Yildiz Technical University, Istanbul, Turkey rr Also at Bingol University, Bingol, Turkey ss Also at Sinop University, Sinop, Turkey tt Also at Erciyes University, Kayseri, Turkey [2017],Chang et al. [2022],Luo et al. [2019],Wu et al. [2020],Hasan et al. [2016],Ullah et al. [2021],Hu et al. [2019], traffic monitoringHsu [2017],Deng et al. [2022], cyber-security on sensor systemsBanković et al. [2012],Tišljarić et al. [2021],Jiang et al. [2022], medical diagnosisAhmedt-Aristizabal et al. [2021], and environment monitoringZhang et al. [2020].A unique quality of ST data that differentiates it from other traditional data studied is the presence of dependencies among measurements induced by the spatial and temporal attributes, where data correlations are more complex to capture by conventional techniquesAtluri et al. [2018].Spatio-temporal anomaly is defined as a data point or cluster of data points that violate the nominal ST correlation structure of the normal points Hsu [2017], Deng et al. [2022], Banković et al. [2012], Tišljarić et al. [2021], Xu et al. [2017], Chang et al. [2022], Luo et al. [2019], Wu et al. [2020], Hasan et al. [2016], Ullah et al. [2021], Ahmedt-Aristizabal et al. [2021], Jiang et al. [2022].The previous ST AD studies on video data sets Chang et al. [2022], Luo et al. [2019], Wu et al. [2020], Hasan et al. [2016] focus on CNN models for regular spatial feature extraction, and GNNs are gaining popularity for sensor and traffic flow data Hsu [2017], Deng et al. [2022] that exhibit irregular spatial attributes with non-Euclidean distance among nodes.GNNs have recently achieved promising results at the LHC Duarte and Vlimant [2022], Shlomi et al. [2020], and outperformed CNN in learning irregular calorimeter geometry Qasim et al. [2019] and in pileup mitigation Martínez et al. [2019].The spatial characteristics of the HCAL channels exhibit regular spatial positioning of particle hits in the calorimeter and irregularity in measurement, as adjacent channels may share different backend circuits.Our proposed model integrates both CNN and GNN Bruna et al. [2013], Kipf and Welling [2016] to capture Euclidean and non-Euclidean spatial characteristics, respectively, and RNN for temporal learning for the HCAL channels.

Figure 2 :
Figure 2: The data acquisition chain of the HE-including the SiPMs, the frontend readout cards, and the optical link connected to the back-end electronics Strobbe [2017].Each readout card contains twelve QIE11 for charge integration, an Igloo2 FPGA for data serialization and encoding, and a VTTx optical transmitter.A fault in the chain may cause anomalous digi-occupancy reading in the online DQM.

Figure 3 :
Figure 3: Digi-occupancy map (year=2018, RunId=325170, LS=15) of the HE.The HE channels are placed in |iη|=[14, 29], iϕ =[1, 72], and depth =[1,7].Each pixel in the map corresponds to the readout of a calorimeter channel after a particle hit.The HCAL covers a considerable volume of CMS and has a fine segmentation along three axes (iη, iϕ, and depth).The missing section near the top-left of the map is due to two failed RBX (HEM15 and HEM16) sectors during the 2018 collision runs.tolearn circuit and housing connectivity-induced spatial behavior irregularities among the HCAL sensor channels.There are approximately 7K channels-pixels-on the digi-occupancy map of the HCAL endcap subsystem-housed in 36 RBXes.The channels in a given RBX are susceptible to system faults in the RBX due to the shared backbone circuit and environmental factors, such as temperature and humidity.Behavior variations among RBXes have also been observed due to intrinsic deviations of the custom-built electronic components in the RBXes.Our proposed GraphSTAD employs GNNs-in its spatial feature extraction pipeline-to capture the characteristics of the HCAL channels owing to their shared physical connectivity to a given RBX.GNNs have recently achieved promising results in several applications at the LHCDuarte and Vlimant [2022],Shlomi et al. [2020], and outperformed CNN in learning irregular calorimeter geometry Qasim et al.[2019]  and in pileup mitigationMartínez et al. [2019].The GraphSTAD system exploits both CNN andGNN Bruna et al. [2013],Kipf and Welling [2016] to capture Euclidean and non-Euclidean spatial characteristics of the HCAL channels, respectively.

Figure 4 :
Figure 4: The proposed AE-based reconstruction AD system.The AE reconstructs the input ST digi-occupancy map, and the AD decision is performed on the anomaly scores estimated from the reconstruction errors.

Figure 5 :
Figure 5: Dependency between digi-occupancy and run settings-the received luminosity and the number of events-in LS granularity.The number of events did not fully follow the drop in the luminosity (right-side plot) and the digi-occupancy (middle plot); it portrays the non-linear behavior of LHC.The different colors correspond to different collision runs.

Figure 6 :
Figure 6: Distribution of total digi-occupancy per LS before and after renormalization.From left to right: a) the received luminosity, b) the number of events, c) the digi-occupancy, and d) the renormalized digi-occupancy obtained with the regression model described in the text.The different colors correspond to different runs.

Figure 7 :
Figure 7: The architecture of the proposed AE for the GraphSTAD system.The GNN and CNN are spatial feature extraction on each time step, and the RNN network captures the temporal behavior of the extracted features.The encoderfeature extraction-incorporates the GNN for backend physical connectivity among the spatial channels, CNN for regional spatial proximity of the channels, and RNN for temporal behavior extraction.The decoder-reconstruction-contains RNN and deconvolutional neural networks to reconstruct the spatio-temporal input data from the low dimensional latent features.

Fig. 8
Fig.8demonstrates the capability of the proposed ST AE in reconstructing normal digi-occupancy maps from a sequence of lumisections.The AE has accomplished a promising reconstruction ability on the ST digi-occupancy data.High reconstruction accuracy on the healthy data is essential to reduce false-positive flags when a semi-supervised AE is employed for AD application.We further discuss the reconstruction error distribution comparison on the healthy and abnormal channels in the AD performance in Section 5.1.2.

Figure 8 :
Figure 8: ST digi-occupancy maps reconstruction on samples from the test dataset (RunId: 325170, LS=[500, 750]).The figure illustrates the total digi-occupancy per LS across the seven depths (γ l ).The proposed GraphSTAD AE operates on ST γ map data, and we present the curves corresponding to the γ l per LS only to demonstrate the capability of the AE in handling the fluctuation across the sequence of lumisections.

Figure 9 :
Figure 9: AD classification performance on time-persistent degrading channels.

Figure 10 :
Figure 10: Reconstruction error distribution of healthy and anomalous channels at different health rates.The overlap region decreases substantially as the channel deterioration increases (left to right).

Figure 11 :Figure 12 :
Figure 11: Comparison with benchmark models on detecting time-persistent degrading channels.The GraphSTAD model achieves a significantly lower FPR.

Figure 13 :
Figure 13: Detection of real faulty channels from RunId=324841 collision run data.a) the 3D digi-occupancy maps with faulty channels-dead on the left at LS=6 and degraded on the right at LS=57, and b) visualization of the channel anomaly flags on the 2D map per the depth axes-red for anomaly and green for healthy.

Figure 14 :
Figure 14: The detected real dead channels at the LS=6 from the RunId=324841: a) the raw 2D digi-occupancy maps at the depth axes of the faulty channels, and b) the corresponding anomaly score maps.The GraphSTAD localizes the anomaly scores on the faulty dead channels.

Figure 15 :
Figure 15: The detected real degraded channels at the LS=57 from the RunId=324841: a) raw 2D digi-occupancy maps at the depth axes of the faulty channels, and b) the corresponding anomaly visualization maps.The GraphSTAD localizes the anomaly scores on the faulty degraded channels with strength proportional to anomaly severity-lower scores in the color bars than the dead channels.

Table 1 :
Table 2 using time window five maps resulting in 25K maps.AD on dead and hot channel anomalies on isolated digi-occupancy maps.

Table 2 :
AD on time-persistent dead and hot channel anomalies.

Table 3 :
AD on time-persistent degrading channels.