Next Article in Journal
LLM Security and Safety: Insights from Homotopy-Inspired Prompt Obfuscation
Previous Article in Journal
Semantic Firewalls with Online Ensemble Learning for Secure Agentic RAG Systems in Financial Chatbots
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bridging Human and Artificial Intelligence: Modeling Human Learning with Explainable AI Tools

1
Stanford University, Stanford, CA 94305, USA
2
SLAC National Accelerator Laboratory, Menlo Park, CA 94025, USA
*
Author to whom correspondence should be addressed.
Submission received: 14 November 2025 / Revised: 9 February 2026 / Accepted: 10 February 2026 / Published: 1 March 2026

Abstract

We address a gap in Machine Learning–human alignment research by proposing that methods from Explainable AI (XAI) can be repurposed to quantitatively model human learning. To achieve alignment between human experts and Machine Learning (ML) models, we must first be able to explain the problem-solving strategies of human experts with the same rigor we apply to ML models. To demonstrate this approach, we model expertise in the complex domain of particle accelerator operations. Analyzing 14 years of operational text logs, we construct weighted graphs where nodes represent operational subtasks and edges capture their strategic relationships. We then examine these strategic models across four granularity levels. Our analysis reveals statistically significant changes with expertise at three of four graph levels. Remarkably, despite numerous possible ways to partition subtasks, operators across all expertise levels demonstrate a striking consistency in high-level strategy, partitioning the task into the same three functional communities. This suggests a shared “divide and conquer” cognitive framework. Expertise develops within this stable framework, as experts exhibit greater cognitive flexibility (forming more cross-community connections) and build more refined internal models. The primary contribution of this work is a methodology for creating a quantitative, interpretable baseline of expert human performance. This provides a “ground truth” for future research in alignment between humans and ML models, enabling a new approach to verification: the ML model’s representation of the task can be quantitatively compared against the human expert benchmark to measure their alignment. This paves the way for building safer, more interpretable partnerships between humans and ML models.

1. Introduction

As Machine Learning (ML) applications expand into safety-critical domains (e.g., scientific facilities, industrial pipelines, and autonomous systems), explaining AI behavior has become critically important. This has driven significant advances in Explainable AI (XAI), leading to the development of tools that trace how models learn, reason, and act. However, a core limitation persists: we still struggle to explain AI behavior in complex tasks [1,2,3,4]. Remarkably, this mirrors a parallel challenge in cognitive science—explaining human learning and expertise in complex domains remains an active research frontier [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]. This convergence suggests an opportunity: before we can build ML models that seamlessly align with human experts, we must first be able to model and “explain” human problem solving with the same analytical rigor we apply to ML models. This investigation argues for a new direction to this end: applying methods from the XAI toolkit to model the structure of human expertise.
Graph-based methods offer a powerful lens for this task. Complex reasoning occurs within structured problem spaces naturally representable as networks [19]. Consequently, tools such as centrality analysis, community detection, etc., have become central to graph-based XAI for explaining the behavior of Graph Neural Networks (GNNs). Recent surveys on graph-XAI explicitly categorize these measures as core explainability methods [20,21], and evaluations [22] have shown that algorithms such as PageRank can be highly effective at generating faithful explanations. These measures are part of the toolkit of popular GNN explainability libraries, for instance, GraphFramEx [22]. Just as XAI methods identify “feature importance” by ranking influential nodes and connections within an AI model, these same graph-theoretic tools can be used to identify a form of “human feature importance” by revealing which subtasks are central to an expert’s strategy. Our work operationalizes this idea by borrowing established analytical methods from the graph-XAI toolkit and applying them to model human performance data.
In this work, we demonstrate this approach by modeling human expertise in a real-world task: operating particle accelerators. This extremely complex task requires trained human operators to control a large set of parameters to generate bright, powerful X-ray beams (Section 5). We represent this task environment as a graph, where nodes represent elementary tasks, and edges capture the sequential relationships between them, inferred from over a decade of operational text logs. To examine the characteristics of operator performance and their evolution with experience, we employ a set of graph-theoretic measures at four levels of granularity: individual nodes, edges, communities, and global network topology.
Our findings reveal consistent, statistically significant changes in operator strategy with increasing expertise. While operators at all levels partition the task into similar, functionally distinct modules (a shared “divide and conquer” strategy), experts exhibit more efficient and refined pathways within these modules. This analysis provides two primary contributions:
  • A novel methodology for creating quantitative, structural models of human expertise from observational data.
  • An empirical baseline of expert behavior that defines “human feature importance” in this domain.
This baseline is a critical prerequisite for developing and validating future ML models, as it provides a ground truth against which an ML model’s strategy can be compared to ensure alignment with trusted human problem-solving patterns. Ultimately, this work provides a concrete step towards a data-driven science of human-ML model teaming.

2. Graph Theory as a Foundation for Explaining Complex Behavior

A complex task can be imagined as a hierarchy of many simple subtasks, where the higher-level subtasks serve as the goals for the lower-level ones [19]. For example, making a turn while driving is a high-level complex task that serves as the goal for several lower-level subtasks: checking traffic signals, pressing the brakes, rotating the steering wheel, and so on. Turning a car is itself a subtask of the even higher-level task of driving a car. Due to their innate network-like structure, graphs provide a direct and natural way to represent complex behavior and to analyze nested structures at varying levels of granularity.
A graph ( G : V , E ) consists of a set of nodes (or vertices) V, connected by edges E that encode relationships between node pairs. The relationships include temporal order, co-occurrence, or functional dependencies, which can either be integrated into a single graph or distributed across multiple graphs. The resulting structures can be analyzed using a well-established suite of graph-theoretic measures and methods at different levels of granularity, including contributions of individual nodes, their interconnections and similarities, groups of highly connected nodes, or the entire graph.
In ML and XAI, graph theory underpins several successful approaches to improving model efficiency and interpreting their behavior, leveraging the strengths of the graph-based formalism for integrating information about individual nodes and their structural relationships. Graph Neural Networks (GNNs), for instance, leverage structural features (e.g., node embeddings, edge weights, subgraphs) in making predictions. Current best methods combine multiple graph features to outperform traditional ML methods in node classification and link prediction tasks [23,24,25]. To explain the decisions of these complex GNNs, a major subfield of XAI has emerged that uses graph-theoretic principles to provide explanations. These GNN explainability methods identify critical components of the input graph (such as influential nodes, important edges, or explanatory subgraphs, often referred to as motifs or rationales) that are most responsible for a given prediction. By ranking nodes by their importance (e.g., via attribution methods or centrality) or by identifying key communities, these techniques reveal the model’s underlying structuring and rationale.
Cognitive Network Science extends graph-theoretic principles to the study of human cognition and behavior [26,27]. For example, in semantic retrieval tasks, people’s responses are more accurately predicted by the PageRank of words—a measure of node importance within a graph—than by word frequency or pairwise association strength alone [28]. Researchers also employ graph-based models to investigate how individuals navigate problem spaces, form conceptual associations, and refine strategies over time. Several studies show that humans navigate internal cognitive representations of problems (modeled as graphs) in ways that mirror their navigation of physical environments [29,30,31,32]. The graph-based methods have also proven valuable in personality research, political sentiment analysis, developmental psychology, and psychiatry [33,34,35].
These parallel traditions in graph-based XAI and Cognitive Network Science reveal a powerful synergy. Both fields often use the same fundamental tools (such as node ranking, community detection, etc.) to explain complex behavior, albeit in different domains (for ML models versus for human cognition). This shared analytical language provides a unique opportunity. By applying the lens of graph-based XAI tools to model human expertise, we can create quantitative, interpretable models of human performance that are directly comparable to explanations of AI behavior, providing a concrete step towards studying and improving human AI alignment.

3. Solving and Learning Complex Tasks

Explaining human behavior and expertise in complex domains has been a central focus in psychology and cognitive science. Researchers have focused on the roles of structured representations, hierarchical planning, and adaptive strategies in learning and expertise, with the explanations often falling into one of two overlapping categories: (1) information-based and (2) process-based.
Information-based explanations focus on the information content and representations used in solving problems. Central to this view is the chunking theory, which suggests that people group information into compact, meaningful units called chunks, allowing them to store and manipulate data efficiently within limited memory [36,37,38,39,40]. This mechanism has been used to explain expertise across domains, including chess [41], air traffic control [9,42], cybersecurity [43,44], and video games [45]. Building on chunking, cognitive architectures (e.g., Adaptive Control of Thought-Rationale, ACT-R [46]) and learning theories (e.g., instance-based learning [47]) model the learning process as the acquisition of chunks and decision rules used to deduce the actions with maximum utility. For reviews on diverse successful applications, please see Gonzalez and Dutt [48] and Ritter et al. [49].
Despite the success, scaling up information-based explanations for complex problems remains a challenge due to their dependence on utility maximization [50] as the rationality principle [51]. This principle states that the rational choice among possible options is the one with maximum utility, but choosing this option requires computing the expected utilities of all possible options. As complexity increases, associated computations become intractable for any computationally limited models, making utility maximization a strategy possible only in “small worlds” [51,52,53]. Simon and Schaeffer [54] provide an example using game theory for chess play. In game theory, the optimal strategy is to first explore the tree of all possible states, then choose the path with the maximum utility (i.e., a win). However, as the number of possible states in chess is extremely large ( O ( 10 120 ) [55]), this strategy cannot be implemented within our computational limits.
The impossibility of utility-based optimization led Simon and colleagues to call for process-based explanations (i.e., through processes implementable within computational limits) and develop the principles of bounded rationality [6,51,53,56,57,58,59,60]. This principle states that, in complex problems, humans “satisfice” rather than optimize; that is, they approximate good-enough solutions using heuristics that satisfy constraints and suffice for current goals [57,58,60,61,62,63]. When optimality is beyond reach, heuristics provide frugal and efficient alternatives that often outperform sophisticated algorithms in complex and uncertain environments, such as trading stocks, driving a car, or flying a plane (For a review, see [63]). The need for process-level explanations also extends to learning [6,8,57,58]. While learning simple tasks is often described as perfecting a single optimal strategy [64,65,66,67], humans generally go through a sequence of improved strategies during learning complex tasks (e.g., [11,12,15,17,68,69,70,71,72]). For example, in learning well-structured problems (with clear goals to achieve using a specific set of actions in sequences), such as the Tower of Hanoi (ToH), humans go through several strategies—(1) selective information search, (2) means-end analysis, (3) recursion, and (4) hierarchical recursion [6].
The progression from simple heuristics to sophisticated, structured strategies, as is seen in tasks like the Tower of Hanoi, highlights a central methodological challenge: how can we quantitatively model these evolving behavioral processes in complex, real-world environments? While classic studies provide deep insights, they often rely on controlled experiments that are difficult to apply in operational settings, such as a particle accelerator control room. This creates a need for data-driven methods that can extract and analyze these cognitive strategies directly from observational data. Our work addresses this need by proposing a graph-based framework to create the very process-based explanations that Simon and others called for, but in a scalable, real-world context.

4. Research Questions and Hypotheses

Our primary goal is to demonstrate how analytical methods from graph-based XAI can be repurposed to create quantitative, process-based models of human expertise. Such models may be beneficial for data-driven human-AI alignment. To this end, we apply a suite of graph-theoretic measures to a large dataset of operational logs from a particle accelerator control room. By representing the complex tuning task as a graph of its elementary subtasks, we can systematically investigate the main exploratory question: how do operator strategies evolve with expertise? As operator strategies may span several task levels (from low-level subtasks to high-level task goals), we divide this focal question into two parts to probe both the breadth and the depth of strategic changes with expertise.
Q1
What are the changes at the different levels of the complex task with expertise?
Q2
How are the subtasks organized to make up the complex task?
The following are our hypotheses for these questions.
H1. 
There will be measurable, directed changes with expertise at some or all graph levels.
With expertise, we expect changes in the role of individual subtasks, how the subtasks are interconnected and organized into communities, as well as the structure of entire graphs. To test this hypothesis, we examine whether the changes with experience are statistically significant at four levels of our graph analysis, representing subtasks (nodes), their interconnections (edges), groupings (communities), and the whole complex task (entire graphs). The statistical tests are detailed in Section 6.6.
H2. 
The operators will divide the subtasks nested in the complex task and group them into distinct functional categories.
We expect the subtasks to exhibit strong community structure—clusters of operational subtasks that are more densely connected among each other than with the rest of the graph. Such a result would suggest that operators organize the task into functional groups, reflecting a divide-and-conquer strategy and implying an organizational structure of task information in this domain. To test this hypothesis, we first apply community detection and clustering algorithms to partition the graphs into communities, before evaluating the strength of these partitions using modularity [73] (detailed in Section 6.4.3). Note that the edge weights were estimated based on subtask co-occurrences; therefore, the detected communities represent clusters of subtasks that tend to be performed in close temporal or functional proximity.

5. The Complex Task for Operating Particle Accelerators

We implement the graph-based approach to examine human expertise in an extremely complex real-world task: operating particle accelerators. The accelerator we studied is the Linear Coherent Light Source (LCLS) at the SLAC National Accelerator Laboratory, a Free-Electron Laser (FEL) that produces very bright, powerful X-rays at ultra-fast pulse rates. These X-rays enable researchers to examine matter at the molecular and even atomic scales, capturing high-resolution, high-frequency snapshots. LCLS hosts thousands of experiments annually that aid our understanding of phenomena such as photosynthesis [74] and molecular interactions for drug discovery [75].
Safe, efficient, and reliable beam delivery is ensured by a complex, human-in-the-loop control system in which control room operators serve as the central decision-makers, performing subtasks by directing control hardware, software suites, computational algorithms, ML-powered solvers, and automated machine controls to configure, operate, and troubleshoot the accelerator within policy and procedural constraints and in consultation with a broad team of experts (physicists, machine specialists, system experts, technicians, safety officers, and others). Expertise in integrating domain knowledge and selecting effective control actions develops gradually. Operators, typically hired with physics or engineering backgrounds, progress through a multi-year on-the-job training program in accelerator physics and operations, beginning with a foundational year that prepares them to work with minimal supervision. Mastery is difficult because control room problems involve extremely complex moving optima in large parameter spaces, substantial uncertainty from measurement noise, machine drifts, and hidden parameters, and the absence of deterministic solution strategies.
FEL tuning is one such complex problem with the goal to set three X-ray beam parameters: (i) Photon energy, (ii) Pulse length, and (iii) Pulse intensity (or more simply, beam brightness). Once the desired energy and pulse length are established, operators iteratively adjust numerous parameters to maximize beam brightness by increasing the number of X-ray photons emitted by the source, a process called tuning.This tuning process is the focus of our investigations. While the first two parameters are established through highly constrained task sequences, tuning beam brightness requires creative problem-solving and is thus more likely to reveal differences in strategy.
Operators have identified 27 commonly used tuning parameters (Appendix A). While this set is finite, the optimal values and the sequences in which they are adjusted depend heavily on the operator’s strategy and the accelerator’s evolving state. The combinatorial complexity is enormous: there are 27 ! = 1.09 × 10 28 possible ordered tuning sequences, and when considering strategies as subsets of interconnected parameters, there are approximately 5.45 × 10 20 possible partitions (the Bell Number for 27). Since it is impossible for operators to explore all potential methods, they rely on heuristics to guide performance as predicted by principles of bounded rationality. Maximizing beam brightness, for instance, is reframed from a pure optimization problem into a satisficing one by setting an acceptable threshold informed by historical data, current machine configuration, and available time. Operators also frequently use trial-and-error search based on feedback—a heuristic search strategy [76]—for certain parameters, and they draw heavily on past solutions that succeeded in similar situations, reflecting the mirror-the-successful heuristic [63].
Detailed information on operations is archived and made available in many forms, including a myriad of digital training articles that serve as references in the control room. Another crucial source of data for our analysis comes from the operators’ own electronic logs (elogs), which are timestamped text entries cataloging events as they unfold in the control room. Conceptually, these entries can be viewed as the “chunks” of information that operators deem important enough to record for future use. However, we acknowledge that written logs are an observational proxy and not a perfect trace of the full cognitive process. As discussed in our ensuing discussion of limitations (Section 8), factors such as logging habits can influence the data. Despite this, the elog dataset presents a rare and valuable opportunity to reconstruct operator strategies from real-world, high-stakes observational data and to quantitatively investigate how these strategies change with expertise.

6. Methods

Our analysis draws on two primary data sources: (1) a large database of operator elogs (please see examples in Appendix B), which provide the behavioral data for our graph models, and (2) an expert-authored training article on FEL tuning, which was used as a ground truth to identify and label elog entries corresponding to the tuning task.

6.1. Participants

The elog dataset contains approximately 350,000 entries accumulated over a 14-year period between 2009 (The start of LCLS operations) and 2022. For each elog entry, we calculated the experience of the operator who authored it by the time elapsed since the operator wrote their first elog. We examine the graph changes in two ways: (i) by years of experience (Figure 1) and (ii) by dividing all operators and their entries into three groups of expertise (Table 1).
Figure 1 shows the number of operators at each stage of experience (Figure 1a, in half-year increments) and the number of elog entries they created (Figure 1b), both of which can be observed to decrease with expertise. This pattern is a finding in itself, potentially reflecting both attrition over time and changes in logging behavior as operators gain experience (e.g., novices may log more granular steps, while experts write more consolidated entries). To have a reasonable amount of data at each stage, we use the criterion of a minimum of 50 entries (the red dashed line in Figure 1b) in a period to be included. Therefore, we include all stages up to Year 7.
Table 1 shows the three expertise groups containing all operators and their entries: the Novice group with less than 1 year of experience, the Intermediate group with between 1 and 4 years of experience, and the Expert group with more than 4 years of experience. These bins are selected to align with existing operator training stages. As the table indicates, the number of operators and entries varies considerably across groups. This uneven distribution is a key feature of the dataset and a potential source of bias, a point we address explicitly in our Limitations section. To ensure that the unequal bin sizes do not mislead us, the group-wise investigations are preceded by the examination of the more continuous changes with years of experience.

6.2. Data Processing

Each elog entry consists of a title and a main text, along with the metadata that includes author information, timestamps, occasional references to related entries on the same topic, image attachments, and optional tags for related areas of the accelerators. To simplify analysis, we combined the title and the main text into a single piece of text.
A primary challenge was to systematically identify the small subset of entries related to our specific experimental task (FEL tuning) from the large and varied elog database. To solve this, we employed a natural language processing (NLP) approach, using an expert-authored training article on FEL tuning as a “ground truth” document. This article provides detailed descriptions of all 27 tuning parameters. By measuring the semantic similarity between each elog entry and this expert document, we could first filter for tuning-related entries and then identify which specific parameters were discussed within each. The precise steps of this data cleaning, filtering, and labeling process are detailed in the following section.

6.3. Developing Graphs of Tuning Parameters from Elogs

Figure 2 shows the steps used to develop the graphs from elog data. The first step involved extensive pre-processing to isolate relevant, human-authored text. We filtered out: (1) automated machine logs, (2) duplicate entries, (3) trivial entries with fewer than 10 words, and (4) entries unrelated to the LCLS accelerator. Subsequently, following standard NLP practices to reduce noise, we removed all punctuation, URLs, email IDs, stopwords, and numbers from both the elog entries and the expert training article. After this step, the resulting dataset contains about 132,000 entries and about 12,000 unique words.
Next, we represented the entries and the article as multi-dimensional vectors of all words. We employed Latent Semantic Indexing (LSI) [77], a robust topic modeling technique that represents documents as vectors in a lower-dimensional “topic space”. LSI is well-suited for this task as it captures semantic relationships (e.g., synonymy) beyond simple keyword matching. This is in direct contrast to modern Language Modeling approaches like transformer-based models, which we decided against due to their requirement for large corpora of data and their computational expense, in addition to other reasons. For the elogs, we reduced the dimensions of the entry vectors from 12,000+ words to 100 topics (Appendix C). Then, we used cosine similarity to find similarities between the vectorized entries and the article on tuning, identifying the relevant entries. A similarity threshold of 0.3 was empirically selected to balance precision and recall, yielding approximately 2000 relevant entries. This process left us with approximately 2 thousand entries. Finally, for each of these relevant entries, we calculated its similarity to the text describing each of the 27 individual parameters and identified the top three most likely parameters discussed. This focus on the top three parameters is a simplification to capture the most salient aspects of each entry.
The final and most critical step was the graph construction itself. This step is based on the core assumption of our study, which is inspired by the distributional hypothesis from computational linguistics: co-occurrence in language implies a functional or semantic relationship. We posit that tuning parameters frequently mentioned together within the same log entry are also conceptually linked in the operator’s cognitive model of the task. Based on this principle, we identified which of the labeled tuning parameters co-occurred within each entry. These collocations were used to populate an adjacency matrix for a given set of entries (for instance, all entries from the Novice group). The weighted, undirected graph was then constructed from this matrix, where nodes represent the tuning parameters and edge weights represent the frequency of their co-occurrence.

6.4. Graph Analysis

The graphs constructed from elog data serve as models of operator strategy. In these graphs, nodes represent the tuning parameters and edges represent the strength of their co-occurrence in the operators’ logs. The graphs can be investigated at different scales, ranging from microscopic views of node or edge importance to more macroscopic views of clusters of closely related nodes and entire graphs. Figure 3 shows an example graph containing 10 nodes, 13 edges, and 3 clusters. We investigate the FEL tuning graphs (consisting of 27 interconnected nodes) at four ascending levels of complexity: (1) Node level—individual subtasks, (2) Edge level—interconnections between subtasks, (3) Community level—groups or clusters of subtasks, and (4) Graph level—the whole task.
To perform this multi-level analysis, we adopt a framework inspired by graph-based XAI. Our approach is to apply established graph-theoretic measures that are used in XAI to interpret complex models (like GNNs) and re-purpose them to interpret our models of human behavior.

6.4.1. Node Level

These measures are central to graph-based XAI, where they are used to determine “feature importance” by identifying which input nodes are most influential to a model’s decision. We apply the same logic to identify “human feature importance” in our operator graphs. For our main analysis, we used PageRank centrality (or simply PageRank) [78]. While originally famous for ranking webpages, PageRank has been shown to be a highly effective explainability method for Graph Neural Networks, sometimes outperforming more complex techniques [22]. Within our framework, a node with high PageRank represents a tuning parameter that is not only frequently used but is also central to a cluster of other important parameters, making it a key component of the operator’s strategy. PageRank is a specific variant of Eigenvector centrality [79] that assigns importance to nodes based on their connectivity and the importance of their neighbors. The values are calculated by randomly initializing the nodes and iteratively recalculating each node’s PageRank based on its incoming PageRanks until all nodes converge to steady PageRanks.
Other measures we explored are (a) the degree centrality, which estimates node importance based solely on the number of connections, and (b) the clustering coefficient, which estimates the probability of a node being part of a cluster. The distributions of these measures across the tuning parameters closely resembled those observed for PageRank. Therefore, they are not discussed in the main text, but are included in Appendix F.

6.4.2. Edge Level

We explored two measures of edge importance: (a) the edge weights and (b) the edge betweenness centrality. The edge weights express the strength of connections between nodes in our model; they are simply the normalized frequencies of co-occurrence between two tuning parameters in the logs. A higher edge weight indicates a stronger perceived relationship between two parameters in the operators’ workflow. The second measure, edge betweenness, represents the importance of an edge in connecting different parts of the graph. In an XAI context, edges with high betweenness metrics are critical “bridges” in the model’s reasoning process. In our human model, a high-betweenness edge connects two otherwise disparate clusters of parameters, representing a crucial step that links two different sub-goals of the tuning task. It is calculated as the ratio of the number of shortest paths in the graph that contain the edge and the total number of shortest paths in the graph. We observed that the edge weights and the betweenness centrality portray similar (but opposite) changes with expertise. For brevity, we discuss the edge weights in the main text and include the edge betweenness in Appendix F.

6.4.3. Community Level

Community Detection Beyond individual nodes and edges, identifying groups of densely connected nodes, or communities, is a core task in both network science and graph-based XAI. In an XAI context, such communities can represent explanatory sub-structures or “motifs” that are collectively responsible for a model’s behavior. In our model of human expertise, we hypothesize that these communities correspond to the distinct functional categories that operators use to manage task complexity (a “divide and conquer” strategy). Community detection is an NP-hard problem, and therefore, finding optimal partitions through exhaustive search is intractable beyond small graphs. To illustrate, there are 52 possible partitions for 5 nodes, but 115,975 for 10 nodes. As the number of communities increases worse than exponentially with the number of nodes, heuristic-based approaches are used for community detection [80,81].
To test our second hypothesis, which posits that operators group subtasks into distinct functional categories, we examined the presence of community structures using three heuristic algorithms: (1) the Louvain algorithm, (2) Spectral Clustering, and (3) the G-N algorithm [73]. Notably, only the Spectral Clustering algorithm requires specifying the number of partitions, whereas the other two automatically identify the number of communities by optimizing their respective objective functions. The strength of partitions of a graph into communities was evaluated by the modularity metric [82], which measures how unlikely it is to observe the given density of in-community edges if connections were randomly distributed across a size-matched graph. Modularity values range between [−1, 1], where modularity of 0 corresponds to the random case and positive modularity values indicate the presence of communities in the graph. Modularity close to 1 is rare in real graphs, and values between 0.3 and 0.7 are considered to indicate strong partitions [73,82]. We use the modularity value to test the second hypothesis. The Louvain algorithm and spectral clustering both yielded strong partitions with modularity above 0.3, whereas the G-N algorithm failed to find strong partitions. In the main text, we present the results of the Louvain algorithm, while all three are compared in Appendix F.
Structures Within Communities While community detection reveals what the functional groups are, it does not explain how they are internally organized or how they relate to each other. Understanding this nested, hierarchical structure is crucial for modeling complex, compositional reasoning. For this purpose, hierarchical clustering presents a suitable class of methods. Among several popular methods, we chose agglomerative hierarchical clustering, which groups nodes based on pairwise distances, due to its simplicity of interpretation [83,84]. The distances are computed using a linkage criterion. We adopted a complete linkage approach, which uses the average distance between all node pairs across two clusters. Agglomerative clustering begins at the lowest level with individual nodes and iteratively merges them based on pairwise distances, progressively forming larger, higher-level clusters. The resulting combination process can be visualized as a dendrogram. This tree-like structure provides a visual explanation of the task’s compositional structure, revealing how operators conceptually build up complex strategies by progressively grouping elementary subtasks into larger, higher-level functional units.

6.4.4. Entire Graph Level

The graph-level measures enable us to combine the information at different levels of the graph into a single quantity. Different representations and measures have been helpful in comparing entire graphs for various purposes (For reviews, see [85,86]). We represent each graph of n nodes in two standard matrix forms: (1) the Adjacency matrix and (2) the Graph Laplacian matrix. For each form, we find the eigenvalues that form the spectra of the given graph. The set of eigenvalues serves as the coordinates of the graph represented as a point in an n-dimensional space (n being the number of nodes).

6.5. Measuring and Comparing Changes at Different Levels of Graphs

At both the node and edge levels, we obtained probability distributions of the measures over all possible options. To quantify how the importance of individual subtasks (nodes) and their connections (edges) evolves with expertise, we must compare these distributions across different experience levels. To measure the changes at each of the two levels, we use two measures of distributional similarity: (a) Relative Entropy (RE) or Kullback–Leibler (KL) Divergence and (b) Overlapping Index (OI). RE is an information-theoretic measure of the difference between two distributions in terms of information contained in the distributions [87]. The OI measures the amount of overlap in the areas contained between the distribution functions [88]. Please see Appendix D for the equations.
At the community level, we find groups of nodes. To assess whether operators converge on a common “divide and conquer” strategy, we measured the similarity of the community structures found in the graphs of different expertise groups. We measured the changes using two similarity measures widely used in ML for testing clustering performance [89,90,91]: (a) the Adjusted Rand Index (ARI) and (b) the Adjusted Mutual Information (AMI). The expressions are included in Appendix D. The unadjusted versions of the measures range between 0 (all incorrect cluster assignments) and 1 (all correct cluster assignments). Notably, all incorrect assignments are extremely unlikely and worse than the expected accuracy for random assignments. The adjusted versions are corrected for chance and range from −0.5 (all incorrect assignments) to 1 (all correct), with a value of 0 corresponding to the expected clustering accuracy from random assignments.
At the graph level, we sought a holistic measure to quantify the overall change in an operator’s strategic model. For this, we use two well-known spectral distances—(a) the Adjacency Spectral Distance (ASD) and (b) the Laplacian Spectral Distance (LSD)—based on the Adjacency and Graph Laplacian matrices, respectively [86]. In each case, we first compute the eigenvalues of the corresponding matrix. Thereafter, the difference between the two graphs is calculated as the Euclidean ( L 2 ) distance between their eigenvalue sets.
Please note that not all of these quantities are strictly measures, as they do not obey the triangle inequality. However, they are numerical indices well-suited for our purpose of examining the changes in the measures over time. Another point to note is that some measures of change we use are distance-based, whereas others are similarity-based. Distance measures have a maximum of (zero similarity between distributions) and a minimum of 0 (identical distributions), whereas the similarity measures are scaled between 0 (no similarity) and 1 (identical distributions). To aid comparisons across levels, we convert all measures to distance measures in the main paper (Section 7.1). There are multiple ways to transform between the distance and similarity metrics. In this work, we use the Gaussian Kernel relationship ( S i m i l a r i t y = e D i s t a n c e 2 D i s t a n c e = l o g e ( S i m i l a r i t y ) ) to convert between the similarity and the distance measures.

6.6. Statistical Significance of Changes at Graph Levels

To test our first hypothesis (i.e., whether changes in graph metrics with operator experience are statistically significant), we employed both frequentist and Bayesian linear regression. For each metric, we tested for positive slopes against the null hypothesis of no change. For frequentist regression, we used ordinary least squares to estimate slopes, one-sided p-values, 95% confidence intervals, and R 2 , assuming normality and homoscedasticity. For Bayesian regression, we standardized inputs, applied a N o r m a l ( 0 , 1 ) prior to the slope, and sampled with a high target acceptance rate to compute posterior mean slopes, 95% highest density intervals, and Bayes Factors ( B F 10 ) via the Savage-Dickey density ratio [92]. We use p < 0.005 and B F 10 > 3 as thresholds for statistical significance. Notably, the Bayes factor provides a direct measure of the strength of evidence in favor of each hypothesis [92]. Therefore, this dual approach allows complementary insights, combining traditional significance testing with Bayesian evidence to assess expertise-driven graph changes. Please see Appendix E for more details of the tests.

7. Results

We begin by (1) summarizing the changes at four levels (node PageRanks, edge weights, communities, and the whole graph) with respect to experience and (2) performing statistical tests. Thereafter, we examined changes at each level across three stages of operator expertise (Novice, Intermediate, and Expert), taking a top-down approach (i.e., starting with whole graphs and progressively moving down to nodes).

7.1. Changes in the Graphs with Years of Experience

We measured how much the operators’ strategic models, as represented by the graphs, change with each half-year of experience at different levels. We use data from the first half-year as a reference and measure the changes across years relative to this reference. The results are shown in Figure 4.
The first two graph levels (respectively, in Figure 4a,b) are: (a) the “human feature importance” of individual subtasks, represented by PageRank distributions over the nodes, and (b) the strategic linking of subtasks, represented by edge weight distributions over all possible pairs of subtasks. We measured changes in the PageRank and edge weight distributions using two metrics: RE (blue) and OI (orange). As we see, both metrics show consistent trends: distances increase with experience, peak around Years 2.5 and 5, and then reach plateaus. Notably, the edge weights evolve more dynamically than the PageRanks, suggesting that learning how to combine subtasks is a more significant aspect of expertise development than learning the importance of individual subtasks. However, this difference may stem in part from the unequal number of options in the distributions (27 individual subtasks vs. 351 edges). Nonetheless, the changes are largely continuous and unidirectional. Therefore, the systematic changes with expertise should be well captured in larger groups spanning years, which we explore in the next sections.
We estimated the changes in the communities (Figure 4c) with two measures—the ARI (blue) and the AMI (orange) values—using the first period as the reference. As we see, both measures capture nearly identical changes. However, compared to PageRank and edge weights, the community changes with experience seem discontinuous. The greatest change occurs in the first year, after which distance values remain consistent across years. This plateau suggests that operators quickly converge on a stable, high-level “divide and conquer” strategy for organizing the task. As we show in later sections, the communities remain similar even in our grouped investigation. Crucially, this interpretation was externally validated by domain experts, who confirmed that the communities discovered by our model align perfectly with the three main functional sub-goals of the real-world task: (1) beam transport corrections, (2) beam energy and compression tuning, and (3) miscellaneous, less routine tasks. This alignment provides strong evidence that our graph models are capturing a meaningful, real-world cognitive structure.
For the whole graph (Figure 4d), we used the adjacency (blue) and the Laplacian (orange) spectral distances that measure the differences between, respectively, the adjacency and the Graph Laplacian matrix representations of the graphs. These spectral distances capture shifts in the global structure of the task graph, providing a compact and sensitive measure of how operators’ overall workflow organization changes across years. Both spectral distances can be observed to rise continuously with expertise, indicating that the whole graphs change consistently as operators gain expertise. As the whole graphs encapsulate the graph levels we have examined earlier, these distances capture an aggregate of all the different ways the graphs change over the years. A reflection of the aggregation can be observed in the peak at the 2.5-year mark and the trough at Year 3, resembling the PageRank and the edge weights (Figure 4a,b). There are also some notable differences. The peaks observed at Year 5 for PageRank and edge weights are not reflected in changes across the entire graph. These differences suggest that while node- and edge-level changes contribute substantially to overall graph evolution, they do not fully account for the changes captured by spectral distances. This continuous rise in whole-graph distance provides a powerful, holistic confirmation of our central hypothesis: expertise involves a global, structural transformation of an operator’s mental model of the task, reflecting a deeper and more organized understanding of the domain.

Significance of Results and Hypothesis Tests

Overall, these results largely support our second hypothesis, as we find measurable changes at all graph levels tested, except the communities. To test whether the changes are significant, we conducted both frequentist and Bayesian hypothesis tests. Here we present a summary of the tests (Table 2). Please see Appendix E for more details.
Frequentist linear regression (Table 2 and Table A2) revealed significant positive slopes for six of eight change metrics, supporting expertise-driven increases in FEL tuning strategies at the PageRank, edge weight, and whole graph levels. Community-level changes (ARI and AMI) were found to be statistically insignificant, suggesting stable community structures across expertise levels. Bayesian linear regression (Table 2 and Table A3) corroborated these results, demonstrating extremely strong evidence for positive slopes in six of eight graph metrics, with B F 10 between 50 and ≈ 5 × 10 30 ). Once again, community-level changes (ARI and AMI) showed weak evidence ( B F 10 1 ), consistent with frequentist findings, confirming that the communities remained stable at all stages of experience.
Based on these results, we accept the second hypothesis for the three levels of the four (excluding the community level). In the next sections, we take a deeper dive into the changes at each level by binning the operators into three groups: Novice (<1 year of experience), Intermediate (1–4 years), and Expert (>4 years).

7.2. Whole Graphs

Figure 5 provides a bird’s eye view of the FEL Tuning graphs for the three groups of operators. Each node represents one of the 27 subtasks (i.e., tuning parameters to set), and their sizes in the plot are set proportional to their PageRank, representing their “human feature importance”. The graphs are drawn using a spring layout, where edges act as springs with spring constants proportional to their weights. Therefore, nodes that are strongly linked in the operators’ strategies are pulled close together. The communities detected using the Louvain algorithm, representing high-level functional groupings, are shown in different colors.
Visually, the strategic models for all three groups are characterized by a core of dense communities and a periphery of less-connected nodes. The internal organization and density of these communities appear to evolve with expertise. There are also changes in node sizes (i.e., PageRanks) and in the distances between nodes (i.e., edge weights), which are difficult to verify in this qualitative visualization. Notably, we observe largely the same three communities across all groups. The clear separation of these communities in the layout provides strong visual evidence for the “divide and conquer” approach operators take, partitioning the complex problem space into distinct, manageable sub-problems. In the next section, we examine the communities in detail and test our second hypothesis.

7.3. Groups of Subtasks: Communities and Hierarchies in the Graphs

We first tested our hypothesis (H2) that operators employ a “divide and conquer” strategy by examining the community structure of their strategic models. We used the Louvain algorithm to detect communities and evaluated the partition strength using modularity, which measures how densely connected a community is compared to a random configuration. The range of 0.3–0.7 is recommended for a strong partition [73,82].
For each group, modularity values of the partitions were all above 0.30, indicating strong community structure across levels of expertise. Specifically, the modularity was 0.367 for novices, 0.349 for intermediates, and 0.313 for experts. These consistently high values suggest that the detected communities are unlikely to have arisen by chance and reflect meaningful structural organization in the task networks of all operator groups. To validate these findings and examine the uncertainty of these estimates, we conducted bootstrap tests across 100 resampled subsets of the elogs (80% of the data sampled without replacement). The estimated mean modularity values for the novice, intermediate, and expert groups were 0.370, 0.352, and 0.311, respectively, with corresponding standard deviations of 0.011, 0.007, and 0.008, confirming the stability and reliability of the detected partitions. These results provide strong support for our second hypothesis (H2), which posited that operators at all levels of expertise would exhibit strong community structures in their task networks.
Importantly, the three groups of operators demonstrate remarkable similarities in categorizing the subtasks into communities. We find exactly three communities in the graphs for all groups. These communities are largely similar, with only one or two subtasks being classified differently across groups (e.g., Parameter 0 for the experts and Parameters 3 & 4 for the intermediates). As mentioned earlier, the three communities correspond to beam transport (green), beam energy/compression (purple), and miscellaneous (pink), respectively, and this classification was validated by several system experts. These similarities among the communities demonstrated by the groups are quite striking, given the extremely large number of possible partitions (≈ 5.45 × 10 20 ). These similarities strongly suggest that operators at all stages of expertise can recognize and categorize the parameters into similar groups. Therefore, any differences in tuning performance with expertise are unlikely to stem from improvements in categorizing different parameters into communities.
If the what of the strategy is stable, where does expertise come from? We hypothesized it lies in the how: the internal organization and integration of subtasks within these stable communities. We examined this compositional structure using hierarchical clustering. The results are presented as dendrograms (Figure 6) which demonstrate clear, directional changes across expertise levels. The height of the dendrograms represents the distance at which two nodes are clustered into the same group (marked by the horizontal connections). The dendrograms decrease in height as expertise increases, indicating that the conceptual distance between subtasks shrinks. Experts develop a more compact and tightly integrated strategic model. For Novices, the heights at which subtasks converge are much higher and vary considerably more than for the other two groups. For the Intermediate and Expert operators, the dendrogram heights not only become lower and more concentrated within a smaller range, but we also observe that several subtasks converge at almost the same height, suggesting the formation of well-defined, stereotyped routines. Finally, the differences in dendrograms appear to be much larger between the novices and the intermediates than between the intermediates and the experts, reflecting the steep learning curve for novice operators observed in Figure 4.
Taken together, these findings paint a clear picture of expertise development. The high-level “divide and conquer” strategy is a stable foundation. Expertise is the process of refining the internal, hierarchical structure of this strategy, building a more compact, efficient, and interconnected cognitive model of the task. In the next sections, we build on our understanding of the communities and hierarchies in the graphs to probe these differences at the node and edge levels.

7.4. Interconnections Between Subtasks—Edge Weight Distributions

Figure 7 shows the distributions of edge weights over all possible edges. The edges are presented in ascending order of the nodes they connect (e.g., Edge 0 connects Nodes 0 and 1, Edge 1 connects Nodes 0 and 2, etc.), and the weight represents the strength of connections between the nodes. We classify edges as in-community (green plus signs) or out-of-community (black cross signs), depending on whether they connect nodes within the same community.
We see that edges with high weights are predominantly in-community, while those with low weights are mostly out-of-community. To summarize these differences, we calculated the ratio of average weights for out-of-community vs. in-community edges. The edge-weight ratios are 0.27, 0.31, and 0.43 for the Novices, Intermediates, and Experts groups, respectively. This increase in the proportion of out-of-community edges is due to both reduced in-community weights and increased out-of-community weights. As the edges connect subtasks recorded in close proximity, these results suggest that Novices tend to stay within communities when choosing subtasks, whereas operators with more expertise tend to choose subtasks across communities.
Regarding the distributional changes, the distributions generally become more uniform with expertise. For Novices, the distributions are highly skewed, with weights markedly higher for edges near the right extreme of the plot. For instance, the most pronounced peak appears between subtasks 24 (Undulator Launch) and 25 (Undulator Orbit), indicating that these two tasks are strongly interconnected in their mental models. With expertise, the weight of these dominant “novice” links decreases, and the overall distribution becomes more uniform. This suggests that experts move beyond a few rigid, high-weight routines and develop a more balanced and adaptable repertoire of strategies.
Apart from these differences, we observe that edges with higher weights than the others (near the center of distribution) remain quite similar across the three groups. For example, for each group, the edges labeled 200–250 and 300–350 seem to have the highest weights. We attribute the similarities at the edge level to the similarities of the communities. As the communities remain the same, the same sets of edges connect the elements within and outside of the communities. Our analysis indicates that the evolution towards greater cross-community linkage is most prominent in the “miscellaneous” community, suggesting that a key part of becoming an expert is learning how to integrate these less routine tasks with the core functional groups of the system.

7.5. Individual Subtasks—Node PageRank Distributions

We investigated the “human feature importance” of each subtask using PageRank. As discussed in the Methods, PageRank estimates a node’s importance (Importance in a graph-theoretic sense, not meant as a domain-specific measure of importance) based on both the number of edges and their weights. The PageRank distributions over the parameters are shown in Figure 8. Each parameter is color-coded by the community it belongs to, based on the communities detected using the Louvain algorithm (Figure 5).
Similar to the edge weights, the PageRank distributions appear to become more uniform with increasing expertise. These distributions exhibit several interesting patterns in relation to the communities in the graph. For all operator groups, the beam transport (green) subtasks have consistently high PageRank values. This result is expected, since transporting the beam from the origin to the destination is a crucial step of FEL tuning. The PageRank values for the beam energy and compression (purple) subtasks also appear to remain largely consistent, increasing only slightly with expertise. However, for the Miscellaneous (pink) subtasks, there are large changes in the distributions across stages of expertise. For Novices, this set contains subtasks with the highest PageRank values. This suggests that novices initially focus heavily on a few “miscellaneous” (and perhaps more easily accessible) tuning knobs. As they gain experience, the importance of these specific subtasks decreases, and a more balanced strategic focus emerges.
The most dramatic shift occurs with the Undulator tuning subtasks (24, 25, 26). These have very high importance for novices but see their PageRank decrease significantly for experts. These parameters are physically co-located and represent a basic, local tuning strategy. Conversely, Subtasks 11 and 12 (Quadrupole Magnets) show the opposite trend: their “human feature importance” increases significantly with expertise. The devices related to these subtasks are located far upstream in the machine, and the results from adjusting them are more dramatic and less predictable than those from other tuning subtasks. The PageRank values suggest that with expertise, operators develop the confidence and deep system understanding required to use these powerful, high-impact controls. This provides a clear signature of the cognitive shift from a simple, spatially aligned mental model (tuning what is in front of you) to a more sophisticated, physics-based model that understands action at a distance.
Finally, we observed in Section 7.3 that, while the communities remain the same, the graph structures and community structures differ across groups. Therefore, even if they divide the whole task into the same parts, operators of varying expertise differ in the underlying steps of task selection and sequencing. These differences are likely to account for the observed differences in the PageRank distributions. As we see, the PageRank values for subtasks within particular communities tend to move together, suggesting that operators adapt their task selections and sequencing within stable community boundaries rather than redefine those boundaries. In other words, expertise appears to manifest not through restructuring the whole task but through selective reorganization of its elements (e.g., prioritizing different subtasks and altering connection weights) to navigate the same functional categories with increasingly efficient strategies.

8. General Discussion

This study set out to demonstrate how analytical methods from the graph-XAI toolkit could be repurposed to create quantitative, interpretable models of human expertise. By applying this framework to over a decade of operational logs from a particle accelerator, we constructed structural models of operator strategy at different levels of experience, revealing a clear and systematic picture of how expertise develops in this complex, high-stakes domain.
In this work, we adopted a graph-theoretic approach to investigating expertise in the complex task of operating particle accelerators. Processing the text in a large archive of more than 350,000 operational logs, we modeled the task as a graph of its elementary subtasks. Thereafter, we investigated the changes with expertise using a host of measures and methods at four graph levels: (1) individual subtasks (nodes), (2) the interconnections among subtasks (edges), (3) the groups of subtasks (communities), and (4) the whole task (entire graph) of tuning the brightness of the particle accelerator beam.
At each level, we examined the changes using multiple measures and statistical tests. Supporting our first hypothesis, our results reveal measurable, statistically significant changes with expertise at three of the four levels we tested: specifically, the subtask, interconnection, and whole graph levels. On the other hand, the communities remained remarkably similar despite a very large scope of variation, as the operators consistently grouped subtasks into three specific communities that discernibly aligned with the transport, compression, and less routine subtasks, as validated by control room experts.
Our modularity-based tests show that graph partitions into these three specific communities are extremely unlikely to occur randomly. After consulting with multiple operations experts, we found that two communities correspond to two important subgoals of FEL tuning (beam transport and energy compression), whereas a “miscellaneous” community accommodates the remaining, less routine subtasks. These results strongly support our second hypothesis that the operators divide the numerous subtasks of FEL tuning into distinct functional categories, reflecting a divide-and-conquer approach to deal with task complexity. In addition to implying organizational structure inherent to the strategy space that experts can leverage for more competent control of the machine, some of the strong consistency between subtask groupings and strategy in the real-world task is likely due to operators being trained through a common training program, suggesting a strong opportunity for impactful interventions through training. However, due to the limitations of our study, these findings need to be further substantiated using new datasets and across domains for generalization (discussed in the next section).
Probing beneath this consistency in communities, we find that the organization of subtasks within them changes considerably with increasing expertise, as subtasks become closer and the graphs become denser—indicating improved organizational structure with the acquisition of domain knowledge about subtasks. These changes were also reflected at the node and edge levels. For novices, in each case, we find the distributions of these measures skewed towards the miscellaneous group, whereas the distributions become more uniform at higher levels of expertise. Moreover, while novices were more likely to follow up a task with another in the same community, experts demonstrated an increase in out-of-community connections, suggesting confidence toward less routine (more risky due to low reproducibility or high impact) subtasks develops along with expertise. Generally, these results demonstrate that operator strategies are built on common foundations (i.e., the same partition of the graphs), but they evolve systematically with expertise, observable as more scattered novice understanding, in-community focus (e.g., skewed PageRank toward miscellaneous subtasks and higher in-community edge weights) compared to experts’ integrated, cross-community approaches (uniform distributions and rising out-of-community weights).
Generally, our findings highlight the promise of graph-based approaches as a unified lens for examining and explaining how humans acquire system expertise. The space of possible solution strategies is vast in complex problems, and learners need to explore and find improved strategies with experience to improve performance. To explain complex-task learning, we need to explain how problem-solving strategies evolve with experience. As we demonstrated, graph-based methods offer a powerful framework for capturing the evolution of strategies. By analyzing changes across different graph levels (e.g., shifts in node centrality, edge weights, and community organization), we can trace how learners transition from fragmented, localized approaches to more integrated, efficient strategies. These representations illuminate the cognitive scaffolding that underpins expertise and provide interpretable metrics to track learning progress. Ultimately, such studies would deepen our understanding of the nuanced ways in which expertise develops, offering a promising path toward designing ML models that align with and enhance human capabilities in complex, high-stakes environments. This way, the unified lens of graph theory can help us bridge cognitive science, technical training, and artificial intelligence to empower human-machine teams to tackle the complexity of our real world.

Limitations and Future Directions

Among the numerous control room tasks, we focused on the critical task of FEL tuning. A holistic set of tasks would yield much larger graphs with many more subtasks and, subsequently, a sterner test for our methods. To study expertise non-invasively, we relied on elog entries: timestamped text logs describing control room events.
Our methodology relies on the core assumption that co-occurrence in written logs is a valid proxy for an operator’s cognitive strategy. As we noted in the Methods (Section 6), this assumption is grounded in the well-established distributional hypothesis. However, we acknowledge that this proxy has limitations. Elog data is an incomplete trace of cognition. It does not capture unstated thoughts, verbal communication, or actions deemed too trivial to log. Additionally, a portion of the logging is retrospective, wherein the sequence of events may represent a summary constructed after the fact rather than a perfect real-time record. While this can affect the fine-grained temporal accuracy of co-occurrence, such summaries still represent the operator’s own reconstructed narrative, highlighting the subtasks they deemed most significant and how they chose to mentally group the procedure for reporting. Furthermore, logging behavior itself may change with expertise, wherein experts might write more concise, higher-level summaries than novices. While our analysis accounts for these differences by focusing on aggregated patterns, future work could enrich these models by integrating additional data streams, such as eye-tracking or control system command logs, to create a more complete picture of the operator’s process. Another limitation is that our findings are derived from a single facility with its own unique training culture. Comparative studies across different complex control environments would be a valuable next step to test the generalizability of these findings. Furthermore, this study uses years of experience as an indicator of proficiency. This ignores individual differences in learning and is, at best, a statistical descriptor of expertise.
An exciting future direction is to build on our methods. There are many ways to represent graphs, assess their characteristics, and compare them to investigate changes in task selection and organization with expertise. To mitigate these limitations, we (1) used multiple algorithms and measures and accepted results consistent across them, (2) performed significance and stability tests, and (3) conducted several rounds of expert validation. A promising automated approach to graph construction comes from graph-based ML (e.g., graph representation learning and graph neural networks). These methods have been helpful in capturing the evolution of graphs [23,24] and can aid in automating graph construction. Developing graph models that combine multivariate and multimodal streams of information is itself an interesting problem. Naturally occurring data from real-world environments are often available in different forms. The graphs can be constructed based on the varying relationships across different data streams, but integrating them into a single model remains less explored and requires testing and replication.
Finally, the most significant future direction is to build upon the foundational models developed in this work to inform the design and evaluation of AI systems. Our quantitative models of expert strategy provide an empirical “ground truth” for human-AI alignment. Instead of designing ML models based on abstract needs, future work can use these expert graphs as a concrete blueprint. A key next step would be to develop an ML model for FEL tuning and then use the same graph-based methods to model its strategy. By comparing the ML model’s strategic graph to the human expert graph using the distance metrics developed here, we can create a quantitative, interpretable measure of human-model alignment. This moves beyond simple performance metrics to answer a deeper question: does the ML model “think” about the problem in a way that is compatible and legible to its human partner?

9. Summary and Conclusions

We introduced a novel approach to studying expertise by re-purposing analytical tools from the graph-XAI toolkit to model human problem-solving in a complex, real-world task. Using 14 years of operational logs from a particle accelerator, we constructed weighted networks of subtasks to represent operator strategies at different experience levels. Our results reveal directed expertise-driven changes at the subtask, interconnection, and whole-graph levels. Crucially, we found that while the high-level “divide and conquer” strategy (partitioning the task into three stable, functionally distinct communities) is a common scaffold shared by all operators, the development of expertise lies in the refinement of strategies within and between these communities. Experts build more compact and flexible strategic models, demonstrating a shift from simple, local routines to a more integrated, physics-based understanding of the system.
The primary contribution of this work is a concrete methodology for creating quantitative, interpretable models of human expertise directly from observational data. By establishing an empirical, structural baseline of expert human performance, our framework provides a necessary “ground truth” for the future of human-ML model alignment. This enables a new, concrete approach to alignment verification: one can now quantitatively compare the strategic graph of an ML model against this human expert benchmark to identify precisely where their problem-solving approaches converge and diverge. This moves beyond simply measuring task success to asking a deeper question: does the ML model “think” about the problem in a way that is legible and compatible with its human partner? Ultimately, this work takes a step toward a data-driven science of human-ML model alignment, paving the way for safer and more effective partnerships in complex environments.

Author Contributions

Conceptualization, R.R., J.S. and W.-L.H.; methodology, R.R., A.A.M., J.S. and W.-L.H.; software, R.R. and A.A.M.; validation, R.R., A.A.M., J.S. and W.-L.H.; formal analysis, R.R. and A.A.M.; investigation, R.R., J.S. and W.-L.H.; resources, J.S. and W.-L.H.; data curation, R.R. and J.S.; writing—original draft preparation, R.R.; writing—review and editing, A.A.M., J.S. and W.-L.H.; visualization, R.R.; supervision, A.A.M., J.S. and W.-L.H.; project administration, J.S. and W.-L.H.; funding acquisition, J.S. and W.-L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the SLAC Laboratory Directed Research and Development (LDRD) program (grant number: 22-031) under the project “Performance Optimization for Human-in-the-Loop Complex Control Systems,” led by Wan-Lin Hu.

Institutional Review Board Statement

This study was conducted using secondary data from operational logs involving human participants. All personally identifiable information (PII) was removed prior to analysis. The research protocol was reviewed and approved by the Stanford University Institutional Review Board (IRB), Protocol No. 72743, in March 2024.

Informed Consent Statement

Participant consent was waived due to the use of de-identified secondary data and the absence of direct interaction with participants.

Data Availability Statement

A Python (3) implementation of all methods and tests used can be found here: https://github.com/Roussel006/Expertise-in-Operating-Particle-Accelerators-through-Network-Models-of-Performance (accessed on 1 February 2026). The elog dataset can be found here: https://osf.io/qmt2x/ (accessed on 1 February 2026). Note that the elog dataset has been pre-processed and anonymized.

Acknowledgments

We would like to thank the accelerator operations team at SLAC National Laboratory, who generously allowed us into the control room and shared their wealth of system expertise to make our work possible. Specifically, we would like to thank: the operators for giving us insight into their unique and incredible skill set and generating the elog data set; Peter Schuh, Johnny Warren, and Alex Saad for supporting and enabling this research and connecting us with the operators; Benjamin Ripman and Janice Nelson for consulting on the design of this study with special thanks to Matt Gibbs who did all that and helped us navigate a myriad of networking and software development challenges. We would like to thank John Schmerge for being our upper-management advocate and extend our deepest gratitude to him and Mike Dunne for securing the additional funding we needed to complete this research.

Conflicts of Interest

The authors had no conflicts of interest. The funders had no role in the design of the study, in the collection, analysis, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript (ordered alphabetically):
AIArtificial Intelligence
AMIAdjusted Mutual Information
ARIAdjusted Rand Index
ASDAdjacency Spectral Distance
BCBunch Compressor
FELFree Electron Laser
G-N AlgorithmGirvan–Newman Algorithm
IRBInstitutional Review Board
KLKullback–Leibler (Divergence)
LCLSLinac Coherent Light Source
LSDLaplacian Spectral Distance
LSALatent Semantic Analysis
LSILatent Semantic Indexing
MLMachine Learning
OIOverlapping Index
PIIPersonally Identifiable Information
RERelative Entropy
ToHTower of Hanoi
XAIExplainable Artificial Intelligence

Appendix A. The 27 FEL Tuning Subtasks

In a Free-Electron Laser (FEL), electron bunches are accelerated to nearly the speed of light in a linear accelerator. As these high-energy charged particles travel through a long array of alternating magnets—called an undulator—they wiggle side to side and emit radiation. Under appropriate conditions, this radiation interacts coherently with the electron bunches, causing a runaway amplification process that produces extremely bright, ultrafast X-ray pulses.
Tuning is the process by which accelerator operators adjust machine parameters so that amplification occurs as efficiently as possible. Even small mismatches between the electron beam and the undulator conditions can dramatically reduce X-ray output, so operators iteratively refine many aspects of the machine to bring everything into alignment. At a high-level, the tuning process typically involves achieving three beam parameters requested by the scientists. The first two parameters are target photon energy and pulse length, which are set by configuring the accelerator energy, undulator gap, and compression settings. The third parameter is the X-ray beam brightness or pulse intensity. Maximizing beam brightness is a core goal of FEL tuning and must be achieved by adjusting a set of 27 known tuning parameters shown in Table A1.
As discussed in the main text, understanding the changes in operator strategies in adjusting this set of tuning parameters through their graphs is a main goal of our study. One reason we emphasized identifying communities as distinct subsets of tuning parameters is that they correspond to distinct subgoals of the tuning process. For example, to align the electron beam through the undulator in the intended direction of travel, operators adjust steering magnets, quadrupoles, and orbit feedback systems to ensure the beam travels cleanly through the undulator with minimal losses. While doing so, operators adjust parameters such as bunch charge, compression, energy spread, and emittance to optimize conditions for coherent X-ray generation. Some common features of control room problems that highlight why they are hard to master are: (1) extremely complex moving optima within large parameter spaces [93,94,95,96], (2) containing large uncertainty of outcomes (e.g., due to measurement errors, periodic drifts in machine states, and hidden/unobserved parameters), and consequently, (3) no deterministic solution strategies. In this context, operators have to rely rely on heuristics, past experience, and trial-and-error search to converge on a configuration that produces stable, bright X-ray pulses.
Table A1. List of FEL tuning parameters.
Table A1. List of FEL tuning parameters.
IDParameters
0LASER Iris Position
1Gun Solenoid Strength
2Schottky Phase
3CQ01 SQ01
4XC01 YC01
5135 MeV Bunch Length
6LASER Pulse Stacker Delay
7LASER Heater e-beam Overlap (3D)
8LASER Heater Power
9Injector Matching Quads QA01 QA02 QE01-04
10XCAV Launch Horizontal and Vertical
11S21 Matching Quads 21Q201 QM11-13
12BC1 Dispersion Quads CQ11 CQ12
13BC1 Horn Cutting
14BC1 Compression/Bunch Length
1521-2 L1X Amplitude and Phase
16L2 Transverse Steering Launch
17BC2 Compression/Bunch Length
18BC2 Dispersion Quads CQ21 CQ22
19Sector 26 Matching Quads
20DL2 Dispersion Quads
21LTU Matching Quads
22SXRSS Chicane Delay
23HXRSS Chicane Delay
24Undulator Launch
25Undulator Orbit
26Undulator Taper

Appendix B. Sample Elogs

Figure A1. Sample elogs on the Tuning Task. Personally identifiable information has been redacted in these examples, reflecting our text preprocessing step. The top example shows an independent elog, whereas the bottom example shows two related elogs created on the same thread.
Figure A1. Sample elogs on the Tuning Task. Personally identifiable information has been redacted in these examples, reflecting our text preprocessing step. The top example shows an independent elog, whereas the bottom example shows two related elogs created on the same thread.
Ai 07 00082 g0a1

Appendix C. Latent Semantic Indexing or Analysis (LSI/LSA)

In this work, the elog entries and the article on FEL tuning are represented as multi-dimensional vectors of all words or terms. The vector space representation allows several advantages, the most relevant of which is using the similarities between entries in the vector space for information retrieval or classification [77]. However, vector space representations based on just the counts or the frequencies of terms suffer from two common issues in NLP: (a) synonymy—many terms may have the same meaning, and (b) polysemy—one word may have different meanings. In this work, we use Latent Semantic Indexing (LSI), a popular topic modeling method that is well-equipped to deal with synonymy by taking the context of each term into account [77]. The problem of polysemy is mitigated in our case as the operators rarely use the same word to mean different entities. In addition, LSI allows for reducing a document-word matrix (representing a whole database) from a high-dimensional space of words into a lower-dimensional space of topics by mapping sets of words to topics.
LSI consists of two steps: (a) Term Frequency-Inverse Document Frequency (TF-IDF) vectorization and (b) Singular Value Decomposition (SVD). In the TF-IDF representation for each document, the dimensions are the words, and the magnitudes are their TF-IDF values. TF of a term or word refers to the ratio between the number of times a word is used in a document and the total number of words in the document. IDF of a term is the logarithm of the ratio between the number of documents in the dataset (i.e., the elog entries and the tuning article) and the number of documents in which the term appears. The TF-IDF value for a term is the product of TF and IDF. Scaling by IDF helps to reduce the importance of terms that may occur very frequently (e.g., the term “electron” in FEL tuning) but convey little information about the specific document.
The second step is performing SVD on the vectorized documents. The SVD process decomposes the original document-term matrix M of dimensions n terms ×m documents into a form of M = U S V T . The three factor matrices are (a) U, the term-topic matrix of dimensions n × k , (b) S, a k × k diagonal matrix containing k singular values representing the importance of the k topics, and (c) V T , a topic-document matrix of dimensions k × m . The number of topics k is the minimum of ( m , n ) for the full case. To truncate, we set k to be lower, essentially keeping only the highest k singular values while zeroing out the rest. This way, SVD allows us to reduce the dimensionality of the data by mapping the words into a reduced number of topics. In our work, we truncate our original matrix of more than 12,000 unique words using k = 100 topics.

Appendix D. Measuring Changes at Different Graph Levels

Appendix D.1. Distributional Changes

Suppose we have two distributions p ( x ) and q ( x ) , and we aim to measure how different or similar they are. One of the most suitable and popular measures for such purposes is the RelativeEntropy(RE) or KL-Divergence [87,97]. RE measures the difference in the information contained in a target distribution, p ( x ) , that is not contained in a reference distribution, q ( x ) . The RE of p ( x ) relative to q ( x ) can be expressed as following [87]:
R E ( p ( x ) | | q ( x ) ) = X p ( x ) log a p ( x ) q ( x ) d x
Some points to note about RE: (1) it is a relative and asymmetric metric; that is: R E ( p | | q ) R E ( q | | p ) . A common technique (dating back to the original inventors of the metric) to make it symmetric is to average R E ( p | | q ) and R E ( q | | p ) . We also used this technique in our investigations. (2) The minimum value of RE is 0, corresponding to p ( x ) = q ( x ) . The maximum RE is infinite. (3) The unit of RE depends on the base of the logarithm in Equation (A1). For the 2-based logarithm, the unit is b i t s of information, and for the natural logarithm, the unit is n a t s of information. In this work, we used the natural logarithm to maintain consistency with the Gaussian Kernel used to transform between distance and similarity metrics.
Another metric we used is the OverlappingIndex(OI) [88]. The OI is dimensionless and directly measures the overlap between the areas contained by two distribution functions. The overlap between two distributions p ( x ) and q ( x ) can be expressed as following:
O I ( p ( x ) , q ( x ) ) = X m i n ( p ( x ) , q ( x ) ) d x = 1 1 2 X | p ( x ) q ( x ) | d x
Whereas RE is a distance or dissimilarity metric, OI is a similarity metric. Minimum OI is 0 (corresponding to R E ) and 1 ( R E = 0 and p ( x ) = q ( x ) ). We use the Gaussian Kernel relationship to convert back and forth between the similarity and the distance metrics.
S i m i l a r i t y = e D i s t a n c e 2 D i s t a n c e = l o g e ( S i m i l a r i t y )

Appendix D.2. Community or Cluster Changes

In this work, we used two indices to measure cluster similarity: Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI). The details of these indices are beyond the scope of this work, but they can be found in [89,90,91] among many other resources due to their popularity. The principles behind the indices are extremely simple to explain. Suppose we have a binary classification problem (Yes/No). For each element labeled positive (i.e., a “Yes”), the label may be correct (True positive, TP) or incorrect (False Positive, FP). Similarly, each negative label may be correct (True negative, TN) or incorrect (False negative, TN). In this case, the RandIndex (unadjusted) is simply the accuracy expressed by the proportion of correct labels (Equation (A4)).
R I ( U , V ) = N u m b e r o f c o r r e c t l a b e l s N u m b e r o f t o t a l l a b e l s = T P + T N T P + F P + F N + T N
Rand Index ranges from 0 (no correct labels) to 1 (all correct labels). For ARI, the index is adjusted so that (i) it ranges from −1 (no correct labels) to 1 (all correct labels), and (ii) an index of 0 corresponds to the expected number of correct labels when assigned randomly. This is accomplished as shown in Equation (A5), in which E [ . ] denotes the expected value and m a x ( . ) refers to the maximum value.
A R I ( U , V ) = R I E [ R I ] m a x ( R I ) E [ R I ]
This approach can be easily extended beyond binary classification. Generally, for classification problems, our goal is to find a “true” partition. A partition is a set of elements divided into subsets, clusters, or communities. For example, for a set X = { a , b , c , d , e } , two possible partitions among many are p 1 = { { a , b } , { c , d } , { e } } and p 2 = { { a } , { b , c , d } , { e } } . When comparing two partitions, we estimate the accuracy based on the pairs of elements in the clusters. Specifically, the pairs labeled correctly to be in the same cluster are the TP, the pairs labeled correctly to be in different clusters are the TN, and so on. To adjust, we use the expected number of correct labels in the random case. The full expression for ARI between two partitions P 1 and P 2 of total N elements can be expressed by Equation (A6) [98] (Chapter 3) where C i and C j refer to the clusters in partitions P 1 and P 2 respectively. The other notations are: N i = N ( C i ) , N j = N ( C j ) , N ( i , j ) = N ( C i C j ) , and N = N ( C i C j ) .
A R I ( P 1 , P 2 ) = ( i , j ) N ( i , j ) 2 ( i N i 2 × j N j 2 / N 2 ) 1 2 ( i N i 2 + j N j 2 ) ( i N i 2 × j N j 2 / N 2 )
AdjustedMutualInformationorAMI between two partitions P 1 and P 2 can be expressed by Equation (A7) [91]. H ( . ) denotes information entropy, M I ( . , . ) denotes the mutual information between two partitions, E [ . ] denotes the expected value, and m a x ( . , . ) denotes taking the maximum.
A M I ( P 1 , P 2 ) = M I ( P 1 , P 2 ) E [ M I ( P 1 , P 2 ) ] m a x ( H ( P 1 ) , H ( P 2 ) ) E [ M I ( P 1 , P 2 ) ]

Appendix E. Hypothesis Tests

To evaluate how graph metrics evolve with operator experience, we test for systematic changes in slope over time using both frequentist and Bayesian hypothesis testing frameworks. The null hypothesis ( H 0 ) posits that there is no directional change in the metric with increasing years of experience—that is, the slope of the relationship is zero. The alternative hypothesis ( H 1 ) asserts that the slope is positive, indicating that the metric increases with experience. We apply frequentist linear regression to estimate the slope and compute one-sided p-values and confidence intervals. In parallel, we use Bayesian linear regression to quantify the posterior probability of a positive slope, derive credible intervals, and estimate Bayes Factors to assess the strength of evidence against the null. This dual approach allows us to compare traditional significance testing with probabilistic inference, offering complementary insights into how expertise shapes graph structure.

Appendix E.1. Frequentist Regression Test

The frequentist approach uses ordinary least squares to estimate the slope, intercept, and associated statistics. We compute a one-sided p-value to assess the probability of observing a positive slope under the null hypothesis. The 95% confidence interval (CI) is derived using the critical value from the t-distribution and the standard error of the slope. Additionally, the coefficient of determination ( R 2 ) quantifies the proportion of variance in the dependent variable explained by the predictor. This method provides a direct test of the null hypothesis and a bounded interval estimate for the slope, assuming normality and homoscedasticity.
Frequentist linear regression (Table A2) revealed significant positive slopes (p < 0.005, one-sided) for six of eight graph metrics, indicating expertise-driven increases in graph structure. Whole graph changes (adjacency distance, β 1 = 35.004 , R 2 = 0.864 ; Laplacian distance, β 1 = 95.626 , R 2 = 0.832 ) and edge weight changes (relative entropy, β 1 = 0.182 , R 2 = 0.814 ) showed the strongest changes, while community changes (ARI, p = 0.069; AMI, p = 0.038) were less robust, suggesting stable community structures.
Table A2. Frequentist Linear Regression Results for Graph Metrics Across Expertise Levels.
Table A2. Frequentist Linear Regression Results for Graph Metrics Across Expertise Levels.
MetricChange MeasureOne-Sided p-Value R 2 Slope ( β 1 )95% CI (Lower, Upper)
PageRankRE0.0000.7610.022 *(0.014, 0.029)
PageRankOI0.0000.6890.048 *(0.027, 0.068)
Edge WeightRE0.0000.8140.182 *(0.128, 0.237)
Edge WeightOI0.0010.5970.063 *(0.031, 0.096)
CommunityARI0.0690.1750.050(−0.018, 0.118)
CommunityAMI0.0380.2390.055(−0.007, 0.116)
Whole GraphASD0.0000.86435.004 *(26.277, 43.731)
Whole GraphLSD0.0000.83295.626 *(68.642, 122.611)
Note: Asterisks (*) indicate significant positive slopes (p < 0.005, one-sided). One-sided p-values of 0.000 reflect rounding of values below 0.0005, indicating highly significant positive slopes.

Appendix E.2. Bayesian Regression Test

To assess the relationship between experience and a graph metric, we standardize inputs and fit a Bayesian linear regression. The slope is assigned a N o r m a l ( 0 , 1 ) prior, and the model is sampled with a high target acceptance rate to ensure reliable posterior estimates. After sampling, slope values are rescaled to the original units for interpretability. We compute the posterior mean, the probability that the slope is positive, and the 95% highest density interval (HDI) to quantify uncertainty. To evaluate evidence for a non-zero slope, we use the Savage-Dickey density ratio, comparing prior and posterior densities at zero via Gaussian KDE. This yields an approximate Bayes Factor indicating how strongly the data support a directional effect.
Bayesian linear regression (Table A3) confirmed strong evidence ( B F 10 > 3) for positive slopes in six of eight graph metrics, supporting expertise-driven graph reorganization. Whole graph changes (adjacency distance, posterior mean slope = 34.601, B F 10 = 4.95 × 10 30 ; Laplacian distance, post. mean slope = 93.461, B F 10 = 5.16 × 10 20 ) and edge weight changes (relative entropy, post. mean slope = 0.179, B F 10 = 6.40 × 10 2 ) exhibited the strongest evidence, while community changes (ARI, B F 10 = 0.82 ; AMI, B F 10 = 1.36 ) showed weaker evidence, consistent with frequentist findings about stability of the communities.
Table A3. Bayesian Linear Regression Results for Graph Metrics Across Expertise Levels.
Table A3. Bayesian Linear Regression Results for Graph Metrics Across Expertise Levels.
MetricChange Measure BF 10 Posterior Mean Slope95% HDI (Lower, Upper)
PageRankRE 2.06 × 10 14 0.021 *(0.014, 0.029)
PageRankOI 2.85 × 10 2 0.046 *(0.026, 0.067)
Edge WeightRE 6.40 × 10 2 0.179 *(0.124, 0.236)
Edge WeightOI 5.02 × 10 1 0.061 *(0.028, 0.093)
CommunityARI0.820.046(−0.019, 0.110)
CommunityAMI1.360.051(−0.007, 0.107)
Whole GraphASD 4.95 × 10 30 34.601 *(25.636, 43.017)
Whole GraphLSD 5.16 × 10 20 93.461 *(64.856, 119.965)
Note: Asterisks (*) indicate strong evidence for a positive slope ( B F 10 > 3). B F 10 is the Bayes Factor comparing the model with a non-zero slope to the null (slope = 0).

Appendix F. Detailed Results from Graph Analysis for Operator Groups

In this section, we present the detailed results from the graph analysis of different groups of operators. In the main text, we focused on the changes with expertise at different levels of granularity using a specific set of measures. These measures at each level were selected after extensive testing since there is no singular “optimal” or best measure at any of the levels. Here, we demonstrate some of the measures we tried and how we selected the set we used in our analysis. In our first demonstration here (Figure A2), we include the tuning-related entries from all operators of all stages of expertise in developing the graph. This example is discussed in detail. The detailed results for each of the three expertise groups discussed in the main text (i.e., Novices, Intermediates, and Experts) are included respectively in Figure A3, Figure A4, Figure A5.
Figure A2. Graph analysis for all participants from all stages of experience. Panes (ac) show the distributions of the node-level measures, respectively, the degree, the clustering coefficient, and the PageRank. As we see, the distributions are highly similar across measures. Panes (d,e) show the distributions of the edge level measures—edge weights and edge betweenness centrality—over all 351 edges (in the same order in (d,e)). As we can see, there are some noticeable differences between the two distributions; for example, the edges with low weights appear to have high betweenness centrality. Panes (fi) display the community-level analysis: (f) illustrates the modularity trend of increasing partitions from the G-N algorithm; (gi) show the communities identified by three different algorithms. Interestingly, while the G-N algorithm fails to find a strong partition of the graph into communities (highest modularity 0), the Louvain and the spectral clustering algorithms both find a strong and very similar partition of the graph (modularity > 0.3) into three communities.
Figure A2. Graph analysis for all participants from all stages of experience. Panes (ac) show the distributions of the node-level measures, respectively, the degree, the clustering coefficient, and the PageRank. As we see, the distributions are highly similar across measures. Panes (d,e) show the distributions of the edge level measures—edge weights and edge betweenness centrality—over all 351 edges (in the same order in (d,e)). As we can see, there are some noticeable differences between the two distributions; for example, the edges with low weights appear to have high betweenness centrality. Panes (fi) display the community-level analysis: (f) illustrates the modularity trend of increasing partitions from the G-N algorithm; (gi) show the communities identified by three different algorithms. Interestingly, while the G-N algorithm fails to find a strong partition of the graph into communities (highest modularity 0), the Louvain and the spectral clustering algorithms both find a strong and very similar partition of the graph (modularity > 0.3) into three communities.
Ai 07 00082 g0a2

Appendix F.1. Node Level

We begin at the node level of the graph—that is, the importance of individual parameters of FEL tuning. The three measures used at this level are the degree, the clustering coefficient, and the PageRank centrality of each node, shown respectively in Figure A2a–c. The degree centrality (or simply degree) of a node captures the importance of a node by the number of connections. The clustering coefficient of a node captures the probability or the tendency of a node and its neighbors to form (triangular) clusters. It is measured as the ratio of the actual number (unweighted graphs) or the weights (weighted graphs) of edges between the neighbors of a node and the maximum possible number or weight of connections between the neighbors. As graphs with high clustering coefficients of the nodes are impervious to small changes (e.g., removal of an edge), the distribution of clustering coefficients provides an indicator of graphs’ robustness to random changes.
Figure A3. Graph analysis for Group 1 consisting of Novice Operators.
Figure A3. Graph analysis for Group 1 consisting of Novice Operators.
Ai 07 00082 g0a3
Figure A4. Graph analysis for Group 2 consisting of Intermediate Operators.
Figure A4. Graph analysis for Group 2 consisting of Intermediate Operators.
Ai 07 00082 g0a4
Figure A5. Graph analysis for Group 3 consisting of Expert Operators.
Figure A5. Graph analysis for Group 3 consisting of Expert Operators.
Ai 07 00082 g0a5
The first point to note is that the three measures provide very similar distributions over the 27 nodes. For each measure, some parameters can be observed to be of lower importance than others. For example, the values for parameters 9, 11, 12, and 18–26 appear to be higher than those of the other parameters. These differences across parameters are most prominent in the degree measure, followed by PageRank, and are least prominent in the clustering coefficient. A possible reason is that the PageRank of a node considers both the degree and the clustering coefficient of the node. The degree of a node is the total weight of the connected edges, whereas the clustering coefficient of a node represents the edge weights between its neighbors. The PageRank combines both sets of information by considering the importance of a node’s neighbors in estimating its own importance. As the distributions are fairly consistent across measures, we use only PageRank centrality as the node-level measure in the main analysis.

Appendix F.2. Edge Level

At the edge level, we look at two measures: the weights and the betweenness centrality values of the edges. The results are shown in Figure A2d,e. Please note that the edges are presented in the same ascending order using the IDs of node pairs connected by the edges. Unlike the similarity of the node-level measures, we observe that the weights and betweenness centrality measures exhibit drastically different distributions over the edges. A general pattern emerges that edges with high weights tend to have low betweenness centrality, and the converse is also true. Moreover, the edges with high weights tend to cluster together, and similarly, the edges with high betweenness centrality also tend to cluster together. The differences between the two measures can explain these differences. The edge weights represent the importance of connecting a pair of nodes, whereas the edge betweenness centrality values represent the role in connecting disconnected parts of the graph. The contrasts in the distributions of weights and betweenness centrality indicate the presence of a strong community structure in the graph, as edges with high weights are likely to fall within the same community, and edges with high betweenness are likely to be the connections between these communities.

Appendix F.3. Communities

We confirm this possibility by investigating the communities in the graph using three algorithms. The strength of partitions of the graph into communities is represented by the modularity metric, which measures how densely connected a set of nodes is compared to the expected density of the nodes in a random case. As a brief reminder, a modularity value of 0 corresponds to the random case, and values greater than 0 indicate an escape from randomness and the presence of communities in the graph. The recommended range for a strong partition is between 0.3 and 0.7. The first algorithm we use is the G-N algorithm, which creates communities by progressively removing edges in a descending order based on the betweenness centrality measure. Intuitively, the idea is that if island-like communities are bridged together by some edges, we will find the communities by removing these “bridges”. The distribution of modularity values for communities from the steps of the G-N algorithm is shown in Figure A2f. As evident from all modularity values being close to or lower than 0, the algorithm fails to find a strong partition in the graph. The communities from the best partition are shown in Figure A2g coded by different colors. While such low modularity values indicate a lack of community structures, it is also possible that the relatively simple principle of removing edges with high betweenness centrality contributes to the failure. A reason is that the G-N algorithm only considers a few possible partitions (specifically, the number of nodes minus 1) until it reaches a state where all nodes are singletons. However, there are approximately 5.45 × 10 20 possible partitions of the graph. This possibility is also supported by the high number of edges with high betweenness centrality we observed in Figure A2e, indicating that the communities in the graph may be connected by several edges instead of one, which would have been ideal for the G-N algorithm.
The second algorithm we used is the Louvain algorithm (Figure A2h), another heuristic approach that aims to optimize the modularity metric hierarchically (Please see Section 6.4.3 for more details). As we can see, the Louvain algorithm yields a strong partition with a modularity of 0.34, dividing the graph into three communities.
Finally, the third algorithm we used is spectral clustering (Figure A2i). The spectral clustering algorithm uses spectral decomposition of the original graph and then uses closeness in the spectral domain to identify groups of nodes that cluster together. For binary classification, Fiedler values corresponding to the second eigenvalue in the spectrum are used. For more than two classes, using K-means clustering in the spectral domain is a common approach, which is also the method we employed in this work. As we see, spectral clustering yields the same partition as the Louvain algorithm. Three communities with a modularity value of 0.338. These results, along with our earlier investigations, confirm the presence of communities in the task graph.
Another support for the success of the Louvain and the spectral clustering algorithms is present in the visualization of the graph itself. The graph plots in Figure A2g–i are drawn using a spring layout in which the edges act like springs with spring constants proportional to the edge weights, and the nodes with high edge weights are pulled close together. As we can see, the communities from both algorithms are well separated in the spring layout, lending an extra layer of support to the detected communities. Since the Louvain algorithm and the spectral clustering algorithm produce very similar results, we use the results from the Louvain algorithm only when comparing the groups in the main paper.

References

  1. Wang, K.; Variengien, A.; Conmy, A.; Shlegeris, B.; Steinhardt, J. Interpretability in the wild: A circuit for indirect object identification in gpt-2 small. arXiv 2022, arXiv:2211.00593. [Google Scholar] [CrossRef]
  2. Bricken, T.; Templeton, A.; Batson, J.; Chen, B.; Jermyn, A.; Conerly, T.; Turner, N.; Anil, C.; Denison, C.; Askell, A.; et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transform. Circuits Thread 2023, 2. [Google Scholar]
  3. Ying, L.; Collins, K.M.; Wong, L.; Sucholutsky, I.; Liu, R.; Weller, A.; Shu, T.; Griffiths, T.L.; Tenenbaum, J.B. On benchmarking human-like intelligence in machines. arXiv 2025, arXiv:2502.20502. [Google Scholar] [CrossRef]
  4. Sibert, C.; Rahman, R. The Need for Speed? Exploring the Contribution of Motor Speed to Expertise in a Complex, Dynamic Task. Proc. Annu. Meet. Cogn. Sci. Soc. 2025, 47, 7gv242kf. [Google Scholar]
  5. Newell, A. You can’t play 20 questions with nature and win: Projective comments on the papers of this symposium. In Visual Information Processing; Chase, W., Ed.; Academic Press: New York, NY, USA, 1973. [Google Scholar]
  6. Anzai, Y.; Simon, H.A. The theory of learning by doing. Psychol. Rev. 1979, 86, 124–140. [Google Scholar] [CrossRef]
  7. Frederiksen, J.R.; White, B.Y. An approach to training based upon principled task decomposition. Acta Psychol. 1989, 71, 89–146. [Google Scholar] [CrossRef]
  8. Simon, H.A. What we know about learning. J. Eng. Educ. 1998, 87, 343–348. [Google Scholar] [CrossRef]
  9. Lee, F.J.; Anderson, J.R. Does learning a complex task have to be complex?: A study in learning decomposition. Cogn. Psychol. 2001, 42, 267–316. [Google Scholar] [CrossRef]
  10. Boot, W.R.; Basak, C.; Erickson, K.I.; Neider, M.; Simons, D.J.; Fabiani, M.; Gratton, G.; Voss, M.W.; Prakash, R.; Lee, H.; et al. Transfer of skill engendered by complex task training under conditions of variable priority. Acta Psychol. 2010, 135, 349–357. [Google Scholar] [CrossRef]
  11. Towne, T.J.; Boot, W.R.; Ericsson, K.A. Understanding the structure of skill through a detailed analysis of Individuals’ performance on the Space Fortress game. Acta Psychol. 2016, 169, 27–37. [Google Scholar] [CrossRef]
  12. Gray, W.D.; Lindstedt, J.K. Plateaus, dips, and leaps: Where to look for inventions and discoveries during skilled performance. Cogn. Sci. 2017, 41, 1838–1870. [Google Scholar] [CrossRef] [PubMed]
  13. Gray, W.D. Game-XP: Action games as experimental paradigms for cognitive science. Top. Cogn. Sci. 2017, 9, 289–307. [Google Scholar] [CrossRef] [PubMed]
  14. Bossaerts, P.; Murawski, C. Computational complexity and human decision-making. Trends Cogn. Sci. 2017, 21, 917–929. [Google Scholar] [CrossRef] [PubMed]
  15. Rahman, R.; Gray, W.D. SpotLight on Dynamics of Individual Learning. Top. Cogn. Sci. 2020, 12, 975–991. [Google Scholar] [CrossRef]
  16. Gray, W.D.; Rahman, R.; Torres, E.B.; Perez, R.; Sims, C.R.; Wiltshire, T.J. New Measures for Fundamentals of Human Performance. Proc. Annu. Meet. Cogn. Sci. Soc. 2020, 42, 76z4w2c6. [Google Scholar]
  17. Rahman, R.; Gray, W.D. Towards precise measures of individual performance in complex tasks. In Proceedings of the 19th International Conference on Cognitive Modeling, Online, 3–9 July 2021; Stewart, T.C., Ed.; Applied Cognitive Science Lab, Penn State University: University Park, PA, USA, 2021; pp. 227–233. [Google Scholar]
  18. Nguyen, T.N.; Phan, D.N.; Gonzalez, C. Speedyibl: A solution to the curse of exponential growth in instance-based learning models of decisions from experience. arXiv 2021, arXiv:2111.10268. [Google Scholar]
  19. Simon, H.A. The Architecture of Complexity. Proc. Am. Philos. Soc. 1962, 106, 467–482. [Google Scholar]
  20. Chien, E.; Peng, J.; Li, P.; Milenkovic, O. Adaptive universal generalized pagerank graph neural network. arXiv 2020, arXiv:2006.07988. [Google Scholar]
  21. Nandan, M.; Mitra, S.; De, D. GraphXAI: A survey of graph neural networks (GNNs) for explainable AI (XAI). Neural Comput. Appl. 2025, 37, 10949–11000. [Google Scholar] [CrossRef]
  22. Amara, K.; Ying, R.; Zhang, Z.; Han, Z.; Shan, Y.; Brandes, U.; Schemm, S.; Zhang, C. Graphframex: Towards systematic evaluation of explainability methods for graph neural networks. arXiv 2022, arXiv:2206.09677. [Google Scholar]
  23. Zhang, M.; Chen, Y. Link prediction based on graph neural networks. Adv. Neural Inf. Process. Syst. 2018, 31, 5165–5175. [Google Scholar]
  24. Zhang, M. Graph neural networks: Link prediction. In Graph Neural Networks: Foundations, Frontiers, and Applications; Springer: Singapore, 2022; pp. 195–223. [Google Scholar]
  25. Hu, W.; Fey, M.; Zitnik, M.; Dong, Y.; Ren, H.; Liu, B.; Catasta, M.; Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. Adv. Neural Inf. Process. Syst. 2020, 33, 22118–22133. [Google Scholar]
  26. Siew, C.S.; Wulff, D.U.; Beckage, N.M.; Kenett, Y.N. Cognitive network science: A review of research on cognition through the lens of network representations, processes, and dynamics. Complexity 2019, 2019, 2108423. [Google Scholar] [CrossRef]
  27. Kenett, Y.N.; Beckage, N.M.; Siew, C.S.; Wulff, D.U. Cognitive Network Science: A New Frontier. Complexity 2020, 2020, 6870278. [Google Scholar] [CrossRef]
  28. Griffiths, T.L.; Steyvers, M.; Firl, A. Google and the mind: Predicting fluency with PageRank. Psychol. Sci. 2007, 18, 1069–1076. [Google Scholar] [CrossRef] [PubMed]
  29. Kleinberg, J.M. Navigation in a small world. Nature 2000, 406, 845. [Google Scholar] [CrossRef]
  30. Hills, T.T.; Todd, P.M.; Goldstone, R.L. Search in external and internal spaces: Evidence for generalized cognitive search processes. Psychol. Sci. 2008, 19, 802–808. [Google Scholar] [CrossRef] [PubMed]
  31. Beckage, N.; Steyvers, M.; Butts, C. Route choice in individuals—Semantic network navigation. Proc. Annu. Meet. Cogn. Sci. Soc. 2012, 34, 6bv9f13d. [Google Scholar]
  32. Wulff, D.U.; Hills, T.T.; Hertwig, R. Worm holes in memory: Is memory one representation or many? Proc. Annu. Meet. Cogn. Sci. Soc. 2013, 35, 3817–3822. [Google Scholar]
  33. Goedschalk, L.; Treur, J.; Verwolf, R. A network-oriented modeling approach to voting behavior during the 2016 US presidential election. In Proceedings of the Trends in Cyber-Physical Multi-Agent Systems, the PAAMS Collection-15th International Conference, PAAMS 2017, Porto, Portugal, 21–23 June 2017; Springer: Cham, Switzerland, 2018; pp. 3–15. [Google Scholar]
  34. Borsboom, D.; Deserno, M.K.; Rhemtulla, M.; Epskamp, S.; Fried, E.I.; McNally, R.J.; Robinaugh, D.J.; Perugini, M.; Dalege, J.; Costantini, G.; et al. Network analysis of multivariate data in psychological science. Nat. Rev. Methods Prim. 2021, 1, 58. [Google Scholar] [CrossRef]
  35. Tozzi, L.; Zhang, X.; Pines, A.; Olmsted, A.M.; Zhai, E.S.; Anene, E.T.; Chesnut, M.; Holt-Gosselin, B.; Chang, S.; Stetz, P.C.; et al. Personalized brain circuit scores identify clinically distinct biotypes in depression and anxiety. Nat. Med. 2024, 30, 2076–2087. [Google Scholar] [CrossRef] [PubMed]
  36. Miller, G.A. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol. Rev. 1956, 63, 81. [Google Scholar] [CrossRef] [PubMed]
  37. de Groot, A.D. Het Denken van den Schaker: Een Experimenteel-Psychologische Studie; Noord-Hollandsche Uitgevers Maatschappij: Amsterdam, The Netherlands, 1946. [Google Scholar]
  38. de Groot, A.D. Thought and Choice in Chess; Walter de Gruyter: Berlin, Germany, 1978; Volume 4. [Google Scholar]
  39. Gobet, F.; Lane, P.C.; Croker, S.; Cheng, P.C.; Jones, G.; Oliver, I.; Pine, J.M. Chunking mechanisms in human learning. Trends Cogn. Sci. 2001, 5, 236–243. [Google Scholar] [CrossRef] [PubMed]
  40. Gobet, F. Chunking models of expertise: Implications for education. Appl. Cogn. Psychol. 2005, 19, 183–204. [Google Scholar] [CrossRef]
  41. Chase, W.G.; Simon, H.A. Perception in chess. Cogn. Psychol. 1973, 4, 55–81. [Google Scholar] [CrossRef]
  42. van Meeuwen, L.W.; Jarodzka, H.; Brand-Gruwel, S.; Kirschner, P.A.; de Bock, J.J.; van Merriënboer, J.J. Identification of effective visual problem solving strategies in a complex visual domain. Learn. Instr. 2014, 32, 10–21. [Google Scholar] [CrossRef]
  43. Cranford, E.A.; Gonzalez, C.; Aggarwal, P.; Tambe, M.; Cooney, S.; Lebiere, C. Towards a cognitive theory of cyber deception. Cogn. Sci. 2021, 45, e13013. [Google Scholar] [CrossRef]
  44. Aggarwal, P.; Gonzalez, C.; Dutt, V. Cyber-security: Role of deception in cyber-attack detection. In Advances in Human Factors in Cybersecurity, Proceedings of the AHFE 2016 International Conference on Human Factors in Cybersecurity, Orlando, FL, USA, 27–31 July 2016; Springer: Cham, Switzerland, 2016; pp. 85–96. [Google Scholar]
  45. Thompson, J.J.; McColeman, C.M.; Blair, M.R.; Henrey, A.J. Classic motor chunking theory fails to account for behavioural diversity and speed in a complex naturalistic task. PLoS ONE 2019, 14, e0218251. [Google Scholar] [CrossRef]
  46. Anderson, J.R.; Matessa, M.; Lebiere, C. ACT-R: A theory of higher level cognition and its relation to visual attention. Hum.-Comput. Interact. 1997, 12, 439–462. [Google Scholar] [CrossRef]
  47. Gonzalez, C.; Lerch, J.F.; Lebiere, C. Instance-based learning in dynamic decision making. Cogn. Sci. 2003, 27, 591–635. [Google Scholar] [CrossRef]
  48. Gonzalez, C.; Dutt, V. Instance-based learning: Integrating sampling and repeated decisions from experience. Psychol. Rev. 2011, 118, 523. [Google Scholar] [CrossRef]
  49. Ritter, F.E.; Tehranchi, F.; Oury, J.D. ACT-R: A cognitive architecture for modeling cognition. Wiley Interdiscip. Rev. Cogn. Sci. 2019, 10, e1488. [Google Scholar] [CrossRef]
  50. von Neumann, J.; Morgenstern, O. Theory of Games and Economic Behavior, 1st ed.; Princeton University Press: Princeton, NJ, USA, 1944. [Google Scholar]
  51. Gigerenzer, G. How to explain behavior? Top. Cogn. Sci. 2020, 12, 1363–1381. [Google Scholar] [CrossRef] [PubMed]
  52. Savage, L. Foundations of Statistics; Wiley: Hoboken, NJ, USA, 1954. [Google Scholar]
  53. Simon, H.A. A behavioral model of rational choice. Q. J. Econ. 1955, 69, 99–118. [Google Scholar] [CrossRef]
  54. Simon, H.A.; Schaeffer, J. The game of chess. In Handbook of Game Theory With Economic Applications; North-Holland: Amsterdam, The Netherlands, 1992; Volume 1, pp. 1–17. [Google Scholar]
  55. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  56. Simon, H.A. Administrative Behavior: A Study of Decision-Making Processes in Administrative Organization, 1st ed.; Macmillan: New York, NY, USA, 1947. [Google Scholar]
  57. Newell, A.; Shaw, J.C.; Simon, H.A. Elements of a theory of human problem solving. Psychol. Rev. 1958, 65, 151–166. [Google Scholar] [CrossRef]
  58. Simon, H.A.; Newell, A. Human problem solving: The state of the theory in 1970. Am. Psychol. 1971, 26, 145–159. [Google Scholar] [CrossRef]
  59. Simon, H.A. From substantive to procedural rationality. In 25 Years of Economic Theory: Retrospect and Prospect; Springer: New York, NY, USA, 1976; pp. 65–86. [Google Scholar]
  60. Gigerenzer, G. What is bounded rationality? In Routledge Handbook of Bounded Rationality; Routledge: Abingdon, UK, 2020; pp. 55–69. [Google Scholar]
  61. Simon, H.A. Models of Bounded Rationality: Empirically Grounded Economic Reason; MIT Press: Cambridge, MA, USA, 1997; Volume 3. [Google Scholar]
  62. Gigerenzer, G.; Goldstein, D.G. Reasoning the fast and frugal way: Models of bounded rationality. Psychol. Rev. 1996, 103, 650–669. [Google Scholar] [CrossRef]
  63. Gigerenzer, G. Why heuristics work. Perspect. Psychol. Sci. 2008, 3, 20–29. [Google Scholar] [CrossRef]
  64. Newell, A.; Rosenbloom, P.S. Mechanisms of skill acquisition and the law of practice. In Cognitive Skills and Their Acquisition; Psychology Press: New York, NY, USA, 1981; Volume 1, pp. 1–55. [Google Scholar]
  65. Rickard, T.C. Bending the power law: A CMPL theory of strategy shifts and the automatization of cognitive skills. J. Exp. Psychol. Gen. 1997, 126, 288–311. [Google Scholar] [CrossRef]
  66. Heathcote, A.; Brown, S.; Mewhort, D. The power law repealed: The case for an exponential law of practice. Psychon. Bull. Rev. 2000, 7, 185–207. [Google Scholar] [CrossRef] [PubMed]
  67. Ritter, F.E.; Schooler, L.J. The learning curve. Int. Encycl. Soc. Behav. Sci. 2001, 13, 8602–8605. [Google Scholar] [CrossRef]
  68. Siegler, R.S. The perils of averaging data over strategies: An example from children’s addition. J. Exp. Psychol. Gen. 1987, 116, 250–264. [Google Scholar] [CrossRef]
  69. Siegler, R.S.; Jenkins, E.A. How Children Discover New Strategies; Distinguished Lecture Series; Psychology Press: Hillsdale, NJ, USA, 1989. [Google Scholar]
  70. Wulf, G.; Shea, C.H. Principles derived from the study of simple skills do not generalize to complex skill learning. Psychon. Bull. Rev. 2002, 9, 185–211. [Google Scholar] [CrossRef]
  71. Molenaar, P.C.; Campbell, C.G. The new person-specific paradigm in psychology. Curr. Dir. Psychol. Sci. 2009, 18, 112–117. [Google Scholar] [CrossRef]
  72. Donner, Y.; Hardy, J.L. Piecewise power laws in individual learning curves. Psychon. Bull. Rev. 2015, 22, 1308–1319. [Google Scholar] [CrossRef]
  73. Newman, M.E.; Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 2004, 69, 026113. [Google Scholar] [CrossRef]
  74. Young, I.D.; Ibrahim, M.; Chatterjee, R.; Gul, S.; Fuller, F.D.; Koroidov, S.; Brewster, A.S.; Tran, R.; Alonso-Mori, R.; Kroll, T.; et al. Structure of photosystem II and substrate binding at room temperature. Nature 2016, 540, 453–457. [Google Scholar] [CrossRef]
  75. Colletier, J.P.; Sawaya, M.R.; Gingery, M.; Rodriguez, J.A.; Cascio, D.; Brewster, A.S.; Michels-Clark, T.; Hice, R.H.; Coquelle, N.; Boutet, S.; et al. De novo phasing with X-ray laser reveals mosquito larvicide BinAB structure. Nature 2016, 539, 43–47. [Google Scholar] [CrossRef]
  76. Simon, H.A.; Kadane, J.B. Optimal problem-solving search: All-or-none solutions. Artif. Intell. 1975, 6, 235–247. [Google Scholar] [CrossRef]
  77. Manning, C.D. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
  78. Page, L. The PageRank Citation Ranking: Bringing Order to the Web; Technical Report; Stanford University: Stanford, CA, USA, 1999. [Google Scholar]
  79. Hamilton, W.L. Graph Representation Learning; Synthesis Lectures on Artificial Intelligence and Machine Learning; Springer: Cham, Switzerland, 2024; Volume 14, pp. 1–159. [Google Scholar]
  80. Fortunato, S. Community detection in graphs. Phys. Rep. 2010, 486, 75–174. [Google Scholar] [CrossRef]
  81. Fortunato, S.; Hric, D. Community detection in networks: A user guide. Phys. Rep. 2016, 659, 1–44. [Google Scholar] [CrossRef]
  82. Newman, M.E. Analysis of weighted networks. Phys. Rev. E—Stat. Nonlinear Soft Matter Phys. 2004, 70, 056131. [Google Scholar] [CrossRef] [PubMed]
  83. Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009; Volume 2. [Google Scholar]
  84. James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: New York, NY, USA, 2013; Volume 112. [Google Scholar]
  85. Tantardini, M.; Ieva, F.; Tajoli, L.; Piccardi, C. Comparing methods for comparing networks. Sci. Rep. 2019, 9, 17557. [Google Scholar] [CrossRef] [PubMed]
  86. Wills, P.; Meyer, F.G. Metrics for graph comparison: A practitioner’s guide. PLoS ONE 2020, 15, e0228728. [Google Scholar] [CrossRef]
  87. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  88. Pastore, M.; Calcagnì, A. Measuring distribution similarities between samples: A distribution-free overlapping index. Front. Psychol. 2019, 10, 1089. [Google Scholar] [CrossRef]
  89. Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
  90. Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
  91. Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 1073–1080. [Google Scholar]
  92. Wagenmakers, E.J.; Lodewyckx, T.; Kuriyal, H.; Grasman, R. Bayesian hypothesis testing for psychologists: A tutorial on the Savage–Dickey method. Cogn. Psychol. 2010, 60, 158–189. [Google Scholar] [CrossRef]
  93. Edelen, A.; Neveu, N.; Mayes, C.; Emma, C.; Ratner, D. Machine learning models for optimization and control of x-ray free electron lasers. In Proceedings of the NeurIPS Machine Learning for the Physical Sciences Workshop, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  94. Mishra, A.A.; Edelen, A.; Hanuka, A.; Mayes, C. Uncertainty quantification for deep learning in particle accelerator applications. Phys. Rev. Accel. Beams 2021, 24, 114601. [Google Scholar] [CrossRef]
  95. Gupta, L.; Edelen, A.; Neveu, N.; Mishra, A.; Mayes, C.; Kim, Y.K. Improving surrogate model accuracy for the LCLS-II injector frontend using convolutional neural networks and transfer learning. Mach. Learn. Sci. Technol. 2021, 2, 045025. [Google Scholar] [CrossRef]
  96. Mishra, A.; Seaberg, M.; Roussel, R.; Poitevin, F.; Thayer, J.; Ratner, D.; Edelen, A.; Mehta, A. A Start to End Machine Learning Approach to Maximize Scientific Throughput from the LCLS-II-HE. Synchrotron Radiat. News 2025, 38, 10–17. [Google Scholar] [CrossRef]
  97. Vedral, V. The role of relative entropy in quantum information theory. Rev. Mod. Phys. 2002, 74, 197–234. [Google Scholar] [CrossRef]
  98. Yang, Y. Temporal Data Mining via Unsupervised Ensemble Learning; Elsevier: Amsterdam, The Netherlands, 2016. [Google Scholar]
Figure 1. Histograms showing (a) the number of operators in each stage of experience and (b) the corresponding number of elog entries. The numbers are shown in half-year increments up to Year 10, after which all operators are binned into one group. In our year-wise investigations, we include periods with at least 50 entries (the red dashed line in (b)), which is the case up to Year 7.
Figure 1. Histograms showing (a) the number of operators in each stage of experience and (b) the corresponding number of elog entries. The numbers are shown in half-year increments up to Year 10, after which all operators are binned into one group. In our year-wise investigations, we include periods with at least 50 entries (the red dashed line in (b)), which is the case up to Year 7.
Ai 07 00082 g001
Figure 2. A schematic of the processes used to develop graphs from the text data in the elog database.
Figure 2. A schematic of the processes used to develop graphs from the text data in the elog database.
Ai 07 00082 g002
Figure 3. An example of a weighted graph with 10 nodes and three communities (shown in different colors). The edge lengths represent the strengths of relationships between node pairs. The solid edges denote in-community edges, and the dashed edges represent out-of-community edges.
Figure 3. An example of a weighted graph with 10 nodes and three communities (shown in different colors). The edge lengths represent the strengths of relationships between node pairs. The solid edges denote in-community edges, and the dashed edges represent out-of-community edges.
Ai 07 00082 g003
Figure 4. Changes at different levels of the graphs with experience. The first bin represents an operator’s first six months and serves as a reference graph for the distance metrics used to estimate changes.
Figure 4. Changes at different levels of the graphs with experience. The first bin represents an operator’s first six months and serves as a reference graph for the distance metrics used to estimate changes.
Ai 07 00082 g004
Figure 5. Graphs of FEL tuning subtasks for three groups of operators. The node sizes represent the node’s PageRank value. The distances between nodes represent edge weights (lower distances indicate higher weights). The node colors denote three communities identified using the Louvain algorithm and verified using Spectral Clustering.
Figure 5. Graphs of FEL tuning subtasks for three groups of operators. The node sizes represent the node’s PageRank value. The distances between nodes represent edge weights (lower distances indicate higher weights). The node colors denote three communities identified using the Louvain algorithm and verified using Spectral Clustering.
Ai 07 00082 g005
Figure 6. Hierarchical clustering of graphs of tuning performance and their changes with expertise. The height indicates the average distance between nodes in the embedding space, reflecting the strength of interconnections between subtasks. The horizontal lines in the hierarchy represent the distance at which nodes were grouped into the same cluster.
Figure 6. Hierarchical clustering of graphs of tuning performance and their changes with expertise. The height indicates the average distance between nodes in the embedding space, reflecting the strength of interconnections between subtasks. The horizontal lines in the hierarchy represent the distance at which nodes were grouped into the same cluster.
Ai 07 00082 g006
Figure 7. The distributions of edge weights across all groups of expertise. The high edge weights correspond mainly to the in-community edges. The distributions also change slightly with expertise, reflecting changes in the connections between subtasks.
Figure 7. The distributions of edge weights across all groups of expertise. The high edge weights correspond mainly to the in-community edges. The distributions also change slightly with expertise, reflecting changes in the connections between subtasks.
Ai 07 00082 g007
Figure 8. PageRank distributions of the four groups. The subtasks are color-coded by the communities detected in earlier sections.
Figure 8. PageRank distributions of the four groups. The subtasks are color-coded by the communities detected in earlier sections.
Ai 07 00082 g008
Table 1. Three groups of operators by stage of expertise.
Table 1. Three groups of operators by stage of expertise.
GroupYears of ExperienceNumber of OperatorsNumber of Entries
(1) Novice operators≤154390
(2) Intermediate operators1–454876
(3) Expert operators≥426822
All operators0–1467 *2088
* Not a sum of group sizes, as the same operators may contribute to different groups as they gain experience.
Table 2. Summary of hypothesis tests across graph levels.
Table 2. Summary of hypothesis tests across graph levels.
Graph LevelGraph MetricChange MeasureOne-Sided p-Value BF 10
NodePageRankRE0.000 * 2.06 × 10 14 *
PageRankOI0.000 * 2.85 × 10 2 *
EdgeEdge WeightRE0.000 * 6.40 × 10 2 *
Edge WeightOI0.001 * 5.02 × 10 1 *
CommunityCommunity SetsARI0.0690.82
Community SetsAMI0.0381.36
Whole GraphAdjacency MatrixASD0.000 * 4.95 × 10 30 *
Laplacian MatrixLSD0.000 * 5.16 × 10 20 *
Note: Asterisks (*) indicate significant positive slopes (p < 0.005 and B F 10 > 3). B F 10 is the Bayes Factor comparing the evidence for a non-zero slope (alternate hypothesis) and for a zero slope (null hypothesis). One-sided p-values of 0.000 reflect rounding of values below 0.0005, indicating highly significant positive slopes.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rahman, R.; Shtalenkova, J.; Mishra, A.A.; Hu, W.-L. Bridging Human and Artificial Intelligence: Modeling Human Learning with Explainable AI Tools. AI 2026, 7, 82. https://doi.org/10.3390/ai7030082

AMA Style

Rahman R, Shtalenkova J, Mishra AA, Hu W-L. Bridging Human and Artificial Intelligence: Modeling Human Learning with Explainable AI Tools. AI. 2026; 7(3):82. https://doi.org/10.3390/ai7030082

Chicago/Turabian Style

Rahman, Roussel, Jane Shtalenkova, Aashwin Ananda Mishra, and Wan-Lin Hu. 2026. "Bridging Human and Artificial Intelligence: Modeling Human Learning with Explainable AI Tools" AI 7, no. 3: 82. https://doi.org/10.3390/ai7030082

APA Style

Rahman, R., Shtalenkova, J., Mishra, A. A., & Hu, W.-L. (2026). Bridging Human and Artificial Intelligence: Modeling Human Learning with Explainable AI Tools. AI, 7(3), 82. https://doi.org/10.3390/ai7030082

Article Metrics

Back to TopTop