1. Introduction
Pedestrian flow simulation has become an indispensable analytical framework across multiple disciplines, including urban design, transportation systems engineering, architecture, crowd safety, and emergency evacuation planning. As cities grow denser and public spaces become more dynamic and multifunctional, the ability to understand and predict pedestrian movement is critical to ensuring not only the efficiency of infrastructure but also the safety and comfort of individuals in shared environments. From large-scale transit terminals and commercial complexes to narrow corridors and emergency exits, modeling pedestrian behavior allows stakeholders to anticipate congestion, improve spatial layouts, and design policies that promote smooth and safe pedestrian flows.
The traditional modeling techniques for pedestrian dynamics generally fall into two main categories: macroscopic and microscopic models. Macroscopic models, such as the Lighthill–Whitham–Richards (LWR) model [
1], treat pedestrian flows as continuous media, borrowing heavily from hydrodynamic analogies. These models are computationally efficient and suitable for large-scale simulation, but they lack the ability to reflect the discrete, intentional, and often unpredictable behavior of individuals in crowded environments. As a result, while macroscopic models are useful for estimating overall flow patterns and densities, they fall short in scenarios where localized interactions and decision-making play critical roles.
Microscopic models aim to fill this gap by representing each pedestrian as an individual agent with unique decision-making capabilities. Among the most established of these are the Social Force model [
2] and the Boids model [
3]. The Social Force model expresses pedestrian dynamics through attractive and repulsive forces, capturing phenomena such as lane formation and bottleneck congestion. Boids models, originally designed to simulate flocking in animals, emphasize local rule-based interactions among agents and have been adapted for human crowd simulation. While these models provide more realistic and granular behavior than macroscopic models, they still rely heavily on manually tuned parameters and predefined interaction rules, limiting their ability to adapt to diverse or unforeseen scenarios.
In recent years, advancements in artificial intelligence and data-driven modeling have paved the way for learning-based simulation techniques. Reinforcement learning (RL), in particular, has been applied to agent-based pedestrian models to facilitate adaptive decision-making in complex environments [
4]. These agents can learn optimal policies by interacting with the environment, receiving feedback through rewards or penalties. While these RL models have achieved commendable performance in various tasks, they often suffer from a lack of interpretability due to the opaque nature of neural policies and the large number of parameters involved. This “black-box” character complicates model validation and restricts their use in safety-critical or policy-sensitive applications.
To address the trade-off between adaptability and transparency, this study proposes a novel approach using the Anticipatory Classifier System 2 (ACS2) [
5,
6], a type of learning classifier system that integrates symbolic rule representation with reinforcement learning principles. Originally inspired by cognitive systems theory [
7], the ACS2 extends traditional classifier systems [
8,
9] by incorporating an “effect” component that encodes expected environmental changes, thereby allowing agents to form predictive models of their environment. Through evolutionary mechanisms such as genetic algorithms and covering strategies, the rule population evolves over time, enabling agents to adapt to complex tasks while maintaining a set of interpretable behavior rules.
One of the central motivations of this study is the interpretability of the resulting behavior. While black-box models such as deep reinforcement learning can outperform traditional algorithms in many cases, their internal workings remain opaque, often making them unsuitable for domains that require transparency, such as public policy, urban safety planning, or collaborative robotic systems. In contrast, the if–then rules derived through ACS2 learning can be directly examined, allowing human researchers and practitioners to understand, verify, and even manually refine agent behavior. This opens the door for more trustworthy and verifiable pedestrian simulation frameworks, especially in high-stakes applications.
The design of our pedestrian simulation environment incorporates realistic elements, including static obstacles, moving agents, and multiple destinations. Agent perception is modeled through sector-based visual fields, and the state space includes information about surrounding pedestrians and obstacles. A discrete set of actions, including movement in various directions and stopping, allows agents to choose from a limited but sufficient set of behavioral responses. Rewards are provided based on goals such as reaching destinations, avoiding collisions, and minimizing stopping behavior. These components collectively define a dynamic and responsive environment suitable for learning-based navigation.
To validate our approach, we conduct multiple simulation experiments designed to assess the behavior of ACS2-based agents in realistic scenarios. These include bottleneck navigation under various density conditions [
10], escape dynamics similar to panic evacuation [
11], and the emergence of cooperative patterns in bidirectional pedestrian flows [
12,
13]. Performance metrics include throughput rates, average travel time, conflict frequency, and convergence of behavior. The outcomes demonstrate that ACS2 agents are capable of acquiring socially plausible and efficient navigation strategies under diverse environmental constraints.
Additionally, we compare the ACS2 with conventional reinforcement learning models, such as tabular Q-learning. While both approaches are capable of learning effective behavior in the pedestrian domain, the ACS2 shows a distinct advantage in terms of rule compactness and interpretability. Moreover, the ACS2’s evolutionary learning component promotes generalization, allowing agents to transfer their knowledge across scenarios with minimal retraining [
14,
15]. These properties make the ACS2 a promising candidate for scalable and human-understandable agent modeling in pedestrian simulation and beyond.
This paper is organized as follows.
Section 2 discusses the related work in pedestrian simulation, reinforcement learning, and classifier systems, establishing the theoretical foundation for our study [
16,
17,
18].
Section 3 describes the ACS2-based simulation framework, including the agent architecture, state representation, action space, reward formulation, and learning mechanisms.
Section 4 presents a comprehensive evaluation of the simulation experiments and provides insightful comparisons with the baseline models. Finally,
Section 5 summarizes the findings and outlines potential directions for future research.
3. Simulation Framework
To evaluate the applicability of the Anticipatory Classifier System 2 (ACS2) to pedestrian flow modeling, we designed a simulation framework in which agents interact with a dynamic environment through perceptual input, rule-based decision-making, and reinforcement-guided learning. This section describes the architecture of ACS2-powered pedestrian agents, the state representation derived from the environment, the available action space, and the reward structure that drives learning.
3.1. Agent Architecture and Visual Perception
Each pedestrian agent in the simulation is implemented as an autonomous decision-making entity equipped with a sector-based field of view and a rule-based controller. Inspired by human visual perception, the field of view is divided into nine discrete sectors, spanning ±60° from the agent’s facing direction. Each sector contains information on whether it is occupied by an obstacle, another agent, or represents a valid destination cell. This discretized sensory input forms the condition part of the ACS2 rules and serves as the agent’s perception of the environment at each time step.
Figure 1 illustrates the perceptual field of a pedestrian agent. The visual field spans 120 degrees, divided into nine sectors covering the forward direction and peripheral views. Each sector is encoded with a 2-bit value representing the type of object detected—whether it is empty, an obstacle, another agent, or a goal cell. This structured representation enables compact and interpretable rule matching in the ACS2 framework.
The sensory input from the nine sectors is encoded into a binary string, which constitutes the environmental state as perceived by the agent. Specifically, each sector is assigned a 2-bit code indicating its contents: 00 for empty, 01 for obstacle, 10 for another agent, and 11 for a goal-related cell. The full state representation thus becomes an 18-bit string that succinctly captures the spatial configuration around the agent. This binary encoding enables efficient rule matching and classification within the ACS2 framework as rules are defined over condition strings of identical length.
Each agent selects its action from a discrete set of seven possible movement directions at every time step. These include moving forward, turning ±10° or ±60° relative to the current heading, stopping in place, or moving directly toward the goal if it is within the field of view. This action set is designed to balance navigational flexibility with computational tractability, enabling agents to react adaptively to dynamic surroundings while maintaining a manageable policy space. The selected action is executed only if the target cell is not occupied by an obstacle or another agent; otherwise, the agent is forced to stop, receiving a penalty in the reward function.
To guide the learning process, agents receive scalar rewards based on their interactions with the environment. A positive reward is granted when an agent successfully reaches its goal, whereas penalties are applied in the case of collisions, invalid moves, or unnecessary stopping. More specifically, a high positive reward (e.g., +100) is assigned upon goal arrival, a moderate negative reward (e.g., −10) is given for collisions or blocked movements, and a small penalty (e.g., −1) is applied when the agent chooses to stop without necessity. This reward structure encourages agents to navigate efficiently and safely, balancing goal-directed movement with collision avoidance.
Learning in the ACS2 is driven by a combination of reinforcement signals and anticipatory predictions. When an agent selects and executes an action, the ACS2 evaluates the corresponding rule based on the reward received and the accuracy of its predicted environmental outcome. Rules that produce both effective actions and accurate predictions are reinforced and selected for reproduction through a genetic algorithm. Conversely, poorly performing rules are penalized or replaced. In cases where no matching rule exists for the current situation, a covering mechanism generates a new rule based on the observed condition, action, and resulting effect. This dynamic rule evolution enables agents to construct compact and interpretable policy sets that adapt to diverse scenarios over time.
3.2. Action Set and Movement Dynamics
The action set available to each agent is designed to balance simplicity, navigational realism, and learning efficiency. At every time step, the agent selects one of seven possible actions: move forward, turn ±10° or ±60° relative to its current heading, stop, or move directly toward the goal direction if the goal is visible. Each action corresponds to a transition to an adjacent cell in a discretized grid environment. The resulting movement is subject to environmental constraints: agents cannot move into cells occupied by obstacles or other agents. If the selected move is invalid, the agent remains in place and incurs a penalty, as defined by the reward structure described in
Section 3.1.
The grid-based environment assumes that agents move at a uniform speed of one cell per time step, provided the selected action is valid. The direction of movement is updated according to the agent’s current heading and the relative angle associated with the chosen action. In the case of goal-directed movement, the agent computes the angular offset between its current heading and the goal direction and selects the action that most closely aligns with that vector. After each movement decision, the agent’s heading is updated to reflect the direction of the most recent action, which affects its future perception and decision-making. This orientation-aware mechanism allows agents to exhibit more realistic and consistent motion trajectories over time.
To prevent overlapping and simulate realistic crowd behavior, the model enforces collision avoidance through a simple but effective mechanism. When two or more agents attempt to move into the same target cell simultaneously, a conflict resolution rule is applied: all conflicting agents are prevented from moving and instead remain in their current positions. This approach models the hesitation and re-planning often observed in real pedestrian interactions. Additionally, agents maintain a soft repulsion effect by assigning higher penalties for selecting actions that bring them into close proximity with others, even if no direct collision occurs. These mechanisms contribute to smoother flow dynamics and more natural lane formation during dense crowd scenarios.
5. Experiments
To evaluate the effectiveness and validity of the proposed ACS2-based pedestrian simulation framework, we conducted a series of experiments under diverse crowd scenarios.
Table 1 summarizes the key parameters used in the simulation. These include environmental configurations such as agent speed and field of view, as well as ACS2-specific learning and evolutionary parameters. The reward structure reflects incentives for goal-reaching and penalties for collisions and unnecessary idling.
5.1. Validation with Empirical Pedestrian Flow
The first set of experiments focuses on validating whether ACS2-powered agents can reproduce pedestrian flow patterns observed in empirical studies. We use data from controlled bottleneck experiments reported by Liu et al. [
10], which provide measurements of pedestrian density, velocity, and throughput under various inflow conditions. The simulation environment is configured to mimic these settings, including corridor width, entrance size, and agent arrival rates.
The simulation environment, illustrated in
Figure 2, models a corridor with a fixed width of 3 m and a variable agent inflow rate. The bottleneck is represented by a narrowed passage 1 m wide, located at the corridor exit. Agents are generated at the entrance with randomized arrival intervals following a Poisson distribution, replicating inflow variability. Each agent is assigned a random destination at the far end of the corridor, and its behavior is governed by ACS2-learned rules. The simulation runs for 300 time steps per episode, and results are averaged over 20 episodes to ensure statistical robustness. Evaluation metrics include the density–velocity relationship, the number of agents passing through the bottleneck per unit time, and the spatial distribution of congestion.
To assess the realism of the learned pedestrian behavior, we compare the simulation results with the empirical density–velocity relationship. A designated measurement area is placed in front of the bottleneck, where pedestrian density and average velocity are recorded. Throughput is also computed by tracking the number of agents passing through the bottleneck per unit time. These metrics jointly assess the plausibility and generalizability of the learned behavior under realistic flow conditions.
Figure 3 shows the relationship between pedestrian density and average velocity, measured within the designated area. The ACS2-based simulation reproduces the characteristic inverse correlation observed in empirical studies: as density increases, average velocity declines, with a critical threshold near 1.8 persons/m
2, beyond which flow efficiency drops sharply. This pattern aligns well with the experimental data reported by Liu [
10]. In this figure, the horizontal axis
ρ denotes pedestrian density [1/m
2], and the vertical axis
v represents the average walking speed [m/s] measured within the designated area.
Liu [
10] report a similar trend, particularly beyond 3.0 persons/m
2, where velocity drops significantly. Our results exhibit comparable behavior under different inflow conditions (labeled as 1.6-1, 1.6-2, and 1.6-3), with the transition from free to congested flow clearly emerging near 2.0 persons/m
2.
5.2. Comparison with Reinforcement Learning (Q-Learning)
To further evaluate the performance and interpretability of the proposed ACS2-based model, we compared it with a conventional reinforcement learning approach—namely tabular Q-learning. Q-learning is a widely used algorithm for learning optimal policies through trial-and-error interactions with the environment. In our implementation, agents maintain a Q-table indexed by discretized state–action pairs and update the values using the standard Bellman equation with ϵ-greedy exploration. The state representation mirrors the 18-bit perception vector used in the ACS2 setting, and the action set is identical. Both models are trained in the same environment under equivalent simulation conditions, allowing for a fair comparison in terms of convergence behavior, throughput efficiency, and policy complexity.
The simulation environment used for this comparison is illustrated in
Figure 4. Unlike the bottleneck scenario described in
Section 5.1, this setup features a T-shaped corridor layout with two symmetric entry points and a shared central zone. This configuration is designed to induce moderate agent interference and decision-making complexity while maintaining a balanced flow across both learning models.
Figure 5 compares the learning performance of the ACS2 and tabular Q-learning in terms of average episode reward over training iterations. Both models show an upward trend in reward, indicating successful learning; however, the convergence behavior differs significantly. The ACS2 model exhibits smoother and more stable convergence, with less variance across episodes. In contrast, tabular Q-learning shows oscillatory behavior during early training and requires more iterations to reach a comparable level of performance. These results suggest that the ACS2 is more sample-efficient and robust in sparse-reward multi-agent environments like pedestrian flow scenarios.
From
Figure 5, we observe that both the ACS2 and tabular Q-learning achieve similar levels of average episode reward over training iterations, indicating no significant difference in overall performance. However, their internal policy representations differ substantially in terms of structural compactness and interpretability. In the ACS2 model, knowledge is encoded as a set of condition–action–effect rules, which can generalize across similar perceptual inputs. As a result, the final rule population typically contains fewer than 500 unique rules, even after extended training. In contrast, tabular Q-learning stores a value for each discretized state–action pair, leading to a significantly larger table. Given the 18-bit state encoding and seven possible actions, the full Q-table contains over 180,000 entries, most of which remain unused or sparsely updated due to the combinatorial explosion of the state space. Although our experimental setup allowed a maximum of 300,000 rules for both the ACS2 and Q-learning, most experiments reached this upper bound. However, we did not record specific quantitative data on the number or simplicity of generated rules during experiments, which represents a limitation in quantifying interpretability. Nonetheless, the modular nature of ACS2 rules still qualitatively enhances human interpretability compared to Q-learning’s large numerical tables.
Despite these strengths, the ACS2 may face limitations in extremely diverse environments, where the number of rules required for comprehensive coverage can grow significantly, potentially affecting scalability and learning efficiency. This growth in the rule population could slow down learning processes and complicate rule management in highly heterogeneous scenarios. Therefore, careful tuning of the genetic algorithm parameters and periodic rule pruning might be necessary to maintain efficiency and effectiveness in practical applications.
5.3. Analysis of Crowd Dynamics and Rule-Based Behavior
While the previous sections focused on performance metrics and learning stability, we now turn to a qualitative analysis of agent behavior and crowd-level dynamics emerging from ACS2-based learning. Since the ACS2 produces human-readable rules that map perceptual inputs to actions, we can inspect these rules to uncover typical behavioral patterns, such as lane formation, collision avoidance, and turn-taking at intersections. Furthermore, we analyze spatial flow patterns and congestion formation over time to examine how local rule execution leads to global crowd phenomena. This analysis provides insight into the interpretability and emergent coordination properties of the learned system.
To investigate the influence of environmental design on pedestrian flow, we adopt a simulation environment in which a central obstacle is placed, as illustrated in
Figure 6. This layout is motivated by urban design considerations, aiming to assess how such obstacles affect collision avoidance behavior and overall flow efficiency. The simulation runs for 800 episodes per trial, with 10 independent trials conducted for statistical validity. The first 700 episodes are used for learning, and the remaining 100 episodes are used for evaluation. We also analyze the acquired ACS2 rules to evaluate how learned behaviors contribute to crowd-level coordination and improved flow.
To quantitatively evaluate the emerging coordination and behavioral convergence of agents, we introduce three metrics: the separation rate, information entropy, and action selection probability. The
separation rate measures the degree to which pedestrian groups are spatially divided when passing around the central obstacle. It is defined as
where
and
represent the number of agents passing on the left side from the lower and upper groups, respectively, and
and
represent those passing on the right side. Information entropy quantifies behavioral diversity, capturing the uncertainty in pedestrian decision-making. High entropy indicates more varied and less predictable movements, while low entropy reflects convergence towards stable coordinated behaviors. To assess behavioral diversity, we calculate the
information entropy H, which captures how uniformly actions are distributed under observed states:
where
S is the set of observed states,
A is the action set,
is the empirical frequency of state
s, and
is the probability of choosing action
a in state
s. This selection probability is computed based on the fitness of ACS2 classifiers as
where
is the set of classifiers matching state
s,
is the subset proposing action
a, and
is the fitness of classifier
i. These metrics jointly allow us to analyze how local rule adaptations scale up to collective order in pedestrian flows.
To assess the emergence of coordinated behavior between pedestrian groups, we analyze the entropy of action distributions for agents originating from the upper and lower sides of the corridor. In this study, we define the convergence of collective behavior as the point at which the entropy for both groups drops below a predefined threshold—specifically the maximum entropy value during the evaluation period—continuously for 20 episodes. The number of episodes until this convergence is reached is used as the time to consensus formation.
Figure 7 shows the maximum flow rate and the separation rate observed under varying obstacle widths
w.
Figure 8 depicts the number of episodes required to achieve consensus under the same conditions. These results reveal a clear relationship between obstacle width and emergent flow efficiency. To derive the summary metric in
Figure 8, we computed information entropy (Equation (
2) and action selection probability Equation (
3)) to quantitatively evaluate pedestrian behavioral diversity and convergence. Due to the extensive volume of numerical data, individual values are not reported as they offer limited additional insight.
From
Figure 7 and
Figure 8, we observe that reduced consensus time and high separation rates contribute to increased maximum flow rate. When obstacle width is zero (i.e., no obstacle is present), pedestrians naturally separate according to their direction of travel, which results in a high flow rate. However, the time required to reach consensus in this condition is longer since the wide corridor allows agents to choose diverse paths freely, delaying the emergence of coordinated flow. When the obstacle width ranges from 0.2 to 0.6 m, the consensus time remains long, while the flow rate tends to decrease. This is because, although the obstacle exists, the passage is still wide enough for divergent path choices, resulting in dispersed agent movement and delayed coordination. In contrast, when the width is between 1.2 and 1.4 m, both the maximum flow rate and separation rate significantly increase, and the time to consensus shortens. A narrower passage restricts directional options and promotes group alignment, leading to more efficient movement. Notably, a width of 1.4 m yields the best performance across all three metrics. For widths above 1.6 m, both flow rate and separation rate decline again, and consensus takes longer to achieve. This degradation is attributed to the difficulty of self-organizing in excessively narrow paths, which increases agent interference and delays agreement on travel direction.
These findings suggest that inserting a central obstacle can indeed improve pedestrian flow, but the effect is highly dependent on the obstacle’s width. An appropriately sized obstacle facilitates alignment and separation, promoting smoother movement. Conversely, poorly sized obstacles can delay consensus formation and reduce efficiency. Therefore, in urban design, careful adjustment of obstacle width can lead to improved pedestrian throughput and spatial organization.
6. Conclusions
This study proposed a rule-based pedestrian simulation framework using the Anticipatory Classifier System 2 (ACS2), aiming to model emergent behavior in crowd dynamics through interpretable learning-based agent policies. Each pedestrian agent perceives its local environment through a sector-based visual field and selects actions based on compact condition–action–effect rules evolved via reinforcement learning and genetic algorithms.
Through simulation experiments, we validated the plausibility of learned behaviors under bottleneck scenarios, demonstrating that ACS2 agents can reproduce realistic density–velocity relationships and adapt to variable inflow conditions. A comparative analysis with tabular Q-learning revealed that, while both methods achieved similar performance in terms of cumulative reward, the ACS2 offered notable advantages in rule compactness, learning stability, and interpretability.
Further experiments in environments containing a central obstacle showed that the ACS2 agents developed coordinated avoidance and alignment strategies. Metrics such as separation rate, information entropy, and consensus time revealed that appropriately sized obstacles promote faster convergence to structured movement and improve the overall flow efficiency. In particular, obstacle widths around 1.4 m yielded optimal performance across all the evaluation metrics.
Overall, this research highlights the potential of rule-based learning approaches like the ACS2 in modeling complex social behaviors in pedestrian dynamics. The interpretability of evolved rules provides valuable insight for analyzing collective decision-making processes, and the framework can inform future work in urban design and adaptive crowd management systems. Future research will explore practical applications, such as evacuation simulations during emergencies or crowd management at large public events, leveraging the interpretability and adaptability of ACS2-generated behavior rules.