Next Article in Journal
Guide Robot Based on Image Processing and Path Planning
Previous Article in Journal
Research on Bearing Fault Diagnosis Method Based on MESO-TCN
Previous Article in Special Issue
Improved QT-Opt Algorithm for Robotic Arm Grasping Based on Offline Reinforcement Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Nonblocking Modular Supervisory Control of Discrete Event Systems via Reinforcement Learning and K-Means Clustering

1
School of Automation and Electrical Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
2
Department of Engineering Design, KTH Royal Institute of Technology, 10044 Stockholm, Sweden
*
Author to whom correspondence should be addressed.
Machines 2025, 13(7), 559; https://doi.org/10.3390/machines13070559
Submission received: 10 May 2025 / Revised: 24 June 2025 / Accepted: 25 June 2025 / Published: 27 June 2025

Abstract

Traditional supervisory control methods for the nonblocking control of discrete event systems often suffer from exponential computational complexity. Reinforcement learning-based approaches mitigate state explosion by sampling many random sequences instead of computing the synchronous product of multiple modular supervisors, but they struggle with limited reachable state spaces. A primary novelty of this study is to use the K-means clustering method for online inference with the learned state-action values. The clustering method divides all events at a state into the good group and the bad group. The events in the good group are allowed by the supervisor. The obtained supervisor policy can ensure both system constraints and larger control freedom compared to conventional RL-based supervisors. The proposed framework is validated by two case studies: an industrial transfer line (TL) system and an automated guided vehicle (AGV) system. In the TL case study, nonblocking reachable states increase from 56 to 72, while in the AGV case study, a substantial expansion from 481 to 3558 states is observed. Our new method achieves a balance between computational efficiency and nonblocking supervisory control.

1. Introduction

Industrial automation requires multiple autonomous machines to work collaboratively to produce products with high productivity and safety. These machines include, for instance, assembly stations, mobile logistic vehicles, and storage buffers. The movement and operations of these machines must be coordinated by logic-level supervisory controllers. The supervisory control theory (SCT) [1,2] provides a theoretical framework for designing nonblocking and maximally permissive supervisors for discrete event systems (DESs). The supervisors may be designed by traditional exhaustive search methods or emerging sample-based methods using reinforcement learning (RL).
The traditional search-based methods suffer from exponential computational complexity. To reduce complexity, modular supervisory control [3] decomposes global specifications into multiple smaller local specifications, which are enforced by local modular supervisors [4]. Only the plant components that share events with the local specification are considered when designing the local modular supervisor. These supervisors collectively ensure compliance with the global specification. However, blocking states may appear in the controlled system due to the lack of coordination between independently designed local modular supervisors. A straightforward way to ensure nonblocking control is to introduce a coordinator to resolve the conflicting control decisions among modular supervisors. Nevertheless, the complexity of computing such a nonblocking coordinator can still grow exponentially with the number of supervisors and plant components. Developing coordination mechanisms that ensure nonblocking behavior with tractable computational complexity remains an open research question.

1.1. Traditional Methods for Coordinator Computing

Traditional SCT employs two primary strategies to efficiently achieve nonblocking modular supervisory control. The first strategy leverages the structural properties of the system to ensure that local modular supervisors maintain the nonblocking property. For instance, Feng and Wonham [5] propose the control-flow-network method, which analyzes the interactions among local supervisors and plant components. This method identifies specific network configurations that inherently guarantee nonblocking behavior for certain modular supervisors.These favorable network structures are common in manufacturing lines. In addition, Goorden et al. [6] present three structural properties that can be checked to verify nonblocking behavior in modular supervisory control applications.
The second strategy aims to alleviate computational complexity by employing system decomposition and hierarchical abstraction techniques in the coordinator synthesis process. This approach partitions the system into smaller, more manageable subsystems and abstracts their behavior into higher-level models, enabling efficient synthesis of a global supervisor. For instance, Schmidt et al. [7,8] propose a hierarchical abstraction technique that uses state aggregation to reduce the state space of individual modules while preserving their essential control properties. Their method significantly reduces the computational burden of synthesizing a coordinator, particularly in large-scale systems. Similarly, Leduc et al. [9] introduce a compositional synthesis framework that combines abstraction and decomposition to achieve scalable nonblocking control. Their approach leverages the concept of “partial models,” where each module is abstracted independently, and the global supervisor is synthesized incrementally. This method not only reduces computational complexity but also enhances the flexibility of the control method.
Recent advancements have further extended these strategies by integrating learning-based methods and formal verification tools. For instance, learning-based techniques are exploited for synthesizing modular supervisors in uncertain environments [10,11]. Additionally, tools like Supremica [12] and MIDES [13] have been developed to automate the synthesis and verification of modular supervisors, enabling the application of these strategies to industrial-scale systems. These developments highlight the ongoing evolution of SCT, where traditional strategies are being enhanced with modern computational techniques to address the challenges of increasingly complex systems.

1.2. RL-Based Methods

RL [14] has emerged as a promising alternative to traditional supervisory control methods in DESs. Traditional methods exhaustively search the state-transition model of the controlled system and prune unwanted states, but RL learns control policies through sample-based simulations of the target system. This sample-based approach effectively circumvents the state explosion problem commonly encountered in conventional coordinator design techniques. Kajiwara and Yamasaki [15] propose a RL-based approach for implementing supervisory control in decentralized DESs, where the central supervisor and local subsystems are characterized by distinct preference functions. The reward function is designed to account for both event occurrence and disabling during the learning process. Furthermore, Zielinski et al. [16] integrate SCT and RL to assign an optimal schedule to an industrial system, where SCT is utilized to build a safety search environment for RL algorithms. Hu et al. [17] combine SCT and RL to synthesize an optimal directed controller for a DES. Deep reinforcement learning (DRL) algorithms are incorporated into the above framework to effectively address control problems in large-scale DESs. Konishi et al. [18] integrate DRL and SCT to solve a safety control problem in DESs, where DRL algorithms are used to compute a suboptimal solution efficiently and SCT is employed to restrict the sub-optimal solution to a safety solution. Liu [10] combines SCT and RL to design nonblocking modular supervisors for DESs with unknown environments. Our previous research [19] employs SCT and DRL to compute a nonblocking coordinator for a set of local modular supervisors. Furthermore, SCT is employed to reduce the learning time of DRL for supervisor designing for DESs with linear temporal logic specifications [20].

1.3. Motivation and Contributions

Classical SCT strives for the supremal supervisor with the greatest reachable state set, but RL-based supervisors do not explicitly maximize the reachable state set. Hence, the RL-based supervisors are often limited in the reachable state space.
At a reachable state, a supervisor allows all uncontrollable events and a subset of controllable events. The cardinality of the controllable event subset is flexible, ranging from 0 to all controllable events. This flexibility helps the supervisor satisfy all control specifications and achieve the largest reachable state set. In contrast, RL algorithms choose only one action at a reachable state. This limitation significantly confines the reachable state set of the controlled systems. Our previous studies [19,20] attempt to enlarge the reachable state set of the RL-learned supervisor by selecting multiple actions, but these methods choose a fixed number of controllable events and incur longer training time. The flexibility of choosing an arbitrary number of controllable events at a reachable state is desirable for the RL-learned supervisor.
This paper adds this missed flexibility to RL-learned supervisors. Suppose that an RL algorithm maximizes the reward. After the learning process, the learned supervisor has evaluations for all pairs of reachable state and controllable events. If the supremal nonblocking supervisor allows a controllable event at a state, the evaluation number of the pair of state and event may be high. Otherwise, the evaluation of the pair may be low. A research hypothesis is that all events with higher evaluations learned by RL should be allowed by the supremal nonblocking supervisor. Since there is not a fixed threshold value for dichotomizing the controllable events, we apply the clustering method to identify the pattern of the events and choose the flexible group of allowed events. Among clustering algorithms like hierarchical clustering [21], support vector machine [22], a density-based method [23], and fuzzy C-Means [24], K-means [25] emerges as the most suitable choice for this study because its simplicity and computational efficiency align with the binary classification requirement of evaluation numbers.
We evaluate the proposed method with a transfer line system and an automated guided vehicle (AGV) system. The results show that the proposed method greatly enlarges the state size of the controlled system, compared with the method only using RL. The structure of the paper is as follows. Section 2 reviews the necessary preliminaries. The research problem and its solution are presented in Section 4. Section 5 demonstrates the implementation of the proposed method and Section 6 concludes the paper.

2. Preliminaries

2.1. Deterministic Finite Automata (DFA)

A DFA is formally represented as a five-tuple G = Q , Σ , η , q 0 , Q m , where:
  • Q denotes a finite, nonempty set of states;
  • Σ is a finite, nonempty alphabet representing the set of event labels;
  • η : Q × Σ Q is a (partially defined) transition function;
  • q 0 Q is the designated initial state;
  • Q m Q represents the set of marker (or accepting) states.
For a state q Q and an event σ Σ , denote η ( q , σ ) ! if η ( q , σ ) is defined. The set of events that are defined at q is E n b ( q ) = { σ Σ η ( q , σ ) ! } ; q is a deadlock state if E n b ( q ) = .
An event reward function R e : Σ R and a state reward function R s : Q R are defined for DESs as follows. For an event σ Σ ,
R e ( σ ) = w 1 , if σ Σ s 0 , otherwise ,
where Σ s Σ defines a specific event set, and  w 1 is a positive real number. For a state q Q ,
R s ( q ) = w 1 , if q Q m w 2 , if q Q m & E n b ( q ) = 0 , otherwise ,
where w 2 is a negative real number.

2.2. SCT and Local Modular Supervisors

A plant in SCT is modeled as a DFA G , whose event set Σ is partitioned into controllable event set Σ c and uncontrollable event set Σ u . The specification of the plant behavior is described by another DFA E , whose event set is also Σ . A supervisory controller is designed to ensure the plant satisfies the desired requirements by disabling some controllable events. The supervisory control function is V : Q C , where C = { Π Σ | Σ u Π } denotes the set of all event subsets that may be enabled at a given system state. Note that uncontrollable events are always permitted to occur. At a system state q Q , the supervisory control function determines a permissible event set V ( q ) , and all events not in V ( q ) are disabled. Consequently, the closed-loop behavior of the plant G is constrained to a subset of the intersection of the behaviors of the plant G and the specification E , and the closed-loop behavior is nonblocking. SCT synthesizes a maximally permissive supervisor S * that permits the greatest reachable state set of the closed-loop behavior.
Many DES control problems involve multiple concurrent plant components and modular control specifications. Let G i ( i 1 , , b ) and E j ( j 1 , , d ) denote the DFAs modeling the i-th plant component and the j-th specification, respectively, each defined over event sets Σ i and Σ j . Here, b and d are integers greater than one. The set of plant components that interact with specification E j is given by Share ( E j ) = { i 1 , , b Σ i Σ j } . The synchronous product of two DFAs is defined as follows. Given two DFAs G 1 = Q 1 , Σ 1 , η 1 , q 10 , Q 1 m and G 2 = Q 2 , Σ 2 , η 2 , q 20 , Q 2 m , their synchronization is G 1 G 2 = Q , Σ , η , q , Q m , where
  • Q = Q 1 × Q 2 is the state set;
  • Σ = Σ 1 Σ 2 is the set of events;
  • η : Q × Σ Q is the partial transition function. For q = ( q 1 , q 2 ) Q and σ Σ ,
    η ( q , σ ) = ( q 1 , q 2 ) , if σ Σ 1 Σ 2 , q 1 = η 1 ( q 1 , σ ) ( q 1 , q 2 ) , if σ Σ 2 Σ 1 , q 2 = η 2 ( q 2 , σ ) ( q 1 , q 2 ) , if σ Σ 1 Σ 2 , q 1 = η 1 ( q 1 , σ ) , q 2 = η 2 ( q 2 , σ ) undefined , otherwise ;
  • q 0 = ( q 10 , q 20 ) is the initial state;
  • Q m = Q 1 m × Q 2 m is the set of marker states.
The operation can be extended to more than two DFAs. For each specification E j , the associated local plant is constructed as H j = i Share ( E j ) G i . The corresponding modular supervisor S j , which enforces the specification E j , is computed from the composition of H j and E j using SCT [26,27]. Consequently, the concurrent behavior of all plant components G i = Q i , Σ i , η i , q i 0 , Q i m satisfies all specifications under the supervisory control of S j = Y j , Σ , δ j , y j 0 , Y j m . The overall behavior of the controlled DES is captured by the synchronous composition of all local modular supervisors along with their respective plant models. As noted in Section 1, the controlled DES may exhibit blocking behavior, which necessitates a coordinator to guarantee nonblocking execution [1,28].

2.3. Optimal Controller for the Markov Decision Process

A Markov decision process (MDP) is formally represented as a five-tuple M = S , A , T , s 0 , R :
  • S denotes a finite state space containing at least one state;
  • A represents the finite set of available actions;
  • T : S × A × S [ 0 , 1 ] specifies state transition probabilities;
  • s 0 S identifies the initial system state;
  • R : S × A × S R defines the step reward function.
Within this framework, a deterministic control policy maps states to actions through a function π : S A . Implementing policy π on M generates a Markov chain M π with transition probabilities T ( s , π ( s ) , s ) between states s and s . The system’s evolution under this policy produces infinite trajectories τ = ( s 0 , a 0 , s 1 ) , ( s 1 , a 1 , s 2 ) , where each transition satisfies T ( s i , a i , s i + 1 ) > 0 with a i = π ( s i ) .
The cumulative discounted return for trajectory τ is calculated as:
G ( τ ) = k = 0 γ k R ( s k , a k , s k + 1 )
where γ ( 0 , 1 ] serves as the discount factor, progressively diminishing the impact of future rewards. The state-value function V π : S R quantifies the expected cumulative discounted return from state s:
V π ( s ) = E π G ( τ ) s 0 = s
Following Bellman’s principle of optimality [14], this value function satisfies the recursive relationship:
V π ( s ) = s S T ( s , π ( s ) , s ) R ( s , π ( s ) , s ) + γ V π ( s )
The optimal control policy π * maximizes expected returns across all states:
π * = arg max π s S T ( s , π ( s ) , s ) R ( s , π ( s ) , s ) + γ V π ( s )
There are two main approaches to determining π * : model-based methods using dynamic programming when full knowledge of T and R is available, and model-free reinforcement learning methods when transition dynamics are unknown.

2.4. K-Means Clustering

K-means clustering [25] is a popular unsupervised learning algorithm designed to partition a dataset into multiple disjoint clusters based on similarity. The method iteratively assigns data points to clusters and refines the cluster centers to minimize the variance within each cluster. Hence, similar data points are effectively grouped together. Mathematically, given a dataset U = { u i R m i = 1 , , n } , K-means aims to partition it into k clusters C = { C 1 , , C k } to minimize the sum of within-cluster variances:
min C C l C u i C l u i μ l 2 ,
subject to
μ l = 1 C l u i C l u i ,
C l C p = , l p , l , p { 1 , , k }
l = 1 k C l = U
where μ l is the centroid of cluster C l and · represents the Euclidean norm.
The K-means algorithm first randomly selects k centroids from the dataset. It then calculates the distance of each data point to all chosen centroids and assigns each point to the cluster of the nearest centroid. When a data point is added to a cluster, the cluster’s centroid is recalculated. This process repeats iteratively until the cluster assignments converge.

3. Schema of the Overall Approach

The proposed methodology for nonblocking modular supervisory control follows a structured four-step process, as illustrated in Figure 1. Step 1 uses the plant models { G 1 , G 2 , , G b } and specification models { E 1 , E 2 , , E d } to synthesize a group of local modular supervisors { S 1 , S 2 , , S d } . Step 2 augments the obtained DFAs to MDPs, according to the method elaborated in Section 4. Step 3 applies the Q learning algorithm to learn the optimal state-action value function Q * ( x , a ) ( a A ) , which evaluates the reward value for every state and action pair ( x , a ) . Step 4 uses K-means clustering to infer the allowed event subset at the current state.

4. Combination of SCT, RL, and K-Means Clustering

We propose a procedure that combines SCT, RL, and K-means clustering to compute supervisory controllers that meet control requirements, ensure nonblocking property, and allow close-to-optimal freedom. Let G i ( i = 1 , , b ) and E j ( j = 1 , , d ) represent the DFA models of the i-th plant component and the j-th requirement, respectively. For the specification E j , the associated local plant is computed as H j = i Share ( E j ) G i . The nonblocking local modular supervisor S j = Y j , Σ j , δ j , q j 0 , Y j m is synthesized by SCT from H j and E j . As a result, d local modular supervisors are derived. Consequently, the controlled system under the local modular supervisors is described by a compositional DFA G = X , Σ , Δ , x 0 , X m , where X and X m are collections of global states and global marker states, respectively.
As introduced in [19], a global state x = ( q 1 , q 2 , , q b , y 1 , y 2 , , y d ) X of the system contains states of all plant components and local modular supervisors; x X m is called a global marker state if ( i { 1 , , b } , j { 1 , , d } ) q i Q i m y j Y j m . The initial global state is x 0 = ( q 10 , , q b 0 , y 10 , , y d 0 ) . Define the index function I : Σ 2 { 1 , , b } to determine the indexes of the plant components that contain a given event. For an event σ Σ , I ( σ ) = { i { 1 , , b } | σ Σ i } . An event σ Σ is allowed to occur at a global state x , namely, Δ ( x , σ ) ! , if  ( i I ( σ ) ) η i ( q i , σ ) ! ( j { 1 , , d } ) δ j ( y j , σ ) ! . The set of all allowed events at state x is E n b ( x ) = { σ Σ Δ ( x , σ ) ! } . If  E n b ( x ) = , then x is called a deadlock state. After an event σ E n b ( x ) occurs at state x , the DES reaches the next global state x = Δ ( x , σ ) = ( q 1 , q 2 , , q b , y 1 , y 2 , , y d ) , where ( i I ( σ ) ) q i = η i ( q i , σ ) ( i I ( σ ) ) q i = q i ( j { 1 , , d } ) y j = δ j ( y j , σ ) .

4.1. Augmenting DFAs to MDPs

This part constructs an MDP from the obtained DFA G = X , Σ , Δ , x 0 , X m and computes the evaluations for all pairs of reachable state and controllable events for the following K-means clustering step. Since only controllable events can be disabled in SCT, the action set is defined as
A = Σ c
to suit the RL framework. As introduced in Section 2.2, a supervisor of G decides a subset of allowed events Π C at every reachable state x . To apply RL methods to obtain the event subset, we define a mapping π : A C that maps an action to an event subset. For an action a A , define
π ( a ) = { a } Σ u
If the agent chooses action a A at a global state x , there must be one controllable event that is permitted. As a result, each controllable event at a reachable state has a evaluation value after training. In addition, a none controllable event is allowed at this state if the evaluations of all controllable events at x do not satisfy the section condition. Define the function Ψ : X × A 2 Σ as
Ψ ( x , a ) = π ( a ) E n b ( x )
to get all executable events at state x for action a.
Since an arbitrary event in Ψ ( x , a ) may occur, the next state is uncertain. This implies that, after the introduction of the action set A, the global DFA obtained becomes stochastic for the RL agent and can hence be considered as an MDP M = X , A , P , x 0 , r for the RL agent, where the state transition probability function is P : X × A × X [ 0 , 1 ] . For a global state x and an action a A , if  Ψ ( x , a ) = , then for any next global state x , the state transition probability is P ( x , a , x ) = 0 . Otherwise, we derive the probability value P ( x , a , x ) from the assumption that any event in the allowed event subset Ψ ( x , a ) has an equal chance to occur, i.e.,  ( σ Ψ ( x , a ) ) P r ( σ | x , a ) = 1 | Ψ ( x , a ) | . The uniform event occurrence probability assumption is verified in [19] by a case study. The results show that this assumption does not affect the nonblocking property of the controlled system and the majority of the episode rewards based on the uniform distribution are bigger than those based on the non-uniform probability distribution.
Furthermore, a function is defined as E : X × A × X 2 Σ , such that
E ( x , a , x ) = { σ Ψ ( x , a ) | Δ ( x , σ ) = x }
to collect the events that can reach x from x by action a. The probability value P ( x , a , x ) is the cardinality of E ( x , a , x ) divided by the number of all events in Ψ ( x , a ) , i.e.,
P ( x , a , x ) = | E ( x , a , x ) | | Ψ ( x , a ) |
Evidently, if  x Δ ( x , π ( a ) ) , then P ( x , a , x ) = 0 , where Δ ( x , π ( a ) ) = { x x = Δ ( x , σ ) , σ π ( a ) } .
The step reward function r : X × A × X R for the MDP is defined based on Equations (1) and  (2). For a step from x to x by action a A , r ( x , a , x ) = 0 if E ( x , a , x ) = . Otherwise,
r ( x , a , x ) = 1 | E ( x , a , x ) | σ E ( x , a , x ) R e ( σ ) + R s ( x )
According to  (6), the optimal controller f * for MDP M = X , A o , P , x 0 , r is computed by
f * = arg max f x Δ ( x , Ψ ( x , f ( x ) ) ) P ( x , f ( x ) , x ) [ r ( x , f ( x ) , x ) + λ U f ( x ) ]
where Δ ( x , Ψ ( x , f ( x ) ) ) = { x x = Δ ( x , σ ) , σ Ψ ( x , f ( x ) ) } . To employ Q learning, a state-action value function Q f : X × A R for the controller f is defined to replace U f in the learning process. Executing an action a A at a global state x , the expected return is
Q f ( x , a ) = x Δ ( x , Ψ ( x , a ) ) P ( x , a , x ) [ r ( x , a , x ) + λ U f ( x ) ] = 1 Ψ ( x , a ) σ Ψ ( x , a ) R e ( σ ) + R s ( Δ ( x , σ ) + λ U f ( Δ ( x , σ ) )
The optimal Q function is
Q * ( x , a ) = 1 Ψ ( x , a ) σ Ψ ( x , a ) R e ( σ ) + R s ( Δ ( x , σ ) ) + λ max a Q * ( Δ ( x , σ ) , a )
under the optimal control function f * .
To evaluate and improve the learning performance, the concept of temporal difference (TD) [14] is introduced. Temporal difference methods enable the agent to learn optimal value functions incrementally by comparing estimates across successive states. Specifically, the TD error measures the discrepancy between the predicted value of the current state-action pair and the updated estimate based on subsequent states. For employing the Q learning algorithm, the temporal difference error is defined as
T D e r r o r = 1 Ψ ( x , a ) σ Ψ ( x , a ) [ R e ( σ ) + R s ( Δ ( x , σ ) ) + λ max a Q ( Δ ( x , σ ) , a ) ] Q ( x , a )
Consequently,
Q ( x , a ) Q ( x , a ) + α × T D e r r o r
where α denotes the learning rate and regulates the learning speed. By incorporating the temporal difference, the RL agent learns to approximate the optimal Q-function efficiently.
Algorithm 1 outlines the RL algorithm employed in this paper. Each episode begins with the initial global state x 0 and terminates if the maximal step is arrived or no event is available at the reached state, i.e.,  E n b ( x ) = . Note that the second termination condition includes two situations. First, the terminal state is a marker state that qualifies a positive state reward w 1 . Second, the terminal state is indeed a deadlock state and deserves a negative number w 2 . The exploration–exploitation trade-off is managed using the method named ϵ greedy strategy, where ϵ ( 0 , 1 ) . At the end of each episode at Line 25, ϵ is reduced but the minimal value is 0.01. At a global state x , the algorithm randomly selects an action a A with probability ϵ or chooses the action a = arg max a Q * ( x , a ) .
The event subset decided at x by the chosen action a is Ψ ( x , a ) . If Ψ ( x , a ) = , meaning that the controllable event selected is not defined at x , the state-action value is updated in Line 13, where a negative real number w 2 is assigned as a penalty for this selection. The current state is then retained as the next state to facilitate sufficient exploration during the learning process in this situation. Conversely, if  Ψ ( x , a ) , we consider the transitions caused by all events in Ψ ( x , a ) and update the Q function as Equation (19). In this scenario, the next state is arbitrarily chosen from the set of reachable states Δ ( x , Ψ ( x , a ) ) .
Algorithm 1 is developed for finite state MDPs, and its convergence is guaranteed according to [29]. Moreover, the reward generated in one episode, the episode number, and the average reward over the previous Len episodes are denoted by EpReward , EpR , and AvR , respectively. The iteration terminates when either AvR exceeds the threshold e or EpNum surpasses the maximum allowed limit MaxEp . At this point, the converged state-action table Q ( x , a ) is returned. To track the mean reward over the most recent Len episodes, an empty circular buffer Que of size Len is initialized. After each episode completes, the obtained reward is inserted into Que .

4.2. Online Inference by K-Means Clustering

Algorithm 1 provides the learned Q function values, where each reachable state and controllable event pair is associated with a value, Q * ( x , a ) . This section focuses on computing the final supervisor by employing the K-means clustering algorithm for online inference. The goal is to map each reachable state to a permissible event subset based on the Q function values. At a reachable state x , the set of Q function values for controllable events is expressed as
U ( x ) = { Q * ( x , a ) a A }
Algorithm 1: Q-learning for computing state-action values
Data: M = X , A o , P , x 0 , r , λ , α , ϵ , dec , e, MaxSt , MaxEp and Len
Result: A converged state-action table Q ( x , a )
Machines 13 00559 i001
The K-means clustering algorithm, as discussed in Section 2.4, is applied to partition U ( x ) into two clusters, C 1 and C 2 , such that m i n ( C 1 ) > m a x ( C 2 ) . The cluster with larger values, C 1 , corresponds to the set of allowed actions at x . This resulting action set is
A 1 = { a Q * ( x , a ) C 1 }
Since each action corresponds to a controllable event, A 1 is the permitted controllable event subset at x . Finally, the allowed event subset at x is ( A 1 Σ u ) E n b ( x ) . This method ensures that the permissible event set incorporates actions with high evaluation values while also considering all defined uncontrollable events.
Algorithm 2 computes the permissible event subset γ for a given state x based on the learned Q function. The process leverages variance analysis and K-means clustering to determine whether or not clustering is necessary and to enlarge the set of controllable events if appropriate. The algorithm requires three inputs: the current state x , the learned Q function values Q * ( x , a ) for all actions a A , and a predefined threshold δ [ 0 , 1 ] . The value set U ( x ) is computed as Equation (20) at x . If all values are less than zero, it indicates that all controllable events lead to deadlock states. In this case, no controllable events are permitted, and the event subset γ includes only the defined uncontrollable events. Otherwise, δ is used to determine the partition of controllable events. To eliminate manual parameter recalibration for threshold δ across datasets with differing magnitudes, values at x are normalized as
U n o r m ( x ) = { Q * ( x , a ) U m i n U m a x U m i n a A }
where U m a x and U m i n represent the maximum and minimum of U ( x ) , respectively. Let u i be the i-th element of U n o r m ( x ) . The standard deviation σ n o r m is calculated as
σ n o r m = 1 A i = 1 A u i u ¯ 2
where
u ¯ = 1 A i = 1 A u i
If σ n o r m < δ , the variance is deemed insignificant. All controllable events are permitted, and the permissible event subset includes the defined controllable and uncontrollable events at the current state. If σ n o r m δ , K-means clustering is applied to divide U ( x ) into two clusters. The actions corresponding to the cluster C 1 with larger estimation values are used to construct the permissible event subset, and the event subset γ is computed in Line 11.
Algorithm 2: Computing the permitted event subset at x based on Q * ( x , a ) and K-means clustering
  Data: The learned Q-function Q * ( x , a ) , a reachable state x , and the threshold δ
  Result: Permitted event subset γ
Machines 13 00559 i002
The threshold δ largely influences the clustering process. When δ is small, the Q-values of actions at state x are more likely to be divided into two distinct clusters—one representing favorable controllable events and the other unfavorable ones. Consequently, controllable events in the unfavorable cluster are disabled. In contrast, a larger δ may lead to all controllable events being permitted, which increases the risk of violating the nonblocking property. The tuning of δ is to find the largest value that can still ensure the nonblocking property.

5. Case Studies for Quantitative Evaluations

The proposed approach is verified by two case studies related to industrial automation. To show the strengths of the new method on computational efficiency and control freedom, we compare it with two existing methods. Denoted as SCT + RL, method 1 combines SCT and RL and the RL controller chooses at most two controllable events at every reachable state [19]. The comparison shows that the new method requires similar learning time for RL but the K-means method significantly increases control freedom. Denoted as SCT, method 2 is the standard supervisory control synthesis method based on exhaustive search of finite state automata [30]. Method 2 ensures the supremal nonblocking supervisor. The comparison shows that the new approach can achieve a close-to-supremal supervisor.

5.1. Supervisory Control of a Production Transfer Line

The first case study considers a production transfer line composed of two machines, M 1 and M 2 , a testing unit TU , and two intermediate product buffers, B 1 and B 2 , as illustrated in Figure 2. Suppose that buffers B 1 and B 2 have storage capacities of 4 and 2, respectively.
In this system, machine M 1 loads and processes a part, then transfers the completed part to buffer B 1 . Next, machine M 2 retrieves a part from B 1 , processes it, and passes the finished item to buffer B 2 . The test unit TU picks up parts from B 2 for inspection. If a part passes the quality test, it is discharged from the production line; otherwise, it is redirected back to B 1 for reprocessing by M 2 . This operation introduces a feedback loop in the material flow. The system has two control requirements. (1) Buffers B 1 and B 2 must be protected against underflow and overflow. (2) The controlled system must not have any deadlock state.
The models of M 1 , M 2 , TU , B 1 , and B 2 are displayed in Figure 3. The event sets of controllable and uncontrollable events are Σ c = { ld 1 , ld 2 , tst } and Σ u = { uld 1 , uld 2 , fns , rjt } , respectively. The models of the two buffers are specifications, and others are plant components. The proposed method first computes two reduced local modular supervisors, S 1 and S 2 , based on the plant components and specifications, where S 1 corresponds to B 1 with nine states and 51 transitions, and S 2 corresponds to B 2 with three states and 12 transitions. With the restriction of S 1 and S 2 , the controlled system satisfies the specification but leads to deadlocks.
For the new method, the initial global state is x 0 = ( q 1 , 0 , q 2 , 0 , q 3 , 0 , y 1 , 0 , y 2 , 0 ) , where q 1 , 0 , q 2 , 0 , q 3 , 0 , y 1 , 0 and y 2 , 0 are the initial states of M 1 , M 2 , TU , S 1 , and S 2 , respectively. The corresponding MDP is constructed as introduced in Section 4.1, where the action space is { ld 1 , ld 2 , tst } .

5.1.1. Computation of the Q Function Values

Algorithm 1 is applied to compute the optimal Q function, with the hyperparameters summarized in the first line in Table 1. For instance, the used positive and negative rewards are w 1 = 30 and w 2 = 50 , respectively. Figure 4 shows the plots of average rewards for the past 30 episodes of the learning processes through Algorithm 1 and the SCT + RL method in our previous work [19]. In the SCT + RL method, the RL algorithm allows at most two controllable events at every reachable state. It shows that Algorithm 1 requires 300 episodes to converge but the SCT + RL method needs more than 400 episodes. The reason for this is that the action space sizes are three and four, respectively. Consequently, Algorithm 1 has a higher training efficiency.

5.1.2. Online Inference for the Final Supervisor

After training, the Q table stores the learned values for the pairs of reachable states and controllable events are obtained. Table 2 displays the learned Q values at four reachable states. Each global state contains the state of five components: M 1 , M 2 , TU , S 1 , and S 2 , respectively.
Algorithm 2 is employed to decide the allowed events based on the learned Q function. For instance, as shown in Table 2, all values of controllable events at x 3 are negative; hence, no controllable event is allowed at this state. Since the defined uncontrolled events are uld 1 and uld 2 , the final permitted event subset at x 3 is { uld 1 , uld 2 } . Moreover, controllable events are divided into { ld 1 , tst } and { ld 2 } at x 2 according to the values, and the permitted event subset for x 2 is { ld 1 , tst } . Similarly, the determined event subset at x 0 is { ld 1 } . Since all values for controllable events at x 1 are positive, Algorithm 2 computes σ n o r m = 0.77 . If δ = 0.5 , i.e., σ n o r m > δ , the controllable events are partitioned into { ld 1 , tst } and { ld 2 } , and the decided controllable event subset is { ld 1 , tst } . Since tst is not allowed by the modular supervisors, the final permitted controllable event subset is { ld 1 } . If δ = 0.8 , all controllable events are selected, and the final permitted controllable event subset is { ld 1 , ld 2 } .
Figure 5 illustrates the relationship between the threshold parameter δ and the number of reachable states of the controlled system, where the number of nonblocking reachable states is 72 with 0.85 δ 0.88 . The result shows that the controlled system is blocking if δ > 0.88 .
In this case, we employ δ = 0.88 . Table 3 presents the control results from three methods: integrating RL and SCT [19]; our new method combining RL, SCT, and K-means clustering; and directly employing SCT [31]. The results show that the SCT + RL method obtains a nonblocking supervisor of 56 reachable states. The novel SCT + RL + K-means method results in a nonblocking controlled system with 72 reachable states. For comparison, SCT is utilized to synthesize the supremal nonblocking supervisor, and the controlled system under this supervisor also achieves 72 reachable states. All three methods above obtain nonblocking supervisors. Our proposed method computes a nonblocking supervisor with equal number of reachable states as the supremal nonblocking supervisor does.

5.2. Control of an Automatically Guided Vehicles System

The second application is the control of automatically guided vehicles (AGVs) [19] for a manufacturing workcell. The system consists of 15 components, including two input stations (IPS1, IPS2), three workstations (WS1, WS2, WS3), five AGVs (AGV1–AGV5), four zones (Zone1–Zone4), and the station (CPS) for completed parts.
As shown in Figure 6, five AGVs travel on fixed circular routes, transporting parts to their designated places. Figure 7 and Figure 8 respectively display the DFA models of AGVs and specifications, where the controllable event set is
Σ c = { σ 11 , σ 13 , σ 21 , σ 23 , σ 31 , σ 33 , σ 41 , σ 43 , σ 51 , σ 53 }
The models of the local modular supervisors corresponding to the specifications and the physical meanings of these events are illustrated in [19]. As a result, the controlled system under these modular supervisors has deadlocks.
A global state consists of the states of the five AGV models and the eight reduced modular supervisors by SCT. The constructed action set is A = Σ c , with 10 actions. The proposed method then applies Algorithm 1 to learn the value function Q * ( x , a ) for the MDP that corresponds to the global state transition system. Figure 9 shows the plots of average rewards for the past 100 episodes of the learning processes through Algorithm 1 and the proposed method in our previous work [19], where the parameters used are displayed in the “AGV case value” line in Table 1. Consequently, Algorithm 1 requires less than 3000 episodes, while the other needs more than 4000 episodes to get converged.
Algorithm 2 is then applied to infer the supervisor’s policy online. Figure 10 displays the variation of nonblocking reachable states with δ . The largest number of states, 3558, is obtained when δ = 0.32 , and the controlled system reaches deadlock if δ > 0.32 . Therefore, we use δ = 0.32 in this case.
Table 4 presents the control results of the system under the method employed in the previous case. As a result, the controlled system with the SCT + RL method is nonblocking and includes 481 states, where the number of controllable events at a state is limited to, at most, two. Through the proposed method, the controlled system remains nonblocking and expands to 3558 states. Furthermore, With the supervisor by SCT, the controlled system achieves 4406 nonblocking reachable states. Compared with the method that only combines RL algorithms and SCT, the proposed method admits a much larger state size of the controlled system. Simultaneously, the satisfaction of the specifications is ensured.

6. Conclusions

This paper presents an integrated approach for nonblocking modular supervisory control in DESs by combining SCT, RL, and K-means clustering. The proposed method enhances learning efficiency by reducing the action space of RL algorithms while improving system freedom in RL-based supervisors. Through RL, the system efficiently learns values for all state action pairs, and K-means clustering infers the allowed controllable events online to ensure a balance between permissiveness and system constraints. Experimental validation on two case studies confirms the effectiveness of the new approach, demonstrating a significant increase in the number of nonblocking reachable states compared to prior methods.
In the future, we plan to enhance the scalability and adaptability of the approach by incorporating a Deep Q-Network (DQN). DQNs extends traditional Q-learning by using deep neural networks to approximate Q-values, enabling efficient learning in high-dimensional and large-scale discrete event systems. This will allow the system to handle more complex plant structures, larger event spaces, and dynamic environments without requiring exhaustive state enumeration. Furthermore, techniques such as experience replay and target networks [32] will be explored to stabilize training and improve convergence. Moreover, partial observability due to sensor noise, delays, or failures is one of our future works.

Author Contributions

J.Y., conceptualization, methodology, software, formal analysis, investigation, writing—original draft; K.T., software.; L.F., conceptualization, methodology, formal analysis, investigation, supervision, writing—review & editing. All authors have read and agreed to the published version of the manuscript.

Funding

The work is supported in part by the Youth Science and Technology Fund Project of Gansu Province, China, under Grant 25JRRA960 and the Youth Foundation of Lanzhoujiaotong University, China, under Grant 2024044. Lei Feng is financially supported by the research center of XPRES at KTH.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Wonham, W.M.; Cai, K. Supervisory Control of Discrete-Event Systems; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
  2. Patrik, B.; Fabian, M. Calculating Restart States for Systems Modeled by Operations Using Supervisory Control Theory. Machines 2013, 1, 116–141. [Google Scholar] [CrossRef]
  3. Wonham, W.M.; Ramadge, P.J. Modular supervisory control of discrete-event systems. Math. Control. Signals Syst. 1988, 1, 13–30. [Google Scholar] [CrossRef]
  4. Queiroz, M.H.d.; Cury, J.E.R. Modular supervisory control of large scale discrete event systems. In Proceedings of the International Conference on Discrete Event Systems, Anchorage, AK, USA, 27 September 2000; pp. 103–110. [Google Scholar]
  5. Feng, L.; Wonham, W.M. Computationally Efficient Supervisor Design: Control Flow Decomposition. In Proceedings of the IEEE 8th International Workshop on Discrete Event Systems, Ann Arbor, MI, USA, 10–12 July 2006; pp. 9–14. [Google Scholar]
  6. Goorden, M.; van de Mortel-Fronczak, J.; Reniers, M.; Fabian, M.; Fokkink, W.; Rooda, J. Model properties for efficient synthesis of nonblocking modular supervisors. Control Eng. Pract. 2021, 112, 104830. [Google Scholar] [CrossRef]
  7. Schmidt, K.; Marchand, H.; Gaudin, B. Modular and decentralized supervisory control of concurrent discrete event systems using reduced system models. In Proceedings of the IEEE 8th International Workshop on Discrete Event Systems, Ann Arbor, MI, USA, 10–12 July 2006; pp. 149–154. [Google Scholar]
  8. Schmidt, K.; Moor, T.; Perk, S. Nonblocking hierarchical control of decentralized discrete event systems. IEEE Trans. Autom. Control 2008, 53, 2252–2265. [Google Scholar] [CrossRef]
  9. Leduc, R.J.; Brandin, B.A.; Lawford, M.; Wonham, W.M. Hierarchical interface-based supervisory control-part I: Serial case. IEEE Trans. Autom. Control 2005, 50, 1322–1335. [Google Scholar]
  10. Liu, Y.; Liu, F. Optimal control of discrete event systems under uncertain environment based on supervisory control theory and reinforcement learning. Sci. Rep. 2024, 14, 25077. [Google Scholar] [CrossRef]
  11. Dai, J.; Lin, H. A learning-based synthesis approach to decentralized supervisory control of discrete event systems with unknown plants. Control Theory Technol. 2014, 12, 218–233. [Google Scholar] [CrossRef]
  12. Malik, R.; Åkesson, K.; Flordal, H.; Fabian, M. Supremica–an efficient tool for large-scale discrete event systems. IFAC-PapersOnLine 2017, 50, 5794–5799. [Google Scholar] [CrossRef]
  13. Farooqui, A.; Hagebring, F.; Fabian, M. MIDES: A tool for supervisor synthesis via active learning. In Proceedings of the IEEE 17th International Conference on Automation Science and Engineering, Lyon, France, 23–27 August 2021; pp. 792–797. [Google Scholar]
  14. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  15. Kajiwara, K.; Yamasaki, T. Adaptive supervisory control based on a preference of agents for decentralized discrete event systems. In Proceedings of the IEEE SICE Annual Conference, Taipei, Taiwan, 18–21 August 2010; pp. 1027–1032. [Google Scholar]
  16. Zielinski, K.M.C.; Hendges, L.V.; Florindo, J.B.; Lopes, Y.K.; Ribeiro, R.; Teixeira, M.; Casanova, D. Flexible control of Discrete Event Systems using environment simulation and Reinforcement Learning. Appl. Soft Comput. 2021, 111, 107714. [Google Scholar] [CrossRef]
  17. Hu, Y.; Wang, D.; Yang, M.; He, J. Integrating reinforcement learning and supervisory control theory for optimal directed control of discrete-event systems. Neurocomputing 2024, 613, 128720. [Google Scholar] [CrossRef]
  18. Konishi, M.; Sasaki, T.; Cai, K. Efficient safe control via deep reinforcement learning and supervisory control-csae study on multi-robot warehouse automation. IFAC-PapersOnLine 2022, 55, 16–21. [Google Scholar] [CrossRef]
  19. Yang, J.; Tan, K.; Feng, L.; Li, Z. A model-based deep reinforcement learning approach to the nonblocking coordination of modular supervisors of discrete event systems. Inf. Sci. 2023, 630, 305–321. [Google Scholar] [CrossRef]
  20. Yang, J.; Tan, K.; Feng, L.; El-Sherbeeny, A.M.; Li, Z. Reducing the learning time of reinforcement learning for the supervisory control of discrete event systems. IEEE Access 2023, 11, 59840–59853. [Google Scholar] [CrossRef]
  21. Ran, X.; Xi, Y.; Lu, Y.; Wang, X.; Lu, Z. Comprehensive survey on hierarchical clustering algorithms and the recent developments. Artif. Intell. Rev. 2023, 56, 8219–8264. [Google Scholar] [CrossRef]
  22. Kurani, A.; Doshi, P.; Vakharia, A.; Shah, M. A comprehensive comparative study of artificial neural network (ANN) and support vector machines (SVM) on stock forecasting. Ann. Data Sci. 2023, 10, 183–208. [Google Scholar] [CrossRef]
  23. Kriegel, H.P.; Kröger, P.; Sander, J.; Zimek, A. Density-based clustering. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2011, 1, 231–240. [Google Scholar] [CrossRef]
  24. Bezdek, J.C.; Ehrlich, R.; Full, W. FCM: The fuzzy c-means clustering algorithm. Comput. Geosci. 1984, 10, 191–203. [Google Scholar] [CrossRef]
  25. Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 2023, 622, 178–210. [Google Scholar] [CrossRef]
  26. Feng, L.; Wonham, W.M. TCT: A computation tool for supervisory control synthesis. In Proceedings of the IEEE 8th International Workshop on Discrete Event Systems, Ann Arbor, MI, USA, 10–12 July 2006; pp. 3–8. [Google Scholar]
  27. Su, R.; Wonham, W.M. What information really matters in supervisor reduction? Automatica 2018, 95, 368–377. [Google Scholar] [CrossRef]
  28. Malik, R.; Teixeira, M. Optimal modular control of discrete event systems with distinguishers and approximations. Discret. Event Dyn. Syst. 2021, 31, 659–691. [Google Scholar] [CrossRef]
  29. Lewis, F.L.; Vrabie, D.; Vamvoudakis, K.G. Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controller. IEEE Control Syst. Mag. 2012, 32, 76–105. [Google Scholar]
  30. Feng, L.; Cai, K.; Wonham, W. A structural approach to the non-blocking supervisory control of discrete-event systems. Int. J. Adv. Manuf. Technol. 2009, 41, 1152–1168. [Google Scholar] [CrossRef]
  31. Wonham, W.M.; Ramadge, P.J. On the supremal controllable sublanguage of a given language. SIAM J. Control Optim. 1987, 25, 637–659. [Google Scholar] [CrossRef]
  32. Saglam, B.; Mutlu, F.B.; Cicek, D.C.; Kozat, S.S. Actor prioritized experience replay. J. Artif. Intell. Res. 2023, 78, 639–672. [Google Scholar] [CrossRef]
Figure 1. Schema for the proposed approach.
Figure 1. Schema for the proposed approach.
Machines 13 00559 g001
Figure 2. Illustration of the transfer line.
Figure 2. Illustration of the transfer line.
Machines 13 00559 g002
Figure 3. Component DES.
Figure 3. Component DES.
Machines 13 00559 g003
Figure 4. Plots of average rewards [19].
Figure 4. Plots of average rewards [19].
Machines 13 00559 g004
Figure 5. Variation of nonblocking reachable states with δ .
Figure 5. Variation of nonblocking reachable states with δ .
Machines 13 00559 g005
Figure 6. A manufacturing workcell with AGVs.
Figure 6. A manufacturing workcell with AGVs.
Machines 13 00559 g006
Figure 7. DFA models of AGVs.
Figure 7. DFA models of AGVs.
Machines 13 00559 g007
Figure 8. DFA models of specifications on AGVs.
Figure 8. DFA models of specifications on AGVs.
Machines 13 00559 g008
Figure 9. Plots of average rewards [19].
Figure 9. Plots of average rewards [19].
Machines 13 00559 g009
Figure 10. Variation of nonblocking reachable states with δ .
Figure 10. Variation of nonblocking reachable states with δ .
Machines 13 00559 g010
Table 1. Values of parameters.
Table 1. Values of parameters.
Parameter α λ ϵ w 1 w 2 edecMaxStMaxEpLen
TL0.90.960.9830−500.0010.05100100030
AGV0.950.990.9530−500.0015 × 10−51005000100
Table 2. Values of controllable events at four reachable states.
Table 2. Values of controllable events at four reachable states.
StateControllable Events
ld1ld2tst
x 0 : (0,0,0,0,0)68.77−50−50
x 1 : (0,0,1,5,0)89.4151.4183.66
x 2 : (0,0,0,4,2)82.89−5040.83
x 3 : (1,1,0,6,1)−42.88−43.08−43.29
Table 3. Comparison of the control results by three methods.
Table 3. Comparison of the control results by three methods.
MethodsSCT + RLSCT + RL + K-MeansSCT
State space size567272
NonblockingYesYesYes
Table 4. Comparison of the control results by three methods.
Table 4. Comparison of the control results by three methods.
MethodsSCT + RLSCT + RL + K-MeansSCT
State space size48135584406
NonblockingYesYesYes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, J.; Tan, K.; Feng, L. Nonblocking Modular Supervisory Control of Discrete Event Systems via Reinforcement Learning and K-Means Clustering. Machines 2025, 13, 559. https://doi.org/10.3390/machines13070559

AMA Style

Yang J, Tan K, Feng L. Nonblocking Modular Supervisory Control of Discrete Event Systems via Reinforcement Learning and K-Means Clustering. Machines. 2025; 13(7):559. https://doi.org/10.3390/machines13070559

Chicago/Turabian Style

Yang, Junjun, Kaige Tan, and Lei Feng. 2025. "Nonblocking Modular Supervisory Control of Discrete Event Systems via Reinforcement Learning and K-Means Clustering" Machines 13, no. 7: 559. https://doi.org/10.3390/machines13070559

APA Style

Yang, J., Tan, K., & Feng, L. (2025). Nonblocking Modular Supervisory Control of Discrete Event Systems via Reinforcement Learning and K-Means Clustering. Machines, 13(7), 559. https://doi.org/10.3390/machines13070559

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop