Next Article in Journal
Dynamic Behaviors and Stability Analysis of Closed-Loop Controlled LLC Resonant Converters
Previous Article in Journal
Lightweight 1D-CNN-Based Battery State-of-Charge Estimation and Hardware Development
Previous Article in Special Issue
Electric-Field and Magnetic-Field Decoupled Wireless Power and Full-Duplex Signal Transfer Technology for Pre-Embedded Sensors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Adaptive Frequency Control for Multi-Relay MC-WPT Systems Based on Clustering and Reinforcement Learning

1
Chongqing Research Institute, China Coal Technology and Engineering Group, Chongqing 400039, China
2
School of Electronic and Electrical Engineering, Chongqing University of Science and Technology, Chongqing 401331, China
3
Faculty of Electric Power Engineering, Kunming University of Science and Technology, Kunming 650500, China
4
Chongqing Anbiao Testing and Research Institute Co., Ltd., Chongqing 400052, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(3), 705; https://doi.org/10.3390/electronics15030705
Submission received: 16 December 2025 / Revised: 31 January 2026 / Accepted: 2 February 2026 / Published: 6 February 2026

Abstract

Magnetically coupled resonant wireless power transfer (MC-WPT) systems with multi-relay coupling structures can significantly extend the transmission distance. However, system performance is highly sensitive to the spatial positions and coupling conditions of the relay coils. Any misalignment can alter the energy transfer path, causing shifts in the optimal operating frequency and reductions in efficiency. This makes conventional single-frequency or static-tuning strategies unsuitable for handling complex variations in coupling states. To address this issue, this paper investigates a three-relay MC-WPT system and proposes an adaptive frequency control and energy routing method that combines clustering and Q-learning for scenarios with severe coil misalignment. First, a physical model based on coupled-mode theory is established to describe the relationships among coupling coefficients, operating frequency, and transmission efficiency. High-dimensional coupling state data are then collected under different relay coil misalignment conditions. Next, principal component analysis (PCA) and clustering algorithms are used to extract representative coupling patterns and identify the system’s optimal efficiency points, forming an offline database that includes mappings of optimal frequencies. Furthermore, Q-learning is introduced to enable adaptive frequency control through online state recognition. Finally, under severe coil misalignment, frequency retuning of non-misaligned coils is applied to actively shield misaligned coils and reconstruct the energy transfer path. Simulation and experimental results show that the proposed method can achieve real-time frequency control and dynamic energy routing in multi-relay MC-WPT systems without additional hardware. The system transmission efficiency is significantly improved under all relay misalignment scenarios, effectively addressing the optimal frequency shift problem in multi-relay coupling structures and providing a new approach for intelligent and efficient MC-WPT systems under complex coupling conditions.

1. Introduction

Multi-relay magnetically coupled wireless power transfer (MC-WPT) systems have great potential in electric vehicles, portable devices, and unmanned systems. They provide high efficiency and enable medium-range power transfer. However, their performance is highly sensitive to the spatial coupling between coils. Misalignment changes the coupling network, causing resonance frequency drift and efficiency loss. Conventional fixed-frequency or static-tuning methods cannot adapt to such variations in complex systems. However, the overall performance of MC-WPT systems strongly depends on the spatial coupling conditions among the coils. In practical environments, lateral, axial, and angular misalignments are almost unavoidable, making ideal alignment difficult to maintain. These misalignments cause fluctuations in the coupling coefficient and detuning of the resonance condition, which lead to significant reductions in transfer efficiency and output power. Therefore, improving the misalignment tolerance of MC-WPT systems has become a key research focus in this field. Extensive studies have been carried out to address the coil misalignment problem. Existing approaches can generally be divided into three main categories: structural optimization, compensation network adjustment, and frequency control. For structural optimization, researchers have attempted to enhance misalignment tolerance by designing asymmetric coil structures or multi-coil arrays. An asymmetric three-coil structure was proposed to improve the system’s robustness against misalignment. The effects of the relay coil radius and position on transmission efficiency were analyzed from a geometric design perspective [1]. An eight-coil MC-WPT system was further developed using dual transmitting and dual receiving coils to generate an approximately uniform magnetic field. This configuration maintained stable power output and efficiency even under large variations in coupling conditions [2]. A superimposed star-shaped coil array structure has been proposed to improve the angular misalignment tolerance at the receiver [3]. The star-shaped transmitter consists of two square planar coil arrays rotated 45° relative to each other. Both simulations and experiments show that even when the receiver coil is rotated by 90°, a certain level of transmission efficiency can still be maintained. This demonstrates that coil misalignment tolerance can be achieved through geometric optimization of the coil array, without relying on complex control strategies. For a single-transmitter multiple-receiver WPT system, a structure using cross-overlapping dual-pole transmitter coils and multiple receiver coils was proposed [4]. By adjusting the magnetic field distribution and employing an impedance-matching network, the system improved power distribution stability under lateral and axial misalignment as well as angular rotation of the coils, effectively enhancing the system’s adaptability to environmental variations. Although these methods improve system robustness by optimizing the magnetic field distribution or increasing the number of coils, they often lead to complex coil geometries, larger system size, and greater implementation difficulty. In the second category, compensation network adjustment, variable compensation components or switchable networks are used to dynamically adapt to changes in the coupling coefficient. For example, a switch-controlled capacitor (SCC) was integrated into an LCC-S topology to enable efficient energy transfer under coil misalignment. Experimental results showed that the system maintained high efficiency under different offset directions [5]. An active compensation network based on variable resonant capacitance was also proposed. Digital control strategies and optimization algorithms were employed to adjust the transmitter-side series–parallel compensation capacitors in real time, thereby maintaining a constant output voltage and improving overall efficiency [6]. Detuning effects in the LCC-S topology were further investigated to assess their impacts on power, efficiency, and input phase. An auxiliary compensation coil was then proposed to reduce reactance deviations caused by temperature drift or manufacturing errors [7]. Although compensation network adjustment provides superior adaptability and control precision compared to structural optimization, it often requires complex switching circuits or tunable components. This leads to higher hardware costs and increased control complexity. The third category of methods is based on adaptive frequency control strategies. Due to their fast response and low hardware requirements, these methods have received wide attention in recent years. According to the implementation approach, they can be roughly divided into three types. The first type is measurement-based methods, which dynamically adjust the frequency by online acquisition of system parameters to maintain operation at the optimal efficiency point. Ref. [8] proposed a WPT system using an integrated coupler, where coaxially overlapping capacitors and coils achieve frequency tracking. This system maintains stable resonance under significant changes in coil coupling without requiring additional compensation components. Furthermore, to address detuning caused by parameter variations, ref. [9] proposed a resonant frequency adaptive tracking strategy based on switch-controlled capacitors and a perturb-and-observe frequency tracking algorithm. This method can adjust the resonant frequency in real time using only signals collected from the transmitter. Experiments verified that the system maintains continuous resonance within the frequency range specified by the SAEJ2954 standard [10], with a response time of less than 1 ms and significantly improved transmission efficiency. The second type is compensation-element tuning methods, which adjust the parameters of compensation elements to adapt to coil variations and achieve frequency control. For the LCC-S compensation network, ref. [11] proposed a detuned capacitance selection method. By slightly detuning the transmitter capacitance, the method achieves stable voltage output under coupling coefficients ranging from 0.18 to 0.38. Ref. [12] further proposed a power-compensation-based tuning method, which adjusts only the receiver compensation capacitor in combination with a DC/DC converter to achieve adaptive impedance matching for coil parameter variations. The third type is digital control methods, which combine digital signal processing with advanced control strategies to achieve high-precision frequency tracking and adaptive control. Optimal frequency tracking techniques based on perturb-and-observe and phase-locked loop (PLL) methods allow the system to maintain high-efficiency adaptive frequency control under various operating conditions [13]. In addition, ref. [14] proposed a high-precision frequency tracking strategy using an FPGA-based control system. By continuously monitoring the phase difference between the transmitter voltage and current and dynamically adjusting the phase compensation angle, the system operates in zero-voltage switching mode, reducing losses and improving output power and transmission efficiency under different coupling and load conditions.
Taken together, these studies demonstrate that misalignment tolerance can be improved not only through frequency control but also by coil topology design, compensation network combination, and geometric optimization. However, in multi-relay MC-WPT systems, due to high-dimensional coupling paths, complex mutual inductance matrices, and multiple resonance peaks, these methods cannot be directly applied. Traditional frequency-tracking algorithms struggle to accurately identify the global optimal operating frequency, and frequent frequency switching may lead to dynamic instability or slow system response. Moreover, most existing frequency control strategies do not consider relay routing mechanisms, and systematic research on bypassing or shielding low-efficiency or severely misaligned relay coils remains limited [15,16]. Therefore, it is necessary to adopt adaptive frequency optimization and energy routing strategies to cope with complex coupling conditions and improve both transmission efficiency and system robustness. To overcome the limitations of existing studies, this paper proposes an adaptive frequency optimization method for a three-relay MC-WPT system that integrates coupled-mode theory, offline clustering analysis, and online reinforcement learning inference. First, a physical correlation model among the coupling coefficient, operating frequency, and transmission efficiency is established based on coupled-mode theory. A large set of system state data is then generated offline under various coupling conditions. Next, clustering algorithms are employed to categorize these states and extract the optimal frequency points corresponding to the highest efficiency in each cluster. This constructs the mapping between coupling features and optimal frequencies. Furthermore, a Q-learning-based reinforcement learning framework is introduced to achieve online correction and adaptive reconstruction of the physical model through temporal-difference updates. This enables accurate estimation of optimal operating frequencies even for coupling states not covered in the offline database. For severe misalignment scenarios, a frequency retuning-based relay shielding strategy is also proposed. It allows inefficient relays to automatically exit the resonant chain, thereby enabling dynamic energy-path reconfiguration and global efficiency optimization [17].
In summary, this paper focuses on maintaining optimal transmission efficiency in a three-relay MC-WPT system under coil misalignment conditions. It proposes an adaptive optimal-frequency decision-making framework that integrates theoretical modeling, data clustering, and machine-learning inference. The main contributions are summarized as follows:
(1)
A three-relay MC-WPT multi-coupling-path efficiency correlation model based on coupled-mode theory is proposed to quantitatively relate the coupling coefficients, operating frequency, and transmission efficiency.
(2)
An offline coupling-state database covering multiple misalignment scenarios is constructed. Clustering analysis is used to extract representative coupling states, optimal frequencies, and maximum-efficiency clusters, achieving a structured representation of the high-dimensional coupling state space. Furthermore, a Q-learning-assisted machine learning framework is introduced. Given the current coupling state as input, it can predict the optimal frequency in real time. Continuous interaction and temporal-difference updates allow adaptive reconstruction of the physical model, enabling the system to quickly retune to the maximum efficiency point under misalignment conditions and avoiding the delays and local optima associated with conventional frequency-scanning methods.
(3)
A shielding-based routing strategy for severely misaligned coils is proposed. By adjusting the frequencies of non-misaligned coils, detuned relays automatically exit the resonant path, enabling dynamic reconstruction of the energy transfer route and achieving globally optimal efficiency.

2. System Modeling Based on Coupled-Mode Theory

Figure 1 illustrates the system, which comprises a high-frequency inverter, a multi-relay coupling structure, and a high-frequency rectifier with load. The inverter generates a high-frequency square wave f 1 f n , while the coupling structure is designed to resonate at the system’s initial frequency f * , corresponding to optimal efficiency under the given coupling condition. This figure highlights the energy transfer path and key functional modules, providing a foundation for subsequent physical modeling and frequency-adaptive optimization. To achieve adaptive frequency selection for a multi-relay MC-WPT system under complex spatial configurations, it is first necessary to establish a mathematical model describing the interrelationships among the physical coupling state, transmission efficiency, and operating frequency. This section begins with the definition of the system state vector. By incorporating the concepts of effective coupling coefficient and coupled-mode theory, a frequency–feature mapping model of the system is constructed. This model provides the mathematical foundation for the subsequent clustering analysis and Q-learning-based optimization algorithm [18].

2.1. Definition and Normalization of System State Vector

To characterize the operating characteristics of the three-relay MC-WPT system under different coil misalignment conditions, the system state vector is defined as follows:
s = [ f , η , ϕ , ] T
where f denotes the operating frequency of the system, η represents the transmission efficiency, and ϕ indicates the coil misalignment state, whose larger value corresponds to a more severe misalignment condition.
ϕ = | | k k ideal | | | | k ideal | |
where k ideal denotes the ideal coupling coefficient matrix, and k represents the cross-coupling matrix of the system:
k = 0 k 12 k 13 k 14 k 15 k 21 0 k 23 k 24 k 25 k 31 k 32 0 k 34 k 35 k 41 k 42 k 43 0 k 45 k 51 k 52 k 53 k 54 0
To eliminate the influence of dimensional differences and facilitate subsequent calculations, the state vector is normalized as follows:
s ˜ = [ f f max , η η max , ϕ ϕ max ] T
The normalization coefficients are determined based on the system design parameters, where η max denotes the theoretical maximum efficiency, f max represents the maximum allowable operating frequency, and ϕ max indicates the maximum permissible coil misalignment.

2.2. Frequency–Coupling Coefficient Mapping Modeling

To establish a quantitative relationship between the coil coupling coefficient and the system operating frequency, the concept of effective coupling is introduced. The effective coupling coefficients of various coupling paths in the system vary with the operating frequency. The relationship between the intrinsic coil coupling and the system effective coupling can be expressed as follows:
k m n eff ( f ) = k m n × H m n ( f )
where k m n eff ( f ) denotes the effective coupling coefficient, which is a dynamic parameter dependent on the operating frequency, and H m n ( f ) represents the frequency response function, characterizing the modulation effect of circuit resonance on the coupling performance.
H m n ( f ) = 1 1 + Q m 2 ( f f m f m f ) × 1 1 + Q n 2 ( f f n f n f )
where f m , f n denote the self-resonant frequencies of coils m and n, respectively. Therefore, the effective coupling coefficient matrix of the five-coil system at frequency f i is given as follows:
k i eff = 0 k 12 H 12 ( f i ) ... k 15 H 15 ( f i ) k 21 H 21 ( f i ) 0 ... ... ... ... ... k 45 H 45 ( f i ) k 51 H 51 ( f i ) ... k 54 H 54 ( f i ) 0
In summary, the modulation effect of frequency on the effective coupling coefficient reveals the resonant filtering characteristics of the system. When the operating frequency approaches the self-resonant band of the coils, the energy transfer channels of the coupling links are effectively activated. This demonstrates a strong frequency response behavior. These observations indicate that the transmission performance of the system depends heavily on the coupling dynamics at the operating frequency. Therefore, based on the established mapping between frequency and coupling coefficient, it is important to further investigate their relationship with system transmission efficiency.

2.3. Modeling of Effective Coupling–Efficiency Mapping

To establish a quantitative relationship between the system’s effective coupling and transmission efficiency, the coupled-mode theory (CMT) is introduced. The system is assumed to be a five-coil structure operating under steady-state conditions. Radiation losses are neglected, while only ohmic losses and load losses are considered. All coils are assumed to operate at the same excitation frequency. The energy state of coil m is described by a complex amplitude a m , where | a m | 2 represents the energy stored in coil m. For this system, the coupled-mode equations can be expressed as follows:
d a m d t = ( j ω m Γ m ) a m + j n m k m n a n + F m δ m 1
where ω m is the resonant angular frequency of coil m, Γ m denotes the total decay rate of coil m, and k m n represents the coupling coefficient between coils m and n. The term F m denotes the external driving source or excitation applied to coil m, i.e., the high-frequency power supply connected to the transmitting coil, which serves as the total energy input of the system. δ m 1 is the Kronecker delta function, which equals 1 when m = 1 and 0 otherwise.
The radiation attenuation of the system coils in air can be approximately estimated by the following formula:
P rad = μ 0 π 4 f 4 r 4 I 2 6 c 3
In the equation, μ 0 = 4 π × 10 7   H / m denotes the permeability of free space; f is the system operating frequency, ranging from 10 to 200 kHz; r represents the coil radius, which is 10 cm; I is the coil current amplitude, assumed to be 10 A (and not exceeding 10 A in actual experiments); and c is the speed of light. Based on these parameters, the radiation loss of the system coils in air can be estimated to lie approximately within the range of 7.556 × 10 17   Ω ~ 1.209 × 10 11   Ω . Therefore, in this low-frequency system operating between 10 and 200 kHz, coil radiation loss can be considered negligible.
Based on this theory, the overall transmission efficiency of the system can be expressed as follows:
η = 2 Γ 5 , load Γ 5 × | a 5 | 2 | a 1 | 2
where 2 Γ 5 , load Γ 5 denotes the transmission efficiency from the receiving coil to the system load, | a 5 | 2 | a 1 | 2 represents the energy transfer efficiency from the transmitting coil to the receiving coil, which includes the coupling effects of all transmission paths, and Γ 5 denotes the total decay rate of the receiving coil:
Γ 5 = Γ 5 , load + Γ 5 , loss
where Γ 5 , load denotes the rate of energy transferred from the receiving coil to the load, and Γ 5 , loss represents the rate of energy dissipation due to various losses.
In summary, | a 5 | 2 | a 1 | 2 encompasses the net effect of all effective coupling paths, which directly determines the overall transmission efficiency of the system. Therefore, it can intuitively reflect the quantitative influence of the system’s effective coupling state on its transmission efficiency.

2.4. Frequency–Feature Mode Mapping

Through the analyses presented in the previous two subsections, the physical mapping relationships among the system operating frequency, coupling coefficient, and transmission efficiency have been established. To further achieve holistic feature modeling from input frequency to overall system response, a frequency–feature mode mapping matrix is introduced. This matrix characterizes the optimal efficiency modes and their distribution under different coupling conditions. It provides a unified physical model foundation for the subsequent Q-learning process.
To establish the mapping matrix from frequency to system characteristic modes, the set of selectable operating frequencies of the system is defined as follows:
F = { f 1 , f 2 , ... , f M }
where M denotes the number of available operating frequencies.
Establishment of the Frequency–Feature Mode Mapping Matrix:
M r o u t e = f 1 η ( f 1 ) ϕ ( f 1 ) f 2 η ( f 2 ) ϕ ( f 2 ) ... ... ... f M η ( f M ) ϕ ( f M ) 1
Each row of data corresponds to the system characteristic mode associated with a specific operating frequency. Each mode represents the condition under which the system achieves optimal transmission efficiency for the given frequency and coupling state. The detailed procedure for obtaining the offline database is as follows:
As shown in Figure 2, the physical modeling and mapping matrix framework illustrates the process of establishing quantitative relationships among system frequency, coupling, and transmission efficiency. In this framework, a physical model based on coupled-mode theory is used to generate system characteristic modes at different operating frequencies. Each mode represents the steady-state performance of the system under a given frequency and coupling condition, thereby forming an offline database. This database provides physically consistent prior knowledge for the subsequent Q-learning process and enables the mapping from a continuous physical space to a finite state–action space. However, in practical systems, factors such as environmental disturbances, component parameter drifts, and non-ideal coupling may cause certain operating conditions to fall outside the coverage of the offline model. To address this issue, a physics-guided hybrid exploration strategy is introduced during the reinforcement learning stage. When the system encounters new states not covered by the offline database, the Q-learning mechanism initiates an adaptive exploration process. Through temporal-difference (TD) updates and experience replay, it continuously optimizes the frequency selection policy and feeds the newly acquired experience samples back into the mapping matrix. This design establishes a closed-loop self-evolution mechanism that integrates physical modeling, data-driven learning, and model correction. The physical model provides an interpretable initial structure for learning, while reinforcement learning continuously extends its applicability during real-time operation, achieving the co-evolution and adaptive reconstruction of the physical model and the intelligent control algorithm. In addition, the system state vector is reflected in the mapping matrix of the physical model. As a result, the model does not rely on a single measurement during decision-making but instead refers to multiple coupling state patterns. This makes the method robust to measurement noise and transient parameter disturbances. Minor parameter fluctuations are smoothed through the averaging process of the mapping matrix, making them unlikely to affect the selection of the optimal frequency. Moreover, the temporal-difference (TD) update mechanism in Q-learning accumulates expected rewards over multiple iterations, which helps suppress short-term disturbances. By integrating the physical model with reinforcement learning, the proposed control strategy enables stable decision-making even in the presence of measurement noise or transient parameter variations, ensuring the reliability of the algorithm in practical applications.
In summary, the mapping model established based on coupled-mode theory describes the relationship between the system’s energy transfer characteristics and coupling states under different operating frequencies. However, the mapping data are continuous and high-dimensional. Directly applying them to intelligent optimization algorithms would lead to an excessively large state space and high computational complexity, making effective decision-making difficult. Therefore, clustering analysis is required to extract representative characteristic patterns from the system state features in the mapping matrix. This process divides the complex high-dimensional physical feature space into a finite number of state clusters, providing a discretized and learnable state set for the subsequent adaptive Q-learning process. Based on this foundation, the next section introduces the implementation of clustering analysis and the feature extraction procedure.

3. K-Means Clustering Algorithm

Based on the previously established physical model, the state vectors in the mapping matrix represent continuous and high-dimensional information. To reduce the complexity of these continuous high-dimensional data and provide a discrete foundation for the subsequent adaptive Q-learning process, it is necessary to extract representative feature patterns from the mapping matrix. First, since the mapping matrix is symmetric, its upper triangular part is vectorized to eliminate redundant information. Then, principal component analysis (PCA) is applied to project the high-dimensional data onto orthogonal principal components, preserving the directions with the greatest variance while removing noise and redundant information. Although PCA significantly reduces data dimensionality, the reduced data remain continuous. To further discretize them and construct a finite and interpretable set of system states, the K-means clustering algorithm is employed, where each cluster center represents a typical system feature pattern. Compared with other clustering methods, K-means offers high computational efficiency, physical interpretability of cluster centers, and dynamic updatability. Therefore, the combination of PCA and K-means not only transforms the high-dimensional continuous state-space data into a learnable form but also preserves the key physical characteristics of the system.
In the feature space reduced by PCA, the K-means clustering algorithm is employed to partition the continuous states into a finite set of discrete state clusters:
C = { c i } , i = 1 , 2 , , K
where each cluster center represents a typical system state:
c i = [ η i , f i , ϕ i ] T = s ˜ i
The similarity between different states can be measured using a weighted Euclidean distance:
d ( s ˜ i , s ˜ j ) = l = 1 3 w l × ( s ˜ i , l s ˜ j , l ) 2
where the weight vector w l is determined according to the importance of each state component in the routing decision: efficiency, as the core optimization objective, is assigned the highest weight.
For the continuous state vector s ˜ t at the current time, the nearest cluster is identified using the equation
i * = arg min i = 1 , ... , K d ( s ˜ t , c i )
if the minimum weighted Euclidean distance satisfies
d ( s ˜ t , c i ) > δ
and the current number of clusters K is
K < K max
where δ is the clustering creation threshold; if the distance exceeds this value, a new cluster is created. K max denotes the maximum number of clusters, which prevents unlimited growth of the cluster set. In this case, a new cluster is created with its center initialized s ˜ t as the current state, and a new feature mode is formed in the physical feature space (e.g., a newly emerged dominant coupling path); that is,
c K + 1 = s ˜ t , K K + 1
Otherwise, the current state s ˜ t is assigned to the existing cluster i * , and the corresponding cluster center is updated as follows:
c i * ( 1 β ) c i * + β s ˜ t
where β denotes the update rate of the cluster center, which is dynamically adjusted according to the clustering quality:
β = β base × Q target Q i
where β base denotes the initial update rate, Q i represents the quality evaluation index of the i-th cluster, and Q target is the target clustering quality.
The clustering quality index is defined as follows:
Q cluster = 1 K i = 1 K 1 | C i | s C i d ( s ˜ i , c i )
This index reflects the cohesion and stability of the clustering results; a smaller value indicates a more compact and representative cluster. By updating the parameter β online, the system can dynamically optimize the positions of the cluster centers under varying state distributions, thereby achieving adaptive discretization of the state space.
In summary, the K-means clustering algorithm provides a finite and physically consistent state space for the subsequent Q-learning process. It offers an interpretable representation of the system model and establishes the foundation for adaptive frequency selection and energy routing optimization under severe misalignment conditions.

4. Physics-Guided Q-Learning-Based Multi-Objective Frequency Routing Optimization

4.1. Q-Function Design for Frequency Selection

In the reinforcement learning framework, the Q-function serves as the core function characterizing the long-term value of an action under a given state. To enable the algorithm to effectively evaluate the advantages and disadvantages of different frequency selections under various system states, a frequency-selection Q-value function model is constructed. The long-term value of choosing frequency f for clustered state i is defined as follows:
Q ( i , f ) = E [ r route ( i , f ) + γ Q max f F Q ( i , f ) ]
where r route ( i , f ) denotes the immediate reward obtained by selecting frequency f under the clustered state i; γ Q is the discount factor that represents the degree of importance assigned to future rewards ( 0 γ Q 1 ); max f F Q ( i , f ) denotes the future reward, i.e., the maximum value attainable in the next clustered state i when the optimal frequency f is selected; and the entire expression represents the expected value—the immediate reward r route ( i , f ) brought by the current action f, plus the discounted expectation of future rewards.
The frequency–performance mapping model described above is utilized to initialize the Q-function for frequency selection as follows:
Q 0 ( i , f ) = η ( f , c i ) η max
where η ( f , c i ) are obtained from the mapping matrix M route , as given in Equation (13). This initialization provides physical prior knowledge for the subsequent learning process, allowing high-efficiency actions to be preferentially selected at the early stage. It also significantly accelerates the convergence of Q-learning and ensures consistency between the learned policy and the physical model under high-confidence states, thereby providing a reliable foundation for subsequent adaptive exploration and model correction.

4.2. Design of the Composite Reward Function

To achieve comprehensive optimization of system performance, a multi-objective weighted composite reward function is designed, with objectives including transmission efficiency, routing gain, stability, and physical-model consistency.
r total = r effciency + r routing + r stability + r physical
  • Design of the Efficiency Reward
r effciency = λ η × tanh ( η η target η target )
The efficiency reward is designed using a hyperbolic tangent function, which maps the efficiency deviation into the interval (−1,1), thereby preventing excessively large reward values that could cause learning instability. The function is approximately linear near the origin, which facilitates gradient computation. The target efficiency η target is determined based on system theoretical analysis and experimental data. The weighting coefficient λ η serves as the reference weight, reflecting the central importance of efficiency in the overall system optimization.
2.
Design of the Routing-Gain Reward
r routing = λ r × η f η ¯ hist η ¯ hist
where η f denotes the actual efficiency at the current frequency, and η ¯ hist represents the historical average efficiency, which is obtained using an exponentially weighted moving average as follows:
η ¯ hist ( 1 α hist ) × η ¯ hist + α hist × η f
This equation represents an exponential weighted moving average (EWMA) update method, where α hist denotes the historical learning rate (0~1), which is used to balance the response speed to recent performance and long-term stability. The weighting coefficient λ r is introduced to moderately encourage exploration without excessively affecting the primary optimization objective. If the value α hist is relatively large (e.g., 0.9), the historical average responds more quickly to recent performance; if α hist is smaller (e.g., 0.1), the historical average changes more slowly, maintaining greater long-term stability. This design encourages the model to explore new routing paths that can improve the overall system performance.
3.
Design of the Stability Penalty
r stability = λ S × 1 switch e ( t t last τ )
where 1 switch is the switching indicator function, which takes the value of 1 when a frequency switch occurs and 0 otherwise, directly penalizing switching behaviors. e ( t t last τ ) denotes the exponential decay term, reflecting the time-dependent attenuation of the penalty. τ = 10 is the time constant that controls the decay rate and is determined based on the system’s dynamic response characteristics. λ S represents the weighting coefficient used to balance performance optimization and operational stability.
4.
Design of the Physical-Model Consistency Reward
r physical = λ C × max [ 0 , 1 | 1 η ( f , c i ) | η ( f , c i ) ]
This component encourages consistency between the actual system performance and the frequency–routing mapping model, thereby enhancing the physical interpretability of the learning process and preventing the learned results from deviating beyond physically reasonable boundaries.

4.3. Q-Learning Update Rule

After defining the Q-function and the multi-objective reward, the algorithm requires a mechanism to update the Q-values in order to capture the temporal dynamics of the system. Conventional Q-learning relies solely on empirical data for updates, making it susceptible to noise and model bias. Therefore, in this section, a physics-based correction term is introduced within the temporal-difference (TD) learning framework, enabling the update process to integrate both data-driven adaptation and physical consistency. This design enhances the robustness and interpretability of the Q-value iteration process.
Based on the temporal-difference (TD) principle, the Q-learning update rule is defined as follows:
Q ( i , f ) Q ( i , f ) + α ( i , f ) × [ δ TD + λ ph Δ physical ( i , f ) ]
In this equation, the Q-value represents the expected long-term cumulative reward obtained by taking action f in state i. The learning objective is to continuously update the Q-values so that they approach the true optimal values, thereby guiding the agent to make optimal decisions, where α ( i , f ) is the adaptive learning rate, which is adjusted based on the physical-model consistency mechanism (see Equation (44)); λ ph denotes the weighting coefficient of the physical correction term; and δ TD represents the temporal-difference (TD) error, reflecting the deviation between the current estimated Q-value and the more accurate estimation updated from actual experience, which is defined as follows:
δ TD = r tatal + γ δ × max f F Q ( i , f ) Q ( i , f )
where r tatal denotes the composite immediate reward (see Equation (26)), and γ δ ( 0 , 1 ) is the discount factor used to balance the trade-off between current and future returns.
Here, Δ physical ( i , f ) represents the physical-model correction term, which is used to update the mapping physical model in a reverse manner based on machine learning feedback.
Δ physical ( i , f ) = η ( f , c i ) Q ( i , f )
This equation represents the deviation between the actual system performance and the current Q-value. It is used to update the Q-function in a backward-update manner, making the learning process more consistent with the underlying physical laws.

4.4. Physics-Guided Intelligent Routing Strategy

Based on the aforementioned Q-learning framework and update mechanism, the system can dynamically adjust the long-term value function of frequency selection while maintaining physical consistency. On this basis, a physics-guided hybrid strategy is further developed to achieve adaptive integration of the physical model with the learning policy. This approach enhances the system’s intelligent decision-making capability under varying confidence levels.
Construction of a Physics-Confidence-Based Hybrid Decision Strategy:
π ( i ) = π physical ( i )   if   C physical ( i ) > θ high π learned ( i )     if   θ high C physical ( i ) > θ low π explore ( i )     else
The physical strategy π physical ( i ) is defined as:
π physical ( i ) = arg   max f F   η ( f , c i )
where the data η ( f , c i ) are obtained from the mapping matrix, as given in Equation (13). This strategy is a deterministic policy that selects the optimal frequency action based on the physical model data, with the aim of maximizing system transmission efficiency.
It operates independently of reinforcement learning experience and relies solely on the theoretical physical model.
The learning strategy π learned ( i ) is defined as:
π learned ( i ) = arg   max f F   Q ( i , f )
where Q ( i , f ) denotes the frequency selection function (see Equation (24)). This learning strategy not only reflects experiential learning but also retains the guidance of the physical model to a certain extent.
The exploration strategy π explore ( i )   (activated only when the confidence level is low) is defined as:
π explore ( i ) = arg   max f F promising ( i ) [   Q ( i , f ) + λ N × Novelty ( i , f ) ]
where λ N is the novelty weighting coefficient, and F promising ( i ) denotes the promising frequency candidate set:
F promising ( i ) = { f F | η ( f , c i ) > 0.8 × η max }
This formulation is used to narrow the search space within the combined strategy of the physical model and reinforcement learning. A frequency f is included in the candidate set only when its performance value exceeds 80% of the maximum achievable system performance.
Here, Novelty ( i , f ) denotes the novelty exploration term:
Novelty ( i , f ) = ( 1 N visit ( i , f ) max ( 1 , N total ) ) × ϕ ( f , c i )
where N visit ( i , f ) denotes the number of times frequency f has been selected under state i, and N total represents the total number of frequency selections under state i throughout the entire training process. This formulation aims to increase the exploration rate for frequencies with fewer selections. If a frequency f has rarely been attempted N visit ( i , f ) , the novelty term becomes larger, encouraging exploration; conversely, if frequency f has been frequently selected, the novelty reward is reduced, discouraging further exploration. As shown in Equation (2), ϕ ( f , c i ) corresponds to the coil misalignment state. This exploration strategy enables the discovery of potentially high-performance frequencies when the confidence in the physical model is low. It also prevents the algorithm from prematurely converging to a local optimum.
Here, C physical ( i ) denotes the physical-model confidence:
C physical ( i ) = 1 T t = 1 T [ η actual ( t ) η pred ( t ) > 0.9 ] × A ( i )
where η actual ( t ) and η pred ( t ) denote the experimentally measured and model-predicted performance values, respectively. A ( i ) represents the physical consistency metric, which is used to evaluate the consistency between the frequency action and the ideal physical model under clustered state i.
A ( i ) = 1 | F | f F exp ( | η actual η ( f , c i ) | η ( f , c i ) )
θ low and θ high represent the confidence thresholds used to distinguish among the physical, learning, and exploration strategies:
θ high = 0.8 + 0.1 × ( 1 t T max ) θ low = 0.6 + 0.2 × ( 1 t T max )
where t denotes the current learning iteration, and T max is the total number of learning iterations. When the confidence is θ high , the system exhibits high confidence in the physical model, indicating that the model performance closely approximates the real state; hence, the optimal frequency recommended by the physical model is preferentially selected. When the confidence lies between θ low and θ high , the system maintains a moderate level of confidence in the physical model, which still provides valuable reference information; therefore, the decision is made by combining the physical model with reinforcement learning. When the confidence is θ low , the system exhibits low confidence in the physical model, and the exploration strategy is forcibly activated to discover potentially high-performance frequencies.

4.5. Adaptive Learning Rate

In the physics-guided hybrid strategy framework, the learning rate at different stages has a significant impact on the convergence and stability of the algorithm. To improve learning efficiency and convergence in dynamic environments, an adaptive learning rate mechanism is designed based on feedback regarding the consistency of the physical model.
α ( i , f ) = a base × min ( 1 , η ( f , c i ) η actual ) × 1 1 + N ( i , f )
where a base denotes the initial reference learning rate, and N ( i , f ) represents the number of visits to the state–action pair, which is used to prevent oscillations caused by frequent updates. When η ( f , c i ) < η actual , the actual system performance is superior to the model prediction, and the learning rate increases (reinforcement-dominant behavior) to rapidly absorb new information. Conversely, when η ( f , c i ) > η actual , the physical model prediction is overly optimistic, and the system automatically decreases the learning rate to prevent the model from deviating from the underlying physical laws.
The proposed adaptive learning rate, combined with the aforementioned hybrid decision strategy, enables rapid exploration in low-confidence regions and stable exploitation in high-confidence regions. This approach enhances both the adaptability and the physical interpretability of the overall intelligent routing system.

4.6. Routing Shielding Strategy Under Severe Misalignment Conditions

In the previously described hybrid decision strategy, the system defines high and low confidence thresholds ( θ high and θ low ) to distinguish among the physical, learning, and exploration policies. On this basis, to address the degradation of coupling channels caused by severe misalignment of relay coils, a shielding threshold θ block is introduced into the Q-learning framework. When the physical confidence of a relay coil falls below θ block ( θ block < θ low ), the system determines that the corresponding channel has severely deteriorated and automatically performs energy shielding and frequency retuning operations. This mechanism ensures prioritized energy transmission along the main coupling path and achieves a virtual bypass effect for detuned relays. Through frequency retuning and confidence-weighted redistribution, the system autonomously reconfigures the coupling network. As illustrated in Figure 3, this physics-confidence-based routing shielding approach enables the MC-WPT system to maintain high transmission efficiency even under severe misalignment conditions. It also demonstrates the deep integration of the physical model and the reinforcement learning strategy.

5. Simulation Verification

5.1. Verification of Multi-Coupling Effects in Multi-Relay Systems

Table 1 presents the system simulation parameters. As illustrated in Figure 4, the transmission efficiency of the three-relay MC-WPT system is analyzed under different coupling conditions as a function of operating frequency. When the coils are well aligned, the system exhibits strong overall coupling. The main resonance peak appears at 119.39 kHz, achieving a maximum transmission efficiency of 95.92%. Two secondary resonance peaks are observed at 90.60 kHz and 67.58 kHz, corresponding to efficiencies of 93.83% and 89.50%, respectively. This indicates the presence of distinct multi-mode coupling characteristics. Due to the symmetric coil configuration and complete mutual inductive paths, the resonance peaks are concentrated, and the primary peak yields the highest efficiency, demonstrating optimal power transfer performance. Under slight relay coil misalignment, the coupling channels are moderately weakened. The primary resonance peak shifts from 119 kHz to approximately 90.61 kHz, with a reduction in peak efficiency to 92.96%. Although the multi-mode structure remains, several resonance peaks (e.g., at 69.49 kHz and 111.7 kHz) are noticeably attenuated, and the spacing between peaks increases. This behavior reflects an imbalance in the magnetic coupling network, resulting in uneven energy distribution. Overall, minor misalignment primarily causes a downward shift in resonant frequency and modest degradation in transmission efficiency. When the relay coil is severely misaligned and unshielded, the system exhibits pronounced deterioration in coupling strength and significant drift of resonance peaks. A new dominant resonance occurs at 84.83 kHz, with a maximum efficiency of 87.63%, while a secondary peak emerges at 134.75 kHz with 66.21% efficiency. The efficiency near 100 kHz drops drastically to only 29.7%, indicating that the coupling pathway is disrupted and the system becomes highly detuned. In this case, power transfer mainly occurs through asymmetric secondary coupling channels, leading to waveform distortion and resonance peak splitting. To mitigate the detuning effect caused by severe misalignment, the non-misaligned relay coils are retuned to the new optimal operating frequency of 84.83 kHz, effectively shielding the misaligned coil originally designed for 100 kHz operation. As shown in Figure 4, the main resonance peak shifts to 136.67 kHz, and the system efficiency is restored to 91.10%, corresponding to an approximately 23% improvement over the unshielded case. These results confirm that the proposed Q-learning-based routing shielding mechanism effectively enhances system robustness against coil misalignment and significantly improves power transfer efficiency under severe offset conditions. The system adaptively retunes its operating frequency to bypass inefficient relay coils, thereby achieving dynamic reconstruction of the main coupling pathway and maintaining global efficiency optimality.

5.2. Dimensionality Reduction-Based Adaptive Clustering and Q-Learning Optimization

Based on the previously established physical model, 40 sets of system simulation data were collected under different coupling conditions (see Appendix A). These data were then processed using feature dimensionality reduction, clustering analysis, and a Q-learning-based intelligent optimization scheme to learn the mapping between coupling states and the optimal tuning frequencies.
As shown in Figure 5, the sample distribution obtained from 40 coupling matrices after feature extraction and PCA-based dimensionality reduction is illustrated. The principal components (PCs) represent linear combinations of the original features that capture the greatest variance among the samples. In the multi-relay MC-WPT system, PC1 primarily reflects the overall coupling strength and energy distribution along the transmission paths. A larger PC1 value indicates higher mutual inductance across the system and stronger energy coupling channels. PC2, on the other hand, is associated with the asymmetry of relay coil misalignment or the non-uniformity of coupling paths. It reveals the influence of different offset directions on the coupling characteristics of the power transfer network. By projecting the data samples onto the PC1–PC2 plane, the spatial distribution of various coupling states can be intuitively visualized in the reduced feature space. Each point in the figure corresponds to one coupling scenario, and its color represents the cluster label. The clustering results in different colors indicate that the system can be categorized into several typical operating regions according to coupling strength and offset condition. This figure confirms that PCA effectively distinguishes the coupling patterns of the system under different relay coil misalignment states, thereby providing a physically consistent and interpretable state-space partition for the subsequent Q-learning-based control strategy.
As shown in Figure 6, the average transmission performance of the system under different clustered states is illustrated. Each subfigure corresponds to one clustering group and compares the average power transfer efficiency between the adaptive resonant frequency and the fixed 100 kHz driving condition. It can be observed that, in most clusters, the efficiency at the adaptive resonant frequency is significantly higher than under the fixed-frequency condition. This demonstrates the necessity of adaptive frequency optimization in complex misalignment environments. The result also serves as an intuitive reference for designing the reward function in the Q-learning optimization process.
The heatmap of the policy values obtained from the Q-learning algorithm is presented in Figure 7, showing the value distribution of different candidate frequency actions across various clustered states. It can be observed that the Q-values exhibit distinct stratification across the clustering states, indicating that the algorithm has developed differentiated decision preferences for various physical coupling modes. Overall, the gradient variation in the Q-value distribution reflects the effectiveness of the designed reward function and further validates the adaptive learning capability of the proposed method in managing multi-physical coupling scenarios.
As shown in Figure 8, the optimal action frequencies and their corresponding Q-values are illustrated for each clustered state. It can be observed that, as the cluster index varies, the optimal frequencies are dynamically distributed within the range of 80–120 kHz, demonstrating the Q-learning algorithm’s adaptive frequency-tuning capability across multiple states. Specifically, Clusters 1, 2, and 4 exhibit a tendency toward higher-frequency decisions, while Clusters 3, 6, 7, and 8 are concentrated around 100 kHz. Cluster 5 favors lower-frequency selections. This distribution trend is consistent with the patterns observed in the Q-value heatmap, indirectly validating the stability and physical consistency of the learned strategy.

5.3. Adaptive Model Correction and Validation Under Unseen Coupling States

In the previous phase, the Q-learning model was trained offline for 1000 episodes to construct a comprehensive frequency optimization policy table. To further evaluate its adaptive learning capability under unseen coupling conditions, a misalignment scenario outside the offline dataset was introduced for simulation-based testing.
The offline database includes the following coil coupling states: relay coil 1 with various degrees of axial misalignment, relay coil 2 with various degrees of axial misalignment, relay coil 3 with various degrees of axial misalignment, and simultaneous misalignment of relay coils 1 and 3. To evaluate the algorithm’s frequency selection performance under coupling states not present in the offline database, 15 example states were considered. These states included lateral misalignments of relay coils 1 and 2, lateral misalignments of relay coils 2 and 3, and simultaneous axial misalignments of relay coils 1, 2, and 3. Table 2 presents the comparison between the algorithm-predicted optimal frequencies and the actual optimal frequencies for different coil displacement distances. The data indicate that the offline database covers most typical coupling patterns. For new coupling states, the algorithm can still identify similar patterns in the database and achieve reasonably accurate predictions of the optimal frequency. Only in a few cases where direct matches were unavailable did the predicted frequency show some deviation, but the error remained within approximately 2%. Figure 9 illustrates the adaptive learning and model correction process of the Q-learning algorithm under a new coupling pattern not included in the offline database. Figure 9a shows the temporal-difference (TD) error as a function of training episodes. At the early stages of training, the TD error is relatively high, indicating a discrepancy between the initial Q-values and the actual system performance. As training progresses, the TD error gradually decreases and converges to a low level after about 120 episodes. This demonstrates that the reinforcement learning agent progressively learns the optimal frequency action for the new state through exploration. Meanwhile, the experience samples are fed back to update the mapping matrix, enabling adaptive correction of the physical model. Figure 9b compares the predicted transfer efficiency before and after model updating under the new coupling condition. After updating, the predicted efficiency aligns more closely with the simulated optimal efficiency, reducing the deviation from 5.9% to approximately 2.0%. These results confirm the algorithm’s generalization and self-evolution capability under previously unseen coupling conditions. The Q-learning agent can gradually identify the new optimal frequency through exploration and iteratively update the learned experience into the mapping matrix. Under new misalignment conditions, the updated physical model exhibits significantly reduced prediction errors, demonstrating its adaptive reconstruction capability. Although the experiments in this study mainly focus on coupling states included in the offline database, the proposed theoretical framework allows for extension to unmodeled states through online Q-learning exploration. When new coupling conditions occur, the TD-error-driven backward updates progressively optimize the physical mapping matrix, enabling the offline model to adaptively evolve into an online optimal model.

6. Experimental Verification

6.1. Experimental Platform and Measurement System

To verify the effectiveness of the previously developed simulation and learning models, a three-relay MC-WPT experimental platform was constructed, as shown in Figure 10.
The system consists of a transmitting coil, a high-frequency inverter, a high-frequency PWM sweeping circuit, an FPGA, three relay coils, a receiving coil, a high-frequency rectifier circuit, and a load. The main experimental parameters are summarized in Table 3.
The control system of the experimental platform is implemented on an FPGA. It acquires the system voltage and current signals through a measurement module and computes their amplitudes and phases to obtain the system coupling feature vector. This feature vector is reduced in dimensionality using PCA. The transformation matrix obtained from offline training is stored in the FPGA’s BRAM, and DSP slices perform fixed-point matrix multiplication. A pipelined structure ensures real-time dimensionality reduction, producing the PCA feature vector in each sampling cycle without relying on the CPU. This enables online Q-learning frequency tuning. The feature vector is then compared with 40 sets of offline data in a k-means clustering module to determine the optimal initial frequency for the current coupling state. Afterward, the FPGA executes the adaptive Q-learning control logic. When the transmission efficiency deviates from the offline physical model or the reduced-dimension feature cannot match any offline data, the FPGA activates online Q-learning to continuously adjust the system frequency and maintain optimal efficiency under different coupling conditions.
The proposed adaptive Q-learning algorithm is implemented on the experimental platform using an FPGA with a finite state machine architecture. Each computation unit—including state recognition, action selection, reward calculation, Q-value update, and learning rate adjustment—operates in a fixed-point parallel pipelined manner. The FPGA performs all computations using DSP slices, and the Q-table is stored in BRAM for subsequent read and write operations. A complete temporal-difference update path is constructed in hardware. The details of the control flow of the experimental platform are as follows:
(1)
State Recognition Computation Unit:
The FPGA compares the real-time system features with the offline data using parallel comparators and a minimum-value selector. It identifies the nearest cluster and generates a state vector encoding, which is stored in the state register for indexing the Q-table.
(2)
Action Selection Computation Unit:
The action frequency is selected from the Q-table based on the current state. An adaptive strategy determines whether to choose the action with the maximum Q-value or to explore other actions. The resulting action index is used to set the frequency of the PWM modulator.
(3)
Reward Calculation Computation Unit:
Each reward component is calculated independently in a parallel pipelined manner using DSP slices. The weighted sum of all components is computed to obtain the overall reward. This value is then stored in the registers of the Q-value update computation unit for temporal-difference calculation.
(4)
Q-Value Update Computation Unit:
The FPGA calculates the temporal-difference (TD) error using the current state–action pair and the next state–action pair. The resulting Q-value update is then written back to the BRAM.
(5)
Frequency Output Computation Unit.
In the experiments, a wide frequency sweep of 10–200 kHz was applied to fully observe the multi-peak resonance characteristics of the three-relay WPT system and to verify the accuracy of the proposed adaptive Q-learning control strategy in predicting the optimal frequency under different coil coupling conditions. In practical applications, however, the adaptive Q-learning strategy in the FPGA only performs a narrow-band sweep around the currently predicted optimal frequency. This approach ensures high transmission efficiency while improving the system’s response speed.

6.2. Experimental Results and Model Validation

As shown in Figure 11, the experimental efficiency–frequency curves and the model-predicted optimal frequencies under different spatial misalignment conditions of the relay coils are presented. By performing a frequency sweep within the 10–200 kHz range, typical multi-peak resonance characteristics can be clearly observed. This multi-peak structure mainly arises from the mode-splitting effect in the multi-relay coupled system. When both primary and secondary coupling paths coexist, the system exhibits relatively stable power transfer peaks at multiple frequency points. When all coils are properly aligned, the system’s mutual inductance matrix exhibits a nearly symmetric distribution, and the primary coupling path remains dominant and stable.
The experimental results show that the primary resonance peak appears around 119 kHz, corresponding to a maximum transmission efficiency of approximately 89.90%. Two secondary peaks are observed at 60 kHz and 160 kHz, reflecting the energy splitting phenomenon of the secondary coupling modes in the multi-relay system. At the designed resonant point of 100 kHz, the efficiency is relatively low (≈67.85%), indicating that the coil configuration at this moment does not correspond to the optimal coupling state at 100 kHz. The model-predicted optimal frequency is 119 kHz, which matches the experimental peak frequency almost exactly, verifying the high accuracy of the proposed model. When the relay coils experience slight misalignment, some elements of the mutual inductance matrix decrease slightly, and the overall symmetry of the system is disrupted. The main resonance peak shifts leftward to approximately 99.26 kHz, with a maximum efficiency of about 84.05%. Distinct secondary peaks still appear near 70 kHz and 140 kHz, indicating that minor misalignment primarily causes resonant frequency tuning rather than complete degradation. The efficiency at 100 kHz increases to about 84%, meaning that the current coupling condition becomes the optimal state for the 100 kHz frequency. The model-predicted frequency (100 kHz) is fully consistent with the experimental observation, further confirming its accuracy and robustness. Under severe misalignment without shielding, the primary coupling path is severely distorted, and the system transitions into an asymmetric coupling mode. The main peak shifts to around 90 kHz, where the maximum efficiency drops to approximately 70.43%. At the designed 100 kHz resonance point, the efficiency decreases drastically to 18.62%. Although multiple peaks still exist, their relative amplitudes change significantly. The model predicts an optimal frequency of 91 kHz, which differs from the experimental peak by only 1 kHz. This demonstrates that the Q-learning model can accurately adapt to highly nonlinear coupling conditions. When the frequency retuning and shielding strategy is applied to suppress the coupling of misaligned relays, a portion of the parasitic interference is effectively mitigated, allowing the system to re-establish a stable dominant transmission path. The experimental results indicate that the main peak shifts to the range of 103–113 kHz, with the maximum efficiency recovered to approximately 86.2%, while the efficiency at 100 kHz increases to 75.04%. These findings confirm that relay shielding effectively alleviates multipath interference and enhances the concentration of the frequency response. The model-predicted optimal frequency (110 kHz) lies within the experimental main peak region, demonstrating the excellent adaptability of the proposed learning model. The detailed experimental data are presented in Table 4.
Table 5 compares the efficiency performance and misalignment tolerance of the proposed multi-relay wireless power transfer (WPT) system with those reported in references [19,20,21,22] under various lateral coil misalignment conditions. The proposed system employs a four-relay structure, achieving high-efficiency energy transfer through optimized magnetic path layout and a shielding-based routing strategy. Under small misalignment conditions (40 mm), the system’s original efficiency is 89.90%, and the efficiency after misalignment is 87.50%, corresponding to a decrease of only 2.67%, demonstrating strong misalignment resilience. In comparison, ref. [19] utilized a solenoid–dual combination planar magnetic coupler for autonomous underwater vehicles, resulting in a 6.05% efficiency drop under the same misalignment; ref. [20], based on a single-receiver coil with multi-layer Litz wire, exhibited a 23.89% efficiency drop under a 30 mm misalignment. These results indicate that single- or dual-coil systems still suffer significant efficiency loss under small misalignment, whereas multi-relay systems inherently maintain efficiency by distributing energy transfer paths and reducing the impact of individual coil load variations. Under large misalignment conditions (approximately 50% or more), the proposed system’s efficiency drops to 70.43%. However, with the shielding-based routing optimization, the efficiency is restored to 86.20%, corresponding to a decrease of only 4.12%. This is substantially lower than the efficiency reduction observed in the stacked-coil system of [21] (87 mm misalignment, 17.44% drop) and the rotating-mechanism system of [22] (85 mm misalignment, 25.17% drop). These results indicate that multi-relay systems not only maintain high efficiency under severe misalignment but also that shielding optimization and multiple energy path design effectively suppress coupling interference between non-adjacent coils, significantly enhancing system robustness.
In summary, the proposed multi-relay wireless power transfer system demonstrates excellent misalignment tolerance and high energy recovery under various coil displacement conditions. Compared with single- or dual-coil systems or systems with a small number of relays, it exhibits greater stability and practical potential in complex multi-coupling environments.
Figure 12 presents a comparison between the experimental results and model predictions under different relay coil misalignment conditions, focusing on transmission efficiency differences and trends in optimal operating frequency. As shown in Figure 12a, the model-predicted and experimentally measured efficiencies exhibit good overall consistency, with relative errors within 1% across all misalignment conditions. This confirms the validity of the proposed model. Under aligned and slightly misaligned conditions, the experimental optimal efficiencies both exceed 84%, and the model-predicted values closely coincide with the measured results, demonstrating high fitting accuracy in the strong-coupling region. In the case of severe misalignment without shielding, the overall transmission efficiency decreases significantly, with experimental and model results of approximately 70% and 69%, respectively; nevertheless, their trends remain consistent, indicating that the model accurately captures the coupling degradation caused by coil displacement. After the shielding strategy is applied, the efficiency recovers to above 85%, further validating the model’s ability to represent the compensating effect of shielding. Overall, Figure 12a demonstrates the strong consistency and robustness of the proposed model across various coupling environments. Figure 12b illustrates the variation in the optimal operating frequency under different misalignment conditions. It can be observed that, as the misalignment severity increases, the optimal frequency gradually shifts from 119 kHz to 90 kHz, while under the shielding condition it rises again to approximately 110 kHz. In summary, Figure 12 verifies the accuracy and stability of the proposed clustering–reinforcement-learning-based adaptive frequency control model under multiple misalignment conditions. The model not only effectively captures the variation trend of transmission efficiency but also accurately tracks the optimal operating frequency in complex coupling structures, providing both theoretical and experimental support for adaptive control in multi-relay MC-WPT systems.

7. Conclusions

This paper proposed a Q-learning-based intelligent decision-making method with physical-model consistency for adaptive frequency selection and energy routing in multi-relay MC-WPT systems. Physical constraints and adaptive learning mechanisms were integrated into the reinforcement learning framework to enhance system performance, stability, and consistency. A physical model was developed to describe the relationships among the coupling state, resonant frequency, and transmission efficiency. PCA and K-means were employed to adaptively cluster the coupling features. A physics-informed correction term and a confidence-guided decision strategy were designed to enable flexible switching between model-driven and learning-based modes. To address severe coil misalignment, a frequency-retuning and virtual-shielding mechanism was introduced, which significantly improved transmission efficiency. The proposed model demonstrated strong adaptability and generalization capability under complex operating conditions. Overall, the proposed method provides valuable insights for the intelligent optimization of multi-relay MC-WPT systems.

Author Contributions

Conceptualization, X.Q. and M.S.; methodology, X.Q.; software, X.Q.; validation, X.Q.; formal analysis, Z.Y.; investigation, X.Q.; resources, Z.C.; data curation, Z.Z.; writing—original draft preparation, X.Q.; writing—review and editing, X.Q. and M.S.; visualization, X.Q.; supervision, T.Y. and M.S.; project administration, Z.Z.; funding acquisition, X.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under the grants 51977015 and 52307204; in part by the Natural Science Foundation of Chongqing under the grant 2023NSCQ-MSX3764; and in part by the Youth Project of the Science and Technology Research Program of Chongqing Education Commission of China under the grant KJQN20230152.

Data Availability Statement

The data supporting the findings of this study are available upon reasonable request to cathy-ying.liu@connect.polyu.hk.

Conflicts of Interest

The authors Xiaodong Qing, Zhao Chen, Tingfa Yang and Zhigang Zhang have received research grants from companies Chongqing Research Institute, China Coal Technology and Engineering Group and Chongqing Anbiao Testing and Research Institute Co. Ltd., where the latter is a subsidiary company of the former.

Appendix A

Table A1. Offline Dataset.
Table A1. Offline Dataset.
Coupling OffsetOptimal Frequency/kHZOptimal Efficiency/%Efficiency at 100 kHz/%
0.038137119.3995.9289.86
0.07627492.5393.6391.62
0.11441190.6194.0289.22
0.15254890.6194.4388.29
0.19529092.5294.8384.40
0.23850491.3992.2675.45
0.28197392.9985.9457.88
0.32559492.9984.5029.91
0.36931494.5978.4543.63
94.4489.2366.50
0.04020086.7793.1684.63%
0.08040086.7792.7884.84%
0.12060086.7792.7084.95%
0.16080086.7792.9985.07%
0.20965786.7793.0584.49%
0.25928188.4891.3384.38%
0.309304121.3392.4585.70%
0.359559112.3182.3171.11%
0.409960121.3488.9145.40%
121.5192%84.83%
0.03813790.6894.0691.15
0.07627490.6893.8891.06
0.11441195.6293.3091.95
0.15254897.2693.1492.48
0.19529098.9191.4590.94
0.23850485.7487.1723.22
0.28197385.7482.3531.62
0.325594115.3774.3038.05
0.369314115.3773.2742.10
115.3783.3242.23
0.050849109.0994.8867.04
0.101699120.1093.9957.94
0.152548120.1092.1950.82
0.203397117.9084.7045.71
0.25424689.2789.9743.05
0.30509678.2685.9029.28
0.355945117.9085.8464.17
0.406794113.5089.8783.96
0.457644102.4992.6389.90
102.4996.9296.38

References

  1. Huan, T.; Zhong, W. Design of Three-Coil WPT System with Wide Range Misalignment Tolerance Based on Generalized Impedance Matrix. In Proceedings of the 2025 Zhejiang Power Electronics Conference (ZPEC), Hangzhou, China, 22–24 August 2025; pp. 66–70. [Google Scholar]
  2. Liu, S.; Yan, X.; Xu, G.; Wang, G.; Liu, Y. An Eight-Coil Wireless Power Transfer Method for Improving the Coupling Tolerance Based on Uniform Magnetic Field. Processes 2024, 12, 2109. [Google Scholar] [CrossRef]
  3. Pahlavan, S.; Shooshtari, M.; Ashtiani, S.J. Star-Shaped Coils in the Transmitter Array for Receiver Rotation Tolerance in Free-Moving Wireless Power Transfer Applications. Energies 2022, 15, 8643. [Google Scholar] [CrossRef]
  4. Luo, Y.; Dai, Z.; Yang, Y. A Single-Transmitter Multi-Receiver Wireless Power Transfer System with High Coil Misalignment Tolerance and Variable Power Allocation Ratios. Electronics 2024, 13, 3838. [Google Scholar] [CrossRef]
  5. Wang, W.; Deng, J.; Chen, D. An LCC-S compensated wireless power transfer system using receiver-side switched-controlled capacitor combined semi-active rectifier for constant voltage charging with misalignment tolerance. IET Power Electron. 2023, 16, 1103–1114. [Google Scholar] [CrossRef]
  6. Luo, Z.; Zhao, Y.; Xiong, M.; Wei, X.; Dai, H. A Self-Tuning LCC/LCC System Based on Switch-Controlled Capacitors for Constant-Power Wireless Electric Vehicle Charging. IEEE Trans. Ind. Electron. 2022, 70, 709–720. [Google Scholar] [CrossRef]
  7. Zhang, B.; Cao, Y.; Hou, Y.; Hou, S.; Guo, Y.; Tian, J.; He, X. Analysis and Design Considerations for Transmitter-Compensated Inductance Mistuning in a WPT System with LCC-S Topology. World Electr. Veh. J. 2024, 15, 45. [Google Scholar] [CrossRef]
  8. Wang, F.; Yang, Q.; Zhang, X.; Chen, T.; Li, G. Enhancing Misalignment Tolerance in Hybrid Wireless Power Transfer System with Integrated Coupler via Frequency Tuning. IEEE Trans. Power Electron. 2024, 39, 11885–11899. [Google Scholar] [CrossRef]
  9. Liu, C.; Han, W.; Yan, G.; Zhang, B.; Li, C. Receiver Resonant Frequency Adaptive-Tracking in Wireless Power Transfer Systems Using Primary Variable Capacitors. IEEE J. Emerg. Sel. Top. Ind. Electron. 2025, 6, 1567–1576. [Google Scholar] [CrossRef]
  10. SAE J2954; Wireless Power Transfer for Light-Duty Plug-in/Electric Vehicles and Alignment Methodology. SAE International: Warrendale, PA, USA, 2024.
  11. Li, W.; Mei, W.; Yuan, Q.; Song, Y.; Dongye, Z.; Diao, L. Detuned Resonant Capacitors Selection for Improved Misalignment Tolerance of LCC-S Compensated Wireless Power Transfer System. IEEE Access 2022, 10, 49474–49484. [Google Scholar] [CrossRef]
  12. Ouchi, Y.; Matsumoto, R.; Imura, T.; Hori, Y. Power Compensation Method for Coil Parameters Variation in LCC-S Wireless Power Transfer. In Proceedings of the IECON 2023—49th Annual Conference of the IEEE Industrial Electronics Society (IES), Singapore, 16–19 October 2023; pp. 1–6. [Google Scholar]
  13. Kim, D.-H.; Kim, M.-S.; Kim, H.-J. Frequency-Tracking Algorithm Based on SOGI-FLL for Wireless Power Transfer System to Operate ZPA Region. Electronics 2020, 9, 1303. [Google Scholar] [CrossRef]
  14. Zhang, X.; Chu, Z.; Geng, Y.; Pan, X.; Han, R.; Xue, M. FPGA-Based Frequency Tracking Strategy with High-Accuracy for Wireless Power Transmission Systems. Appl. Sci. 2023, 13, 2316. [Google Scholar] [CrossRef]
  15. Zhong, W.; Zhang, C.; Liu, X.; Hui, S.Y.R. A Methodology for Making a Three-Coil Wireless Power Transfer System More Energy Efficient Than a Two-Coil Counterpart for Extended Transfer Distance. IEEE Trans. Power Electron. 2015, 30, 933–942. [Google Scholar] [CrossRef]
  16. Feng, Y.; Sun, Y.; Lin, T.; Hu, H.; Chen, F. Mutual inductance surrogate model of the UWPT system and its constant power optimization at misaligned positions. Wirel. Power Transf. 2024, 11, e001. [Google Scholar] [CrossRef]
  17. Li, Z.; Li, S.; Deng, H.; Zhang, Y.; Hu, W. A wireless power transfer based on P-LCC-S compensated topology for artificial catheterization. Wirel. Power Transf. 2024, 11, e005. [Google Scholar] [CrossRef]
  18. Yang, J.; Zhang, X.; Wang, Y.; Shang, S.; Wang, K.; Yang, Y. A frequency-adjustable PCB shielding coil in wireless power transfer system. Wirel. Power Transf. 2024, 11, e006. [Google Scholar] [CrossRef]
  19. Wen, H.; Wang, P.; Li, J.; Yang, J.; Zhang, K.; Yang, L.; Zhao, Y.; Tong, X. Improving the Misalignment Tolerance of Wireless Power Transfer System for AUV with Solenoid-Dual Combined Planar Magnetic Coupler. J. Mar. Sci. Eng. 2023, 11, 1571. [Google Scholar] [CrossRef]
  20. Chang, H.; Lim, T.; Lee, Y. Compensation for Lateral Misalignment in Litz Wire Based on Multilayer Coil Technology. Sensors 2021, 21, 2295. [Google Scholar] [CrossRef] [PubMed]
  21. Lim, T.; Lee, Y. Stacked-Coil Technology for Compensation of Lateral Misalignment in Nonradiative Wireless Power Transfer Systems. IEEE Trans. Ind. Electron. 2020, 68, 12771–12780. [Google Scholar] [CrossRef]
  22. Ding, S.; Niu, W.; Gu, W. Lateral Misalignment Tolerant Wireless Power Transfer with a Tumbler Mechanism. IEEE Access 2019, 7, 125091–125100. [Google Scholar] [CrossRef]
Figure 1. Circuit Schematic.
Figure 1. Circuit Schematic.
Electronics 15 00705 g001
Figure 2. Physical model construction and frequency–feature mapping matrix.
Figure 2. Physical model construction and frequency–feature mapping matrix.
Electronics 15 00705 g002
Figure 3. Shielding Strategy.
Figure 3. Shielding Strategy.
Electronics 15 00705 g003
Figure 4. Transmission efficiency curves of the three-relay MC-WPT system under different coupling and shielding conditions.
Figure 4. Transmission efficiency curves of the three-relay MC-WPT system under different coupling and shielding conditions.
Electronics 15 00705 g004
Figure 5. PCA Clustering Distribution.
Figure 5. PCA Clustering Distribution.
Electronics 15 00705 g005
Figure 6. Bar Chart of Average Efficiency for Each Cluster.
Figure 6. Bar Chart of Average Efficiency for Each Cluster.
Electronics 15 00705 g006
Figure 7. Heatmap of Q-Learning Strategy Values.
Figure 7. Heatmap of Q-Learning Strategy Values.
Electronics 15 00705 g007
Figure 8. Q-L Action Frequency Distribution across Cluster States.
Figure 8. Q-L Action Frequency Distribution across Cluster States.
Electronics 15 00705 g008
Figure 9. Adaptive Learning and Model Correction.
Figure 9. Adaptive Learning and Model Correction.
Electronics 15 00705 g009
Figure 10. Experimental System Platform.
Figure 10. Experimental System Platform.
Electronics 15 00705 g010
Figure 11. Experimental Efficiency Curve Fitting and Model Prediction Verification.
Figure 11. Experimental Efficiency Curve Fitting and Model Prediction Verification.
Electronics 15 00705 g011
Figure 12. Experimental and Model Prediction Comparison under Different Coil Misalignments.
Figure 12. Experimental and Model Prediction Comparison under Different Coil Misalignments.
Electronics 15 00705 g012
Table 1. Simulation Parameters of the Three-Relay MC-WPT System.
Table 1. Simulation Parameters of the Three-Relay MC-WPT System.
Main ParametersParameter Values
System Design Frequency/kHz100
Frequency Sweeping Range/kHz10~200
Coil Internal Resistance R 1 , R 2 , R 3 , R 4 , R 5 / m Ω 120
Capacitance C 1 , C 2 , C 3 , C 4 , C 5 , C 6 , C 7 , C 8 / nF 126
Input Voltage/V30
Table 2. Predictions for Unseen Misalignment Scenarios.
Table 2. Predictions for Unseen Misalignment Scenarios.
Misalignment ScenariosRelay Coils 1 (cm)Relay Coils 2 (cm)Relay Coils 3 (cm)Predicted Optimal Frequency (khz)Actual Optimal Frequency (khz)Predicted Frequency Efficiency (%)Actual Frequency Efficiency (%)
Relay Coils 1 and 24.54.50115.37115.509191
Relay Coils 1 and 24.56.10128.00131.848688
Relay Coils 1 and 26.14.5090.6890.708686
Relay Coils 1 and 24.57.3090.6889.628181
Relay Coils 1 and 24.5Shielded0120.10120.238787
Relay Coils 1 and 27.34.5094.5994.177878
Relay Coils 1 and 2Shielded4.50119.39119.008686
Relay Coils 2 and 304.54.589.2789.138787
Relay Coils 2 and 304.56.189.2789.558787
Relay Coils 2 and 306.14.592.9992.378888
Relay Coils 2 and 304.57.386.7786.027879
Relay Coils 2 and 304.5Shielded97.2697.428888
Relay Coils 2 and 307.34.595.6295.508080
Relay Coils 2 and 30Shielded4.5117.90117.008686
Relay Coils 1, 2 and 32.65.27.895.6295.208686
Table 3. Experimental System Parameters.
Table 3. Experimental System Parameters.
Main ParametersParameter Values
System Design Frequency/kHz100
Frequency Sweeping Range/kHz10~200
Transmitting Coil L 1 / μ H 20.7820
Transmitting Coil E S R 1 / Ω 0.3125
Relay Coil L 2 / μ H 20.6830
Relay Coil E S R 2 / Ω 0.3369
Relay Coil L 3 / μ H 20.5890
Relay Coil E S R 3 / Ω 0.3275
Relay Coil L 4 / μ H 20.6650
Relay Coil E S R 4 / Ω 0.3377
Relay Coil L 5 / μ H 20.7890
Relay Coil E S R 5 / Ω 0.3347
Relay Coil L 6 / μ H 20.548
Relay Coil E S R 6 / Ω 0.3274
Input Voltage/V30
Resonant Compensation Capacitor C / nF 126.8
Table 4. Model Prediction and Experimental Verification Data.
Table 4. Model Prediction and Experimental Verification Data.
Misalignment ConditionExperimental Optimal Frequency/kHZModel-Predicted Optimal Frequency/kHZExperimental Optimal Efficiency/%Model-Predicted Optimal Efficiency/%Efficiency at 100 kHz/%
Aligned119.6611989.9089.0067.85
Slight
Misalignment
99.2610084.0584.0084.00
Severe
Misalignment (Without Shielding)
90.349170.4369.1218.62
Severe
Misalignment
(Shielding)
103–11311086.2%85.8072.04
Table 5. Comparison Data.
Table 5. Comparison Data.
ReferenceCoil Displacement (mm)Baseline Efficiency (%)Post-Misalignment Efficiency (%)Efficiency After Shielding-Based Routing (%)Efficiency Reduction (%)
This Work4089.9087.50-2.67
This Work9089.9070.4386.204.12
[18]4091.6486.10-6.05
[19]3087.5066.60-23.89
[20]8796.9080.00-17.44
[21]8575.5056.50-25.17
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qing, X.; Yu, Z.; Shan, M.; Chen, Z.; Yang, T.; Zhang, Z. Adaptive Frequency Control for Multi-Relay MC-WPT Systems Based on Clustering and Reinforcement Learning. Electronics 2026, 15, 705. https://doi.org/10.3390/electronics15030705

AMA Style

Qing X, Yu Z, Shan M, Chen Z, Yang T, Zhang Z. Adaptive Frequency Control for Multi-Relay MC-WPT Systems Based on Clustering and Reinforcement Learning. Electronics. 2026; 15(3):705. https://doi.org/10.3390/electronics15030705

Chicago/Turabian Style

Qing, Xiaodong, Zhongming Yu, Menghao Shan, Zhao Chen, Tingfa Yang, and Zhigang Zhang. 2026. "Adaptive Frequency Control for Multi-Relay MC-WPT Systems Based on Clustering and Reinforcement Learning" Electronics 15, no. 3: 705. https://doi.org/10.3390/electronics15030705

APA Style

Qing, X., Yu, Z., Shan, M., Chen, Z., Yang, T., & Zhang, Z. (2026). Adaptive Frequency Control for Multi-Relay MC-WPT Systems Based on Clustering and Reinforcement Learning. Electronics, 15(3), 705. https://doi.org/10.3390/electronics15030705

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop