Next Article in Journal
Recent Advances in Functionalized Gold Nanoprobes for Photoacoustic Imaging Analysis of Diseases
Previous Article in Journal
Characterization of a 100 nm RADFET as a Proton Beam Detector
Previous Article in Special Issue
An Innovative Master Haptic Interface Employing Magnetorheological Fluids for Endovascular Catheterization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

QL-HIT2F: A Q-Learning-Aided Adaptive Fuzzy Path Planning Algorithm with Enhanced Obstacle Avoidance

1
School of Computer Science, Shandong Xiehe University, Jinan 250109, China
2
College of Electrical Engineering, Sichuan University, Chengdu 610065, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(1), 200; https://doi.org/10.3390/s26010200
Submission received: 16 November 2025 / Revised: 17 December 2025 / Accepted: 23 December 2025 / Published: 27 December 2025

Abstract

There has been significant interest in solving robot path planning problems using fuzzy logic-based methods. Recently, the Genetic Algorithm-based Hierarchical Interval Type-2 Fuzzy (GA-HIT2F) system has been introduced as a novel planner in this domain. However, this method lacks adaptability to changes in target points, and insufficient flexibility can lead to planning failures in local minimum traps, making it difficult to apply to complex scenarios. In this paper, we identify the limitations of the original GA-HIT2F approach and propose an enhanced Q-Learning-aided Adaptive Hierarchical Interval Type-2 Fuzzy (QL-HIT2F) algorithm for path planning. The proposed planner incorporates reinforcement learning to improve a robot’s capability to avoid collisions with special obstacles. Additionally, the average obstacle orientation (AOO) is introduced to optimize the robot’s angular adjustments. Two supplementary robot parameters are integrated into the reinforcement learning action space, along with fuzzy membership parameters. The training process also introduces the concepts of meta-map and sub-training. Simulation results from a series of path planning experiments validate the feasibility and effectiveness of the proposed QL-HIT2F approach.

1. Introduction

Smart mobile robots (SMR) are utilized in a wide range of industries for the sake of solving complex, tedious and dangerous tasks in manifold engineering environments, improving work quality and promoting efficient output, such as precise and flexible handling of goods in manufacturing and logistics, the exploration, detection and special tasks in high-risk environments, etc. [1,2,3,4,5]. Enabling robotic systems with superior autonomous navigation is crucial to mobile robot intelligence. Safe and reliable path planning is the basis and pivot of any navigation technology. Path planning requires obtaining a path that enables the robot to reach the target and avoid the obstacles along the way. Currently, a large number of path planning algorithms have been proposed, mainly including the visual graph [6] and cell decomposition [7] based on geometric principles, the artificial potential field (APF) [8] method with the assumption of the existence of fabricated target gravity and obstacle repulsion field, A* [9] and D* [10] algorithms through discrete graph search, others like numerous heuristic methods based on nature inspired optimization algorithms, and sampling-based methods of a family of probabilistic road map (PRM) and rapidly exploration random trees (RRT) [11,12]. However, the above methods must obtain complete global information and there are uncertainties in the process of robot path planning as well, such as uncertain expressions with errors from sensors and actuators. Considering this situation of local path planning (LPP), fuzzy logic-based path planning methods [13] are an effective method that can handle uncertainty in path planning.
The advantages of fuzzy logic come from the great performance of fuzzy systems [14,15,16] to handle uncertainty and the universal approximation of nonlinear functional relations, which has led to a broad applications of fuzzy logic methods in robot path planning [17,18,19,20,21,22]. When it is applied to path planning, fuzzy logic generally models the robot’s obstacle avoidance behavior and then designs some kind of fuzzy system, while the model expressing the obstacle avoidance behavior is transformed into a suitable set of fuzzy rules by which the robot performs corresponding planning for navigation toward the target. The fuzzy system takes the relevant information it obtains (obstacle distances, etc.) as system inputs and, by fuzzy reasoning, finally outputs the necessary variables for path planning such as steering angle and step length, etc. Additionally, fuzzy logic is commonly developed as a local path planning (LPP) method and is often employed as a complementary planning stage following global planning. For example, in [23], a set of guidance points derived from D* Lite through a proposed forward search optimization technique was utilized as sub-goals within an improved fuzzy inference system (FIS), which functions as a local planner in their hybrid framework.
In [22], a hierarchical interval type-2 fuzzy system (HIT2F) was proposed for local path planning. This approach leverages the capability of interval type-2 fuzzy inference to handle uncertainties in unknown environments, while the hierarchical structure mitigates rule explosion and simplifies system complexity. The system takes as inputs the distance between the detected obstacle and the self-mobile robot (SMR), as well as the angle formed by the robot, the obstacle, and the target point, ultimately outputting the linear and angular velocities of the robot to guide its motion in local path planning. In addition, in the design of each affiliation function, this study also applies the meta-heuristic intelligent genetic algorithm (GA) to obtain the optimal parameters of the affiliation functions as best as possible, and proposes a parameters-optimized version of the HIT2F algorithm (called GA-HIT2F) with better planning effect. However, this planner requires re-optimization of its fixed parameters whenever the map or target changes. Furthermore, its path planning may fail in particular wall-like environments that create local minimum traps.
Considering the above defects, the motivation of this research is to make up for the weakness of the GA-HIT2F and the problems we focus on are listed as follows: (1) the way to self-adjust the affiliation function parameters for an unknown environment; (2) the way to flee from local minimum in LPP for specific obstacles (irregular obstacles); (3) crucial variables to determinate the state and action space of the mobile robot. Therefore, building upon the GA-HIT2F path planner and inspired by the work in [24,25,26,27,28,29], we introduce a reinforcement learning mechanism to enable dynamic, adaptive parameter adjustment within the HIT2F framework. Additionally, two new parameters are incorporated: an AOO switch and the detected distance. Both are designed to enhance obstacle avoidance learning and performance. This paper first explains the principles of defects in the parameters of the original method and the potential failed solution in some special wall-shape maps and then illustrates the necessity of dynamic adjustment of the affiliation parameters; furthermore, we propose and implement a new planning algorithm, QL-HIT2F. The improved algorithm introduces the concept of average obstacle orientation and uses the Q-Learning method, one of the reinforcement learning methods, for dynamic adaptive adjustment of the affiliation parameters in HIT2F system. The meta-map is also proposed to train the SMR, and the effectiveness of the fully sub-trained SMR to be able to plan satisfactory obstacle avoidance paths in the generalized map environment is verified.

2. The Original GA-HIT2F and Its Performance Analysis

2.1. Basic GA-HIT2F Path Planning Algorithm

As described in [22], some excellent achievements have been made; however, some problems exist. Based on the analysis of the work in [22], this paper renamed some variables and functions according to the steps and principles of the HIT2F algorithm given in [22] for simple representation, and a detailed pseudo-code version of GA-HIT2F is shown in the following flow (Algorithm 1):
Algorithm 1 The origin GA-HIT2F
1: Input necessary Map and Robot Information
2: [ R x , R y , θ R ] InitRobotState ( ) ;
3: P a t h ϕ ;
4: P a t h . a p p e n d ( [ R x , R y ] ) ;
5: While   d i s R G > d i s t h r
6:   [ d i s R O min , θ G R O ] RangeSensor R o b o t S t a t e ;
7:   f i n 1 d i s R O min / ( d i s r a n g e / 5 ) ;
8:   f i n 2 θ G R O ;
9:   [ f o u t 1 , f o u t 2 ] GAHIT 2 Fsystem f i n 1 , f i n 2 ;
10:   s t e p m f o u t 1 ;
11:   Δ θ f o u t 2 ;
12:   [ R x , R y , θ R ] R o b o t S t a t e . u p d a t e ( [ s t e p , Δ θ ] )
13:   d i s R G . u p d a t e ( [ R x , R y ] ) ;
14:   P a t h . a p p e n d ( [ R x , R y ] ) ;
15: e n d       W h i l e ;
16: return       P a t h ;
According to the algorithm flow, the information necessary for planning is entered before the beginning, including map data, starting and target state points plus information about the robot and the sensors it carries. InitRobotState is a function that initializes the robot states and returns the initial pose of the SMR, where R x , R y are the horizontal and vertical coordinates of the robot and θ R denotes the orientation (pose). When the distance to the target exceeds a preset threshold, the robot begins to continuously acquire the distribution of surrounding obstacles via its sensors. The function R a n g e S e n s o r represents the sensor feedback function and returns the minimum distance d i s R O min between the robot and nearby obstacles, detected by all sensors, as well as the angle θ G R O formed between the vector from the robot’s centroid to the nearest obstacle and the vector toward the target point. The robot is equipped with nine sensors oriented at angles of π 3 , π 4 , π 6 , π 12 , 0, π 12 , π 6 , π 4 and π 3 relative to the robot’s heading direction θ R . The principle of sensor-based data acquisition is illustrated in Figure 1. Before recording sensor information, the robot’s current position is recorded to initialize the P a t h .
The sensor data is fed into a hierarchical interval type-2 fuzzy logic system (IT2 FLS) with two inputs: f i n 1 and f i n 2 . Here, f i n 1 is the normalized value of the minimum obstacle distance d i s R O min , scaled to match the expected input range, and f i n 2 corresponds directly to the angle θ G R O . The system features a two-layer structure: the first layer outputs a steering adjustment Δ θ , which is then combined with d i s R O min as input to the second layer for further fuzzy inference. The second layer finally produces the step size for the IMR. The overall input–output architecture of this GAHIT2Fsystem is illustrated in Figure 2. Once the outputs are obtained from the type-2 FIS, the step size is scaled by a map-dependent multiplier and then used together with Δ θ to actuate the robot for the current motion step. The robot’s state is updated to record its new pose, which is appended to the planned path. Simultaneously, the target distance d i s R G is recalculated. The path-planning loop terminates when the IMR enters a threshold region around the target, at which point the final path is output.

2.2. Principle of Parameter Optimization of the Affiliation Functions in GA-HIT2F

During the development of the GA-HIT2F system, the path planner employs a two-layer fuzzy rule design to enhance the robot’s obstacle avoidance performance. All variables utilize interval type-2 fuzzy sets, with both upper and lower membership functions represented by triangular membership curves. Within each layer, every input variable is divided into five linguistic variables across its respective universe of discourse. For definitions of the fuzzy sets associated with these linguistic variables and their corresponding membership functions, please refer to [22].
The initial parameters of the membership functions are set empirically. To further optimize the decision-making at each step of the path planning process, the system applies a Genetic Algorithm (GA) to globally optimize all parameters of the triangular membership functions. This optimization enables the Intelligent Mobile Robot (IMR) to execute more appropriate obstacle avoidance actions, namely step length and steering angle, based on its current state. The design of the fitness function used in the GA optimization is described as follows:
f i t H I T 2 F p a r a s = f i t 1 + f i t 2 + f i t 3 n
where p a r a s is the combination of all parameters of the triangular affiliation function, about 68 parameters in total, f i t 1 is the length of the path obtained from one independent path planning, f i t 2 indicates the distance between the intelligent body and the target at the termination of the current round of path planning, f i t 3 indicates the penalty term of the feedback if the SMR still has not arrived or has moved away from the target area when the set maximum number of planning steps is reached, and n is a constant related to the map, robot, and sensor scale.

2.3. Parameter Optimization Performance and Its Defect Analysis

According to the fuzzy rules of GAHIT2Fsystem we can observe the general mechanism of GA-HIT2F path planner: when there is an obstacle around the robot, the robot is empowered with the steering angle of the tendency to deviate from the obstacle, and according to the distance of the nearest obstacle, the robot is assigned with the appropriate step to enable the robot to maintain better safety with the surrounding environment after moving forward. For example, when the intelligent body detects an obstacle within a short distance to the left of a large angle, according to rule 1, the LPP planner GA-HIT2F outputs a control amount of deflection to the right and advances at a smaller speed (step length) to avoid the risk of collision. We note that GA-HIT2F defaults to zero degrees when there is no obstacle directly in front of the robot, then according to fuzzy rules 2 to 5, the robot will execute the turning action, and obviously the decision is unwise and non-optimal. For this conservative and reckless collision avoidance behavior, this paper finds that GA-HIT2F will compensate for this strategy by optimizing the parameters through GA, i.e., the process of GA search for the optimal parameters will judge the robot’s tendency to approach the target as much as possible so that the robot will attempt to go straight to the target position if the safety is assured in front. However, this also generates a problem that the optimized parameters only fit the basic information set in the optimization process; therefore, when a new target or a new map emerged, the GA-optimized parameters will have a tendency to move conforming to the original map and the original target hastily, because the parameters are the optimal values found by the GA after a large number of iterations in the old one. The parameters are not suitable for planning in other targets and locations. In this section, a small planning test was conducted to clarify this phenomenon. Two types of maps are prepared as follows: the first is a blank environment without any obstacles, and the theoretical optimal path of the robot is the line between the starting and target points; the second is an environment with a rectangular obstacle in the center of the map, as shown in Figure 3.
The GA-HIT2F path planning (LPP) for the two environments is performed first, that is: after optimizing the optimal parameters based on the respective map and starting target information using GA, the parameters are input into the GAHIT2Fsystem and then the path planning is completed using Algorithm 1, and the results are shown in Figure 4.
In particular, in the empty environment, the optimal parameters successfully allow the robot to perform a direct action towards the target with a two-point line, which fully demonstrates that the GA-optimized parameters enable the robot to make fine adjustments to take optimal actions under the original fuzzy rules.
To further explore the effects of the parameters, two special tests were conducted: (1) changing the parameters in the Figure 1 environment, i.e., using the optimized parameters of Figure 2 environment for the path planning of Figure 1; (2) changing the target points of the Figure 2 environment, but still taking the GA-optimized parameters in the original target state. The planning results obtained from the two tests are shown in Figure 4 and Figure 5.
From the results of test (1), it can be seen that the robot unexpectedly turns to avoid an “obstacle” in an otherwise empty environment, ignoring the original target point. These weird behaviors reinforce the idea in this section that the optimization parameters had already contained some key information about the original map, and even if they are applied to other maps, the robot would still make a decision to move towards the original map or the original target.
GA-HIT2F as an LPP planner possesses a defect that is basically difficult to be solved by a class of LPP algorithms. In a special environment such as the concave wall or the narrow passage wall in Figure 6, SMR often falls into local extremes, resulting in poor path planning. The GA-HIT2F algorithm causes this phenomenon because SMR first has to avoid the wall when traveling to the target point, and when it bypasses the wall, it should theoretically advance to the end point, but at this time, due to this special wall environment, it must cause the robot to produce the exact same fuzzy input at a certain location as before bypassing the wall, plus using the same optimization parameters. The two coincidences of sameness make the GAHIT2Fsystem produce a steering output command instead, causing the robot to move further away from the target.
To summarize, here are two major defects of the basic GA-HIT2F algorithm parameters: irst, if the map or target point changes, the fixed parameters of GA optimization must be re-optimized. Second, in some wall environments with local minimum traps, unexpectedly the same fuzzy inputs with fixed optimization parameters cause the robot, which should have taken two different strategies for obstacle avoidance, to produce the same fuzzy outputs to make the path planning fail.

2.4. The Necessity of Dynamically Adjusting the Parameters of the Affiliation Function

Recognizing the limitations of the parameter optimization approach in the basic GA-HIT2F, this paper argues that its fixed optimized parameters are obtained by seeking a balance between the robot and the overall map environment, including the start and end points. Consequently, this method struggles to effectively capture the distribution relationship among the three core elements: the robot, varying surrounding obstacles, and the target point. Ideally, fuzzy operations require parameters capable of reflecting the real-time state of the robot, encompassing sensor-detected obstacle information and the target point status. In other words, the parameters of the membership functions should be dynamically adjustable and possess the self-adaptive ability to select values that are optimal specifically for the current time step under each distinct configuration of the three elements.

3. QL-HIT2F Planner Based on Reinforcement Learning and the Concept of Average Obstacle Orientation

3.1. Average Obstacle Orientation

In the GA-HIT2F algorithm, sensors in nine directions are used to detect the distance to obstacles within their respective detection ranges. The RangeSensor function returns the nearest angle θ G R O , which is specifically obtained by subtracting θ G R from θ R O . Here, θ G R represents the angle of the vector from the IMR to the goal, while θ R O . represents the angle of the vector from the IMR to the nearest obstacle. θ R O reflects, to some extent, the positional relationship between the agent and environmental threats, but it only considers the distribution of the nearest obstacle. This makes it difficult to reflect obstacle features collected by sensors in other directions, leaving information from the unknown environment in other directions underutilized within the HIT2F system. To comprehensively consider obstacle information from all directions, ideally, the closer an obstacle in a certain direction is to the robot, the greater its impact on the robot should be. Moreover, such influences from obstacles in all nine directions should be aggregated. Therefore, this paper proposes a new concept called the average obstacle orientation angle (AOO) [26], which is an average tendency angle calculated based on the distance and angle information of obstacles detected by each sensor, representing the overall distribution of obstacles around the robot. It is computed according to the following formula:
θ A R O = i = 1 i = 9 w i θ R O i i = 1 i = 9 w i
where w i is the weight of each sensor corresponding to the obstacle and w i = 1 / d i s R O i . So the function RangeSensor, which incorporates the concept of AOO, should return a value of θ G R O as θ G R O = θ G R θ A R O . However, a limitation of the average orientation angle is that when multiple independent obstacles are distributed on both sides of the robot, their positive and negative angles relative to the robot’s heading cancel each other out during the weighted averaging process. Consequently, a correction to the angle θ A R O is required, as detailed below.
θ G R O = θ G R θ R O ,           θ A R O θ R O > θ t h r θ G R θ A R O ,           o t h e r s
where θ t h r is the threshold value for the difference between the AOO and the orientation of the nearest obstacle. The implication of this formula is that when independent obstacles exist on both sides of the robot, θ A R O may deviate excessively, rendering the estimation based on the average orientation inaccurate. In such cases, the AOO angle is not applied.

3.2. Dynamic Parameter Configuration of Reinforcement Learning

Q-Learning algorithm [24,25,26] (QL) is a category of discrete reinforcement learning methods based on Q-tables, which implements unknown environment exploration and action mapping based on a reward and punishment mechanism. QL is generally employed in path planning in two ways. The first is employing it as a main method to train a mobile robot to seek an optimized path through corresponding state-action policy; the second approach is employing QL as a policy maker of some parameters or obstacle avoidance strategy to support path planning methods through its learning process; for instance, studies [24,25] provide explanatory examples, respectively. QL can be understood in the field of SMR path planning as follows: the SMR chooses an appropriate obstacle avoidance strategy in the current state based on Q-tables using some existing experience and executes it. The robot is driven to navigate its environment and conversely, the environment provides timely feedback on the reward; meanwhile, the SMR arrives at the next state. This executed strategy is interpreted as an action of the state transition. In addition, the reward and the new state will successively update the Q table using the Bellman equation (Algorithm 2) and then backfire on the robot to prompt it to dynamically adjust its actions so that it can continuously learn through trial and error to achieve the optimal policy. The standard QL reinforcement learning algorithm is shown as follows.
Algorithm 2 Standard Q-Learning algorithm
1: Q Table Initializezeros ( N u m S , N u m A )
2: for  p r e s e t   e p i s o d e s
3:   S t InitRandomState ( S )
4: While  S t t e r m i n a l   s t a t e
5:   A t GreedyChooseAction S t , Q Table
6:   S t + 1 , R PerformActionToGetEnvFeedback S t , A t
7:   Q S t , A t = Q S t , A t + α R + γ max Q S t + 1 , A t Q S t , A t
8:   S t S t + 1
9:    end while
10:   end for
Due to the complex unknown environment of path planning in which the possibility of obstacles with special shapes emerging is high, it is necessary to change them dynamically according to the robot’s state. Considering the extreme environment of narrow passages as in Figure 6, it is found through testing that the GA-HIT2F algorithm does not easily enable the robot to traverse narrow areas. After profound investigation, we find that this dilemma is caused by the fact that the maximum radius of robot detection d i s r a n g e (which is also the minimum safe distance of SMR) is too big relative to the width of the narrow passage. When the robot travels near the narrow area, it detects an obstacle within d i s r a n g e . Since d i s r a n g e is greater than the narrow passage width, the robot misjudges that the detection range is full of obstacles, leading to avoidance of the surrounding narrow passage entrance. Given that the maximum detection range of the robot sensor itself is fixed, we propose to adaptably change the value of d i s r a n g e involved in the operation in real time in the RangeSensor function instead of using a fixed one of the sensor itself; i.e., d i s r a n g e is also considered as a unfixed parameter that needs to be dynamically adjusted by QL, denoted by d i s r a n g e i n c o d i n g ( d i s r a n g e i n c o d i n g d i s r a n g e ).
Additionally, this paper introduces the concept of the average obstacle orientation angle θ A R O and specifies a threshold condition for its activation. However, since the threshold   θ t h r is manually set, relying solely on this criterion to decide whether to activate θ A R O may not be sufficiently robust across different specialized environments. To address this, an activation switch S w i t c h A R O is further proposed. This switch is a binary logic variable that can only take the values 0 or 1, with θ A R O being activated only when S w i t c h A R O is turned on (set to 1). In summary, S w i t c h A R O is also a parameter that should be dynamically adjusted through Q-learning decision-making. Therefore, the average obstacle orientation angle will be activated only when the following condition is satisfied:
  θ A R O θ R O > θ t h r       & &       S w i t c h A R O = 1
The parameters that need to be adjusted dynamically using QL in The QL-HIT2F algorithm in the process of LPP in unknown environment are: the parameters of all affiliation functions in the fuzzy system GAHIT2F, the value of detection radius d i s r a n g e i n c o d i n g in the function RangeSensor and the AOO activation switch S w i t c h A R O , totaling 70 dynamic parameters.

3.3. Reinforcement Learning State-Space for QL-HIT2F Path Planner

In the QL-HIT2F path planning algorithm, considering that reinforcement learning requires generalization to different maps and environments, the relative relationships among the trinity (the robot state, the surrounding obstacle information detected by the sensors, and the target point) should be utilized to constitute the state space of QL ( S ), which together form a specific state uniquely describing the current features at a given moment. These relative relations are determined in this paper by choosing the following defined four variables (1) the Euclidean distance d i s R G between the current position of the robot and the target; (2) the angle θ R R G between the current robot orientation and the line connected between the robot and the target point; (3) the orientation angle θ R O of the nearest obstacle detected by the sensors; and (4) the steering angle Δ θ of the outputs of the QL-HIT2F system at current control time step. The values of the above four variables determine the four states s t 1 , s t 2 , s t 3 and s t 4 . The first two states measure the extent to which the intelligent agent is close to or far from the target. The third reacts to the distribution of obstacles, and the last state indicates the intensity of the steering control of the current planner; if the Δ θ is greater, it implies that the planner had made some kind of dangerous judgment with giving the robot a greater degree of deviation, which indirectly reflects the possibility of collision occurrence in the surrounding environment and safety level of it. The numbered discrete values of each state are listed in Table 1, which shows that the four small states define a total of C 2 1 C 3 1 C 4 1 C 2 1 = 48 48-dimensional state space S ; here, the threshold values of these variables are selected similarly to [26].
The definition of what kind of action to perform in a particular state and what level of reward the SMR obtains after transferring into a new state in the next step through action selection can be the key issue for the QL algorithm to train the agent for policy making. The planner generates control commanders by means of two sets of fuzzy rules formulated in the GAHIT2Fsystem and dynamic QL optimized parameters. According to the analysis in the previous sections, the optimization parameters that require QL to achieve dynamic self-adaptation can be divided into two groups, the affiliation-parameter group and the other-variable parameter group. The fuzzy input d i s R O min in the first group corresponds to 26 parameters, θ G R O corresponds to 21 parameters, and input Δ θ corresponds to 21 parameters; the total of about 68 parameters of this group is now unified into one vector parameter p a r a s . The other-variable parameter group have the radius d i s r a n g e i n c o d i n g and the AOO activation switch S w i t c h A R O . The parameters that constitute the QL action space are thus reduced to three large variables. To sum up, the action space can be generated by combinations of the possible values of each of these three parameters. The design and definition of the QL action space are elaborated in the following sections.

3.4. Reinforcement Learning Action Space for QL-HIT2F Path Planner

The Q-Learning algorithm records all the discrete state-action pairs defined in the Q-Learning algorithm, and the values of the dynamic parameters should be determined carefully in order to reduce the complexity of the action space. A new concept, the meta-map, is introduced before discussing the values of the optimization parameters. Since the initial intention of employing reinforcement learning and QL is to make the parameter values dynamically change and automatically adapt to different environment maps, the unknown environment used in training the agent (the robot) for QL to complete path planning must be a map specifically designed to contain various special obstacles. Once the robot has explored the environment and learned the characteristics of the obstacles by trial and error, it can make optimal decisions in any other maps without having to separately solve the optimization parameters of the corresponding maps. In this paper, we refer to the map used by QL for training as a ‘meta-map’, as shown in Figure 7, where we specifically designed the meta-map containing narrow passages and concave walls, as well as other types of multi-state obstacles.
The previous subsection has clarified that the parameters obtained from the original GA-HIT2F optimization can only adapt to the same environment and the same target point. Even if the target point is changed, the robot will tend to converge to the original target. Based on this, this paper solves this shortcoming by optimizing the parameters independently with multiple fixed targets, which means this paper does not adopt the target of path planning to optimize the parameters, but selects eight special points to optimize the parameters eight times, and the eight optimized parameters represent, respectively, eight kinds of tendencies to move (to the front, back, left, right, left oblique, and right oblique). In order to get the parameters that can endow the SMR with eight types of movement tendencies, considering that the LPP is in an unknown environment where the a priori information about obstacles cannot be obtained, In this paper, we extract eight SPOs, (Subgoal for Parameters Optimization) on the map boundary as shown in Figure 8 below, which correspond to eight types of tendencies of the robot movement, and these eight SPOs are applied to optimize the affiliation parameters of these eight extreme scenarios in meta-maps independently using the original GA-HIT2F system. (Because we only need the moving tendencies in each direction, GA-HIT2F itself does not have special complex environment planning capability but only uses GA to obtain the optimal parameters for each tendency, and the ability to traverse extreme environments requires QL training in the meta-maps). That means p a r a s , the dynamic parameters in the QL spaces, possesses eight possible values, which can be expressed as the following equation.
paras ∈ {pa1, pa2, pa3, pa4, pa5, pa6, pa7, pa8}
where p a k denotes the SPOk of the kth direction in the simple meta-map, and they are all vectors consisting of about 70 affiliation parameters.
The set of values of other affiliation functions-related parameters are given below, and the discrete range of the radius d i s r a n g e i n c o d i n g of the SMR in the sensor operation process is defined as follows: d i s r a n g e _ i n _ c o d i n g 1 3 d i s r a n g e ,       2 3 d i s r a n g e ,       d i s r a n g e .
The binary switch variable S w i t c h A R O that activates the AOO is expressed by a binary logic, which takes the following values, i.e., S w i t c h A R O 0 , 1 .
In summary, according to the multiplicative principle, there are a total of C 8 1 C 3 1 C 2 1 = 48 possible combinations for the two groups of parameters that constitute the QL action space, i.e., an action space with 48 dimensions is defined. A certain action representation of the system is like: p a r a s = p a 1 and d i s r a n g e i n c o d i n g = 1 3 d i s r a n g e , and S w i t c h A R O = 0 .
Specifically, the above described SMR states and action space generate a 48 × 48 dimensional matrix Q T a b l e . The principles of environmental reward and the whole procedure of QL training the agent will be discussed in detail in the subsequent sections.

3.5. Reinforcement Learning Reward Mechanism and Training Process for QL-HIT2F Path Planner

A reasonable reward mechanism will make an accurate performance assessment of each step executed by the agent in path planning, and the continuously accumulated rewards or penalties for a specific state indirectly assess the current degree of safety of the robot with respect to its surrounding environment. In this paper, the reward principle is set up as follows in Table 2, and the reward score is determined by the values of four new defined states formed after performing a specific action. These four states are: the distance between the robot and the target, the proximity to the target, the length of the feasible path when successful arrival is satisfied, and last the robot’s position.
To achieve a perfect post-training intelligence, the design of the policy for choosing one step from the action space in each training step of the QL algorithm is also crucial. The greedy strategy avoids pure exploration or unilateral exploitation, and it is often used as a reference for action selection in the QL training. However, in the unknown environment, all states from the state space generated by the agent are discrete, and if this state has not shifted compared to the previous step, no action should be generated according to the greedy strategy. Assuming the second case, in the early episodes at the beginning of training, Q T a b l e is not updated completely and a discrete state of the robot will correspond to multiple discrete actions with an equal value (all are 0); this situation also cannot retain the greedy strategy, but continues the discrete actions of the previous step to maintain the continuity of robot control. In summary, the QL training process in this paper is shown in Algorithms 3 and 4.
Algorithm 3 QL training flow for QL-HIT2F path planner
1: Input necessary Map and Robot preliminary information
2: S InitQLStateSpac e ;
3: A InitQLActionSpace ;
4: Q T a b l e InitializeZeros 48 , 48 ;
5: for   p r e s e t     e p i s o d e s   do
6:   s t 1 , s t 2 , s t 3 , s t 4 InitRobotState ;
7:   S t QStatesMapping s t 1 , s t 2 , s t 3 , s t 4 ;
8:   w h i l e       S t t e r m i n a l   s t a t e
9:     A t ModifiedGreedyChooseAction S t ,   A t ( l a s t ) , Q T a b l e , S t a t i c F l a g ,   ;
10:   [ p a r a s , d i s R a n g e I n C o d i n g , S w i t c h A R O ] = QActionDemapping A , A t ;
11:   R o b o t S t a t e . u p d a t e HIT 2 F 3     p a r a m e t e r s ;
12:   s t 1 , s t 2 , s t 3 , s t 4 R o b o t S t a t e ( n e w ) ;
13:   S t + 1 QStatesMapping s t 1 , s t 2 , s t 3 , s t 4 ;
14:   [ R , S t o p F l a g ] PerformActToGetEnvFeedBack R o b o t S t a t e ( n e w ) ;
15:   Q T a b l e . u p d a t e ( S t , S t + 1 , A t , R ) ;
16:    S t ( l a s t ) = S t ;
17:    S t = S t + 1 ;
18:    S t a t i c F l a g = 0 ;
19:    i f   S t = S t ( l a s t ) ,   S t a t i c F l a g = 1 ; e n d   i f
20:    i f   S t o p F l a g = 1 ,   b r e a k ; e n d   i f
21:     e n d   w h i l e
22: e n d   f o r
23:     return   Q T a b l e
In this algorithm, I n i t Q L S t a t e S p a c e ( ) , and I n i t Q L A c t i o n S p a c e ( ) initialize the state space and action space in reinforcement learning separately. R o b o t S t a t e ( n e w ) means the new updated state of the robot. I n i t i a l i z e Z e r o s ( ) initializes a Q-value table with all values set to 0. I n i t R o b o t S t a t e ( ) initializes the actual physical state of the robot at the beginning of a new turn, which returns four state variables (st1, st2, st3, st4), which are raw sensor or environmental data. Q S t a t e s M a p p i n g ( ) maps the current actual physical state of the robot to the discrete state index used in the Q table. M o d i f i e d G r e e d y C h o o s e A c t i o n ( ) selects action A t based on the current state s t , previous action A t (last), Q-table, and StaticFlag. Q A c t i o n D e m a p p i n g ( ) decodes the abstract action index A t selected from the Q table into actual parameters that can directly control the HIT2F planner. H I T 2 F ( ) , means that the function executes the core path planning algorithm (HIT2F) proposed in this study, it receives parameters obtained from the previous Q A c t i o n D e m a p p i n g ( ) step, calculates the robot’s next motion command, and updates its physical state accordingly. R o b o t S t a t e ( n e w ) means the new updated state of the robot. P e r f o r m A c t T o G e t E n v F e e d B a c k ( ) evaluates the new state and returns an immediate reward R and a termination flag StopFlag.
QL training generally consists of two cycles: the outer cycle is the number of reinforcement learning episodes, according to experience set to 5000–20,000 times; the inner cycle represents an independent path planning process, in which the loop may be interrupted because the agent has reached the target or collision occurs with the environment. The QStatesMapping function in the pseudocode represents the state mapping function, which maps to a specific state in the state space based on the four predefined decision variables of the current robot. The numbered S t reflects the specific state that the robot occupies in the discrete state space at this moment. Inversely, the QActionDemapping in the algorithm will absorb the specific actions represented by discrete number A t from the modified version of the greedy policy (detailed selection policy is as represented in Algorithm 4) and then demap the action into values of specific dynamic parameters. Once the current parameter values are determined, they can be integrated into the HIT2F system for path planning. The robot then moves one step to the new state through this planner, and after the state is updated, the PerformActToGetEnvFeedBack function is immediately triggered to give the robot the reward or punishment feedback it should obtain, after which Q can be updated based on the triplet (state, action, and reward) using the Bellman equation.
There are a few details in the algorithm that are worth noting. The variable S t o p F l a g represents the flag that compels the robot to terminate the training of a single episode after a collision interaction with an obstacle, and S t a t i c F l a g represents the flag that the robot is supposed to keep the previous action group; both flags work when they are set to 1. In addition, the policy of action selection of QL considered the continuous change in robot actions; the CoverZeros function in Algorithm 4 represents that all the rows corresponding to a state in the Q table are 0, when the actions should also be generated completely randomly.
Algorithm 4 Modified greedy choose action algorithm
1:   i f     S t a t i c F l a g = 1
2:   A t = A t l a s t ;
3: elif     r a n d ε       o r         CoverZeros Q T a b l e S t
4:   A t RandAction A ;
5: e l s e    
6:   A t e m p argmax A t e m p A Q T a b l e S t ;
7:    if     length A t e m p   > 1    
8:    A t A t e m p r a n d ;
9:    e l s e
10:     A t = A t e m p ;
11:   e n d   i f
12: e n d     i f
13: return   A t

4. QL-HIT2F Path Planning Training and Simulation Test

4.1. Path Planning Training of QL-HIT2F for Metamap Scenarios

This study adopts a sequential training paradigm to learn a versatile Q table that empowers the IMR to process a wide range of unknown obstacles in real time across unfamiliar maps. Using the QL-HIT2F algorithm, training is conducted successively on two distinct prototype maps. The process begins with a simpler map to derive an initial Q table which then serves as the foundation for subsequent training on a more complex map characterized by narrow passages. This sequential strategy is designed to boost the robot’s generalization capacity, ensuring comprehensive learning of the diverse structural configurations. Prototype map Figure 7(left) represents a dense-sparse environment, while prototype map Figure 7(right) is a complex environment filled with various special obstacles, predominantly narrow channels. These two environments encourage the IMR to autonomously explore and learn a rich set of state-action pairs.
For the robot and QL-related parameter settings for path planning training in the meta-map scenario, the map boundary length is 300 units, while the starting and target locations of the SMR are the lower left and upper right corners of the map at a distance of the SMR safety radius to the boundary, respectively, so that the robot can traverse as many special obstacle bodies and walls as possible from both sides of the diagonal line. The reinforcement learning episodes are 20,000 times, and QL updates the parameter A, in the equation. To further strengthen the generalization ability of the training results to adapt to the new environment, the training procedure is again enriched in this paper. The approach is to add additional training of path planning with symmetry in five directions at other points selected at the map boundary in addition to the learning of lower-left-upper-right route navigation planning in QL training of each meta-map.
In conclusion, one complete QL training for each meta-map contains six sub-trainings in these six directions as in Table 3 and Figure 8 above. Similarly, the QT outputs of the first sub-training will be used as the initial Q table for the second sub-training, and the third, fourth, and fifth are repeated in the same way. Thus, after the sixth sub-training is finished, the QT is a rich value of the robot explored and updated from multiple directions after accumulation of obstacle-avoidance experience in six times of sub-trainings. The results of all QL trainings are shown in Figure 9. It can be seen that the robot has fully explored the environment of the meta-map from all directions and multiple possibilities after up to 20,000 episodes of learning to find feasible paths for different homotopic classes.

4.2. QL-HIT2F Algorithm Path Planning Test in Multiple Scenarios

After sufficient training in the two meta-maps, the Q T a b l e obtained by QL has contained the ability of how to dynamically adjust the parameters in the QL-HIT2F system when dealing with unknown environments, and SMR can then plan by reference to the Q T a b l e to make the optimal parameter selection action. In this section, two scenarios (map1 and map2 of Figure 10) are selected based on [22] and one additional scenario (map3 and map4 of Figure 11 and Figure 12) with concave walls or narrow walls is designed to compare the proposed algorithm QL-HIT2F with the original planner GA-HIT2F. It should be noted that when path planning is performed in a new map, the obstacle environment is unknown, and the only information available to the path planning procedure are the starting and target points, the map boundary, the real time data detected by the sensors and a well-trained Q T a b l e . SMR needs to merely consume this information to complete the planning and navigation. The planning process of QL-HIT2F is basically the same as the original GA-HIT2F Algorithm 1 and the difference is the addition of two extra processes: one is the decision of the dynamic affiliation parameter based on the trained Q Table (how the determination is made is given in Training Algorithm 3) and the other is determination of whether to enable the AOO to calculate angles associated with the obstacle based on the AOO switch.

4.3. Path Planning Analysis in General Scenarios

Initially, the LPP navigation test was conducted on the two maps used in the original literature, with the proposed QL-HIT2F algorithm benchmarked against GA-HIT2F. Both algorithms started navigation from identical initial positions and orientations. The path planning results are presented in Figure 13, where the actual robot paths are represented by blue-black circles for QL-HIT2F and green-black circles for GA-HIT2F.
Based on the two path planning results, it is clear that the LPP path for QL-HIT2F navigation outperforms GA-HIT2F in both the path length cost and the smoothness of the steering angle. The ability to flexibly judge collision hazards of SMR endowed by the training of reinforcement learning can be observed from the blue path. Although both paths successfully avoid the obstacle, the robot from the QL-HIT2F path starts to deflect slightly in advance of the green-black one from GA-HIT2F, and these slight turns cumulatively enable the robot to output the behavior of avoiding the central obstacle at an almost optimal position (near the upper left vertex of the obstacle’s convex hull), while for the green path, a sharper turn is made only at a moment very close to the surface of the obstacle and then robot’s obstacle avoidance is executed along the surface of the central obstacle. Moreover, after the robot bypassed the central obstacle, the GA-HIT2 algorithm made a more conservative action output in the unknown environment, and the robot reached the target point from the map boundary with an almost 45-degree attitude as an arc-shaped trajectory. However, the QL-HIT2F algorithm directly makes the optimal action to drive the robot directly forward to the target. On the other hand, as evidenced by Figure 13a,b, one can see that the introduction of the AOO mechanism results in a significantly smoother and safer planned path, successfully avoiding obstacles in both scenarios.
In addition, the literature which proposed the GA-HIT2F planner also demonstrates the obstacle avoidance LPP navigation in dense multi-obstacle environments. Figure 14 below shows the comparison of the path planning results of our local planner QL-HIT2F with the original algorithm GA-HIT2F in such a multi-convex distributed map, from which more advantages of our QL-HIT2F can also be obviously perceived.
The QL-HIT2F planner demonstrates superior performance in Figure 14. Its path (Figure 14b) utilizes the Q-table to optimize dynamic parameters, resulting in a smoother, more continuous trajectory that avoids unnecessary turns and shortens the travel distance. This stands in sharp contrast to the GA-HIT2F path (Figure 14c), which is characterized by multiple turns and concludes with a redundant and unsafe near-90-degree reorientation of the robot immediately prior to reaching the goal. The results in Figure 14a,b also implies the effectiveness of the AOO mechanism.
In order to verify that the defects of the original algorithm elaborated in this paper can be effectively solved in QL-HIT2F, this paper additionally adds path planning with special obstacle walls, noting that the path planning is completely unknown and the robot does not foresee the special geometric patterns of the obstacles. The comparative results of the two algorithms are shown in Figure 15, Figure 16 and Figure 17. The comparative results demonstrate that the path planned by our algorithm (Figure 15b, Figure 16b and Figure 17b) successfully navigates past obstacles—including both horizontal narrow walls and concave walls—to reach the target. Notably, this is achieved without prior map knowledge, relying solely on real-time planning during motion. This performance confirms that the Q-table effectively selects optimal parameters and triggers the AOO mechanism at critical positions to escape local minima. In addition, the proposed method without AOO shows poor planning results, as listed in Figure 15a, Figure 16a and Figure 17a, the absence of the AOO mechanism led to a trajectory that collided with an obstacle, indicating a failure in path planning. Moreover, the original GA-HIT2 algorithm (Figure 15c, Figure 16c and Figure 17c) also fails as its path collides with obstacles and becomes trapped in a local minimum. This result directly corresponds to the limitation of the GA-HIT2F algorithm identified previously. In wall environments featuring local extremum traps, the algorithm’s fixed optimization parameters cause it to generate identical, inappropriate fuzzy outputs from similar inputs, even when a detour is essential. This inflexibility prevents the robot from adopting an obstacle-avoidance strategy, resulting in planning failure. These observations collectively verify that the QL-HIT2F algorithm successfully overcomes the limitations of its predecessor. It is capable of selecting optimal parameters in unknown and complex environments, making judicious collision-avoidance decisions to ensure successful navigation even in the presence of challenging concave walls.
The detail evaluations of the GA-HIT2F and QL-HIT2F for Figure 15, Figure 16 and Figure 17 are listed in Table 4; in this table, the Cost means the length of planning trajectory, and maxΔθ denotes the maximum turning angle of planned path. Here, the results without AOO have also been expressed. Analysis of the statistical data in Table 4 reveals that the QL-HIT2F algorithm holds strong advantages not only in path cost but also in the steering angle (SA) variation during planning. Its maximum absolute SA variation remains within the robot’s ideal angular range, indicating superior motion continuity compared to the original GA-HIT2F. These results demonstrate that QL-HIT2F overcomes the limitations of the original algorithm by enabling optimal parameter selection in unknown and complex environments. Even in narrow areas with concave walls, it successfully guides the robot with reasonable collision-avoidance actions. Additionally, the data confirm that the introduced AOO mechanism effectively resolves collision avoidance, underscoring its critical role in the algorithm’s overall effectiveness.

5. Conclusions

This paper introduces QL-HIT2F, a dynamic adaptive local path-planning method that creatively combines Q-Learning to provide robots with decision-making autonomy for optimal action selection in unknown maps. The proposed approach features three major enhancements. First, a dynamic parameter adjustment in which each affiliation parameter can be adaptive to the changing triplet state; second, in addition to affiliation parameters in the original algorithm, a novel angle concept of obstacle orientation of average is introduced and the switch of whether it is activated or not is considered as an adaptive parameter; third, the operation value of the maximum radius of sensor perception of the robot is also considered as a dynamic parameter. In this study, Q-Learning is utilized to unify these three types of dynamic parameters for adaptive decision-making; based on this, the state space, action space and reward mechanism, three elements required for RL are designed on the basis of the mutual motion and geometric relationship among the triplet (the robot, the target, and the environment). The training empowers the robot with a universal policy to determine the optimal dynamic parameters which correspond to the current state of the robot in any unknown environment. Finally, a series of simulations are conducted, in general environments distributed with special obstacles and walls, respectively. The simulation results demonstrate that the actions and paths decided by the QL-HIT2F local planner outperform the original GA-HIT2F planner, which verifies the effectiveness of combining reinforcement learning with type-2 fuzzy dynamic tuning parameters to assist robot path planning in unknown environments.

Author Contributions

Conceptualization, methodology, and writing—original draft preparation, N.Z.; data curation, F.Z.; writing—review and editing, J.Z.; visualization, C.L.; funding acquisition, N.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by High Level Talent Start up fund of Shandong XieHe University entitled “Evaluation and Risk Prediction of Health Status Based on Multimodal Big Data and Visual Intelligence” (NO. SDXHQD2025089).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
GA-HIT2FGenetic algorithm hierarchical interval type-2 fuzzy
QL-HIT2FQ-Learning-supported adaptive hierarchical interval type-2 fuzzy
HIT2FHierarchical interval type-2 fuzzy

References

  1. Sharma, K.; Doriya, R. Coordination of multi-robot path planning for warehouse application using smart approach for identifying destinations. Intell. Serv. Robot. 2021, 14, 313–325. [Google Scholar] [CrossRef]
  2. Tang, Y.; Muhammad, A.Z.; Maryam, Y. Path Planning Trends for Autonomous Mobile Robot Navigation: A Review. Sensors 2025, 25, 1206. [Google Scholar] [CrossRef] [PubMed]
  3. Azpúrua, H.; Rezende, A.; Potje, G.; Júnior, G.P.d.C.; Fernandes, R.; Miranda, V.; Filho, L.W.d.R.; Domingues, J.; Rocha, F.; de Sousa, F.L.M.; et al. Towards Semi-autonomous Robotic Inspection and Mapping in Confined Spaces with the EspeleoRobô. J. Intell. Robot. Syst. 2021, 101, 69. [Google Scholar] [CrossRef]
  4. Liao, J.; Guo, B.; Zhang, Y.; Miao, X.; Zheng, T. Non-singular optimal terminal sliding mode control for autonomous surface vessels with actuator faults and model uncertainties via adaptive dynamic programming. Ocean Eng. 2026, 343, 123334. [Google Scholar] [CrossRef]
  5. Vygantas, U.; Michał, N.; Andrius, D.; Česnavičius, R.; Augutis, E.; Martišius, I.; Jurevičius, R.; Eidukynas, D.; Vaitkus, V.; Kačeniauskas, A.; et al. Sensor-Fusion Based Navigation for Autonomous Mobile Robot. Sensors 2025, 25, 1248. [Google Scholar] [CrossRef]
  6. Krell, E.; King, S.A.; Garcia, R. Autonomous Surface Vehicle energy-efficient and reward-based path planning using Particle Swarm Optimization and Visibility Graphs. Appl. Ocean Res. 2022, 122, 103125. [Google Scholar] [CrossRef]
  7. Salama, O.A.A.; Eltaib, M.E.H.; Mohamed, H.A.; Salah, O. RCD: Radial Cell Decomposition Algorithm for Mobile Robot Path Planning. IEEE Access 2021, 9, 149982–149992. [Google Scholar] [CrossRef]
  8. Lin, P.; Yang, J.H.; Quan, Y.S.; Chung, C.C. Potential field-based path planning for emergency collision avoidance with a clothoid curve in waypoint tracking. Asian J. Control. 2022, 24, 1074–1087. [Google Scholar] [CrossRef]
  9. Tang, G.; Tang, C.; Claramunt, C.; Hu, X.; Zhou, P. Geometric A-Star Algorithm: An Improved A-Star Algorithm for AGV Path Planning in a Port Environment. IEEE Access 2021, 9, 59196–59210. [Google Scholar] [CrossRef]
  10. Lee, M.-T.; Chuang, M.-L.; Kuo, S.-T.; Chen, Y.-R. UAV Swarm Real-Time Rerouting by Edge Computing D Lite Algorithm. Appl. Sci. 2022, 12, 1056. [Google Scholar] [CrossRef]
  11. Wang, Z.; Cai, J. Probabilistic roadmap method for path-planning in radioactive environment of nuclear facilities. Prog. Nucl. Energy 2018, 109, 113–120. [Google Scholar] [CrossRef]
  12. Su, Y.; Xin, J.; Sun, C. Dynamic Path Planning for Mobile Robots Based on Improved RRT* and DWA Algorithms. IEEE Trans. Ind. Electron. 2025, 72, 10595–10604. [Google Scholar] [CrossRef]
  13. Punriboon, C.; So-In, C.; Aimtongkham, P.; Leelathakul, N. Fuzzy Logic-Based Path Planning for Data Gathering Mobile Sinks in WSNs. IEEE Access 2021, 9, 96002–96020. [Google Scholar] [CrossRef]
  14. Sahu, S.; Choudhury, B.B. Fuzzy Logic Based Path Planning for Industrial Robot. Int. J. Manuf. Mater. Mech. Eng. 2018, 8, 1–11. [Google Scholar] [CrossRef]
  15. Samadi, G.M.; Jond, H.B. An intelligent approach for autonomous mobile robots path planning based on adaptive neuro-fuzzy inference system. Ain Shams Eng. J. 2022, 13, 10149. [Google Scholar] [CrossRef]
  16. Dirik, M.; Castillo, O.; Kocamaz, A.F. Visual-Servoing Based Global Path Planning Using Interval Type-2 Fuzzy Logic Control. Axioms 2019, 8, 58. [Google Scholar] [CrossRef]
  17. Mirza, N. Robotic Path Planning and Fuzzy Neural Networks. Int. Arab. J. Inf. Technol. 2020, 17, 615–620. [Google Scholar] [CrossRef]
  18. Mohseni, A.; Duchaine, V.; Wong, T. Experimental study of path planning problem using EMCOA for a holonomic mobile robot. J. Syst. Eng. Electron. 2021, 32, 1450–1462. [Google Scholar] [CrossRef]
  19. Alomari, A.; Phillips, W.; Aslam, N.; Comeau, F. Dynamic Fuzzy-Logic Based Path Planning for Mobility-Assisted Localization in Wireless Sensor Networks. Sensors 2017, 17, 1904. [Google Scholar] [CrossRef]
  20. Taheri, E.; Ferdowsi, M.H.; Danesh, M. Fuzzy Greedy RRT Path Planning Algorithm in a Complex Configuration Space. Int. J. Control. Autom. Syst. 2018, 16, 3026–3035. [Google Scholar] [CrossRef]
  21. Sathiya, V.; Chinnadurai, M.; Ramabalan, S. Mobile robot path planning using fuzzy enhanced improved Multi-Objective particle swarm optimization (FIMOPSO). Expert Syst. Appl. 2022, 198, 116875. [Google Scholar] [CrossRef]
  22. Zhao, T.; Xiang, Y.; Dian, S.; Guo, R.; Li, S. Hierarchical interval type-2 fuzzy path planning based on genetic optimization. J. Intell. Fuzzy Syst. 2020, 39, 937–948. [Google Scholar] [CrossRef]
  23. Li, H.; Zhao, T.; Dian, S. Forward search optimization and subgoal-based hybrid path planning to shorten and smooth global path for mobile robots. Knowl.-Based Syst. 2022, 258, 110034. [Google Scholar] [CrossRef]
  24. Ee, S.L.; Pauline, O.; Cheng, Y.L. A modified Q-learning path planning approach using distortion concept and optimization in dynamic environment for autonomous mobile robot. Comput. Ind. Eng. 2023, 181, 109338. [Google Scholar] [CrossRef]
  25. Zhao, M.; Lu, H.; Yang, S.; Guo, F. The Experience-Memory Q-Learning Algorithm for Robot Path Planning in Unknown Environment. IEEE Access 2020, 8, 47824–47844. [Google Scholar] [CrossRef]
  26. Bonny, T.; Kashkash, M. Highly optimized Q-learning-based bees approach for mobile robot path planning in static and dynamic environments. J. Field Robot. 2022, 39, 317–334. [Google Scholar] [CrossRef]
  27. Xi, M.; Yang, J.; Wen, J.; Liu, H.; Li, Y.; Song, H.H. Comprehensive Ocean Information-Enabled AUV Path Planning via Reinforcement Learning. IEEE Internet Things J. 2022, 9, 17440–17451. [Google Scholar] [CrossRef]
  28. Lu, C.; Liang, S.; Chao, J.; Jia, Q.; Guo, W. Reinforcement based mobile robot path planning with improved dynamic window approach in unknown environment. Auton. Robot. 2021, 45, 51–76. [Google Scholar]
  29. Hao, B.; Du, H.; Yan, Z. A path planning approach for unmanned surface vehicles based on dynamic and fast Q-learning. Ocean Eng. 2023, 270, 113632. [Google Scholar] [CrossRef]
Figure 1. Illustration of the detection of SMR of GA-HIT2F.
Figure 1. Illustration of the detection of SMR of GA-HIT2F.
Sensors 26 00200 g001
Figure 2. Hierarchical type-2 fuzzy inference system of GA-HIT2F.
Figure 2. Hierarchical type-2 fuzzy inference system of GA-HIT2F.
Sensors 26 00200 g002
Figure 3. Empty environment (left) and central obstacle environment (right).
Figure 3. Empty environment (left) and central obstacle environment (right).
Sensors 26 00200 g003
Figure 4. LPP results of GA-HIT2F for two environments.
Figure 4. LPP results of GA-HIT2F for two environments.
Sensors 26 00200 g004
Figure 5. Results of GA-HIT2F for two environments with changed parameters.
Figure 5. Results of GA-HIT2F for two environments with changed parameters.
Sensors 26 00200 g005
Figure 6. Special environments with concave surfaces (left) or narrow walls (right).
Figure 6. Special environments with concave surfaces (left) or narrow walls (right).
Sensors 26 00200 g006
Figure 7. Meta-map 1 and meta-map 2 for two environments.
Figure 7. Meta-map 1 and meta-map 2 for two environments.
Sensors 26 00200 g007
Figure 8. Schematic diagram of the eight directions of SPO.
Figure 8. Schematic diagram of the eight directions of SPO.
Sensors 26 00200 g008
Figure 9. Schematic diagram of sub-training. Blue star is the starting point, and the green one represents the ending point.
Figure 9. Schematic diagram of sub-training. Blue star is the starting point, and the green one represents the ending point.
Sensors 26 00200 g009
Figure 10. Q-Learning training results in meta-maps. The blue star is the starting point, and the green one represents the ending point.
Figure 10. Q-Learning training results in meta-maps. The blue star is the starting point, and the green one represents the ending point.
Sensors 26 00200 g010
Figure 11. Basic scenarios.
Figure 11. Basic scenarios.
Sensors 26 00200 g011
Figure 12. Special scenarios.
Figure 12. Special scenarios.
Sensors 26 00200 g012
Figure 13. QL-HIT2F planning results without AOO (a), QL-HIT2F planning path with AOO (b) and GA-HIT2F planning path (c). The red part represents the ending point of a path, and the green star is the start point.
Figure 13. QL-HIT2F planning results without AOO (a), QL-HIT2F planning path with AOO (b) and GA-HIT2F planning path (c). The red part represents the ending point of a path, and the green star is the start point.
Sensors 26 00200 g013
Figure 14. QL-HIT2F planning path without AOO (a), QL-HIT2F planning path with AOO (b) and GA-HIT2F planning path (c). The red part represents the ending point of a path.
Figure 14. QL-HIT2F planning path without AOO (a), QL-HIT2F planning path with AOO (b) and GA-HIT2F planning path (c). The red part represents the ending point of a path.
Sensors 26 00200 g014
Figure 15. Planning results in concave scenario (QL-HIT2F without AOO (a), QL-HIT2F with AOO (b), GA-HIT2F (c)). Red part represents the ending point of a path, and the green star is the start point.
Figure 15. Planning results in concave scenario (QL-HIT2F without AOO (a), QL-HIT2F with AOO (b), GA-HIT2F (c)). Red part represents the ending point of a path, and the green star is the start point.
Sensors 26 00200 g015
Figure 16. Planning results in narrow scenario 1 (QL-HIT2F without AOO (a), QL-HIT2F with AOO (b), GA-HIT2F (c)). Red part represents the ending point of a path, and the green star is the start point.
Figure 16. Planning results in narrow scenario 1 (QL-HIT2F without AOO (a), QL-HIT2F with AOO (b), GA-HIT2F (c)). Red part represents the ending point of a path, and the green star is the start point.
Sensors 26 00200 g016
Figure 17. Planning results in narrow scenario 2 (QL-HIT2F without AOO (a), QL-HIT2F with AOO (b), GA-HIT2F (c)). Red part represents the ending point of a path.
Figure 17. Planning results in narrow scenario 2 (QL-HIT2F without AOO (a), QL-HIT2F with AOO (b), GA-HIT2F (c)). Red part represents the ending point of a path.
Sensors 26 00200 g017
Table 1. Q-Learning state space.
Table 1. Q-Learning state space.
StateValueDependence Variable
s t 1 1 d i s R G < = d i s s t a r t 2 g o a l / 2
2 d i s R G > d i s s t a r t 2 g o a l / 2
s t 2 1 0 θ R R G 60 °
2 60 ° θ R R G < 0
3 θ R R G > 60 °
s t 3 1 0 < θ R O 60 °
2 60 ° θ R O < 0
3 θ R O > 60 °
4 θ R O = 0
s t 4 1 Δ θ 60 °
2 Δ θ > 60 °
Table 2. SMR’s environment reward mechanism and principles for QL training.
Table 2. SMR’s environment reward mechanism and principles for QL training.
VariableReward MechanismReward Principle
d i s R G RW = RW + 5000 d i s R G < = d i s t h r
Δ d i s R G RW = RW + 10 Δ d i s R G < 0
RW = RW − 10 Δ d i s R G > 0
t r a j c o s t RW = RW + 5000 t r a j c o s t ( n e w ) < t r a j c o s t ( l a s t )
[ R x , R y ] RW = RW − 200Collision!
Table 3. Initial settings of QL sub-training for SMR.
Table 3. Initial settings of QL sub-training for SMR.
StartTargetInitial
Sub-training1 20 , 20 280 , 280 45.00 °
Sub-training2 20 , 20 280 , 150 26.57 °
Sub-training3 20 , 20 150 , 280 63.43 °
Sub-training4 20 , 280 280 , 20 45.00 °
Sub-training5 20 , 280 150 , 20 63.43 °
Sub-training6 20 , 280 280 , 150 26.57 °
Table 4. Evaluation of Path Planning Results for GA-HIT2F and QL-HIT2F.
Table 4. Evaluation of Path Planning Results for GA-HIT2F and QL-HIT2F.
AlgorithmMapCostmaxΔθSteps
GA-HIT2Fmap in Figure 13384.7943.2199
map in Figure 14376.8446.0487
maps in Figure 15, Figure 16 and Figure 17 planning failed
QL-HIT2F (without AOO)map in Figure 13380.4232.4867
map in Figure 14368.3728.4162
maps in Figure 15, Figure 16 and Figure 17 planning failed
QL-HIT2F (with AOO)map in Figure 13377.2617.8060
map in Figure 14358.8126.7457
map in Figure 15305.4636.4347
map in Figure 16438.0141.4287
map in Figure 17415.7524.2964
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, N.; Zhou, F.; Li, C.; Zhong, J. QL-HIT2F: A Q-Learning-Aided Adaptive Fuzzy Path Planning Algorithm with Enhanced Obstacle Avoidance. Sensors 2026, 26, 200. https://doi.org/10.3390/s26010200

AMA Style

Zhou N, Zhou F, Li C, Zhong J. QL-HIT2F: A Q-Learning-Aided Adaptive Fuzzy Path Planning Algorithm with Enhanced Obstacle Avoidance. Sensors. 2026; 26(1):200. https://doi.org/10.3390/s26010200

Chicago/Turabian Style

Zhou, Nana, Fengjun Zhou, Changming Li, and Jianning Zhong. 2026. "QL-HIT2F: A Q-Learning-Aided Adaptive Fuzzy Path Planning Algorithm with Enhanced Obstacle Avoidance" Sensors 26, no. 1: 200. https://doi.org/10.3390/s26010200

APA Style

Zhou, N., Zhou, F., Li, C., & Zhong, J. (2026). QL-HIT2F: A Q-Learning-Aided Adaptive Fuzzy Path Planning Algorithm with Enhanced Obstacle Avoidance. Sensors, 26(1), 200. https://doi.org/10.3390/s26010200

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop