1. Introduction
With the widespread application of 5G technology in underground coal-mine communications, the construction of “smart mines” with the goals of visualization, unmanned operation, informatization, and intelligence is gradually being realized [
1,
2,
3]. To effectively address the challenges of signal quality and coverage blind spots in the complex underground propagation environment, novel superconducting reflective surface technology with passive controllable signal characteristics has been introduced into mine safety production and information exchange processes. By dynamically adjusting signal-propagation characteristics, this technology constructs an adaptive underground signal coverage network, enabling the controllability of random and uncontrollable complex channels [
4,
5,
6,
7]. Currently, research on reflector-assisted wireless communication technology in mines can be primarily categorized into two aspects: the first focuses on the hardware structure design of reflectors [
8,
9,
10] and the study of mine channel enhancement mechanisms using traditional algorithms [
11,
12,
13]; the second focuses on the autonomous optimization design of mine reflectors based on intelligent algorithms such as deep learning.
Reference [
14] investigates the field-strength superposition principle of multiple Reconfigurable Intelligent Surfaces (RIS) in tunnel scenarios. By constructing a vector model, it studies the signal field-strength characteristics of the RIS receiving end at a frequency of 28 GHz in subway tunnels and provides a deployment scheme for multiple RIS within a limited dynamic distribution space. Reference [
15] focused on wireless communication in mine tunnels, specifically investigating non-line-of-sight propagation at the 28 GHz frequency band. It employed 3D ray-tracing simulations to model mine tunnel environments and evaluated the effectiveness of different RIS deployment modes—reflective and absorptive—in enhancing communication capacity under high path loss and multipath interference conditions. In [
16], a joint deployment strategy for 5G-R base stations assisted by RIS is proposed, providing a wireless coverage solution for mountainous railways using RIS. It studies system gains using an optimized SINR criterion and the Charnes–Cooper solution method, and investigates the signal compensation characteristics of a 100-unit array RIS at the 930 MHz frequency band. In [
17], a joint beamforming scheme for RIS-assisted multi-user MIMO systems based on deep learning is proposed, investigating system performance and data rates under various RIS array configurations. However, this study does not fully consider multi-RIS deployment scenarios or multi-hop signal propagation. In [
18,
19], multi-hop RIS-assisted joint communication and sensing (JCAS) methods are employed to enhance the energy efficiency and overall sensing rate of JCAS access points in underground coal mines. In [
20], an error-rate detection method for mine RIS systems based on the Parallel Interference Cancellation (PIC) algorithm is proposed and its error characteristics are analyzed. However, this study only targets ideal rectangular tunnels and does not consider the complex tunnel structures in real-world scenarios. In [
21], a RIS-NOMA mine communication system model based on wireless local area network noise interference is proposed, evaluating system transmission reliability by analyzing the interruption probability. In [
22], an underground communication system model based on the Nakagami-g fading channel model and RIS signal-propagation model is constructed, with preliminary channel estimation performed using the least squares (LS) algorithm, and the channel estimation results optimized using an octave convolution (OCT) neural network. In [
23], the reconfigurable characteristics of passive RIS beams are implemented through the principles of passive coding and splicing, suitable for coal-mine tunnels with different turning angles.However, traditional algorithms exhibit slower computation speeds and lower efficiency when handling complex environments. Therefore, adopting more efficient intelligent algorithms is particularly crucial for addressing the intricate scenarios encountered in coal-mine communication systems.
Reference [
24] introduces RIS into the Vehicle-to-Everything (V2X) communication system for mines. By integrating meta-heuristic optimization algorithms with deep learning techniques and incorporating an error-correction strategy, it optimizes the RIS phase configuration to enhance channel gain.Reference [
25] utilizes DRL to obtain the optimal RIS-assisted computational unloading strategy in dynamic mining environments, proposing a RIS-assisted computational unloading scheme based on Deep Deterministic Policy Gradient (DDPG) to jointly optimize the phase shift and unloading rate of RIS components, thereby maximizing the utility of mining IoT devices, improving energy efficiency, and reducing computational latency. In [
26], RIS is used to optimize signal quality and maximize spectral efficiency in high-speed railway tunnels, proposing an algorithm combining Long Short-Term Memory (LSTM) and DDPG, namely LSTM-DDPG, to address this issue. In [
27,
28,
29,
30], the channel model and fading for RIS-assisted rectangular mines are derived, the feasibility of RIS in mine communications is verified, and the DDPG algorithm is used to optimize the system and rate in rectangular tunnels.DDPG and LSTM-DDPG have achieved some progress in RIS optimization, but they still face challenges in handling noise issues in complex environments and multi-hop RIS deployments. The DDPG algorithm suffers from overestimation problems, which may lead to inefficient and unstable policy optimization. Therefore, there is an urgent need to propose more efficient intelligent algorithms.
The existing literature primarily focuses on simple rectangular tunnel scenarios assisted by RIS, without addressing more complex mine tunnel geometric structures. The main contributions of this paper are summarized in the following three key aspects:
- (1)
This paper proposes for the first time a novel RIS channel propagation model specifically designed to model complex semi-circular vaulted mine structures. This model accounts for the geometric characteristics of mine tunnels, filling a gap in the existing literature that has only addressed simple rectangular tunnels.
- (2)
To overcome the noise issues in the existing Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, this paper proposes an improved TD3 algorithm—TD3 with Noise Fading. By dynamically adjusting the standard deviation of Ornstein–Uhlenbeck (OU) noise, this paper effectively mitigates the negative impact of noise on algorithm performance, enhancing the stability and efficiency of the optimization process.
- (3)
The proposed TD3-NF algorithm jointly optimizes the base-station beamforming matrix and the RIS phase-shift matrix to maximize the system link rate. The paper also analyzes the impact of base-station transmit power and the number of RIS units on the link rate, further explores the influence of key neural network parameters on algorithm performance, and provides new insights for the design of coal-mine RIS communication systems.
The remainder of this paper is structured as follows:
Section 2 introduces the system model,
Section 3 presents the research questions and proposes corresponding solutions,
Section 4 evaluates the proposed approach through simulation, and
Section 5 concludes the paper.
2. System Model
Figure 1 shows the simulated distribution of multi-branch mine tunnels. The tunnel geometric cross-section consists of a combination of rectangular tunnel structures and slightly curved semi-circular arched roofs. The figure illustrates two RIS-assisted tunnel scenarios: one is a communication system model assisted by RIS on the same side wall, and the other is a communication system model assisted by RIS on the opposite side wall. To enhance the feasibility of the system model design, the study fully considers the curved and narrow conditions of the mine tunnels. It is important to note that RIS is used for signal coverage in underground environments, so the research focuses solely on the signal-propagation characteristics of RIS-assisted mine tunnels under Non-Line-of-Sight (NLoS) conditions. Due to the complex long-distance propagation in confined spaces, the study does not consider the impact of Line-of-Sight (LoS) propagation on the system.
Consider a dual RIS-assisted multi-user coal-mine communication system with obstacles. The system is specified as consisting of a base station (BS) with M antennas and two RISs composed of reflective units.Here, we assume that the two RIS systems have an equal number of reflectors (i.e., ). This ensures dimensional matching in matrix operations, thereby guaranteeing the validity of the formula and the mathematical consistency of the system model. Furthermore, setting the reflector count equal simplifies the signal-processing model and ensures symmetry during signal optimization. RISs deployed on the curved walls of the mine tunnels, RIS1 near the BS end, RIS2 near the UE end and K single-antenna users (UE). The channels from BS to RIS1 and RIS2 are denoted as and , respectively, the channels from RIS1 and RIS2 to UE are denoted as and respectively, and the channel from RIS1 to RIS2 is denoted as .
2.1. Coal-Mine Tunnel Modeling
Define the reflection coefficient matrix of RIS as shown in Equation (
1):
where
is the phase change caused by the
reflection element of the
RIS, and
is an amplitude function based on the
phase change:
where
represents the horizontal offset of the phase shift; and
controls the steepness of the function curve.
depends on the hardware implementation constants of the RIS. If the signal undergoes ideal reflection on the RIS, then
, i.e.,
, or
.
The aggregated channel from the BS to the user
k is shown in Equation (
3):
Due to the wide-frequency spectrum characteristics of electrical noise, electrical noise that affects radio signals is regarded as a pulse interference response. Therefore, underground noise can be represented by an independent and identically distributed Bernoulli–Gaussian process:
where
is additive Gaussian white noise, pulse interference noise
is Gaussian noise with random pulses,
is a Bernoulli random process with mean 0 and variance 1, taking values 0 or 1,
.
The signal received from the
user can be expressed as shown in Equation (
5):
where the first term represents the expected signal of the
user, the second term represents the interference caused by the signals of all other users
to the
user, i.e., co-channel interference (CCI),
is the data symbol transmitted by the base station to the user
k,
is the transmission beamforming vector of the base station to user
k, and
P is the transmission power of each signal. Therefore, the signal-to-interference-plus-noise ratio (SINR) of the
user in the system can be expressed as shown in Equation (
6):
where
p is the probability that
equals 1. Then, the total link rate in the system (in units of
) can be expressed as shown in Equation (
7):
2.2. Signal-Propagation Path Analysis
The study uses the mirror method to predict the propagation characteristics of signals. First, it is necessary to obtain the effective reflection surface required for the mirror method. As shown in
Figure 2, the arch is divided into
Q equal parts. Regardless of whether
Q is odd or even, the Y-axis coordinate and the Z-axis coordinate of any coordinate point
can be expressed as shown in Equation (
8):
The
small plane of the arch is determined by two adjacent points
and
. The plane equation of the
small plane is shown in Equation (
9):
The plane equation of the coal-mine wall surface is shown in Equation (
10):
Specify the coordinates of the launch point
, the coordinates of the receiving point
, the coordinates of the reflection point
, and the equation of the reflection plane
m is
, constructed in the Cartesian coordinate system shown in
Figure 2. The mirror point
can be expressed as shown in Equation (
11):
According to the line connecting the mirror point and the receiving point
and the reflection plane intersecting at the reflection point, we have:
From this, we can obtain:
The reflection point is also located on the reflection plane, so it satisfies the reflection plane equation, giving us:
The coordinates of the reflection point can be obtained from Equations (
13) and (
14).
In a semi-circular arch-shaped mine tunnel, the signal undergoes more than three reflections, resulting in significant energy attenuation. Therefore, considering only the first three reflections is sufficient to fully reveal the signal-propagation behavior. Thus, the upper limit for the number of reflections is set to three [
31,
32]. The number of reflection lines depends on the number of reflective surfaces
in the tunnel, where reflective surfaces include the tunnel walls and RIS. A single reflection line can have up to
lines, with each reflection line involving one reflective surface; a secondary reflection line can have up to
lines, with any two different reflective surfaces forming a secondary reflection line; a tertiary reflection line can have up to
lines, with each tertiary reflection line passing through three different reflective points, and adjacent reflective points cannot be on the same reflective surface, but the first and third reflective points can be on the same reflective surface. In this process, the RIS, as a new type of controllable reflective surface, collaborates with traditional reflective surfaces to construct a multipath propagation environment. Therefore, the total effective scattering paths are shown in Equation (
15):
Assuming that the signal passes through n reflection points from the transmission point to the reception point, the path length of the signal propagation can be expressed as: ; where represents the distance from the reflection point to the reflection point , and the reflection points include the position of RIS. In the case of dual RIS, the length of the reflection path changes with the number of RIS, so it is necessary to calculate the distance between each reflection point step by step.
According to Fresnel’s law, the reflection coefficients of vertically polarized waves and horizontally polarized waves can be expressed as shown in Equation (
16) and (
17):
where
is the relative dielectric constant of the tunnel wall. Assuming that the roughness distribution of the tunnel wall follows a Gaussian distribution with a mean of 0 and a variance of
h and
is the roughness loss factor, the roughness loss coefficient caused by the rough surface of the tunnel wall is:
where
. Multiple reflections of the signal on both sides of the tunnel and the top and bottom plates will cumulatively affect the total reflection coefficient. Assuming the ray undergoes
m reflections at the side walls and
n reflections at the top and bottom plates (i.e., experiencing
m reflections of horizontally polarized waves followed by
n reflections of vertically polarized waves at the top and bottom plates), the reflection coefficient at this point can be expressed as shown in Equation (
19):
3. Joint Design of Base-Station Beamforming and RIS Phase Shifting Based on TD3-NF
3.1. Optimization of Beamforming Matrix and RIS Phase Shift Matrix
The power of the multi-antenna BS transmission signal is subject to the maximum transmission power constraint, expressed as shown in Equation (
20):
where
is the maximum transmission power at BS,
represents the statistically expected value, and
represents the trace of the matrix.
In order to maximize the total link rate of the system by optimizing
and
, define the optimization problem:
Due to the non-convexity of the objective function and the complexity of the constraints, traditional mathematical optimization methods require a large number of iterative calculations, resulting in high computational complexity. Therefore, this paper adopts a framework based on the TD3 algorithm to solve the problem in Equation (
21) and obtain feasible
and
under the premise of satisfying all feasible constraints.
3.2. TD3-NF Algorithm Design
The TD3 algorithm is an improvement on the DDPG algorithm, both of which are used for continuous control problems. Compared to DDPG, TD3 introduces two key innovations: dual updates of the target network and an action delay update mechanism to reduce estimation errors during training.
In standard TD3, Ornstein–Uhlenbeck (OU) noise is used to increase the randomness of action selection to promote exploration. However, the standard deviation of OU noise is typically fixed, which may result in an overly prolonged exploration phase or excessive noise persisting in the later stages of training, thereby impairing the model’s exploitation efficiency. To address this issue, this paper proposes the TD3-NF algorithm to solve problem in Equation (
21). This algorithm balances exploration and exploitation by gradually reducing the standard deviation of the noise, i.e.,
where
is the noise decay rate,
is the minimum standard deviation of noise, and
is the standard deviation of noise in the previous step. This decay strategy allows the algorithm to conduct extensive exploration in the early stages and gradually transition to a more strategy-dependent utilization phase as training progresses. The design process of problem in Equation (
21) is shown in
Figure 3.
Using CSI as input for the TD3-NF agent, generate the optimal and . The specific settings for state, action, and reward are as follows:
State
: Transmit power at time
t; channel matrices from BS to RIS1 and RIS2; channel matrices from RIS1 and RIS2 to UE; channel matrix from RIS1 to RIS2. The state space size is shown in Equation (
23):
Action
: Composed of the beamforming matrix and phase shift matrix at time
t; the size of the action space is shown in Equation (
24):
Reward
: The reward at time
t is the value of the objective function in Equation (
21):
The TD3-NF algorithm proposed in this paper follows the procedure below, as summarized in Algorithm 1.
Algorithm 1. TD3-NF algorithm |
- 1
Initial: transmit beamforming matrix G, phase-shift matrix , experience replay pool E, parameters of the Critic training network , parameters of the Actor training network , parameters of the target network - 2
Input: channel matrices from BS to RIS1 and RIS2 ; channel matrices from RIS1 and RIS2 to UE ; channel matrix from RIS1 to RIS2 - 3
Output: Q-value function, optimal action - 4
for to J do - 5
Get the initial state for to T do - 6
From the Actor training network , - 7
Based on the action , get the state at the next moment, and get the immediate reward Store into the experience replay pool E - 8
A small batch of samples of size W is randomly drawn from the empirical replay pool E - 9
Select two target Critic networks for Q-value prediction, - 10
Calculate TD Objectives - 11
Update Critic: - 12
Update Actor: - 13
Update target networks: - 14
Let - 15
end - 16
end
|
As the number of training iterations increases, the agent continuously interacts with the environment to select the optimal actions, ultimately maximizing long-term returns. When training approaches convergence, the agent calculates the optimal beamforming matrix and RIS phase-shift matrix. The neural network structure designed is shown in
Figure 4. The action network and evaluation network, as well as their respective target networks, are structurally similar, both consisting of four fully connected layers: an input layer, two hidden layers (with batch normalization applied in the hidden layers), and an output layer. Based on this, all layers of the deep neural networks in this paper use the tanh activation function, and all networks uniformly use the Adam optimizer. The learning rate in Adam is adaptively adjusted according to the gradient changes of each parameter, i.e.,
, and
, where
and
are the decaying rates of the action network and critic network, respectively.
is the learning rate for updating the Critic training network, and
is the learning rate for updating the Actor training network.
During the training phase, the variables L, and represent the size of the training layer, the size of the input layer, and the number of neurons in layer l, respectively, and the number of learned samples is set to B. The computational complexity of each time step is . During the training phase, each minibatch contains events, each event is T time steps long, and each training model undergoes multiple iterations until convergence. Therefore, the total computational complexity of the training is expressed as: . During the training phase, in addition to the forward propagation and backpropagation computations of the neural network, environmental simulation is required. This involves calculating the composite channel matrix for each user and performing matrix multiplication for RIS phase shifts. The computational complexity of channel calculation is , where represents the number of reflectors in the RIS and K denotes the number of users. Additionally, the SINR for each user must be computed, introducing further computational overhead. During the simulation phase, the overall computational complexity increases to , where J corresponds to the number of SINR calculations performed per user.
4. Results
To validate the feasibility of the proposed TD3-NF-based algorithm for improving transmission rates in RIS-assisted semi-circular arch tunnels, this section conducts simulation experiments using numerical modeling, with Python as the programming language. First, an environment initialization function is constructed and initial values are set. An environment transition function is designed to describe the changes in the environment state after each action step. The neural network structure and parameters of the proposed TD3-NF-based algorithm are designed, and a DRL agent is constructed to interact with the environment. The coal-mine height is set to 6 m, width to 5 m, arch crown radius to 2.5 m, and the distances from the base station and receiver to the tunnel sidewall are 4.5 m and 2 m, respectively. The number of reflector array elements is
,
,
, and the element size is
= 5 cm. At 2.4 GHz, the dimensions of the RIS are
cm,
cm, and
cm, with a transmission distance of 200 m. The distance between the RIS and the base station is fixed at 50 m. In this study, we assume that the underground mine communication system employs a leaky feeder antenna configuration, a common antenna structure in complex environments such as mines that provides effective signal coverage. Signals are modulated using QPSK, with a transmission bit rate set at 1 Mbps, a beamwidth of 120°, and a standard time-division multiplexing (TDM) frame structure. The channel matrices
,
and
,
are implemented using ray tracing, where the total effective scattering paths are
, the gain of the receiving and transmitting antennas is 5 dBm, the roughness variance of the tunnel walls is
, and the relative permittivity is
= 5. The simulation parameters are shown in
Table 1. In this paper, the average reward is used as the metric to evaluate system performance, defined as shown in Equation (
26):
As shown in
Figure 5, this paper compares and analyzes the performance of four reinforcement learning algorithms (TD3-NF, DDPG, Asynchronous Advantage Actor–Critic (A3C), and Deep Q-Network(DQN)) in a RIS-assisted tunnel communication system. The experimental results indicate that the TD3-NF algorithm demonstrates the best performance, with its link rate significantly improving from the initial 2.5 bps/Hz to 11.1 bps/Hz. In contrast, the DDPG algorithm eventually stabilizes at 8.6 bps/Hz, with the performance bottleneck primarily stemming from the issue of overestimating Q-values; the learning curve of the A3C algorithm was relatively flat, increasing from approximately 1.0 bps/Hz to 5.7 bps/Hz. The DQN algorithm, constrained by the expressive capability of its discrete action space, only slowly improved from 2.3 bps/Hz to 4.8 bps/Hz. Overall, the TD3-NF algorithm emerged as the most suitable for this tunnel communication system due to its strong adaptability and efficient policy learning. In contrast, DDPG, A3C, and DQN failed to achieve comparable optimization results due to their respective limitations. These algorithms face varying degrees of performance bottlenecks, particularly when handling complex continuous or large-scale action spaces.
When
dBm, the changes in the system average reward under different RIS arrays for the DDPG and TD3-NF algorithms are shown in
Figure 6. As the number of RIS elements increases, the system average reward using the TD3-NF algorithm improves from 7.8 bps/Hz to 11.1 bps/Hz, an increase of 3.3 bps/Hz, while DDPG only improves by 1.8 bps/Hz. This indicates that the TD3-NF algorithm can more effectively boost the system’s average reward, exhibiting stronger scalability and optimization capabilities as the number of RIS elements increases.
When the number of RIS elements in the system is 64, the average reward changes for the two algorithms under different transmit power levels are shown in
Figure 7. When the power increases by 2 dBm, the TD3-NF algorithm exhibits a faster reward growth rate. When
dBm, the average reward of the system under this algorithm increases from 3 bps/Hz to 11.1 bps/Hz, an increase of 8.1 bps/Hz; while the reward growth of DDPG only improves by 3 bps/Hz.The TD3-NF algorithm consistently outperforms the DDPG algorithm in efficiency across different transmission power levels, particularly under high-power conditions, demonstrating superior adaptability and optimization capabilities in complex environments. Even under identical conditions, the TD3-NF algorithm achieves significantly higher average reward improvements compared to DDPG.
Figure 6 and
Figure 7 show that the performance of the TD3-NF algorithm is more prominent when the number of RIS elements and transmission power are increased, especially under higher RIS configurations and transmission power conditions, where the learning speed and reward value are significantly improved. This indicates that optimizing the number of RIS elements and transmission power has a positive effect on algorithm performance.
The CDF of the total system rate for different numbers of RIS elements and transmit powers is shown in
Figure 8. As the transmit power and number of RIS elements increase, the CDF curve shifts to the right. When CDF = 0.6 and
dBm, the number of RIS elements is positively correlated with the system rate, and the rate for an 8 × 8 array is 13.45 bps/Hz.The CDF curve validates the observations in
Figure 6 and
Figure 7, confirming that both the average and peak rates of the system improve with increases in transmit power and the number of RIS elements.
In the algorithm proposed in this paper, the Actor and Critic networks use fixed learning rates and decay rates. Under the conditions of
and
dBm, the relationship between different learning rates, decay rates, average rewards, and episode steps is shown in
Figure 9 and
Figure 10.
The impact of different learning rates on algorithm performance is shown in
Figure 9. When the learning rate is
, the system achieves optimal performance, so this paper selects
as the learning rate for this model. However, when the learning rate is
, the algorithm performs the worst, which may be due to the excessively high learning rate causing the algorithm to oscillate continuously during the optimization process and unable to find the optimal solution.
The relationship between the average reward and the time-step length under different decay rates is shown in
Figure 10. Compared with the learning rate, the decay rate has a smaller impact on system performance. When the decay rate is
, the system achieves optimal performance, while the algorithm performs worst when the decay rate is
. Therefore, this paper selects
as the decay rate for this model.
5. Conclusions
This study proposes a communication design scheme for semi-circular arch mine tunnels based on RIS assistance. By jointly optimizing the base-station beamforming matrix and the RIS phase-shift matrix, the TD3-NF algorithm is used to evaluate the system link rate. The results show that the TD3-NF algorithm significantly outperforms traditional DDPG, A3C, and DQN algorithms in link-rate optimization, with the average link rate improving from an initial 2.5 bps/Hz to 11.1 bps/Hz. Compared to the second-best DDPG algorithm, the link rate achieved using the TD3-NF algorithm is superior to that of the DDPG algorithm under the same conditions. When the transmit power is 38 dBm and and 8 × 8 RIS elements, the system’s average link rate reached 11.1 bps/Hz. The study also analyzed the impact of RIS element count and base-station transmit power on system performance, finding that increasing the number of RIS elements and boosting transmit power can further enhance system performance, offering new solutions to improve the efficiency of intelligent mine communication systems.However, current research still faces several challenges, including performance optimization in complex mine structures, interference issues in multi-hop RIS deployment, and stability concerns with intelligent algorithms. Future research could focus on the following directions: further optimizing RIS deployment and multi-RIS collaborative optimization in complex environments, enhancing algorithm adaptability and stability in dynamic mine settings, and exploring the integrated application of RIS with other intelligent technologies. Concurrently, practical deployment and experimental validation will provide crucial evidence for further refining this solution.