Our goal is to sequentially decide on the optimal packet transmit power level and the optimal selection of the next talker with the consideration of energy efficiency of the message dissemination. In EMDS, the optimal decision making is conducted in every slot as explained in
Section 3 based on the network conditions; deadline of the message dissemination, the number of vehicles in the platoon, and the platoon velocity. To this end, we formulate an MDP model with four elements: (1) state space; (2) action space; (3) state transition function; and (4) reward and cost functions [
23]. Subsequently, we introduce the optimality equation and a value iteration algorithm to solve the equation.
4.2. Action Space
Based on the current state information, a talker of EMDS chooses a multiobjective action which consists of deciding the transmit power level and the next talker. Therefore, we define the action space of a finite set
$\mathbf{A}$ as
where
$\mathbf{P}$ is the set of possible transmit power level and
$\mathbf{H}$ is the set of the number of hops to the next talker that the current talker wants to indicate.
$\mathbf{P}$ can be represented as
where
${P}_{\mathrm{max}}$ is the maximum power level for the packet transmission. We normalize all transmit powers with respect to a minimum possible transmit power in
mW,
${P}_{\left[mW\right]}$, which is typically injected by the lower end of the linearity range of RF amplifier on wireless network interfaces. Thus, the transmit power is considered to be an integer multiple of
P and the
nth power level can be defined as
$n\times P$. Meanwhile, zero transmit power level,
$0\in \mathbf{P}$, means that a talker chooses not to send a packet. For example, if a message dissemination is completed before deadline, the talker does not need to send the control packet in the remaining slots. In this case, the talker selects zero transmit power level. Meanwhile,
$\mathbf{H}$ can be defined as
where the total number of vehicles in the platoon is
N. This means that the current talker includes the number of hops to the next talker,
$i\in \mathbf{H},0\le i\le N2$, in a control packet to specify the next talker. For example, given that the current talker is third vehicle from the leader and it want to indicate sixth vehicle as a next talker. Then, the current talker sets threehop be in a control packet. Meanwhile, if the current talker sets zerohop,
$0\in \mathbf{H}$, it is to designate itself as the next talker.
4.3. State Transition Function
Let
$k,g,c,m,x,\mathrm{and}\phantom{\rule{0.222222em}{0ex}}t$ are the indices for components of the state
$\mathbf{V},\mathbf{G},\mathbf{C},\mathbf{M},\mathbf{U},\mathrm{and}\phantom{\rule{0.222222em}{0ex}}\mathbf{T}$, respectively, while
b and
h are the indices for action components
$\mathbf{P}$ and
$\mathbf{H}$. In addition, we assume that two arbitrary states in
$\mathbf{S}$ be
$s\stackrel{\mathsf{\Delta}}{=}\left\{k,g,c,m,x,t\right\}$ and
${s}^{\prime}\stackrel{\mathsf{\Delta}}{=}\left\{{k}^{\prime},{g}^{\prime},{c}^{\prime},{m}^{\prime},{x}^{\prime},{t}^{\prime}\right\}$, and an arbitrary action in
$\mathbf{A}$ be
$a\stackrel{\mathsf{\Delta}}{=}\left\{b,h\right\}$. The state transition function is the probability that system starts from state
s and ends in state
${s}^{\prime}$ by taking an action
a. Since the message dissemination time and the driving operation time occur sequentially in the order of slots, every component is dependent on
$\mathbf{M}$. During the driving operation times,
$\mathbf{V}$ is dependent on
$\mathbf{G}$,
$\mathbf{C}$, and
$\mathbf{U}$, while
$\mathbf{G}$ and
$\mathbf{C}$ are dependent on
$\mathbf{U}$. Meanwhile, during the message dissemination times,
$\mathbf{U}$ and
$\mathbf{T}$ are dependent on
$\mathbf{V}$ as well as
$\mathbf{A}$. Therefore, the state transition function can be described by
The transition probability for timeslots,
$\mathrm{Pr}\left[{m}^{\prime}m\right]$, can be expressed as
where
${m}_{++}\stackrel{\mathsf{\Delta}}{=}\left(m+1\right)\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}M+1$ and
$\delta \left({m}^{\prime},{m}_{++}\right)$ is the Kronecker delta function. Here, the term
$\delta \left({m}^{\prime},{m}_{++}\right)$ means that timesslot index always increases one at a time until the end of the frame.
When
$m=M$ and
$x=P$, when it is the driving operation time after successful message dissemination during the section time, the platoon increases its velocity until the velocity reaches to the maximum speed only if
$g=1$ and
$c=C1$. Therefore, the transition probability of
$\mathbf{V}$ can be derived as
and
On the other hand, when
$g=0$,
$m=M$, and
$x\ne P$, the platoon decreases its velocity until the velocity reaches to the minimum speed. Unlike the acceleration driving mode, there is no threshold for
c, so if the message dissemination failure occurs, it is immediately reflected as the platoon velocity deceleration. Then, the transition probability of
$\mathbf{V}$ is given by
In the case of
$x=P$, the platoon returns to the acceleration mode maintaining its velocity. Thus, we have the transition probability of
$\mathbf{V}$ as
Meanwhile, the platoon can change its speed only after the deadline time of the message dissemination and maintains its velocity during a section time. Thus, if
$m\ne M$, the transition probability of
$\mathbf{V}$ can be represented as
With the start of the driving operation time, the platoon checks the success of the message dissemination and adjusts its driving mode,
g, and the successful message dissemination counter,
c. In the case of successful message dissemination, the platoon set its driving mode as acceleration and increase counter one at a time. In addition, the platoon resets
c to zero if the message dissemination succeeds while
c is
$C1$, the platoon resets
c as zero. Therefore, the transition probability of
$\mathbf{G}$ and
$\mathbf{C}$ can be derived as
and
Meanwhile, in the standby mode, the platoon changes its driving mode as acceleration and resets the counter after successful message dissemination and thus, the transition probability of
$\mathbf{G}$ and
$\mathbf{C}$ can be given by
If the message dissemination fails, the driving mode keeps its mode as standby and the counter is reset to zero and the transition probability can be expressed as
Lastly, the driving mode and the counter maintain their value during the section time. Therefore, we have the transition probability as
During the driving operation time (
$m=M$), irrespective of the current status of
$\mathbf{U}$ and
$\mathbf{T}$, they are initialized for the dissemination of the next message. Therefore, the joint transmission probability of
$\mathbf{U}$ and
$\mathbf{T}$ can be given by
In each slot in the message dissemination time (
$m\ne M$), the next packet reception state and the next talker are determined according to the ARQbased relay protocol along with an action selected by the current talker. Accordingly, the joint transition probability of
$\mathbf{U}$ and
$\mathbf{T}$ is
$\mathrm{Pr}\left[{x}^{\prime},{t}^{\prime}x,t,a=\left\{b,h\right\},k,m\right]$. Meanwhile,
${t}^{\prime}$ is dependent on
${x}^{\prime}$ as described in
Section 3.1; thus, by the Bayes rule, the joint transition probability can be represented as
where we assume
$\mathrm{Pr}\left[z\right]\stackrel{\mathsf{\Delta}}{=}\mathrm{Pr}\left[x,t,a=\left\{b,h\right\},k,m\ne M\right]$ for the convenience. Therefore, when
$m\ne M$, the joint transition probability can be derived as
where
${\mathsf{\Phi}}_{t}^{t+h}\left({\mathbf{u}}_{{x}^{\prime}}\right)$ is the product of values of the elements from the
tth to the
$(t+h)$th in vector
${\mathbf{u}}_{{x}^{\prime}}$, and
${\mu}_{{v}_{k}}\left({u}_{\varsigma}^{\prime},{u}_{\varsigma},b\right)$ is the transition probability of
$\mathbf{U}$ according to the transmit power
b while the platoon velocity is
${v}_{k}$. In other words,
and
where
${u}_{\varsigma}^{\prime}\in {\mathbf{u}}_{{x}^{\prime}},{u}_{\varsigma}\in {\mathbf{u}}_{x}$.
Here,
${\mathrm{P}}_{E}^{{v}_{k}}\left(\varsigma ,b,t\right)$ is the probability that a control packet transmitted at power
${P}_{b}=b\times {P}_{\left[mW\right]}$ from talker
t will be received in error at the
$\varsigma $th follower vehicle when the platoon velocity is
${v}_{k}$. This probability depends on the modulation and coding scheme, and if we consider quadrature phaseshift keying (QPSK) transmission under additive white Gaussian noise (AWGN) channel, the packet error probability can be expressed as
where
$\mathrm{erf}\left(\xb7\right)$ is Gauss error function,
l is the packet size in bits,
${\phi}_{t,\varsigma \left[dB\right]}$ is the path loss between vehicle
t and
$\varsigma $ described in (
2), and
${N}_{0}$ is the noise power spectral density.
4.5. Optimal Equation
A power control and a relay selection policy
$\pi $ describes a decision rule that determines the action taken by the talker. The expected total reward obtained over an infinite time horizon, which is expressed as
where
$n\in \left\{1,2,\cdots \right\}$ is the slot index,
${S}_{n}$ is the state sequence,
${a}_{n}$ is the action sequence,
${s}_{0}$ is the initial state, and
${E}_{\pi}$ denotes the expectation with the policy
$\pi $.
The goal here is to find a policy that maximizes the expected total reward. For this end, we first find the maximum expected total reward that can be described as
where
$\mathsf{\Pi}$ is the set of all stationary deterministic policies. Please note that the expected total reward can be maximized when the talker takes the most beneficial action
${a}^{\ast}$ in each state
s. Therefore, the optimal equation known as the Bellman optimality equation [
24] is given by
where
$\lambda $ is a discount factor in the MDP model.
$\lambda $ closer to 1 gives greater weight to future rewards. Then, the optimal action
${a}^{\ast}$ is the action that satisfies the optimal equation. To solve the optimality equation and to obtain the optimal policy
${\pi}^{\ast}$, we use a value iteration algorithm, as shown in Algorithm 1, where
$\leftV\right=\mathrm{max}V\left(s\right)$ for
$s\in \mathbf{S}$.
Algorithm 1: Value iteration algorithm. 

In general, each iteration in the value iteration algorithm is performed in a polynomial time as
$O\left(\left\mathbf{A}\right{\left\mathbf{S}\right}^{2}\right)$ [
25]. Since this complexity cannot be neglected, each vehicle of the platoon uses a table to store the optimal policy regarding the transmit power and relay selection according to the platoon velocity. Then, each of them performs the decision making referring the table when it is designated as a talker. This table includes the state and the decision for each state and can be computed in advance to the beginning of driving by the value iteration. Thus, when the vehicles forming their platoon, the leader creates the table and shares it with its follower vehicles. In this way, EMDS can be applied to the vehicle without high computational overhead.