Supervised deep learning algorithms continuously train neural networks to minimize the error of the output of the neural networks and the target solution. An extensive amount of data is thus required for training. In this paper, we repeated the training of our neural network toward optimal solutions given in (

4) and obtained data for the repeated trainings from extensive channel realizations. In addition, we should verify whether the algorithms have been over-fitted by using extra channel realizations different from those used for training. If the whole channel gains are available, we can find the optimal combination given in (

4) based on the brute-force searching algorithm. However, the brute-force searching algorithm will cause a tremendous computational complexity, especially as

N increases. Thus, we formulated a sub-optimal scheme as an alternative to obtain data samples required to train our deep learning algorithm.

#### 4.1. A Sub-Optimal Scheme to Obtain Data Samples for Training

The main concept of the sub-optimal scheme was proposed in our previous study [

20]. It was shown that the sub-optimal scheme can achieve comparable sum rates to the brute-force searching scheme with an extremely low computational complexity. The sub-optimal scheme is described in Algorithm 1. In this paper, we used the sub-optimal scheme to obtain data samples instead of an optimal scheme merely because of the complexity of the optimal scheme. However, using the sub-optimal scheme does not cause any change in the proposed algorithm, nor does it limit the contributions of this paper. For given

N pairs, the brute-force scheme requires a maximum of

${2}^{N}$ iterations, while the sub-optimal scheme only requires a maximum of

N iterations. In the sub-optimal scheme,

N pairs of mobile devices are sorted according to their channel gains in descending order, ignoring interference channels. The sorted pairs are re-indexed by

$\widehat{i},\phantom{\rule{3.33333pt}{0ex}}\widehat{1}\le \widehat{i}\le \widehat{N}$. Thus, the sorted pairs satisfy:

**Algorithm 1** A sub-optimal algorithm to obtain training samples. |

Sort $|{h}_{ii}{|}^{2}$ in descending order Initialize: $\mathbb{T}=\varnothing $ and ${R}_{0}=0$ **for** $k=1$ to N **do** **for** $i=1$ to k **do** Calculate the SINR for the ${\widehat{i}}^{\mathrm{th}}$ pair, ${\gamma}_{\widehat{i}}$ **end for** ${R}_{k}={\sum}_{i=1}^{k}{log}_{2}(1+{\gamma}_{\widehat{i}})$ **if** ${R}_{k-1}\le {R}_{k}$ **then** $\mathbb{T}=\mathbb{T}\cup \left\{\widehat{k}\right\}$ **else** break **end if** **end for** |

In the $k{(1\le k\le N)}^{\mathrm{th}}$ iteration, the sub-optimal scheme calculates ${R}_{k}={\sum}_{\widehat{i}=\widehat{1}}^{\widehat{k}}{log}_{2}(1+{\gamma}_{\widehat{i}})$, which is the sum rate when the k pairs $\widehat{1}$ through $\widehat{k}$ transmit data simultaneously, and compares it with ${R}_{k-1}$. If the calculated sum rate is greater than or equal to the sum rate obtained in the previous iteration, i.e., ${R}_{k-1}\le {R}_{k}$, the pair $\widehat{k}$ is allowed to transmit data and added to $\mathbb{T}$. Thus, $\mathbb{T}$ is updated by $\mathbb{T}=\mathbb{T}\cup \left\{\widehat{k}\right\}$, and the algorithm moves on to the next iteration. Otherwise, the algorithm is terminated. Finally, the pairs included in the transmission set $\mathbb{T}$ are allowed to transmit data simultaneously as soon as the algorithm is terminated early before N iterations or stops after completing N iterations.

#### 4.2. A Proposed Scheme Based on Convolutional Neural Networks

The architecture of our CNN for deep learning is shown in

Figure 2 and consists of two hidden convolution layers. The first convolution layer consists of 256 convolution filters with an

$N\times N$ input matrix. The input matrix consists of channel coefficients and is denoted by

${\left[{h}_{ji}\right]}_{1\le j\le N,1\le i\le N}$. Each convolution filter is initialized by the Xavier normal initializer [

29]. The width and height of the output of a convolution filter can both be calculated by:

where

O is the width and height of the output of a convolution filter,

N is the input size,

K is the kernel (filter) size,

P is the number of paddings, and

S is the stride. In the first convolutional layer, it was assumed that the kernel size of each convolution filter was

$5\times 5$ with a stride of one, and we did not pad zeros; thus,

$K=5$,

$S=1$, and

$P=0$. Based on (

6), the height and width of our first convolutional layer is given by:

Each convolution filter was activated by a rectified linear unit (ReLU) function, which returned the element-wise

$max(x,0)$ for a given input

x. The output of each convolution filter was followed by a

$2\times 2$ max pooling layer. A

$2\times 2$ max pooling layer performed down-sampling operations along the spatial dimensions by applying a max filter to non-overlapping sub-regions. For each of the regions represented by the filter, the maximum value of that region would be output. Thus, each element of the output matrix would be the maximum value of a region in the original input. If

${O}_{1}$ is odd, the

$2\times 2$ max pool will be only applied to the

$({O}_{1}-1)\times ({O}_{1}-1)$ matrix except for the last column and row. Otherwise, it will be applied to the

${O}_{1}\times {O}_{1}$ matrix. Thus, the width and height of the output of the

$2\times 2$ max pool layer is given by

$\u230a{\displaystyle \frac{{O}_{1}}{2}}\u230b$, which can be calculated as:

if

${O}_{1}$ is replaced by (

7). The final output size of the first layer was

$\left(\u230a{\displaystyle \frac{N}{2}}\u230b-2\right)\times \left(\u230a{\displaystyle \frac{N}{2}}\u230b-2\right)\times 256$. The second convolution layer consisted of 512 convolution filters. Each filter was also initialized by the Xavier normal initializer and

$K=2$. We also assumed that

$S=1$ and

$P=0$. The input size of the second convolutional layer was the output size of the first max pooling layer, which is given in (

8). If

N is replaced by

$\u230a{\displaystyle \frac{{O}_{1}}{2}}\u230b$ in (

6), then the width and height of the output of each convolution filter in the second layer are given as:

As in the first convolution layer, each convolution filter was also activated by a ReLU function, and the output of each filter was down-sampled by a

$2\times 2$ max pooling layer. The width and height of the output of the

$2\times 2$ max pooling layer is given by

$\u230a{\displaystyle \frac{{O}_{2}}{2}}\u230b$, which can be calculated as:

where the second equality is valid because:

for any positive integer

n [

30]. The output size of the second max pooling layer was

$\left(\u230a{\displaystyle \frac{N-6}{4}}\u230b\right)\times \left(\u230a{\displaystyle \frac{N-6}{4}}\u230b\right)\times 512$. The outputs of the max pooling layer were dropped out with a probability

$p=0.2$ to prevent the neural network from over-fitting. Thus, randomly selected neurons were ignored with a probability of 0.2 during training. The outputs were flattened to a one-dimensional array with the size of

$512{\left(\u230a{\displaystyle \frac{N-6}{4}}\u230b\right)}^{2}\times 1$ and were reduced to

$1000\times 1$ by a fully connected layer, which had a ReLU as an activation function. The

$1000\times 1$ array went through another drop-out layer with

$p=0.5$. It was reduced to an

$N\times 1$ array by another connected layer. Finally, the output of the fully connected layer was activated by the sigmoid function. The sigmoid function defined by

$S\left(x\right)={\displaystyle \frac{1}{1+{e}^{-x}}}$ for a given input

x can be interpreted as a probability in many applications because

$0\le S\left(x\right)\le 1$. The output activated by the sigmoid function is denoted by

$\mathbb{P}$, and the

ith element of the

$\mathbb{P}$,

$\mathbb{P}\left[i\right]$, can be interpreted as the probability that the

ith D2D pair is allowed to transmit data. Our scheduler determined if each D2D pair

i would be allowed to transmit data based on the corresponding

$\mathbb{P}\left[i\right]$. Thus,

$\mathbb{B}\left[i\right]$ indicating whether to allow the

ith D2D pair to transmit data can be determined as:

Our proposed neural network was repeatedly trained to enhance the performance of scheduling by reducing the error between $\mathbb{B}$ and the result obtained by the sub-optimal scheme.