The GraMNet approach used in this study is detailed in this section. We will start with a quick summary. We next proceed to describe the inner workings of each SubNet and their interdependence. The GraMNet architecture that was actually employed in our experiment is presented below. In this final session, we will describe how we trained our network.
  3.5.1. Overview
The GraMNet strategy was modelled on a child’s play item, i.e., building blocks. We suggest that CNN constructions can be built using modular mechanisms in the same way that a child might use building blocks to erect a fort. Training each module was simple and straightforward, which enabled a hypothetically large network without the possible prohibitive computational expense of training the last network immediately. To achieve its goal of producing complementary information, the suggested GraMNet used a novel hybrid learning technique that positively merged various SubNets.
Our strategy consisted of serially or concurrently adding new SubNet modules on top of the pre-existing architecture. SubNet modules were superimposed on top of the existing GraMNet feature computation layers. As a result, the categorization layers were relocated to the end of the network.
In addition, the current learnable parameters of GraMNet were frozen, and only the SubNets were modified. This greatly reduces the computational burden of the back-propagation updates, especially for large networks. It is also possible to grow the network simultaneously at various points throughout the GraMNet procedure. The classification layers were moved to the end, and the back-propagation learning process was updated with the new Subnet. The present GraMNet feature maps needed to be concatenated with the SubNet feature maps; to do this, a new layer of operation was required. In the channel dimension, we joined these feature maps together.
  3.5.2. SubNet Architecture
Figure 6 depicts the various SubNet topologies that were taken into account for this article. 
Figure 6A depicts the layer groupings that made the feature-generating SubNets.
 Figure 6B illustrates the classification layers. 
Figure 6A depicts the convolutional layer architecture, which consists of a batch normalization layer. Convolution filters can vary in both number and size across layers. The loss function for classification uses cross entropy and there is only one completely linked layer.
 The output of a SubNet constructed from the L 1-layer groups, such as those in 
Figure 6A, and then shadowed by an L’th layer, such as that in 
Figure 6B, is formally defined below. In order to get started, let us define what a single input data sample is:
          where 
 is the n’th paradigm from the minibatch. Possible multi-channel images with H rows, W pillars, and D channels are represented by these inputs. The classification problem, in which there are M possible categories, is dealt with. The truth for each example is denoted by 
, for 
.
Let us use the n-th example and 
 as the network’s Layer 1 input. In lexicographic notation, we can write this as 
, 
 vector. Take note that this is a column vector produced by redesigning the 3D data cube in 
. Each group of convolutional layers in 
Figure 6A has an output that can be written as follows:
          within the minibatch, for 
. The weight matrix 
 represents the weights of all kernels in layer group l. 
 has the dimensions 
, where 
 is the total sum of filters in layer group 
. By using max pooling layers, the dimensions can be decreased in consecutive layer groups. The vector 
 is used to store the bias terms. By considering the vector, 
 represents the lexicographic output of the 3D feature map cube of the present layer 
.
By combining the stacked function representations of the ReLU and max pooling layers shown in 
Figure 6A, the following is expressed:
The ReLU operation is the sum of the highest values of all elements and 0. To avoid the slowdown in learning and performance caused by vanishing gradients in other activation functions, the ReLU activation function  is utilized in this case. The feature maps are downsampled by a factor of dimension for each channel using the MaxPool() operator’s 22 spatial sub-sampling kernel.
In this case, we follow the convolution layer groups with the classification layer group, as indicated in 
Figure 6B. In the fully connected layer, each input and output is linked by the weight matrix, and the size is equivalent to the sum of classes, M. Instead of using convolution kernels, it uses something else. Additionally, neither ReLU nor maximum pooling is used. In a fully combined layer, the function can be written as follows:
          where 
 is the output. The final feature map from the L−1 convolution layer is the vector 
. A complete layer’s worth of biases can be found in the vector notational notation 
. The so-called soft-max process, which standardizes the yield, comes into play after the completely connected layer.
          
          where
          
Note that the productivities of the SoftMax process, 
, are in the variety [0, 1], and
          
The above-mentioned mathematical specifics can be condensed into the following:
          where 
 is all of the learnable parameters, and the predicted labels are provided:
Thus, 
 represents the entire SubNet predictor module, and 
 stands for the network’s trainable parameters. Each minibatch’s empirical risk is used to inform an update to the learnable structures. The error function in cross entropy that is obtained empirically is as follows:
The minibatch data  and the observable structures are the required inputs for . The learnable parameters in  are the minibatch size represented by N, and the number of classes represented by M. Our model’s truth labels are represented by the mutable , whereas the predicate labels are denoted by . After a loss is calculated for a given minibatch, the SubNet’s trainable parameters can be fine-tuned with the help of the adaptive moment estimation (Adam) optimizer using a process known as back-propagation.
  3.5.3. Series and Parallel Combinations
Assume that we are considering the sum of two SubNets, 
. Let the convolution layers in SubNet A be 
, following Equation (5), and the convolution layers in SubNet B be 
. 
 represents the total sum of layers in the merged network, with the extra layer serving as the only completely connected one in Equation (7). SubNet A’s settings are as follows:
This is corrected after SubNet A is trained. SubNet B’s convolution layers and the fully connected layer’s parameters are as follows:
As  is trained and the parameters in  are adjusted accordingly. Using the same equation as before, the output of the series layers is sent to the SoftMax layer (8).
Here, we will look at two paired SubNets, 
. Let the convolution layers in SubNet A be 
, following Equation (5), and the convolution layers in SubNet B be 
. Let us set the parameters for the layer in each SubNet as
          
          and
          
This is the result of the convolution layers in SubNet 
A:
The productivity of SubNet B is assumed by the following:
 where 
. Remember that 
, since the inputs to the two parallel SubNets are identical. Let us call the fully interconnected output layer 
. When the final feature maps are added to the output of this completely connected layer, we obtain the following:
As before, the results of this layer’s work are sent to the SoftMax one via Equation (8). The parameters in  are fixed and the parameters in , along with  and , are updated.
  3.5.4. Proposed GraMNet Architecture
By combining the proposed SubNets with other SubNet structures, the GraMNet can generate an infinite number of final architectures. For medical imaging, we present one exact instance that we believe strikes a good compromise between performance and computational complexity. Image 7 depicts the proposed GraMNet architecture. SubNets A, B, C, D, and E are shown in 
Figure 7 and are progressively added to form the whole network. Keeping the computational cost low is a primary concern, hence we employ compact and tiny SubNet modules.
To begin, we trained SubNet A using all of the available minibatches, and found the optimal parameters by minimizing the loss and employing Equation (13). Once SubNet A’s training was complete, we stopped updating its trainable parameters and instead referred to it as GraMNet A. The output of this network was the feature maps that were utilized as input by the newly created SubNet B. All the feature mappings from SubNet A’s L–1 layer were fed into SubNet B’s first convolutional layer. We called this set up GraMNet A and B, because SubNet B was connected to the rest of the network in a series arrangement that was represented by the symbol A + B.
Next, we installed a new SubNet C in tandem with locking the existing GraMNet A and B. For the sake of notational effortlessness, we referred to this set of SubNets as GraMNet A–C, or . The equations indicated above were used to create feature maps using GraMNet A and B, which were then combined with feature maps created using the new SubNet C. With a reshape layer, we combined the depth-based feature maps from GraMNet A and B with the feature maps output from SubNet C. Finally, SubNet D was added sequentially to form the GraMNet A–D configuration . At last, we parallelized GraMNet A–D and included the newest SubNet E. As a shorthand, we referred to this GraMNet series as GraMNet A–E, where  is the sequence of letters. SubNets in series and parallel were chosen for this arrangement because, in our experience, they works adequately when combined and can even improve one another.