Applications of Neural Networks in Biomedical Data Analysis

Neural networks for deep-learning applications, also called artificial neural networks, are important tools in science and industry. While their widespread use was limited because of inadequate hardware in the past, their popularity increased dramatically starting in the early 2000s when it became possible to train increasingly large and complex networks. Today, deep learning is widely used in biomedicine from image analysis to diagnostics. This also includes special topics, such as forensics. In this review, we discuss the latest networks and how they work, with a focus on the analysis of biomedical data, particularly biomarkers in bioimage data. We provide a summary on numerous technical aspects, such as activation functions and frameworks. We also present a data analysis of publications about neural networks to provide a quantitative insight into the use of network types and the number of journals per year to determine the usage in different scientific fields.


Introduction
Biomedical research creates a myriad of health-related data, which come in different formats, such as numerical (e.g., blood biomarker concentration and gene expression data), time series (e.g., electrocardiogram data) or image data (e.g., histological images and mammograms) [1][2][3]. Bioimage informatics is a subfield of bioinformatics that deals with large-scale, high-throughput computational methods for the analysis of bioimages, particularly cellular and molecular images [4,5].
However, this applies not only to biomedical research but also to other fields, such as forensics [6,7]. The goal is to extract useful knowledge from complicated and heterogeneous images and their associated metadata. Methods and algorithms are often applied to exploit image features corresponding to statistical, geometric and morphological properties. The frequency of image pixels and regions as well as the topological relationship between multiple image objects are also used in this process [4,5,[8][9][10][11].
High-content screening and analysis (HCA) encompasses a range of methods for the objectified automated analysis of large image data sets using image processing, computer vision and machine learning. Often, HCA is realised using complex processing pipelines consisting of automated microscopes, automated liquid-handling and cell culture. Typical applications are found in cell biology and drug discovery to identify compounds with pharmacological activity or to identify biomarkers. Researchers use such methods to find specific cell patterns in cells with suitable biomarkers (e.g., tubulin) under a fluorescence microscope [12][13][14][15].
Image processing (including normalisation, segmentation, tracking, spatial transformation, registration and feature calculation) and machine learning are used to automatically process high-dimensional image data into numerical data [16]. From these, biomedical The most basic unit of a neural network is a so-called (artificial) neuron (AN), which mimics the function of biological neurons (Figure 1). A neuron is a function with n input parameters and one output value. The output value is calculated as a weighted sum of all input parameters and is passed through a so-called activation function, which modifies the output range of the neuron. Additionally, a bias can be added to shift the activation function on the x-axis. The simplest ANNs consist of a layered structure of interconnected ANs to mimic the function of a brain. As each neuron in an ANN receives the activation of all ANs of the previous layer as input parameters, this type of architecture is also called a Dense or Fully Connected Network ( Figure 2). Another name would be a Feed-Forward Network, as this architecture only connects consecutive layers. ANNs with more than one layer between the input and output layer (so-called hidden layers) are called deep; thus, training such networks is called deep learning. Training is usually performed via supervised learning (manual annotation of data before training) and adjusting the weights for each input via backpropagation. Figure 2. Basic structure of an artificial neural network with three inputs. A neural network consists of artificial neurons, which calculate weighted sums with N input parameters. The output range of an individual neuron will be limited before passing via the application of an activation function. Neurons are arranged in a layered structure, where each neuron receives the activation of all neurons of the previous layer as input parameters. ANNs with more than one layer between the input and output layer (hidden layers) are called deep neural networks. X = Initial input of the network. O = Final output of the network.
Before starting to use deep learning for data analysis, the nature of the data to analyse has to be determined to choose an appropriate network type and architecture. Thereby, the activation functions for each layer have to be chosen carefully to weight between training speed and prediction quality.

Activation Functions
As stated previously, activation functions are used for limiting the output range of a neuron. Over the last decades, dozens of different activation functions have been proposed for usage in neural networks and deep learning, each with specific advantages and flaws [45] (Table 2). Table 2. Commonly used activation functions. The graphs for each function (red) and its derivative (blue) are shown.

Name Function [Range] Function (Red) and Derivative (Blue)
linear Heaviside Rectified Linear Unit Logistic/sigmoid

Early Activation Functions
One of the earliest activation functions was the Heaviside function (named after Oliver Heaviside [46]) also called the unit step function: This function binarizes the activation by assigning all values below zero to zero and every other value to one. Whilst extremely efficient, the Heaviside function has two severe drawbacks: First, the strict binarization of the activation drastically limits the plasticity of the ANN by limiting the neurons to a similar switch-like function as transistors in electrical circuits. A more critical problem is that learning cannot be performed via backpropagation due to the derivative of the function, which equals 0 for all x (except for 0, where the derivative is undefined) [46]. Another, comparably simple approach that solves this problem would be the usage of a linear function (also called an identity function if a equals 1) as an activation function [47], which would perform similarly: As the derivative of the function equals a, learning via back propagation would theoretically be possible. However, due to the constant derivative of the function, gradient descent would be input independent and thus convergence of the model, meaning that further training will no longer improve the model inference, cannot be achieved.

Sigmoid Activation Functions
The usage of non-linear functions solves the problem of a constant derivative and thus allows for learning via back propagation. One of the oldest proposed nonlinear function is the sigmoid (s-shaped) logistic function: It limits the neuron output between 0 and 1 and thus can be useful as an output layer activation for categorisation tasks. Another sigmoid function is the Tangens hyperbolicus (tanh): This function is comparable to the logistic function but limits the output range between −1 and 1. Sigmoid functions suffer from the vanishing gradient problem: As the network becomes deeper, the calculated gradient of the loss function becomes smaller and smaller, reducing the possible weight update and thus impairing the learning of layers closer to the input [48].

Rectified Linear Activation Functions
A solution for the vanishing gradient problem was the introduction of the Rectified Linear Unit function (ReLU): This assigns 0 to all values below or equal to zero and otherwise is equal to the identity function [49]. This omits the problem of constant, input-independent learning but introduces a new problem known as "dying ReLu". Here, neurons have an activation of 0, regardless of input and thus cannot be changed by gradient descent any longer [50]. Different modified forms of ReLU were proposed to address this problem, which all modify the assigned values below 0 to create a non-zero derivative and thus enable gradient descent. Examples are Leaky ReLU: which multiplies all values below zero with a small alpha value [51], or ELU: which shifts function one unit to the left and exponentiates Euler's number with all values below 0 [52]. The minimal y value of the function can be controlled by variable α. Despite the "dying ReLU" problem, ReLu remains one of the most popular activation functions due to its simple implementation, its fast calculation and the good inference performance [45]. The function is frequently used for convolutional layers [45] but was also proposed to be used as an output function for the last layer [53]. Compared to sigmoid functions, ReLU is far less costly to calculate and appears to be on par in inference quality [54,55].

Training
However, training efficiency is not only influenced by the selected activation function, the choice of the training algorithm/optimiser is equally important. In general, a backpropagation algorithm attempts to determine the global minimum of the loss function of a neural network. Each training cycle, the network weights are updated, according to their share of network output, to follow the gradient to the local or global minimum of the loss function. Several optimisation algorithms have been developed over time. One of the most prominent algorithms for backpropagation is gradient descent.

Gradient Descent
Optimisation algorithms are used to minimise the loss function of a neural network. The most popular optimisers in deep learning are based on gradient descent. Gradient descent, which is a widely used algorithm for optimisation in neural networks, is an iterative optimisation algorithm to minimise the objective function J(Θ) over the training data by updating the parameter Θ [56]. The main idea of gradient descent is to update randomly initialised parameters until the objective function J reaches a minimum. Gradient descent was proven to be highly effective in supervised learning [57]. Gradient descent is based on the following process ( Figure 3):

•
Initiating a random or all-zero vector value to Θ. • Modification of Θ in order to decrease J(Θ).
To reach the optimal value of Θ, it is recurrently updated until J(Θ) reaches its minimum value. This can be described by the following formula: Here, α is step size or learning rate that indicates the size of the steps to reach a (local) minimum. In gradient descent, we usually normalise the direction of the steepest descent: Here, d is the descent direction and indicates the direction of the steepest descent. The direction of the steepest descent is an guaranteed improvement if the objective function is smooth, the step size is small enough and and the gradient is greater than zero. The direction of the steepest descent is opposite to the direction of gradient ∇J. Thus, to obtain a maximal decrease in J, the subsequent direction will always be orthogonal to the current direction [58]. Hence, to have optimal step size α at each step, we have: Furthermore, to minimise α, we have [58]: The above equation indicates that d (k+1) and d (k) are orthogonal, which was the condition to have maximum decrease in J. Three important variations were developed from gradient descent: Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD) and Mini BGD, which differ in the amount of data that are used to calculate the gradient of the objective function. Depending on the amount of data, a trade-off between the accuracy and parameter update time is necessary.

Batch Gradient Descent
Batch Gradient Descent (BGD) performs calculations over the whole training set at each update. As a result, it is very slow on large datasets and is additionally limited by memory capacity. This also introduces redundancy in terms of computation.
Batch Gradient Descent is more appropriate for convex or relatively smooth error manifolds and is guaranteed to converge convex error surfaces to the global minimum and non-convex surfaces to local minimums [56,57].

Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a widespread algorithm in various machine learning algorithms, e.g., in neural networks and logistic regression. SGD calculates the error and updates the parameters of the model for each training example x (i) and label y (i) : In contrast to the redundant computations of Batch Gradient Descent, SGD reduces the amount of computations by performing one update at a time, making SGD usually much faster than Batch gradient descent. Despite the faster convergence of SGD, the error function is not as well minimised as for the batch gradient descent. Furthermore, the descent path is noisier, as only one example per update is used. However, this can allow the model to escape from shallow local minima [59].

Mini-Batch Gradient Descent
Mini-Batch Gradient Descent can be seen as the middle ground between the robustness of Stochastic Gradient Descent and the efficiency of Batch Gradient descent, and this is the most common gradient descent algorithm in the field of deep learning. It makes an update for every mini batch of n training examples [56]: Mini-batch refers to the number of training examples utilised in one iteration, which has a range between 1 and n − 1 where n is the total dataset size. Finding the appropriate batch size can be challenging: If the size of the batch is chosen small, the learning process converges quickly; however, the descent is noisy, and reaching a minimum is therefore more difficult. Large batch sizes, however, result in a learning process that converges slowly while the error gradient is estimated more accurately. However, some experiments have indicated that a small batch size improves the training stability [60] and takes advantage of speeding up the learning process [61]. The upsides and downsides of variant gradient descent are listed in Table 3. Table 3. Comparison of gradient descent algorithm variations and their advantages and disadvantages.

Algorithm Advantages Disadvantages
Batch Gradient Descent

Optimisation Algorithms
In the following paragraphs, various algorithms are introduced to address the challenges of the three mentioned variants of gradient descent.

Momentum
SGD is a popular optimisation method; however, training takes a long time. Momentum can be used to reduce the training time, particularly when curvatures are high and gradients are either small and noisy or steady [62]. Momentum is a commonly used optimisation algorithm, and many of the latest models are trained with it. It is an adaptive optimisation algorithm that accelerates SGD in the related directions and reduces oscillations [56]. The use of Momentum can be imagined as pushing a ball down a hill. The ball accumulates momentum via gravity and becomes faster. Similarly, gradient causes Momentum, which accumulates in the descent methods. As a result, convergence becomes faster, and oscillation is reduced. Momentum is described by the following Equations [58]: For β = 0, the gradient descent formula is recovered.

Nesterov Accelerated Gradient
A problem of Momentum is that it does not slow down sufficiently at the bottom of a valley but rather tends to follow the slope, making it prone to overshooting [58]. Therefore, a more sophisticated optimisation algorithm is needed. Nesterov Accelerated Gradient (NAG) is a slightly different version of Momentum. NGA calculates the point where the current Momentum is pointing to ( Figure 4) and thus is able to reduce the step size before the valley slopes up again [56]. The modified Momentum formulas are as follows: A glance at the ( Figure 4) reveals the difference between Nesterov's momentum update and the regular momentum update. The momentum update is performed before calculating the steepest gradient descent vector, which is the main difference between these two. This kind of update refrains from being too fast and ends up increasing responsiveness. As a result, it can enhance the performance of RNN on various tasks, such as the reconstruction of ultrasound images [63].

Adagrad
Unlike Momentum and Nesterov momentum, which update all parameters with the same learning rate, the adaptive subgradient method (Adagrad) uses a different learning rate for each parameter. To adjust the learning rate to the parameters, it considers larger updates for infrequent, and smaller updates for frequent parameters, thus, making it suitable for sparse data. Dean et al. [64] used Adagrad to train large-scale neural nets at Google due to the improved robustness compared to SGD, and it was also utilised to recognise cats in videos. In addition, Adagrad was used to train GloVe, a network to find relations between words [65], due to its ability of much larger updates for infrequent parameters. Adagrad can be described with the following formulas: Here, s is a vector, and s (k) i is the sum of the squares of the partials up to time step k and respecting Θ. is a small value (about 1 × 10 −8 ) to prevent division by zero. Adagrad is less sensitive to the used learning rate (usual value: 0.01); however, a major weakness of the method is the accumulation of the squared gradients in the denominator. Every added term is positive, and thus the accumulated sum continues increasing during the training. As a result, the learning rate often becomes infinitesimally small before convergence. It was shown that AdaGrad can have fewer generalisation errors compared to the Adam optimiser [66].

Adadelta
Adaptive Delta (Adadelta) is a more robust extension of Adagrad. The aim of Adadelta is to deal with the monotonically decreasing learning rate of Adagrad based on restricting the window of accumulated past gradients to some fixed size w [67] rather than accumulating all past gradients.
Since the value of RMS(∆Θ) is not known, it is estimated with the Root Mean Square (RMS) of parameter updates until the previous time step [60]. As can be seen, it is unnecessary to specify a default learning rate, since it has been eliminated from the update rule.

Root Mean Square Propagation
The Root Mean Square Propagation (RMS Prop), similar to Momentum, is a technique to speed up gradient descent [68]. It maintains an exponentially decaying average of squared gradients and divides the learning rate by the root of this average [56]. The average is updated according to: where the decay γ ∈ [0, 1] was suggested by Tieleman and Hinton [68] to be set to 0.9, while a good default value for the learning rate α is 0.001.

Adam
The adaptive moment estimation method (Adam) adapts learning rates to each parameter. Both exponentially decaying squared gradient s (k+1) , such as RMSProp and Adadelta, and exponentially decaying gradients v (k+1) , such as momentum, are stored in the Adam algorithm. v (k+1) and s (k+1) are estimates of first-order momentum and second-order momentum of the gradients, respectively [69]. To introduce a bias, v (k+1) and s (k+1) are initialised to zero during the initial time steps and particularly when the decay rates are small. The mathematical notation for Adam can be expressed as following: According to a publication of Kingma and Ba [70], good starting values are α = 0.001, γ v = 0.9, γ s = 0.999 and = 1 × 10 −8 . Adam is an efficient method for computation, uses less memory for implementation and is invariant to diagonal rescaling of the gradients. Thus, it is suitable for huge data sets, noisy data, inadequate gradients and non-stationary problems that require small tuning [62].

AdaMax
AdaMax is a variant of Adam, which was changed based on the use of infinity norm (u t ). The Adam update rule for weights is based on scaling the gradients inversely proportional to a l 2 norm of the past and current gradients. The l 2 norm based update rule can be generalised to a l p norm based update rule. In the case of using large p values, norms become numerically unstable, while, for l ∞ , a stable algorithm appears [70]. The AdaMax algorithm can be described by the following formulas: by replacing + √ŝ (k+1) in the Adam equation with u (k+1) , the AdaMax update rule is obtained:

Choosing the Right Optimiser
The choice of the right optimiser is highly dependent on the dataset to analyse. If the input data are sparse, then good results can be achieved using one of the adaptive learningrate methods, such as Adam or Adadelta. Additionally, using adaptive-learning methods eliminates the need to fine-tune the learning rate to obtain optimal results. However, it should be considered that they are computationally costly, since they calculate and keep all the past gradients and their squares to update the next parameters.
Furthermore, the adaptive-learning optimisers converge to different minima points in comparison with fixed learning-rate optimisers. Both RMSprop and Adadelta are extensions of Adagrad, which overcome Adagrad's monotonically decreasing learning rate. The difference between both methods lies in the usage of the RMS of parameter updates in the numerator update rule.
The Adam algorithm extends RMSprop by adapting learning rates to each parameter and adding bias-correction and momentum. The Adam technique can be utilised in the case of high-dimensional parameters and huge data sets. RMSprop, Adadelta and Adam are algorithms with similar behaviour and can perform well under comparable conditions. However, Kingma and Ba [70] indicate that its bias-correction results in better performance in Adam compared to RMSprop towards the end of optimisation as gradients become sparser. Therefore, Adam might be the best of the presented choices.
Interestingly, recent work found that SGD can produce better results when combined with a good learning rate annealing schedule. Although SGD is usually able to find a minimum, it might take considerably longer than other optimisers. It also depends much more on a robust initialisation and choice of learning rate. Moreover, its fluctuation helps avoid local minima; however, it may become stuck in saddle points. In summary, when a good learning rate schedule is required, SGD with momentum can be a viable choice. In the case of searching for fast convergence and training a complex neural network, one of the adaptive learning rate methods should be chosen.

Back Propagation
Backpropagation, presented in 1986 by Rumelhart and McClelland [71], is a short form of "backward propagation of errors". It is a common method of training neural networks and an iterative gradient descent training procedure. It utilises the loss function and gradient descent method to modify the parameters (called weights) of a network. The backpropagation process can be described as follows [72]: Forward propagation: Input is entered into the network and propagated from the input layer, via the hidden layer, to the output layer. Input values are multiplied with weights of connecting nodes, and values of hidden layer nodes are obtained. The weight and offset value of the network are kept constant during the forward propagation.
Back propagation: In the case that there is a difference between the expected output and the achieved output, the error of the network is propagated from the output layer to the input layer. The network carries on updating the weights until the error becomes minimal. Updating weights is performed from the output layer and hidden layer and can be described via the following formulas [73,74]: The error function is defined as follows Here, e j (n) is the error of the jth neuron, E(n) is the instantaneous error energy, E AV is the averaged squared error energy, and N is the total number of training data. Furthermore, d j (n) and y j (n) are the expected output and the obtained output of the jth neuron, respectively. In the following, the way of obtaining output of the layers is explained: Here, w ji indicates the weight of the connection from the ith neuron to the jth neuron, m is the number of neurons, x i is the input signal of ith neuron, and φ is a function. The methods of minimising the loss function and updating the weights are as follows: The weight correction ∆w is obtained as follows: The local gradient δ j of the jth neuron is computed using the chain rule, which can be seen as follows: Here, α is the learning rate. Ultimately, the weight is updated as follows: To update weights from the hidden layer, errors are propagated from hidden layer down to the input layer. The method of calculating a local gradient is different, and it is computed as follows: in order to compute ∂e k (n) ∂v k (n) , remember: Finally, the weight is updated as shown in following formulas: w ji (n + 1) = w ji (n) + ∆w ji (n)

Potential Training Problems
The automatic approach of neural network training has several potential pitfalls, which can negatively impact training and overall classification performance or even completely invalidate the classification. First, problems can arise due to the initialisation of the network weights. An intuitive approach would be to simply initialise all values and biases with a fixed value, e.g., zero. This, however, impairs learning as such initialised neurons tend to develop similar weights [75]. A solution to this problem was proposed by random value initialisation-the application of randomly changing the learning rate [75,76]. Further, training problems can arise when an inappropriate learning rate is selected: too small and the training progress per training cycle is small, while too large and minima of the loss function might not be achievable.
Another potential point of failure is the training used data set and training time. Data sets should have a sufficient size and be diverse enough to fully represent the target data, otherwise the network might not be able to generalise enough to perform well under real testing conditions. Another error regarding the dataset is to break the strict separation of the training and the validation data set, which usually results in a gross overestimation of network performance.
Regarding the training time, two phenomenons can be seen: underfitting and overfitting. Underfitting occurs when the model's training time is too short to properly adapt to the dataset, whilst overfitting occurs when a network is trained too extensively and thus loses its capability to generalise. Both under-and overfitting result in poor network performance. However, whilst underfitting can be easily spotted by predictions accuracy during training, overfitting can easily be overlooked if no adequate validation dataset is used.
A more sinister problem is called Clever Hans Predictors (CHP). These kinds of neural networks seemingly perform well under laboratory conditions, making them difficult to spot before deployment. The problem of CHP arises if a network focuses on features that are logically irrelevant for inference [77]. These could be watermarks or other text on images or on background details instead of the objects of interest. This problem becomes particularly serious when only certain classes contain irrelevant characteristics. If both the training and validation dataset are flawed, the only way to spot CHPs is via a review of its activation for specific samples, which both takes time and requires a certain degree of expertise for both the used dataset and programming.
Recent examples of flawed applications of neural networks in the health sector were described in the studies of Wynants et al. [78] and Roberts et al. [79], where they analysed prediction models that had been recently described to support COVID-19 diagnosis (Wynant et al.: 232 models; Roberts et al.: 62 models). Both studies did not recommend any of the studied models for clinical use because all of them had at least one or more of the above listed problems. This highlights the urgency of good datasets and network design.

Network Types
Different types of network architectures have been developed over time to address different types of data and problems. The first developed types were fully connected neural networks, followed by convolutional neural networks. Currently, more complicated networks, such as U-Nets or Generative Adversarial Neural Networks are also abundant.

Convolutional and Generative Adversarial Neural Networks
Convolutional Neural Networks (CNNs) are specialised ANNs that are designed to solve pattern recognition tasks via machine learning. Thereby, rather than receiving scalar input, as with dense networks, CNNs receive matrix input, such as images. The basis for modern CNNs was laid by the neocognitron by Fukushima in 1980 (43) and the time delay neural networks by Waibel in 1987 [80]. One of the first widely recognised networks was LeNet, a CNN for the recognition of postal zip codes, designed by LeCun et al. in 1989 [81]. CNNs are composed of three main components: convolutional, downsampling/pooling and dense layers.
In contrast to dense layers, convolutional layers perform convolution, which means each neuron calculates weighted sums of a predefined set of inputs for each input rather than forming a weighted sum for all inputs. The size and weighting of the area is defined by a convolution kernel, which is shared between all neurons of a layer. This allows convolutional layers to perform image processing tasks, such as edge and corner detection. Per convolutional layer, multiple convolution kernels are trained to perform different processing tasks.
To reduce the input dimensionality as well as to abstract it, each convolutional layer is followed by a downsampling layer. Whilst different methods for pooling are available, the most commonly used is maximum pooling, where the maximum of the specified area is used as the output. In addition, reducing the output dimension of a convolutional layer and thus subsequently the complexity of the network, it can also help to prevent overfitting by reducing the availability of raw input information. To be compatible with the dense part of the network, the output of the last downsampling layer is vectorised before passing. The subsequent processing is then performed as described for ANNs ( Figure 5).
Whilst CNNs are useful for whole image classification, their ability for image segmentation is limited: Due to their dense layer, a convolutional network can output a certainty if an object is contained in an image but not where the object is located. Another limitation is the detection of multiple different objects in the same image. One approach to overcome this hurdle was the introduction of regional CNNs (R-CNNs). R-CNNs are designed as described for regular CNNs but are fed with overlapping segments of the image, which are classified individually, allowing to create a heatmap of object locations.
A further advantage for image segmentation was the introduction of fully convolutional networks (FCNs), first described by Long et al. in 2014 [82]. Contrary to CNNs, FCNs are composed completely out of convolutional and pooling layers. The dense part of a CNN is replaced by one upsampling layer to match the output and input dimensions. This design gives FCNs several advantages over CNNs. Due to the missing dense layers and the shared convolution kernels of convolutional layers, FCN architecture allows for dynamic, arbitrary input sizes. Furthermore, as the output of an FCN is a matrix rather than a vector, pixelwise image segmentation can be achieved.
Another direction to deal with mentioned problems is superpixel segmentation. A Superpixel can be defined as a group of pixels that perceptually shares common characteristics while considering spatial constraints. Superpixels carry more information compared to pixels and also provide a compact representation of an image, which is useful for reducing computational complexity [83]. They are becoming increasingly popular in many computer vision and image processing algorithms, such as image segmentation, semantic labelling, object detection and tracking.
Gheshlaghi et al. [84] used the superpixel segmentation technique to overcome dimensionality problems for multiple sclerosis lesion detection. Fang et al. [85] proposed a superpixel segmentation algorithm to segment two-dimensional bone images and threedimensional brain images. In this paper, the blocks with the same features were merged and to segment the superpixel/voxel medical image, the final distance with the intensity feature and the location feature, and the gradient feature was considered. The FCN architecture was further refined by the introduction of U-Nets by Ronneberger et al. in 2015 [86]. A U-Net can be divided into two different sections: downsampling and upsampling. As with a normal FCN, the downsampling part of the network is composed of alternating convolutional and pooling layers. The upsampling of a U-Net is composed of transposed convolutional layers, whereby the number of transposed convolutional layers matches the number of pooling layers.
Furthermore, the upsampling rate is set to match the downsampling rate so that the input and output shape are equal. To further improve the segmentation quality, each upsampling layer is connected to a downsampling block; thereby, the first downsampling block is connected to the last upsampling layer, the second downsampling layer is connected to the second last upsampling layer and so on. This allows deeper layers to use low-level data, which helps to improve the predictions.
In comparison to the older dense neural network, neural networks with convolutional layers have different advantages: Due to the weight sharing, convolutional layers can process much more input parameters. The performed convolution also allows each neuron to consider local neighbourhood parameter relationships rather than weighing each parameter individually, which allows the network to recognise features, such as edges, corners and patterns. Additionally, the added pooling forces the network to become less reliant on the input data as part of it is cut at each pooling layer, which might help the network to generalise better.
A major drawback of convolutional neural networks is their dependence on huge datasets for training, as they are neither rotation-, translation-nor color-invariant. This requires multiple images of the object of interest in different positions and rotations as well as lighting conditions. This could partially be alleviated by the introduction of random rotation/translation into the dataset (data augmentation) but might introduce difficult to detect artificial artefacts, which could negatively influence the classification results.
A special use case of FCNs/CNNs are Generative Adversarial Networks (GANs), which can be used to create images and other output data comparable to its training data from noise [87,88]. This is achieved by training an FCN and CNN simultaneously, where one network is designed to receive noise (FCN, called the generator) and output fabricated images, whilst the second network is trained to differentiate between the fabricated and real input images (CNN, called the discriminator).
The generator is generally trained using the inference difference of the discriminator between generated and real images. This allows the generator to train unsupervised. However, compared to the training of other types networks, GANs are notoriously difficult to train as they require hyperparameter fine tuning for both the generator and the discriminator: Both overly high and overly low inference quality of the discriminator can cause poor generation quality. Several different approaches have been proposed to address this problem, such as noise addition for the discriminator input or "freezing" of the lower layers of a pretrained discriminator [89,90].

Recurrent Neural Networks
Recurrent neural networks (RNNs) are a special class of neural networks that are derived from feed-forward networks and possess the ability to process sequential data. This is accomplished by saving previous states of the network in specialised recurrent units (RUs), which are organised in loop-like structures. These RUs can be regarded as dense layers with tanh activation, which save their respective activation as a hidden state. This hidden state is passed between and modified by each RU for inference.
Whilst this, in theory, allows RNNs to learn long-term dependencies, they suffer in practice from either vanishing or exploding gradients during training, drastically limiting the inference performance of classical RNNs [91]. This problem was addressed by the development of Long Short-Term Memory Networks (LSTMs) [91,92]. Each standard LSTM unit has two states: the cell state, which is solely passed between the different LSTM cells and the hidden state, which constitutes the cell output. These states are influenced by three different gates: The forget-gate, the modification-gate and the output-gate.
The cell state is first influenced by the forget-gate, which weighs the previous cell state. In other words, the forget-gate decides which information of the cell state is relevant for further inference. The second gate is the modification gate, which modifies the cell state according to the cells learned parameters. Both gates depend on the hidden state of the previous cell. The last gate is the output gate, which further modifies the cell state, dependent on the previous hidden state, to create this cell's hidden state.
Since the first introduction of LSTMs, different modifications have been proposed. Peephole LSTMs [93], for example, interconnect the gates with the cell state and thus changes the information flow inside each unit. A more radical change was proposed with Gated Recurrent Units (GRUs) [94], which replaces the forget and the modification gate with a singular update gate and merges the cell and hidden state.
Due to their structure, RNNs cannot be trained via regular backpropagation. Instead, training is usually performed via backpropagation through time (BPTT), a variant of backpropagation, which unravels the RNN and applies gradient descent using accumulation of the error for each recurrent unit of the network. Regardless of their challenges in training convergence, they exhibit remarkable performance in practical application. As an example, in one application [95], a combination of CNN feature extraction and LSTM classification was used to classify histopathological images.
The images were divided into patches to alleviate the limitation of the computational requirements, and then the CNN model was utilised to embed each path into a latent space (e.g., feature vectors). Next, the feature description of all patches was concatenated to form an input sequence for the LSTM model. Finally, with the sequence-to-sequence transformation capability of the LSTM model, the contextual dependency among patches was modelled to produce an image-level prediction. R. Azad et al. [96] further developed the LSTM idea into a segmentation model to unify the feature description driving from the encoder-decoder module in a non-linear fashion.

Graph Neural Networks
Graph-like data has been proven to be difficult to analyse using network architectures, such as CNNs or RNNs, which expect a predefined set of input features. To tackle this kind of data, a new kind of architecture was developed, named Graph Neural Network (GNN, [97]). A GNN is usually used to tackle the following tasks: Node classification and regression; edge classification and prediction; and graph classification, regression and matching [98].
In general, a GNN usually consists of three main modules: a propagation module, used to aggregate information from neighbouring nodes, a sampling module, working in conjunction with the propagation module, and a pooling module, which reduces the data complexity [98]. An improvement of a GNN is the so-called Graph Convolutional Network (GCN), which applies convolutional layers to extract information [99,100]. The concept of convolution was applied to GNN to decrease the high computational weight of the previous GNN design. In CNNs, convolution operates on local Euclidean structure, while in GCNs, it operates on non-Euclidean data (e.g., graph) to incorporate irregular data structure [101]. The main difference between GNNs and dense neural networks is the graph transfer function. More specifically, similar to the dense neural network, the graph transfer module learns the full connection weights between all nodes; however, in addition, it considers the importance of the edge connections.
This might explain why the GCN network is more capable of learning structural information shared among all nodes and is more prone to missing data points. It should also be noted that the GCN model requires precise data structuring (unlike CNN that works on the raw data), which might limit the applicability of this architecture in different applications. Time and space complexity are also other limitations of the GCN network. Particularly, the backpropagation operation in the GCN [22] networks requires saving all computed nodes along with the intermediate states, which requires high computational memory specifically for the large graph.
In addition, as stated before, the GCN is a generalised form of CNN architectures and requires more training time to capture the underlying representation. Yu et al. [102] used Graph Neural Networks for the determination of biomarkers from microarray data. In this work, first, the graph structure was constructed using the gene interaction network, and then the Graph Neural Network was used for the link prediction method to enhance the graph structure data. Li et al. [103] proposed a novel biomarker selection method in microarray data by Combining Graph Neural Networks and Gene Relationships. In this paper, Graph Neural Networks were used to select features and characterise node information. Then, a spectral clustering method was applied to filter redundant features.

Transformers
Transformers are a neural network architecture that was proposed in the year 2017 by Vaswani et al. [104]. This architecture is built around attention-driven blocks, which are neural network layers that aggregate information from the whole input sequence [105].
The model was originally designed for machine translation by modelling long-range dependencies and multi-head attention mechanisms (the proposed model of Vaswani et al. is one the most famous in the field of Natural Language Processing (NLP)) but is currently also increasingly used for image processing. Although Convolutional Neural Networks (CNNs) have been the most popular deep neural networks in medical image analysis, they are weak in learning long-range data because of their localised receptive field [106].
Application of the long range learning capabilities for computer vision tasks was made possible by the development of Vision Transformers (proposed by Dosovitskiy et al., 2020 [107]). In this model architecture, an image is transformed into a sequence of non-overlapped patches, where each patch indicates a spatial location on the input image. Next, by applying the multi-head attention mechanism followed by the Multilayer Perceptron (MLP) module, it learns the importance of each sequence component to model object-level recognition. An extension of this architecture (e.g., [108]) was further proposed to alleviate the problem of weak global representation stemming from the local nature of the CNN modules.
From another perspective, the lack of global representation in the CNN model usually pushes the learning strategy of this network toward the texture clues, which weakens the shape-based description. The Transformer network utilises the attention mechanism on top of the image patches to model global representation. Hence, it has the potential in learning global and, consequently, shape-based information.
In addition, the sequence-to-sequence learning strategy deployed in the Transformer model empowers this architecture to better model inductive bias compared to the CNN counterparts. However, with a higher number of parameters, the quadratic computational complexity of the attention operation and hunger for the large training data are among the main drawbacks of this network architecture.
As an example of utilising Transformers in clinical applications, Lum et al. [109] proposed an attention-based video model to detect the disease signatures and learn clinically relevant imaging biomarkers. A knowledge transfer approach was also used to overcome the problem of data limitations. In another application [110], a Transformer-based method was utilised to reconstruct Electrocardiography's (ECG) signal from the photoplethysmography (PPG) version. More specifically, a multi-head attention mechanism was designed to perform sequence to sequence prediction processes using waveform data. The predicted ECG signal along with the PPG version was then used to monitor cardiovascular diseases.

Challenges of Neural Networks
Despite their advantages, neural networks also have to contend with some difficulties. The most glaring problem of neural networks is their extremely high complexity, which makes it difficult to explain the results obtained (also called black-box networks). In recent years, some efforts have been made to make neural networks more interpretable (e.g., through reverse engineering or various visualisation techniques, [111,112]); however, as neural networks grow in size and complexity, the problem is likely to become worse. This is problematic because black-box networks can hide other problems, such as CHP (Section 4) and dataset bias, i.e., the uneven distribution of available data for different social groups, which can lead to incorrect diagnoses and/or treatments for women and minority groups when such networks are used as tools by clinicians [113,114].

Usage of Neural Networks for Medical Data Analysis
The appropriate architecture for analysis largely depends on the type of data. We differentiate between four data types: scalar data, n × n matrices (images), series data and graph data.

Scalar Data
Scalar data are the most basic type of data obtained during diagnosis. Typical medical data with scalar characteristics are: A dense network usually receives and outputs scalar data, making it useful for classification and clustering. The main application of dense networks lies in diagnosis assistance for medical personnel. Networks trained with different biomarkers as input and the respective medical diagnosis as output could be utilised particularly for the diagnosis of rare and thus often overlooked diseases. The applicability of such diagnosis models has been evaluated for acute nephritis, abnormal cardiac behaviour, carcinoma, valve stenosis and other diseases [115,116].
Dense networks were also used for the estimation of skin parameters for the reconstruction of 2D maps of blood volume fraction and blood oxygen saturation in the skin, the measurement of oxygen saturation and haemoglobin concentration in living tissue, the differentiation of smokers from non-smokers and the discrimination of human bodies from bones and teeth remains [117][118][119][120].

Images
Several diagnostic procedures produce image data. Some typical examples are: In addition, there are applications in forensics that deal with, for example, the identification of image sources, forensic face verification in videos, drowning diagnosis or the interpretation of gunshot wounds [32][33][34][35]. The input for CNNs/FCNs/U-Nets is an n-dimensional matrix; it can thus be used for classification and clustering of image data. Since their development, CNNs, FCNs/U-Nets and derivatives thereof have been used extensively for classification of image-based medical data. Examples can be found, among others, for the detection of Alzheimer's Disease, brain tumours, lung cancer, liver cancer and mitosis and nuclear atypia detection for breast cancer [121][122][123][124][125]. GANs also have seen various uses in medical data analysis from the estimation of CT images from MR images, the detection of brain lesions and retinal vessel detection to image synthesis for recognition network training [126][127][128][129].

Series Data
Series data are a special case of data compared to the previously described ones. Instead of only analysing one data point for classification, the network needs to draw conclusions from a series of ordered data points. These data points can be scalar data points as well as image data. Typical series data are: • Biomarker concentration over time. • ECG. • Live cell imaging.
RNNs with memory cells (LSTMs and GRU) are a suitable choice to tackle this kind of data. They have been used for sepsis detection, survival prediction for heart transplantation, hospital readmission rate prediction for lupus patients or MRI image reconstruction [130][131][132][133][134].

Graph Data
A graph describes the characteristics of and relationships between data points. Typical series data are: • Protein/molecule structures.
Graphs are difficult to analyse using classical neural networks because of their lack of predefined structure. The network types to analyse graph data are GNNs and derivatives, such as GCNs. GCNs have been utilised in many fields, including computer vision applications, person re-identification, action localisation and also in medical image analysis. These networks were successfully used, e.g., for diagnosis prediction, prescription prediction and biomarker identification [135][136][137]. Zhang et al. [128] considered the supervoxels from the brain MRI volume as the nodes of the graph and used GCN to classify supervoxels into different types of tissues. Zhou et al. [101] utilized a GCN for the grading of colorectal cancer, and Shi et al. [138] used a GCN to classify cervical cells.

Publication Development between 2000 and 2021
To analyse the usage of different neural network architectures, we searched the PubMed API (https://pubmed.ncbi.nlm.nih.gov/, accessed on 27 January 2022) to find articles with publication dates between 2000 and 2021 that matched a respective keyword (Table 4). Overall, 42,335 publications were analysed. First, the PubMed IDs of matching publications were fetched using the esearch interface. A document summary, containing among others, information about the authors and journal, was then requested for each ID using the efetch interface. For each keyword, the number of published articles, and journals between 2000 and 2021 was analysed (52,940 publications in total, publications without a full journal name were excluded). Thereby, for all searched keywords, an increase in the number published articles and a respective increase in publishing journals was observed ( Figure 6). The largest increase in publications per year was observable for the broad terms "Artificial Neural Network" (N = 185 (2000) to N = 3531 (2021) and "Deep Learning" (N = 62 (2000) to N = 10,000 (2021)), which are generally broadly used to indicate the usage of neural networks.
For specific architectures, an increase in publications per year for all architecture types were observed, particularly after the year 2015. This might be explained by ever-increasing computing prowess over the last twenty years, the emergence of Graphics Processing Units (GPU; described as early as 2005 [139])-and cloud computing driven machine learning (ML) as well as increasing popularity due to easy access ML APIs, such as Tensorflow or Torch. Thereby, the largest number of publications was observed for Convolutional Neural Networks with 5415 publications in 2021, followed by Recurrent Neural Networks (N = 926) and Graph Neural Networks (N = 872).
The least number of peak publications was found for Fully Convolutional Neural Networks (N = 608). The high number of papers incorporating convolutional neural networks can be explained by their ability to categorise multidimensional data as well as their easier implementation compared to generative or recurrent networks. Whilst approaches have been shown to analyse image data for both recurrent and generative networks, their implementation is usually more complicated, and more parameter finetuning is required for successful training. Additionally, the ten journals with the overall most publications between 2000 and 2021 were investigated (Figure 7). The overall most publications were published in the journal Sensors (Basel, Switzerland) with 4502 publications, followed by Scientific reports (1867 publications) and Conf Proc IEEE Eng Med Biol Soc (1570 publications). Author Contributions: Conceptualization, R.W. and S.R.; methodology, R.W.; software, R.W.; formal analysis, R.W. and S.R.; investigation, R.W. and S.K.; resources, S.R.; data curation, R.W.; writing-original draft preparation, R.W.; writing-review and editing, R.W., S.K., D.R. and S.R.; visualization, R.W. and S.K.; supervision, S.R.; project administration, S.R.; and funding acquisition, S.R. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: