Semantic Segmentation of Medical Images Based on Runge–Kutta Methods

In recent years, deep learning has achieved good results in the semantic segmentation of medical images. A typical architecture for segmentation networks is an encoder–decoder structure. However, the design of the segmentation networks is fragmented and lacks a mathematical explanation. Consequently, segmentation networks are inefficient and less generalizable across different organs. To solve these problems, we reconstructed the segmentation network based on mathematical methods. We introduced the dynamical systems view into semantic segmentation and proposed a novel segmentation network based on Runge–Kutta methods, referred to hereafter as the Runge–Kutta segmentation network (RKSeg). RKSegs were evaluated on ten organ image datasets from the Medical Segmentation Decathlon. The experimental results show that RKSegs far outperform other segmentation networks. RKSegs use few parameters and short inference time, yet they can achieve competitive or even better segmentation results compared to other models. RKSegs pioneer a new architectural design pattern for segmentation networks.


Introduction
Deep learning has recently achieved success in many fields [1][2][3][4][5][6]. In particular, deep convolutional neural networks have greatly advanced the progress of medical image segmentation. The state-of-the-art segmentation networks are typically encoder-decoder structures. Network models for image classification are usually adopted as the backbones of semantic segmentation networks [7][8][9][10]. The backbone is also referred to as the encoder. Correspondingly, there is a decoder following it. Skip connections closely connect the encoder and decoder. Consequently, the design of the segmentation network is fragmented and lacks overall consideration. Furthermore, the design lacks a mathematical explanation. As a result, there are several issues in medical image segmentation networks. First, segmentation networks are very large and computationally expensive. Second, the performance of segmentation networks often varies across different organs, and the generalizability is poor. Therefore, it is necessary to redesign the segmentation network with overall consideration and mathematical explanation.
From experience in image classification [11][12][13][14][15][16], the dynamical systems view is a good perspective for designing efficient network models with appropriate mathematical interpretation. Reference [11] views the forward pass of a neural network as the trajectory of a dynamical system described by an ordinary differential equation (ODE). Since the trajectories of dynamical systems are usually approximated by numerical methods, numerical methods are also used to construct neural networks. The Runge-Kutta (RK) methods are frequently used numerical methods [17]. They are used to construct Runge-Kutta convolutional neural networks (RKCNNs) [16]. RKCNNs are state-of-the-art numerical networks for image classification. They are very efficient and save computing resources significantly. They also surpass the models using linear multi-step methods, another kind of numerical method.
No segmentation model has been constructed from the dynamical systems view, although semantic segmentation is based on image classification. To obtain higher segmentation efficiency, the construction of a numerical segmentation network is a valuable research topic.
We abandoned the concept of encoder and decoder in segmentation models and instead model the entire network from the dynamical systems perspective. We regarded the process of semantic segmentation as a dynamical system since it is also the neural network's forward pass, just like the image classification. Similarly, we used the RK methods in the segmentation networks to approximate the trajectory of the dynamical system. Due to the superiority of RKCNN, we exclusively used RKCNN as a reference in order to construct segmentation networks.
Unlike all of the existing numerical models including RKCNNs, we creatively used multiple scales within one time step of RK methods. In other words, the existing models maintain the same scale within a time step, while we down-sampled and up-sampled within a time step. Different stages approximated the increment of the time step in different scales. Consequently, we proposed a novel segmentation network structure using the RK methods. It is called RKSeg. Moreover, we evaluated the performance of RKSegs on ten organs using images from the Medical Segmentation Decathlon (MSD) [18,19].
Overall, the main contributions of our work are: • We abandoned the encoder-decoder structure and considered the design of the segmentation network holistically from the dynamical systems perspective. • We introduced RK methods into the segmentation network and inventively used various dimensions within one time step of RK methods. • We proposed a novel segmentation network architecture called RKSeg. First, the classification models AlexNet [20], VGG net [21], and GoogLeNet [22], were adapted into fully convolutional networks (FCNs) [7] for semantic segmentation. Specifically, the fully connected layers were cast into convolutional layers. Moreover, up-sampling layers were added at the end of networks for pixel-wise prediction. Based on the single-stream FCN, they combine the predictions from multiple layers to improve performance.
Afterward, Reference [8] modifies and extends FCN. The backbone is no longer an existing classification model but a new design. Furthermore, the up-sampling part of FCN is expanded in order to transform FCN into a U-shaped architecture, the so-called U-Nets. U-Nets focus on biomedical image segmentation. Next, densely connected convolutional networks (DenseNets) [23] are merged into U-Nets. As a result, FC-DenseNets [24] are proposed. Based on U-Nets, UNet++ [9] is proposed as a nested U-Net architecture with deep supervision. Subsequently, UNet 3+ [10] surpassed U-Net and UNet++ on two organ datasets. Then, nnU-Nets [25] optimized U-Nets and become the state-of-the-art segmentation models. nnU-Nets proved their generalizability on the ten organ datasets of MSD. U-shaped models are dominant in medical image segmentation.
The state-of-the-art segmentation networks consider the down-sampling part originating from the classification network as the backbone or encoder and the up-sampling part as the decoder. These perspectives divide the semantic segmentation process into two parts and connect them via skip connections. However, these designs are experimental. To be specific, FCNs use two skip connections of small scales to improve precision. U-Net connects downstream and upstream paths in pairs by scale. UNet++ overlays U-Nets of various depths and densely connects them at each scale. UNet3+ introduces full-scale skip connections. DeepLabv3+ uses a skip connection at a medium scale. In general, they have no clear mathematical explanation for whether a node or a skip connection is necessary.

RKCNNs
RK methods are divided into explicit methods and implicit methods. The explicit RK methods are easy to implement using a neural network to approximate ODE. However, the equations of implicit RK methods are too complicated to compute directly. In an ordinary way, Newton iterations are used to approximate the implicit RK equation. However, RKCNNs approximate the RK equations using neural networks no matter whether they are explicit or implicit. Furthermore, the coefficients of RK methods are learned through training but not specified as in other models. More details of RK methods and RKCNNs are introduced below.
A neural network stands for a time-dependent dynamical system. The system state y is a function of time t. Moreover, the rate of change of y is described by ODE [30]: where y 0 is the initial value. The RK methods use the ODE to approximate the system state after a time step. This approximation can be performed step by step. The (n + 1)th time step of RK methods is written as below [31]: where The system state at time t n+1 is y(t n+1 ). In Equation (2), y n+1 is an approximation of y(t n+1 ); h is the size of the (n + 1)th time step; h ∑ s i=1 b i z i is the increment of y after h. The slope z i of the ith stage is computed using Equation (3). For s-stage RK methods, the estimated slope is a weighted average of all s slopes. i.e., ∑ s i=1 b i z i . a ij , b i , and c i are the coefficients of RK methods and co-decide the accuracy of the approximation.
RKCNNs are constructed based on the above equations. RKCNN contains three components: the pre-processor, the post-processor, and the periods between the former two. The raw images are processed by the pre-processor, which outputs an initial value to subsequent periods. The last period outputs to the post-processor. Next, the classifier makes predictions. For different datasets, the number of periods is various. If there are multiple periods, there are transition layers between different periods. Moreover, the transition layers reduce the dimension. Large-scale or complex images can be processed at multiple scales to improve performance. RKCNNs on the MNIST dataset have only one period, since all in MNIST are handwritten 0 to 9 gray-scale images of 28 × 28 pixels. Nevertheless, RKCNNs on the SVHN [32] and CIFAR [33] datasets are three-period, since all in both datasets are complex color images of 32 × 32 pixels. In addition, each period could be divided into multiple time steps.
If hb i z i is denoted by e i , Equation (2) is rewritten as [16]: In RKCNNs, the convolutional subnetwork for every time step is constructed based on Equation (4). There are three different architectures of RKCNN, and they are distinguished by the suffixes -E, -I, and -R. The difference among these architectures is how to approximate e i .
In RKCNN-E, e i is approximated by a network E i as follows [16]: The parameter of this network is a function of t n , a ij , b i , and c i . i.e., t n , a ij , b i , and c i are learned through training rather than being specified [16].
RKCNN-I and RKCNN-R use Equation (5) to approximate x i , which is the initial value of e i [16]. i.e., x i is written as below: In RKCNN-I, a network as shown below approximates e i using x i [16]: In RKCNN-R, the network for approximating e i is slightly different from Equation (7). It is written as follows [16]: We introduce RKCNNs into semantic segmentation. Table A1 describes all of the above mathematical symbols.

RKCNN-Based FCN
We make the prototype of RKSegs based on single-stream FCNs. At first, RKCNNs are classification networks. Therefore, we can adapt them to FCNs. Since the MNIST dataset and many medical image datasets are grey-scale maps, we choose the RKCNNs for MNIST as the backbone of FCNs. Specifically, down-samplings in the pre-processor are the same as in the original models. The pooling layer before the full connection is removed. The full connection in the post-processor is changed to 1 × 1 convolution for pixel-wise prediction. In addition, up-sampling is appended at the end of the network. Since truncation errors can accumulate over multiple time steps [34], only one time step is used like in the original models, i.e., n = 0. Then, we get RKCNN-based FCNs as the prototype of RKSegs. An example is shown in Figure 1.

Skip connection Down-sampling
Up-sampling Figure 1. FCN with a 4-stage RKCNN-E as the backbone. y 0 is the initial state, while y 1 is the system state after one time step. E i is the convolutional subnetwork for the ith stage described in Equation (5), where 1 ≤ i ≤ 4. e i is the weighted increment of the ith stage. According to the RK method, y 1 is the sum of y 0 and the weighted average of the increments. k is the number of output channels per subnetwork. c is the number of classes.

From FCN to RKSeg
The prototype of RKSeg is FCN, which uses the RKCNNs for the MNIST dataset as the backbone. The image size of the MNIST dataset is 28 × 28 pixels. However, the medical images are much larger than this size. Hence, we must improve the model. Additionally, the major computations in the prototype are on the same scale. Nevertheless, multiple scales can bring benefits to segmentation. Thus, we must consider the scheme to reduce the dimension.
We remove down-sampling from the pre-processor and add down-sampling bef in order to preserve more multi-scale information. If RKCNN-E is used as the backbone, based on Equations (4) and (5), the model is described as follows: where In Equation (9), U(·) is a function for up-sampling e i as the same scale as y 0 . In Equation (10), D(·) is a function for down-sampling input as the same scale as e i . The resulting model is shown in Figure 2b. It is referred to as RKSeg-L, since the core of the network is on the left.
In consideration of the superiority of nnU-Nets on medical image segmentation, we use nnU-Nets for reference to adapt RKSegs. In nnU-Nets, the subnetwork in each node is 3×3, m 3×3, m convolutional layers, where m is the number of output channels. m gradually doubles as the scale gets smaller until 480. In RKSegs, we use 3×3, k 3×3, k convolutional layers, where k does not change but remains the same as the initial number. Like the last node of nnU-Nets, the post-processor of RKSegs is convolutional layers, where c is the number of classes.
In addition, except for the first stage, every stage in RKSegs has multi-scale input. Therefore, the convolutional down-sampling in nnU-Nets is not applicable for RKSegs. We adopt MaxPool for down-sampling and interpolation for up-sampling in RKSegs. Although deep supervision is helpful to UNet++, UNet 3+, and nnU-Net, we do not use it in RKSegs, since it cannot be explained from the dynamical systems view.  (10). (c) RKSeg-R based on RKCNN-E. E i is the subnetwork described in Equation (14). (d) RKSeg-L based on RKCNN-I. X i is the subnetwork described in Equation (11). I i is the subnetwork described in Equation (12). (e) RKSeg-R based on RKCNN-I. X i is the subnetwork described in Equation (15). I i is the subnetwork described in Equation (16). (f) RKSeg-L based on RKCNN-R. X i is the subnetwork described in Equation (11). R i is the subnetwork described in Equation (13). (g) RKSeg-R based on RKCNN-R. X i is the subnetwork described in Equation (15). R i is the subnetwork described in Equation (17).
Similarly, to construct RKSeg-L with RKCNN-I or RKCNN-R as the backbone, we rewrite Equation (6) as below: For RKSeg-L based on RKCNN-I, we rewrite Equation (7) as below: For RKSeg-L based on RKCNN-R, we rewrite Equation (8) as below: The resulting models are shown in Figure 2d,f.

More Variants
In RKSeg-L, the computations of stages are from large scale to small scale. This sequence can be reversed. Therefore, we down-sample the initial state y 0 and then upsample after each stage. In other words, for RKSeg based on RKCNN-E, Equation (5) is rewritten as below: In Equation (14), V(·) is a function for up-sampling input as the same scale as e i . This type of RKSeg is shown in Figure 2c. It is referred to as RKSeg-R, since the core of the network is on the right.
Similarly, to construct RKSeg-R with RKCNN-I or RKCNN-R as the backbone, we rewrite Equation (6) as below: For RKSeg-R based on RKCNN-I, we rewrite Equation (7) as below: For RKSeg-R based on RKCNN-R, we rewrite Equation (8) as below: The resulting models are shown in Figure 2e,g. According to the comparison in Figure 2, the number of nodes in RKSegs is almost half that in nnU-Nets with the same down-sampling depth. Moreover, the number of feature maps is reduced remarkably. Most importantly, nodes and skip connections of RKSegs are justified in the RK method.

Experiments
We evaluate RKSegs and state-of-the-art segmentation networks on the MSD dataset. The MSD tests the generalisability of algorithms when applied to 10 different semantic segmentation tasks. It involves ten organs: brain, heart, liver, hippocampus, prostate, lung, pancreas, hepatic vessel, spleen, and colon. Some medical images in MSD are MRI scans, and others are CT scans.
Owing to the superiority of nnU-Nets on MSD, we evaluate UNet++, UNet 3+, and RKSeg following the configuration of nnU-Nets, i.e., they have the same initial number of feature maps, depth of down-sampling, convolutions, and loss function as nnU-Nets on each organ dataset. However, DeepLabv3+ and FC-DenseNet do not follow these configurations since they are very different from UNets. Moreover, nnU-Net, UNet++, and UNet 3+ adopt deep supervision, while RKSeg, DeepLabv3+, and FC-DenseNet do not use deep supervision. For efficiency, MobileNetV2 [35] is used as the backbone for DeepLabv3+. FC-DenseNet56 is evaluated. All of the evaluated models use 2D convolutions.
We implement RKSegs within the nnU-Net framework, which is written using Pytorch. The code of RKSegs is available at https://github.com/ZhuMai/RKSeg (accessed on 26 March 2023). It can be integrated into the nnU-Net framework to train.
The nnU-Net framework creates a five-fold cross-validation using all of the available training cases in MSD. We do not carry out cross-validation for RKSegs but entirely follow the configuration and training hyperparameters of nnU-Nets. For example, stochastic gradient descent with an initial learning rate of 0.01 and a Nesterov momentum of 0.99 is used. Moreover, RKSegs are trained in the same batch sizes as nnU-Nets.
We choose the first fold split by nnU-Net, i.e., fold 0, to evaluate all the competitive models since MSD does not release the ground truth of the testing cases. All of the evaluated models are trained from scratch for 150 epochs on GeForce RTX 3080 GPU. The training is carried out three times. The testing cases are only used to evaluate the inference time.
All the data are pre-processed by the nnU-Net framework. Details of the data used in our experiments are listed in Table 1. We first compare RKSegs with different backbones, i.e., RKCNN-E, RKCNN-I, and RKCNN-R. The stages are alternately updated in RKCNN-I and RKCNN-R so it has at least two stages [16]. Moreover, the number of subnetworks in a time step must be even. In the configuration of nnU-Nets, the depth of down-sampling is even only on the heart and prostate datasets. Therefore, we only compare RKSegs with different backbones on the heart and prostate datasets. Next, we compare RKSegs with state-of-the-art models on all ten organ datasets of MSD.

Comparison of Backbones
Dice similarity coefficients (DSCs) are evaluated on the validation sets of MSD. The number of parameters and DSCs are listed in Table 2. For the shown models of the prostate, the mean DSC of the two segmentation targets is the highest among the three runs. According to the experimental results, RKSegs based on RKCNN-E get higher DSCs than the others. Furthermore, they have no limitation on the depth of down-sampling, so they can be used on more datasets. As a result, we consider that RKCNN-E is more suitable as the backbone of RKSegs.

Compared to State-of-the-Art Models
We compare RKSegs based on RKCNN-E with state-of-the-art models. If there are multiple segmentation targets on an organ dataset, we choose the model with the highest mean of all targets across the three runs. The experimental data are shown in Table 3. Table 3. DSCs of competitive models on the validation sets of MSD. DSCs are listed sequentially according to the segmentation targets of the corresponding organs in Table 1. The mean ± std over three runs is in brackets. The unit of parameters is a million bytes. The training time is shown in the format of hh:mm:ss. The experimental data on ten organ datasets are divided into (a)~(j). The fewest parameters and the highest DSCs are in blue.   Furthermore, we evaluate the inference time of the competitive models on testing cases of MSD. The results are shown in Table 4. A segmentation result of RKSeg-L on the spleen dataset is shown in Figure 3. RKSeg-L can even segment more curved details on the right side of the spleen. Its segmentation result is more like the raw image than the label.

Discussion
We construct RKSegs based on RKCNNs. If RKCNN-I or RKCNN-R has the same number of nodes as RKCNN-E, then it has half the stages of RKCNN-E, because it updates each stage alternately [16]. Therefore, the number of skipped connections from the stages to the addition in RKCNN-I or RKCNN-R is only half of that in RKCNN-E. Hence, RKSegs based on RKCNN-E have more information from different stages at different scales. Multi-scale information can bring benefits to segmentation. As expected, RKSegs based on RKCNN-E achieve higher DSCs than corresponding RKSegs based on RKCNN-I or RKCNN-R. As a result, RKSegs based on RKCNN-E are chosen for comparison with state-of-the-art models.
According to the experimental results, our RKSegs have the fewest parameters. At the same time, RKSegs achieve the highest mean DSCs over three runs on three CT datasets, namely the lung, spleen, and colon datasets. In addition, on the brain dataset, which is an MRI dataset, RKSeg wins on a segmentation target, while UNet++ wins on the other two targets and the mean of three targets. On the other six organ datasets, RKSegs obtain competitive DSCs with only 0.85~11% of the parameters of the best models. Additionally, according to the standard deviation of three runs, RKSegs are stable. In terms of training time, RKSegs obtain the shortest time once, the second shortest time eight times, and the third shortest time once. However, the inference time of RKSegs is the shortest among all of the evaluated models.
FC-DenseNet56 achieves the highest mean DSCs over three runs on two MRI datasets, the hippocampus and prostate datasets. On the prostate dataset, FC-DenseNet56 wins on a segmentation target and the mean of two targets, while UNet++ wins on the other target. Additionally, the performance of FC-DenseNet56 is very poor on six CT datasets. nnU-Nets achieve the highest mean DSCs over three runs on one MRI dataset and three CT datasets. Nevertheless, nnU-Nets have a lot of parameters on each organ dataset. UNet++ achieves the highest mean DSC over three runs on an MRI dataset, the brain dataset. However, UNet++ has the most parameters among all of the evaluated models.
In summation, RKSegs are general and efficient on diverse organ datasets with different modalities.
For the encoder-decoder structure, skip connections are used to introduce multi-scale information in prediction. However, whether to add nodes for the decoder and where to add skip connections are up to experimentation.
Contrarily, RKSegs are constructed from the dynamical systems view. Each node and skip connection of RKSegs is justified in the RK methods. Hence, RKSegs avoid many superfluous components in other models. Computational resources are greatly saved. Experimental results show that RKSegs have much higher efficiency than competing models and generalize across diverse organ datasets. Even so, the efficiency of RKSegs still could be improved. For example, the convolutional subnetwork of each node, the depth of the network, and the training hyperparameters of RKSegs are all the same as nnU-Nets in our experiments. They can be tuned to improve the performance of RKSegs.

Conclusions
The encoder-decoder structure is a well-known structure in medical image segmentation. However, the composition of the decoder and the skip connections between the encoder and decoder are designed experimentally. Therefore, segmentation models are either inefficient or not generalizable across different organs. To remedy these deficiencies, we introduce a dynamical systems view to build segmentation models.
We propose a novel segmentation network based on RKCNNs, which use RK methods to construct networks. Our network is called RKSeg. Unlike RKCNNs, RKSegs perform down-sampling and up-sampling within a time step of the RK methods. In RKSegs, each node and skip connection is meaningful in the RK methods. According to the experiments, RKSegs based on RKCNN-E achieve superior performance on the ten organ datasets of MSD, while they have much fewer parameters than other models. Furthermore, RKSegs have a shorter inference time than competitive models on each organ dataset.
Mathematical methods bring benefits to the performance of network models. Our work may inspire new ideas about segmentation networks.

Conflicts of Interest:
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: MNIST Modified National Institute of Standards and Technology MSD Medical Segmentation Decathlon RK Runge-Kutta ODE Ordinary Differential Equation MRI Magnetic Resonance Imaging CT Computerized Tomography

Appendix A
An introductions to the math symbols mentioned in the main text in summarized in Table A1. The rate of change of the system state. t n The nth moment. t 0 The initial time. y(t n ) The system state at t n . y n An approximation of y(t n ). y 0 The initial state of the system. h The size of the time step from t n to t n+1 . s The stages of RK methods. z i The slope in the ith stage.  The weighted increment of the ith stage. e i It stands for hb i z i .
The initial value of e i in RKCNN-I and RKCNN-R.

E i
The convolutional subnetwork to get e i in RKCNN-E.

X i
The convolutional subnetwork to get x i in RKCNN-I or RKCNN-R.

I i
The convolutional subnetwork to get e i in RKCNN-I.

R i
The convolutional subnetwork to get e i in RKCNN-R. m The number of convolution filters. It is variable. k The number of convolution filters. It is constant. c The number of classes.