Segmentation-Based vs. Regression-Based Biomarker Estimation: A Case Study of Fetus Head Circumference Assessment from Ultrasound Images

The fetus head circumference (HC) is a key biometric to monitor fetus growth during pregnancy, which is estimated from ultrasound (US) images. The standard approach to automatically measure the HC is to use a segmentation network to segment the skull, and then estimate the head contour length from the segmentation map via ellipse fitting, usually after post-processing. In this application, segmentation is just an intermediate step to the estimation of a parameter of interest. Another possibility is to estimate directly the HC with a regression network. Even if this type of segmentation-free approaches have been boosted with deep learning, it is not yet clear how well direct approach can compare to segmentation approaches, which are expected to be still more accurate. This observation motivates the present study, where we propose a fair, quantitative comparison of segmentation-based and segmentation-free (i.e., regression) approaches to estimate how far regression-based approaches stand from segmentation approaches. We experiment various convolutional neural networks (CNN) architectures and backbones for both segmentation and regression models and provide estimation results on the HC18 dataset, as well agreement analysis, to support our findings. We also investigate memory usage and computational efficiency to compare both types of approaches. The experimental results demonstrate that even if segmentation-based approaches deliver the most accurate results, regression CNN approaches are actually learning to find prominent features, leading to promising yet improvable HC estimation results.


Introduction
The automated measurement of fetus head circumference (HC) is performed throughout pregnancy as a key biometric to monitor fetus growth and estimate gestational age. In a clinical routine, this measurement is performed on ultrasound (US) images, via manually tracing the skull contour, along to fitting it to an ellipse, which is done by sonographers. Figure 1 is one sample of an ultrasound (US) image of fetus head from the HC18 public dataset [1] used in this paper. Identifying the head contour is challenging due to low signal-to-noise ratio in US images, and also because the contours have fuzzy (and sometimes missing) borders ( Figure 1). Manual contouring is an operator-dependant operation, subject to intra-and inter-variability, which yields inaccurate measurements, as measured in [2]: The 95% limits of agreement have been measured to ±7 mm for the intra-operator variability and ±12 mm for the inter-operator variability.
Usually, automating the measurement of fetus head circumference in US images is achieved through automatic segmentation methodology. Segmentation methods typically involves image-processing or machine learning-based approaches, some post-processing of the result, so as to fit it into an ellipse. This process involves multiple steps, is adhoc, and can be prone to error. Let us emphasize on the fact that here, the segmentation is just an intermediate step to compute a characteristic from the image, i.e., the length of the head contour. More generally, segmentation is often a prerequisite step toward the computation of biomarkers: For example, the cardiac ventricles are segmented in magnetic resonance images in order to estimate the cardiac contractile function via some indices (e.g., ejection fraction) [3]. Another example is anthropometry, where measuring the skeletal muscle body mass and fat body mass, which are a significant prognostic factors in cancer, are estimated from the segmentation of muscle and fat in computed tomography (CT) images [4]. Instead of resorting to segmentation, which is a costly and error-prone process, one can attempt to estimate the (single or multiple) characteristics or biomarkers, directly. Works on this topic have gotten a second wind with the breakthrough of deep learning, which allows one to take advantage of the power of feature representation and to perform an end-to-end regression [5][6][7][8][9][10].
However direct, "segmentation-free" approaches rely on much less information to estimate the biomarker, and it is not clear yet if segmentation-free approaches can reach the level of accuracy of segmentation-based approaches. To our knowledge, there is no study that has rigorously compared segmentation-based methods and segmentation-free methods for a given application of biomarker estimation, and quantifies the gap between them. This observation motivates the present study, where we propose a fair, quantitative comparison of segmentation-based and segmentation-free (i.e., regression) approaches to estimate how far regression-based approaches stand from segmentation approaches, for the estimation of the head circumference in US images. More precisely, we investigate several settings, i.e., state-of-the-art segmentation models and various backbones for the regression CNN architectures, to obtain the best of both worlds, and to also investigate time and memory consumption in addition to estimation accuracy.
The rest of the paper is organized as follows. Section 2 describes the related segmentationbased works about HC estimation from US images on one side and segmentation-free approaches for biomarker estimation on the other side. Section 3 introduces the methodological framework for segmentation-based and segmentation-free methods. In Section 4 we describe the dataset, the experiment protocol. We present and discuss the experimental results comparing both the segmentation-based and segmentation-free approaches in Section 5. At last, the conclusion is drawn in Section 6.

Fetus Head Circumference Estimation
Several approaches have been proposed in the literature to measure the fetus head circumference in US images, based on image segmentation [11][12][13]. Some follow a two-step approach, namely fetus head localization and segmentation refinement [11]. For example, in [14], the first step consists of locating the fetus head with Haar-like features used to train a random forest classifier; and the second step consists of the measurement of the HC, via ellipse fitting and Hough transform. Other approaches build upon deep segmentation models also in a two-step process, contour prediction, and ellipse fitting [15]. In [16], the standard segmentation model U-Net [17] is trained using manually labeled images, and segmentation results are fitted to ellipses. In [18], authors build upon the same idea, using the multi-task learning paradigm to jointly segment the US image and estimate the ellipse parameters. In [19], the authors use first a region-proposal CNN for head localization, and a regression CNN trained on distance fields to segment the HC. Ref. [20] advances the work [19] since they propose a Mask-R 2 CNN neural network to perform HC distancefield regression for head delineation in an end-to-end way, which does not need prior HC localization or postprocessing for outlier removal. All these methods rely on a segmentation of the fetus head as a prerequisite to estimating the HC.

Segmentation-Free Approaches for Biomarker Estimation
Works aimed at directly extracting biomarkers from medical images have gained traction these last years, especially thanks to advances in deep learning. The goal is to avoid intermediate steps, such as segmentation and other adhoc post-processing steps, that maybe computationally expensive (for both model training and images annotation) and prone to errors [5]. Direct parameter estimation with deep learning can be found in various medical applications; for example, in [5], the authors propose a learning-based approach to perform a direct volume estimation of the cardiac left and right ventricles from magnetic resonance (MR) images, without segmentation. The approach consists in computing shape descriptors using a bag-of-word model, and to perform Bayesian estimation with regression forests. Ref. [6] utilizes regression forest to directly estimate the kidney volume on computed tomography images. Ref. [7] quantify spine indices from MRI via regression CNN with feature amplifiers. Ref. [8] propose multi-task learning for the measurement of cardiac volumes from MRI. For vascular disease diagnosis, Ref. [9] quantify 6 indices of the coronary artery stenosis from X-ray images by using multi-output regression CNN model with an attention mechanism. Preliminary results on the estimation of the head circumference in US images with regression CNN are presented in [10]. By taking advantage of the representation power of CNN, one can now skip the feature design step and learn the features, while at the same time performing the estimation of the value of interest, i.e., regression. Regression CNN are also at the heart of other fields in computer vision, such as head-pose estimation [21], facial landmark detection [22], and human-body pose estimation [23].

CNN Segmentation Model
We investigate several segmentation architectures that are the state-of-the-art network in medical image segmentation, to segment the contour of a fetus head: The well-known U-Net model [17], U-Net++ [24], DoubleU-Net [25], FPN [26], LinkNet [27], and PSPNet [28]. We trained these architectures from scratch but also investigate transfer learning as a way to mitigate the limited number of images in the HC18 dataset. Even though the natural images from ImageNet and US images have obvious dissimilarities, some generic representations can be learned from a large-scale dataset that might be beneficial to other types of images, and they have proven so in the context of MR images [29]. Thus we have used various backbone models, namely VGG16 [30], ResNet50 [31], and EfficientNet [32], pre-trained on the ImageNet dataset, for all architectures mentioned above. For the loss function, we use the Dice loss, highlighted by [33] to be one of the best loss functions for medical image segmentation.

Post-Processing of Segmentation Results
It can happen that the segmentation results have some noise or incomplete part such as holes, which can cause inaccurate ellipse fitting. Thus some post-processing is applied on the segmentation results: Contours are detected from the segmentation map by the Canny filter, then the largest connected component is kept.

HC Computation Based on Segmentation Results
After post-processing the segmented results, the next step is to perform ellipse fitting in order to get the parameters (long axis, short axis, center points, angle) of the ellipse to compute its length. The length of an ellipse denoted HC is approximated by the Ramanujan approximation method [34] in which h = (a−b) 2 (a+b) 2 , a and b being the long and short axis of the ellipse:

Regression CNN Model
As shown in Figure 2, the regression CNN are composed of a CNN backbone and regression layer (linear activation function), which can learn the features of the input fetus head to estimate the HC value directly. The backbone CNN that we experimented with are state-of-the-art architectures: VGG16 [30], ResNet50 [31], EfficientNetb2 [32], DenseNet121 [35], Xception [36], MobileNet [37], and InceptionV3 [38]. In order to improve model convergence, and for the reasons stated above in the previous section, we use them pretrained on ImageNet [39], and fine-tune them for the task at hand.

Loss Functions
The loss functions commonly used in regression CNN include the mean absolute error (MAE) loss, mean square error (MSE) loss, and the Huber loss (HL), defined as: where p i is the probability of predicted pixels, g i is the real value of head circumference in pixels, and N is the number of pixels in an image. δ is a hyper parameter and empirically set to 1. We will investigate all three of them, as there is no heuristic to choose one loss over the other, as highlighted in [40].

Model Configuration
For regression models, both the weights of the CNN feature extractor part and regression layer are trainable. As the number of training data of HC18 is limited, to avoid over-fitting, we set the dropout rate as 0.7; in other words, 30% of parameters in regression CNN models are kept. The number of trainable parameters of each model is listed in Table 1.

Dataset and Pre-Processing
The HC18 dataset [1] contains 999 US images acquired during the various trimesters of the pregnancy, along with the corresponding ground truth of the skull contour map and HC values. The reference contour of a fetus head is annotated as an ellipse shape by a professional sonographer and the HC value as well as the pixel size of each image is given in a text file. The gestational age range of this dataset is 10-40 weeks [1].
Image preprocessing includes a resizing from 800 × 540 pixels to 224 × 224, and normalization by subtracting the mean and dividing by standard deviation. The HC values are normalized by dividing by the maximum value of HC, in order to improve convergence. We split the dataset into a training set (600 images), validation set (199 images), and test set (200 images) in a random order. We augment the data of the training set by performing horizontal flipping, and rotation with 10 degrees, the amount of training data is 1800 images.

Experiment Configuration
Both approaches, segmentation or regression, are evaluated with the same protocol, namely with 5-fold cross validation, the folds being identical for all the methods. We set the optimizer as Adam with a learning rate of 10 −4 . The batch size is 16. The training takes 100 epochs. The implementation is based on Keras, using the public Python library Segmentation Models [41]. Models are trained on the Tesla P100 GPU server with 16 GB of memory.

Evaluation Metrics
Evaluation metrics for the segmentation results are the Dice index (DI), the Hausdorff distance (HD), which is the maximum point-to-point distance (Dist) between two contours, and the average symmetric surface distance (ASSD), between the segmented results (Seg) and the ground truth (GT) map. We use the mean absolute error (MAE) and the percentage MAE (PMAE) to compare the predicted and ground truth HC values:

HC Estimation Based on Segmentation
We train and test 6 different segmentation architectures (U-Net, U-Net++ DoubleU-Net, FPN, LinkNet, and PSPNet) with three CNN backbones (VGG16, ResNet50, and Efficient-Net). We found that the segmentation models pretrained on ResNet50 outperformed the other two CNN backbones. So we only report the detailed quantitative evaluation for the ResNet50 backbone, to which we added the original U-Net architecture [17], as shown in Table 2, that contains both the segmentation accuracy and the HC estimation MAE.
From Table 2, one can gather that segmentation-wise, all segmentation models obtained similar scores, as shown by values in columns DI, HD, and ASSD in the Table. However, when it comes to the estimation error of the HC, the U-Net-B2 and LinkNet-B2 are the best architectures, as assessed by a two-sided, paired Student's t-test between the pair of method scores, which resulted in a p-value inferior to 0.05 for these two networks. Both networks achieve an MAE value (after post-processing) of 1.08 mm and 1.15 mm, respectively. Post-processing allows one to obtain a small enhancement in the MAE value.
We also analyzed some segmentation results ( Figure 3) on some vague US fetus head images; the influence of noise and artifacts of images in segmentation-based methods is less than that in the segmentation-free methods (presented in Figure 4).

HC Estimation Based on Regression CNN
We train and test regression CNN architectures with seven different pretrained CNN backbones, experimenting with three regression loss functions (MAE loss, MSE loss, and Huber loss) on the HC18 dataset. The evaluations of direct HC estimation are given in Table 3. One can find that the Regression EfficientNet (Reg-B3-L1) in conjunction with the MAE loss, performs better than the other CNN models: The resulting MAE for this regression network is 1.83 mm.

Saliency Maps of Regression CNN Results on HC
Contrary to segmentation models, regression models come at a cost of low interpretability, i.e., the model is not providing explicit explanations along with the HC prediction. In order to shed light on what is indeed learned by the regression CNN, we use a post-hoc explanation method to analyze the regression model. In our previous work [42], we showed, with a quantitative perturbation study, that the Layer-wise relevance propagation (LRP) method [43] was appropriate to explain CNN regression models for this application. The idea of LRP is to compute a relevance score for each input pixel layer by layer in a backward direction. It first forward-passes the image so as to collect activation maps and backpropagates the error, taking into account the network weights and activations, yielding saliency maps [44], in which the areas that most contributed to a decision are highlighted. Note that in [45], authors also used the LRP method to explain the results of a regression CNN that aims towards counting leaf on plant photographs. One can discover from Figure 5 that the regression CNN can indeed find the key features from the head contour on the input US images and relies on, to some extent, many contour pixels to make the HC estimation.

Saliency Maps on Outlier Analysis
We also display some saliency maps where regression models fail to make an accurate estimation (see Figure 4). We observe that the features extracted by regression CNN models are fooled by the hypersignal (i.e., high intensity pixels) above the head, which leads to increased predicted HC values. This illustrates the case where the background is heterogeneous and makes it difficult for the network to distinguish the head contour and thus to accurately estimate the head circumference.

Comparison of Segmentation CNN vs. Regression CNN
To compare the performance of the segmentation-free vs. the segmentation approaches, we gathered the two best results from Tables 2 and 3 into Table 4. From this table, one can see that the best segmentation approach (U-Net-B2: U-Net with pretrained ResNet50 with post-processed segmentation results) is better than the best regression approach (Reg-B3-L1) by 40.7%. We can also notice from Figure 6 that both segmentation and regression methods are correctly fitting the data, the fitting of the segmentation-based method being even smoother.  We also analyze the agreement between the estimated HC values by both types of methods against the real HC values via linear regression. From Figure 7, one can first observe a remarkable linear correlation between the prediction and reference values, for all four models, whether it is segmentation or regression models. There is a tiny fluctuation in regression CNN models in the right top that illustrates that the regression models have a tendency to underestimate the large HC values (this trend will also appear in the Bland-Altman analysis). The Bland-Altman plot is another way to analyze the agreement between two measurements, by plotting the difference between the measurements vs. their mean, which makes it easy to spot a bias between the measurements. From the Bland-Altman plot in Figure 8, obtained on a fold of 200 test images, we observe that regression approaches struggle with larger fetus head images, which is interesting since segmentation approaches usually fail on small structures. One can also see that for the segmentation models, 8 out of 200 points are outside the 95% agreement limit; for regression models, there are 12 outliers out of 200, mostly distributed in larger HC values. Unsurprisingly, room for improvement is left for regression-based approaches. One can also identify the 95% agreement limits: For the best segmentation model, they are [−3.12 mm, 0.7 mm], and for the best regression model, they are [−3.25 mm, 2.92 mm]. We can compare these limits to the 95% agreement limits on inter-operator variability, which is ±12 mm ([2], Table 1, p. 272): The fact that they are greatly smaller highlights the high relevance of both segmentation-based and segmentation-free approaches as an alternative to automatically estimate the HC from US images. However, the comparison to manual variability should be handled with care as these results have not been obtained on the same dataset. . The x-axis represents the average value of ground truth and predicted HC; the y-axis represents the difference between ground truth and predicted HC (in mm). The horizontal red solid lines represent the upper and lower limits of 95% consistency. The middle dotted green line represents the mean of the difference.

Memory Usage and Computational Efficiency
The theoretical memory usage of a CNN during training requires to store the network parameters and the activation outputs of every layer, used to compute the gradients, for each batch. As show in Table 5, as one could expect that regression CNN models require less memory storage in general, than the segmentation-based approaches, see column Mem-M. However in practice, the gap between regresssion and segmentation models is not so large, as shown by the actual memory cost in the prediction stage, defined as the maximum used memory when the inference is stable (computed using Python library Memory Profiler). In particular, the best regression method (Reg-B3-L1) is requires even more memory than segmentation methods.
As Table 5 shows, the training time per epoch over 1800 training US images for the segmentation method U-Net-B2 (U-Net with ResNet50), takes 29 s on a Tesla P100 GPU. For the best regression model Reg-B3-L1 (EfficientNet), it takes 20 s. In the prediction stage with a Intel Core i7 CPU, 32 GB RAM, the Regression Reg-B3-L1 only takes 36.95 s over 200 test images; in other words, predicting one image requires 0.18 s to be compared to 0.35 s of the U-Net-B2. Segmentation-based methods require longer time at training but also at inference time, than segmentation-free methods. As a conclusion, while the advantage of using regression-based approach is clear computationwise, there is no clear evidence that regression models are less memory greedy, in the experimental conditions we set up. It is worthy to note that with the continuous progress of hardware and computing power, such a time error between segmentation-based and segmentation-free methods may be ignored in clinical practice.

Comparison of HC Estimation with State-of-the-Art
At last, the proposed segmentation-based methods and segmentation-free methods are compared with state-of-the-art (SotA) segmentation methods, which is the standard way of estimating the HC as recalled in Section 2.1. Although a fully accurate comparison is not possible since the experimental protocols are different in each paper (e.g., in [16], the model is trained on the HC18 dataset combined with other fetus head US images), we provide the results as a mean to estimate the order of magnitude of the estimation error. Results given in Table 6 show that the estimation error of the proposed segmentation-based and regression-based approaches based on transfer learning has the same order of magnitude than approaches made of multiple adhoc steps, and dedicated to this task.

Conclusions and Future Work
In this paper, we addressed the problem of HC estimation from US images via both a conventional segmentation approach with post-processing and ellipse fitting, and a regression-based approach that can directly predict HC without segmentation intervention. Our idea was to quantify how far regression-based approaches stand from segmentation approaches when the final task is to estimate a parameter, i.e., a biomarker, from the image. Although segmentation-based methods provide interpretable results for the HC estimation because the segmentation result is visible, they often require dedicated post-processing steps. On the other hand, regression approaches based on CNN are end-to-end, less costly, and prone to error and even though they do not offer explicit interpretability, this aspect can be explored using saliency maps for example [42]. In our paper, we explored both segmentation and segmentation-free approaches with state-of-the-art CNN architectures and backbones. By setting the same experimental conditions, we proposed a fair, quantitative comparison of these two approaches, in order to assess if the direct estimation approach is viable for this task. Even though the estimation error is much higher with the regression networks, the results are still promising and in line with inter-operator variability. Therefore, direct estimation, regression-based approaches have a high potential that should be deepened in the future. While we used general-purpose architectures for our regression methods, it would be interesting to investigate customized architecture for this task, and that includes attention mechanisms.
In future work, we will assess the generic regression CNNs on other medical datasets to estimate multiple biomarkers. Besides, we plan to investigate the segmentation-free approaches with other, recent CNN architectures that have a higher ability regarding feature representation, e.g., transformer architectures, as well as multi-task learning which combines a segmentation branch and regression branch.