Deep Learning Automated Segmentation for Muscle and Adipose Tissue from Abdominal Computed Tomography in Polytrauma Patients

Manual segmentation of muscle and adipose compartments from computed tomography (CT) axial images is a potential bottleneck in early rapid detection and quantification of sarcopenia. A prototype deep learning neural network was trained on a multi-center collection of 3413 abdominal cancer surgery subjects to automatically segment truncal muscle, subcutaneous adipose tissue and visceral adipose tissue at the L3 lumbar vertebral level. Segmentations were externally tested on 233 polytrauma subjects. Although after severe trauma abdominal CT scans are quickly and robustly delivered, with often motion or scatter artefacts, incomplete vertebral bodies or arms that influence image quality, the concordance was generally very good for the body composition indices of Skeletal Muscle Radiation Attenuation (SMRA) (Concordance Correlation Coefficient (CCC) = 0.92), Visceral Adipose Tissue index (VATI) (CCC = 0.99) and Subcutaneous Adipose Tissue Index (SATI) (CCC = 0.99). In conclusion, this article showed an automated and accurate segmentation system to segment the cross-sectional muscle and adipose area L3 lumbar spine level on abdominal CT. Future perspectives will include fine-tuning the algorithm and minimizing the outliers.


Introduction
Body composition, defined as the amount of fat relative to the amount of muscle in the body, has been linked to clinical outcomes in a number of conditions [1][2][3]. Sarcopenia, as the progressive loss of skeletal muscle mass (SMM), is considered to be a muscle disease inducing muscle impairment [4]. Sarcopenia was initially described in the elderly population as loss of muscle mass and strength with advancing age [5], but is proven to be a problem in various patient groups. Poor muscle status and increased adipose tissue is associated with significant health effects and increased morbidity and mortality [1]. The clinical application of altered body composition and sarcopenia in risk assessment and treatment decision, therefore, warrants early identification. CT imaging, on the level of the third lumbar vertebra, is the current gold standard for quantification of muscle mass [5][6][7][8][9][10]. Parameters are obtained based on the analysis of a single slice CT scan. Shen et al. demonstrated that skeletal muscle area (SMA) which is the cross-sectional muscle area calculated at the level of the third lumbar vertebra, is able to correctly estimate total body muscle mass [11]. To relate SMA to total muscle mass, it is normalized to the square of height, resulting in the skeletal muscle index (SMI) [12].
Muscle radiation attenuation, calculated as the average Hounsfield Unit [HU) of the cross-sectional muscle area, is a measurement of muscle density, with lower values indicating higher muscle fat content. Segmentation of muscle area in the determination of sarcopenia on CT is done manually, by tracing the muscle group margins on axial sections. Although the inter-and intra-observer agreement for this method is excellent [13], it is a labor-intensive process, which requires diagnostic accuracy and time-consuming involvement from a radiologist. The required time investment limits the use of CT for routine clinical measurement of muscle mass, and thereby, not widely used for early identification of patients at risk.
Deep learning techniques are emerging to support diagnoses with high accuracy, enhancing the speed of image interpretation and, thus, improve the clinical efficiency for a wide range of medical tasks. For example, recent studies showed improved and accurate body composition on CT [14,15]. For muscle mass measurement, recent studies created and validated automated segmentation of the abdominal muscle on manually extracted CT-image at the L3 level [16][17][18][19]. These studies, including patients with varying medical conditions, showed detailed information regarding severity of sarcopenia. However, patients had been excluded when a motion or scatter artefact was present, resulting in a highly restricted population; moreover, they were not multi-center studies and all studies suffered from small sample sizes [16,17].
Following severe trauma, abdominal CT scans are routinely obtained in the emergency evaluation upon admission. These CT scans provide accurate information regarding internal injuries, but need to be quickly and robustly delivered and must not cause a delay in definitive trauma care. This implies that trauma CT scans often contain motion or scatter artefacts, incomplete vertebral bodies or contain arms that influence image quality [20]. Another limiting factor is the heterogeneity in polytrauma patients, such as the different trauma mechanisms, medical conditions and ages. As recently shown, the prevalence of sarcopenia in older trauma patients is significant [21], thereby putting these patients at risk for complications.
Since early identification of sarcopenia in these patients would be clinically valuable for trauma case management, and since an abdominal CT scan is often present, the aim of this study was to investigate whether trauma-CT images can be used in a new deep learning algorithm for quantification of muscle mass. The necessary requirement was rapid automated segmentation and on-the-spot computation of body composition. In this article, we report on the initial development and external generalizability test of a deep learning neural network that was trained on thousands of CTs of abdominal cancer surgery patients. We hypothesized that by allowing the network to train on a large multi-institutional dataset, including a wide range of scanner manufacturers and image acquisition settings, the automated segmentation would be more widely generalizable. We then tested whether the deep learning model could segment and quantify truncal musculature at L3 lumbar vertebral level on abdominal CT scans of polytrauma patients, for the potential use in rapid clinical detection of sarcopenia.

Training and Validation Set: Cancer Surgery Cases
A deep learning neural network was trained on a multi-center collection of 3413 abdominal cancer surgery subjects to automatically segment truncal musculature, subcutaneous adipose and visceral adipose at the L3 lumbar vertebral level. Several different cohorts of patient groups were used to train and validate the algorithm. Two cohorts of patients with colorectal liver metastases were selected in the UK [22]. One cohort of patients with colorectal liver metastases was selected from Zuyderland Medical Center. Two cohorts with patients with ovarian cancer consists of data from five centers in the Netherlands [23,24]. One cohort consists of 304 patients treated for a resectable PDAC of the pancreatic head in the Uniklinik Aachen (Aachen, Germany) and MUMC+ (Maastricht, Netherlands).

Independent Test Set: Polytrauma Cases
Patients with an injury severity score (ISS) greater or equal to 16 who presented at a single level-1 trauma center between 2015 and 2019, and with an abdominal computed tomography (CT) axial image (supine position at third lumbar level) at admission were extracted from a regional trauma registry. Included patients were selected as a related dataset that would pose an independent and challenging validation for a deep-learning algorithm trained on the aforementioned surgical CTs.
From the segmented areas, we computed Skeletal Muscle Radiation Attenuation (SMRA), Skeletal Muscle Index (SMI), VAT index (VATI) and SAT index (SATI) in a wellestablished manner [12,26]. There was no exclusion of CT images on basis of either motion artefacts, scatter artefacts, signal-to-noise ratio, abdominal wall hernia or presence of hands/arms in the field of view.

Deep Learning
A deep-learning neural network (DLNN) for multi-label segmentation of L3M, VAT and SAT was based on a standard two-dimensional U-Net [27], with only minor adjustments to fit this task. Pre-processing of CT images was done with a widely-used deep learning procedure by first clipping the CT image intensities to fall between −200 HU and +200 HU. A schematic block diagram of the DLNN architecture is given as Figure S1 in the Supplementary Materials. Briefly, blocks in the downsizing path each consisted of two convolution layers with pixel padding of 1 to maintain consistent size of input to the next layer. Batch normalization [28] was performed after every convolution layer with max-pooling [29] in between every block. Ridge regression regularization [30] was applied to reduce the effect of over-fitting. The upsizing path comprised transposed convolutions concatenated with some downsizing path features introduced via skip-connections. All except the final convolution layer used parametric rectified linear units (PReLUs) [31] as activation function and ADAM optimizer [32]. The final convolution layer was followed by softmax activation to extract probability density maps for four segmentation labels (L3M, VAT, SAT or other). Hyper-parameter fine-tuning was performed exclusively on the training dataset through trial-and-error; the final hyper-parameters used for this work are stated in Table S1 in the Supplementary Materials.

Performance Statistics
For each case in the trauma dataset, the geometric similarity of deep learning automatic segmentation was validated against the reference source of truth using a Dice Similarity Coefficient (DSC). We computed four body composition metrics-SMRA, SMI, VATI and SATI-from the deep-learning generated segmentations to compare against the reference values. Agreement was quantified using Lin's Concordance Correlation Coefficient (CCC) [33] and Bland-Altman's Limits of Agreement interval (LOA) [34]. The Bland-Altman plots are presented in the Supplemental Materials as Figure S2. Additionally, as a rough estimate of inter-observer divergence, we used SMI values from our previously published paper [21] and compared these to SMI from the abovementioned reference L3M segmentations. All statistical analysis was performed in R [35] (version 4.0.3) using the "epiR" library [36] (accessed on 15 January 2021).

Results
A wide range of abdominal surgery cases from 31 distinct cancer centers made up the training dataset. The general case mix summary has been given in Table 1. Scanners from at least four different manufacturers were included, but possibly more, since vendor information was not always retained in the DICOM metadata. Median values for imaging parameters were 120 kVp, 5 mm reconstructed slice thickness and 0.768 mm × 0.768 mm axial spacing and 90% of patients were scanned in the feet-first supine position. Arms and hands had been kept out of the L3 slice field of view, being held on or folded over the chest. Image quality metrics of Signal-Noise-Ratio (SNR) and Contrast-Noise-Ratio (CNR) are also reported in Table 1. The SNR was calculated for skeletal muscle compared to background, and the CNR was calculated for fat with respect to muscle. In total, 233 polytrauma cases were suitable for inclusion as the test dataset. A flowchart of case attrition numbers has been included as Figure 1, and the general case characteristics were also shown in Table 1. The average age was 74 (range 10-88) years and the mean BMI was 29.5 (range 13.2-45.7) kg/m 2 . There were 156 male patients (67%) and 77 female patients. All cases were imaged according to the polytrauma protocol settings. The patients were presented head first supine position, slice thickness 1-3 mm, 120 kVp and 254 mAs/slice with an iodine-based contrast. Two scanners were used: Philips brilliance 64-slice scanner and Siemens SOMATOM Definition Flash CT Scanner. 2021, 21, x FOR PEER REVIEW 5 of 13 and the mean BMI was 29.5 (range 13.2-45.7) kg/m 2 . There were 156 male patients (67%) and 77 female patients. All cases were imaged according to the polytrauma protocol settings. The patients were presented head first supine position, slice thickness 1-3 mm, 120 kVp and 254 mAs/slice with an iodine-based contrast. Two scanners were used: Philips brilliance 64-slice scanner and Siemens SOMATOM Definition Flash CT Scanner.

Similarity of Segmentation
The overall geometric accuracy of automated segmentation compared to the reference segmentation was good; median DSC (and interquartile range) for L3M, VAT and SAT were 0.926 (0.866-0.959), 0.951 (0.888-0.974) and 0.953 (0.916-0.975), respectively. Perfect agreement implied a DSC of 1, whereas no overlap at all between the reference and automated segmentation implied a DSC of 0.
For added transparency of our results, the distribution of DSC in each of the three body components are shown in a box-and-whisker plot as Figure 2. It is clear that, while overall geometric performance is good, there are several cases with low DSC. There was only a single case where the automated segmentation failed to produce any output at all; hence, its DSC was exactly 0.
From all 233 individual results of the automated segmentation, we arbitrarily selected six as representative examples of the overall geometric findings. These are presented in Figure 3. The majority of cases with poor automated segmentation performance coincided with unusually noisy or poor-quality CT images. The second most common

Similarity of Segmentation
The overall geometric accuracy of automated segmentation compared to the reference segmentation was good; median DSC (and interquartile range) for L3M, VAT and SAT were 0.926 (0.866-0.959), 0.951 (0.888-0.974) and 0.953 (0.916-0.975), respectively. Perfect agreement implied a DSC of 1, whereas no overlap at all between the reference and automated segmentation implied a DSC of 0.
For added transparency of our results, the distribution of DSC in each of the three body components are shown in a box-and-whisker plot as Figure 2. It is clear that, while overall geometric performance is good, there are several cases with low DSC. There was only a single case where the automated segmentation failed to produce any output at all; hence, its DSC was exactly 0.
From all 233 individual results of the automated segmentation, we arbitrarily selected six as representative examples of the overall geometric findings. These are presented in Figure 3. The majority of cases with poor automated segmentation performance coincided with unusually noisy or poor-quality CT images. The second most common cause of discrepancies was the appearance of hands and/or arms in the CT field of view, though these were only rarely present in the training dataset. Much rarer trauma cases showing poor automated segmentation results involved the presence of external foreign objects adjacent to the abdomen, atypical anatomy (such as extremely low muscle mass) or clear signs of subcutaneous emphysema.
Sensors 2021, 21, x FOR PEER REVIEW 6 of 13 cause of discrepancies was the appearance of hands and/or arms in the CT field of view, though these were only rarely present in the training dataset. Much rarer trauma cases showing poor automated segmentation results involved the presence of external foreign objects adjacent to the abdomen, atypical anatomy (such as extremely low muscle mass) or clear signs of subcutaneous emphysema.

Agreement Analysis
Quantitative assessment of Lin's CCC and Bland-Altman's LOA intervals indicated generally good results. The concordance was generally very good for the body composition indices of SMRA (CCC = 0.92), VATI (CCC = 0.99) and SATI (CCC = 0.99), whereas perfect agreement implied a CCC of 1. Visual confirmation of these results is shown as the concordance plots in Figure 4. The overall result for SMI was degraded (CCC = 0.71) in comparison to the other indices. From Figure 4b, it is clear that hands/arms in the CT view leads to a consistent and systematic over-estimation of the SMI value in the polytrauma cases relative to the reference truth.
Exploring the differences between CT slices with or without hands/arms in the field of view, we see from Table 2 that the results of SMRA and SMI with hands/arms in the field of view tended to decrease the overall CCC. The SMI agreement for only those cases with hands/arms in the field of view was significantly reduced (subgroup CCC = 0.58). Similar trends were shown by the bias correction factor (which is the scale shift required to dispose the datapoints around the ideal concordance line) and LOA intervals; the bias corrections were close to 1 (which implied perfect agreement) and the LOA intervals of

Agreement Analysis
Quantitative assessment of Lin's CCC and Bland-Altman's LOA intervals indicated generally good results. The concordance was generally very good for the body composition indices of SMRA (CCC = 0.92), VATI (CCC = 0.99) and SATI (CCC = 0.99), whereas perfect agreement implied a CCC of 1. Visual confirmation of these results is shown as the concordance plots in Figure 4. The overall result for SMI was degraded (CCC = 0.71) in comparison to the other indices. From Figure 4b, it is clear that hands/arms in the CT view leads to a consistent and systematic over-estimation of the SMI value in the polytrauma cases relative to the reference truth.
Exploring the differences between CT slices with or without hands/arms in the field of view, we see from Table 2 that the results of SMRA and SMI with hands/arms in the field of view tended to decrease the overall CCC. The SMI agreement for only those cases with hands/arms in the field of view was significantly reduced (subgroup CCC = 0.58). Similar trends were shown by the bias correction factor (which is the scale shift required to dispose the datapoints around the ideal concordance line) and LOA intervals; the bias corrections were close to 1 (which implied perfect agreement) and the LOA intervals of agreement were close to 0 (which implied perfect agreement), except for SMI and particularly for SMI in the subgroup with hands/arms included in the CT. agreement were close to 0 (which implied perfect agreement), except for SMI and particularly for SMI in the subgroup with hands/arms included in the CT. We additionally compared the reference truth in this work with previously reported SMI values [21]. This provided a very limited but nonetheless indicative estimate of potential disagreement between independent observers working on the same data at different times (see again Table 2). The interobserver CCC for SMI was 0.88, which was slightly better in contrast to the automated SMI CCC of 0.83 on just those images without hands/arms, but it was not statistically significantly better. Sarcopenia was determined using cutoff values for SMI as described by Prado et al. (52.4 cm 2 /m 2 and 38.5 cm 2 /m 2 for males and females, respectively). The accuracy of sarcopenia classification based on the automated segmentation relative to reference segmentation was 77% (sensitivity 59%, specificity 96%). The numbers of automated false positives and false negatives were 5/233 (2%) and 48/233 (21%), respectively. Considering only CTs with no hands or arms in the field of view, the accuracy was 90%, sensitivity 86% and specificity 94%, with 4/131 (3%) false positives and 9/131 (7%) false negatives.  We additionally compared the reference truth in this work with previously reported SMI values [21]. This provided a very limited but nonetheless indicative estimate of potential disagreement between independent observers working on the same data at different times (see again Table 2). The interobserver CCC for SMI was 0.88, which was slightly better in contrast to the automated SMI CCC of 0.83 on just those images without hands/arms, but it was not statistically significantly better.
For comparison, a previous investigator who independently segmented the polytrauma dataset compared to the present reference segmentation had an accuracy of 82%, sensitivity of 66% and specificity of 98%, with false positives in 2/233 (1%) and false negatives in 40/233 (17%), though there was no difference in the human-to-human discriminative performance with or without hands/arms in the CT field of view.

Discussion
This work showed that an automated deep learning segmentation algorithm trained on a massive and diverse multi-center surgical dataset yielded overall good geometric agreement with respect to a human expert reference segmentation, for adipose and muscle tissues, on abdominal CT scans at the L3 level. Geometric performance was degraded due to the presence of hands in the polytrauma test dataset, which was only very rarely found in the surgical training dataset. Geometric agreement was a surrogate measure of clinical relevance; in tests for agreement against human reference-based SMRA, SMI, VATI and SATI values, the deep learning algorithm performed well overall on CCC and LOA measures. The algorithm consistently and systematically over-estimated the SMI, as was confirmed in the geometric comparisons.
This constitutes the first suggestion that our deep learning model could potentially be used in the future, subject to further development and extensive validation, to provide rapid body composition quantification, and for supporting the diagnosis of sarcopenia in trauma patients, from a generic L3 abdominal axial CT. To the best of our knowledge, no deep learning system has yet been robustly tested to this degree, using related but wholly independent CT images from a completely different clinical setting.
Other studies approached the automated segmentation of muscle, VAT and SAT within the L3 vertebra region in a variety of ways, also including deep learning. These reported DSC ranged from 0.85-0.99 [16][17][18][19]37]; however, good comparisons between those studies cannot be made due to the use of similar patient cohorts in development and accuracy testing and restrictive sample sizes. These models were able to extract semantic information, overall muscle shape and adipose tissue; nevertheless, these studies analyzed computed tomography scans, which were made in a controlled (non-trauma) setting [16,17,38].
Given the fact that after severe trauma, abdominal CT scans are almost routinely obtained in the emergency evaluation, and the features in CT scans to diagnose sarcopenia are presently not being used because of the time-consuming effort, the need for automated segmentation is evident. Although these CT scans were briskly and non-selectively obtained, with often some motion or scatter artefacts, imperfect vertebral bodies or arms/hands in the view, and contain artefacts and/or poorer image quality, the concordance was generally very good for the body composition indices of SMRA (CCC = 0.92), VATI (CCC = 0.99) and SATI (CCC = 0.99). The interobserver CCC in SMI was 0.88 (95% confidence interval 0.86-0.91). This provides a very limited but indicative estimate of potential disagreement between independent observers working on the same data at different times. The agreement in SMI due to the automated segmentation initially looks much poorer, but closer inspection of Table 2 shows that the CT scans containing hands or arms in the field of view drags the overall result downwards. In the sub-group without hands/arm visible, the deep learning performance was almost comparable with human-to-human performance. However, a complete study of inter-and intra-observer divergences was presently not feasible using the current datasets. Our results are interesting because the unedited outcomes of our algorithm (CCC = 0.83) approach but are still below the estimate of concordance between human experts working independently (CCC = 0.88), so we acknowledge that our model should be further improved by more training.
The accuracy of the algorithm was 77% overall, which is only marginally poorer than human-to-human accuracy (82%), but we expect this to rise if 90% if future development work can overcome the over-segmentation of hands and arms, where present.
Assuming a qualified physician remains in supervision of the procedure, such an algorithm could provide clinically relevant information supporting the rapid detection of sarcopenia in the trauma setting. Sarcopenia is associated with negative health outcomes and is, therefore, an important piece of differential prognostic information in clinical practice. For a clinically useful algorithm that identifies patients with a high risk of sarcopenia, we would want the number of false negatives to be as low as possible, since this represents a detrimental health condition going undetected during care. Conversely, we might be willing to tolerate slightly more false positives, since treatment of sarcopenia involves better nutrition and more functional activity [39], which is unlikely to harm anyone. Every case of muscle decline that is detected through such an algorithm might then reduce the severity of sarcopenia, and thereby, its associated negative health outcomes.
Our deep learning algorithm was trained using a diverse geographical cohort of patients that did not contain any polytrauma patients. While the algorithm suggests potential for rapid clinical assessment of body composition, we note some important limitations in the present work. We have noted that this algorithm systematically overestimates muscle area, because it tends to include the muscle tissue in the hands and arms, if these are included in the field of view. Additionally, overlapping adjacent internal organs with muscle, and CT Hounsfield Units being similar between some organs and muscle, also lead to a degree of misidentification as muscle.
Due to time pressure for this trauma protocol CT setting, there were several imaging artefacts which affected the segmentation that were not corrected by re-scanning. We chose to include all of these challenging cases, because they are fully representative to real trauma cases. Some segmentations contained a CT "streak" scatter artefact near the spine that led to internal organs and adipose being mislabeled. Foreign objects lying in the field of view sometimes created strong scatter artefacts that led to misclassification of subcutaneous adipose tissue as muscle. Noisy or poor-quality CT scans resulted in dispersed spots of undetected adipose tissue and muscle. Rarely, post-traumatic subcutaneous emphysema at the L3 level also led to missed detection of subcutaneous adipose tissue.
After locating the L3 slice, segmenting the muscle tissue alone using the HU thresholding software required approximately 20 to 30 min per slice [38]. From the same starting point, the deep learning model segmented muscle, subcutaneous fat and visceral fat in an average of 0.4 s per slice in total (on our device; for hardware specifications, please see the hyperparameters section in the Supplementary Materials). We acknowledge that segmentation timing may vary greatly on different hardware. The principal time efficiency gained by a deep learning approach is because the model executes with no additional human interaction (other than choosing the L3 slice), and our reported results do not contain any kind of post-segmentation editing by a human user. Most of the time consumed during the HU thresholding approach was due to the human operator adjusting the thresholds and growing the body compartments, then afterwards manually adjusting the results by hand. The latter is arguably the most accurate; however, it is comparatively time-consuming, though our results show that human-made segmentations are themselves susceptible to disagreement and uncertainty.
In regards to future work, we propose two promising areas for making significant improvements: firstly, we will improve the overall agreement by retraining the neural network to exclude the hands. Secondly, we plan to test a workflow that includes automatic detection of the L3 slice from whole body axial CT scans. Ideally, our system should be extended to incorporate whole body analysis rather than just one axial slice, providing rapid and accurate characterization of comprehensive body morphometric parameters. Further work may also be done to comprehensively search for globally optimal hyper-parameter settings, as these may also lead to some additional improvements in geometric accuracy.

Conclusions
This article showed that a deep learning U-Net-based algorithm was able to segment cross-sectional muscle and adipose area at L3 lumbar spine level on abdominal CT of realistic trauma patients. Clinically relevant body composition metrics were computed from the automated segmentation, which showed good agreement and clinically congruent decisions compared against a reference human expert segmentation. Further work on algorithm development should be able to improve the geometric accuracy, quantitative agreement and diagnostic discrimination of sarcopenia in our clinical setting, as well as produce whole-body CT scans for comprehensive morphological analysis. Funding: This article has received funding from the EIT Food and EIT Health. EIT Food and EIT Health are supported by the European Institute of Innovation and Technology (EIT), a body of the European Union receiving support from the European Union's Horizon 2020 Research and Innovation program.

Institutional Review Board Statement:
The investigations were carried out following the rules of the Declaration of Helsinki of 1975, revised in 2013. Surgical cancer data were obtained with agreement from the study principal investigators. Ethical review and approval were waived for the polytrauma cases, since these anonymized data were extracted with permission from a regional registry.

Informed Consent Statement:
For cancer surgery studies, the original research data were collected with individual IRB approval for each study; a list of institutional review board (IRB) reference numbers have been provided to the Editor. The principal investigators of the original studies provided the anonymized CT images and segmentation masks. For extraction and re-analysis of polytrauma case data, approval was given by the Maastricht University Medical Center IRB (reference METC 2018-0756). Requirement for informed consent was waived because of the retrospective nature of this study.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.