Radiomics-Based Prediction of Overall Survival in Lung Cancer Using Different Volumes-Of-Interest

provide a signature to predict Overall Survival in patients with Locally Advanced Non-Small Cell Lung Cancer. The results could offer the physicians advanced software tools to early evaluate the disease evolution before the treatment start so to personalize each patient’s therapy. Moreover, this work provides insight into the use of the different segmentation volumes usually applied in radiation oncology. Abstract: Lung cancer accounts for the largest amount of deaths worldwide with respect to the other oncological pathologies. To guarantee the most effective cure to patients for such aggressive tumours, radiomics is increasing as a novel and promising research ﬁeld that aims at extracting knowledge from data in terms of quantitative measures that are computed from diagnostic images, with prognostic and predictive ends. This knowledge could be used to optimize current treatments and to maximize their efﬁcacy. To this end, we hereby study the use of such quantitative biomarkers computed from CT images of patients affected by Non-Small Cell Lung Cancer to predict Overall Survival. The main contributions of this work are two: ﬁrst, we consider different volumes of interest for the same patient to ﬁnd out whether the volume surrounding the visible lesions can provide useful information; second, we introduce 3D Local Binary Patterns, which are texture measures scarcely explored in radiomics. As further validation, we show that the proposed signature outperforms not only the features automatically computed by a deep learning-based approach, but also another signature at the state-of-the-art using other handcrafted features.


Introduction
Lung cancer is worldwide recognised as the most common type of cancer, with about 18.4% of all cases [1], also accounting for the largest amount of deaths. Every year, there are about 470 thousand new cases in Europe and the Non-Small Cell Lung Cancer (NSCLC) is the most frequent one with about 85-90% of all cases [2,3] and, due to the lack of apparent symptoms in early stages, approximately 70% of the new cases are diagnosed with stage III or IV. There are several treatment options that are
Because of the topic discussed, hereinafter we shortly overview the work concerned with the prediction of the Overall Survival in patients suffering from the NSCLC [19,20,30,38,39] (Table 1). Furthermore, the interested readers can deepen AI applications in lung cancer starting from two reviews that discuss the prediction of lung cancer prognosis [40,41].
The literature explored so far shows a clear predominance of studies that focus on radiomics descriptors that were extracted from the GTV.
Moreover, the analysis of the literature reveals to us that most of the work computed both 2D and 3D shape and texture descriptors that mostly leverage on the Grey-Level Co-occurrence Matrices or wavelets. This can be justified by the fact that healthy tissue often differs from tumour tissue, mainly at the microstructure level.
On these grounds, the contribution of this manuscript is twofold: first, we investigate how the use of different volumes of interest impacts the Overall Survival prediction. Second, we introduce the 3D version of the local binary pattern (LBPs), a texture descriptor that is widely used for texture analysis in 2D images in other fields [42], but still not extended to the radiomic field.

Materials
This work studies 97 patients with NSCLC stage III that were treated with definitive concurrent chemoradiotherapy. The enrollment protocol was approved by Ethical Committee of Campus Bio-Medico University on 30 October 2012 and registered at ClinicalTrials.gov on 12 July 2018 with Identifier NCT03583723 after an initial exploratory phase. The Institutional review board approved this analysis, and a written informed consent was collected from all the patients. For each patient, the simulation CT images were acquired before the treatment using a Siemens Somatom Emotion, with 140 Kv, 80 mAs, and 3 mm for slice thickness. Subsequently all of the images were preprocessed using a lung filter (kernel B70) and a mediastinum filter (kernel B31). The patients were then clinically followed, for a median follow-up of 18.55 months after that it was possible to divide patient into two classes: 53 dead (class 0) and 44 alive (class 1). The patients had a mean Overall Survival (OS) time of 28.7 ± 26.4 months (min 4.8 months, max 142.6 months).
In this study we consider three different VOIs for each patient, the Gross Tumour Volume (GTV), the Clinical Target Volume (CTV), and the Planning Target Volume (PTV) segmented by radiation oncologists, as already introduced in Sections 1 and 2. For GTV, the expert radiation oncologists included all the macroscopic disease defined at CT or PET-CT and nodes PET positive and/or larger than 1 cm in the short axis at CT imaging. Starting from GTV, the other volumes can be segmented: CTV former contains GTV plus a margin for sub-clinical disease spread, which therefore cannot be fully imaged, and it represents the volume that must be adequately treated to achieve cure [43]. PTV is a geometric concept that is designed to ensure that the radiotherapy dose is actually delivered to CTV. It is determined expanding CTV with a safety margin, which is necessary to manage internal motion and set-up reproducibility [44].
It is worth noting that the ability to segment tumour mass and the surrounding tissue involved with the tumour to a certain extent is an important professional skill for a medical speciality that is radiation oncology. Several methods are proposed in the literature to reduce interpersonal variability [45,46]. This is why, in our work, the same radiation oncologist segmented all the images in our repository, and then another lung cancer expert radiation oncologist proofread the segmentations. Furthermore, to mitigate the inter-reader variability the segmentations were performed according to the international guidelines [47]. Figure 1 shows an example of these VOIs: the GTV is drawn in red, and it is the smallest segmentation that precisely delineates the visible tumour; CTV is represented in yellow and it includes areas, where, in the image, there is no evidence of tumour, but where sub-clinical disease may be. PTV, in blu, is the largest segmentation.  Figure 2 details the pipeline designed in this work, which can be divided in three main blocks: data processing with the feature computation, feature selection and final performance computation. First, the data processing flow is represented in the uppermost green block: it consists in the segmentation of the three VOIs (GTV, CTV, and PTV) from patients' CT scans and, then, in the computation of the descriptors outlined in the next Section 4.1. Second, the feature selection procedure and the final performance computation are both included in the grey block. Proceeding from left to right, the descriptors are standardised and ranked in the blue box, then we find the final subset selection in the purple box that is further detailed in the bottommost diagram. Third, the final model performance is shown in the rightmost grey blocks. All of these processes are fully described in the followings.

Feature Computation
For each patient, several quantitative biomarkers were extracted from all the VOIs. The features were computed using an in-house software tool coded in MATLAB R2019a (The MathWorks, Inc., Natick, MA, USA) that calculates the first order statistical features and two families of textural features: 3D Grey Level Co-occurrence Matrix (GLCM) and Three Orthogonal Planes-Local Binary Patterns (TOP-LBP). The TOP-LBP implementation of 3D LPBs computationally simplifies the extraction process, as discussed below.
First-order features: all of these features are computed from the grey levels' histogram of a ROI and therefore try to describe the statistical distribution of tissue's density inside the volume. From such histograms, we extracted 12 descriptors, which are the moments from first to fourth-order, namely the mean, the standard deviation, the skewness and the kurtosis, the histogram width, the energy, the entropy, the value of the histogram absolute maximum and the corresponding grey-level value, the energy around such maximum, and the number of relative maxima in the histogram and their energy.
3D Grey Level Co-occurrence Matrix: the healthy and the cancerous tissues usually present different structural patterns: for this reason, we go beyond the grey levels distribution of the VOIs voxels performing an additional textural analysis. In this respect, we computed the 3D Grey Level Co-occurrence Matrix, which generalises the 2D GLCM to the third dimension, being able to catch the differences of the tissues at the micro scale.
Given a 3D grey-scale image I and a 3D Cartesian reference system O(x, y, z), whose origin is located in the top-front-left corner of I, the position of each voxel can be identified by a vector p = p xî + p yĵ + p zk , with p x , p y and p z ∈ N. We also denote a displacement vector d = d xî + d yĵ + d zk , with d x , d y and d z ∈ N. If we define m equal to the number of bit used to represent I, a GLCM3 is a square matrix of size N = 2 m , where each entry (g i , g j ), with both g i and g j ∈ [0, 2 m − 1], represents the number of times a voxel in p with intensity g i is separated by a displacement d from another voxel with intensity g j , therefore located in p + d. Hence, a GLCM3 matrix synthesises the count of how many values co-occur between neighbouring pixels according to specific displacements. Those quantities vary, depending on the VOI structure. Denoting as d h the h-th component of d, in order to assess all possible directions, we considered the combination of d h ∈ {−1, 0, 1} as displacements, without considering the (0, 0, 0) vector.
This leads to 26 different displacement directions. Furthermore, when considering that each direction +d produces a GLCM3 matrix that is the transpose of the one produced by its opposite direction −d, thus giving a redundant informative contribution, we only considered the 13 different directions given by +d. More details on GLCM3 can be deepened in [48]. Finally, from each GLCM3, we extract seven second order statistical features, referred to as Haralick descriptors [49]. Namely, they are autocorrelation, homogeneity, entropy, energy, covariance, inertia, and absolute contrast. Their formal definition is reported in the Appendix A. Concatenating such measures for each GLCM3 we get 13 × 7 = 91 textural descriptors per patient. Three Orthogonal Planes-Local Binary Patterns: this is a relatively new descriptor in radiomics scenario, and it is one of the possible generalisation of the well-known 2D LBP to the 3D. Now, we will briefly recall the computation of a basic bi-dimensional LBP. If we consider a bi-dimensional image I, then we can compare the intensity I p of each pixel p with the intensity I j of all its jth neighbour pixels that lay on a circle centred in p with radius r. If I j > I p , the jth pixel is set to 1, 0 otherwise. Next, it is possible to process all p's neighbours in a circular direction, interpreting the sequence of 0s and 1s as a binary string and setting the value of p to the equivalent decimal value. Proceeding as described through all of the pixels of I, we obtain a new image encoding the intensity distribution of each pixel respect to its neighbours that can describe the texture of the original image. The number of patterns for this 2D implementation is 2 P , with P denoting the number of local neighbouring points around the central point. In order to extend a 2D LBP to the 3D environment, a first solution consists in considering a helix neighbourhood of each voxel [42]. However, to avoid the high increase of the computational burden, associated to the very large number of patterns for volume LBP, when P increases, as 2 3P+2 , we chose another 3D implementation of LBP transformation that was also shown to work fine [42]. It considers the co-occurrence on three orthogonal planes crossing the centre of the analysed volume, as depicted in Figure 3.
It is based on the computation of three 2D LBPs, each of one is derived from each of the three planes; their histograms are then computed and concatenated to obtain a unique histogram for the specific volume. This conspicuously alleviates the computational load, since the number of patterns for TOP-LBP is 3 × 2 P . Furthermore, in our LBP implementation, we consider two more variants to cope with other two issues of 2D LBP definition. First, we computed rotation invariant LBP, i.e., all of the binary strings obtained as the circular shift of a fundamental string are considered the same. Second, we implemented a uniform version of LBP, i.e., all binary strings containing more than two crossings from 0 to 1 or from 1 to 0 are considered not uniform and coded with a specific string. In our case, setting P = 8, we get 48 features by computing first-order measures from the each of the three 2D LBP. The interested readers can find more details about LBPs in [42,50].

Features Selection and Classification
An important step during a radiomics approach is the features selection step: it helps in reducing the dimensionality of the problem, lowering the curse of dimensionality and the risk of overfitting, and it helps finding the most informative set of descriptors for the problem at hand.
The feature selection pipeline, represented in the blue, purple and red blocks of Figure 2, consists in a wrapper-based approach, which searches and evaluates the best subset of features maximising the performance of a given classification algorithm. To avoid any bias, features are normalised before each step, using a standard scaler, as represented in Figure 2. This approach scales features while using the following equation: z = (x − µ)/σ, with µ mean value, σ standard deviation and with x and z as the original feature and the scaled feature, respectively.
Starting from the blue box in Figure 2, the feature set was analysed by the Recursive Feature Elimination (RFE) approach [51], making use of four different classification algorithms, namely AdaBoost, CART Decision Tree (DT), Random Forest (RF), and XGBoost (XGB) [52]. The use of four learners and the availability of three different types of VOIs gave rise to 12 runs of the wrapper, which are listed in Table 2.
The goal of the RFE is to select a feature subgroup, by initially considering all of the features and recursively examining a smaller and smaller dataset according to an assigned feature importance attribute. During this process, the selected classifier is trained on the standardised descriptor set to compute the feature importance attribute based on the accuracy level of the model. The feature with the lowest attribute is excluded from further analysis. This recursive analysis is performed until the best feature is found resulting in a final descriptor ranking.
To avoid any bias during the features selection approach, the RFE step was applied according to a nested leave-one-out (LOO) procedure: indeed, referring to Figure 2, there is an outer grey loop for final performance computation and an inner blue loop for RFE application. This means that this analysis was repeated N − 1 times, to obtain a global view of the feature importance. Hence, all of the N − 1 rankings obtained were summed up to get a final rank of each descriptor.
From this step on we considered two alternative paths referred to as single ranking and total ranking. On the one hand, in the first path, the ranking procedure was performed independently for each single experiment, i.e., considering a single classifier and a single type of VOI. Straightforwardly, each of the 12 experiments listed in Table 2 uses a specific features rank and, in turn, this permits us to optimise the feature set for each learning algorithm and for each VOI.
On the other hand, in the second path, the wrapper first computes the rank of the features for the GTV, CTV, and PTV; then, such rankings computed using the same classifier on the three volumes were summed up to gain a global ranking for every learning algorithm. This implies that we finally have four different feature sets, one for each classifier used in the wrapper, regardless of the VOI used. This permitted designing experiments that are comparable within the same classifier.
Turning our attention to the purple and red boxes in Figure 2, these represent the feature evaluation workflow to find the best subset and it was performed independently for both single ranking and total ranking analyses. To find out the best descriptors, we tested different combinations of features subsets whose performance was evaluated according to a given metric. To this goal, all of the features were initially sorted by their rank and, then, the subsets were sequentially inspected using a cumulative forward selection approach (yellow loop). The cumulative forward selection search procedure starts from a subset with one feature with the highest rank position and then incrementally adds descriptors with lower ranks, until the set contains all the descriptors. To avoid any bias, each subset was then evaluated while using a nested LOO configuration, where the inner loop (red LOO) evaluates the best subset maximising the Area Under the ROC Curve (AUC) metric, whereas the outer loop is again the grey LOO for final performance computation. Hence, the output of the purple box is a list of N − 1 best subsets found with the nested procedure. We chose as final set the intersection among these intermediate subsets, which represents the radiomic signature of the specific combination of VOI and classifier in Table 2.
Finally, with this set of descriptors, the outer grey leave-one-out computes the ultimate performance of the system.

Comparative Analysis
As reported in Section 1, in this work we deepen the analysis comparing the proposed approach with the state-of-the-art and with an approach that leverages on image features automatically computed. Therefore, this last section of the methods introduces the methodology of the competitors.

Comparison with the State-Of-The-Art
As a first point, we compare our approach with the most cited one at the state-of-the-art, represented by the radiomic signature that was proposed by Aerts et al. [19]. This signature includes four features: (i) the "statistics energy" that describes the overall density of the tumour volume, (ii) the "shape compactness" quantifying how compact the tumour shape is, (iii) the "grey level nonuniformity" that measures for intratumour heterogeneity, and (iv) the "grey level nonuniformity HLH" measuring intratumour heterogeneity after decomposing the image in midfrequencies through wavelet analysis. The interested readers can find the formal definition of such measures in the supplementary material of [19], while their implementation is available in Python [53]. We compute this signature on our dataset and, for a fair comparison, all of the tests are performed in LOO fashion for all of the classifier-VOI combinations reported in Table 2.

Comparison with a Deep Learning Approach
Besides the comparison with the handcrafted signature at the state-of-the art, we investigate the possibility to use automatic feature extractors. To this goal, we consider two well-established deep networks, i.e., the AlexNet [54] and the ResNet50 [55], and we train them from scratch on our dataset considering only the CTVs, since they yield the largest performance in the machine learning pipeline proposed here, as shown in Section 5. On the one hand, the AlexNet has a feature extraction step composed of three consecutive blocks containing two convolutional layers and one max pooling layer. After each convolutional layer, we included a batch normalization layer and a dropout layer, to prevent exploding gradient values and overfitting, respectively. Subsequently, the classification step is designed with two dense layers, one with 256 neurons and the other with one neuron, as it is the output layer. All of the layers implement a LeakyRelu activation function, except for the output that uses a sigmoid for the final classification. On the other hand, the ResNet50 was designed, as reported in [55], except for the first max pooling layer, which was eliminated in order to maximally preserve the information contained in the images, and the output layer that is a dense layer with one neuron with a sigmoid activation function. In both cases, due to the large computational efforts needed, the test procedure was conducted in a 10-fold cross validation, with eight folds as training, one as validation, and one as test. Here, cross validation is preferred to LOO to alleviate the computational burden, while having enough samples in the training and validation sets. In this case, the CNNs classify each slice and then patient label is given by majority voting.
As a further direction of investigation, we also study what happens using the CNNs only as feature extractors that feed the same four classifiers that were used in our approach. In practice, from both networks pretrained on our dataset, we use as feature set the last layer in the feature extraction blocks: this yields to 256 descriptors from the dense layer of AlexNet and to 8192 descriptors from the average pooling layer of ResNet50. These two feature sets separately fed the four classifiers used in the proposed pipeline, which were trained in order to classify each 2D slice. Again, patient label is given by majority voting and the performance are computed in 10-fold cross validation

Experimental Results
The main experiment described in Section 4 was applied to all possible combinations of classifiers and VOIs where the features are computed, summarised in Table 2. Furthermore, the experiments are run also using both the single ranking and the total ranking methods for feature selection. The use of four classifiers, three VOIs, and two feature selection ranking approaches results in twenty-four experiments, whose results are presented in Table 3. We measured the following performance metrics: accuracy = TP+TN P+N , the area under the ROC curve (AUC), the precision = TP TP+FP and the recall = TP TP+FN . Straightforwardly, TP, TN, P, and N stands for the true positive, true negative, total number of positive samples, and total number of negative samples, respectively. Hereinafter, we also use FP and FN to denote the false positive and false negative, respectively. Let us recall that the positive and negative classes correspond dead and alive patients, respectively, as described in Section 3.
In Table 3, the upper panel shows the performance of the twelve combinations of the classifiers and the VOIs when the single ranking method is used, whereas the lower panel reports the performance for the total ranking case. For the single ranking, the accuracy and AUC values range from 60.82% to 83.51% and from 61.06% to 82.78%, respectively. For the total ranking, the accuracy ranges from 44.33% to 77.32%, and the AUC ranges from 44.43% to 75.96%. The subsets of features selected in each combination range from two to nine descriptors for the single ranking and from two to 20 descriptors for the total ranking. Note that, in the total ranking case, we obtain a worse range of accuracy, AUC values and larger subsets of selected feature in comparison to the single ranking. This could be expected, since the single ranking approach provides a best feature subset optimised for both the classification algorithm and the considered VOI.
A more detailed analysis of Table 3 suggests us other three observations. The first is that the performance on GTV in the single ranking is stable across the four classification algorithms. The same does not happen for the total ranking case. We deem this indicates that a system built on this VOI is particularly stable when a volume-specific optimization occurs for the best feature set selection. Secondly, the overall largest performance was obtained combining the classifier Adaboost with the CTV in the single ranking procedure. Compared to this result, the best result for the total ranking procedure was obtained combining the Random Forest Classifier with, again, the CTV. Such best results for single and total rankings are highlighted in bold in Table 3. As a third observation, we notice that, although both the best experiments exploit the CTV, the largest performance are given by the single ranking approach, where the best feature set was customized on the specific VOI analysed. We also run the ANOVA test among the VOIs both for the results attained by single and total ranking, whatever the classification paradigm used. When we get a significant result, i.e., at least one of the groups tested differs from the other groups, we proceed with Least Significant Difference (LSD) that detects the statistical differences among a group of results. In the case of results provided by the single ranking, we obtain a p-value equal to 0.0334, suggesting that the performance achieved using the different VOIs are not the same. The LSD tests show that the results of CTV+Adaboost are significant different (p < 0.1) from the others in all cases, except for CTV+Random Forest, PTV+Random Forest, and PTV+XGBoost. In the case of the results provided by the total ranking, the p-value is 1.586 × 10 −5 , suggesting again that there are differences between the VOIs. The method with the best AUC, i.e., CTV+Random Forest, shows performance statistically different from all the others except for CTV+AdaBoost, PTV+AdaBoost, GTV+Random Forest, PTV+Random Forest, CTV+XGBoost, and PTV+XGBoost.
Let us now turn the attention to the radiomics signature underlying the best case, i.e., CTV+Adaboost in single ranking: as shown in the first column of Table 4 it is composed of eight features belonging to the LBP set, and one belonging to the GLCM set. In detail, such features are: "uniform 3D LBP kurtosis", which estimates the shape of the distribution of all patterns discharging the noisy patterns, "3D LBP energy", which represents the homogeneity of the patterns, "rotation invariant 3D LBP absolute maximum", which is the absolute maximum value among the patterns, regardless of their rotation, "3D LBP energy around the absolute maximum", which estimates the homogeneity of patterns around the absolute maximum value, "rotation invariant 3D LBP energy", which is the homogeneity of the patterns without considering their rotation, "uniform 3D LBP energy around the relative maximum", which estimates the homogeneity of pattern around a relative maximum value discharging the noisy patterns, "uniform 3D LBP entropy", which is a measure of the variety of patterns discharging the noisy ones, and "uniform 3D LBP skewness", which measures the asymmetry of distribution of all patterns discharging noisy patterns, and "inverse GLCM in direction (−1,−1,0)", which measures the inverse of grey level attenuation in the ROI over the direction (−1,−1,0).
Turning our attention to the newly introduced set of features, i.e., 3D LBPs, and still focusing on the single ranking approach, these descriptors were always chosen in the best feature subsets, regardless of the classifier and the VOIs under consideration. For the sake of brevity, we hereby do not report the full list of features chosen in each of the 24 experiments. However, to provide a synthetic analysis of the best feature sets, the second and the third columns of Table 4 show how many times each feature in the first column (i.e., the radiomic signature of the best case) was included in the best feature set using the Adaboost with the GTV and PTV (second column) and the DT, RF, and XGBoost with the CTV (third column). Note that finding zeros in the occurrences of Table 4 does not exclude the presence of other LBP features in the best subsets besides those chosen in the experiment with largest performance. Table 4. Best radiomic signatures. The fist column lists the feature selected in the best performance. The other two columns indicate the number of times that each descriptor was selected in the best feature set. Abbreviations: U-uniform, RI-rotation invariant. To deepen the discussion on the effectiveness of 3D LBPs, we excluded them from each signature selected by our approach and train again each learner. The results are reported in Table 5, where "-" indicates that the classifier was not trained, since the best set included only 3D LBP descriptors. When comparing these results with Table 3 we notice that there is not an evident pattern in the performance differences between the two experiments. Despite that, the use of 3D LBPs improves the largest accuracy obtained: indeed, without them, the overall best accuracy score decreases from 83.51% to 76.29% and from 77.32% to 75.26% for the single ranking and the total ranking, respectively. This trend is also manifested by the other performance measures reported for the best cases, indicating that LBPs actually give an important contribution to the classification task at hand. Furthermore, we statistically pairwise compared the results reported in Table 3 with those in Table 5 using the Wilcoxon signed rank test. In the case of single ranking, for AdaBoost+CTV, DT+CTV, RF+CTV, RF+PTV, and XGBoost+GTV the use of 3D LBPs provides performance larger and statistically different from those attained excluding these descriptors (p < 0.1). In the case of total ranking, at the same level of confidence, the results of the test show that the use of 3D LBP provides larger and statistically different performance for AdaBoost+PTV and XGBoost+CTV. Table 5. System performance excluding LBP descriptors from the final feature subset in all possible combinations of Table 2 for single ranking (upper panel) and total ranking (lower panel). The notation '-' indicates that no feature was left in the final set after excluding LBP descriptors. The largest performance is highlighted in bold.

Comparison with The-State-Of-The-Art
The results were compared against the signature presented by Aerts et al. [19] for OS prediction in NSCLS, as introduced in Section 4. Further, to be one of the most cited works in the state-of-the-art, such a radiomic signature was also used by Lambin et al. [17] and Kwan et al. [56]. Table 6 reports the performance computed with Aerts et al.'s signature on our dataset: the largest AUC value was obtained considering the CTV with Random Forest as classifier, and it is equal to 76.92%. Using the same VOI our signature gets an AUC equal to 82.78%, which suggests that the inclusion of the 3D LBP descriptors boosts the discrimination power and leads to larger performance. Despite the fact that the best results of Tables 3 and 6 are obtained with different classifiers, comparing the corresponding columns of the two tables further validates the previous assertion. Also note that the best combination in Table 6 is based on a Random Forest classifier with the CTV, which is also one of two best configurations in our experiments: this confirms our results that the CTV is the most informative volume of interest.

Comparison with a Deep Learning Approach
We conducted a comparison with two deep learning approaches to further assess the power of the handcrafted features, as described in Section 4. Hence, we performed two experiments: the first tests the AlexNet and the ResNet50 as a whole classification approach, whereas the second tests the two deep networks as feature extractors feeding the same four classifiers used herein-before. All of the results are summarised in Tables 7 and 8 and, in general, they show that the most performing network is the ResNet50.
In the first experiment, this deep network achieves an AUC equal to 76.49% and an accuracy equal to 71.11%. Furthermore, the performance differences between the ResNet50 and the AlexNet are statistically significant according to the Wilcoxon signed rank test (p = 0.0396).
In the second experiment, when the networks are used as feature extractors, we find that the automatic features they compute combined with the Adaboost classifier provide accuracy and AUC equal to 73.33% and 73.42%, respectively. Furthermore, using this learner, the performance differences using the features computed by the AlexNet and by ResNet50 are statistically significant with p < 0.1 according to the Wilcoxon signed rank test. We do not find statistical differences in the pairwise comparison between the automatic features while using the other three classifiers.
As it is evident, these values were lower than those achieved by the proposed approach (accuracy = 83.51% and AUC = 82.78%), showing the successful discriminative power of hand-crafted features. We also assess this difference using again the Wilcoxon signed rank test, comparing our proposal (AdaBoost+CTV in Table 3) with the best one provided by deep learning approach (ResNet50 features in Table 7). The results of such two approaches are statistically different with p < 0.05. A possible reason for this finding could be the size of our dataset: indeed, this could hinder the power of deep approaches, leading to the superiority of handcrafted features with respect to measures automatically learned.

Conclusions
In this manuscript, we propose a radiomics approach to predict the Overall Survival in a cohort of 97 patients suffering from locally advanced non-small cell lung cancer. NSCLC has been already studied in few other works and, differently from the literature, here we introduce two novel aspects which have not been analysed in this way so far. The first is the analysis of three different VOIs commonly used in clinical practice and segmented on CT images routinely collected. The second, but not secondary, is the introduction of 3D Local Binary Pattern descriptors, already used in computer vision problems. The both constitute novel and challenging directions of this research.
Among the different combinations of classification algorithms and VOIs, the best accuracy of proposed approach is equal to 83.51%, achieved using the Clinical Target Volume, the Adaboost learner, and nine features selected from the pool. The fact that the VOI providing the best results is the CTV confirms our initial hypothesis that there is information in the surroundings of the visible tumour and that this is important to predict patients' OS. Furthermore, the LBPs features proved to be determinant, improving the classification performance in comparison with respect to other texture measures.
The promising results described so far suggest three future directions worthy of investigation. First, the system performance and its variability could be explored for small variations of the segmentations of the target tumour volume, considering not only the GTV, but also the CTV and PTV. This will permit us to assess how important it is to get a very accurate automatic segmentation. Second, we could further explore the potential of LBPs, validating such descriptors on an external dataset, so to confirm their stability and effectiveness. Third, we are working to enlarge the size of the dataset and to extend the research in the deep learning area, even when considering deep approaches to artificially enlarge the size of the training set, such as the Generative Adversarial Networks.
Funding: This work was partially founded by Universita' Campus Bio-Medico di Roma under the programme "University Strategic Projects 2018 call" within the project "a CoLlAborative multi-sources Radiopathomics approach for personalised Oncology in non-small cell lung cancer (CLARO)".

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Haralick Features
This appendix presents in Table A1 the formal definition of Haralick features listed in Section 4. The content of such tables adopts the following notation: • G is the number of grey levels in the image I; • g i and g j denotes two grey level values, with i and j lying in [0, G − 1]; • d is a displacement vector; • p(g i , g j , d) denotes the grey level co-occurrence matrix (GLCM) of I; • µ p(g j ,d) = ∑ G−1 i=0 p(g i , g j , d) and µ p(g i ,d) = ∑ G−1 j=0 p(g i , g j , d) denote the mean value of the one-dimensional marginal distributions of the GLCM; • ∂(g i , g j ) = (g i − g j ) 2 is a measure of dissimilarity.