1. Introduction
Lung cancer is an uncontrolled neoplastic growth in pulmonary tissues [
1,
2]. With 18.4% of all cancer-related deaths, lung cancer has the highest mortality rate not just in Australia, but globally [
3]. Moreover, lung cancer is the 4th leading cause of death in Australia as of 2015 [
4]. That year, 1.7 million people died from lung cancer worldwide [
5]. Since its mortality is almost interrelated with the spread of uncontrolled cell growth from the lung into other tissues also known as metastasis, early diagnosis and treatments considerably increase treatment success and survival rates in affected patients [
6]. Across Europe, there are significant differences in five-year survival rates for metastatic cases, ranging from 20% in Sweden and Austria, to only 8% in Bulgaria, further underpinning the importance of an early diagnosis and treatment onset [
7]. The five-year survival rate for non-metastatic patients was 58.2% in 2010, whilst that of metastatic cases was less than 21% [
8]. Whilst not visible on X-ray scans, small lung nodules can be detected via computed tomography (CT) [
9]. In most cases, additional biopsies are ordered to examine the target nodule’s histopathology [
10]. With an ever-increasing amount of medical data, the described elevated mortality rates of lung cancer, additional financial incentives such as cost reduction and the need for improved diagnostic aid further promote research incentives to optimise data-driven solutions for computer-aided diagnostic (CAD) systems [
11].
The increasing number of patient records and medical images as shown in
Figure 1 makes clinical decision-making exceedingly time consuming. Vast quantities of CT scans, usually multiple hundreds per patient, can make it challenging to identify small, malignant, or suspicious lung nodules in the available timeframe [
2,
10]. Pulmonary nodules appear in various shapes with heterogeneous properties, such as diversified densities or malignancy-implicant calcification properties [
2]. Hence, further complicating the scanning and classifying of such for radiologists [
2]. In this paper, however, the focus lies on improving lung nodule classification performance only, as combinations with additional segmentation CNNs such as U-Net could pose a sustainable approach for a stand-alone CAD system in radiology diagnostics [
12,
13,
14].
For the classification of any accumulated cell mass’s malignancy that is visible on CT scans, however, specialised machine learning (ML) applications could be used in the future [
15,
16]. In particular, deep learning (DL) methods and convolutional neural networks (CNNs) have demonstrated outstanding potential to analyse both feature-complex and large datasets [
17], such as AlexNet [
18], DenseNet [
19], GoogLeNet [
20]. Examples of models with outstanding performance on CT-lung scans include Resnet and FractalNet [
16,
17]. However, other approaches, such as deep local–global networks (DLGNs), too, have demonstrated remarkable lung nodule classification performance [
18,
19].
FractalNet, a CNN approach that does not incorporate residuals, allows repeated use of simple expansion rules, hence supporting the widened depth of image analysis by its truncated fractals [
21,
22,
23]. In contrast to the FractalNet, the DLGN relies on shortcuts to help minimise residuals between layers while preserving the identity and weights of the previous layer’s outputs upon residual summation [
24,
25]. DLGNs have demonstrated to be able to outperform ResNet and other models, partly due to their ability to extract multi-scale features with high generalisation ability [
24]. In contrast to other CNN approaches, DLGNs aim to perceive both local and global features without demanding a full connection to all its layers, which helps reduce computing time by cutting down the number of computations of weights and connections. Since many CNNs with small kernel sizes fail to capture global features, DLGNs were developed with the concept of not dilated, but residual convolution [
24], which has been proven to help identify both local and global features. The extraction of global features, without increasing the kernel size here, is enabled by implementing self-attention layers [
26].
The described high-performing models have been manually designed with the incorporation of both domain-specific knowledge in the problem space and expertise of deep learning hyperparameters [
14]. While this combinatory approach creates excellent results, the discovery and design of optimised or even novel architectures are yet dependant on CNN-specific expertise, which can hinder the discovery of optimal solutions [
27]. Studies on hyperparameter and network optimisation have shown that the performance of models with the same hyperparameter and network architectures can vary significantly when applied to datasets of different domains, data properties, target class quantity, training example count and event probabilities [
28]. Therefore, the absence of a holistic recipe for a flawless deep learning architecture often results in a model design based on previous approaches, as well as extensive conducted trial and error. Since DL training can be incredibly time consuming, smart solutions for CNN structure design are needed to ensure efficient optimisations of model designs in the future. The need for a smart and automated CNN design solution is resilient particularly in settings, where it is not possible to allocate experts from all necessary disciplines to the task. In a clinical setting, for instance, medical professionals have a strong expertise regarding their patient’s datasets and the underlying physiology, yet usually do not have deep artificial intelligence (AI) knowledge or experts at their disposal. In such cases, an automated network design with automated hyperparameter optimisation could be used to deploy sophisticated classification models without having to rely on additional experts and resources from other fields. Moreover, such an approach additionally can remove the influence of cognitive bias during the network design and optimisation process, which can limit the resulting network’s performance.
To save time and resources for the computation during the optimisation process, evolutionary algorithms can be deployed to automate the CNN architecture design with all its entailed parameters. Just as neural networks resemble computational replicas of natural concepts, genetic algorithms were developed to solve computational problems by approximating the optimal solution by simulating evolutionary processes [
29,
30]. Genetic algorithms (GAs) are a subset of evolutionary algorithms and have successfully demonstrated outstanding performance on several different network optimisation problems [
31]. They are used in various fields to help solve the computational ‘Knapsack’ problem for combinatory puzzles, as described by Tobias Dantzig in 1930 [
32], since GAs circumvent the necessity to naïvely testing all possible solutions [
33,
34].
GA-evolved solutions are represented by a sequential encoding, i.e., the genome [
35]. To find the best solution for any given problem and therefore, to evolve the best genome (=creation of the fittest individual), GAs uses bio-inspired, genetic processes such as selection, crossover and mutation [
36]. A generic pseudo-code for genetic algorithms is depicted in
Figure 2. The structure and design for the genome and therefore, the tools or building blocks to find a solution for any problem, might be tailored for each approach [
14,
37]. As for image classification, GAs can therefore help to automatically and efficiently find optimal solutions for CNN hyperparameter settings, as well as overall CNN architecture design [
34]. Recently, GAs have demonstrated exceptional performance in automatically generating state-of-the-art CNNs—such as for classifying images of the CIFAR10 dataset with accuracies of up to 96.78 %, higher than most by DL experts manually tuned models [
14]. Current research and recent advances of algorithms that can genetically design CNNs (CNN-GAs) do not only focus on algorithmic aspects such as variable-length encoding strategies with adaptive crossover operator length, but also address computational aspects, such as utilising all computational hardware resources efficiently to optimise computing time and costs [
14].
In this paper, a genetic algorithm to automatically evolve and select the best CNN architecture design for classifying lung nodules from the Lung Image Database Consortium (LIDC) image collection was implemented by adopting both the proposed variable-length encoding and the proposed evolutionary operators by Sun et al. [
14], to ultimately select the best-performing lung nodule classification model since naïve approaches of network architecture optimisation are not feasible, given the elevated amount of hyperparameters and other architectural features in CNN designs, such as layer count and type.
2. Materials and Methods
In this section, the design of the manually designed FractalNet and DLGN is presented. Furthermore, the design of the automatic CNN architecture design via genetic algorithm is described. All three models were configured and trained independently, with subsequent comparison of their respective classification performances on the validation data splits for cross validation.
In this study, all models were trained and validated with data sourced from the LIDC only. The sourced data consisted of thoracic CT scan files with additional annotations of a total of 1018 patients [
8]. Utilised annotations included ratings of the nodules by radiologists ranging from 1 to 5, in which 1 indicates a low probability of the nodule’s malignancy and 5 indicates the highest chance for malignancy [
38]. Examples for scans with different annotations for malignant nodules are displayed in
Figure 3.
However, not all nodules were selected. The following criteria were applied to determine each sample’s inclusion and class membership for both the test and validation dataset: (a) at least 3 or more radiologists acknowledged the nodule with annotations; (b) each sample with a mean annotation value greater than 3 was labelled as ‘malignant’, otherwise as ‘benign’ if less than 3; and (c) samples with an average value of 3 were considered ‘ambiguous’ and therefore, were not to be included for training or validation purposes. Under those assumptions, 380 malignant and 421 benign nodules were identified. No further balancing of the dataset was performed, as the target class ratio of benign/malignant 1.07 indicates slight class imbalances only. To augment the dataset, further samples were created by following general DL conventions, such as rotating images by 90, 180 and 270 degrees. Furthermore, images were cropped from random sides with fixed stride, whilst relocating the centre of lung nodules as an area of interest. The described augmentation of the dataset was performed eight times.
For both FractalNet and CNN-GA training, a total of 6408 images were used—3040 of which were labelled as malignant, and 3368 labelled as benign. The first classification model, FractalNet, was implemented with the pre-processed dataset assorted into training and validation sets with a ratio of 8:2. With an overall total of 4 fractal blocks, a maximum pooling size of 2 and a convolutional filter size of 4, the training was performed for 50 epochs. The adjustment of weights was performed via the adaptive learning rate algorithm method (ADAM) with an initial learning rate of 0.002. Cross entropy was used for computing the loss function. The global drop probability was set 0.2. Dropout layers of the FractalNet were initialised with event probabilities of 0.1, 0.2, 0.3, 0.4 and 0.5, respectively. As the first layer was set with the lowest dropout probability, significant features are less likely to be overseen at the beginning of the convolutional training.
The same data split was utilised for the training and validation of the DLGN, the training of which was performed with a batch size of 150 for both 50 and 100 epochs. The comparison of target and output was conducted via binary cross entropy, whilst the subsequent adaption of weights was performed by using ADAM with an initial learning rate of 0.1. Generally, the linear transformations used in DLGN models characterise all regions of interest and analysed features, whilst Softmax classifiers are utilised to further extract regions with non-zero attention values.
For the CNN-GA, all digital imaging and communications in medicine (DICOM) files were transformed from the 3-channel RGB format with dimensions of 3 × 32 × 32 to greyscale NumPy arrays with dimensions of 1 × 32 × 32 each. The training-validation data split was randomised with rations between 10/90 and 60/40 for each performed CNN. Since the pre-processed CT-scans did not carry significant information which would be accessible by the RGB channel only, the transformation to the singular, monochrome format is not to impact classification performance whilst reducing the necessary use of computational resources. All generated monochrome NumPy arrays were subsequently transformed by using PyTorch. The pre-processed data were then converted into Tensor images with dimensions of 1 × 32 × 32 (input channels × height × width). Here, normalisation and additional, randomised horizontal flipping of all images was performed.
During the fitness evaluation, a fitness function evaluated all individuals per generation. Individuals whose genome encoded for CNNs with final validation performances below 50% were deemed unfit, and therefore were assigned a fitness score of 0. All other individual’s fitness values were derived by their classification performances using cross-validation.
The utilised GA design was developed to increase the depth of the CNN depth and therefore, its classification capability [
14]. Hence, fully connected layers were replaced with skip layers, which can prevent the overfitting of the model. Following the mutation, the environmental selection was conducted by deploying the binary tournament selection approach, which is commonly used in single-objective optimisation with CNN-GAs [
14,
39].
Upon the selection of individuals, variable crossover and mutation relied on randomisation. Therefore, it was possible that, via genetic drift, the currently fittest individual and its corresponding genome may be deconstructed during either of the previously described random procedures [
40]. Therefore, elitism was implemented by checking whether the fittest individual had already been selected to generate the new offspring for the next generation. If the fittest individual is not represented by the new parent generation yet, it will replace the least fit individual in the list of the current selection, which ensures the survival of the fittest individual. After each generation, a new population was initialised with all selected individuals and their freshly acquired genome. After evaluating and ranking all individuals according to their fitness, parents for the new generation were selected to generate new offspring for the new generation via mutation and crossover functions.
In this paper, CNN building blocks were implemented as per the proposal by Sun et al. shown in
Figure 4.
The utilised genetic algorithm encodes CNN architectures with a variable genome length. This means that layer counts can vary, which gives more flexibility when adapting the overall network architecture. Please note that the resulting genome encodes pooling layers and skip layers, as this setup was inspired by Resnet’s architecture, which demonstrated high performances with the described skip layer-based network architecture [
41,
42].
3. Experimental Setup
The evolutionary simulation of the CNN-GA was initiated with a population count of n = 20. Each population inherited an initial fitness of zero. For each generation, each population was trained for 100 epochs with a subsequent mutation and crossover. All generated models were trained using stochastic gradient descent (SGD) with a learning rate of 0.1. Each training was followed by a fitness evaluation, in which all final validation accuracies of the current models were compared.
The probability of evolving the architecture with an additional skip layer by random mutation was furthermore set to be seven-fold higher than other possible mutations, which included adding pooling layers with random settings, removing layers, and changing hyperparameters of the currently encoded CNN design with all its components [
14,
30].
Randomised values for layer counts were generated within pre-determined convolutional layers count of min = 5, max = 15, with output channels generated between min = 64 and max = 512. The number of convolutional layers for each network architecture design is to be changed using the genetic operators of the proposed GA, as well as each layer’s input channel dimensions.
Rectifier activation functions and batch normalisation were appended to each convolutional layer. The final classification of each iteration was generated using a SoftMax classifier for both target classes.
Each CNN architecture was predetermined to inherit three pooling layers, with an analogue setup as per convolutional layer channel limits. Pooling layers were initialised with a kernel size of 2 × 2 and a stride of 2.
Skip layers that connected the prior convolutional layers to the following convolutional layer were assigned filter sizes of 3 × 3 with a stride of 1.
After 19 generations, the algorithm was set to terminate the evolutionary process, and the performance of populations throughout all generations as well as classification performance of all generated architectures were evaluated. For comparison purposes, FractalNet [
22,
24] and DGLNs [
25,
43] were used to compare with the CNN-GA. Cross-entropy was deployed with pytorch, using
to compute loss the functions in this paper.
Notation 1. Cross-entropy loss function.
x = the label’s probability; y = the predicted label’s probability
All here-presented solutions that were generated by the CNN-GA were tracked using an individual identifier, as well as their generation count, whilst the generation count is indexed with 0 for the first generation: i.e., the ID 09-02 corresponds to the genomic encoding of the solution of individual number 2 in the 10th generation, the ID 01-15 corresponds to individual number 15 in the second generation, etc.
4. Results
The FractalNet training accuracy peaked at 87% during the 46th epoch of training. The final classification performance was 85% with a loss of 36% for cross-validation. After subsequently increasing the batch size, the performance was further reduced. Increasing the number of filters of the convolution layers did not result in testing accuracy. Upon increasing the learning rate by 50% (to 0.003), the accuracy was reduced furthermore.
The DGLN demonstrated a final classification performance of up to 92% during training. However, validation accuracy was significantly lower at 81%. Its performance during the validation was less constant, whilst the training accuracy improved more constantly over time. As for the training over 100 epochs, the classification accuracy plateaued and fluctuated between 80% and 90% with a final validation accuracy of 88.2%. However, the loss function did not approximate its final plateau after 100 iterations.
The CNN-GA was able to generate models with classification accuracies for training and validation data of up to 96.5% and 91.3%, respectively, the best-performing network architecture of which is depicted in
Figure 5. Throughout the evolutionary process, underperforming individuals with elevated loss functions were eliminated. The summary of the population’s performance progression is shown in
Figure 6. Note that the average final accuracy for each generation includes even the worst individuals of each generation, which was terminated prematurely due to their inability to classify.
The fittest individuals varied among the generations and therefore, the fittest individual was selected for each generation and plotted in
Figure 6. Thus, the graph does not represent the classification performance progression of one single individual, i.e., one network architecture design solution. Despite having fit individuals with good performance after the first generation, the top performances improved only slowly over the course of 19 generations.
With the here utilised genomic encoding’s building blocks, the best classification performance on the validation dataset was achieved by a CNN with nine convolutional layers, and three pooling layers as shown in
Figure 5.
However, loss functions for validation datasets were significantly higher than their training data counterparts in most cases. Nonetheless, most models started to approximate a classification accuracy of more than 80% after evolving more than 9 generations. Many individuals remained unreliable despite demonstrating satisfactory final performances. As exemplarily shown in
Figure 7, accuracies showed stark fluctuations across certain models training.
5. Discussion
Both the manually designed and off-the-shelf CNN can demonstrate respectable classification performance on hundreds of test cases within only one hour of computing, whilst the CNN-GA required over 79 h to compute the final CNN architecture, which demonstrated validation accuracies of over 91%. However, once the CNN-GA is well trained, its operation time is very fast. The comparison of all models is summarised in
Table 1.
To further improve the CNN-GA results, however, it is advisable to refrain from randomised data splitting, as the observed loss values showed stark variances for certain validation datasets. One may expect elevated validation accuracies if the CNN-GA was to be deployed with a fixed data split ratio. Additional runs would enable better a judgement of the CNN-GA’s reproducibility, which is not as high when compared to other classifiers since both evolutionary algorithms and the CNN training themselves rely heavily on randomised parameters [
44]. Furthermore, the utilised SGD might be replaced by ADAM, since it is generally a more favourable choice for stochastic optimisation methods—particularly when facing noisy gradients [
45].
The fittest individuals of the first generation accumulated a surprisingly high fitness of over 85%, and it is yet to prove that the relatively high mutation probability was favourable for optimising the evolutionary design approximation. To enhance classification reliability for future applications, decreased mutation probabilities could be set with an increased number of generations—even if such an approach would require significantly more time and perhaps, better computational resources.
To further develop the GA’s ability to find the best CNN architecture, additional building blocks may be added to the setup to boost the number of possible solutions of the gene pool. Whilst the used algorithm focuses on convolutional layer quantity and hyperparameter, it is recommended to investigate an evolutionary simulation that provides a ‘from scratch’ approach, which would allow adding a plethora of different layer types, hyperparameters, epochs, batch sizes, filters, learning rate methods, loss functions, and even automated data pre-processing structures.
6. Conclusions
In this paper, an implementation of a GA-designed, automated CNN architecture to address medical CT-imaging problems in lung cancer classification was successfully implemented, demonstrating that CNN-GAs can outperform manually tuned CNN classifiers, even in the absence of data-science expert knowledge. The proposed implementation of the CNN-GA for lung cancer classification may be deployed in various clinics in the future, as it may be possible to generate highly accurate, population-specific CAD tools for lung nodule malignancy classification on DICOM images with this approach.
The reproducibility of genetically evolved CNN classifiers may pose additional hurdles towards a full-on CAD implementation, as clear clinical and medical regulation standards are yet to be established in many countries [
45,
46,
47]. Furthermore, the automatic design process of deep learning models adds an additional layer of concern to ethical hurdles, such as questions regarding the liability of the artificially generated CAD system [
48]. In the future, one can expect clearer standards to arise, even for automatically generated CAD systems [
49,
50]. In the meantime, further research in the field of automatically designing CNN architectures with evolutionary algorithms is to be expected, with particular focus on optimizing reproducibility of the results for different populations. Additionally, further research may focus on speeding up CNN modelling and fitness evaluation to overcome computational limitations [
14].
Overall, the outstanding performance of the GA-designed CNN classifier is an indicator for the future of deep learning and AI, with a plethora of possible applications outside of healthcare. The ability to develop optimised DL solutions in various fields, discrete of any field of interest with the given independence of machine learning expertise, may pave the way for a new era of developing AI.
Thus, being able to deploy optimised mathematical classifiers for various problems without being dependent on the presence of deep learning expertise has a high potential to coin the evolution of data-driven applications in the future and may increasingly drive new DL models to be discovered, the impact of which is yet to be defined. Accelerating computational performance (e.g., with quantum computing) in combination with computational optimisation (e.g., with population-based algorithms) may fundamentally change society and our relationship with computed intelligence.